# **Data Collection**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under `outputs/datasets/collection`
  
## Inputs

* Kaggle JSON file (`kaggle.json`) as the authentication token to access the kaggle dataset. 
* Raw dataset from kaggle.com for house prices in Ames, Iowa.

## Outputs

* Generate Dataset: `outputs/datasets/collection/house_prices_records.csv`
* Generate Inherited House Dataset: `outputs/datasets/collection/inherited_houses.csv`

---

# Change working directory

The notebooks for this project are stored in a subfolder called `jupyter_notebooks`, therefore when running the notebook, the working directory needs to be changed to the parent folder. 
* We access the current directory with `os.getcwd()`

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Fetch Data from Kaggle

Use the `%pip` magic command to install the kaggle package

In [None]:
%pip install kaggle==1.5.12

<div style="border-left: 4px solid #ffa500; padding: 10px; background-color: #f9f9f9; color: #000">
<strong>IMPORTANT:</strong> Remember to import your <code style="color: red">kaggle.json</code> file into the root directory, as this contains the API key.
</div>

Once the `kaggle.json` file is in your root directory, run the cell below, so the token is recognized in the session

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

The dataset for this project can be found at [https://www.kaggle.com/datasets/codeinstitute/housing-prices-data](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)

1. Extract the dataset path from the url. Everything after `www.kaggle.com/datasets/` and assign to a new variable
    - That being: `codeinstitute/housing-prices-data`

2. Define a new variable for the destination folder
3. Download the data from kaggle.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the file, remove the zip file and delete the `kaggle.json` authentication token

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

---

# Load and Inspect Kaggle data

### Load House Prices Records Data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head(10)

#### Summary of Data

In [None]:
df.info()

#### Check for Duplicate Data

In [None]:
df[df.duplicated()]

### Load Inherited House Price Data

In [None]:
import pandas as pd
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited.head()

#### Summary of Data

In [None]:
df_inherited.info()

#### Check for Duplicate Data

In [None]:
df_inherited[df_inherited.duplicated()]

---

# Overall Summary and Observations

**House Price Records Data**
* 1460 records of data
* No duplicated data
* 23 Features and 1 Target variable
* 9 Columns with missing values
* Mixture of Data Types (`int64`, `float64`, `object`)
* 2 Columns with high degree of missing values

**Inherited House data**
* 4 records on the inherited houses data
* No duplicated data
* No missing data
* Mixture of Data Types (`int64`, `float64`, `object`)

**General**
* Manual analysis of the associated text file indicates that some if not all of the object features are ordinal in nature, presenting as textual degrees of quality of a particular feature on the house.
* A consistent approach regarding the data types to be considered if the model requires any performance adjustments. For example, changing int64 to int8/int16 and changing the float64 to int8/int16/int32/64 etc. As some of these data types seem too big to be housing data relating to square feet and/or ordinal numbering. However due to the data size presented, it is unlikely that performance is going to be an issue, but worth noting. 

# Push files to Repo

Create a new folder for the output data and save a copy of the data for further processing. 

In [None]:
import os
try:
    os.makedirs(name='outputs/datasets/collection')
except Exception as e:
    print(e)

df.to_csv('outputs/datasets/collection/house_prices_records.csv', index=False)
df_inherited.to_csv('outputs/datasets/collection/inherited_houses.csv', index=False)