# Data Collection Notebook

## Objectives
- Fetch data from Kaggle
- Save it as raw data
- Inspect the data
- Save it under outputs/datasets/collection

## Inputs
- Kaggle JSON file - the authentication token

## Outputs 
- Generate dataset: outputs/datasets/collection/HousingMarket.csv

---

## Change working directory
Changing current working directory to its parent folder

In [None]:
import os 
cwd = os.getcwd()
cwd

In [None]:
os.chdir(os.path.dirname(cwd))
print("You set a new current working directory")

In [None]:
cwd = os.getcwd()
cwd

## Fetch data from Kaggle
Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

Recognise token

In [None]:
import os 
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define Kaggle dataset and destination folder

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip downloaded file, delete the zip file, delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

In [None]:
---

## Load and Inspect Kaggle data

In [None]:
import pandas as pd
df_house_prices = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df_house_prices.head()

In [None]:
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited.head()

In [None]:
df_house_prices.info(verbose=True)

In [None]:
df_inherited.info(verbose=True)

Initial Observations:
- 1460 entries and 24 columns
- Target variable = SalePrice
- Mostly numerical columns, with a few categorical columns
- Missing data for the following features: 2ndFlrSF, BedroomAbvGr, BsmtExposure, BsmtFinType1, EnclosedPorch, GarageFinish, GarageYrBlt, LotFrontage, MasVnrArea, WoodDeckSF
- Missing data - severe for EnclosedPorch and WoodDeckSF
- GarageYrBlt, YearBuilt and YearRemodAdd are stored as integers - could derive age values from these columns
- Data on inherited houses is complete 



Deriving Useful Variables:
- HouseAge - age of the house in years 
- RemodAge - years since last remodel
- TotalSF - total internal square footage
- AboveGradeSF - total internal above grade square footage

In [None]:
from datetime import datetime

current_year = datetime.now().year

# df_house_prices
df_house_prices['HouseAge'] = current_year - df_house_prices['YearBuilt']
df_house_prices['RemodAge'] = current_year - df_house_prices['YearRemodAdd']
df_house_prices['GarageAge'] = current_year - df_house_prices['GarageYrBlt']
df_house_prices['TotalSF'] = df_house_prices['TotalBsmtSF'] + df_house_prices['1stFlrSF'] + df_house_prices['2ndFlrSF'].fillna(0)
df_house_prices['AboveGradeSF'] = df_house_prices['1stFlrSF'] + df_house_prices['2ndFlrSF'].fillna(0)


# df_inherited
df_inherited['HouseAge'] = current_year - df_inherited['YearBuilt']
df_inherited['RemodAge'] = current_year - df_inherited['YearRemodAdd']
df_inherited['GarageAge'] = current_year - df_inherited['GarageYrBlt']
df_inherited['TotalSF'] = df_inherited['TotalBsmtSF'] + df_inherited['1stFlrSF'] + df_inherited['2ndFlrSF'].fillna(0)
df_inherited['AboveGradeSF'] = df_inherited['1stFlrSF'] + df_inherited['2ndFlrSF'].fillna(0)

Creating Flags:
- IsRemodeled - boolean variable to distinguish remodeled vs original condition
- Has2ndFlr: 1=has second floor, 0=does not have second floor
- HasPorch: 1=has enclosed porch, 0=does not have enclosed porch
- HasDeck: 1=has wood deck, 0=does not have wood deck

In [None]:
# df_house_prices
df_house_prices['IsRemodeled'] = (df_house_prices['YearBuilt'] != df_house_prices['YearRemodAdd']).astype(int)
df_house_prices['Has2ndFlr'] = (df_house_prices['2ndFlrSF'] > 0).astype(int)
df_house_prices['HasPorch'] = (df_house_prices['EnclosedPorch'].fillna(0) > 0).astype(int)
df_house_prices['HasDeck'] = (df_house_prices['WoodDeckSF'].fillna(0) > 0).astype(int)

# df_inherited
df_inherited['IsRemodeled'] = (df_inherited['YearBuilt'] != df_inherited['YearRemodAdd']).astype(int)
df_inherited['Has2ndFlr'] = (df_inherited['2ndFlrSF'] > 0).astype(int)
df_inherited['HasPorch'] = (df_inherited['EnclosedPorch'].fillna(0) > 0).astype(int)
df_inherited['HasDeck'] = (df_inherited['WoodDeckSF'].fillna(0) > 0).astype(int)

Dropping year columns - unhelpful to train a model

In [None]:
df_house_prices.drop(['YearBuilt', 'YearRemodAdd', 'GarageYrBlt'], axis=1, inplace=True)
df_inherited.drop(['YearBuilt', 'YearRemodAdd', 'GarageYrBlt'], axis=1, inplace=True)

df_house_prices.head()

## Create output file

In [None]:
import pandas as pd

print(pd.__version__)

In [None]:
%pip install pathLib

In [None]:
from pathlib import Path

out_dir = Path("outputs/datasets/collection")
out_dir.mkdir(parents=True, exist_ok=True)

df_house_prices.to_csv(out_dir / "HousingPrices.csv", index=False)
df_inherited.to_csv(out_dir / "InheritedHouses.csv", index=False)