# **Data Collection**

## Objectives

* Fetch data from Kaggle and prepare it for further processes

## Inputs

* Kaggle JSOn file - authentication token 

## Outputs

* Generate Dataset: inputs/datasets/housing-prices-data 

## Additional Comments

* No comments 


---

# Import packages

In [1]:
import numpy
import os

In [2]:
current_dir = os.getcwd()
current_dir

'/workspaces/heritage-housing-issues/jupyter_notebooks'

Change the working directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/heritage-housing-issues'

# Install Kaggle

In [5]:
# install Kaggle package
!pip install kaggle



---

# Section 2

Run the cell below to change kaggle configuration directory to current working directory and permission of kaggle authentification json

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set Kaggle Dataset and Download it

In [6]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/heritage-housing-issues"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/heritage-housing-issues
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 10.5MB/s]


Unzip the downloaded file, delete the zip file

In [9]:
!unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

unzip:  cannot find or open inputs/heritage-housing-issues/*.zip, inputs/heritage-housing-issues/*.zip.zip or inputs/heritage-housing-issues/*.zip.ZIP.

No zipfiles found.


---

## Load and Inspect Kaggle data

### House metadata

In [6]:
import pandas as pd
df_house_metadata = pd.read_csv(f"inputs/heritage-housing-issues/house-metadata.txt")
df_house_metadata.head()

Unnamed: 0,1stFlrSF: First Floor square feet
0,334 - 4692
1,2ndFlrSF: Second floor square feet
2,0 - 2065
3,BedroomAbvGr: Bedrooms above grade (does NOT i...
4,0 - 8


#### DataFrame Summary

In [7]:
df_house_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 1 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   1stFlrSF: First Floor square feet  82 non-null     object
dtypes: object(1)
memory usage: 784.0+ bytes


### House prices records

In [8]:
df_house_prices_records = pd.read_csv(f"inputs/heritage-housing-issues/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df_house_prices_records.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


#### DataFrame Summary

*Checking & removing duplicates*

In [9]:
df_house_prices_records.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,953,694.0,3.0,No,0,Unf,953,,460,RFn,...,62.0,0.0,40,5,6,953,0.0,1999,2000,175000
1456,2073,0.0,,No,790,ALQ,589,,500,Unf,...,85.0,119.0,0,6,6,1542,,1978,1988,210000
1457,1188,1152.0,4.0,No,275,GLQ,877,,252,RFn,...,66.0,0.0,60,9,7,1152,,1941,2006,266500
1458,1078,0.0,2.0,Mn,49,,0,112.0,240,Unf,...,68.0,0.0,0,6,5,1078,,1950,1996,142125


In [10]:
df_house_prices_records.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

*Converting 1stFlrSF	, BsmtFinSF1, BsmtUnfSF, GarageArea, GrLivArea, LotArea, OpenPorchSF, TotalBsmtSF to float*

In [11]:
df_house_prices_records['1stFlrSF'] = df_house_prices_records['1stFlrSF'].astype(float)
df_house_prices_records['BsmtFinSF1'] = df_house_prices_records['BsmtFinSF1'].astype(float)
df_house_prices_records['BsmtUnfSF'] = df_house_prices_records['BsmtUnfSF'].astype(float)
df_house_prices_records['GarageArea'] = df_house_prices_records['GarageArea'].astype(float)
df_house_prices_records['GrLivArea'] = df_house_prices_records['GrLivArea'].astype(float)
df_house_prices_records['LotArea'] = df_house_prices_records['LotArea'].astype(float)
df_house_prices_records['OpenPorchSF'] = df_house_prices_records['OpenPorchSF'].astype(float)
df_house_prices_records['TotalBsmtSF'] = df_house_prices_records['TotalBsmtSF'].astype(float)
df_house_prices_records['TotalBsmtSF'] = df_house_prices_records['TotalBsmtSF'].astype(float)
df_house_prices_records.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   float64
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   float64
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   float64
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   float64
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   float64
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   float64
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   float64
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

*Dropping columns EnclosedPorch & WoodDeckSF*

In [12]:
df_house_prices_records.drop(['EnclosedPorch','WoodDeckSF'], axis=1, inplace=True)
df_house_prices_records.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 22 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   1stFlrSF      1460 non-null   float64
 1   2ndFlrSF      1374 non-null   float64
 2   BedroomAbvGr  1361 non-null   float64
 3   BsmtExposure  1460 non-null   object 
 4   BsmtFinSF1    1460 non-null   float64
 5   BsmtFinType1  1346 non-null   object 
 6   BsmtUnfSF     1460 non-null   float64
 7   GarageArea    1460 non-null   float64
 8   GarageFinish  1298 non-null   object 
 9   GarageYrBlt   1379 non-null   float64
 10  GrLivArea     1460 non-null   float64
 11  KitchenQual   1460 non-null   object 
 12  LotArea       1460 non-null   float64
 13  LotFrontage   1201 non-null   float64
 14  MasVnrArea    1452 non-null   float64
 15  OpenPorchSF   1460 non-null   float64
 16  OverallCond   1460 non-null   int64  
 17  OverallQual   1460 non-null   int64  
 18  TotalBsmtSF   1460 non-null 

*Dropping all rows that have missing data*

In [13]:
df_house_prices_records.dropna(inplace=True)
df_house_prices_records.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 797 entries, 0 to 1459
Data columns (total 22 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   1stFlrSF      797 non-null    float64
 1   2ndFlrSF      797 non-null    float64
 2   BedroomAbvGr  797 non-null    float64
 3   BsmtExposure  797 non-null    object 
 4   BsmtFinSF1    797 non-null    float64
 5   BsmtFinType1  797 non-null    object 
 6   BsmtUnfSF     797 non-null    float64
 7   GarageArea    797 non-null    float64
 8   GarageFinish  797 non-null    object 
 9   GarageYrBlt   797 non-null    float64
 10  GrLivArea     797 non-null    float64
 11  KitchenQual   797 non-null    object 
 12  LotArea       797 non-null    float64
 13  LotFrontage   797 non-null    float64
 14  MasVnrArea    797 non-null    float64
 15  OpenPorchSF   797 non-null    float64
 16  OverallCond   797 non-null    int64  
 17  OverallQual   797 non-null    int64  
 18  TotalBsmtSF   797 non-null   

*Converting GarageYrBlt to int*

In [14]:
df_house_prices_records['GarageYrBlt'] = df_house_prices_records['GarageYrBlt'].astype(int)
df_house_prices_records.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 797 entries, 0 to 1459
Data columns (total 22 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   1stFlrSF      797 non-null    float64
 1   2ndFlrSF      797 non-null    float64
 2   BedroomAbvGr  797 non-null    float64
 3   BsmtExposure  797 non-null    object 
 4   BsmtFinSF1    797 non-null    float64
 5   BsmtFinType1  797 non-null    object 
 6   BsmtUnfSF     797 non-null    float64
 7   GarageArea    797 non-null    float64
 8   GarageFinish  797 non-null    object 
 9   GarageYrBlt   797 non-null    int64  
 10  GrLivArea     797 non-null    float64
 11  KitchenQual   797 non-null    object 
 12  LotArea       797 non-null    float64
 13  LotFrontage   797 non-null    float64
 14  MasVnrArea    797 non-null    float64
 15  OpenPorchSF   797 non-null    float64
 16  OverallCond   797 non-null    int64  
 17  OverallQual   797 non-null    int64  
 18  TotalBsmtSF   797 non-null   

### Inherited houses

In [15]:
df_inherited_houses = pd.read_csv(f"inputs/heritage-housing-issues/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited_houses.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


#### DataFrame Summary

In [16]:
df_inherited_houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       4 non-null      int64  
 1   2ndFlrSF       4 non-null      int64  
 2   BedroomAbvGr   4 non-null      int64  
 3   BsmtExposure   4 non-null      object 
 4   BsmtFinSF1     4 non-null      float64
 5   BsmtFinType1   4 non-null      object 
 6   BsmtUnfSF      4 non-null      float64
 7   EnclosedPorch  4 non-null      int64  
 8   GarageArea     4 non-null      float64
 9   GarageFinish   4 non-null      object 
 10  GarageYrBlt    4 non-null      float64
 11  GrLivArea      4 non-null      int64  
 12  KitchenQual    4 non-null      object 
 13  LotArea        4 non-null      int64  
 14  LotFrontage    4 non-null      float64
 15  MasVnrArea     4 non-null      float64
 16  OpenPorchSF    4 non-null      int64  
 17  OverallCond    4 non-null      int64  
 18  OverallQual   

*Converting 1stFlrSF	, 2ndFlrSF, BedroomAbvGr, EnclosedPorch, GrLivArea, LotArea, OpenPorchSF, WoodDeckSF to float.*
*Converting GarageYrBlt to int.*

In [17]:
df_inherited_houses['1stFlrSF'] = df_inherited_houses['1stFlrSF'].astype(float)
df_inherited_houses['2ndFlrSF'] = df_inherited_houses['2ndFlrSF'].astype(float)
df_inherited_houses['BedroomAbvGr'] = df_inherited_houses['BedroomAbvGr'].astype(float)
df_inherited_houses['EnclosedPorch'] = df_inherited_houses['EnclosedPorch'].astype(float)
df_inherited_houses['GrLivArea'] = df_inherited_houses['GrLivArea'].astype(float)
df_inherited_houses['LotArea'] = df_inherited_houses['LotArea'].astype(float)
df_inherited_houses['OpenPorchSF'] = df_inherited_houses['OpenPorchSF'].astype(float)
df_inherited_houses['WoodDeckSF'] = df_inherited_houses['WoodDeckSF'].astype(float)
df_inherited_houses['GarageYrBlt'] = df_inherited_houses['GarageYrBlt'].astype(int)
df_inherited_houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       4 non-null      float64
 1   2ndFlrSF       4 non-null      float64
 2   BedroomAbvGr   4 non-null      float64
 3   BsmtExposure   4 non-null      object 
 4   BsmtFinSF1     4 non-null      float64
 5   BsmtFinType1   4 non-null      object 
 6   BsmtUnfSF      4 non-null      float64
 7   EnclosedPorch  4 non-null      float64
 8   GarageArea     4 non-null      float64
 9   GarageFinish   4 non-null      object 
 10  GarageYrBlt    4 non-null      int64  
 11  GrLivArea      4 non-null      float64
 12  KitchenQual    4 non-null      object 
 13  LotArea        4 non-null      float64
 14  LotFrontage    4 non-null      float64
 15  MasVnrArea     4 non-null      float64
 16  OpenPorchSF    4 non-null      float64
 17  OverallCond    4 non-null      int64  
 18  OverallQual   

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [18]:
import os
df_house_prices_records.to_csv(f"outputs/datasets/collection/house_price_records_cat.csv", index=False)
df_inherited_houses.to_csv(f"outputs/datasets/collection/inherited_houses_cat.csv", index=False)
