# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save under outputs/datasets/collection

## Inputs

* Kaggle JSON file - authentication token 

## Outputs

* Generate Dataset: outputs/datasets/collection/house_prices_records.csv

## Import packages

In [1]:
import numpy
import os

## Change the working directory

In [2]:
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing-project--Predictive-Analytics/jupyter_notebooks'

In [3]:
os.chdir('/workspace/Heritage-Housing-project--Predictive-Analytics')
print("You set a new current directory")

You set a new current directory


In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing-project--Predictive-Analytics'

# Install Kaggle


In [5]:
# install kaggle package
!pip install kaggle



* Run the cell below to change kaggle configuration directory to current working directory and permission of kaggle authentication json

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the Kaggle url. When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ (in some case kaggle.com/datasets). You should copy that at KaggleDatasetPath.

* Set your destination folder.

* Set Kaggle Dataset and Download it


In [5]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "../inputs/datasets/raw"
os.makedirs(DestinationFolder, exist_ok=True)
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data
License(s): unknown
Downloading housing-prices-data.zip to ../inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 3.53MB/s]


In [6]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm config/kaggle.json

Archive:  ../inputs/datasets/raw/housing-prices-data.zip
  inflating: ../inputs/datasets/raw/house-metadata.txt  
  inflating: ../inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: ../inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  
rm: cannot remove 'config/kaggle.json': No such file or directory


In [4]:
import shutil
import os


source = 'jupyter_notebooks/inputs'
destination = '../inputs'


if os.path.exists(source):
    shutil.move(source, destination)
    print(f"Moved {source} to {destination}")
else:
    print(f"The source path {source} does not exist")


The source path jupyter_notebooks/inputs does not exist


In [8]:
import os

print("Content in folder:")
print(os.listdir(".."))

print("Content in jupyter_notebooks-folder:")
print(os.listdir("."))


Content in folder:
['.gitpod', '.pip-modules', '.vscode-remote', 'Heritage-Housing-project--Predictive-Analytics']
Content in jupyter_notebooks-folder:
['.git', '.gitignore', '.gitpod.dockerfile', '.gitpod.yml', '.slugignore', '.vscode', 'Procfile', 'README.md', 'jupyter_notebooks', 'kaggle.json', 'outputs', 'requirements.txt', 'runtime.txt', 'setup.sh', 'inputs']


# Load and Inspect the Kaggle Data

In [10]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
inherited_df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df.head(10)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000
5,796,566.0,1.0,No,732,GLQ,64,,480,Unf,...,85.0,0.0,30,5,5,796,,1993,1995,143000
6,1694,0.0,3.0,Av,1369,GLQ,317,,636,RFn,...,75.0,186.0,57,5,8,1686,,2004,2005,307000
7,1107,983.0,3.0,Mn,859,ALQ,216,,484,,...,,240.0,204,6,7,1107,,1973,1973,200000
8,1022,752.0,2.0,No,0,Unf,952,,468,Unf,...,51.0,0.0,0,5,7,952,,1931,1950,129900
9,1077,0.0,2.0,No,851,GLQ,140,,205,RFn,...,50.0,0.0,4,6,5,991,,1939,1950,118000


# Assessment of data:

* The data set has 10 rows and 24 columns.



* DataFrame Summary:

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

# *Push files to Repo*

In [13]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HeritageHousing.csv",index=False)
inherited_df.to_csv(f"outputs/datasets/collection/InheritedHouses.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'
