# **Data Collection Notebook**

Part of CRISP-DM **Data Understanding**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection.

## Inputs

* Kaggle JSON file - the authentication token.
* Raw downloaded data file

## Outputs

* Generate and save datasets:
    * outputs/dataset/collection/house_price_records.csv
    * outputs/dataset/collection/inherited_houses.csv

## Additional Comments

* The first dataset in the outputs above is the data used to build our machine learning model(s). The second file consists of the inherited houses whose prices our client wants to predict.


---

# Install python packages in the notebooks

In [15]:
! pip3 install -r /workspace/heritage-housing-issues//requirements.txt

Collecting numpy==1.18.5 (from -r /workspace/heritage-housing-issues//requirements.txt (line 1))
  Downloading numpy-1.18.5-cp38-cp38-manylinux1_x86_64.whl.metadata (2.1 kB)
Collecting pandas==1.4.2 (from -r /workspace/heritage-housing-issues//requirements.txt (line 2))
  Downloading pandas-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting matplotlib==3.3.1 (from -r /workspace/heritage-housing-issues//requirements.txt (line 3))
  Downloading matplotlib-3.3.1-cp38-cp38-manylinux1_x86_64.whl.metadata (5.7 kB)
Collecting seaborn==0.11.0 (from -r /workspace/heritage-housing-issues//requirements.txt (line 4))
  Downloading seaborn-0.11.0-py3-none-any.whl.metadata (2.2 kB)
Collecting ydata-profiling==4.4.0 (from -r /workspace/heritage-housing-issues//requirements.txt (line 5))
  Downloading ydata_profiling-4.4.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting plotly==4.12.0 (from -r /workspace/heritage-housing-issues//requirements.txt (line 6))
  Dow

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues'

# Fetch raw data from Kaggle

To collect the data for our project, we will be using the Kaggle API. First, we need to install the Kaggle package.

In [9]:
! pip3 install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle==1.5.12)
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.66.4-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading text_unidecode-1.3-py2.

To access the Kaggle API, we need to have an authentication token available in our workspace directory. This token is in the form of a file named 'kaggle.json'. If you don't have this file available, you can create one by following these steps:

1. Log in to your existing Kaggle account or create a new one.
2. Click on your user profile picture, then on “Settings” from the dropdown menu.
3. Scroll down to the section called API.
4. Click Expire API Token to remove any previous tokens.
5. Click Create New API Token to generate a fresh authentication token and download the kaggle.json file.

Once you have the kaggle.json file, transfer it to your working directory and make sure it is named correctly. Then, run the following code to make the token recognized in the session.

In [4]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


The dataset we will be using is located at the following URL: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data. We can define the Kaggle path and destination folder as follows:

In [3]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

We can then download the data using the following command:

In [4]:
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspace/heritage-housing-issues/jupyter_notebooks. Or use the environment method.


After downloading the data, we can unzip the file and delete the zip file and kaggle.json file.



In [5]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open inputs/datasets/raw/*.zip, inputs/datasets/raw/*.zip.zip or inputs/datasets/raw/*.zip.ZIP.

No zipfiles found.


---

# Load and Inspect Kaggle Data

* Load the Kaggle data using pandas.
* Import the pandas library and load the dataset as a pandas DataFrame

In [4]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")

* View the first few rows of the data.
* Get a summary of the DataFrame.

In [5]:
df.head()
df.info(max_cols=24)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

* Identify columns with missing values.
* Check for duplicate rows.

In [6]:
df.isnull().sum()
df[df.duplicated(subset=None)]

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice


* Check each column for unique values.

In [7]:
for col in df:
    if df[col].dtypes == 'object':
        print(col, '-', df[col].unique())
    elif df[col].unique().size < 11:
        print(col, '-', df[col].unique().size)

BedroomAbvGr - 9
BsmtExposure - ['No' 'Gd' 'Mn' 'Av' 'None']
BsmtFinType1 - ['GLQ' 'ALQ' 'Unf' 'Rec' nan 'BLQ' 'None' 'LwQ']
GarageFinish - ['RFn' 'Unf' nan 'Fin' 'None']
KitchenQual - ['Gd' 'TA' 'Ex' 'Fa']
OverallCond - 9
OverallQual - 10


Our preliminary assessment of the data reveals:

* The dataset contains 1460 rows and 24 columns.
* The columns have a mix of data types, including integers, floats, and objects.
* Several columns have missing values, which will require further attention.
* Some columns appear to be categorical, based on the small number of unique values.

We'll need to investigate these findings further and perform additional data cleaning and preprocessing in the next step.

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* Save the cleaned dataset to a local folder.
* Push the dataset to the repository.

In [8]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create a folder for the data output
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices.csv",index=False)

By saving the dataset to a local folder and pushing it to the repository, we ensure that our data is properly organized and easily accessible for further analysis and processing.

Note: This completes the current Notebook. The cell outputs can now be cleared, and changes to the workspace can be pushed to the GitHub repository.