# **Data Collection Notebook**

Part of CRISP-DM **Data Understanding**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection.

## Inputs

* Kaggle JSON file - the authentication token.
* Raw downloaded data file

## Outputs

* Generate and save datasets:
    * outputs/dataset/collection/house_price_records.csv
    * outputs/dataset/collection/inherited_houses.csv

## Additional Comments

* The first dataset in the outputs above is the data used to build our machine learning model(s). The second file consists of the inherited houses whose prices our client wants to predict.


---

# Install python packages in the notebooks

In [None]:
! pip3 install -r /workspace/heritage-housing-issues//requirements.txt

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch raw data from Kaggle

To collect the data for our project, we will be using the Kaggle API. First, we need to install the Kaggle package.

In [None]:
! pip3 install kaggle==1.5.12

To access the Kaggle API, we need to have an authentication token available in our workspace directory. This token is in the form of a file named 'kaggle.json'. If you don't have this file available, you can create one by following these steps:

1. Log in to your existing Kaggle account or create a new one.
2. Click on your user profile picture, then on “Settings” from the dropdown menu.
3. Scroll down to the section called API.
4. Click Expire API Token to remove any previous tokens.
5. Click Create New API Token to generate a fresh authentication token and download the kaggle.json file.

Once you have the kaggle.json file, transfer it to your working directory and make sure it is named correctly. Then, run the following code to make the token recognized in the session.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

The dataset we will be using is located at the following URL: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data. We can define the Kaggle path and destination folder as follows:

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

We can then download the data using the following command:

In [None]:
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

After downloading the data, we can unzip the file and delete the zip file and kaggle.json file.



In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle Data

* Load the Kaggle data using pandas.
* Import the pandas library and load the dataset as a pandas DataFrame

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")

* View the first few rows of the data.
* Get a summary of the DataFrame.

In [None]:
df.head()
df.info(max_cols=24)

* Identify columns with missing values.
* Check for duplicate rows.

In [None]:
df.isnull().sum()
df[df.duplicated(subset=None)]

* Check each column for unique values.

In [None]:
for col in df:
    if df[col].dtypes == 'object':
        print(col, '-', df[col].unique())
    elif df[col].unique().size < 11:
        print(col, '-', df[col].unique().size)

Our preliminary assessment of the data reveals:

* The dataset contains 1460 rows and 24 columns.
* The columns have a mix of data types, including integers, floats, and objects.
* Several columns have missing values, which will require further attention.
* Some columns appear to be categorical, based on the small number of unique values.

We'll need to investigate these findings further and perform additional data cleaning and preprocessing in the next step.

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* Save the cleaned dataset to a local folder.
* Push the dataset to the repository.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create a folder for the data output
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices.csv",index=False)

By saving the dataset to a local folder and pushing it to the repository, we ensure that our data is properly organized and easily accessible for further analysis and processing.

Note: This completes the current Notebook. The cell outputs can now be cleared, and changes to the workspace can be pushed to the GitHub repository.