# **Data Collection**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under `outputs/datasets/collection`
  
## Inputs

* Kaggle JSON file (`kaggle.json`) as the authentication token to access the kaggle dataset. 
* Raw dataset from kaggle.com for house prices in Ames, Iowa.

## Outputs

* Generate Dataset: `outputs/datasets/collection/house_prices_records.csv`

---

# Change working directory

The notebooks for this project are stored in a subfolder called `jupyter_notebooks`, therefore when running the notebook, the working directory needs to be changed to the parent folder. 
* We access the current directory with `os.getcwd()`

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/pp5-house-price-predictor/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5-house-price-predictor'

---

# Fetch Data from Kaggle

Use the `%pip` magic command to install the kaggle package

In [4]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


<div style="border-left: 4px solid #ffa500; padding: 10px; background-color: #f9f9f9; color: #000">
<strong>IMPORTANT:</strong> Remember to import your <code style="color: red">kaggle.json</code> file into the root directory, as this contains the API key.
</div>

Once the `kaggle.json` file is in your root directory, run the cell below, so the token is recognized in the session

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

The dataset for this project can be found at [https://www.kaggle.com/datasets/codeinstitute/housing-prices-data](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)

1. Extract the dataset path from the url. Everything after `www.kaggle.com/datasets/` and assign to a new variable
    - That being: `codeinstitute/housing-prices-data`

2. Define a new variable for the destination folder
3. Download the data from kaggle.

In [7]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.04MB/s]


Unzip the file, remove the zip file and delete the `kaggle.json` authentication token

In [8]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


---