# Data Collection Notebook

### Objectives:

+ Fetch data from Kaggle and save the date as raw.
+ Load and inspect the data and save under inputs/datasets/raw.
+ Push the files to the github repository.


### Inputs:

+ Kaggle JSON authentication token.
+ Downloaded the recommended house price dataset posted by Code Institute on [Kaggle](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data).

### Outputs:

+ The Kaggle files were unzipped to:
  * inputs/datasets/raw/
  * inputs/datasets/raw/

### Additional Comments:

+ This notebook was written based on the guidelines provided in the walkthrough project 2: 'Churnometer'.
+ This notebook relates to the Data Understanding step of Crisp-DM methodology. 
+ This notebook and the following will represent the learning outcome after following the Code Institute - Predictive Analytics and Machine Learning module.

___

## Change working directory:

The following action will change the working directory from its current folder to its parent folder. 

+ Access the current directory with os.getcwd().

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/PP5-Predictive-Analysis/jupyter_notebooks'

We want to make the parent of the current directory the new current directory.

+ os.path.dirname() gets the parent directory.
+ os.chir() defines the new current directory.

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You have now set a new current directory")

You have now set a new current directory


Confirm the new current directory:

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/PP5-Predictive-Analysis'

## Fetch data from Kaggle

+ Install Kaggle library with the below command:

In [4]:
! pip3 install kaggle==1.5.12

1946.97s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify
  Downloading python_slugify-7.0.0-py2.py3-none-any.whl (9.4 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73031 sha256=29b38c835388f4f282cf73fc08daefa52d867e49e8eed03e432ca4ed25c98fe7
  Stored in directory: /home/gitpod/.cache/pip/wheels/03/f3/c7/fc5a63bb33d22177609b06c5b4c714b5eb3f1b195ce9dc5e47
Successfully built kaggle
Installing collected packages: text-unidecode, python-s

* A kaggle account will be required at this point as if the user is doesn't have an account Kaggle will not allow the usage of data.
  - If an account has been created, then the user can download a kaggle.json file. 
* The Kaggle.json file contains an authentication token, which is required in order to authenticate a data download from Kaggle.
* The Kaggle.json token file must then be copied to the root directory of the project repository. 
* Next, the user must set the Kaggle environment variable and set permission to the toke file to read write for the user.

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

1959.88s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


* Get the dataset path from the kaggle url.

In [6]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

1971.31s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.14MB/s]


* The dataset has been downloaded into a zip file which will need to be unzipped to be used. 


In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

1988.11s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


In [8]:
! pip3 uninstall -y kaggle==1.5.12

1997.29s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Found existing installation: kaggle 1.5.12
Uninstalling kaggle-1.5.12:
  Successfully uninstalled kaggle-1.5.12


___

## Import packages and set environment variables:

In [9]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None

___
