# Data Collection Notebook

### Objectives:

+ Fetch data from Kaggle and save the date as raw.
+ Load and inspect the data and save under inputs/datasets/raw.
+ Push the files to the github repository.


### Inputs:

+ Kaggle JSON authentication token.
+ Downloaded the recommended house price dataset posted by Code Institute on [Kaggle](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data).

### Outputs:

+ The Kaggle files were unzipped to:
  * inputs/datasets/raw/
  * inputs/datasets/raw/

### Additional Comments:

+ This notebook was written based on the guidelines provided in the walkthrough project 2: 'Churnometer'.
+ This notebook relates to the Data Understanding step of Crisp-DM methodology. 
+ This notebook and the following will represent the learning outcome after following the Code Institute - Predictive Analytics and Machine Learning module.

___

## Change working directory:

The following action will change the working directory from its current folder to its parent folder. 

+ Access the current directory with os.getcwd().

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.

+ os.path.dirname() gets the parent directory.
+ os.chir() defines the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You have now set a new current directory")

Confirm the new current directory:

In [None]:
current_dir = os.getcwd()
current_dir

## Fetch data from Kaggle

+ Install Kaggle library with the below command:

In [None]:
! pip3 install kaggle==1.5.12

* A kaggle account will be required at this point as if the user is doesn't have an account Kaggle will not allow the usage of data.
  - If an account has been created, then the user can download a kaggle.json file. 
* The Kaggle.json file contains an authentication token, which is required in order to authenticate a data download from Kaggle.
* The Kaggle.json token file must then be copied to the root directory of the project repository. 
* Next, the user must set the Kaggle environment variable and set permission to the toke file to read write for the user.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the kaggle url.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* The dataset has been downloaded into a zip file which will need to be unzipped to be used. 


In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

In [None]:
! pip3 uninstall -y kaggle==1.5.12

___

## Import packages and set environment variables:

In [None]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None

___


## Load and Inspect the House Price Records:

* Read and observe the house_prices_records dataset csv file into a Pandas dataframe.

In [None]:
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(df.shape)
df.head()

* This will display the data summary information.

In [None]:
df.info()

* It is noted that there is no 'id' field to mark row data uniqueness. Therefore, no ned to check for duplicate data. 

___

## Load and inspect the inherited house records:

* After conducting checks on the House Price Records, we have to conduct checks on the Inherited house records:
   - Read the inherited_houses dataset csv file into a pandas dataframe, to be able to have a better overview at the data.

In [None]:
df_refurbished = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/refurbished_houses.csv")
print(df_refurbished.shape)
df_refurbished

* Display the dataframe summary information: 

In [None]:
df_refurbished.info()

___


## Conclusion and the following steps:

* As you will probably notice, there is a difference between the house price dataset that has been extracted and the inherited houses dataset. If not, have a look again.
 
  - In the house price dataset some features are the type int whilst they are of type float in the inherited house dataset.
  - This difference will not be affecting the analysis of the data.
  - SalePrice (exist only in the house price dataset) is of type integer.

* Now we have our data required for the project, we can move further to clean the data.

   - Keep in mind, garbage in, garbage out. ☺️ 