# **1. Data Collection**

## Objectives
1. Setup the environment
2. Fetch and load the data
3. Inspect and save the data.

## Inputs
1. Kaggle API
2. Kaggle Dataset Path
3. Python packages

## Outputs
1. Environment setup confirmation
2. Download dataset
3. Loaded dataframes
4. Initial data inspection
5. Saved CSV files
6. Documentation

## Additional Comments
* Ensure that the Kaggle API key ('kaggle.json') is stored securely and properly configured.
* Handle any potential errors during the setup, data fetching, and data saving processes.
* Document any assumptions or observations made during the data inspection phase to inform further analysis.

---

# Install packages

In [1]:
%pip install -r /workspace/heritage-housing-mvp/requirements.txt

Collecting numpy==1.18.5 (from -r /workspace/heritage-housing-mvp/requirements.txt (line 1))
  Downloading numpy-1.18.5-cp38-cp38-manylinux1_x86_64.whl.metadata (2.1 kB)
Collecting pandas==1.4.2 (from -r /workspace/heritage-housing-mvp/requirements.txt (line 2))
  Downloading pandas-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting matplotlib==3.3.1 (from -r /workspace/heritage-housing-mvp/requirements.txt (line 3))
  Downloading matplotlib-3.3.1-cp38-cp38-manylinux1_x86_64.whl.metadata (5.7 kB)
Collecting seaborn==0.11.0 (from -r /workspace/heritage-housing-mvp/requirements.txt (line 4))
  Downloading seaborn-0.11.0-py3-none-any.whl.metadata (2.2 kB)
Collecting ydata-profiling==4.4.0 (from -r /workspace/heritage-housing-mvp/requirements.txt (line 5))
  Downloading ydata_profiling-4.4.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting plotly==4.12.0 (from -r /workspace/heritage-housing-mvp/requirements.txt (line 6))
  Downloading plotly-4.12.0-p

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-mvp/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-mvp'

---

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [2]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73027 sha256=d6c18b5f4f25b6299687b2c43f4ee9f6c5a8f02a4a4b07f5657880a

Download your kaggle.json file from your kaggle account. And place it in the main directory.

After that run the cell below, so the token is recognized in the session.

In [7]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We are using the following Kaggle URL: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data

Get the dataset path frok Kaggle url

* When you are viewing the dataset at Kaggle, check what is after https://kaggle.com/

Define the Kaggle dataset, and destination folder and download it.

In [8]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.05MB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file.

In [9]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


---

# Load and inspect the Kaggle data

In [11]:
import pandas as pd
df1 = pd.read_csv(f"/workspace/heritage-housing-mvp/inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df1.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


In [14]:
import pandas as pd
df2 = pd.read_csv(f"/workspace/heritage-housing-mvp/inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df2.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


Get a summary

In [15]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

Dataset has 24 columns consisting of three different data types

Check for duplicates

In [16]:
df1.head()
True in df1.duplicated().unique()

False

Considering that the preliminary objective is to predict prices for 4 inherited houses, then the dataset might lack some features/columns to give actual accurate prediction on the subject.
* Logically considering then the dataset lack any features that represent the location of a house, which may influence the house sale price. For example there are important considerations which also affect house prices, like the nearest school or town centre.
* Secondly a column is missing with the time of sale. Since housing market is affected by the country's or global economical state then that is also a very important feature to have to avoid bias.

These two aspects are already giving a preliminary suggestion that regression model is not suitable for this particular task and that classification modeling seems to be more suitable and would give a more honest estimation to the subject case.

---

# Save the outputs

In [17]:
try:
    path = os.path.join(os.getcwd(), 'outputs/datasets/collection')
    os.makedirs(path)
except Exception as e:
    print(e)

In [18]:
try:
    df1.to_csv(os.path.join(path, 'house_prices_records.csv'), index=False)
except Exception as e:
    print(e)

In [19]:
try:
    df2.to_csv(os.path.join(path, 'inherited_houses.csv'), index=False)
except Exception as e:
    print(e)

---

# Push files to Repo

Clear outputs and push files to repo

---

# Conclusion

In this notebook, environment was successfully set up, data was fetched from Kaggle, and initial data inspection was performed.

#### Next Steps:
* Data visualization and cleaning