# **1 - Data Collection**

## 🧭 Objectives
- Fetch data from Kaggle and save it as raw data.
- Inspect the data and save it under `outputs/datasets/collection`.

## 📥 Inputs
- Kaggle JSON file (`kaggle.json`) – authentication token

## 💾 Outputs
- `outputs/datasets/collection/housing.csv`

## 🗒️ Additional Comments
In a real-world project, data typically comes from internal or external business sources. For learning purposes, we're simulating this using Kaggle.

---

# Change working directory

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-project-heritage-housing-issues/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-project-heritage-housing-issues'

# Fetch data from Kaggle

In [4]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [6]:
# Define dataset path
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

# Create destination folder if it doesn't exist
os.makedirs(DestinationFolder, exist_ok=True)

# Download the dataset
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

# Unzip the downloaded dataset
! unzip -o {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Downloading housing-prices-data.zip to inputs/datasets/raw
100%|███████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 482kB/s]
100%|███████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 480kB/s]
Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


---

# Load and Inspect Kaggle data

In [11]:
import os
import pandas as pd

# Find the actual folder name inside the raw directory
raw_root = "inputs/datasets/raw"
subfolders = next(os.walk(raw_root))[1]  # list of all subfolders
assert len(subfolders) == 1, "Expected exactly one unzipped folder in raw data."

data_folder = os.path.join(raw_root, subfolders[0])

# Load both datasets
main_df = pd.read_csv(os.path.join(data_folder, "house_prices_records.csv"))
client_df = pd.read_csv(os.path.join(data_folder, "inherited_houses.csv"))

# Preview both
print("🏠 Main Dataset: house_prices_records.csv")
display(main_df.head())

print("🏘️ Inherited Properties: inherited_houses.csv")
display(client_df.head())


🏠 Main Dataset: house_prices_records.csv


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


🏘️ Inherited Properties: inherited_houses.csv


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


## Summary and Null Checks

In [12]:
# Check structure of main dataset
print("📋 Main Dataset Info:")
main_df.info()

# Null values in main dataset
print("\n🔍 Null Values in Main Dataset:")
print(main_df.isna().sum().sort_values(ascending=False).head(20))

# Structure of inherited houses
print("\n📋 Inherited Houses Info:")
client_df.info()

# Null values in inherited houses
print("\n🔍 Null Values in Inherited Houses:")
print(client_df.isna().sum().sort_values(ascending=False))

📋 Main Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   

In [13]:
# Create output folder for collected dataset
try:
    os.makedirs('outputs/datasets/collection', exist_ok=True)
except Exception as e:
    print(e)

# Save dataset
# Save both datasets to outputs
main_df.to_csv("outputs/datasets/collection/house_prices_records.csv", index=False)
client_df.to_csv("outputs/datasets/collection/inherited_houses.csv", index=False)

```bash
git add .
git commit -m "Data collection notebook initial set-up"
git push