# **1 - Data Collection**

## 🧭 Objectives
- Fetch data from Kaggle and save it as raw data.
- Inspect the data and save it under `outputs/datasets/collection`.

## 📥 Inputs
- Kaggle JSON file (`kaggle.json`) – authentication token

## 💾 Outputs
- `outputs/datasets/collection/housing.csv`

## 🗒️ Additional Comments
In a real-world project, data typically comes from internal or external business sources. For learning purposes, we're simulating this using Kaggle.

---

# Change working directory

In [26]:
import os
current_dir = os.getcwd()
current_dir

'/'

In [27]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [28]:
current_dir = os.getcwd()
current_dir

'/'

# Fetch data from Kaggle

In [23]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


In [29]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


In [25]:
# Define dataset path
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

# Create destination folder if it doesn't exist
os.makedirs(DestinationFolder, exist_ok=True)

# Download the dataset
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

# Unzip the downloaded dataset
! unzip -o {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

PermissionError: [Errno 13] Permission denied: 'inputs'

---

# Load and Inspect Kaggle data

In [16]:
import os
import pandas as pd

# Find the actual folder name inside the raw directory
raw_root = "inputs/datasets/raw"
subfolders = next(os.walk(raw_root))[1]  # list of all subfolders
assert len(subfolders) == 1, "Expected exactly one unzipped folder in raw data."

data_folder = os.path.join(raw_root, subfolders[0])

# Load both datasets
df_main = pd.read_csv(os.path.join(data_folder, "house_prices_records.csv"))
df_client = pd.read_csv(os.path.join(data_folder, "inherited_houses.csv"))

# Preview both
print("🏠 Main Dataset: house_prices_records.csv")
display(df_main.head())

print("🏘️ Inherited Properties: inherited_houses.csv")
display(df_client.head())


AssertionError: Expected exactly one unzipped folder in raw data.

## Summary and Null Checks

In [None]:
# Check structure of main dataset
print("📋 Main Dataset Info:")
df_main.info()

# Null values in main dataset
print("\n🔍 Null Values in Main Dataset:")
print(df_main.isna().sum().sort_values(ascending=False).head(20))

# Structure of inherited houses
print("\n📋 Inherited Houses Info:")
df_client.info()

# Null values in inherited houses
print("\n🔍 Null Values in Inherited Houses:")
print(df_client.isna().sum().sort_values(ascending=False))

📋 Main Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   

In [None]:
# Create output folder for collected dataset
try:
    os.makedirs('outputs/datasets/collection', exist_ok=True)
except Exception as e:
    print(e)

# Save dataset
# Save both datasets to outputs
df_main.to_csv("outputs/datasets/collection/house_prices_records.csv", index=False)
df_client.to_csv("outputs/datasets/collection/inherited_houses.csv", index=False)

```bash
git add .
git commit -m "Data collection notebook initial set-up"
git push