# **Data Collection Heritage-Housing-Project**

## Objectives

* Fetch the Heritage Housing dataset from the corresponding Kaggle repository  
* Store the dataset in a designated local directory for further processing  
* Inspect the data and save it under outputs/datasets/collection
* Provide a structured step-by-step guide to load the dataset for further use  

## Inputs

* Kaggle API credentials provided via a `kaggle.json` file  
* Kaggle repository URL: https://www.kaggle.com/codeinstitute/housing-prices-data

## Outputs

* Raw dataset files downloaded from Kaggle  
* Files saved to the appropriate folder structure

## Additional Comments

* Ensure that the Kaggle API token is correctly set up before running the notebook  
* Use version control to track changes to data download scripts  

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\PabloGalindo\\Coding-Institute\\PMS5\\heritage-housing-ml-pgz\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\PabloGalindo\\Coding-Institute\\PMS5\\heritage-housing-ml-pgz'

# Fetch Data from Kaggle

## Intro

To retrieve the Heritage Housing dataset from Kaggle, follow the structured steps below. This ensures a reproducible and reliable data acquisition process.

Before proceeding, make sure the `kaggle.json` file containing your API credentials is available in the **parent directory** of this project. If the file is not present, it should be downloaded manually first from your Kaggle.

### Step-by-Step Instructions

1. **Install Required Packages**  
   Ensure the `kaggle` Python package is installed in your environment. This is required to authenticate and download data via the Kaggle API.

2. **Load `kaggle.json`**  
   Load the `kaggle.json` file and set it as an environment variable dynamically to authorize the Kaggle API for dataset download. Ensure the file is located in the working directory before proceeding.


3. **Download the Dataset**  
   Use the Kaggle API to fetch the dataset from the following URL:  
   [https://www.kaggle.com/codeinstitute/housing-prices-data](https://www.kaggle.com/codeinstitute/housing-prices-data)  
   The data should be extracted and saved in a predefined location, such as `inputs/datasets/raw`.





Step 1 – Install Required Packages

In [55]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Step 2 – Load kaggle.json

In [4]:
import os

kaggle_json_path = os.path.abspath("kaggle.json")

if os.path.exists(kaggle_json_path):
    os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
    print("kaggle.json found and environment variable set.")
else:
    print("kaggle.json not found. Please download it from your Kaggle account.")

kaggle.json found and environment variable set.


Step 3 - Download the Dataset

In [5]:
data_dir = os.path.join("inputs", "datasets","raw")
os.makedirs(data_dir, exist_ok=True)
!kaggle datasets download -d codeinstitute/housing-prices-data -p {data_dir} --unzip

Downloading housing-prices-data.zip to inputs\datasets\raw




  0%|          | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████| 49.6k/49.6k [00:00<00:00, 1.35MB/s]


---

# Load and Inspect Kaggle data

## Intro

In this section, we will load the dataset downloaded from Kaggle into a Pandas DataFrame and perform an initial inspection. This includes previewing the structure, checking the number of records and columns, and verifying the data types. After verifying that the data has been successfully loaded, we will save a working copy to the `outputs/datasets/collection` directory for future use in the data preparation and modeling stages.

### Step-by-Step Guide

0. **List retrieved files**  
   Print the filenames and extensions of the dataset files downloaded from Kaggle to confirm contents.

1. **Load the dataset**  
   Read the CSV file into a Pandas DataFrame from the download location.

2. **Preview the data**  
   Use `.head()` to view the first few rows and confirm the format.

3. **Check basic metadata**  
   Use `.shape` to get the number of rows and columns, `.info()` to understand data types and null entries, and `.describe()` to summarize numeric values.

4. **Check for missing values**  
   Identify any columns with null entries that may need cleaning later.

5. **Save a working copy**  
   Save the inspected DataFrame into the `outputs/datasets/collection/` directory for downstream use.


Step 0 – List retrieved files

In [6]:
import os

collected_files = []

for root, _, files in os.walk(data_dir):
    for file in files:
        full_path = os.path.join(root, file)
        rel_path = os.path.relpath(full_path, data_dir)
        ext = os.path.splitext(file)[1]
        collected_files.append((rel_path, ext))

if collected_files:
    print("Retrieved files from data_dir (including subfolders):")
    for rel_path, ext in collected_files:
        print(f"- {rel_path} (Extension: {ext})")
else:
    print("No files found in data_dir.")

Retrieved files from data_dir (including subfolders):
- house-metadata.txt (Extension: .txt)
- house-price-20211124T154130Z-001\house-price\house_prices_records.csv (Extension: .csv)
- house-price-20211124T154130Z-001\house-price\inherited_houses.csv (Extension: .csv)


The dataset is composed of three files:

1. **`house-metadata.txt`**  
   A simple text file describing the column headers in the dataset, including explanations of each feature and the possible values they can take.

2. **`house_prices_records.csv`**  
   The main dataset that will be used to train and evaluate the predictive model.

3. **`inherited_houses.csv`**  
   A sample dataset containing the attributes of four houses inherited by the client. This dataset will be used to make price predictions.

The following steps will focus on analyzing the two CSV files.


Step 1 – Load Datasets

In [7]:
import pandas as pd
import os

# Define dataset paths
records_path = os.path.join(
    data_dir, 'house-price-20211124T154130Z-001', 'house-price', 'house_prices_records.csv'
)
inherited_path = os.path.join(
    data_dir, 'house-price-20211124T154130Z-001', 'house-price', 'inherited_houses.csv'
)

# Load the datasets
try:
    df_records = pd.read_csv(records_path)
    df_inherited = pd.read_csv(inherited_path)
    print("Datasets loaded successfully.")
except FileNotFoundError as e:
    print("File not found:", e)

Datasets loaded successfully.


Step 2 - Preview the data

In [8]:
print("\nFirst rows of house_prices_records.csv:")
display(df_records.head())

print("\nFirst rows of inherited_houses.csv:")
display(df_inherited.head())


First rows of house_prices_records.csv:


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000



First rows of inherited_houses.csv:


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


Step 3 - Check basic metadata

In [9]:
# Info and Shape of the datasets
print("Shape of house_prices_records.csv:", df_records.shape)
print("Shape of inherited_houses.csv:", df_inherited.shape)

print("\nInfo for house_prices_records.csv:")
df_records.info()
print("\nInfo for inherited_houses.csv:")
df_inherited.info()

# Statistical summary of the datasets
print("\nStatistical summary of house_prices_records.csv:")
display(df_records.describe())

print("\nStatistical summary of inherited_houses.csv:")
display(df_inherited.describe())



Shape of house_prices_records.csv: (1460, 24)
Shape of inherited_houses.csv: (4, 23)

Info for house_prices_records.csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtFinSF1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageYrBlt,GrLivArea,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
count,1460.0,1374.0,1361.0,1460.0,1460.0,136.0,1460.0,1379.0,1460.0,1460.0,1201.0,1452.0,1460.0,1460.0,1460.0,1460.0,155.0,1460.0,1460.0,1460.0
mean,1162.626712,348.524017,2.869214,443.639726,567.240411,25.330882,472.980137,1978.506164,1515.463699,10516.828082,70.049958,103.685262,46.660274,5.575342,6.099315,1057.429452,103.741935,1971.267808,1984.865753,180921.19589
std,386.587738,438.865586,0.820115,456.098091,441.866955,66.684115,213.804841,24.689725,525.480383,9981.264932,24.284752,181.066207,66.256028,1.112799,1.382997,438.705324,135.543152,30.202904,20.645407,79442.502883
min,334.0,0.0,0.0,0.0,0.0,0.0,0.0,1900.0,334.0,1300.0,21.0,0.0,0.0,1.0,1.0,0.0,0.0,1872.0,1950.0,34900.0
25%,882.0,0.0,2.0,0.0,223.0,0.0,334.5,1961.0,1129.5,7553.5,59.0,0.0,0.0,5.0,5.0,795.75,0.0,1954.0,1967.0,129975.0
50%,1087.0,0.0,3.0,383.5,477.5,0.0,480.0,1980.0,1464.0,9478.5,69.0,0.0,25.0,5.0,6.0,991.5,0.0,1973.0,1994.0,163000.0
75%,1391.25,728.0,3.0,712.25,808.0,0.0,576.0,2002.0,1776.75,11601.5,80.0,166.0,68.0,6.0,7.0,1298.25,182.5,2000.0,2004.0,214000.0
max,4692.0,2065.0,8.0,5644.0,2336.0,286.0,1418.0,2010.0,5642.0,215245.0,313.0,1600.0,547.0,9.0,10.0,6110.0,736.0,2010.0,2010.0,755000.0



Statistical summary of inherited_houses.csv:


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtFinSF1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageYrBlt,GrLivArea,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,1019.75,344.75,2.75,696.0,284.25,0.0,498.5,1978.5,1364.5,12424.25,78.25,32.0,26.5,5.75,5.5,1016.25,276.25,1978.5,1978.75
std,206.68555,398.193734,0.5,201.141741,112.973079,0.0,172.683719,21.977261,340.623448,1999.967062,3.095696,51.536395,17.691806,0.5,0.57735,209.577949,120.18978,21.977261,22.261701
min,896.0,0.0,2.0,468.0,137.0,0.0,312.0,1958.0,896.0,9978.0,74.0,0.0,0.0,5.0,5.0,882.0,140.0,1958.0,1958.0
25%,918.5,0.0,2.75,568.5,236.75,0.0,430.5,1960.25,1220.75,11211.0,77.0,0.0,25.5,5.75,5.0,915.0,194.0,1960.25,1960.25
50%,927.0,339.0,3.0,696.5,297.0,0.0,476.0,1979.0,1466.5,12726.0,79.0,10.0,35.0,6.0,5.5,927.0,286.0,1979.0,1979.5
75%,1028.25,683.75,3.0,824.0,344.5,0.0,544.0,1997.25,1610.25,13939.25,80.25,42.0,36.0,6.0,6.0,1028.25,368.25,1997.25,1998.0
max,1329.0,701.0,3.0,923.0,406.0,0.0,730.0,1998.0,1629.0,14267.0,81.0,108.0,36.0,6.0,6.0,1329.0,393.0,1998.0,1998.0


We can see that the target variable `SalePrice` is numerical, so no transformation is required before modeling.  
To confirm the data type of `SalePrice`, you can run the following code:

In [10]:
df_records['SalePrice'].dtype


dtype('int64')

Here we compare `house_prices_records.csv` and `inherited_houses.csv` to identify any key differences between the dataset used to train the model and the dataset used to make predictions. This ensures that the input features available for prediction are consistent with those used during model training.


In [11]:
records_cols = set(df_records.columns)
inherited_cols = set(df_inherited.columns)

if records_cols == inherited_cols:
    print("The inherited houses dataset has the same headers as the training dataset.")
else:
    print("Header mismatch detected.")
        
    extra_inherited = inherited_cols - records_cols
    if extra_inherited:
        print("\nColumns present in inherited_houses.csv but missing in house_prices_records.csv:")
        for col in sorted(extra_inherited):
            print("-", col)
    
    missing_inherited = records_cols - inherited_cols
    if missing_inherited:
        print("\nColumns present in house_prices_records.csv but missing in inherited_houses.csv:")
        for col in sorted(missing_inherited):
            print("-", col)


Header mismatch detected.

Columns present in house_prices_records.csv but missing in inherited_houses.csv:
- SalePrice


Only the `SalePrice` column is missing in `inherited_houses.csv`, which is expected since the objective is to predict the price based on the data provided in `house_prices_records.csv`.


Step 4 - Check for missing values

In [12]:
missing_records = df_records.isnull().sum()
missing_records = missing_records[missing_records > 0]

print("Missing values in house_prices_records.csv:")
display(missing_records)

missing_inherited = df_inherited.isnull().sum()
missing_inherited = missing_inherited[missing_inherited > 0]

print("\nMissing values in inherited_houses.csv:")
display(missing_inherited)


Missing values in house_prices_records.csv:


2ndFlrSF           86
BedroomAbvGr       99
BsmtExposure       38
BsmtFinType1      145
EnclosedPorch    1324
GarageFinish      235
GarageYrBlt        81
LotFrontage       259
MasVnrArea          8
WoodDeckSF       1305
dtype: int64


Missing values in inherited_houses.csv:


Series([], dtype: int64)

Step 5 – Save Working Copies

In [14]:
import os
from datetime import datetime
print("Current working directory:", os.getcwd())

output_dir = os.path.join("outputs", "datasets", "collection")
try:
    os.makedirs(output_dir, exist_ok=True)
    print(f"Output directory created or already exists: {output_dir}")
except Exception as e:
    print("An error occurred while creating the output directory:")
    print(e)

output_records_path = os.path.join(output_dir, 'house_prices_records.csv')
output_inherited_path = os.path.join(output_dir, 'inherited_houses.csv')

try:
    df_records.to_csv(output_records_path, index=False)
    df_inherited.to_csv(output_inherited_path, index=False)

    print("Working copies saved to:")
    print("-", output_records_path)
    print("-", output_inherited_path)

except Exception as e:
    print("An error occurred while saving the datasets:")
    print(e)
    

# Log timestamp after saving files to output directory
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print(f"Data collection completed at: {timestamp}")

log_dir = os.path.join("outputs", "logs")
try:
    os.makedirs(log_dir, exist_ok=True)
    print(f"Log directory is ready: {log_dir}")
except Exception as e:
    print("An error occurred while creating the log directory:")
    print(e)

log_path = os.path.join(log_dir, 'data_collection.log')
try:
    with open(log_path, 'a') as log_file:
        log_file.write(f"Data collection completed at: {timestamp}\n")
    print(f"Timestamp logged to: {log_path}")
except Exception as e:
    print("Failed to write to log file:")
    print(e)




Current working directory: c:\Users\PabloGalindo\Coding-Institute\PMS5\heritage-housing-ml-pgz
Output directory created or already exists: outputs\datasets\collection
Working copies saved to:
- outputs\datasets\collection\house_prices_records.csv
- outputs\datasets\collection\inherited_houses.csv
Data collection completed at: 2025-03-23 20:21:29
Log directory is ready: outputs\logs
Timestamp logged to: outputs\logs\data_collection.log


---

## Conclusions

- The dataset was successfully retrieved from Kaggle and includes three files: a metadata file, a historical house sales dataset, and a dataset for the inherited houses.
- Both CSV files (`house_prices_records.csv` and `inherited_houses.csv`) were loaded and inspected without issues.
- The datasets are well-structured, and the `SalePrice` column, which is the prediction target, is present in the main data set. 
- The `SalePrice` column is numerical and does not require any transformation for modeling.
- No critical structural issues were found, although we still need to assess data quality in future steps.


---