# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset



---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with `os.getcwd()`

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Arthur\\OneDrive\\Documentos\\Code Institute\\PP5\\PP5-heritage-housing-issues-ml\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* `os.path.dirname()` gets the parent directory
* `os.chir()` defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Arthur\\OneDrive\\Documentos\\Code Institute\\PP5\\PP5-heritage-housing-issues-ml'

# Fetch Data from Kaggle

Install Kaggle package to fetch data

In [4]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Using cached kaggle-1.5.12-py3-none-any.whl
Collecting python-slugify (from kaggle==1.5.12)
  Using cached python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Using cached text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Using cached python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Installing collected packages: text-unidecode, python-slugify, kaggle
Successfully installed kaggle-1.5.12 python-slugify-8.0.4 text-unidecode-1.3
Note: you may need to restart the kernel to use updated packages.


A **JSON file (authentication token)** is needed to authenticate Kaggle to download data in this session. The Kaggle token was added in the `kaggle.json` file. The following cell is for recognizing it in this session.
I added an `if` statement for it to be able to run on Windows or Linux.

In [5]:
import os
import platform

# Set Kaggle config path
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

# Only run chmod on Unix-based systems
if platform.system() != 'Windows':
    os.system('chmod 600 kaggle.json')

We are using the following [Kaggle URL](https://www.kaggle.com/codeinstitute/housing-prices-data)

![kaggle dataset page screenshot](..\docs\screenshot-kaggle-dataset.png)

Get the dataset path from the Kaggle url
* When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ .

Define the Kaggle dataset, and destination folder and download it.

In [6]:
# Define Kaggle dataset path and local destination folder
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

# Download dataset from Kaggle
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}


Downloading housing-prices-data.zip to inputs/datasets/raw




  0%|          | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████| 49.6k/49.6k [00:00<00:00, 427kB/s]
100%|██████████| 49.6k/49.6k [00:00<00:00, 427kB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
import zipfile
import glob

# Unzip any zip files in the destination folder
for zip_file in glob.glob(f"{DestinationFolder}/*.zip"):
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)
    os.remove(zip_file)

# Delete the kaggle.json file
if os.path.exists("kaggle.json"):
    os.remove("kaggle.json")


---

# Load and Inspect Kaggle data

We import Pandas for data manipulation and load the raw dataset into a DataFrame. Afterwards, we display the first few rows to get an initial look at the data

In [13]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head(20)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000
5,796,566.0,1.0,No,732,GLQ,64,,480,Unf,...,85.0,0.0,30,5,5,796,,1993,1995,143000
6,1694,0.0,3.0,Av,1369,GLQ,317,,636,RFn,...,75.0,186.0,57,5,8,1686,,2004,2005,307000
7,1107,983.0,3.0,Mn,859,ALQ,216,,484,,...,,240.0,204,6,7,1107,,1973,1973,200000
8,1022,752.0,2.0,No,0,Unf,952,,468,Unf,...,51.0,0.0,0,5,7,952,,1931,1950,129900
9,1077,0.0,2.0,No,851,GLQ,140,,205,RFn,...,50.0,0.0,4,6,5,991,,1939,1950,118000


This dataset contains housing records for Ames, Iowa. Each row represents a single house sale, and each column describes an attribute of the house (e.g., size, quality, number of bedrooms, etc.).

We then check basic structure, number of entries, data types, and missing values with a DataFrame summary.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

**Output Notes:**

- The dataset contains 1,460 entries and 24 columns.

- Most features are numerical (`int64` and `float64`), with a few categorical (`object`).

- Some columns have missing values:

    - `2ndFlrSF`, `BedroomAbvGr`, `GarageYrBlt`, and `LotFrontage` have some missing numeric values

    - `BsmtExposure`, `BsmtFinType1`, and `GarageFinish` are categorical features with missing entries

    - `EnclosedPorch` and `WoodDeckSF` appear to have a very low number of valid entries (136 and 155 out of 1460, respectively)

This information will guide the data cleaning steps in the next notebook phase, where we'll handle missing data, outliers, and formatting inconsistencies.

---

# Push files to Repo

Create an output folder to keep track of the dataset as it is worked on.

In [14]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create an output folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices_records.csv",index=False)

Add, commit and push your files to your repo.

In [None]:
git add .
git commit -m "Added the house prices dataset"
git push origin main

---

# Conclusions and Next Steps

In this initial phase, we successfully loaded and inspected the raw dataset. The dataset consists of 1,460 house sale records and 24 features, including a mix of numerical and categorical variables. We identified several features with missing values, particularly in `2ndFlrSF`, `BedroomAbvGr`, and some categorical fields like `BsmtFinType1` and `GarageFinish`.

We also noted that some numerical columns are stored as `float64` due to missing values, even though they represent whole numbers. These will be converted to integers after handling missing data.

**Next steps:**
- Analyze and visualize the distribution of missing values
- Decide on appropriate strategies for imputing or dropping missing data
- Begin exploratory analysis to understand the relationships between features and `SalePrice`