# Exploring the Data

## Objectives

* Fetch data from [Kaggle](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data) and save as raw data.

* Load the raw main Ames Housing dataset (house_prices_records.csv) and perform an initial inspection.

* Understand the structure of the dataset, including column names, data types, and dimensions.

* Identify the presence and scale of missing data across all features.

* Generate descriptive statistics to understand the distribution, spread, and potential outliers in numerical variables.

* Highlight key data quality issues that will need to be addressed during the cleaning stage.

## Inputs

* Dataset: house_prices_records.csv located in the data/raw/ directory.

* Libraries: pandas, numpy (used for DataFrame exploration and missing-value analysis).

* all data is loaded directly from the CSV file.

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [29]:
import os
current_dir = os.getcwd()
current_dir


'/Users/aisha/Desktop/vscode-projects'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [19]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [20]:
current_dir = os.getcwd()
current_dir

'/Users/aisha/Desktop/vscode-projects'

---

# Check Structure and Missing Data


- Understand the data structure
- Check for missing values

In [3]:
import pandas as pd

# Load the main dataset
df = pd.read_csv("data/raw/house_prices_records.csv")
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


- The dataset contains various numerical and categorical features such as floor area, basement quality, and garage information.

- Some missing values (`NaN`) are visible in certain columns 

In [4]:
# Check overall data structure 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

- The dataset has 1,460 entries and 24 columns.
- There is a mix of numeric (`int64`, `float64`) and categorical (`object`) data types.
- Some columns e.g., `2ndFlrSF`, `GarageYrBlt`, `MasVnrArea`, `LotFrontage` can be seen to have less than 1,460 non-null values, confirming the presence of missing data.

### Missing data 

In [5]:
# See how many missing values are in the dataset
df.isna().sum().sum()

np.int64(3580)

- There are 3,580 missing values in total, mainly in columns such as EnclosedPorch, WoodDeckSF, and LotFrontage. These will need to be handled before modelling.

In [6]:
# Count missing values per column
df.isna().sum().sort_values(ascending=False)

EnclosedPorch    1324
WoodDeckSF       1305
LotFrontage       259
GarageFinish      235
BsmtFinType1      145
BedroomAbvGr       99
2ndFlrSF           86
GarageYrBlt        81
BsmtExposure       38
MasVnrArea          8
1stFlrSF            0
OverallCond         0
YearRemodAdd        0
YearBuilt           0
TotalBsmtSF         0
OverallQual         0
KitchenQual         0
OpenPorchSF         0
LotArea             0
GrLivArea           0
GarageArea          0
BsmtUnfSF           0
BsmtFinSF1          0
SalePrice           0
dtype: int64

- Shows counts of how many missing (NaN) values there are in each column 

In [7]:
# Count missing values per column
missing_count = df.isna().sum().sort_values(ascending=False)

# Percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_data = pd.DataFrame({
    'Missing Values': missing_count,
    'Percentage (%)': missing_percent
})
print(missing_data)


               Missing Values  Percentage (%)
1stFlrSF                    0        0.000000
2ndFlrSF                   86        5.890411
BedroomAbvGr               99        6.780822
BsmtExposure               38        2.602740
BsmtFinSF1                  0        0.000000
BsmtFinType1              145        9.931507
BsmtUnfSF                   0        0.000000
EnclosedPorch            1324       90.684932
GarageArea                  0        0.000000
GarageFinish              235       16.095890
GarageYrBlt                81        5.547945
GrLivArea                   0        0.000000
KitchenQual                 0        0.000000
LotArea                     0        0.000000
LotFrontage               259       17.739726
MasVnrArea                  8        0.547945
OpenPorchSF                 0        0.000000
OverallCond                 0        0.000000
OverallQual                 0        0.000000
SalePrice                   0        0.000000
TotalBsmtSF                 0     

-  The table above identifies and quantifies missing data in the dataset. It calculates the percentage of missing entries relative to the total dataset size.

-  Displaying both counts and % helps highlight which columns have minimal missing data that can be filled and which columns have excessive missing data that may need to be removed.


In [8]:
# Filter out columns with no missing values to show the top 10 columns with actuall missing values 
missing_data = missing_data[missing_data["Missing Values"] > 0].sort_values(by="Percentage (%)", ascending=False)
print(missing_data)

               Missing Values  Percentage (%)
EnclosedPorch            1324       90.684932
WoodDeckSF               1305       89.383562
LotFrontage               259       17.739726
GarageFinish              235       16.095890
BsmtFinType1              145        9.931507
BedroomAbvGr               99        6.780822
2ndFlrSF                   86        5.890411
GarageYrBlt                81        5.547945
BsmtExposure               38        2.602740
MasVnrArea                  8        0.547945


-  This filters and displays the top columns with missing data in the dataset.

-  It removes features with no missing values and sorts the rest in descending order by their percentage of missing data.

-  This helps focus only on columns that require attention during data cleaning, making it easier to decide whether to fill or drop them.

In [9]:
# Generate statistical summary of numerical columns in housing dataset

# Returns summary statistics for each numeric feature - count, mean, std dev, min/max values, and quartiles 
df.select_dtypes(include=['number']).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
1stFlrSF,1460.0,1162.626712,386.587738,334.0,882.0,1087.0,1391.25,4692.0
2ndFlrSF,1374.0,348.524017,438.865586,0.0,0.0,0.0,728.0,2065.0
BedroomAbvGr,1361.0,2.869214,0.820115,0.0,2.0,3.0,3.0,8.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0
BsmtUnfSF,1460.0,567.240411,441.866955,0.0,223.0,477.5,808.0,2336.0
EnclosedPorch,136.0,25.330882,66.684115,0.0,0.0,0.0,0.0,286.0
GarageArea,1460.0,472.980137,213.804841,0.0,334.5,480.0,576.0,1418.0
GarageYrBlt,1379.0,1978.506164,24.689725,1900.0,1961.0,1980.0,2002.0,2010.0
GrLivArea,1460.0,1515.463699,525.480383,334.0,1129.5,1464.0,1776.75,5642.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import streamlit

print("✅ Notebook connected to correct environment!")


✅ Notebook connected to correct environment!


In [2]:
import pandas as pd

# Load the main dataset
df = pd.read_csv("../data/raw/house_prices_records.csv")

# Quick check
print("✅ Dataset loaded successfully!")
print("Shape:", df.shape)
df.head()

✅ Dataset loaded successfully!
Shape: (1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


In [3]:
df.isna().sum().sort_values(ascending=False).head(15)


EnclosedPorch    1324
WoodDeckSF       1305
LotFrontage       259
GarageFinish      235
BsmtFinType1      145
BedroomAbvGr       99
2ndFlrSF           86
GarageYrBlt        81
BsmtExposure       38
MasVnrArea          8
1stFlrSF            0
OverallCond         0
YearRemodAdd        0
YearBuilt           0
TotalBsmtSF         0
dtype: int64