# **Exploratory Data Analysis (EDA) with cleaned Ames Housing dataset**

## Objectives

* Read cleaned_data.csv and confirm the dataset shape, column structure, and absence of missing values.
* Use appropriate libraries for analysis and visualisation:
   - `pandas` → data manipulation  
   - `numpy` → numerical operations  
   - `matplotlib.pyplot` → plotting distributions and relationships  
   - `seaborn` → advanced visualisations (correlation heatmap, boxplots, scatterplots)
* Explore the distribution of important numerical and categorical features.
* Investigate how different house attributes relate to `SalePrice`.
* Visualise correlations to identify the strongest predictor variables.
* Compare the client's inherited houses to the overall Ames market.
* Use the insights gained to guide feature selection for the modelling notebook.


## Inputs

* `data/processed/cleaned_data.csv` - The fully cleaned main Ames Housing dataset produced in 02_data_ckeaning.ipynb.
* `data/raw/inherited_houses.csv` - The four inherited properties that must be analysed and later predicted.
* The following Python libraries:`pandas`,`numpy`,`matplotlib.pyplot`,`seaborn`


## Outputs

* Distribution plots (histograms, KDEs, and boxplots) for key numerical features.
* Countplots/bar charts for categorical variables.
* Scatterplots showing relationships between numerical predictors and SalePrice.
* Boxplots showing how categorical variables influence sale price.
* A correlation matrix and correlation heatmap for numerical features.
* Visual comparisons between inherited houses and the wider Ames dataset.
* A summary of analysis findings to support feature selection for modelling.



## Additional Comments

* This notebook does not perform further data cleaning; all cleaning operations were completed in `02_data_cleaning.ipynb`.
- Plots generated in this notebook will also be saved later for use in the Streamlit dashboard.
- All plots and insights should contribute to answering the client’s business questions:
  1. Which features affect house price the most?  
  2. How do the client’s inherited houses compare to the market?
- Insights from this analysis will directly influence the feature engineering and model training steps in the next notebook.


---

### Step 1: Load the cleaned dataset

In this step, I load the cleaned_data.csv dataset that was produced in the `02_data_cleaning.ipynb` notebook. The cleaned dataset contains all of the earlier data preparation steps. The cleaned dataset will be used as the basis for all exploratory visualisations in this notebook.

The client's inherited houses dataset will be loaded later, only when needed for direct comparison during the EDA. This helps keep the notebook focused and avoids introducing unused variables at the start.


In [1]:
import pandas as pd
import numpy as np

# Display options
pd.set_option("display.max_columns", None)

# Path to cleaned dataset
cleaned_path = "data/processed/cleaned_data.csv"

# Load cleaned housing data
housing_df = pd.read_csv(cleaned_path)

# Preview
housing_df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,1710,Gd,8450,65.0,196.0,61,5,7,856,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,1262,TA,9600,80.0,0.0,0,8,6,1262,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,1786,Gd,11250,68.0,162.0,42,5,7,920,2001,2002,223500
3,961,0.0,3.0,No,216,ALQ,540,642,Unf,1998.0,1717,Gd,9550,60.0,0.0,35,5,7,756,1915,1970,140000
4,1145,0.0,4.0,Av,655,GLQ,490,836,RFn,2000.0,2198,Gd,14260,84.0,350.0,84,5,8,1145,2000,2000,250000


In [2]:
print("Shape of cleaned dataset:", housing_df.shape)
print("Total missing values:", housing_df.isna().sum().sum())

Shape of cleaned dataset: (1460, 22)
Total missing values: 0


The preview of the cleaned dataset confirms that the file imported correctly.  
Several checks validate that the cleaning steps applied in the previous notebook were successful:

- All columns display clean and consistent values, with no missing values present.
- Median imputation is visible in numeric columns such as `BedroomAbvGr`, `MasVnrArea`, and `LotFrontage`, which now contain numerical values with no gaps.
- Categorical imputation is correctly applied. e.g. rows with no basement or garage contain the meaningful labels.
- Year-based columns contain realistic values.
- All column data types appear correct, with numerical features stored as integers or floats, and categorical features stored as object types.

This confirms that the cleaned dataset is ready for detailed EDA in the next steps.

### Step 2: Review dataset structure for EDA

Before creating visualisations, it is important to understand the structure of the cleaned dataset.  
Unlike the initial inspection carried out in `01_data_exploration.ipynb`, this step focuses on confirming that the cleaned dataset:

- contains the expected columns,
- uses the correct data types (after cleaning),
- contains no missing values,
- and is ready for analysis.

I will use:
- `df.info()` to view column names, data types, and non-null counts,
- `df.describe().T` to summarise numerical features,
- `df.columns.tolist()` to display all feature names clearly in one list.

---

# Section 2

Section 2 content

---