# House Price Analytics

## 05 Data analysis for generating foresights

**Project:** Code Institute – Capstone Project

---
### **Objectives**
- Load the final house dataset
- Build a Machine Learning model to predict house prices with high accuracy

### **Inputs**
- `/data/models/house_price_model.pkl`

### **Outputs**
- Trained and finetuned Model to power a "Price Estimator" dashboard feature that gives Buyers and Sellers a realistic price range (Min, Average, Max).
        
### **Additional Comments**
Confirm the final_house_data.csv is exisit under outputs/datasets. Run this notebook top-down.

---

### Setup the file and Load the Dataset
Import nesessary libraries

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# Scikit-Learn
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Feature Engine
from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection
from feature_engine.encoding import OneHotEncoder

# Ignore future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) 

Set the home directory. Need to change the working directory from its current folder to its parent folder. Access the current directory with os.getcwd()

In [3]:
PROJECT_DIR = os.path.join(os.getcwd()) # Define the project root directory
os.chdir(PROJECT_DIR) # Change the current working directory
# Uncomment the line below to verify the current working directory
# print("Working directory:", os.getcwd()) 

Load the data from the original data set reside within data directory under data/processed/ directory.

In [4]:
# LOAD DATASET
try:
    # Data directory paths
    data_path = os.path.join("..", "data", "processed")
    # Extract the original dataset
    df = pd.read_csv(os.path.join(data_path, "final_house_data.csv"))
    print("Dataset loaded successfully.")
except Exception as e:
    print(e)
    print("Error loading the dataset.")
    df = pd.DataFrame()  # Create an empty DataFrame if loading fails

print(f"Original dataset shape: {df.shape}")

Dataset loaded successfully.
Original dataset shape: (21596, 31)


---

### DATA SET PREPEATION 
Drop unnecessary columns for modeling and generate X,Y data sets 
- Remove all the attributes contains information about the target (Not just price_log, but 'price_log', 'price_per_sqft') from nthe training data set to PREVENTING DATA LEAKAGE
- Drop non-numeric/unused columns that are unnecessary columns for modeling

In [5]:
# We must drop anything that directly contains 'Price' info to PREVENTING DATA LEAKAGE
leakage_cols = ['price', 'price_log', 'price_per_sqft']

# Drop non-feature columns
# Drop 'id' and 'date' 'sale_month_name' 'age_group' as they are identifiers/timestamps not needed for the model
unused_cols = ['id', 'date', 'sale_month_name', 'age_group']

# Cast Zipcode to String (Categorical)
# This tells the model "treat this as a label, not a number"
df['zipcode'] = df['zipcode'].astype(str)

# Handle Dates
df['date'] = pd.to_datetime(df['date'])
df['yr_sold'] = df['date'].dt.year
    
# Feature Engineering (House Age)
df['house_age'] = df['yr_sold'] - df['yr_built']

# Define Features (X) and Target (y)
# Drop ALL leakage and unused columns
X = df.drop(columns=leakage_cols + unused_cols)
y = df['price_log'] # Target is Log Price

print("Feature Matrix Shape:", X.shape)
X.head()

Feature Matrix Shape: (21596, 25)


Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,...,long,sqft_living15,sqft_lot15,sale_year,sale_month,sale_quarter,house_age,is_renovated,years_since_update,yr_sold
0,3,1.0,1180,5650,1.0,0,0,3,7,1180,...,-122.257,1340,5650,2014,10,4,59,0,59,2014
1,3,2.25,2570,7242,2.0,0,0,3,7,2170,...,-122.319,1690,7639,2014,12,4,63,1,23,2014
2,2,1.0,770,10000,1.0,0,0,3,6,770,...,-122.233,2720,8062,2015,2,1,82,0,82,2015
3,4,3.0,1960,5000,1.0,0,0,5,7,1050,...,-122.393,1360,5000,2014,12,4,49,0,49,2014
4,3,2.0,1680,8080,1.0,0,0,3,8,1680,...,-122.045,1800,7503,2015,2,1,28,0,28,2015


---