# Machine Learning Modeling Notebook

## Objectives
- Build machine learning models to predict house prices
- Compare different models to find the best one
- Tune parameters to improve performance
- Save the best model for deployment

## Inputs
- Prepared data from outputs/datasets/prepared/v1/prepared_data.csv

## Outputs
- Trained ML model saved as pickle file
- Model performance metrics
- Feature importance analysis

In [1]:
# Basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

# For saving models
import pickle
import os

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

In [2]:
# Load the prepared data
df = pd.read_csv('../outputs/datasets/prepared/v1/prepared_data.csv')

print(f"Loaded {len(df)} properties")
print(f"Columns: {df.columns.tolist()}")

Loaded 17553 properties
Columns: ['Transaction unique identifier', 'Price', 'Date of Transfer', 'Property Type', 'Old/New', 'Duration', 'Town/City', 'District', 'County', 'PPDCategory Type', 'Record Status - monthly file only', 'Property_Type_Encoded', 'County_Encoded', 'Old_New_Encoded', 'Duration_Encoded', 'Type_Age_Interaction', 'County_Price_Tier', 'Type_Rarity']


In [3]:
df.head()

Unnamed: 0,Transaction unique identifier,Price,Date of Transfer,Property Type,Old/New,Duration,Town/City,District,County,PPDCategory Type,Record Status - monthly file only,Property_Type_Encoded,County_Encoded,Old_New_Encoded,Duration_Encoded,Type_Age_Interaction,County_Price_Tier,Type_Rarity
0,{6146E264-E0D9-4C53-ACC8-48DB3954F80B},95200,2007-11-23 00:00,F,Y,L,SWINDON,SWINDON,SWINDON,A,A,1,104,1,1,1,0,0.18128
1,{26EBD75A-D90F-411C-85E5-4D56F0F66484},199950,2013-06-28 00:00,S,Y,F,BINGLEY,BRADFORD,WEST YORKSHIRE,A,A,3,117,1,0,3,0,0.27699
2,{E700C723-9426-4924-8D3F-1730EC3B2BCC},132000,2001-06-26 00:00,S,N,F,BRIGHTON,BRIGHTON AND HOVE,BRIGHTON AND HOVE,A,A,3,11,0,0,0,0,0.27699
3,{677E0E46-8E8F-4560-AD93-07F72D5AE6D5},60000,1997-10-31 00:00,S,N,F,CAERPHILLY,CAERPHILLY,CAERPHILLY,A,A,3,13,0,0,0,0,0.27699
4,{E2387F76-24EC-4A7E-8A27-220E500F0DC2},87000,1998-02-23 00:00,S,N,F,BEXLEYHEATH,BEXLEY,GREATER LONDON,A,A,3,46,0,0,0,1,0.27699


In [4]:
# Step 1: Define features and target
print("Step 1: Setting up features and target...")

# Features we'll use for prediction
features = [
    'Property_Type_Encoded', 'County_Encoded', 'Old_New_Encoded',
    'Duration_Encoded', 'Type_Age_Interaction', 'County_Price_Tier',
    'Type_Rarity'
]

# What we're trying to predict
target = 'Price'

X = df[features]
y = df[target]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

Step 1: Setting up features and target...
Features shape: (17553, 7)
Target shape: (17553,)
