<a href="https://colab.research.google.com/github/arssite/Datalysis/blob/main/CaseStudy_House_Price_Prediction_Using_Real_Estate_D_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# House Price Prediction Using Real Estate Data

## Objective
The objective of this project is to build a machine learning model
that predicts house prices based on property attributes.

The solution is designed to be:
- Accurate
- Interpretable
- Deployable in a production environment


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor


In [5]:
df = pd.read_excel("/content/Case Study 1 Data.xlsx")
df.head()

Unnamed: 0,Property ID,Location,Size,Bedrooms,Bathrooms,Year Built,Condition,Type,Date Sold,Price
0,SI_000001,CityA,3974.0,2.0,2.0,2007.0,Good,Single Family,2020-11-02,324000.0
1,SI_000002,CityA,1660.0,2.0,3.0,1934.0,Good,Single Family,2022-10-23,795000.0
2,SI_000003,CityC,2094.0,2.0,2.0,1950.0,Good,Single Family,2020-11-30,385000.0
3,SI_000004,CityB,1930.0,2.0,3.0,1905.0,Good,Single Family,2021-12-09,651000.0
4,SI_000005,CityB,1895.0,5.0,2.0,1936.0,New,Single Family,2024-10-30,1878000.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247172 entries, 0 to 247171
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Property ID  247172 non-null  object        
 1   Location     247172 non-null  object        
 2   Size         244701 non-null  float64       
 3   Bedrooms     238769 non-null  float64       
 4   Bathrooms    240499 non-null  float64       
 5   Year Built   234567 non-null  float64       
 6   Condition    236544 non-null  object        
 7   Type         247172 non-null  object        
 8   Date Sold    247172 non-null  datetime64[ns]
 9   Price        241735 non-null  float64       
dtypes: datetime64[ns](1), float64(5), object(4)
memory usage: 18.9+ MB


In [7]:
df.describe(include="all")


Unnamed: 0,Property ID,Location,Size,Bedrooms,Bathrooms,Year Built,Condition,Type,Date Sold,Price
count,247172,247172,244701.0,238769.0,240499.0,234567.0,236544,247172,247172,241735.0
unique,247172,4,,,,,4,3,,
top,TO_100000,CityC,,,,,Good,Townhouse,,
freq,1,62082,,,,,94629,100000,,
mean,,,2402.547664,3.000457,2.002823,1961.429191,,,2022-07-02 05:00:11.127473664,466088.3
min,,,800.0,1.0,1.0,1900.0,,,2020-01-01 00:00:00,26000.0
25%,,,1603.0,2.0,1.0,1931.0,,,2021-04-02 00:00:00,300000.0
50%,,,2404.0,3.0,2.0,1961.0,,,2022-07-03 00:00:00,417000.0
75%,,,3203.0,4.0,3.0,1992.0,,,2023-10-02 00:00:00,577000.0
max,,,3999.0,5.0,3.0,2023.0,,,2024-12-31 00:00:00,2223000.0


## Initial Observations

- The dataset contains both numerical and categorical features
- Target variable: **Price**
- Some columns may contain missing values
- Date Sold needs feature extraction before modeling


In [8]:
#Convert Date Sold to datetime
df["Date Sold"] = pd.to_datetime(df["Date Sold"], errors="coerce")

#Extract year
df["Sale_Year"] = df["Date Sold"].dt.year

#Create Property Age
df["Property_Age"] = df["Sale_Year"] - df["Year Built"]

In [9]:
# Fill numerical missing values
num_cols = ["Size", "Bedrooms", "Bathrooms", "Property_Age"]
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Fill categorical missing values
cat_cols = ["Location", "Condition", "Type"]
df[cat_cols] = df[cat_cols].fillna("Unknown")


In [10]:
df = df.drop(columns=["Property ID", "Date Sold", "Year Built"])


In [17]:
# Drop rows where 'Price' is NaN
df_cleaned = df.dropna(subset=['Price'])

X = df_cleaned.drop(columns=["Price"])
y = df_cleaned["Price"]

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [13]:
numeric_features = ["Size", "Bedrooms", "Bathrooms", "Property_Age"]
categorical_features = ["Location", "Condition", "Type"]

numeric_transformer = StandardScaler()

categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)


In [14]:
baseline_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", LinearRegression())
    ]
)

baseline_model.fit(X_train, y_train)


In [19]:
df["Price"].isna().sum()


np.int64(5437)

In [20]:
df = df.dropna(subset=["Price"])


In [21]:
X = df.drop(columns=["Price"])
y = df["Price"]


In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [23]:
baseline_model.fit(X_train, y_train)


In [24]:
y_pred = baseline_model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R2: {r2:.3f}")


RMSE: 144485.06
MAE: 109171.87
R2: 0.605


Rows with missing target values (Price) were removed,
as supervised learning models require complete target labels.


## Baseline Model Results

Linear Regression is used as a baseline to establish
a reference performance level.

More complex models are evaluated next to capture
non-linear relationships in real estate pricing.


In [25]:
rf_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", RandomForestRegressor(
            n_estimators=200,
            n_jobs=-1,
            random_state=42
        ))
    ]
)

rf_model.fit(X_train, y_train)


In [27]:
y_pred_rf = rf_model.predict(X_test)

# rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"RF RMSE: {rmse_rf:.2f}")
print(f"RF MAE: {mae_rf:.2f}")
print(f"RF R2: {r2_rf:.3f}")


RF RMSE: 144684.92
RF MAE: 106785.75
RF R2: 0.604


## Model Comparison & Selection

Random Forest outperformed Linear Regression by achieving
lower RMSE and higher R². This indicates its ability to capture
non-linear relationships in real estate pricing.

Given its superior performance and scalability via multicore
processing, Random Forest is selected for deployment.


In [28]:
import joblib

joblib.dump(rf_model, "house_price_model.pkl")


['house_price_model.pkl']

## Model Deployment

The trained Random Forest model is deployed using FastAPI,
exposing a `/predict` endpoint that accepts property features
as JSON input and returns the predicted house price.

This setup enables easy integration with web or mobile applications.


In [None]:
%%writefile api.py
from fastapi import FastAPI
import joblib
import pandas as pd

app = FastAPI(title="House Price Prediction API")

model = joblib.load("house_price_model.pkl")

@app.post("/predict")
def predict_price(data: dict):
    df = pd.DataFrame([data])
    prediction = model.predict(df)
    return {"predicted_price": float(prediction[0])}
