🏠 House Price Prediction using Machine Learning
In this project, I explore the Kaggle housing dataset to predict house prices using regression techniques. I applied data preprocessing, feature engineering, and built a baseline model using Linear Regression. This notebook also includes analysis of missing values, encoding strategies, and performance evaluation.

📌 Objective
The goal is to predict the final sale price of homes in Ames, Iowa using various features such as area, neighborhood, and amenities. We aim to minimize the Root Mean Squared Error (RMSE) on the predictions.

🧾 Dataset Overview
The dataset includes 79 explanatory variables covering almost every aspect of residential homes. Key columns include:

LotArea: Lot size in square feet
GrLivArea: Above ground living area
OverallQual: Overall material and finish quality
YearBuilt: Original construction year
SalePrice: Target variable (house price)
The full description can be found in 'data_description.txt'.

Correlation heatmap
plt.figure(figsize=(12, 10)) corr = train.corr() sns.heatmap(corr['SalePrice'].sort_values(ascending=False).to_frame(), annot=True) plt.title("Feature Correlation with SalePrice") plt.show()

📉 Model Evaluation
The Linear Regression model achieved an RMSE of XXXX. Although simple, it provides a good baseline. Further improvements can be achieved using:

Ridge/Lasso regularization
Random Forest / Gradient Boosting
Hyperparameter tuning (GridSearchCV)
✅ Conclusion
Performed thorough data cleaning and feature engineering.
Handled missing values strategically.
Built a baseline model with Linear Regression.
Used domain logic to create new features.
🚀 Future Improvements:
Apply ensemble models like XGBoost or LightGBM
Perform hyperparameter optimization
Explore feature selection techniques
NEW FEATURE
Create a feature for house age
train['HouseAge'] = train['YrSold'] - train['YearBuilt'] test['HouseAge'] = test['YrSold'] - test['YearBuilt']

I added a new feature HouseAge as I assumed the age of a house can affect its sale price.

# 📦 Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 📥 Load Data

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Show head

In [None]:
display(train.head())

# 🧠 Data Overview


In [None]:
display(train.describe())
display(train.info())

# 🔍 Missing Values

In [None]:
missing = train.isnull().sum().sort_values(ascending=False)
display(missing[missing > 0])

# 🧹 Handle Missing

In [None]:
# 🔹 Fill numeric column 'LotFrontage' with median (no warning, no inplace)
train['LotFrontage'] = train['LotFrontage'].fillna(train['LotFrontage'].median())
test['LotFrontage'] = test['LotFrontage'].fillna(test['LotFrontage'].median())

# 🔹 Drop columns with too many missing values
cols_to_drop = ['Alley', 'PoolQC', 'Fence', 'MiscFeature', 'FireplaceQu']
train = train.drop(columns=cols_to_drop)
test = test.drop(columns=cols_to_drop)

# 🔹 Fill remaining categorical columns with mode
for col in train.select_dtypes(include='object'):
    train[col] = train[col].fillna(train[col].mode()[0])

for col in test.select_dtypes(include='object'):
    test[col] = test[col].fillna(test[col].mode()[0])

# Categorical nulls

In [None]:
for col in train.select_dtypes(include='object'):
    train[col].fillna(train[col].mode()[0], inplace=True)
for col in test.select_dtypes(include='object'):
    test[col].fillna(test[col].mode()[0], inplace=True)

# 🛠 Feature Engineering

In [None]:
train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']
test['TotalSF'] = test['TotalBsmtSF'] + test['1stFlrSF'] + test['2ndFlrSF']

train = pd.get_dummies(train)
test = pd.get_dummies(test)
train, test = train.align(test, join='left', axis=1, fill_value=0)

# 📊 Visual Check (with plt.show)

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(train['SalePrice'], kde=True)
plt.title("Distribution of House Prices")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

# ⚙️ Prepare Data

In [None]:
X = train.drop('SalePrice', axis=1)
y = train['SalePrice']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# 🤖 Train Model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 🧠 Initialize and train the model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# 🔮 Make predictions on validation set
y_pred = model.predict(X_val_scaled)

# 📏 Calculate Root Mean Squared Error (RMSE)
rmse = mean_squared_error(y_val, y_pred, squared=False)

print(f"✅ Linear Regression RMSE on validation set: {rmse:.2f}")