# Paris House Price Prediction - Regression Problem

## Introduction

This project involves predicting house prices in Paris using a dataset created from imaginary data representing various features of houses in an urban environment. The dataset is ideal for educational purposes, allowing students and practitioners to practice regression modeling and enhance their knowledge in data science.

## Content

The dataset provides a comprehensive view of house attributes, making it suitable for building regression models to predict house prices. Each row represents a house, and each column represents a specific feature of the house.

## Source

This dataset is available on Kaggle in the following link:
> [https://www.kaggle.com/datasets/mssmartypants/paris-housing-price-prediction/data]

## Data Dictionary

All attributes in the dataset are numeric variables, which are described below:

- **squareMeters**: The total area of the house in square meters. This is numeric.
- **numberOfRooms**: The total number of rooms in the house. This is numeric.
- **hasYard**: Indicates whether the house has a yard (1 for yes, 0 for no). This is binary.
- **hasPool**: Indicates whether the house has a swimming pool (1 for yes, 0 for no). This is binary.
- **floors**: The number of floors in the house. This is numeric.
- **cityCode**: The zip code of the area where the house is located. This is numeric.
- **cityPartRange**: Indicates the exclusivity of the neighborhood (the higher the range, the more exclusive the neighborhood).
- **numPrevOwners**: The number of previous owners the house has had. This is numeric.
- **made**: The year the house was built. This is numeric.
- **isNewBuilt**: Indicates whether the house is newly built (1 for yes, 0 for no). This is binary.
- **hasStormProtector**: Indicates whether the house has a storm protector (1 for yes, 0 for no). This is binary.
- **basement**: The size of the basement in square meters. This is numeric.
- **attic**: The size of the attic in square meters. This is numeric.
- **garage**: The size of the garage in square meters. This is numeric.
- **hasStorageRoom**: Indicates whether the house has a storage room (1 for yes, 0 for no). This is binary.
- **hasGuestRoom**: The number of guest rooms in the house. This is numeric.
- **price**: The predicted price of the house (target variable).

## Problem Statement

1. **Model Building**: Train the model with the dataset to predict the price of house in Paris.
2. **Model Evatuation**: Evaluate the performance of the model with different metrics such as R2 Score, Mean Absolute Error(MAE), Mean Squared Error(MSE) and Root Mean Squared Error(RMSE).
3. **Model Optimization**: Optimize the peroformance of the model to enhance the performance and decrease the error in prediction using cross validation and hyperparameter tuning. 

### Load Libraries

In [1]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model building and evaluation
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import r2_score
from sklearn import metrics

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

### Settings

In [2]:
# Warnings
warnings.filterwarnings("ignore")
# Path
data_path = "../data"
model_path = "../models"
csv_path = os.path.join(data_path, "ParisHousing_uf.csv")

In [3]:
df = pd.read_csv(csv_path)

In [4]:
# Check Data
df.head()

Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom,price
0,75523,3,0,1,63,3,8,2005,0,1,4313,9005,956,0,7,7559081.5
1,80771,39,1,1,98,8,6,2015,1,0,3653,2436,128,1,2,8085989.5
2,55712,58,0,1,19,6,8,2021,0,0,2937,8852,135,1,9,5574642.1
3,32316,47,0,0,6,10,4,2012,0,1,659,7141,359,0,3,3232561.2
4,70429,19,1,1,90,3,7,1990,1,0,8435,2429,292,1,4,7055052.0


### Preprocessing

- Separate the input and output features for trainging the model
- Split the data input and output data for traing the model and tesing the performance of the model.
- Scale the data to standardize it to same range of data. 

In [5]:
# Separate Input and output features
X = df.iloc[: , :-1]
y = df.iloc[:,-1]

In [6]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [7]:
# Standardize the data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

- Train the model with training data set for prediction
- Evaluate the performance of the model using different metrics

In [8]:
# Function to train the model with training data and evaluate the metrics
def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predict train and test
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # Print the evaluation metrics for train and test
    print("=" * 60)
    print("EVALUATION OF MODEL FOR TRAIN")
    print("=" * 60)
    print(f"Score: {r2_score(y_train, y_train_pred)}")
    print(f"RMSE: {np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))}")
    print(f"MSE: {metrics.mean_squared_error(y_train, y_train_pred)}")
    print(f"MAE: {metrics.mean_absolute_error(y_train, y_train_pred)}")
    print("=" * 60)
    print("EVALUATION OF MODEL FOR TEST")
    print("=" * 60)
    print(f"Score: {r2_score(y_test, y_test_pred)}")
    print(f"RMSE: {np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))}")
    print(f"MSE: {metrics.mean_squared_error(y_test, y_test_pred)}")
    print(f"MAE: {metrics.mean_absolute_error(y_test, y_test_pred)}")

In [9]:
# Define Ridge regressor
r = Ridge()

train_evaluate(r)

EVALUATION OF MODEL FOR TRAIN
Score: 0.9999995458042785
RMSE: 1924.5017064988056
MSE: 3703706.8183168145
MAE: 1503.7398941760173
EVALUATION OF MODEL FOR TEST
Score: 0.9999995631596791
RMSE: 1956.0618086465058
MSE: 3826177.7992454395
MAE: 1541.6978919808678


### Insights

The results of applying Ridge regression to the Paris House Price prediction dataset indicate an exceptionally high-performing model. Here's a detailed breakdown:

#### Training Performance:

- **R2 Score: 0.9999995** This shows that the model explains nearly all of the variance in the training data, a near-perfect fit.
- **RMSE: 1924.50** On average, the model's predictions are off by about **1924.5** units of the target variable.
- **MSE: 3,703,706** A large number, as MSE squares the errors, but it’s useful for capturing larger discrepancies.
- **MAE: 1503.74** On average, the absolute difference between the predicted and actual values is around **1503.74**, indicating low error.

#### Testing Performance:

- **R2 Score: 0.9999996** Similarly high, indicating excellent generalization to the unseen data.
- **RMSE: 1956.06** The error in the test data is only slightly higher than in the training data, which is a positive sign of **minimal overfitting**.
- **MSE: 3,826,177** A slight increase compared to the training set, but still a very small difference, showing that the model generalizes well.
- **MAE: 1541.70** The average error on test data is only slightly higher than the training set, showing **good stability**.

#### Analysis:

- **Excellent Fit:** Both the training and testing metrics are nearly identical, which suggests that the model is **well-optimized and does not overfit**.
- **Small Errors:** The RMSE and MAE are relatively small compared to the scale of house prices, meaning the model is **highly accurate in predicting prices**.

#### Conclusion:

The Ridge regression model is performing exceptionally well on the Paris House Price dataset. The metrics indicate that the model has learned the underlying pattern in the data without overfitting, making it a reliable predictor for unseen data.

In [14]:
# Define Lasso Regressor
l = Lasso()
train_evaluate(l)

EVALUATION OF MODEL FOR TRAIN
Score: 0.999999561443655
RMSE: 1891.0781663100142
MSE: 3576176.6310944455
MAE: 1470.7642304227854
EVALUATION OF MODEL FOR TEST
Score: 0.9999995781391223
RMSE: 1922.232168698513
MSE: 3694976.510379389
MAE: 1509.2063574238223


### Insights

The Lasso regression model on the Paris House Price prediction dataset also demonstrates excellent performance, but with some subtle differences compared to Ridge regression. Here’s a breakdown:

#### Training Performance:

- **R2 Score: 0.9999996** This is nearly identical to the Ridge model, indicating that Lasso is also capturing almost all of the variance in the training data.
- **RMSE: 1891.08** Slightly lower than Ridge, indicating slightly better error reduction on the training data.
- **MSE: 3,576,176** A bit lower than Ridge's MSE, meaning Lasso performs slightly better in terms of squared errors.
- **MAE: 1470.76** Slightly lower than Ridge’s MAE, meaning that the average absolute error is slightly better on the training data.

#### Testing Performance:

- **R2 Score: 0.9999996** Nearly identical to the training score, showing a good fit without overfitting.
- **RMSE: 1922.23** Slightly lower than Ridge’s RMSE on the test data, indicating better performance on unseen data.
- **MSE: 3,694,976** Lower than Ridge’s MSE, again indicating better performance in terms of squared errors.
- **MAE: 1509.21** Slightly lower than Ridge’s test MAE, meaning that Lasso provides a slightly better generalization in terms of average absolute error.

#### Analysis:

**Improvement over Ridge:** Lasso shows slight improvements in both training and testing metrics (**lower RMSE, MSE, and MAE**), which could indicate that Lasso's regularization helps to reduce complexity, possibly removing some less important features.
- **Very Small Errors:** The differences between Lasso and Ridge are minor but show that Lasso is handling the data very well, slightly better in terms of **error minimization**.

#### Conclusion:

Lasso regression appears to perform marginally better than Ridge on the Paris House Price dataset, with slightly lower errors. Both models are nearly perfect in terms of R2 scores, but Lasso has a slight edge in handling both training and testing errors.