# Paris House Price Prediction - Regression Problem

## Introduction

This project involves predicting house prices in Paris using a dataset created from imaginary data representing various features of houses in an urban environment. The dataset is ideal for educational purposes, allowing students and practitioners to practice regression modeling and enhance their knowledge in data science.

## Content

The dataset provides a comprehensive view of house attributes, making it suitable for building regression models to predict house prices. Each row represents a house, and each column represents a specific feature of the house.

## Source

This dataset is available on Kaggle in the following link:
> [https://www.kaggle.com/datasets/mssmartypants/paris-housing-price-prediction/data]

## Data Dictionary

All attributes in the dataset are numeric variables, which are described below:

- **squareMeters**: The total area of the house in square meters. This is numeric.
- **numberOfRooms**: The total number of rooms in the house. This is numeric.
- **hasYard**: Indicates whether the house has a yard (1 for yes, 0 for no). This is binary.
- **hasPool**: Indicates whether the house has a swimming pool (1 for yes, 0 for no). This is binary.
- **floors**: The number of floors in the house. This is numeric.
- **cityCode**: The zip code of the area where the house is located. This is numeric.
- **cityPartRange**: Indicates the exclusivity of the neighborhood (the higher the range, the more exclusive the neighborhood).
- **numPrevOwners**: The number of previous owners the house has had. This is numeric.
- **made**: The year the house was built. This is numeric.
- **isNewBuilt**: Indicates whether the house is newly built (1 for yes, 0 for no). This is binary.
- **hasStormProtector**: Indicates whether the house has a storm protector (1 for yes, 0 for no). This is binary.
- **basement**: The size of the basement in square meters. This is numeric.
- **attic**: The size of the attic in square meters. This is numeric.
- **garage**: The size of the garage in square meters. This is numeric.
- **hasStorageRoom**: Indicates whether the house has a storage room (1 for yes, 0 for no). This is binary.
- **hasGuestRoom**: The number of guest rooms in the house. This is numeric.
- **price**: The predicted price of the house (target variable).

## Problem Statement

1. **Model Building**: Train the model with the dataset to predict the price of house in Paris.
2. **Model Evatuation**: Evaluate the performance of the model with different metrics such as R2 Score, Mean Absolute Error(MAE), Mean Squared Error(MSE) and Root Mean Squared Error(RMSE).
3. **Model Optimization**: Optimize the peroformance of the model to enhance the performance and decrease the error in prediction using cross validation and hyperparameter tuning. 

### Load Libraries

In [1]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model building and evaluation
from sklearn.svm import SVR
from sklearn.metrics import r2_score
from sklearn import metrics

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

### Settings

In [23]:
# Warnings
warnings.filterwarnings("ignore")
# Path
data_path = "../data"
model_path = "../models"
# csv_path = os.path.join(data_path, "ParisHousing_uf.csv")
# csv_path = os.path.join(data_path, "ParisHousing_out.csv")
csv_path = os.path.join(data_path, "ParisHousing_sf.csv")

In [24]:
df = pd.read_csv(csv_path)

In [25]:
# Check Data
df.head()

Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,garage,hasStorageRoom,price
0,75523,3,0,1,3,8,2005,0,1,4313,956,0,7559081.5
1,80771,39,1,1,8,6,2015,1,0,3653,128,1,8085989.5
2,55712,58,0,1,6,8,2021,0,0,2937,135,1,5574642.1
3,32316,47,0,0,10,4,2012,0,1,659,359,0,3232561.2
4,70429,19,1,1,3,7,1990,1,0,8435,292,1,7055052.0


### Preprocessing

- Separate the input and output features for trainging the model
- Split the data input and output data for traing the model and tesing the performance of the model.
- Scale the data to standardize it to same range of data. 

In [26]:
# Separate Input and output features
X = df.iloc[: , :-1]
y = df.iloc[:,-1]

In [27]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [28]:
# Standardize the data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

- Train the model with training data set for prediction
- Evaluate the performance of the model using different metrics

In [29]:
# Function to train the model with training data and evaluate the metrics
def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predict train and test
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # Print the evaluation metrics for train and test
    print("=" * 60)
    print("EVALUATION OF MODEL FOR TRAIN")
    print("=" * 60)
    print(f"Score: {r2_score(y_train, y_train_pred)}")
    print(f"RMSE: {np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))}")
    print(f"MSE: {metrics.mean_squared_error(y_train, y_train_pred)}")
    print(f"MAE: {metrics.mean_absolute_error(y_train, y_train_pred)}")
    print("=" * 60)
    print("EVALUATION OF MODEL FOR TEST")
    print("=" * 60)
    print(f"Score: {r2_score(y_test, y_test_pred)}")
    print(f"RMSE: {np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))}")
    print(f"MSE: {metrics.mean_squared_error(y_test, y_test_pred)}")
    print(f"MAE: {metrics.mean_absolute_error(y_test, y_test_pred)}")

In [30]:
# DefineSupport Vector machine(SVM)
svr = SVR()

train_evaluate(svr)

EVALUATION OF MODEL FOR TRAIN
Score: 5.332779576106006e-05
RMSE: 2855519.9637826346
MSE: 8153994263561.178
MAE: 2463586.322512888
EVALUATION OF MODEL FOR TEST
Score: -0.0016282608141158228
RMSE: 2961928.017173405
MSE: 8773017578916.778
MAE: 2576658.1835580254


### Key Findings

Your Support Vector Machine Regressor (SVR) model shows very low R² scores for both the training and test datasets, which suggests that the model is underfitting the data.

1. **R² Score**:
    - Training Score (5.332779576106006e-05): Indicates that the model unable to find any pattern in the training data.
    - Test Score (-0.0016282608141158228): It is negetive, suggesting that the model unable to make prediction on unseen data.
2. **RMSE (Root Mean Squared Error)**:
    - Training RMSE (~2855520): This is the average error in the prediction of house prices for the training set. It is very high.
    - Test RMSE (~2961928): The error is higher on the test set, which is expected.
3. **MSE (Mean Squared Error)**:
    - The MSE values are consistent with the RMSE values, indicating issues are very high.
4. **MAE (Mean Absolute Error)**:
    - Training MAE (~2463586): The high magnitude of errors in predictions.
    - Test MAE (~2576658): Again, the error on the test set is higher.

The model is totally underfitting the training data, as indicated by the very low score and the fact that the errors are larger on the test set. This indicates that this model cannot recognize any pattern in the data set. So we should try some other model for this dataset.