# Paris House Price Prediction - Regression Problem

## Introduction

This project involves predicting house prices in Paris using a dataset created from imaginary data representing various features of houses in an urban environment. The dataset is ideal for educational purposes, allowing students and practitioners to practice regression modeling and enhance their knowledge in data science.

## Content

The dataset provides a comprehensive view of house attributes, making it suitable for building regression models to predict house prices. Each row represents a house, and each column represents a specific feature of the house.

## Source

This dataset is available on Kaggle in the following link:
> [https://www.kaggle.com/datasets/mssmartypants/paris-housing-price-prediction/data]

## Data Dictionary

All attributes in the dataset are numeric variables, which are described below:

- **squareMeters**: The total area of the house in square meters. This is numeric.
- **numberOfRooms**: The total number of rooms in the house. This is numeric.
- **hasYard**: Indicates whether the house has a yard (1 for yes, 0 for no). This is binary.
- **hasPool**: Indicates whether the house has a swimming pool (1 for yes, 0 for no). This is binary.
- **floors**: The number of floors in the house. This is numeric.
- **cityCode**: The zip code of the area where the house is located. This is numeric.
- **cityPartRange**: Indicates the exclusivity of the neighborhood (the higher the range, the more exclusive the neighborhood).
- **numPrevOwners**: The number of previous owners the house has had. This is numeric.
- **made**: The year the house was built. This is numeric.
- **isNewBuilt**: Indicates whether the house is newly built (1 for yes, 0 for no). This is binary.
- **hasStormProtector**: Indicates whether the house has a storm protector (1 for yes, 0 for no). This is binary.
- **basement**: The size of the basement in square meters. This is numeric.
- **attic**: The size of the attic in square meters. This is numeric.
- **garage**: The size of the garage in square meters. This is numeric.
- **hasStorageRoom**: Indicates whether the house has a storage room (1 for yes, 0 for no). This is binary.
- **hasGuestRoom**: The number of guest rooms in the house. This is numeric.
- **price**: The predicted price of the house (target variable).

## Problem Statement

1. **Model Building**: Train the model with the dataset to predict the price of house in Paris.
2. **Model Evatuation**: Evaluate the performance of the model with different metrics such as R2 Score, Mean Absolute Error(MAE), Mean Squared Error(MSE) and Root Mean Squared Error(RMSE).
3. **Model Optimization**: Optimize the peroformance of the model to enhance the performance and decrease the error in prediction using cross validation and hyperparameter tuning. 

### Load Libraries

In [29]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model building and evaluation
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn import metrics

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

### Settings

In [30]:
# Warnings
warnings.filterwarnings("ignore")
# Path
data_path = "../data"
model_path = "../models"
csv_path = os.path.join(data_path, "ParisHousing_uf.csv")

In [12]:
df = pd.read_csv(csv_path)

In [13]:
# Check Data
df.head()

Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom,price
0,75523,3,0,1,63,3,8,2005,0,1,4313,9005,956,0,7,7559081.5
1,80771,39,1,1,98,8,6,2015,1,0,3653,2436,128,1,2,8085989.5
2,55712,58,0,1,19,6,8,2021,0,0,2937,8852,135,1,9,5574642.1
3,32316,47,0,0,6,10,4,2012,0,1,659,7141,359,0,3,3232561.2
4,70429,19,1,1,90,3,7,1990,1,0,8435,2429,292,1,4,7055052.0


### Preprocessing

- Separate the input and output features for trainging the model
- Split the data input and output data for traing the model and tesing the performance of the model.
- Scale the data to standardize it to same range of data. 

In [14]:
# Separate Input and output features
X = df.iloc[: , :-1]
y = df.iloc[:,-1]

In [15]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [16]:
# Standardize the data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Training and Evaluation

- Train the model with training data set for prediction
- Evaluate the performance of the model using different metrics

In [20]:
# Function to train the model with training data and evaluate the metrics
def train_evaluate(model):
    # Train the model
    model.fit(X_train_s, y_train)

    # Predict train and test
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # Print the evaluation metrics for train and test
    print("=" * 60)
    print("EVALUATION OF MODEL FOR TRAIN")
    print("=" * 60)
    print(f"Score: {r2_score(y_train, y_train_pred)}")
    print(f"RMSE: {np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))}")
    print(f"MSE: {metrics.mean_squared_error(y_train, y_train_pred)}")
    print(f"MAE: {metrics.mean_absolute_error(y_train, y_train_pred)}")
    print("=" * 60)
    print("EVALUATION OF MODEL FOR TEST")
    print("=" * 60)
    print(f"Score: {r2_score(y_test, y_test_pred)}")
    print(f"RMSE: {np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))}")
    print(f"MSE: {metrics.mean_squared_error(y_test, y_test_pred)}")
    print(f"MAE: {metrics.mean_absolute_error(y_test, y_test_pred)}")

In [21]:
# Define XGboost regressor
xgb = XGBRegressor()

train_evaluate(xgb)

EVALUATION OF MODEL FOR TRAIN
Score: 0.999992513256914
RMSE: 7813.457355385905
MSE: 61050115.84443411
MAE: 6112.526243383793
EVALUATION OF MODEL FOR TEST
Score: 0.9999767818318243
RMSE: 14260.514392520487
MSE: 203362270.73928395
MAE: 11643.888599218759


### Key Findings

Your XGBRegressor model shows very high R² scores for both the training and test datasets, which suggests that the model is fitting the data extremely well.

1. **R² Score**:
    - Training Score (0.99999): Indicates near-perfect fit on the training data.
    - Test Score (0.99998): Also very high, suggesting that the model generalizes well to unseen data.
2. **RMSE (Root Mean Squared Error)**:
    - Training RMSE (~7813): This is the average error in the prediction of house prices for the training set.
    - Test RMSE (~14260): The error is higher on the test set, which is expected, but the increase should be analyzed.
3. **MSE (Mean Squared Error)**:
    - The MSE values are consistent with the RMSE values, indicating no obvious issues.
4. **MAE (Mean Absolute Error)**:
    - Training MAE (~6113): The average magnitude of errors in predictions.
    - Test MAE (~11644): Again, the error on the test set is higher.

The model might be slightly overfitting the training data, as indicated by the near-perfect score and the fact that the errors are larger on the test set. Although the test score is still high, the increase in RMSE and MAE suggests some overfitting.
Even though the R² scores are high, the RMSE and MAE on the test set are relatively significant. Depending on the scale of house prices in your dataset, this error could be substantial.

### Hyperparameter Tuning

By tuning the hyperparameter of the model we can try to minimize the error.

In [23]:
# Define hyperparameter
param_dict = {
    "n_estimators":[100, 500, 1000],
    "max_depth": [3, 5, 6],
    "min_child_weight": [1, 2, 3],
    "colsample_bytree":[0.5, 1.0],
    "reg_lambda": [0, 1]
}

In [27]:
# Tune Hyperparameter
xgbr_ht = XGBRegressor()
gscv = GridSearchCV(estimator= xgbr_ht,
                   param_grid= param_dict,
                   cv= 5,
                   verbose=1, scoring="neg_mean_absolute_error")
gscv.fit(X, y)
print(f"Best Score: {gscv.best_score_}")
best_params = gscv.best_params_
print(best_params)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Score: -10991.061035664063
{'colsample_bytree': 1.0, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100, 'reg_lambda': 0}


In [28]:
# Train with best parameters
xgb_model = XGBRegressor(**best_params)
train_evaluate(xgb_model)

EVALUATION OF MODEL FOR TRAIN
Score: 0.9999953433481641
RMSE: 6162.169856337105
MSE: 37972337.33834965
MAE: 4905.892946484378
EVALUATION OF MODEL FOR TEST
Score: 0.9999795545956411
RMSE: 13381.93965430206
MSE: 179076308.9113819
MAE: 11084.511072851568


In [34]:
#  Save the model
path_model = os.path.join(model_path, "xbg_php.pkl")
with open(path_model, "wb") as model:
    pickle.dump(xgb_model, model)