## Tunisian Property Price Prediction

Given *data about properties in Tunisia*, let's try to predict the **price** of a given property.

We will use a variety of regression models to make our predictions. 

Data Source: https://www.kaggle.com/datasets/ghassen1302/property-prices-in-tunisia

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
data = pd.read_csv('Property Prices in Tunisia.csv')
data

Unnamed: 0,category,room_count,bathroom_count,size,type,price,city,region,log_price
0,Terrains et Fermes,-1.0,-1.0,-1.0,À Vendre,100000.0,Ariana,Raoued,5.000000
1,Terrains et Fermes,-1.0,-1.0,-1.0,À Vendre,316000.0,Ariana,Autres villes,5.499687
2,Appartements,2.0,1.0,80.0,À Louer,380.0,Ariana,Autres villes,2.579784
3,Locations de vacances,1.0,1.0,90.0,À Louer,70.0,Ariana,Autres villes,1.845098
4,Appartements,2.0,2.0,113.0,À Vendre,170000.0,Ariana,Ariana Ville,5.230449
...,...,...,...,...,...,...,...,...,...
12743,Terrains et Fermes,-1.0,-1.0,-1.0,À Vendre,3200000.0,Tunis,Sidi Bou Said,6.505150
12744,Appartements,1.0,1.0,100.0,À Louer,600.0,Tunis,Autres villes,2.778151
12745,Maisons et Villas,3.0,1.0,760.0,À Vendre,1950000.0,Tunis,La Marsa,6.290035
12746,Maisons et Villas,3.0,1.0,190.0,À Vendre,240000.0,Tunis,La Marsa,5.380211


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12748 entries, 0 to 12747
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   category        12748 non-null  object 
 1   room_count      12748 non-null  float64
 2   bathroom_count  12748 non-null  float64
 3   size            12748 non-null  float64
 4   type            12748 non-null  object 
 5   price           12748 non-null  float64
 6   city            12748 non-null  object 
 7   region          12748 non-null  object 
 8   log_price       12748 non-null  float64
dtypes: float64(5), object(4)
memory usage: 896.5+ KB


### Preprocessing

In [4]:
df = data.copy()
df

Unnamed: 0,category,room_count,bathroom_count,size,type,price,city,region,log_price
0,Terrains et Fermes,-1.0,-1.0,-1.0,À Vendre,100000.0,Ariana,Raoued,5.000000
1,Terrains et Fermes,-1.0,-1.0,-1.0,À Vendre,316000.0,Ariana,Autres villes,5.499687
2,Appartements,2.0,1.0,80.0,À Louer,380.0,Ariana,Autres villes,2.579784
3,Locations de vacances,1.0,1.0,90.0,À Louer,70.0,Ariana,Autres villes,1.845098
4,Appartements,2.0,2.0,113.0,À Vendre,170000.0,Ariana,Ariana Ville,5.230449
...,...,...,...,...,...,...,...,...,...
12743,Terrains et Fermes,-1.0,-1.0,-1.0,À Vendre,3200000.0,Tunis,Sidi Bou Said,6.505150
12744,Appartements,1.0,1.0,100.0,À Louer,600.0,Tunis,Autres villes,2.778151
12745,Maisons et Villas,3.0,1.0,760.0,À Vendre,1950000.0,Tunis,La Marsa,6.290035
12746,Maisons et Villas,3.0,1.0,190.0,À Vendre,240000.0,Tunis,La Marsa,5.380211


In [5]:
# Encode missing values properly
df = df.replace(-1, np.nan)
df

Unnamed: 0,category,room_count,bathroom_count,size,type,price,city,region,log_price
0,Terrains et Fermes,,,,À Vendre,100000.0,Ariana,Raoued,5.000000
1,Terrains et Fermes,,,,À Vendre,316000.0,Ariana,Autres villes,5.499687
2,Appartements,2.0,1.0,80.0,À Louer,380.0,Ariana,Autres villes,2.579784
3,Locations de vacances,1.0,1.0,90.0,À Louer,70.0,Ariana,Autres villes,1.845098
4,Appartements,2.0,2.0,113.0,À Vendre,170000.0,Ariana,Ariana Ville,5.230449
...,...,...,...,...,...,...,...,...,...
12743,Terrains et Fermes,,,,À Vendre,3200000.0,Tunis,Sidi Bou Said,6.505150
12744,Appartements,1.0,1.0,100.0,À Louer,600.0,Tunis,Autres villes,2.778151
12745,Maisons et Villas,3.0,1.0,760.0,À Vendre,1950000.0,Tunis,La Marsa,6.290035
12746,Maisons et Villas,3.0,1.0,190.0,À Vendre,240000.0,Tunis,La Marsa,5.380211


In [10]:
# Percentage of Missing values
df.isna().mean()*100

category           0.000000
room_count        26.788516
bathroom_count    26.788516
size              26.788516
type               0.000000
price              0.000000
city               0.000000
region             0.000000
log_price          0.000000
dtype: float64

In [11]:
# Fill missing values with column medians
for column in ['room_count', 'bathroom_count', 'size']:
    df[column] = df[column].fillna(df[column].median())

In [12]:
df.isna().sum()

category          0
room_count        0
bathroom_count    0
size              0
type              0
price             0
city              0
region            0
log_price         0
dtype: int64

In [15]:
df

Unnamed: 0,category,room_count,bathroom_count,size,type,price,city,region,log_price
0,Terrains et Fermes,3.0,1.0,120.0,À Vendre,100000.0,Ariana,Raoued,5.000000
1,Terrains et Fermes,3.0,1.0,120.0,À Vendre,316000.0,Ariana,Autres villes,5.499687
2,Appartements,2.0,1.0,80.0,À Louer,380.0,Ariana,Autres villes,2.579784
3,Locations de vacances,1.0,1.0,90.0,À Louer,70.0,Ariana,Autres villes,1.845098
4,Appartements,2.0,2.0,113.0,À Vendre,170000.0,Ariana,Ariana Ville,5.230449
...,...,...,...,...,...,...,...,...,...
12743,Terrains et Fermes,3.0,1.0,120.0,À Vendre,3200000.0,Tunis,Sidi Bou Said,6.505150
12744,Appartements,1.0,1.0,100.0,À Louer,600.0,Tunis,Autres villes,2.778151
12745,Maisons et Villas,3.0,1.0,760.0,À Vendre,1950000.0,Tunis,La Marsa,6.290035
12746,Maisons et Villas,3.0,1.0,190.0,À Vendre,240000.0,Tunis,La Marsa,5.380211


In [14]:
{column: len(df[column].unique()) for column in df.select_dtypes('object').columns}

{'category': 7, 'type': 2, 'city': 24, 'region': 257}

In [16]:
# Binary Encoding
df['type'] = df['type'].replace({'À Louer': 0, 'À Vendre': 1})

In [17]:
{column: len(df[column].unique()) for column in df.select_dtypes('object').columns}

{'category': 7, 'city': 24, 'region': 257}

In [18]:
# One Hot Encoding
for column in ['category', 'city', 'region']:
    dummies = pd.get_dummies(df[column], prefix=column)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)

In [19]:
{column: len(df[column].unique()) for column in df.select_dtypes('object').columns}

{}

In [20]:
# Drop log_price column
df = df.drop('log_price', axis=1)

In [21]:
# Split df into X and y
y = df['price']
X = df.drop('price', axis=1)

In [37]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [38]:
X_train

Unnamed: 0,room_count,bathroom_count,size,type,category_Appartements,category_Bureaux et Plateaux,category_Colocations,category_Locations de vacances,"category_Magasins, Commerces et Locaux industriels",category_Maisons et Villas,...,region_Tozeur,region_Tunis,region_Téboulba,region_Téboursouk,region_Utique,region_Zaghouan,region_Zaouit-Ksibat Thrayett,region_Zarzis,region_Zarzouna,region_Zéramdine
83,2.0,1.0,75.0,0,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1917,2.0,2.0,100.0,0,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
7857,3.0,1.0,140.0,0,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
5044,2.0,1.0,100.0,0,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2726,3.0,1.0,120.0,0,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10955,1.0,1.0,20.0,1,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
905,1.0,1.0,87.0,0,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5192,2.0,1.0,65.0,0,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
12172,3.0,1.0,100.0,0,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


In [39]:
y_train

83           130.0
1917         550.0
7857         600.0
5044         700.0
2726         600.0
           ...    
10955    8000000.0
905          800.0
5192         450.0
12172        500.0
235       210000.0
Name: price, Length: 8923, dtype: float64

In [40]:
# Scale X
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [41]:
X_train

Unnamed: 0,room_count,bathroom_count,size,type,category_Appartements,category_Bureaux et Plateaux,category_Colocations,category_Locations de vacances,"category_Magasins, Commerces et Locaux industriels",category_Maisons et Villas,...,region_Tozeur,region_Tunis,region_Téboulba,region_Téboursouk,region_Utique,region_Zaghouan,region_Zaouit-Ksibat Thrayett,region_Zarzis,region_Zarzouna,region_Zéramdine
83,-0.594847,-0.415569,-0.526100,-1.240965,-0.762766,-0.195968,13.321411,-0.15142,-0.233407,-0.575323,...,-0.055091,-0.113253,-0.018339,-0.014973,-0.02594,-0.086323,-0.066256,-0.023678,-0.021177,-0.010587
1917,-0.594847,1.004080,-0.377567,-1.240965,-0.762766,-0.195968,-0.075067,-0.15142,-0.233407,1.738154,...,-0.055091,-0.113253,-0.018339,-0.014973,-0.02594,-0.086323,-0.066256,-0.023678,-0.021177,-0.010587
7857,0.122329,-0.415569,-0.139914,-1.240965,-0.762766,-0.195968,-0.075067,-0.15142,-0.233407,1.738154,...,-0.055091,-0.113253,-0.018339,-0.014973,-0.02594,-0.086323,-0.066256,-0.023678,-0.021177,-0.010587
5044,-0.594847,-0.415569,-0.377567,-1.240965,-0.762766,5.102881,-0.075067,-0.15142,-0.233407,-0.575323,...,-0.055091,-0.113253,-0.018339,-0.014973,-0.02594,-0.086323,-0.066256,-0.023678,-0.021177,-0.010587
2726,0.122329,-0.415569,-0.258740,-1.240965,1.311018,-0.195968,-0.075067,-0.15142,-0.233407,-0.575323,...,-0.055091,-0.113253,-0.018339,-0.014973,-0.02594,-0.086323,-0.066256,-0.023678,-0.021177,-0.010587
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10955,-1.312023,-0.415569,-0.852872,0.805824,-0.762766,-0.195968,-0.075067,-0.15142,4.284361,-0.575323,...,-0.055091,-0.113253,-0.018339,-0.014973,-0.02594,-0.086323,-0.066256,-0.023678,-0.021177,-0.010587
905,-1.312023,-0.415569,-0.454804,-1.240965,1.311018,-0.195968,-0.075067,-0.15142,-0.233407,-0.575323,...,-0.055091,-0.113253,-0.018339,-0.014973,-0.02594,-0.086323,-0.066256,-0.023678,-0.021177,-0.010587
5192,-0.594847,-0.415569,-0.585513,-1.240965,-0.762766,5.102881,-0.075067,-0.15142,-0.233407,-0.575323,...,-0.055091,-0.113253,-0.018339,-0.014973,-0.02594,-0.086323,-0.066256,-0.023678,-0.021177,-0.010587
12172,0.122329,-0.415569,-0.377567,-1.240965,-0.762766,-0.195968,-0.075067,-0.15142,-0.233407,1.738154,...,-0.055091,-0.113253,-0.018339,-0.014973,-0.02594,-0.086323,-0.066256,-0.023678,-0.021177,-0.010587


### Training

In [42]:
models = {
    'Linear Regression': LinearRegression(),
    'Linear Regression (L2 Regularization)': Ridge(),
    'Linear Regression (L1 Regularization)': Lasso(),
    'K-Nearest Neighbors': KNeighborsRegressor(),
    'Neural Network': MLPRegressor(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor(),
    'Gradient Boosting': GradientBoostingRegressor()
}

In [43]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

Linear Regression trained.
Linear Regression (L2 Regularization) trained.
Linear Regression (L1 Regularization) trained.
K-Nearest Neighbors trained.
Neural Network trained.
Decision Tree trained.
Random Forest trained.
Gradient Boosting trained.


### Results

In [44]:
# RMSE 
for name, model in models.items():
    y_pred = model.predict(X_test)
    rmse = np.sqrt(np.mean((y_test - y_pred)**2))
    print(name + "RMSE: {:.2f}".format(rmse))

Linear RegressionRMSE: 22920101164145643520.00
Linear Regression (L2 Regularization)RMSE: 1618496260.64
Linear Regression (L1 Regularization)RMSE: 1618365373.29
K-Nearest NeighborsRMSE: 1641085393.75
Neural NetworkRMSE: 1617133676.14
Decision TreeRMSE: 1619216335.97
Random ForestRMSE: 1618428735.09
Gradient BoostingRMSE: 1618061578.07


In [46]:
# R2 Score
for name, model in models.items():
    r2 = model.score(X_test, y_test)
    print(name + " R^2: {:.4f}".format(r2))

Linear Regression R^2: -200944074440973352960.0000
Linear Regression (L2 Regularization) R^2: -0.0020
Linear Regression (L1 Regularization) R^2: -0.0018
K-Nearest Neighbors R^2: -0.0302
Neural Network R^2: -0.0003
Decision Tree R^2: -0.0029
Random Forest R^2: -0.0019
Gradient Boosting R^2: -0.0015
