# Regression Challenge

Predicting the selling price of a residential property depends on a number of factors, including the property age, availability of local amenities, and location.

In this challenge, you will use a dataset of real estate sales transactions to predict the price-per-unit of a property based on its features. The price-per-unit in this data is based on a unit measurement of 3.3 square meters.

> **Citation**: The data used in this exercise originates from the following study:
>
> *Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.*
>
> It was obtained from the UCI dataset repository (Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science).

## Review the data

Run the following cell to load the data and view the first few rows.

In [1]:
import pandas as pd

# load the training dataset
data = pd.read_csv('data/real_estate.csv')
data.head()

Unnamed: 0,transaction_date,house_age,transit_distance,local_convenience_stores,latitude,longitude,price_per_unit
0,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


The data consists of the following variables:

- **transaction_date** - the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
- **house_age** - the house age (in years)
- **transit_distance** - the distance to the nearest light rail station (in meters)
- **local_convenience_stores** - the number of convenience stores within walking distance
- **latitude** - the geographic coordinate, latitude
- **longitude** - the geographic coordinate, longitude
- **price_per_unit** house price of unit area (3.3 square meters)

## Train a Regression Model

Your challenge is to explore and prepare the data, identify predictive features that will help predict the **price_per_unit** label, and train a regression model that achieves the lowest Root Mean Square Error (RMSE) you can achieve (which must be less than **7**) when evaluated against a test subset of data.

Add markdown and code cells as required to create your solution.

> **Note**: There is no single "correct" solution. A sample solution is provided in [02 - Real Estate Regression Solution.ipynb](02%20-%20Real%20Estate%20Regression%20Solution.ipynb).

In [2]:
data.describe()

Unnamed: 0,transaction_date,house_age,transit_distance,local_convenience_stores,latitude,longitude,price_per_unit
count,414.0,414.0,414.0,414.0,414.0,414.0,414.0
mean,2013.148971,17.71256,1083.885689,4.094203,24.96903,121.533361,37.980193
std,0.281967,11.392485,1262.109595,2.945562,0.01241,0.015347,13.606488
min,2012.667,0.0,23.38284,0.0,24.93207,121.47353,7.6
25%,2012.917,9.025,289.3248,1.0,24.963,121.528085,27.7
50%,2013.167,16.1,492.2313,4.0,24.9711,121.53863,38.45
75%,2013.417,28.15,1454.279,6.0,24.977455,121.543305,46.6
max,2013.583,43.8,6488.021,10.0,25.01459,121.56627,117.5


### Distribution des étiquettes

In [3]:
import plotly.express as px

fig = px.histogram(data, x='price_per_unit', marginal='box', nbins=100)
fig.add_vline(data.price_per_unit.mean(), line_color ='magenta', line_dash='dash')
fig.add_vline(data.price_per_unit.median(), line_color ='cyan', line_dash='dash')

### On supprime les anomalies

In [4]:
data = data[data['price_per_unit'] <= 73.6]

fig = px.histogram(data, x='price_per_unit', marginal='box', nbins=100)
fig.add_vline(data.price_per_unit.mean(), line_color ='magenta', line_dash='dash')
fig.add_vline(data.price_per_unit.median(), line_color ='cyan', line_dash='dash')

### Distribution des valeurs numériques

In [5]:
from IPython.display import display

numeric_features = ['house_age', 'transit_distance', 'latitude', 'longitude']
for col in numeric_features:
    fig = px.histogram(data, x=col, marginal='box')
    fig.add_vline(data[col].mean(), line_color ='magenta', line_dash='dash')
    fig.add_vline(data[col].median(), line_color ='cyan', line_dash='dash')
    display(fig)

### Distribution des valeurs catégoriques

In [6]:
categorical_features = ['transaction_date', 'local_convenience_stores']

for col in categorical_features:
    fig = px.bar(data[col].value_counts().sort_index())
    display(fig)

### Corrélations des valeurs numériques

In [7]:
for col in numeric_features:
    correlation = data[col].corr(data.price_per_unit)
    fig = px.scatter(data, x=col, y='price_per_unit', title=f'price_per_unit vs {col} - correlation: {correlation}')
    display(fig)

In [8]:
for col in categorical_features:
    fig = px.box(data, x=col, y='price_per_unit')
    display(fig)

### Préparation des jeux d'entraînement et de test

In [9]:
# Import modules we'll need for this notebook
from sklearn.model_selection import train_test_split

# Separate features and labels
# After separating the dataset, we now have numpy arrays named **X** containing the features, and **y** containing the labels.
X, y = data[['house_age', 'transit_distance', 'local_convenience_stores', 'latitude', 'longitude']].values, data['price_per_unit'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print(f'Training Set: {X_train.shape[0]} rows')
print(f'Test Set: {X_test.shape[0]} rows')

Training Set: 308 rows
Test Set: 103 rows


### Preprocessing des données

In [10]:
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Define preprocessing for numeric columns (scale them)
numeric_features = [0,1,3,4]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
    ])
display(preprocessor)

### Entraînement du modèle
#### LinearRegression

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', LinearRegression())])

model = pipeline.fit(X_train, y_train)
display(model)

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
px.scatter(x=y_test, y=predictions, trendline='ols', title='Price Per Unit Predictions', labels={'x': 'Actual Labels', 'y': 'Predicted Labels'})

MSE: 64.04524109940222
RMSE: 8.002827069192625
R2: 0.5367290347140843


#### Lasso

In [12]:
from sklearn.linear_model import Lasso

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', Lasso())])

# Fit a lasso model on the training set
model = pipeline.fit(X_train, y_train)
display(model)

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
px.scatter(x=y_test, y=predictions, trendline='ols', title='Price Per Unit Predictions', labels={'x': 'Actual Labels', 'y': 'Predicted Labels'})

MSE: 64.18829188905218
RMSE: 8.011759600053672
R2: 0.5356942774664193


#### DecisionTreeRegressor

In [13]:
from sklearn.tree import DecisionTreeRegressor, export_text

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', DecisionTreeRegressor())])

model = pipeline.fit(X_train, y_train)
display(model)
print(export_text(model._final_estimator))

|--- feature_1 <= -0.09
|   |--- feature_0 <= -0.65
|   |   |--- feature_2 <= 0.36
|   |   |   |--- feature_1 <= -0.77
|   |   |   |   |--- feature_3 <= 0.36
|   |   |   |   |   |--- feature_0 <= -1.18
|   |   |   |   |   |   |--- feature_3 <= 0.25
|   |   |   |   |   |   |   |--- feature_0 <= -1.23
|   |   |   |   |   |   |   |   |--- value: [54.40]
|   |   |   |   |   |   |   |--- feature_0 >  -1.23
|   |   |   |   |   |   |   |   |--- value: [53.50]
|   |   |   |   |   |   |--- feature_3 >  0.25
|   |   |   |   |   |   |   |--- feature_0 <= -1.21
|   |   |   |   |   |   |   |   |--- value: [57.80]
|   |   |   |   |   |   |   |--- feature_0 >  -1.21
|   |   |   |   |   |   |   |   |--- value: [56.80]
|   |   |   |   |   |--- feature_0 >  -1.18
|   |   |   |   |   |   |--- value: [62.10]
|   |   |   |   |--- feature_3 >  0.36
|   |   |   |   |   |--- feature_0 <= -0.79
|   |   |   |   |   |   |--- feature_0 <= -0.84
|   |   |   |   |   |   |   |--- feature_1 <= -0.82
|   |   |   |   |

In [14]:
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
px.scatter(x=y_test, y=predictions, trendline='ols', title='Price Per Unit Predictions', labels={'x': 'Actual Labels', 'y': 'Predicted Labels'})

MSE: 54.38888956310681
RMSE: 7.374882342322948
R2: 0.6065782103993869


#### RandomForestRegressor

In [15]:
from sklearn.ensemble import RandomForestRegressor

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', RandomForestRegressor())])

model = pipeline.fit(X_train, y_train)
display(model)

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
px.scatter(x=y_test, y=predictions, trendline='ols', title='Price Per Unit Predictions', labels={'x': 'Actual Labels', 'y': 'Predicted Labels'})

MSE: 32.43893604843455
RMSE: 5.695518944612031
R2: 0.7653531010573925


#### GradientBoostingRegressor

In [16]:
from sklearn.ensemble import GradientBoostingRegressor

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', GradientBoostingRegressor())])

model = pipeline.fit(X_train, y_train)
display(model)

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
px.scatter(x=y_test, y=predictions, trendline='ols', title='Price Per Unit Predictions', labels={'x': 'Actual Labels', 'y': 'Predicted Labels'})

MSE: 30.60321977042458
RMSE: 5.5320176943340105
R2: 0.7786317465508933


In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score

# Use a Gradient Boosting algorithm
alg = pipeline

# Try these hyperparameter values
params = {
 'regressor__learning_rate': [0.1, 0.5, 1.0],
 'regressor__n_estimators' : [50, 100, 150]
 }

# Find the best hyperparameter combination to optimize the R2 metric
score = make_scorer(r2_score)
gridsearch = GridSearchCV(alg, params, scoring=score, cv=3, return_train_score=True)
gridsearch.fit(X_train, y_train)
print("Best parameter combination:", gridsearch.best_params_, "\n")

# Get the best model
model=gridsearch.best_estimator_
display(model)

# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
px.scatter(x=y_test, y=predictions, trendline='ols', title='Price Per Unit Predictions', labels={'x': 'Actual Labels', 'y': 'Predicted Labels'})

Best parameter combination: {'regressor__learning_rate': 0.1, 'regressor__n_estimators': 50} 



MSE: 30.735039896468674
RMSE: 5.543919181992886
R2: 0.7776782262582337


### Sauvegarde du modèle

In [18]:
import joblib

joblib.dump(model, 'real_estate_regression.joblib')

['real_estate_regression.joblib']

## Use the Trained Model

Save your trained model, and then use it to predict the price-per-unit for the following real estate transactions:

| transaction_date | house_age | transit_distance | local_convenience_stores | latitude | longitude |
| ---------------- | --------- | ---------------- | ------------------------ | -------- | --------- |
|2013.167|16.2|289.3248|5|24.98203|121.54348|
|2013.000|13.6|4082.015|0|24.94155|121.50381|

In [19]:
# Your code to use the trained model
loaded_model = joblib.load('real_estate_regression.joblib')

X_new = np.array([[16.2, 289.3248, 5, 24.98203, 121.54348],
                  [13.6, 4082.015, 0, 24.94155, 121.50381]])

results = loaded_model.predict(X_new)
print('Estimation de deux maisons :')
print(*[np.round(p, 1) for p in results], sep='\n')

Estimation de deux maisons :
48.1
17.2
