<a href="https://colab.research.google.com/github/aciofo/AI-Engineering/blob/main/machine-learning-fundamentals/project/forecasting_model_for_real_estate_market.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [169]:
# import pandas library
import pandas as pd

# Read the dataset

In [170]:
file = 'https://proai-datasets.s3.eu-west-3.amazonaws.com/housing.csv'

# read the data through pandas read_csv function
df = pd.read_csv(file)

# head shows a snapshot of data contained in the dataset
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,1
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,1
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,2
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,1
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,1


In [171]:
df['price'].max()

13300000

In [172]:
# info allows you to view some information about the dataset such as datatypes and nullable columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   price             545 non-null    int64
 1   area              545 non-null    int64
 2   bedrooms          545 non-null    int64
 3   bathrooms         545 non-null    int64
 4   stories           545 non-null    int64
 5   mainroad          545 non-null    int64
 6   guestroom         545 non-null    int64
 7   basement          545 non-null    int64
 8   hotwaterheating   545 non-null    int64
 9   airconditioning   545 non-null    int64
 10  parking           545 non-null    int64
 11  prefarea          545 non-null    int64
 12  furnishingstatus  545 non-null    int64
dtypes: int64(13)
memory usage: 55.5 KB


There are no categorical variables.
All variables are numeric and furnishingstatus is an ordinal categorical variable, but it is already standardized because it contains only numeric values.

# Check missing data

In [173]:
# count all missing data for all columns
df.isna().sum()

Unnamed: 0,0
price,0
area,0
bedrooms,0
bathrooms,0
stories,0
mainroad,0
guestroom,0
basement,0
hotwaterheating,0
airconditioning,0


No missing data

# Standardization

Since I will then have to implement the Ridge, Lasso and ElasticNet regression models, I use Standardization as it is recommended for data preprocessing of the data to be input to these models

In [174]:
# import StandardScaler class from sklearn library
from sklearn.preprocessing import StandardScaler

I apply standardization only to real numeric variables

In [175]:
# drop all not real numeric fields
x = df.drop(['price','mainroad','guestroom','basement','hotwaterheating','airconditioning','prefarea','furnishingstatus'],axis=1)
x.shape

(545, 5)

In [176]:
# create and object StandardScaler
ss = StandardScaler()

# fit_transform first calculates mean and std through fit and then standardizes each column through tranform
x_std = ss.fit_transform(x)

# print mean and standard deviation
print(f'Mean {x_std.mean():.2f}')
print(f'Standard Deviation {x_std.std():.2f}')

Mean -0.00
Standard Deviation 1.00


# Linear Regression

In [177]:
# import train_test_split class from sklearn library
from sklearn.model_selection import train_test_split

In [178]:
# create dataframe x removing 'price' column (target)
x = df.drop('price', axis=1).values

# create dataframe y keeping only 'price' column (target)
y = df['price'].values

Random seed is a value that is set to make the results reproducible when using functions that involve random numbers. This allows, at each run, to have the same data both in the training phase and in the testing phase so as to be able to compare the different models applied to the data

In [179]:
random_seed = 0

In [180]:
# split dataset in two parts: train and test
# assign 30% of the data to the test set and the remaining to the training set
# apply random_seed
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=.3,
    random_state=random_seed
    )

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(381, 12) (164, 12) (381,) (164,)


Apply standardization to train and test dataset so that all variables contribute equally to the regression and regularization (for example area may have too large values than bathrooms or bedrooms but it' not more important)

In [181]:
# apply standardization to train and test dataset so that all variables contribute equally to the regression and regularization
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)

In [182]:
# import mean_squared_error and r2_score metrics from sklearn library
from sklearn.metrics import mean_squared_error, r2_score

In [183]:
# import numpy library
import numpy as np

In [184]:
def evaluate_model(model, dataset):

  # assign features dataset to x and target values dataset to y
  x, y = dataset

  # make predictions based on dataset features
  y_pred = model.predict(x)

  # calculate and print RMSE, rounded to 3 decimal places
  print(f'RMSE: {np.sqrt(mean_squared_error(y, y_pred)):.3f}')
  # calculate and print MSE, rounded to 3 decimal places
  print(f'MSE: {mean_squared_error(y, y_pred):.3f}')
  # calculate and print R2, rounded to 3 decimal places
  print(f'R2: {r2_score(y, y_pred):.3f}')

In [185]:
# import LinearRegression class from sklearn library
from sklearn.linear_model import LinearRegression

In [186]:
# create and object LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)

In [187]:
evaluate_model(lr, (x_train, y_train))
evaluate_model(lr, (x_test, y_test))

RMSE: 1097693.366
MSE: 1204930725280.427
R2: 0.656
RMSE: 980758.422
MSE: 961887082256.527
R2: 0.723


## Regularization models

### Ridge

In [188]:
# import Ridge class from sklearn library
from sklearn.linear_model import Ridge

In [189]:
model = Ridge(alpha=.01)

model.fit(x_train, y_train)

evaluate_model(model, (x_train, y_train))
evaluate_model(model, (x_test, y_test))

RMSE: 1097693.366
MSE: 1204930725755.287
R2: 0.656
RMSE: 980756.085
MSE: 961882497987.251
R2: 0.723


In [190]:
model = Ridge(alpha=.5)

model.fit(x_train, y_train)

evaluate_model(model, (x_train, y_train))
evaluate_model(model, (x_test, y_test))

RMSE: 1097693.905
MSE: 1204931909417.360
R2: 0.656
RMSE: 980642.504
MSE: 961659720417.260
R2: 0.723


In [191]:
model = Ridge(alpha=1)

model.fit(x_train, y_train)

evaluate_model(model, (x_train, y_train))
evaluate_model(model, (x_test, y_test))

RMSE: 1097695.518
MSE: 1204935449596.848
R2: 0.656
RMSE: 980528.489
MSE: 961436118597.690
R2: 0.723


Changing the *alpha* parameter does not substantially change the metrics

### Lasso

In [134]:
# import Lasso class from sklearn library
from sklearn.linear_model import Lasso

In [135]:
model = Lasso(alpha=2)

model.fit(x_train, y_train)

evaluate_model(model, (x_train, y_train))
evaluate_model(model, (x_test, y_test))

RMSE: 1097693.366
MSE: 1204930725304.515
R2: 0.656
RMSE: 980757.553
MSE: 961885378711.823
R2: 0.723


In [136]:
from sklearn.linear_model import ElasticNet

In [137]:
model = ElasticNet(alpha=2, l1_ratio=0.5)
model.fit(x_train, y_train)
evaluate_model(model, (x_train, y_train))
evaluate_model(model, (x_test, y_test))

RMSE: 1199504.733
MSE: 1438811605477.408
R2: 0.589
RMSE: 1079473.450
MSE: 1165262930316.728
R2: 0.664
