# Regression analysis

## Principle of regression

Supervised methods in ML includ classification (cf. titanic) and regression (cf. this exercise about house prices in Boston).

Regression = set of statistical methods helping us to find relationships between variables (predictors = covariates = features = independent variables; outcomes = response variables = dependent variables).

Example: House prices proportional to the size of houses.

## Definition of the problem

The main goal of this exercise is to predict the sale prices for each house (in Boston) based on the different features provided in the dataset.

The data presented in this notebook come from the Kaggle website and report the features that we may want to consider for the exercise.

## Problematic

How to predict the sale prices for each house? Which would be the predicted prices for these houses?

## Resolution approaches

Regression -> Test different regression methods (linear, lasso, ridge, polynomial).

Metrics: RMSE

=> Logs : the distribution of the sale prices of the houses is exponential.

Exp(log) = linear

Log(Exp) = linear

=> Balancing the evaluation (otherwise, all sale prices of expensive houses would take the place of cheaper houses).

## Parameters

## Import modules and load files

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

from sklearn.model_selection import train_test_split

# OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

In [2]:
train = pd.read_csv("../data/raw/house_prices/train.csv", sep = ",", index_col=0) # ID n'a pas de valeur pour faire les prédictions
train.info()
train.head()

<class 'pandas.core.frame.DataFrame'>
Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuilt    

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
test = pd.read_csv("../data/raw/house_prices/test.csv", sep = ",")
test.info()
test.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [None]:
sample = pd.read_csv("../data/raw/house_prices/sample_submission.csv", sep = ",")
sample.info()
sample.head()

In [None]:
train["MSSubClass"].unique()

In [None]:
train["MSZoning"].unique()

In [None]:
train["LandContour"].unique()

In [None]:
for header in train.columns.values:
    print(header)
    print("Unique values:", train[header].unique())
    print("=============================")

| Variables | Natures of the variables | 
| :--------: | :--------------------: |
| Id         | Numerical (quantitative) discrete |
| MSSubClass | Numerical (quantitative) continuous |
| MSZoning   | Categorical (qualitative)|
| LotFrontage | Numerical (quantitative) continuous |
| LotArea    | Numerical (quantitative) continuous |
| Street     | Categorical (qualitative) |
| Alley      | Categorical (qualitative) |
| LotShape   | Categorical (qualitative) |
| LandContour | Categorical (qualitative) |
| Utilities | Categorical qualitative |
| LotConfig | Categorical qualitative |
| LandSlope | Categorical qualitative |
| Neighborhood | Categorical qualitative |
| Condition1 | Categorical qualitative |
| Condition2 | Categorical qualitative |
| BldgType | Categorical qualitative |
| HouseStyle | Categorical qualitative |
| OverallQual | Numerical (quantitative) discrete |
| OverallCond | Numerical (quantitative) discrete |
| YearBuilt | Categorical (qualitative) if discretisation after (old/new houses) / Numerical qualitative discrete (limit #columns for the model) |
| YearRemodAdd | same |
| RoofStyle | Categorical qualitative |
| RoofMatl | Categorical qualitative |
| Exterior1 | Categorical qualitative |
| Exterior2nd | Categorical qualitative |
| MasVnrType | Categorical qualitative |
| MasVnrArea | Numerical quantitative continuous |
| ExterQual | Categorical qualitative |
| ExterCond | Categorical qualitative |
| Foundation | Categorical qualitative |
| BsmtQual | Categorical qualitative |
| BsmtCond | Categorical qualitative |
| BsmtExposure | Categorical qualitative |
| BsmtFinType1 | Categorical qualitative |
| BsmtFinSF1 | Numerical (qualitative) continuous |
| BsmtFinType2 | Categorical qualitative |
| BsmtFinSF2 | Numerical (qualitative) continuous |
| BsmtUnfSF | Numerical (qualitative) continuous |
| TotalBsmtSF | Numerical (qualitative) continuous |
| Heating | Categorical qualitative |
| HeatingQC | Categorical qualitative |
| CentralAir | Categorical qualitative |
| Electrical | Categorical qualitative |
| 1stFlrSF | Numerical (quantitative) continuous |
| 2ndFlrSF | Numerical (quantitative) continuous |
| LowQualFinSF | Numerical (quantitative) continuous |
| GrLivArea | Numerical (quantitative) continuous |
| BsmtFullBath | Numerical (quantitative) discrete | 
| BsmtHalfBath | Numerical (quantitative) discrete |
| FullBath | Numerical (quantitative) discrete |
| HalfBath | Numerical (quantitative) discrete |
| BedroomAbvGr | Numerical (quantitative) discrete |
| KitchenAbvGr | Numerical (quantitative) discrete |
| KitchenQual | Categorical (qualitative) |
| TotRmsAbvGrd | Numerical (quantitative) discrete |
| Functional | Categorical (qualitative) |
| Fireplaces | Numerical (quantitative) discrete |
| FireplaceQu | Categorical (qualitative) |
| GarageType | Categorical (qualitative) |
| GarageYrBlt | Categorical (qualitative) if discretisation after (old/new houses) / Numerical qualitative discrete (limit #columns for the model) |
| GarageFinish | Categorical (qualitative) |
| GarageCars | Numerical (quantitative) discrete |
| GarageArea | Numerical (quantitative) continuous) |
| GarageQual | Categorical (qualitative) |
| GarageCond | Categorical (qualitative) |
| PavedDrive | Categorical (qualitative) |
| WoodDeckSF | Numerical (quantitative) continuous |
| OpenPorchSF | Numerical (quantitative) discrete |
| EnclosedPorch | Numerical (quantitative) discrete |
| 3SsnPorch | Numerical (quantitative) continuous |
| ScreenPorch | Numerical (quantitative) continuous |
| PoolArea | Numerical (quantitative) continuous |
| PoolQC | Categorical (qualitative) |
| Fence | Categorical (qualitative) |
| MiscFeature | Categorical (qualitative) |
| MiscVal | Numerical (quantitative) continuous |
| MoSold | Numerical (quantitative) continuous |
| YrSold | Categorical (qualitative) if discretisation after (old/new houses) / Numerical qualitative discrete (limit #columns for the model) |
| SaleType | Categorical (qualitative) |
| SaleCondition | Categorical (qualitative) |
| SalePrice | Numerical (quantitative) continuous |

float: numerical quantitative continuous (warning: time); string: qualitative, categorical; int: difficult to determine (either one, either the other, or none of them) -> ID: none of them (for us, not the machine learning model) -> meta data (= data which give us information about the sample, data about data)

If we exclude the years, we have 43 categorical variables

In [None]:
fig = px.box(train, x = "MSSubClass", y = "SalePrice", log_y = True, 
             title = "Boxplot representing the house prices according to the SubClasses")
fig.show()

In [None]:
fig = px.box(train, x = "MSZoning", y = "SalePrice", title = "Impact of the MSZoning on the house prices", log_y=True)
fig.show()

In [None]:
# LotFrontage and LotArea: continuous variables => Line plots or scatter plots
fig = px.scatter(train, x = "LotArea", y = "SalePrice", color = "Neighborhood",
                title = "Impact of the lot area on house prices")
fig.show()

## Data preparation

If categorical data => OneHotEncoder

If numerical data => filling null data in (by fixed number, or the mean) => If mean: calculation from the TRAINING dataset

In [3]:
# Separating the features and the target (SalePrice)
X = train.drop(columns = ["SalePrice"])
#X.info()
#X.head()

y = train["SalePrice"]
y.info()
y.head()

<class 'pandas.core.series.Series'>
Index: 1460 entries, 1 to 1460
Series name: SalePrice
Non-Null Count  Dtype
--------------  -----
1460 non-null   int64
dtypes: int64(1)
memory usage: 22.8 KB


Id
1    208500
2    181500
3    223500
4    140000
5    250000
Name: SalePrice, dtype: int64

In [4]:
# Splitting the train dataset, to evaluate the model's performance
# Good practice: splitting the earlier as possible, to avoid that data from dataset could be found in the test dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.33, random_state = 42
)

# Random_state: generating random nb => algo generating pseudo-random nb (in info, all is deterministic). To add an external random notion, this algo 
# takes a seed to initialize the algo. 
# => For a given seed, we will always have the same generated random nb sequence.
# Here, in this function, the random-state parameter is the seed (parameter of random nb)
# => Here, we determine in advance which will be the repartition between the train and the test, ie the result that we have today is exactly the same
# result as the one we obtained a few months ago.
# => Reproducibility

In [5]:
X_train
#X_test
#y_train
#y_test

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
616,85,RL,80.0,8800,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,5,2010,WD,Abnorml
614,20,RL,70.0,8402,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,12,2007,New,Partial
1304,20,RL,73.0,8688,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2006,WD,Normal
487,20,RL,79.0,10289,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,6,2007,WD,Normal
562,20,RL,77.0,10010,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2006,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1096,20,RL,78.0,9317,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,3,2007,WD,Normal
1131,50,RL,65.0,7804,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,12,2009,WD,Normal
1295,20,RL,60.0,8172,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2006,WD,Normal
861,50,RL,55.0,7642,Pave,,Reg,Lvl,AllPub,Corner,...,0,0,,GdPrv,,0,6,2007,WD,Normal


In [6]:
# Limit to categorical data using df.select_dtypes() => onehotencoder creates one column by possible value
X_enc = X_train.select_dtypes(include=[object]) # X_enc = subset of categorical variables
#X_enc.info() # => 43 categorical variables = > ok
#X_enc.head()

# Creating instance of one-hot-encoder
enc = OneHotEncoder() # handle_unknown: column in which a category is never represented (ex if a category is only represented in the test dataset), 
# put "error" as far as we can
#enc

# Apply one-hot encoding to the categorical columns
enc = enc.fit(X_enc) # looks at each column and determine unique values for each column: calculates all possible values for each categorical values
#enc.categories_ # double table: 1st element = list of possible values in the 1st column 
# (respects the order of the columns, order of the columns in X_enc)

# Transform on X_enc
X_enc_transfo = enc.transform(X_enc).toarray()
enc.get_feature_names_out()

# Building the dataframe
df_enc = pd.DataFrame(X_enc_transfo, columns = enc.get_feature_names_out(), index = X_enc.index, dtype=int)
#df_enc.info()
#df_enc.head(5)

# Deleting the columns containing categorical values in X_train and merging the 2 dataframes (X_train + df_enc) on IDs
X_train_drop = X_train.drop(columns = X_enc.columns)
#X_train.info()
#X_train.head()
X_train = X_train_drop.merge(df_enc, how = "left", left_index = True, right_index = True)
X_train.info()
X_train.head(5)

<class 'pandas.core.frame.DataFrame'>
Index: 978 entries, 616 to 1127
Columns: 298 entries, MSSubClass to SaleCondition_Partial
dtypes: float64(3), int64(295)
memory usage: 2.3 MB


Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
616,85,80.0,8800,6,7,1963,1963,156.0,763,0,...,0,0,0,1,1,0,0,0,0,0
614,20,70.0,8402,5,5,2007,2007,0.0,206,0,...,0,1,0,0,0,0,0,0,0,1
1304,20,73.0,8688,7,5,2005,2005,228.0,0,0,...,0,0,0,1,0,0,0,0,1,0
487,20,79.0,10289,5,7,1965,1965,168.0,836,0,...,0,0,0,1,0,0,0,0,1,0
562,20,77.0,10010,5,5,1974,1975,0.0,1071,123,...,0,0,0,1,0,0,0,0,1,0


In [None]:
X_enc_transfo.shape

In [None]:
enc.get_feature_names_out().shape

In [None]:
X_enc.shape

In [None]:
# Doing the same data preprocessing for the test dataset from the onehotencoder
# fitting the model on the data => adjusting the model on our data  