# Regression analysis

## Principle of regression

Supervised methods in ML includ classification (cf. titanic) and regression (cf. this exercise about house prices in Boston).

Regression = set of statistical methods helping us to find relationships between variables (predictors = covariates = features = independent variables; outcomes = response variables = dependent variables).

Example: House prices proportional to the size of houses.

## Definition of the problem

The main goal of this exercise is to predict the sale prices for each house (in Boston) based on the different features provided in the dataset.

The data presented in this notebook come from the Kaggle website and report the features that we may want to consider for the exercise.

## Problematic

How to predict the sale prices for each house? Which would be the predicted prices for these houses?

## Resolution approaches

Regression -> Test different regression methods (linear, lasso, ridge, polynomial).

Metrics: RMSE

=> Logs : the distribution of the sale prices of the houses is exponential.

Exp(log) = linear

Log(Exp) = linear

=> Balancing the evaluation (otherwise, all sale prices of expensive houses would take the place of cheaper houses).

## Parameters

## Import modules and load files

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

from sklearn.model_selection import train_test_split

# OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

In [2]:
train = pd.read_csv("../data/raw/house_prices/train.csv", sep = ",")
train.info()
train.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [None]:
test = pd.read_csv("../data/raw/house_prices/test.csv", sep = ",")
test.info()
test.head()

In [None]:
sample = pd.read_csv("../data/raw/house_prices/sample_submission.csv", sep = ",")
sample.info()
sample.head()

In [None]:
train["MSSubClass"].unique()

In [None]:
train["MSZoning"].unique()

In [None]:
train["LandContour"].unique()

In [None]:
for header in train.columns.values:
    print(header)
    print("Unique values:", train[header].unique())
    print("=============================")

| Variables | Natures of the variables | 
| :--------: | :--------------------: |
| Id         | Numerical (quantitative) discrete |
| MSSubClass | Numerical (quantitative) continuous |
| MSZoning   | Categorical (qualitative)|
| LotFrontage | Numerical (quantitative) continuous |
| LotArea    | Numerical (quantitative) continuous |
| Street     | Categorical (qualitative) |
| Alley      | Categorical (qualitative) |
| LotShape   | Categorical (qualitative) |
| LandContour | Categorical (qualitative) |
| Utilities | Categorical qualitative |
| LotConfig | Categorical qualitative |
| LandSlope | Categorical qualitative |
| Neighborhood | Categorical qualitative |
| Condition1 | Categorical qualitative |
| Condition2 | Categorical qualitative |
| BldgType | Categorical qualitative |
| HouseStyle | Categorical qualitative |
| OverallQual | Numerical (quantitative) discrete |
| OverallCond | Numerical (quantitative) discrete |
| YearBuilt | Categorical (qualitative) if discretisation after (old/new houses) / Numerical qualitative discrete (limit #columns for the model) |
| YearRemodAdd | same |
| RoofStyle | Categorical qualitative |
| RoofMatl | Categorical qualitative |
| Exterior1 | Categorical qualitative |
| Exterior2nd | Categorical qualitative |
| MasVnrType | Categorical qualitative |
| MasVnrArea | Numerical quantitative continuous |
| ExterQual | Categorical qualitative |
| ExterCond | Categorical qualitative |
| Foundation | Categorical qualitative |
| BsmtQual | Categorical qualitative |
| BsmtCond | Categorical qualitative |
| BsmtExposure | Categorical qualitative |
| BsmtFinType1 | Categorical qualitative |
| BsmtFinSF1 | Numerical (qualitative) continuous |
| BsmtFinType2 | Categorical qualitative |
| BsmtFinSF2 | Numerical (qualitative) continuous |
| BsmtUnfSF | Numerical (qualitative) continuous |
| TotalBsmtSF | Numerical (qualitative) continuous |
| Heating | Categorical qualitative |
| HeatingQC | Categorical qualitative |
| CentralAir | Categorical qualitative |
| Electrical | Categorical qualitative |
| 1stFlrSF | Numerical (quantitative) continuous |
| 2ndFlrSF | Numerical (quantitative) continuous |
| LowQualFinSF | Numerical (quantitative) continuous |
| GrLivArea | Numerical (quantitative) continuous |
| BsmtFullBath | Numerical (quantitative) discrete | 
| BsmtHalfBath | Numerical (quantitative) discrete |
| FullBath | Numerical (quantitative) discrete |
| HalfBath | Numerical (quantitative) discrete |
| BedroomAbvGr | Numerical (quantitative) discrete |
| KitchenAbvGr | Numerical (quantitative) discrete |
| KitchenQual | Categorical (qualitative) |
| TotRmsAbvGrd | Numerical (quantitative) discrete |
| Functional | Categorical (qualitative) |
| Fireplaces | Numerical (quantitative) discrete |
| FireplaceQu | Categorical (qualitative) |
| GarageType | Categorical (qualitative) |
| GarageYrBlt | Categorical (qualitative) if discretisation after (old/new houses) / Numerical qualitative discrete (limit #columns for the model) |
| GarageFinish | Categorical (qualitative) |
| GarageCars | Numerical (quantitative) discrete |
| GarageArea | Numerical (quantitative) continuous) |
| GarageQual | Categorical (qualitative) |
| GarageCond | Categorical (qualitative) |
| PavedDrive | Categorical (qualitative) |
| WoodDeckSF | Numerical (quantitative) continuous |
| OpenPorchSF | Numerical (quantitative) discrete |
| EnclosedPorch | Numerical (quantitative) discrete |
| 3SsnPorch | Numerical (quantitative) continuous |
| ScreenPorch | Numerical (quantitative) continuous |
| PoolArea | Numerical (quantitative) continuous |
| PoolQC | Categorical (qualitative) |
| Fence | Categorical (qualitative) |
| MiscFeature | Categorical (qualitative) |
| MiscVal | Numerical (quantitative) continuous |
| MoSold | Numerical (quantitative) continuous |
| YrSold | Categorical (qualitative) if discretisation after (old/new houses) / Numerical qualitative discrete (limit #columns for the model) |
| SaleType | Categorical (qualitative) |
| SaleCondition | Categorical (qualitative) |
| SalePrice | Numerical (quantitative) continuous |

float: numerical quantitative continuous (warning: time); string: qualitative, categorical; int: difficult to determine (either one, either the other, or none of them) -> ID: none of them (for us, not the machine learning model) -> meta data (= data which give us information about the sample, data about data)

In [None]:
fig = px.box(train, x = "MSSubClass", y = "SalePrice", log_y = True, 
             title = "Boxplot representing the house prices according to the SubClasses")
fig.show()

In [None]:
fig = px.box(train, x = "MSZoning", y = "SalePrice", title = "Impact of the MSZoning on the house prices", log_y=True)
fig.show()

In [None]:
# LotFrontage and LotArea: continuous variables => Line plots or scatter plots
fig = px.scatter(train, x = "LotArea", y = "SalePrice", color = "Neighborhood",
                title = "Impact of the lot area on house prices")
fig.show()

## Data preparation

If categorical data => OneHotEncoder

If numerical data => filling null data in (by fixed number, or the mean) => If mean: calculation from the TRAINING dataset

In [3]:
# Separating the features and the target (SalePrice)
X = train.drop(columns = ["SalePrice"])
#X.info()
#X.head()

y = train["SalePrice"]
y.info()
y.head()

<class 'pandas.core.series.Series'>
RangeIndex: 1460 entries, 0 to 1459
Series name: SalePrice
Non-Null Count  Dtype
--------------  -----
1460 non-null   int64
dtypes: int64(1)
memory usage: 11.5 KB


0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

In [4]:
# Splitting the train dataset, to evaluate the model's performance
# Good practice: splitting the earlier as possible, to avoid that data from dataset could be found in the test dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.33, random_state = 42
)

# Random_state: generating random nb => algo generating pseudo-random nb (in info, all is deterministic). To add an external random notion, this algo 
# takes a seed to initialize the algo. 
# => For a given seed, we will always have the same generated random nb sequence.
# Here, in this function, the random-state parameter is the seed (parameter of random nb)
# => Here, we determine in advance which will be the repartition between the train and the test, ie the result that we have today is exactly the same
# result as the one we obtained a few months ago.
# => Reproducibility

In [9]:
#X_train
#X_test
#y_train
y_test

892     154500
1105    325000
413     115000
522     159000
1036    315500
         ...  
1010    135000
390     119000
1409    215000
847     133500
1284    169000
Name: SalePrice, Length: 482, dtype: int64

In [10]:
# Limit to categorical data using df.select_dtypes() => onehotencoder creates one column by possible value
X_enc = X_train.select_dtypes(include=[object])

# Creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')
#enc

# Apply one-hot encoding to the categorical columns
one_hot_encoded = enc.fit_transform(X_enc)
one_hot_encoded

<978x262 sparse matrix of type '<class 'numpy.float64'>'
	with 42054 stored elements in Compressed Sparse Row format>