# Regression analysis

## Principle of regression

Supervised methods in ML includ classification (cf. titanic) and regression (cf. this exercise about house prices in Boston).

Regression = set of statistical methods helping us to find relationships between variables (predictors = covariates = features = independent variables; outcomes = response variables = dependent variables).

Example: House prices proportional to the size of houses.

## Definition of the problem

The main goal of this exercise is to predict the sale prices for each house (in Boston) based on the different features provided in the dataset.

The data presented in this notebook come from the Kaggle website and report the features that we may want to consider for the exercise.

## Problematic

How to predict the sale prices for each house? Which would be the predicted prices for these houses?

## Resolution approaches

Regression -> Test different regression methods (linear, lasso, ridge, polynomial).

Metrics: RMSE

=> Logs : the distribution of the sale prices of the houses is exponential.

Exp(log) = linear

Log(Exp) = linear

=> Balancing the evaluation (otherwise, all sale prices of expensive houses would take the place of cheaper houses).

## Parameters

## Import modules and load files

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

from sklearn.model_selection import train_test_split

# OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# Regression
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, root_mean_squared_error

In [None]:
train = pd.read_csv("../data/raw/house_prices/train.csv", sep = ",", index_col=0) # ID n'a pas de valeur pour faire les prédictions
train.info()
train.head()

In [None]:
test = pd.read_csv("../data/raw/house_prices/test.csv", sep = ",")
test.info()
test.head()

In [None]:
sample = pd.read_csv("../data/raw/house_prices/sample_submission.csv", sep = ",")
sample.info()
sample.head()

In [None]:
train["MSSubClass"].unique()

In [None]:
train["MSZoning"].unique()

In [None]:
train["LandContour"].unique()

In [None]:
for header in train.columns.values:
    print(header)
    print("Unique values:", train[header].unique())
    print("=============================")

| Variables | Natures of the variables | 
| :--------: | :--------------------: |
| Id         | Numerical (quantitative) discrete |
| MSSubClass | Numerical (quantitative) continuous |
| MSZoning   | Categorical (qualitative)|
| LotFrontage | Numerical (quantitative) continuous |
| LotArea    | Numerical (quantitative) continuous |
| Street     | Categorical (qualitative) |
| Alley      | Categorical (qualitative) |
| LotShape   | Categorical (qualitative) |
| LandContour | Categorical (qualitative) |
| Utilities | Categorical qualitative |
| LotConfig | Categorical qualitative |
| LandSlope | Categorical qualitative |
| Neighborhood | Categorical qualitative |
| Condition1 | Categorical qualitative |
| Condition2 | Categorical qualitative |
| BldgType | Categorical qualitative |
| HouseStyle | Categorical qualitative |
| OverallQual | Numerical (quantitative) discrete |
| OverallCond | Numerical (quantitative) discrete |
| YearBuilt | Categorical (qualitative) if discretisation after (old/new houses) / Numerical qualitative discrete (limit #columns for the model) |
| YearRemodAdd | same |
| RoofStyle | Categorical qualitative |
| RoofMatl | Categorical qualitative |
| Exterior1 | Categorical qualitative |
| Exterior2nd | Categorical qualitative |
| MasVnrType | Categorical qualitative |
| MasVnrArea | Numerical quantitative continuous |
| ExterQual | Categorical qualitative |
| ExterCond | Categorical qualitative |
| Foundation | Categorical qualitative |
| BsmtQual | Categorical qualitative |
| BsmtCond | Categorical qualitative |
| BsmtExposure | Categorical qualitative |
| BsmtFinType1 | Categorical qualitative |
| BsmtFinSF1 | Numerical (qualitative) continuous |
| BsmtFinType2 | Categorical qualitative |
| BsmtFinSF2 | Numerical (qualitative) continuous |
| BsmtUnfSF | Numerical (qualitative) continuous |
| TotalBsmtSF | Numerical (qualitative) continuous |
| Heating | Categorical qualitative |
| HeatingQC | Categorical qualitative |
| CentralAir | Categorical qualitative |
| Electrical | Categorical qualitative |
| 1stFlrSF | Numerical (quantitative) continuous |
| 2ndFlrSF | Numerical (quantitative) continuous |
| LowQualFinSF | Numerical (quantitative) continuous |
| GrLivArea | Numerical (quantitative) continuous |
| BsmtFullBath | Numerical (quantitative) discrete | 
| BsmtHalfBath | Numerical (quantitative) discrete |
| FullBath | Numerical (quantitative) discrete |
| HalfBath | Numerical (quantitative) discrete |
| BedroomAbvGr | Numerical (quantitative) discrete |
| KitchenAbvGr | Numerical (quantitative) discrete |
| KitchenQual | Categorical (qualitative) |
| TotRmsAbvGrd | Numerical (quantitative) discrete |
| Functional | Categorical (qualitative) |
| Fireplaces | Numerical (quantitative) discrete |
| FireplaceQu | Categorical (qualitative) |
| GarageType | Categorical (qualitative) |
| GarageYrBlt | Categorical (qualitative) if discretisation after (old/new houses) / Numerical qualitative continuous (limit #columns for the model) |
| GarageFinish | Categorical (qualitative) |
| GarageCars | Numerical (quantitative) discrete |
| GarageArea | Numerical (quantitative) continuous) |
| GarageQual | Categorical (qualitative) |
| GarageCond | Categorical (qualitative) |
| PavedDrive | Categorical (qualitative) |
| WoodDeckSF | Numerical (quantitative) continuous |
| OpenPorchSF | Numerical (quantitative) discrete |
| EnclosedPorch | Numerical (quantitative) discrete |
| 3SsnPorch | Numerical (quantitative) continuous |
| ScreenPorch | Numerical (quantitative) continuous |
| PoolArea | Numerical (quantitative) continuous |
| PoolQC | Categorical (qualitative) |
| Fence | Categorical (qualitative) |
| MiscFeature | Categorical (qualitative) |
| MiscVal | Numerical (quantitative) continuous |
| MoSold | Numerical (quantitative) continuous |
| YrSold | Categorical (qualitative) if discretisation after (old/new houses) / Numerical qualitative continuous (limit #columns for the model) |
| SaleType | Categorical (qualitative) |
| SaleCondition | Categorical (qualitative) |
| SalePrice | Numerical (quantitative) continuous |

float: numerical quantitative continuous (warning: time); string: qualitative, categorical; int: difficult to determine (either one, either the other, or none of them) -> ID: none of them (for us, not the machine learning model) -> meta data (= data which give us information about the sample, data about data)

Categorical variables: no order, fixed number of possible values (ex: a cat / a dog)

If we exclude the years, we have 43 categorical variables

In [None]:
fig = px.box(train, x = "MSSubClass", y = "SalePrice", log_y = True, 
             title = "Boxplot representing the house prices according to the SubClasses")
fig.show()

In [None]:
fig = px.box(train, x = "MSZoning", y = "SalePrice", title = "Impact of the MSZoning on the house prices", log_y=True)
fig.show()

In [None]:
# LotFrontage and LotArea: continuous variables => Line plots or scatter plots
fig = px.scatter(train, x = "LotArea", y = "SalePrice", color = "Neighborhood",
                title = "Impact of the lot area on house prices")
fig.show()

In [None]:
fig = px.box(train, x = "Neighborhood", y = "SalePrice", 
                title = "Impact of the different parameters on house prices")
fig.show()

## Data preparation

If categorical data => OneHotEncoder

If numerical data => filling null data in (by fixed number, or the mean) => If mean: calculation from the TRAINING dataset

In [None]:
# Separating the features and the target (SalePrice)
X = train.drop(columns = ["SalePrice"])
#[[
#     "MSSubClass",
#     "MSZoning",
#     "HouseStyle",
#     "YearBuilt",
#     "Heating",
#     "Electrical",
#     "BsmtFullBath",
#     "FullBath",
#     "BedroomAbvGr",
#     "KitchenAbvGr",
#     "KitchenQual",
#     "GarageType",
#     "GarageCars",
#     "PavedDrive",
#     "OpenPorchSF",
#     "PoolQC"
# ]]
#X.info()
#X.head()

y = train["SalePrice"]
y.info()
y.head()

In [None]:
# Splitting the train dataset, to evaluate the model's performance
# Good practice: splitting the earlier as possible, to avoid that data from dataset could be found in the test dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.33, random_state = 42
)

# Random_state: generating random nb => algo generating pseudo-random nb (in info, all is deterministic). To add an external random notion, this algo 
# takes a seed to initialize the algo. 
# => For a given seed, we will always have the same generated random nb sequence.
# Here, in this function, the random-state parameter is the seed (parameter of random nb)
# => Here, we determine in advance which will be the repartition between the train and the test, ie the result that we have today is exactly the same
# result as the one we obtained a few months ago.
# => Reproducibility

In [None]:
X_train.info()
X_train.head()
#X_test
#y_train
#y_test

In [None]:
# Filling missing values in
X_train["LotFrontage"] = X_train["LotFrontage"].fillna(value = X_train["LotFrontage"].mean())
X_train["Alley"] = X_train["Alley"].fillna(value = "unknown")
X_train["MasVnrType"] = X_train["MasVnrType"].fillna(value = "unknown")
X_train["MasVnrArea"] = X_train["MasVnrArea"].fillna(value = X_train["MasVnrArea"].mean())
X_train["BsmtQual"] = X_train["BsmtQual"].fillna(value = "unknown")
X_train["BsmtCond"] = X_train["BsmtCond"].fillna(value = "unknown")
X_train["BsmtExposure"] = X_train["BsmtExposure"].fillna(value = "unknown")
X_train["BsmtFinType1"] = X_train["BsmtFinType1"].fillna(value = "unknown")
X_train["BsmtFinType2"] = X_train["BsmtFinType2"].fillna(value = "unknown")
X_train["Electrical"] = X_train["Electrical"].fillna(value = "unknown")
X_train["FireplaceQu"] = X_train["FireplaceQu"].fillna(value = "unknown")
X_train["GarageType"] = X_train["GarageType"].fillna(value = "unknown")
X_train["GarageYrBlt"] = X_train["GarageYrBlt"].fillna(value = X_train["GarageYrBlt"].mean())
X_train["GarageFinish"] = X_train["GarageFinish"].fillna(value = "unknown")
X_train["GarageQual"] = X_train["GarageQual"].fillna(value = "unknown")
X_train["GarageCond"] = X_train["GarageCond"].fillna(value = "unknown")
X_train["PoolQC"] = X_train["PoolQC"].fillna(value = "unknown")
X_train["Fence"] = X_train["Fence"].fillna(value = "unknown")
X_train["MiscFeature"] = X_train["MiscFeature"].fillna(value = "unknown")
X_train.info()
X_train.head()

In [None]:
# OneHotEncoder on X_train

# Limit to categorical data using df.select_dtypes() => onehotencoder creates one column by possible value
X_enc = X_train.select_dtypes(include=[object]) # X_enc = subset of categorical variables; object: probably str (to which a numerical value is assigned)
#X_enc.info() # => 43 categorical variables = > ok
#X_enc.head()

# Creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown = "infrequent_if_exist", min_frequency = 10) # handle_unknown: column in which a category is never represented 
# (ex if a category is only represented in the test dataset), 
# put "error" as far as we can
#enc

# Apply one-hot encoding to the categorical columns
enc = enc.fit(X_enc) # looks at each column and determine unique values for each column: calculates all possible values for each categorical values
#enc.categories_ # double table: 1st element = list of possible values in the 1st column 
# (respects the order of the columns, order of the columns in X_enc)

# Transform on X_enc
X_enc_transfo = enc.transform(X_enc).toarray()
enc.get_feature_names_out()

# Building the dataframe
df_enc = pd.DataFrame(X_enc_transfo, columns = enc.get_feature_names_out(), index = X_enc.index, dtype=int)
#df_enc.info()
#df_enc.head(5)

# Deleting the columns containing categorical values in X_train and merging the 2 dataframes (X_train + df_enc) on IDs
X_train_drop = X_train.drop(columns = X_enc.columns)
#X_train.info()
#X_train.head()
X_train = X_train_drop.merge(df_enc, how = "left", left_index = True, right_index = True)
X_train.info()
X_train.head(5)

In [None]:
X_enc_transfo.shape

In [None]:
enc.get_feature_names_out().shape

In [None]:
X_enc.shape

In [None]:
# Doing the same data preprocessing for the test dataset X_test: do the fillna() and onehotencoder
# fitting the model on the data => adjusting the model on our data  

# Filling missing values in => WARNING: SAME mean as X_train
X_test["LotFrontage"] = X_test["LotFrontage"].fillna(value = X_train["LotFrontage"].mean())
X_test["Alley"] = X_test["Alley"].fillna(value = "unknown")
X_test["MasVnrType"] = X_test["MasVnrType"].fillna(value = "unknown")
X_test["MasVnrArea"] = X_test["MasVnrArea"].fillna(value = X_train["MasVnrArea"].mean())
X_test["BsmtQual"] = X_test["BsmtQual"].fillna(value = "unknown")
X_test["BsmtCond"] = X_test["BsmtCond"].fillna(value = "unknown")
X_test["BsmtExposure"] = X_test["BsmtExposure"].fillna(value = "unknown")
X_test["BsmtFinType1"] = X_test["BsmtFinType1"].fillna(value = "unknown")
X_test["BsmtFinType2"] = X_test["BsmtFinType2"].fillna(value = "unknown")
X_test["Electrical"] = X_test["Electrical"].fillna(value = "unknown")
X_test["FireplaceQu"] = X_test["FireplaceQu"].fillna(value = "unknown")
X_test["GarageType"] = X_test["GarageType"].fillna(value = "unknown")
X_test["GarageYrBlt"] = X_test["GarageYrBlt"].fillna(value = X_train["GarageYrBlt"].mean())
X_test["GarageFinish"] = X_test["GarageFinish"].fillna(value = "unknown")
X_test["GarageQual"] = X_test["GarageQual"].fillna(value = "unknown")
X_test["GarageCond"] = X_test["GarageCond"].fillna(value = "unknown")
X_test["PoolQC"] = X_test["PoolQC"].fillna(value = "unknown")
X_test["Fence"] = X_test["Fence"].fillna(value = "unknown")
X_test["MiscFeature"] = X_test["MiscFeature"].fillna(value = "unknown")
X_test.info()
X_test.head()

In [None]:
# OneHotEncoder on X_test

# Limit to categorical data using df.select_dtypes() => onehotencoder creates one column by possible value
X_enc = X_test.select_dtypes(include=[object]) # X_enc = subset of categorical variables; object: probably str (to which a numerical value is assigned)
#X_enc.info() # => 43 categorical variables = > ok (initial categorical columns)
#X_enc.head()

# Creating instance of one-hot-encoder: this object contains all values of each categorie to be able to convert the categories in the appropriated 
# column (1 value = 1 column) -> calculated from what we already know 
#enc = OneHotEncoder() # handle_unknown: column in which a category is never represented (ex if a category is only represented in the test dataset), 
# put "error" as far as we can
#enc

# Apply one-hot encoding to the categorical columns: WARNING: All fits are done on X_train
#enc = enc.fit(X_enc) # looks at each column and determine unique values for each column: calculates all possible values for each categorical values
#enc.categories_ # double table: 1st element = list of possible values in the 1st column 
# (respects the order of the columns, order of the columns in X_enc)

# Transform on X_enc: Transform each categorical value in boolean column (0 or 1)
X_enc_transfo = enc.transform(X_enc).toarray()
enc.get_feature_names_out()

# Building the dataframe
df_enc = pd.DataFrame(X_enc_transfo, columns = enc.get_feature_names_out(), index = X_enc.index, dtype=int)
#df_enc.info()
#df_enc.head(5)

# Deleting the columns containing categorical values in X_train and merging the 2 dataframes (X_train + df_enc) on IDs
# (replacing categorical values by encoded ones)
X_test_drop = X_test.drop(columns = X_enc.columns)
#X_test.info()
#X_test.head()
X_test = X_test_drop.merge(df_enc, how = "left", left_index = True, right_index = True)
X_test.info()
X_test.head(5)

In [None]:
df_enc.head(5)
#X_enc.info()

In [None]:
X_enc.head(5)

## Regression

In [None]:
reg = LinearRegression().fit(X_train, y_train)
print("==========================")
score_train = reg.score(X_train, y_train)
print(f"score train: {score_train}")
score_test = reg.score(X_test, y_test)
print(f"score test: {score_test}")

In [None]:
y_pred = reg.predict(X_test)
y_pred # price (in $ or €)

In [None]:
# Evaluating the deviation between y_pred and _test - metrics:
# - RMSE
# - R^2
root_mean_squared_error(y_test, y_pred)