# Model testing and feature selection
In this notebook we test several model, monitoting their efficency with and without feature selection.
Again, the task is a regression analysis where we aim to predict the price of houses given the data.
Regaring feature engineering we will use 
* Kbest: uses statistic (ANOVA, chi2) to rank features and keep the selected best ones, measured by their ability to predict the label on their own in a linear model
* Recursive Feature Elimination (RFE)
* Variance threshold: Features with a low variance have a low amount of information. With this transformer, we can drop the features with the lowest variance. 

In [1]:
import seaborn as sns
import pandas as pd

# load dataset
path = '/Users/G/Desktop/Documents/Formazione in Data Science/WBS/WBS Bootcamp/7. Supervised Machine Learning/Data/iter-7/housing_prices.csv'
housing = pd.read_csv(path)

## 1. Data Preparation
In the following we split the data in training and testing set, we settle the ordinal encoding for categorical features and eventually we prepare the pipelines.

### 1.1 Train test split

In [2]:
from sklearn.model_selection import train_test_split

X = housing.copy()
#Target variable SalePrice
y = X.pop("SalePrice")
#Remove unessential features
X.drop("Id", axis=1, inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### 1.2 Ordinal encoding

In [3]:
#List of categorcal features
ordered_categories_column_names = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
                                   'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence']

#Ordinal encoding according to data_description.txt
ExterQual_cat = ['Po', 'Fa', 'TA', 'Gd', "Ex"]
ExterCond_cat = ['Po', 'Fa', 'TA', 'Gd', "Ex"]
BsmtQual_cat = ['Na', 'Po', 'Fa', 'TA', 'Gd', "Ex"]
BsmtCond_cat = ['Na', 'Po', 'Fa', 'TA', 'Gd', "Ex"]
BsmtExposure_cat = ['NA', 'No', 'Mn', 'Av', 'Gd']
BsmtFinType1_cat = ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ']
BsmtFinType2_cat = ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ']
HeatingQC_cat = ['Po', 'Fa', 'TA', 'Gd', "Ex"]
KitchenQual_cat = ['Po', 'Fa', 'TA', 'Gd', "Ex"]
FireplaceQu_cat = ['Na', 'Po', 'Fa', 'TA', 'Gd', "Ex"]
GarageFinish_cat = ['NA', 'Unf', 'RFn', 'Fin']
GarageQual_cat = ['Na', 'Po', 'Fa', 'TA', 'Gd', "Ex"]
GarageCond_cat = ['Na', 'Po', 'Fa', 'TA', 'Gd', "Ex"]
PoolQC_cat = ['Na', 'Po', 'Fa', 'TA', 'Gd', "Ex"]
Fence_cat = ['Na', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']

ordinal_cats_list = [ExterQual_cat, ExterCond_cat, BsmtQual_cat, BsmtCond_cat, BsmtExposure_cat, BsmtFinType1_cat,
                     BsmtFinType2_cat, HeatingQC_cat, KitchenQual_cat, FireplaceQu_cat, GarageFinish_cat, GarageQual_cat, GarageCond_cat, PoolQC_cat, Fence_cat]


### 1.3 Pipeline

In [4]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# variables for the three pipes
X_num_columns = X.select_dtypes(include='number').copy()
X_cat_columns = X.select_dtypes(exclude='number').copy()
X_cat_ordered_columns = X_cat_columns[ordered_categories_column_names]
X_cat_unordered_columns = X_cat_columns.drop(
    ordered_categories_column_names, axis=1)

# NUMERICAL PIPE with SimpleImputer(strategy="mean")
numeric_pipe = make_pipeline(
    SimpleImputer(),
)

# CATEGORICAL PIPE with SimpleImputer(fill_value="N_A") and OneHotEncoder
unordered_categoric_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="N_A"),
    OneHotEncoder(sparse_output=False, handle_unknown='ignore')
)

# CATEGORICAL PIPE with SimpleImputer(fill_value="N_A") and OrdinalEncoder
ordered_categoric_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="N_A"),
    OrdinalEncoder(categories=ordinal_cats_list,
                   handle_unknown='use_encoded_value', unknown_value=-1)
)

# Create a ColumnTransformer to apply the transformers to the appropriate columns
# i.e. we glue the three pipes in a single object
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipe, X_num_columns.columns),
        ('unordered', unordered_categoric_pipe, X_cat_unordered_columns.columns),
        ('ordered', ordered_categoric_pipe, X_cat_ordered_columns.columns)
    ])
preprocessor

## 2. Model testing and their score

To monitor the score we use both the R^2 score (1 ist best, -1 worst, 0 average) and the RMSE ( should be small and the difference should be small between train and test). We visualize the results in the following table:

In [5]:
from sklearn.metrics import r2_score, mean_squared_error
scores = pd.DataFrame(
    {"Regression model": [""], "R^2 Score Train": [""], "R^2 Score Test": [""], "RMSE Score Train": [""], "RMSE Score Test": [""]})

scores

Unnamed: 0,Regression model,R^2 Score Train,R^2 Score Test,RMSE Score Train,RMSE Score Test
0,,,,,


### 2.1 Linear Regressor

In [6]:
from sklearn.linear_model import LinearRegression

# Combine the preprocessor and the model into a single pipeline
pipeline = make_pipeline(preprocessor, LinearRegression())

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

# Metrics
train_rmse_lr = mean_squared_error(y_train, y_train_pred, squared=False)
test_rmse_lr = mean_squared_error(y_test, y_test_pred, squared=False)
train_r2_lr = r2_score(y_train, y_train_pred)
test_r2_lr = r2_score(y_test, y_test_pred)

# Update the score dataframe
scores = pd.concat([scores, pd.DataFrame(
    {"Regression model": ['Linear Regressor'], "R^2 Score Train":[round(train_r2_lr,2)], "R^2 Score Test": [round(test_r2_lr,2)],
     "RMSE Score Train": [round(train_rmse_lr,2)],"RMSE Score Test": [round(test_rmse_lr,2)]})], axis=0)
scores


Unnamed: 0,Regression model,R^2 Score Train,R^2 Score Test,RMSE Score Train,RMSE Score Test
0,,,,,
0,Linear Regressor,0.92,0.85,21700.71,32745.96


Below we introduce a function for updating the scores dataframe

In [7]:
def update_scores(scores, model_name, train_r2, test_r2, train_RMSE, test_RMSE):
    return pd.concat([scores, pd.DataFrame(
    {"Regression model": [model_name], "R^2 Score Train":[round(train_r2,2)], "R^2 Score Test": [round(test_r2,2)],
     "RMSE Score Train": [round(train_RMSE,2)],"RMSE Score Test": [round(test_RMSE,2)]})], axis=0)

#### 2.1.1 Linear Regressor with RFE

In [8]:
# initialize the pipe
full_pipeline_ran = make_pipeline(
    preprocessor, StandardScaler(), RFE(estimator=LinearRegression()))

# parameter grid
param_grid_ran = {
    "columntransformer__num__simpleimputer__strategy": ["mean", "median"],
    "standardscaler__with_mean": [True, False],
    "standardscaler__with_std": [True, False]
    # "rfe__estimator__n_estimators": [100, 200, 300],
    # "rfe__estimator__max_depth": [None, 5, 11],
    # "rfe__estimator__min_samples_split": [2, 5, 10],
}

# define cross validation
search_ran = RandomizedSearchCV(full_pipeline_ran, param_grid_ran,
                                n_iter=2, cv=7, scoring='r2', random_state=123, verbose=0)


# fit
search_ran.fit(X_train, y_train)
y_train_pred_ran = search_ran.predict(X_train)
y_test_pred_ran = search_ran.predict(X_test)

# metrics
train_rmse_lr_RFE = mean_squared_error(y_train, y_train_pred_ran, squared=False)
test_rmse_lr_RFE = mean_squared_error(y_test, y_test_pred_ran, squared=False)
train_r2_lr_RFE = r2_score(y_train, y_train_pred_ran)
test_r2_lr_RFE = r2_score(y_test, y_test_pred_ran)

# Update the score dataframe
scores = update_scores(scores,'Linear Regressor RFE', train_r2_lr_RFE, test_r2_lr_RFE, train_rmse_lr_RFE, test_rtest_rmse_lr_RFEmse_dtr_kb)
scores

NameError: name 'RFE' is not defined

### 2.2 Decision Tree Regressor
Here we also apply randomized grid search

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor

# initialize the pipe
full_pipeline_dtR = make_pipeline(
    preprocessor, DecisionTreeRegressor())

# train a DecisionTree with GridSearch cross validation
# parameter grid
param_grid_dtR = {
    'columntransformer__num__simpleimputer__strategy': ['mean', 'median'],
    'decisiontreeregressor__max_depth': range(3, 10),
    'decisiontreeregressor__min_samples_leaf': range(3, 10, 2),
    'decisiontreeregressor__min_samples_split': range(3, 39, 3),
    # 'decisiontreeregressor__max_leaf_nodes': range(1, 100),
    # 'decisiontreeregressor__max_features': range(1,79)
}

# define cross validation
search_dtR = RandomizedSearchCV(full_pipeline_dtR, param_grid_dtR,
                                n_iter=2, cv=7, scoring='r2', random_state=123, verbose=0)


# fit
search_dtR.fit(X_train, y_train)
y_train_pred_dtR = search_dtR.predict(X_train)
y_test_pred_dtR = search_dtR.predict(X_test)

#metrics
train_rmse_dtr = mean_squared_error(y_train, y_train_pred_dtR, squared=False)
test_rmse_dtr = mean_squared_error(y_test, y_test_pred_dtR, squared=False)
train_r2_dtr = r2_score(y_train, y_train_pred_dtR)
test_r2_dtr = r2_score(y_test, y_test_pred_dtR)

# Update the score dataframe
 = update_scores(scores,'Decision Tree', train_r2_dtr, test_r2_dtr, train_rmse_dtr, test_rmse_dtr)
scores

#### 2.2.1 DecisionTreeRegressor with Kbest

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression
KBest = SelectKBest(score_func=f_regression)

# initialize the pipe
full_pipeline_dtR = make_pipeline(
    preprocessor, KBest, DecisionTreeRegressor())

# parameter grid
param_grid_dtR = {
    'columntransformer__num__simpleimputer__strategy': ['mean', 'median'],
    'decisiontreeregressor__max_depth': range(3, 10),
    'decisiontreeregressor__min_samples_leaf': range(3, 10, 2),
    'decisiontreeregressor__min_samples_split': range(3, 39, 3),
    'selectkbest__k': range(20, 65)
    # 'decisiontreeregressor__max_leaf_nodes': range(1, 100),
    # 'decisiontreeregressor__max_features': range(1,79)
}

# define cross validation
search_dtR = RandomizedSearchCV(full_pipeline_dtR, param_grid_dtR,
                                n_iter=2, cv=7, scoring='r2', random_state=123, verbose=0)


# fit
search_dtR.fit(X_train, y_train)
y_train_pred_dtR = search_dtR.predict(X_train)
y_test_pred_dtR = search_dtR.predict(X_test)

train_rmse_dtr_kb = mean_squared_error(y_train, y_train_pred_dtR, squared=False)
test_rmse_dtr_kb = mean_squared_error(y_test, y_test_pred_dtR, squared=False)
train_r2_dtr_kb = r2_score(y_train, y_train_pred_dtR)
test_r2_dtr_kb = r2_score(y_test, y_test_pred_dtR)

# Update the score dataframe
scores = update_scores(scores,'Decision Tree Kbest', train_r2_dtr_kb, test_r2_dtr_kb, train_rmse_dtr_kb, test_rmse_dtr_kb)
scores

#### 2.2.1 DecisionTreeRegressor with RFE

In [None]:
from sklearn.feature_selection import RFE

# initialize the pipe
full_pipeline_dtR = make_pipeline(
    preprocessor, RFE(estimator=DecisionTreeRegressor()))

# train a DecisionTree with GridSearch cross validation
param_grid_dtR_RFE = {
    # 'rfe__feature':column_names,
    # 'rfe__importance':feature_importances,
    'columntransformer__num__simpleimputer__strategy': ['mean', 'median'],
    'rfe__estimator__min_samples_leaf': range(2),
    'rfe__estimator__max_leaf_nodes': range(1, 40),
    'rfe__estimator__max_features': range(1, 79)
}

# define cross validation
search_dtR = RandomizedSearchCV(full_pipeline_dtR, param_grid_dtR_RFE,
                                n_iter=10, cv=7, scoring='r2', random_state=123, verbose=0)


# fit
search_dtR.fit(X_train, y_train)
y_train_pred_dtR = search_dtR.predict(X_train)
y_test_pred_dtR = search_dtR.predict(X_test)

# metric
train_rmse_dtr_RFE = mean_squared_error(y_train, y_train_pred_dtR, squared=False)
test_rmse_dtr_RFE = mean_squared_error(y_test, y_test_pred_dtR, squared=False)
train_r2_dtr_RFE = r2_score(y_train, y_train_pred_dtR)
test_r2_dtr_RFE = r2_score(y_test, y_test_pred_dtR)

# Update the score dataframe
scores = update_scores(scores,'Decision Tree RFE', train_r2_dtr_RFE, test_r2_dtr_RFE, train_rmse_dtr_RFE, test_rmse_dtr_RFE)
scores

### 2.3 KNeighborsRegressor

In [None]:
from sklearn.neighbors import KNeighborsRegressor

neigh = KNeighborsRegressor(n_neighbors=1)

# initialize the pipe
full_pipeline_neigh = make_pipeline(
    preprocessor, neigh)

# parameter grid
param_grid_neigh = {
    'columntransformer__num__simpleimputer__strategy': ['mean', 'median'],
}

# define cross validation
search_neigh = RandomizedSearchCV(full_pipeline_neigh, param_grid_neigh,
                                  n_iter=2, cv=7, scoring='neg_mean_absolute_percentage_error', random_state=123, verbose=0)


# fit
search_neigh.fit(X_train, y_train)
y_train_pred_neigh = search_dtR.predict(X_train)
y_test_pred_neigh = search_dtR.predict(X_test)

#metrics
train_rmse_kn = mean_squared_error(y_train, y_train_pred_neigh, squared=False)
test_rmse_kn = mean_squared_error(y_test, y_test_pred_neigh, squared=False)
train_r2_kn = r2_score(y_train, y_train_pred_neigh)
test_r2_kn = r2_score(y_test, y_test_pred_neigh)

# Update the score dataframe
scores = update_scores(scores,'Kneighbors', train_r2_kn, test_r2_kn, train_rmse_kn, test_rmse_kn)
scores

#### 2.3.1 Kneighbors with Kbest

In [None]:
# initialize the pipe
full_pipeline_neigh = make_pipeline(
    preprocessor, KBest, neigh)
full_pipeline_neigh

# parameter grid
param_grid_neigh = {
    'columntransformer__num__simpleimputer__strategy': ['mean', 'median'],
    'selectkbest__k': range(20, 65)
}

# define cross validation
search_neigh = RandomizedSearchCV(full_pipeline_neigh, param_grid_neigh,
                                  n_iter=2, cv=7, scoring='neg_mean_absolute_percentage_error', random_state=123, verbose=0)


# fit
search_neigh.fit(X_train, y_train)
y_train_pred_neigh = search_dtR.predict(X_train)
y_test_pred_neigh = search_dtR.predict(X_test)

#metrics
train_rmse_kn_kb = mean_squared_error(y_train, y_train_pred_neigh, squared=False)
test_rmse_kn_kb = mean_squared_error(y_test, y_test_pred_neigh, squared=False)
train_r2_kn_kb = r2_score(y_train, y_train_pred_neigh)
test_r2_kn_kb = r2_score(y_test, y_test_pred_neigh)


# Update the score dataframe
scores = update_scores(scores,'Kneighbors Kbest', train_r2_kn_kb, test_r2_kn_kb, train_rmse_kn_kb, test_rmse_kn_kb)
scores

### 2.4 RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor


# initialize the pipe
full_pipeline_ran = make_pipeline(
    preprocessor, StandardScaler(), RandomForestRegressor(random_state=123))

# parameter grid
param_grid_ran = {
    "columntransformer__num__simpleimputer__strategy": ["mean", "median"],
    "standardscaler__with_mean": [True, False],
    "standardscaler__with_std": [True, False],
    "randomforestregressor__n_estimators": [100, 200, 300],
    "randomforestregressor__max_depth": [None, 5, 11],
    "randomforestregressor__min_samples_split": [2, 5, 10],
}

# define cross validation
search_ran = RandomizedSearchCV(full_pipeline_ran, param_grid_ran,
                                n_iter=2, cv=7, scoring='r2', random_state=123, verbose=0)


# fit
search_ran.fit(X_train, y_train)
y_train_pred_ran = search_ran.predict(X_train)
y_test_pred_ran = search_ran.predict(X_test)

#metrics
train_rmse_rf = mean_squared_error(y_train, y_train_pred_ran, squared=False)
test_rmse_rf = mean_squared_error(y_test, y_test_pred_ran, squared=False)
train_r2_rf = r2_score(y_train, y_train_pred_ran)
test_r2_rf = r2_score(y_test, y_test_pred_ran)


# Update the score dataframe
scores = update_scores(scores,'Random Forest', train_r2_rf, test_r2_rf, train_rmse_rf, test_rmse_rf)
scores

#### 2.4.1 Random Forest with Kbest

In [None]:
# initialize the pipe
full_pipeline_ran = make_pipeline(
    preprocessor, StandardScaler(), KBest, RandomForestRegressor(random_state=123))

# parameter grid
param_grid_ran = {
    "columntransformer__num__simpleimputer__strategy": ["mean", "median"],
    "standardscaler__with_mean": [True, False],
    "standardscaler__with_std": [True, False],
    "randomforestregressor__n_estimators": [100, 200, 300],
    "randomforestregressor__max_depth": [None, 5, 11],
    "randomforestregressor__min_samples_split": [2, 5, 10],
    'selectkbest__k': range(20, 65)
}

# define cross validation
search_ran = RandomizedSearchCV(full_pipeline_ran, param_grid_ran,
                                n_iter=2, cv=7, scoring='r2', random_state=123, verbose=0)


# fit
search_ran.fit(X_train, y_train)
y_train_pred_ran = search_ran.predict(X_train)
y_test_pred_ran = search_ran.predict(X_test)

# metrics
train_rmse_rf_kb = mean_squared_error(y_train, y_train_pred_ran, squared=False)
test_rmse_rf_kb = mean_squared_error(y_test, y_test_pred_ran, squared=False)
train_r2_rf_kb = r2_score(y_train, y_train_pred_ran)
test_r2_rf_kb = r2_score(y_test, y_test_pred_ran)

# Update the score dataframe
scores = update_scores(scores,'Random Forest Kbest', train_r2_rf_kb, test_r2_rf_kb, train_rmse_rf_kb, test_rmse_rf_kb)
scores

#### 2.4.2 Random Forest with variance threshold

In [None]:
# from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)

# initialize the pipe
full_pipeline_ran = make_pipeline(
    preprocessor, StandardScaler(), KBest, selector, RandomForestRegressor(random_state=123))

# parameter grid
param_grid_ran = {
    "columntransformer__num__simpleimputer__strategy": ["mean", "median"],
    "standardscaler__with_mean": [True, False],
    "standardscaler__with_std": [True, False],
    "randomforestregressor__n_estimators": [100, 200, 300],
    "randomforestregressor__max_depth": [None, 5, 11],
    "randomforestregressor__min_samples_split": [2, 5, 10],
    'selectkbest__k': range(20, 65)
}

# define cross validation
search_ran = RandomizedSearchCV(full_pipeline_ran, param_grid_ran,
                                n_iter=2, cv=7, scoring='r2', random_state=123, verbose=0)


# fit
search_ran.fit(X_train, y_train)
y_train_pred_ran = search_ran.predict(X_train)
y_test_pred_ran = search_ran.predict(X_test)

#metrics
train_rmse_rf_vt = mean_squared_error(y_train, y_train_pred_ran, squared=False)
test_rmse_rf_vt = mean_squared_error(y_test, y_test_pred_ran, squared=False)
train_r2_rf_vt = r2_score(y_train, y_train_pred_ran)
test_r2_rf_vt = r2_score(y_test, y_test_pred_ran)

# Update the score dataframe
scores = update_scores(scores,'Random Forest Variance Threshold', train_r2_rf_vt, test_r2_rf_vt, train_rmse_rf_vt, test_rmse_rf_vt)
scores

## 3. Conclusion

In [None]:
scores