# Training & Evaluation  

## Importing Preprocessing Pipeline

In [27]:
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline, make_pipeline
# import matplotlib.pyplot as plt
# import plotly.express as px
# from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
# from sklearn.impute import SimpleImputer
# from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler, FunctionTransformer
# from sklearn.metrics.pairwise import rbf_kernel

# from sklearn.compose import ColumnTransformer
# from scipy.signal import find_peaks
# from scipy.stats import gaussian_kde
# from sklearn.base import BaseEstimator, TransformerMixin
# from sklearn.utils.validation import check_array, check_is_fitted
# from sklearn.ensemble import IsolationForest
# from sklearn.cluster import KMeans


In [19]:
preprocessing_pipeline = None  # Define as a placeholder

%run "02_transform_data.ipynb"

In [20]:
preprocessing_pipeline

## Load Training Data

In [21]:
processed_data_path = Path("..", "data", "processed", "housing")

In [22]:
## read data
data = pd.read_csv(Path(processed_data_path, "train_set.csv"))

## Split Features & Labels

In [23]:
## before we create the pipeline lets split he training data into features and labels
df_features = data.drop("median_house_value", axis=1)
df_labels = data["median_house_value"].copy()

## Preprocessing Data

In [24]:
preprocessed_data = pd.DataFrame(preprocessing_pipeline.fit_transform(df_features), columns=preprocessing_pipeline.get_feature_names_out())

In [25]:
preprocessed_data.shape

(16512, 27)

In [26]:
## sanity check to see if we have empty values
preprocessed_data.isna().sum()

bedrooms_per_room__ratio                        0
rooms_per_household__ratio                      0
population_per_household__ratio                 0
multimodal_similarity__similarity_to_peak_17    0
multimodal_similarity__similarity_to_peak_26    0
multimodal_similarity__similarity_to_peak_35    0
multimodal_similarity__similarity_to_peak_52    0
cluster_similarity__similarity_to_cluster_0     0
cluster_similarity__similarity_to_cluster_1     0
cluster_similarity__similarity_to_cluster_2     0
cluster_similarity__similarity_to_cluster_3     0
cluster_similarity__similarity_to_cluster_4     0
cluster_similarity__similarity_to_cluster_5     0
cluster_similarity__similarity_to_cluster_6     0
cluster_similarity__similarity_to_cluster_7     0
cluster_similarity__similarity_to_cluster_8     0
cluster_similarity__similarity_to_cluster_9     0
log_pipeline__total_bedrooms                    0
log_pipeline__total_rooms                       0
log_pipeline__population                        0


## Training Linear Regression

### Training

In [28]:
from sklearn.linear_model import LinearRegression

linear_reg = make_pipeline(preprocessing_pipeline, LinearRegression())
linear_reg.fit(df_features, df_labels)

### Predictions

In [30]:
predictions = linear_reg.predict(df_features)
predictions[:5].round(2)

array([263335.29, 373549.57, 112870.95,  92243.99, 316741.99])

### Evaluations

#### Calculating Error Ratios

In [33]:
## lets compute error ratios
error_ratios = predictions[:5].round(-2) / df_labels.iloc[:5].values - 1
print(", ".join([f"{100 * ratio:.1f}%" for ratio in error_ratios]))

-42.5%, -22.8%, 11.0%, -4.1%, -12.5%


#### Calculating RMSE

In [34]:
from sklearn.metrics import root_mean_squared_error

rmse = root_mean_squared_error(df_labels, predictions)
rmse

66550.34931224315

Interpretation:
* On average the value predicted by our model has difference of `65550` then the actual value. 
* Based on project requirements since the output is fed to another ML model to predict whether we should invest in this area or not, a diff of `66K` might not give us any useful indicator. 
* Also since we are testing on training data, this difference indicates that we are underfitting the model. 


#### Calculating Relative RMSE

In [36]:
mean_price = df_labels.mean()
rmse_relative = rmse / mean_price
print(f"Relative RMSE: {rmse_relative:.2%}")


Relative RMSE: 32.25%


Interpretation
* Based on project requirement the current process is costly and time-consuming and estimates are off by more than 30%, our model is not performing better than the current process.
* We need to test other models, but before that lets run cross-validation to confirm our findings

#### Cross Validation

In [39]:
from sklearn.model_selection import cross_val_score

scores = -cross_val_score(linear_reg, df_features, df_labels, scoring="neg_root_mean_squared_error", cv=10)
scores

array([66975.41917719, 66419.43069432, 64127.57598067, 75625.56219204,
       66741.20322301, 66746.47979067, 65562.44383051, 68892.13827255,
       66147.29254267, 67764.54064592])

Interpretation:
* Looks like our findings are correct, we are underfitting and getting similar RMSE values for all models
* Which means either our model is too simple or we don't have sufficient features. We need to try different models

## Random Forest

### Training

In [40]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing_pipeline, RandomForestRegressor(n_estimators=100, random_state=42))
forest_reg.fit(df_features, df_labels)

### Predictions

In [41]:
predictions = forest_reg.predict(df_features)

In [42]:
predictions[:5].round(2)

array([433104.15, 478151.22, 106536.  , 101683.  , 370400.12])

### Evaluations

#### Calculating RMSE

In [43]:
rmse = root_mean_squared_error(df_labels, predictions)
rmse

17557.540624138303

Interpretations:
* RMSE is significanly less that `Linear Regression`
* One reason could be overfitting, we can confirm that using cross validation.
* If we are not overfitting, then we can use `GridSearch` to find the best params. 

#### Calcuating Relative RMSE

In [45]:
## calculate relative rmse
rmse_relative = rmse / mean_price
print(f"Relative RMSE: {rmse_relative:.2%}")

Relative RMSE: 8.51%


Interpretation:
* Seems like this model performs a lot better than our previous model, and than manual process. 
* If this model is not overfitting then it could be a candidate for production

#### Cross Validation

In [46]:
## cross validation
scores = -cross_val_score(forest_reg, df_features, df_labels, scoring="neg_root_mean_squared_error", cv=10)
scores

array([46518.21599266, 47657.09029417, 46009.61776303, 47236.10087738,
       46343.77184697, 47318.70351497, 47534.58520817, 49669.7806786 ,
       47584.23865129, 46877.0007688 ])

In [47]:
scores.mean(), scores.std()

(47274.91055960402, 961.6083961071712)

Interpretations:
* We are getting different numbers, which are slightly better than `LinearRegression` but a lot worse than the model trained on whole training dataset. 
* Which means we are overfitting a bit when we train using the complete dataset. 

## Support Vector Regressor

### Training

In [51]:
## using support vector regression
from sklearn.svm import SVR

svm_reg = make_pipeline(preprocessing_pipeline, SVR(kernel="rbf"))
svm_reg.fit(df_features, df_labels)

### Predictions

In [52]:
predictions = svm_reg.predict(df_features)
predictions[:5].round(2)

array([179127.79, 179704.09, 178648.92, 178747.33, 179149.84])

Interpretations:
* This is worse than both the models, may be I need to pass separate hyperparameters. 
* May be we are underfitting, lets calculate evaluation metrics and then run cross validation to be sure. 

### Evaluations

#### Calculating RMSE

In [55]:
rmse = root_mean_squared_error(df_labels, predictions)
rmse

118239.52161044904

#### Calculating Relative RMSE

In [56]:
## relative rmse
rmse_relative = rmse / mean_price
print(f"Relative RMSE: {rmse_relative:.2%}")

Relative RMSE: 57.31%


#### Cross Validation

In [57]:
## cross validation
scores = -cross_val_score(svm_reg, df_features, df_labels, scoring="neg_root_mean_squared_error", cv=10)
scores

array([120152.86799787, 121572.35148253, 115886.31068152, 117475.34228705,
       117464.73330846, 119809.06945733, 118817.52228608, 115841.63837712,
       115006.83130508, 120468.94123383])

Interpretation:
* Similar result, this might not be the right model for our data.  Or we might need hyperparameter tuning. 
* I think for now, lets focus on Random Forest hyper parameter tuning and see if we can find the right params. 

## Hyper Parameter Tuning

### Random Forest

In [58]:
## hyper parameter tuning for random forest
from sklearn.model_selection import GridSearchCV

param_grid = [
    {"randomforestregressor__n_estimators": [3, 10, 30], "randomforestregressor__max_features": [2, 4, 6, 8]},
    {"randomforestregressor__bootstrap": [False], "randomforestregressor__n_estimators": [3, 10], "randomforestregressor__max_features": [2, 3, 4]},
]

grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring="neg_mean_squared_error", return_train_score=True)
grid_search.fit(df_features, df_labels)

In [59]:
## best hyper parameters
grid_search.best_params_

{'randomforestregressor__max_features': 8,
 'randomforestregressor__n_estimators': 30}

In [60]:
## best estimator
grid_search.best_estimator_

In [61]:
## evaluation scores
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

60108.92770006013 {'randomforestregressor__max_features': 2, 'randomforestregressor__n_estimators': 3}
51408.50687809988 {'randomforestregressor__max_features': 2, 'randomforestregressor__n_estimators': 10}
48514.3729826983 {'randomforestregressor__max_features': 2, 'randomforestregressor__n_estimators': 30}
54813.44490184968 {'randomforestregressor__max_features': 4, 'randomforestregressor__n_estimators': 3}
48032.21556572403 {'randomforestregressor__max_features': 4, 'randomforestregressor__n_estimators': 10}
45881.672436587025 {'randomforestregressor__max_features': 4, 'randomforestregressor__n_estimators': 30}
55295.479182179464 {'randomforestregressor__max_features': 6, 'randomforestregressor__n_estimators': 3}
48392.05855041651 {'randomforestregressor__max_features': 6, 'randomforestregressor__n_estimators': 10}
45799.36879204769 {'randomforestregressor__max_features': 6, 'randomforestregressor__n_estimators': 30}
53989.633153921844 {'randomforestregressor__max_features': 8, 'ran

Interpretation:
* So the best mean score is similar to what we had seen earlier. 
* Lets try randomized search to see if we get better params.

In [62]:
## randomized search for hyper parameter tuning
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
    "randomforestregressor__n_estimators": randint(low=1, high=200),
    "randomforestregressor__max_features": randint(low=1, high=8),
}

rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs, n_iter=10, cv=5, scoring="neg_mean_squared_error", random_state=42)
rnd_search.fit(df_features, df_labels)

In [63]:
rnd_search.best_params_

{'randomforestregressor__max_features': 7,
 'randomforestregressor__n_estimators': 180}

In [64]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

44627.84843696036 {'randomforestregressor__max_features': 7, 'randomforestregressor__n_estimators': 180}
46709.890708350824 {'randomforestregressor__max_features': 5, 'randomforestregressor__n_estimators': 15}
45910.55151756196 {'randomforestregressor__max_features': 3, 'randomforestregressor__n_estimators': 72}
46129.25479877836 {'randomforestregressor__max_features': 5, 'randomforestregressor__n_estimators': 21}
44739.17611083517 {'randomforestregressor__max_features': 7, 'randomforestregressor__n_estimators': 122}
45903.36033041786 {'randomforestregressor__max_features': 3, 'randomforestregressor__n_estimators': 75}
45788.41498579971 {'randomforestregressor__max_features': 3, 'randomforestregressor__n_estimators': 88}
44845.408438489096 {'randomforestregressor__max_features': 5, 'randomforestregressor__n_estimators': 100}
45608.936223289784 {'randomforestregressor__max_features': 3, 'randomforestregressor__n_estimators': 150}
59179.36222453917 {'randomforestregressor__max_features':

Interpretation:
* Interesting this is slightly better than before, not a huge difference.  Lets train the model using these params and see how it does with complete training data. 

In [65]:
## training the model with the best hyper parameters
model = make_pipeline(preprocessing_pipeline, RandomForestRegressor(n_estimators=180, max_features=7, random_state=42))
model.fit(df_features, df_labels)

In [66]:
predictions = model.predict(df_features)
predictions[:5].round(2)

array([426561.79, 465225.16, 107758.89, 102087.78, 351040.09])

In [67]:
rmse = root_mean_squared_error(df_labels, predictions)
rmse

16322.966963062598

In [68]:
## relative rmse
rmse_relative = rmse / mean_price
print(f"Relative RMSE: {rmse_relative:.2%}")

Relative RMSE: 7.91%


Interpretation:
* We still have a huge difference between Grid search model and full dataset model. Which means some features are causing overfitting. 
* Lets look important features

In [69]:
## important features from random search
rnd_search.best_estimator_.steps[1][1].feature_importances_

array([7.56230974e-02, 5.63139824e-02, 8.42227306e-02, 1.05477719e-02,
       7.32759862e-03, 9.03700023e-03, 6.67301379e-03, 3.61540657e-02,
       1.82883326e-02, 3.67817585e-02, 2.85737222e-02, 5.44760116e-02,
       2.88381968e-02, 1.44158147e-02, 2.69205447e-02, 1.81548283e-02,
       3.13239016e-02, 1.10194725e-02, 1.27491002e-02, 1.25154308e-02,
       1.08886749e-02, 2.65216595e-01, 8.97137820e-03, 1.28635626e-01,
       1.64412514e-04, 1.59793647e-03, 4.56900232e-03])

In [74]:
## mapping important feature names to their importance scores
feature_importances = rnd_search.best_estimator_.steps[1][1].feature_importances_
pd.DataFrame({"feature": preprocessed_data.columns, "importance": feature_importances}).sort_values("importance", ascending=False).sort_values("importance", ascending=False)   


Unnamed: 0,feature,importance
21,log_pipeline__median_income,0.265217
23,categorical__ocean_proximity_INLAND,0.128636
2,population_per_household__ratio,0.084223
0,bedrooms_per_room__ratio,0.075623
1,rooms_per_household__ratio,0.056314
11,cluster_similarity__similarity_to_cluster_4,0.054476
9,cluster_similarity__similarity_to_cluster_2,0.036782
7,cluster_similarity__similarity_to_cluster_0,0.036154
16,cluster_similarity__similarity_to_cluster_9,0.031324
12,cluster_similarity__similarity_to_cluster_5,0.028838


Interpretation:
* Interesting there are lot of unimportant features, that we can get rid of and may be improve model performance. 
* Lets try and create a list of features to remove and see if we can improve the model