![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work 

In [30]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from joblib import dump

In [8]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)
   
fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model 

In [13]:
# CODE HERE
# Initializing a Linear Regression model
lin_reg = LinearRegression()
# Fitting the model to the prepared housing data
lin_reg.fit(housing_prepared, housing_labels)

# First try it out on a few instances from the training set:


In [14]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

In [16]:
# CODE HERE
# Transforming the new instances using the full pipeline
some_data_prepared = full_pipeline.transform(some_data)
# Using the trained linear regression model to make predictions
predictions = lin_reg.predict(some_data_prepared)
# Printing the predictions
print("Predictions:", predictions)
# Printing the actual labels
print("Actual Labels:", list(some_labels))

Predictions: [181746.54359616 290558.74973505 244957.50017771 146498.51061398
 163230.42393939]
Actual Labels: [103000.0, 382100.0, 172600.0, 93400.0, 96500.0]


# measure this regression model’s RMSE on the whole training set 
* sing Scikit-Learn’s mean_squared_error() function:

In [17]:
from sklearn.metrics import mean_squared_error

In [18]:
# CODE HERE
# Making predictions on the whole training set
housing_predictions = lin_reg.predict(housing_prepared)
# Calculating RMSE
mse = mean_squared_error(housing_labels, housing_predictions)
rmse = np.sqrt(mse)
print("RMSE on the whole training set:", rmse)

RMSE on the whole training set: 67593.20745775253


# judge on the RMSE result for this model 
write down your answar 

An RMSE (Root Mean Squared Error) of approximately 67593.21 on the whole training set means that, on average, the model's predictions are off by around $67,593.21 when compared to the actual median house values in the training set.An RMSE of $67,593.21 suggests that the model's performance may not be satisfactory and further investigation, feature engineering, or model selection may be necessary to improve its performance.

# Let’s train a Decision Tree Regressor model 
## more powerful model

In [20]:
from sklearn.tree import DecisionTreeRegressor 

In [21]:
# CODE HERE
# Initializing Decision Tree Regressor model
tree_reg = DecisionTreeRegressor(random_state=42)
# Fitting the model to the prepared housing data
tree_reg.fit(housing_prepared, housing_labels)

# Now evaluate the model on the training set 
* using Scikit-Learn’s mean_squared_error() function:

In [22]:
# CODE HERE
# Making predictions on the training set
housing_predictions_tree = tree_reg.predict(housing_prepared)

# Calculating RMSE
mse_tree = mean_squared_error(housing_labels, housing_predictions_tree)
rmse_tree = np.sqrt(mse_tree)
print("RMSE on the whole training set (Decision Tree Regressor):", rmse_tree)

RMSE on the whole training set (Decision Tree Regressor): 0.0


# Explaine this result 
write down your answar

This perfect fit could be due to several reasons:

1) Overfitting: Decision Trees are prone to overfitting, especially if they are allowed to grow without any constraints. In this case, the model might have memorized the training data's noise and outliers, leading to perfect performance on the training set but poor generalization to unseen data.
2) Hyperparameters: The default hyperparameters of the DecisionTreeRegressor might not have been tuned to prevent overfitting. Setting constraints on the maximum depth of the tree, minimum samples per leaf, or maximum number of leaf nodes could help in preventing overfitting.
3) Data Characteristics: It's also possible that the dataset itself is relatively simple or small, making it easier for the Decision Tree model to fit the data perfectly.
4) Preprocessing: If the data preprocessing steps used before training the model were incorrect or inadequate, it could lead to misleading results.

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [23]:
from sklearn.model_selection import cross_val_score

In [24]:
# CODE HERE
# Defining the Decision Tree Regressor model
tree_reg = DecisionTreeRegressor(random_state=42)

# Performing 10-fold cross-validation
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)

# Calculating RMSE scores
rmse_scores = np.sqrt(-scores)

# Displaying the RMSE scores for each fold
print("RMSE scores for each fold:")
for i, rmse in enumerate(rmse_scores):
    print(f"Fold {i+1}: {rmse}")

# Calculating and display the mean and standard deviation of RMSE scores
print("\nMean RMSE:", rmse_scores.mean())
print("Standard deviation of RMSE:", rmse_scores.std())

RMSE scores for each fold:
Fold 1: 65312.860440308985
Fold 2: 70581.6986567638
Fold 3: 67849.75809964613
Fold 4: 71460.33789358212
Fold 5: 74035.29744573774
Fold 6: 65562.42978503302
Fold 7: 67964.10942543222
Fold 8: 69102.89388457107
Fold 9: 66876.66473025372
Fold 10: 69735.84760006213

Mean RMSE: 68848.18979613911
Standard deviation of RMSE: 2579.6785558576307


2- display the resultant scores and calculate its Mean and Standard deviation

In [25]:
# CODE HERE
# Displaying the RMSE scores for each fold
print("RMSE scores for each fold:")
for i, rmse in enumerate(rmse_scores):
    print(f"Fold {i+1}: {rmse}")
# Calculating and display the mean and standard deviation of RMSE scores
mean_rmse = rmse_scores.mean()
std_rmse = rmse_scores.std()
print("\nMean RMSE:", mean_rmse)
print("Standard deviation of RMSE:", std_rmse)

RMSE scores for each fold:
Fold 1: 65312.860440308985
Fold 2: 70581.6986567638
Fold 3: 67849.75809964613
Fold 4: 71460.33789358212
Fold 5: 74035.29744573774
Fold 6: 65562.42978503302
Fold 7: 67964.10942543222
Fold 8: 69102.89388457107
Fold 9: 66876.66473025372
Fold 10: 69735.84760006213

Mean RMSE: 68848.18979613911
Standard deviation of RMSE: 2579.6785558576307


3-repaet the same steps to compute the same scores for the Linear Regression  model 

*notice the difference between the results of the two models*

In [26]:
# CODE HERE
lin_reg = LinearRegression()

# Performing 10-fold cross-validation
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)

# Calculating RMSE scores
rmse_lin_scores = np.sqrt(-lin_scores)

# Displaying the RMSE scores for each fold
print("RMSE scores for each fold (Linear Regression):")
for i, rmse in enumerate(rmse_lin_scores):
    print(f"Fold {i+1}: {rmse}")

# Calculating and display the mean and standard deviation of RMSE scores
mean_rmse_lin = rmse_lin_scores.mean()
std_rmse_lin = rmse_lin_scores.std()
print("\nMean RMSE (Linear Regression):", mean_rmse_lin)
print("Standard deviation of RMSE (Linear Regression):", std_rmse_lin)

RMSE scores for each fold (Linear Regression):
Fold 1: 65000.67382615272
Fold 2: 70960.56056304109
Fold 3: 67122.6393512386
Fold 4: 66089.6315386527
Fold 5: 68402.54686442243
Fold 6: 65266.34735287604
Fold 7: 65218.78174480775
Fold 8: 68525.46981753556
Fold 9: 72739.87555995534
Fold 10: 68957.34111905852

Mean RMSE (Linear Regression): 67828.38677377408
Standard deviation of RMSE (Linear Regression): 2468.0913950652257


## Let’s train one last model the RandomForestRegressor.

In [29]:
# CODE HERE
# Defining the RandomForestRegressor model
forest_reg = RandomForestRegressor(random_state=42)

# Performing 10-fold cross-validation
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                 scoring="neg_mean_squared_error", cv=10)

# Calculating RMSE scores
rmse_forest_scores = np.sqrt(-forest_scores)

# Displaying the RMSE scores for each fold
print("RMSE scores for each fold (Random Forest Regressor):")
for i, rmse in enumerate(rmse_forest_scores):
    print(f"Fold {i+1}: {rmse}")

# Calculating and display the mean and standard deviation of RMSE scores
mean_rmse_forest = rmse_forest_scores.mean()
std_rmse_forest = rmse_forest_scores.std()
print("\nMean RMSE (Random Forest Regressor):", mean_rmse_forest)
print("Standard deviation of RMSE (Random Forest Regressor):", std_rmse_forest)

RMSE scores for each fold (Random Forest Regressor):
Fold 1: 47341.96931396659
Fold 2: 51653.53070248334
Fold 3: 49360.291488834606
Fold 4: 51625.62777032113
Fold 5: 52771.91063892273
Fold 6: 46989.97118038409
Fold 7: 47333.72603397942
Fold 8: 50636.24303693077
Fold 9: 48951.73251683118
Fold 10: 50183.60590465184

Mean RMSE (Random Forest Regressor): 49684.86085873057
Standard deviation of RMSE (Random Forest Regressor): 1929.9797084102233


# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [None]:
# CODE HERE
# Defining the RandomForestRegressor model
forest_reg = RandomForestRegressor(random_state=42)

# Performing 10-fold cross-validation
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)

# Calculating RMSE scores
rmse_forest_scores = np.sqrt(-forest_scores)

# Displaying the RMSE scores for each fold
print("RMSE scores for each fold (Random Forest Regressor):")
for i, rmse in enumerate(rmse_forest_scores):
    print(f"Fold {i+1}: {rmse}")

# Calculating and display the mean and standard deviation of RMSE scores
mean_rmse_forest = rmse_forest_scores.mean()
std_rmse_forest = rmse_forest_scores.std()
print("\nMean RMSE (Random Forest Regressor):", mean_rmse_forest)
print("Standard deviation of RMSE (Random Forest Regressor):", std_rmse_forest)

# Save every model you experiment with 
*using the joblib library*

In [None]:
# CODE HERE
# Saving the Decision Tree Regressor model
dump(tree_reg, 'decision_tree_model.joblib')

# Saving the Linear Regression model
dump(lin_reg, 'linear_regression_model.joblib')

# Saving the Random Forest Regressor model
dump(forest_reg, 'random_forest_model.joblib')

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor 
*It may take a long time*

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# CODE HERE
# Defining the parameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 10, 20],  # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4],  # Minimum number of samples required at each leaf node
    'bootstrap': [True, False]  # Whether bootstrap samples are used when building trees
}

# Defining the RandomForestRegressor model
forest_reg = RandomForestRegressor(random_state=42)

# Performing grid search with cross-validation
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

# Fitting the grid search to the data
grid_search.fit(housing_prepared, housing_labels)

# Getting the best hyperparameters found
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Getting the best estimator (model)
best_model = grid_search.best_estimator_

# Evaluating the best model
cv_results = grid_search.cv_results_
for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
    print(np.sqrt(-mean_score), params)

with the evaluation scores

In [None]:
# CODE HERE
# Getting the evaluation scores for all combinations of hyperparameters
cv_results = grid_search.cv_results_

# Printing the evaluation scores for each combination of hyperparameters
for mean_score, std_score, params in zip(cv_results["mean_test_score"], cv_results["std_test_score"], cv_results["params"]):
    print(f"Mean RMSE: {np.sqrt(-mean_score):.2f} (Std: {std_score:.2f}) for {params}")


# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [None]:
# CODE HERE
# Get the feature importances
feature_importances = best_model.feature_importances_

# Get the list of feature names
feature_names = list(full_pipeline.named_transformers_['num'].get_feature_names_out()) + ['rooms_per_household', 'population_per_household', 'bedrooms_per_room']

# Pair the feature names with their importances and sort them
feature_importances_dict = dict(zip(feature_names, feature_importances))
sorted_feature_importances = sorted(feature_importances_dict.items(), key=lambda x: x[1], reverse=True)

# Print the relative importance of each attribute
print("Relative Importance of Each Attribute:")
for feature, importance in sorted_feature_importances:
    print(f"{feature}: {importance:.4f}")

2-display these importance scores next to their corresponding attribute names:

In [None]:
# CODE HERE
# Printing the attribute names and their importance scores
print("Attribute Name: Importance Score")
for feature, importance in sorted_feature_importances:
    print(f"{feature}: {importance:.4f}")


## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [None]:
# CODE HERE
# Separating the predictors (features) from the labels in the test set
X_test = test_set.drop("median_house_value", axis=1)  # predictors
y_test = test_set["median_house_value"].copy()  # labels

2-run your full_pipeline to transform the data

In [None]:
# CODE HERE
# Transforming the test set using the full_pipeline
X_test_prepared = full_pipeline.transform(X_test)

3-evaluate the final model on the test set

In [None]:
# CODE HERE
# Making predictions on the transformed test data using the trained model
final_predictions = best_model.predict(X_test_prepared)

# Calculating RMSE to evaluate the performance of the final model
from sklearn.metrics import mean_squared_error
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print("Final RMSE on the test set:", final_rmse)

# compute a 95% confidence interval for the generalization error 
*using scipy.stats.t.interval():*

In [None]:
from scipy import stats

In [None]:
# CODE HERE
# Defining the degrees of freedom
degrees_of_freedom = len(y_test) - 1

# Computing the standard error of the mean squared error
standard_error = np.sqrt(final_mse / len(y_test))

# Computing the margin of error for a 95% confidence interval
margin_of_error = stats.t.ppf(0.975, df=degrees_of_freedom) * standard_error

# Computing the lower and upper bounds of the confidence interval
lower_bound = final_rmse - margin_of_error
upper_bound = final_rmse + margin_of_error

print("95% Confidence Interval for Generalization Error:")
print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Great Job!
# #shAI_Club