![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work 

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [22]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)
   
fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model 

In [23]:
# CODE HERE
from sklearn.linear_model import LinearRegression

l_reg = LinearRegression()
l_reg.fit(housing_prepared, housing_labels)

# First try it out on a few instances from the training set:


In [24]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

In [25]:
# CODE HERE
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", l_reg.predict(some_data_prepared))


Predictions: [181746.54359616 290558.74973505 244957.50017771 146498.51061398
 163230.42393939]


# measure this regression model’s RMSE on the whole training set 
* sing Scikit-Learn’s mean_squared_error() function:

In [26]:
from sklearn.metrics import mean_squared_error

In [27]:
# CODE HERE
housing_predictions = l_reg.predict(housing_prepared)
l_mse = mean_squared_error(housing_labels, housing_predictions)
l_smse = np.sqrt(l_mse)
l_smse

67593.20745775253

# judge on the RMSE result for this model 
write down your answar 

your answer goes here

# Let’s train a Decision Tree Regressor model 
## more powerful model

In [28]:
from sklearn.tree import DecisionTreeRegressor 

In [29]:
# CODE HERE
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

# Now evaluate the model on the training set 
* using Scikit-Learn’s mean_squared_error() function:

In [30]:
# CODE HERE
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_smse = np.sqrt(tree_mse)
tree_smse

0.0

# Explaine this result 
write down your answar

your answer goes here

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [31]:
from sklearn.model_selection import cross_val_score

In [32]:
# CODE HERE
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_smse_scores = np.sqrt(-scores)

2- display the resultant scores and calculate its Mean and Standard deviation

In [33]:
# CODE HERE
print("Scores:", tree_smse_scores)
print("Mean:", tree_smse_scores.mean())
print("Standard deviation:", tree_smse_scores.std())

Scores: [65312.86044031 70581.69865676 67849.75809965 71460.33789358
 74035.29744574 65562.42978503 67964.10942543 69102.89388457
 66876.66473025 69735.84760006]
Mean: 68848.18979613911
Standard deviation: 2579.6785558576307


3-repaet the same steps to compute the same scores for the Linear Regression  model 

*notice the difference between the results of the two models*

In [34]:
# CODE HERE
lin_scores = cross_val_score(l_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_smse_scores = np.sqrt(-lin_scores)
print("Scores:", lin_smse_scores)
print("Mean:", lin_smse_scores.mean())
print("Standard deviation:", lin_smse_scores.std())

Scores: [65000.67382615 70960.56056304 67122.63935124 66089.63153865
 68402.54686442 65266.34735288 65218.78174481 68525.46981754
 72739.87555996 68957.34111906]
Mean: 67828.38677377408
Standard deviation: 2468.091395065229


## Let’s train one last model the RandomForestRegressor.

In [35]:
# CODE HERE
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [36]:
# CODE HERE
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse


21787.892649206584

# Save every model you experiment with 
*using the joblib library*

In [37]:
# CODE HERE
my_model = full_pipeline
import joblib

joblib.dump(my_model, "my_model.pkl") # DIFF
my_model_loaded = joblib.load("my_model.pkl")

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor 
*It may take a long time*

In [38]:
from sklearn.model_selection import GridSearchCV

In [39]:
# CODE HERE
param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

with the evaluation scores

In [40]:
# CODE HERE
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

64878.27480854276 {'max_features': 2, 'n_estimators': 3}
55391.003575336406 {'max_features': 2, 'n_estimators': 10}
52721.66494842234 {'max_features': 2, 'n_estimators': 30}
58541.12715494087 {'max_features': 4, 'n_estimators': 3}
51623.59366665994 {'max_features': 4, 'n_estimators': 10}
49787.65951361993 {'max_features': 4, 'n_estimators': 30}
58620.88234614251 {'max_features': 6, 'n_estimators': 3}
51645.862673140065 {'max_features': 6, 'n_estimators': 10}
49917.66994061786 {'max_features': 6, 'n_estimators': 30}
58640.96129790229 {'max_features': 8, 'n_estimators': 3}
51650.365581628095 {'max_features': 8, 'n_estimators': 10}
49672.50940389753 {'max_features': 8, 'n_estimators': 30}
61580.24110015614 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
53889.80996032937 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
58667.89389226964 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52764.2630869393 {'bootstrap': False, 'max_features': 3, 'n_estimators': 

In [41]:
pd.DataFrame(grid_search.cv_results_)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_features,param_n_estimators,param_bootstrap,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.115721,0.007241,0.003813,0.000738,2,3,,"{'max_features': 2, 'n_estimators': 3}",-4308249000.0,-4248986000.0,...,-4209191000.0,189493000.0,18,-1105656000.0,-1124980000.0,-1136908000.0,-1077530000.0,-1079774000.0,-1104970000.0,23701030.0
1,0.384915,0.025747,0.012201,0.002399,2,10,,"{'max_features': 2, 'n_estimators': 10}",-3098762000.0,-3058719000.0,...,-3068163000.0,79228710.0,11,-569232300.0,-580110500.0,-564676600.0,-581360900.0,-589902000.0,-577056400.0,9025998.0
2,1.09379,0.009265,0.027685,0.000989,2,30,,"{'max_features': 2, 'n_estimators': 30}",-2790734000.0,-2740025000.0,...,-2779574000.0,49815570.0,8,-428236900.0,-434127500.0,-432280100.0,-437365000.0,-437560000.0,-433913900.0,3468334.0
3,0.175657,0.003918,0.0034,0.000489,4,3,,"{'max_features': 4, 'n_estimators': 3}",-3218150000.0,-3295804000.0,...,-3427064000.0,147803200.0,13,-873364400.0,-899843300.0,-976313400.0,-906381200.0,-959480300.0,-923076500.0,38598500.0
4,0.601243,0.024169,0.010999,0.001096,4,10,,"{'max_features': 4, 'n_estimators': 10}",-2566728000.0,-2608173000.0,...,-2664995000.0,77513990.0,5,-478823700.0,-487882700.0,-501063800.0,-509697200.0,-504446900.0,-496382900.0,11355960.0
5,1.760896,0.058307,0.029004,0.00253,4,30,,"{'max_features': 4, 'n_estimators': 30}",-2482751000.0,-2450159000.0,...,-2478811000.0,56768280.0,2,-384593400.0,-384213700.0,-382281100.0,-388687500.0,-390750900.0,-386105300.0,3122129.0
6,0.266565,0.023835,0.004002,0.000632,6,3,,"{'max_features': 6, 'n_estimators': 3}",-3424743000.0,-3545984000.0,...,-3436408000.0,92185150.0,14,-912516800.0,-927504100.0,-878195700.0,-942317100.0,-883255300.0,-908757800.0,24804040.0
7,0.825183,0.052532,0.009802,0.0004,6,10,,"{'max_features': 6, 'n_estimators': 10}",-2727240000.0,-2691870000.0,...,-2667295000.0,44524990.0,6,-511341900.0,-488128200.0,-490954100.0,-508915700.0,-494495600.0,-498767100.0,9524654.0
8,2.413876,0.071724,0.027204,0.000979,6,30,,"{'max_features': 6, 'n_estimators': 30}",-2502153000.0,-2519167000.0,...,-2491774000.0,44133090.0,3,-389893800.0,-383135400.0,-387561800.0,-386263200.0,-386901900.0,-386751200.0,2184835.0
9,0.359646,0.029182,0.004201,0.000401,8,3,,"{'max_features': 8, 'n_estimators': 3}",-3358026000.0,-3416882000.0,...,-3438762000.0,49926440.0,15,-895712600.0,-846862700.0,-884146800.0,-918757800.0,-917897500.0,-892675500.0,26446980.0


# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [None]:
# CODE HERE

2-display these importance scores next to their corresponding attribute names:

In [None]:
# CODE HERE

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [42]:
# CODE HERE
final_model = grid_search.best_estimator_

X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

2-run your full_pipeline to transform the data

In [43]:
# CODE HERE
final_rmse


49198.020631676336

3-evaluate the final model on the test set

In [None]:
# CODE HERE

# compute a 95% confidence interval for the generalization error 
*using scipy.stats.t.interval():*

In [44]:
from scipy import stats

In [45]:
# CODE HERE
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
mean = squared_errors.mean()
m = len(squared_errors)

np.sqrt(stats.t.interval(confidence, m - 1,
                         loc=np.mean(squared_errors),
                         scale=stats.sem(squared_errors)))

array([46948.10215126, 51349.4515311 ])

# Great Job!
# #shAI_Club