![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [2]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)

fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model

In [3]:
# CODE HERE
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# First try it out on a few instances from the training set:


In [4]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

In [5]:
# CODE HERE
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

Predictions: [181746.54359616 290558.74973505 244957.50017771 146498.51061398
 163230.42393939]
Labels: [103000.0, 382100.0, 172600.0, 93400.0, 96500.0]


# measure this regression model’s RMSE on the whole training set
* sing Scikit-Learn’s mean_squared_error() function:

In [6]:
from sklearn.metrics import mean_squared_error

In [7]:
# CODE HERE
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

67593.20745775253

# judge on the RMSE result for this model
write down your answar

Since our median_housing_values range between 120,000 dollar and 265,000 dollar, so the RMSE result (67,593 dollar) doesn't seem a good score.This is an example of a model underfitting the training data. We are going to select a more powerful model.

# Let’s train a Decision Tree Regressor model
## more powerful model

In [8]:
from sklearn.tree import DecisionTreeRegressor

In [9]:
# CODE HERE
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

# Now evaluate the model on the training set
* using Scikit-Learn’s mean_squared_error() function:

In [10]:
# CODE HERE
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

# Explaine this result
write down your answar

Machine learning models can't absolutely perfect, so it's much more likely that the model has overfit the data.

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [11]:
from sklearn.model_selection import cross_val_score

In [12]:
# CODE HERE
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
 scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

2- display the resultant scores and calculate its Mean and Standard deviation

In [13]:
# CODE HERE
def display_scores(scores):
  print("Scores:", scores)
  print("Mean:", scores.mean())
  print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)


Scores: [64492.3851374  69858.64335398 69222.52947355 72185.2910444
 74072.96272067 67068.1531935  68147.67054094 67563.40395155
 67162.56974479 69809.33489122]
Mean: 68958.29440519909
Standard deviation: 2598.5090815284416


3-repaet the same steps to compute the same scores for the Linear Regression  model

*notice the difference between the results of the two models*

In [14]:
# CODE HERE
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
  scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [65000.67382615 70960.56056304 67122.63935124 66089.63153865
 68402.54686442 65266.34735288 65218.78174481 68525.46981754
 72739.87555996 68957.34111906]
Mean: 67828.38677377408
Standard deviation: 2468.0913950652275


## Let’s train one last model the RandomForestRegressor.

In [15]:
# CODE HERE
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

housing_predictions = tree_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

0.0

# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [16]:
# CODE HERE
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
  scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Scores: [46843.05895629 51578.82363337 49608.31668194 51704.7449574
 52680.01439623 47261.91225137 47512.74342286 50816.94689542
 49471.58217779 50080.99156148]
Mean: 49755.91349341585
Standard deviation: 1916.8198664657677


# Save every model you experiment with
*using the joblib library*

In [17]:
# CODE HERE
import joblib
joblib.dump(forest_reg, "forest_reg.pkl")
forest_reg_loaded = joblib.load("forest_reg.pkl")

joblib.dump(lin_reg, "lin_reg.pkl")
lin_reg_loaded = joblib.load("lin_reg.pkl")

joblib.dump(tree_reg, "tree_reg.pkl")
tree_reg_loaded = joblib.load("tree_reg.pkl")

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor
*It may take a long time*

In [18]:
from sklearn.model_selection import GridSearchCV

In [19]:
# CODE HERE
param_grid = [
  {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
  {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
  scoring='neg_mean_squared_error',
  return_train_score=True)

grid_search.fit(housing_prepared, housing_labels)

with the evaluation scores

In [20]:
# CODE HERE
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
  print(np.sqrt(-mean_score), params)

63415.43402916871 {'max_features': 2, 'n_estimators': 3}
55109.941519423846 {'max_features': 2, 'n_estimators': 10}
52425.4009979644 {'max_features': 2, 'n_estimators': 30}
60573.78930104172 {'max_features': 4, 'n_estimators': 3}
52100.92416922879 {'max_features': 4, 'n_estimators': 10}
50111.00992777813 {'max_features': 4, 'n_estimators': 30}
59181.68536657322 {'max_features': 6, 'n_estimators': 3}
52125.775111898314 {'max_features': 6, 'n_estimators': 10}
49566.361884921695 {'max_features': 6, 'n_estimators': 30}
58644.994792411766 {'max_features': 8, 'n_estimators': 3}
51852.31762573422 {'max_features': 8, 'n_estimators': 10}
49607.62923499114 {'max_features': 8, 'n_estimators': 30}
62114.6955446944 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54198.735569794575 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
58952.10895582922 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52410.212803825954 {'bootstrap': False, 'max_features': 3, 'n_estimators'

# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [21]:
# CODE HERE
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

array([7.53395807e-02, 6.57863174e-02, 4.17586744e-02, 1.75802489e-02,
       1.56932592e-02, 1.62217496e-02, 1.53717545e-02, 3.31454112e-01,
       5.42100794e-02, 1.04997447e-01, 7.45438566e-02, 6.59738085e-03,
       1.73555018e-01, 3.02232797e-04, 2.22976733e-03, 4.35852128e-03])

2-display these importance scores next to their corresponding attribute names:

In [22]:
# CODE HERE
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

[(0.3314541122018101, 'median_income'),
 (0.1735550176683082, 'INLAND'),
 (0.10499744718472988, 'pop_per_hhold'),
 (0.07533958068889075, 'longitude'),
 (0.07454385656415093, 'bedrooms_per_room'),
 (0.06578631739123911, 'latitude'),
 (0.054210079443381624, 'rooms_per_hhold'),
 (0.041758674440026754, 'housing_median_age'),
 (0.01758024893908942, 'total_rooms'),
 (0.01622174956894434, 'population'),
 (0.015693259200151154, 'total_bedrooms'),
 (0.015371754450215162, 'households'),
 (0.006597380850258361, '<1H OCEAN'),
 (0.004358521280740297, 'NEAR OCEAN'),
 (0.0022297673309548843, 'NEAR BAY'),
 (0.00030223279710905383, 'ISLAND')]

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [23]:
# CODE HERE
final_model = grid_search.best_estimator_
X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()

2-run your full_pipeline to transform the data

In [24]:
# CODE HERE
X_test_prepared = full_pipeline.transform(X_test)


3-evaluate the final model on the test set

In [25]:
# CODE HERE
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

# compute a 95% confidence interval for the generalization error
*using scipy.stats.t.interval():*

In [26]:
from scipy import stats

In [27]:
# CODE HERE
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

array([47175.58278942, 51499.03125163])

# Great Job!
# #shAI_Club