<a href="https://colab.research.google.com/github/esraa-abdelmaksoud/Shai-Training-Notebooks/blob/main/Task_6_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](logo1.jpg)

# **shAI Training 2021 | Level 1**

## Task #6 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [2]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)
   
fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model 

In [3]:
# CODE HERE
from sklearn.linear_model import LinearRegression

linear = LinearRegression()
linear.fit(housing_prepared, housing_labels)

LinearRegression()

# First try it out on a few instances from the training set:


In [4]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

In [12]:
# CODE HERE
some_data_tr = full_pipeline.transform(some_data)
pred = linear.predict(some_data_tr)
print(pred)
print(np.array(some_labels))

[181746.54359616 290558.74973505 244957.50017771 146498.51061398
 163230.42393939]
[103000. 382100. 172600.  93400.  96500.]


# measure this regression model’s RMSE on the whole training set 
* sing Scikit-Learn’s mean_squared_error() function:

In [13]:
from sklearn.metrics import mean_squared_error

In [15]:
# CODE HERE
housing_preds = linear.predict(housing_prepared)
linear_rmse = mean_squared_error(housing_labels, housing_preds, squared=False)
linear_rmse


67593.20745775253

# judge on the RMSE result for this model 
write down your answar 

The root mean squred error is 67593. It is high which means another model with a different algorithm should be trained.

# Let’s train a Decision Tree Regressor model 
## more powerful model

In [16]:
from sklearn.tree import DecisionTreeRegressor 

In [17]:
# CODE HERE
tree = DecisionTreeRegressor()
tree.fit(housing_prepared, housing_labels)

DecisionTreeRegressor()

# Now evaluate the model on the training set 
* using Scikit-Learn’s mean_squared_error() function:

In [18]:
# CODE HERE
housing_tree_preds = tree.predict(housing_prepared)
tree_rmse = mean_squared_error(housing_labels, housing_tree_preds, squared=False)
tree_rmse

0.0

# Explaine this result 
write down your answar

The rmse of decision tree is 0 which means the model has an overfitting problem.

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [19]:
from sklearn.model_selection import cross_val_score

In [23]:
# CODE HERE
score = cross_val_score(tree, housing_prepared, housing_labels, cv=10,
                        scoring="neg_mean_squared_error")
tree_score = np.sqrt(-score)
tree_score

array([65923.09013902, 70524.52667343, 69988.39750289, 70668.86865182,
       72158.10789585, 66820.30011888, 66091.51009865, 68937.244088  ,
       64506.87514742, 70130.88687103])

2- display the resultant scores and calculate its Mean and Standard deviation

In [25]:
# CODE HERE
print("Scores:", tree_score, "\n Mean:", tree_score.mean(), "\n Std:",
      tree_score.std())

Scores: [65923.09013902 70524.52667343 69988.39750289 70668.86865182
 72158.10789585 66820.30011888 66091.51009865 68937.244088
 64506.87514742 70130.88687103] 
 Mean: 68574.98071870054 
 Std: 2416.655935312109


3-repaet the same steps to compute the same scores for the Linear Regression  model 

*notice the difference between the results of the two models*

In [26]:
# CODE HERE
score = cross_val_score(linear, housing_prepared, housing_labels, cv=10,
                        scoring="neg_mean_squared_error")
linear_score = np.sqrt(-score)
print("Scores:", linear_score, "\n Mean:", linear_score.mean(), "\n Std:",
      linear_score.std())

Scores: [65000.67382615 70960.56056304 67122.63935124 66089.63153865
 68402.54686442 65266.34735288 65218.78174481 68525.46981754
 72739.87555996 68957.34111906] 
 Mean: 67828.38677377408 
 Std: 2468.091395065227


## Let’s train one last model the RandomForestRegressor.

In [27]:
# CODE HERE
from sklearn.ensemble import RandomForestRegressor

forest = DecisionTreeRegressor()
forest.fit(housing_prepared, housing_labels)

DecisionTreeRegressor()

# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [28]:
# CODE HERE
score = cross_val_score(forest, housing_prepared, housing_labels, cv=10,
                        scoring="neg_mean_squared_error")
forest_score = np.sqrt(-score)
print("Scores:", forest_score, "\n Mean:", forest_score.mean(), "\n Std:",
      forest_score.std())

Scores: [64030.52492421 71003.258586   69491.88548799 70634.71300772
 72852.00963828 66907.81710221 67720.39315879 67132.03890487
 67495.58127302 69523.11568435] 
 Mean: 68679.13377674334 
 Std: 2397.3364321153867


# Save every model you experiment with 
*using the joblib library*

In [29]:
# CODE HERE
import joblib
joblib.dump(linear, "linear.pkl")
joblib.dump(tree, "tree.pkl")
joblib.dump(forest, "forest.pkl")

['forest.pkl']

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor 
*It may take a long time*

In [30]:
from sklearn.model_selection import GridSearchCV

In [31]:
# CODE HERE
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')

with the evaluation scores

In [33]:
# CODE HERE
cvres = grid_search.cv_results_
cvres

{'mean_fit_time': array([0.07416954, 0.2381186 , 0.75177517, 0.11739011, 0.37944331,
        1.1447669 , 0.1633215 , 0.53416653, 1.59210563, 0.20373106,
        0.67509875, 2.02926641, 0.11267085, 0.36814694, 0.14908595,
        0.4857172 , 0.18481607, 0.61324039]),
 'std_fit_time': array([0.00646466, 0.00462862, 0.07806737, 0.00490277, 0.00619951,
        0.00620322, 0.00252618, 0.01422538, 0.01239269, 0.00478365,
        0.00812653, 0.00910179, 0.00149587, 0.00551911, 0.00351852,
        0.00335796, 0.00626123, 0.00507137]),
 'mean_score_time': array([0.00499754, 0.0137794 , 0.04132218, 0.00534768, 0.01424856,
        0.040729  , 0.00530419, 0.01380658, 0.03851709, 0.00497775,
        0.01415586, 0.03802309, 0.00589485, 0.01663799, 0.00614123,
        0.01628385, 0.00584321, 0.01630435]),
 'std_score_time': array([9.46086740e-05, 2.13534100e-04, 4.91020143e-03, 7.08673235e-04,
        1.51968801e-03, 3.34859289e-03, 2.46154620e-04, 2.25081635e-04,
        7.36776038e-04, 1.15897466e-

# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [34]:
# CODE HERE
grid_search.best_estimator_

RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)

2-display these importance scores next to their corresponding attribute names:

In [35]:
# CODE HERE
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

64878.27480854276 {'max_features': 2, 'n_estimators': 3}
55391.003575336406 {'max_features': 2, 'n_estimators': 10}
52721.66494842234 {'max_features': 2, 'n_estimators': 30}
58541.12715494087 {'max_features': 4, 'n_estimators': 3}
51623.59366665994 {'max_features': 4, 'n_estimators': 10}
49787.65951361993 {'max_features': 4, 'n_estimators': 30}
58620.88234614251 {'max_features': 6, 'n_estimators': 3}
51645.862673140065 {'max_features': 6, 'n_estimators': 10}
49917.66994061786 {'max_features': 6, 'n_estimators': 30}
58640.96129790229 {'max_features': 8, 'n_estimators': 3}
51650.365581628095 {'max_features': 8, 'n_estimators': 10}
49672.50940389753 {'max_features': 8, 'n_estimators': 30}
61580.24110015614 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
53889.80996032937 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
58667.89389226964 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52764.2630869393 {'bootstrap': False, 'max_features': 3, 'n_estimators': 

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [36]:
# CODE HERE
housing_test = test_set.drop("median_house_value", axis=1)
housing_test_labels = test_set["median_house_value"].copy()

2-run your full_pipeline to transform the data

In [37]:
# CODE HERE
housing_test_prepared = full_pipeline.transform(housing_test)

3-evaluate the final model on the test set

In [41]:
# CODE HERE
grid_model = grid_search.best_estimator_
grid_pred = grid_model.predict(housing_test_prepared)
mse = mean_squared_error(housing_test_labels, grid_pred)
rmse = np.sqrt(mse)
rmse

49198.020631676336

# compute a 95% confidence interval for the generalization error 
*using scipy.stats.t.interval():*

In [42]:
from scipy import stats

In [43]:
# CODE HERE
confidence = 0.95
squared_errors = (grid_pred - housing_test_labels) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

array([46948.10215126, 51349.4515311 ])

# Great Job!
# #shAI_Club