![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [2]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)
   
fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model 

In [27]:
# CODE HERE
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(housing_prepared, housing_labels)

# First try it out on a few instances from the training set:


In [115]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

In [116]:
# CODE HERE
some_data = full_pipeline.transform(some_data)
some_predictions = reg.predict(some_data)

# measure this regression model’s RMSE on the whole training set 
* sing Scikit-Learn’s mean_squared_error() function:

In [117]:
from sklearn.metrics import mean_squared_error

In [118]:
# CODE HERE
housing_predictions = reg.predict(housing_prepared)
np.sqrt(mean_squared_error(housing_labels, housing_predictions))

67593.20745775253

# judge on the RMSE result for this model 
write down your answar 

##### The model's result in terrible, if the RMSE values are high, then the model is performing badly.

# Let’s train a Decision Tree Regressor model 
## more powerful model

In [50]:
from sklearn.tree import DecisionTreeRegressor 

In [51]:
# CODE HERE
dtr = DecisionTreeRegressor()
dtr.fit(housing_prepared, housing_labels)

# Now evaluate the model on the training set 
* using Scikit-Learn’s mean_squared_error() function:

In [103]:
# CODE HERE
housing_predictions = dtr.predict(housing_prepared)

np.sqrt(mean_squared_error(housing_labels, housing_predictions))

0.0

# Explaine this result 
write down your answar

##### The result here was a 0, this means the model now has a new problem, which is called overfitting. This occurs when the model fit the data perfectly, this is bad because it will not be able to adapt to new data.

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [60]:
from sklearn.model_selection import cross_val_score

In [61]:
# CODE HERE
scores = cross_val_score(dtr, housing_prepared, housing_labels, cv=10)
tree_rmse_scores = np.sqrt(scores)

2- display the resultant scores and calculate its Mean and Standard deviation

In [62]:
# CODE HERE
print("Scores: ", tree_rmse_scores)
print("Mean: ", tree_rmse_scores.mean())
print("Standard Deviation: ", tree_rmse_scores.std())

Scores:  [0.81909212 0.80405862 0.7996684  0.78354894 0.76895983 0.81483531
 0.81258479 0.80624477 0.82364957 0.80529053]
Mean:  0.8037932884561739
Standard Deviation:  0.01573557328240619


3-repaet the same steps to compute the same scores for the Linear Regression  model 

*notice the difference between the results of the two models*

In [66]:
# CODE HERE
scores = cross_val_score(reg, housing_prepared, housing_labels, cv=10)
tree_rmse_scores = np.sqrt(scores)

print("Scores: ", tree_rmse_scores)
print("Mean: ", tree_rmse_scores.mean())
print("Standard Deviation: ", tree_rmse_scores.std())

Scores:  [0.82731844 0.79990011 0.81398731 0.80809671 0.80597549 0.82840812
 0.81976954 0.80137215 0.78126595 0.80815792]
Mean:  0.8094251741564188
Standard Deviation:  0.01331213952748954


## Let’s train one last model the RandomForestRegressor.

In [67]:
# CODE HERE
from sklearn.ensemble import RandomForestRegressor

rfr =RandomForestRegressor()
rfr.fit(housing_prepared, housing_labels)

# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [68]:
# CODE HERE
scores = cross_val_score(rfr, housing_prepared, housing_labels, cv=10)
tree_rmse_scores = np.sqrt(scores)

print("Scores: ", tree_rmse_scores)
print("Mean: ", tree_rmse_scores.mean())
print("Standard Deviation: ", tree_rmse_scores.std())

Scores:  [0.91379126 0.90032315 0.90374279 0.88614239 0.88895556 0.91461669
 0.90995076 0.89602654 0.90709543 0.90363683]
Mean:  0.9024281412801001
Standard Deviation:  0.009227088576855627


# Save every model you experiment with 
*using the joblib library*

In [69]:
# CODE HERE
import joblib

joblib.dump(reg, 'reg.pkl')
joblib.dump(dtr, 'dtr.pkl')
joblib.dump(rfr, 'rfr.pkl')

['rfr.pkl']

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor 
*It may take a long time*

In [70]:
from sklearn.model_selection import GridSearchCV

In [106]:
# CODE HERE
rfr = joblib.load('rfr.pkl')

# n_estimators refer to how many decision trees it will make, max_features refer to how many features it will use
param_grid = {'n_estimators':[70,75,80], 'max_features':[5,6,7]}
    

grid_search = GridSearchCV(
    rfr, 
    param_grid, 
    cv=5, 
    scoring='neg_mean_squared_error', 
    return_train_score=True
)

grid_search.fit(housing_prepared, housing_labels)

with the evaluation scores

In [120]:
# CODE HERE
fine_tuned_model = grid_search.best_estimator_
housing_predictions = fine_tuned_model.predict(housing_prepared)
print(np.sqrt(mean_squared_error(housing_labels, housing_predictions)))
fine_tuned_model

18451.869485878946


# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [121]:
# CODE HERE
rfr_predictions = rfr.predict(housing_prepared)
print('Random Forest Regressor Without Fine-Tunining:', np.sqrt(mean_squared_error(housing_labels, rfr_predictions)))
print('Fine-Tuned Random Forest Regressor:', np.sqrt(mean_squared_error(housing_labels, housing_predictions)))

Random Forest Regressor Without Fine-Tunining: 18613.437421912156
Fine-Tuned Random Forest Regressor: 18451.869485878946


2-display these importance scores next to their corresponding attribute names:

In [152]:
# CODE HERE
feature_importances = rfr.feature_importances_
num_attribs = list(train_set.drop(["median_house_value"],axis = 1).columns)
print('Feature Imporatance for Normal Model')
display(sorted(zip(feature_importances, num_attribs),reverse = True))
print('*'*50)
feature_importances = fine_tuned_model.feature_importances_
num_attribs = list(train_set.drop(["median_house_value"],axis = 1).columns)
print('Feature Imporatance for Fine-Tuned Model')
display(sorted(zip(feature_importances, num_attribs),reverse = True))

Feature Imporatance for Normal Model


[(0.4812093994699939, 'median_income'),
 (0.05743823053295707, 'longitude'),
 (0.05556733602476035, 'latitude'),
 (0.044310188571381165, 'housing_median_age'),
 (0.02627837510833653, 'ocean_proximity'),
 (0.012694094803256364, 'total_rooms'),
 (0.011883116214365391, 'total_bedrooms'),
 (0.011638929090408949, 'population'),
 (0.010360420807134977, 'households')]

**************************************************
Feature Imporatance for Fine-Tuned Model


[(0.3530764243073093, 'median_income'),
 (0.06753792468491401, 'longitude'),
 (0.06234194780150535, 'latitude'),
 (0.05373975029049445, 'ocean_proximity'),
 (0.044408605601651906, 'housing_median_age'),
 (0.018491656440967435, 'total_rooms'),
 (0.01758144284533349, 'population'),
 (0.016927540017085754, 'total_bedrooms'),
 (0.016067072067701183, 'households')]

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [157]:
# CODE HERE
final_test_set = test_set.drop("median_house_value", axis=1)
final_labels = test_set['median_house_value'].copy()

2-run your full_pipeline to transform the data

In [158]:
# CODE HERE
final_test_set = full_pipeline.transform(final_test_set)

3-evaluate the final model on the test set

In [165]:
# CODE HERE
final_predictions = fine_tuned_model.predict(final_test_set)
print('the RMSE value is:')
print(np.sqrt(mean_squared_error(final_labels, final_predictions)))

the RMSE value is:
49436.16057610661


# compute a 95% confidence interval for the generalization error 
*using scipy.stats.t.interval():*

In [161]:
from scipy import stats

In [176]:
# CODE HERE
stats.t.interval(.95, housing_prepared)

(array([[   -7.7793414 ,            nan, -1791.32942973, ...,
                    nan,            nan,   -12.70620474],
        [  -34.92655874,            nan,    -5.43848507, ...,
                    nan,            nan,   -12.70620474],
        [           nan,            nan,            nan, ...,
                    nan,            nan,   -12.70620474],
        ...,
        [  -68.10766235,            nan,   -75.01248956, ...,
                    nan,            nan,            nan],
        [           nan,   -16.26254651,            nan, ...,
                    nan,            nan,            nan],
        [           nan,   -12.844679  ,    -4.63847718, ...,
                    nan,   -12.70620474,            nan]]),
 array([[   7.7793414 ,           nan, 1791.32942973, ...,           nan,
                   nan,   12.70620474],
        [  34.92655874,           nan,    5.43848507, ...,           nan,
                   nan,   12.70620474],
        [          nan,           nan

# Report

## Machine Learning Life Cycle
The first step in building a machine learning project is to understand the problem statement, determining why are we going to use machine learning and the goals of the project is the most important step in the process.

After we've determined the problem statement we can then move on to the second step which is data collection, this is typically done by data engineers who have the job of making the data ready for data users.

When data collection is done we can move on to the next step which is data preprocessing, this step includes data integration, data transformation, data reduction, and data cleaning, this step is important because data cannot be used if its not processed.

Exploratory Data Analysis (EDA) involves analyzing and visualizing the data to gain insights and better understand its characteristics, and it is the next step in our process, understanding hidden patterns between features in the data is crucial for the next couple of steps.

The next step is feature engineering, which involves selecting, creating, or transforming features (input variables) to improve the performance of the machine learning model. This may include techniques such as feature scaling, dimensionality reduction, feature selection, and creating new features based on domain knowledge.

Now we get to the real deal, after all these steps above we can finally start building the machine learning model. building the model includes selecting the model and training it. We must pick the most appropriate machine learning model for our use case. There are a lot of factors that determine how to pick the model and they include the nature of the problem, the size and complexity of the dataset, and the desired performance metrics.

Moving on to the next step, model evaluation is the next step of the process, once the model is trained, it needs to be evaluated to assess its performance and generalization ability. This involves testing the model on a separate dataset (the test set) that was not used during training. Common evaluation metrics include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC).

The final step of the process is model tuning and optimization. Based on the evaluation results, we may need to fine-tune the model by adjusting hyperparameters or exploring different algorithms or techniques. The goal is to improve the model's performance and address any issues identified during the evaluation.

# Great Job!
# #shAI_Club