# How to Build a Machine Learning Project
Welcome to our first hands-on workshop! We will learn together how to build a Machine Learning Project from scratch. After this workshop, you will be able to:
- Understand the purpose of applying Machine Learning to a specific dataset
- **Frame the problem** as a **Classification** or **Regression** task
- **Manipulate features** of the data in order to better understand them
- Perform **Exploratory Data Analysis (EDA)** and extract meaningful insights
- **Preprocess data** to be used in Machine Learning algorithms
- Choose the **best algorithm** that maximizes a performance metric of choice
- Choose the **best hyperparameters** that maximize model performance
- **Evaluate** the **performance** of the model on **test data**
<hr>


Add meme or cute gif

First, let's import the libraries that we will need for this project: <br>
- [Numpy](https://numpy.org/doc/) - for matrices and vectors manipulation and operations 
- [Pandas](https://pandas.pydata.org/docs/getting_started/overview.html) - for data cleaning and analysis
- [Scikit-Learn](https://scikit-learn.org/stable/) - for building predictive models
- [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) - for visualizing various types of plots


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
RANDOM_SEED=42
np.random.seed(RANDOM_SEED)
pd.options.mode.chained_assignment = None 

## **0. Understand the Problem**
The very first step is **understand the given problem** and **identify the objective** of the task. By finding the purpose of the task, we can make sure whether the problem can be framed as a Machine Learning Problem or not.<br>

---

For the sake of this workshop, the objective of the task is already defined:<br>
###"Given various socio-economic features, build a model to predict housing prices in California using the California census data.."<br>
---

## **1. Get the Data**
The first step is to download the dataset and load it into Colab runtime. We will use the California census dataset published on [Statslib]():



In [None]:
import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
 if not os.path.isdir(housing_path):
    os.makedirs(housing_path)
 tgz_path = os.path.join(housing_path, "housing.tgz")
 urllib.request.urlretrieve(housing_url, tgz_path)
 housing_tgz = tarfile.open(tgz_path)
 housing_tgz.extractall(path=housing_path)
 housing_tgz.close()

fetch_housing_data()

Let's read the dataset into a Dataframe, a special data structure used in Pandas with various helpful functions for data analysis.

In [None]:
data_path=os.path.join(HOUSING_PATH,'housing.csv')
df=pd._______(data_path)

Let's check the dimensions of the dataframe:

In [None]:
____,____=df.____
print("Number of samples:",_____,"\nNumber of features:",_______)

Let us take a look on the data:

In [None]:
df.____()

Each row represents one district and there are 10 attributes: 
- longitude 
- latitude
- housing_median_age
- total_rooms
- total_bedrooms
- population
-households
- median_income
- ocean_proximity
-**median_house_value (target feature)**

Since we want to predict the **median house value** feature, which is a **continuous variable**, we can frame the problem as a **regression problem**.

Next, let's check some useful informations about the features count and their data types:

In [None]:
df.____()

We notice that:
- There are two different data types: **float64** and **Object** data types (String data type in this case)
- total_bedrooms features has some missing values that we need to do something about later on

A very important step before examining the data is to split the data into **train and test data** to avoid **data snooping**.

In [None]:
from ____.__________ import _______
train_df, test_df=______(______,test_size=_____,random_state=_____)

## **2. Explore the Data** 
Next, let's try to go deeper in understanding the features of the dataset:<br>
First, let's start by getting some useful statistics about the features:

In [None]:
train_df._______

To verify our observations, let's visualize these features distributions: 

In [None]:
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
train_df.___(_____)

The vizualizations confirmed our suspicions:
- Each feature has its range which makes the learning later on quite slow, which mean that we need to rescale all features to accelerate learning.
- All of total_rooms, total_bedroom, population and household features are highly skewed to one side, which mean that there are outliers that we need to take care of.
-  The target feature distribution is similar to a gaussian distribution

Next, let's investigate the relationship between the location and median house value. <br>
Let's first plot the longtitude vs the latitude:

In [None]:
train_df.plot(kind=_____,x=___,y=___,alpha=0.1)

To make the map of california more informative for our task, let us add the median house value to the plot:

In [None]:
train_df.plot(kind=_____,x=_____,y=______,alpha=0.1,c=__________, cmap=plt.get_cmap("jet"),colorbar=True) 

We notice that the closer the districts are the ocean, the higher the value of the median house value, which highlights the importance of the longtitude and latitude features.
<br>

Let's investigate now the categorical feature "Ocean proximity":



In [None]:
plt._____(train_df[________])

We notice that there are 5 unique classes with the maximum class is <1H OCEAN and the minimum class is ISLAND.   

Next, let's invetigate the correlations between the features:

In [None]:
corr=train_df._____
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(______, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

We notice that:
- there is a strong correlation between the target feature and the spatial location (longitude and latitude), confirming our previous observations.
- there is also a strong correlation between the target feature and median income
- note that there is a strong correlation between households and total_rooms, population and households, and total_rooms and nb_of_bedrooms. We might use these correlations to craft new features later on.

Let's find the correlation between these newly crafted features and the target value:

In [None]:
train_df["rooms_per_household"]=train_df[__________]/train_df[__________]
train_df["bedrooms_per_total"]=train_df[___________]/train_df[___________]
train_df["households_per_population"]=train_df[________]/train_df[__________]

In [None]:
corr=train_df._____
corr[___________].___________

Neat! We notice that bedrooms_per_total, rooms_per_household, and households_per_population have higher correlation with the target attribute than the original features. This means that these features would be a great addition to our training features.
<br>
Let's remove them for the moment from the training features:

In [None]:
train_df=train_df.drop(["rooms_per_household","bedrooms_per_total","households_per_population"],axis=1)

## **3. Prepare the Data**

After analyzing the data, let's recap the steps we need to do:
- Fill the missing values in the total_bedrooms feature
- Encode the Ocean Proximity feature into an integer value
- Standardize all the numerical features 

In [None]:
y_train_df=train_df[___________].copy()
x_train_df=train_df.drop(_________,axis=1)

### Dealing with Missing Values

In [None]:
from ___________ import __________
impute=________(strategy=_____)
train_df_num=x_train_df.drop("ocean_proximity",axis=1)
impute._____(________)

Let's make sure the imputer has really fitted the data:

In [None]:
impute.statistics_

In [None]:
train_df_num.median()

Both median values match! Let's fill the missing values now:

In [None]:
impute._________(__________)

In [None]:
train_df_num.isna().sum()

Great, we made sure all missing values are filled. <br>
### Dealing with Categorical values
Next, let's transform the "Ocean proximit" feature into a numerical feature:

In [None]:
train_cat_df=train_df["ocean_proximity"].copy()

In [None]:
from _____________ import ___________
ohe=___________(sparse=False)
ohe.fit(_________.values.reshape(-1,1))

In [None]:
train_cat_ohe=ohe._________(_________.values.reshape(-1,1))
train_cat_ohe

### Scaling features
As we discusssed before, let's standardize all numerical features:

In [None]:
from _____________ import ___________
std_scaler=__________()
train_df_num_scaled=std_scaler.___________(________)

### Creating the Pipeline

Now that we did all the previous tasks, let's prepare the pipeline for our model:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
  def __init__(self): 
    pass
  def fit(self, X, y=None):
    return self # nothing else to do
  
  def transform(self, X, y=None):
    rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
    population_per_household = X[:, population_ix] / X[:, households_ix]
    bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
    return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_room]

attrib_adder = CombinedAttributesAdder()
train_extra_attribs = attrib_adder.transform(x_train_df.values)

In [None]:
from __________ import ________
num_pipeline=_______([("imputer",_______),
                        ("extra_attribs",_______________),
                      ("std_scaler",_______________),
                      ])

Let's combine the num_pipeline and the catergorical pipeline into one pipeline:

In [None]:
from sklearn.compose import ColumnTransformer
num_attributes=___________.________
cat_attributes=[______________]

full_pipeline=ColumnTransformer([
                                 ("num",__________,num_attributes),
                                 ("cat",__________,cat_attributes),
                              ])

Now, we can preprocess the original dataset:

In [None]:
x_train_prepared=full_pipeline.________________(______________)
y_train=__________.___________

## **5. Short-List Promising Models**
Now that we have prepared our data, it is time to train the machine learning models!


In [None]:
from ______________ import _______________
lin_model=_____________()
lin_model.___(_______,_________)

Let's evaluate some of the predictions:

In [None]:
labels=y_train[:5]
sampled=x_train_prepared[:5,:]
print("Labels:",labels)

In [None]:
pred_labels=________.________(_________)
print("Predicted labels:",pred_labels)

Not bad at all! However the predictedvalues are not very close to the actual values. We need some kind of performance metric to measure how close we are to the actual values.<br>
One useful metric in regression is the Root Mean Square Error (RMSE):

In [None]:
from ___________ import ___________
pred_labels=________._______(__________)
mse=____________(__________,__________)
print("RMSE:",np.sqrt(mse))

Okay this score is not that good! With this simple model, we can infer that the model is underfitting. Let's try a more powerful model like Decision Trees:

In [None]:
from __________ import ________
dtr_model=______________
dtr_model.____(______,________)

In [None]:
from ___________ import ___________
pred_labels=________._______(__________)
mse=____________(__________,__________)
print("RMSE:",np.sqrt(mse))

Clearly there is a problem, is this model perfect? Let's us double check with a more robust evaluation technique: cross validation

In [None]:
from ___________ import __________
scores=___________(dtr_model,x_train_prepared,y_train,cv=_____,scoring=________)
scores=np.sqrt(-scores)

In [None]:
print("mean rmse:",np.mean(scores))
print("std:",np.std(scores))

Now we know that this model is a bit worse than the linear model, there is a possibility that the model is overfitting the data. Let's us now a better model like Random Forests. Usually such models called Ensemble models perform better than simpler models:

In [None]:
from sklearn.ensemble import RandomForestRegressor
rand_model=RandomForestRegressor()

In [None]:
from sklearn.model_selection import cross_val_score
scores=cross_val_score(rand_model,x_train_prepared,y_train,cv=5,scoring="neg_mean_squared_error")
scores=np.sqrt(-scores)

In [None]:
print("mean rmse:",np.mean(scores))
print("std:",np.std(scores))

mean rmse: 50447.002014296806
std: 702.8995519311321


Wow! The model has improved drastically!

## **6. Fine Tune the System and Test**

Usually after this step follows the hyperparameters tuning stage, where we try tuning our model hyperparameters to achieve higher results.
For more information, check the [GridSearch](https://scikit-learn.org/stable/modules/grid_search.html) approach in Scikit Learn.

In [None]:
from ____________ import _____________
param_grid = [
 {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
 {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
 ]

forest_reg = ___________()
grid_search = __________(forest_reg, param_grid, cv=5,
 scoring='neg_mean_squared_error',
return_train_score=True)

grid_search.fit(____________,_________)

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
  print(np.sqrt(-mean_score), params)

In [None]:
grid_search.best_params_

Now it is time to test our model! We have seen previously that Random Forests perform the best in comparison to the previous models, so we will use it for our test data evaluation. 

In [None]:
final_model=grid_search.best_estimator_

X_test = test_df.drop("median_house_value", axis=1)
y_test = test_df["median_house_value"].copy()
X_test_prepared = full_pipeline.__________(________)
final_predictions = final_model.________(__________)
final_mse = _________(y_test, final_predictions)
final_rmse = np.sqrt(final_mse) 

In [None]:
print("final rmse:",final_rmse)