# Improving our machine learning models

**Factors to consider on**:
 - Variables or Features
 - Models to use
 - Arguments to use within the models

## Hyperparameters Optimisation

**Hyperparameter optimisation** (also known as **hyperparameter tuning** or **model tuning**).

**Hyperparameter optimisation** is an iterative process whereby we select the best configurations of the hyperparameters that give us the best model performance output

# Importing Relevant Libraries

In [1]:
# Analysis
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn')


# Models
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Metrics evaluation 
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

import warnings
warnings.filterwarnings('ignore') #ignore warnings

%matplotlib inline

In [5]:
# Path of the file to read
iowa_file_path = r'/Users/BarryColleary1/Box/Ex_PY/ExFiles/Jupyter-Notebook-file-Hyperparameter-optimisation-/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)

In [6]:
home_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
y = home_data['SalePrice']

In [8]:
# Getting Description
y.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

# Independent Variables


Now you will create a DataFrame called `X` holding the predictive features:
 - LotArea - *Lot size in square feet*
 - YearBuilt - *Original construction date*
 - 1stFlrSF - *First Floor square feet*
 - 2ndFlrSF - *Second floor square feet*
 - FullBath - *Full bathrooms above grade*
 - BedroomAbvGr - *Number of bedrooms above grade*
 - TotRmsAbvGrd - *Total rooms above grade (does not include bathrooms)* 


In [9]:
feature_names = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF',  'FullBath' , 'BedroomAbvGr', 'TotRmsAbvGrd']

# Select data corresponding to features in feature_names
X = home_data[feature_names]


In [10]:
## Split into validation and training data 

train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.25, random_state=1)

In [11]:
# Make validation predictions and calculate mean absolute error
iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model = iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes:", round(val_mae))

# Using best value for max_leaf_nodes
iowa_model_2 = DecisionTreeRegressor(max_depth=8, random_state=1)
iowa_model_2 = iowa_model_2.fit(train_X, train_y)
val_predictions_2 = iowa_model_2.predict(val_X)
val_mae = mean_absolute_error(val_predictions_2, val_y)
print("Validation MAE for best value of max_leaf_nodes:",round(val_mae))


#Using Random forest
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X,train_y)
pred = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(val_y, pred)
print("Validation MAE for Random Forest Model:", round(rf_val_mae))

Validation MAE when not specifying max_leaf_nodes: 29653
Validation MAE for best value of max_leaf_nodes: 26376
Validation MAE for Random Forest Model: 21857


## Machine Learning Models 

 - **Decision Tree** — A tree algorithm used in machine learning to find patterns in data by learning decision rules.

 - **Random Forest** — A type of bagging method that plays on ‘the wisdom of crowds’ effect. It uses multiple independent decision trees in parallel to learn from data and aggregates their predictions for an outcome. *A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.*

 - **Gradient Boosting Machines** — A type of boosting method that uses a combination of decision tree in series. Each tree is used to predict and correct the errors by the preceding tree additively.

YouTube Video explaining Random Forest - https://youtu.be/v6VJ2RO66Ag. (Stop at 5:50, Speed - 1.25)

![random_forest_img.jpeg](attachment:random_forest_img.jpeg)

Image Reference - https://towardsdatascience.com/understanding-random-forest-58381e0602d2 

# Cross-Validation

 So far, we have made these choices in a data-driven way by measuring model quality with a validation (or holdout) set
  - Train-test split - we typically kept about 20% of the data as a validation dataset.
  
 However, the drawback to this approach, is that the model can do well on the randomly selected 20% of data, and perform differently on another 20% set



- We run our modeling process on different subsets of the data to get multiple measures of model quality.
- We could begin by dividing the data into 5 pieces, each 20% of the full dataset. In this case, we say that we have broken the data into 5 "folds".
- You end up with 5 different training and validation data sets to build and test your models. It’s a good way to **counter overfitting**.
- More generally, this is also called k-fold cross validation. 

![cross-validation_img.png](attachment:cross-validation_img.png)

Image Reference - https://www.kaggle.com/code/alexisbcook/cross-validation 

**`n_estimators`** is the number of trees to be used in the forest. Since Random Forest is an ensemble method which creates multiple decision trees, this parameter is used to control the number of trees to be used in the process before taking the maximum voting or averages of predictions. **The higher number of trees give you better performance but makes your code slower.**

**`max_features`** determines the maximum number of features to consider while looking for the best split.

**`max_depth`** — The maximum number of nodes for a given decision tree.

To optimise and search for the best hyperparameters, we will be using the Grid Search method!

## Grid search

- Choosing the range of your hyperparameters is an iterative process.

- Grid search allows you to test the model at every combination of the ranges specified. 

In [12]:
np.arange(10,60,20)

array([10, 30, 50])

In [13]:
from sklearn.model_selection import GridSearchCV
import numpy as np

max_features_range = np.arange(1,4,1)
n_estimators_range = np.arange(10,60,20)
max_depth_range = np.arange(1,4,1)
param_grid = dict(max_features=max_features_range, n_estimators=n_estimators_range, max_depth=max_depth_range)

rf = RandomForestRegressor(random_state=1)

grid = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_absolute_error')
#grid = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)

In [14]:
grid.fit(train_X, train_y)

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=1),
             param_grid={'max_depth': array([1, 2, 3]),
                         'max_features': array([1, 2, 3]),
                         'n_estimators': array([10, 30, 50])},
             scoring='neg_mean_absolute_error')

In [15]:
grid.best_params_

{'max_depth': 3, 'max_features': 3, 'n_estimators': 50}

In [16]:
#grid.best_score_

In [17]:
rf_model_2 = RandomForestRegressor(random_state=1, max_depth = 3, max_features = 3, n_estimators=30)
rf_model_2.fit(train_X,train_y)
pred = rf_model_2.predict(val_X)
rf_val_mae = mean_absolute_error(val_y, pred)
print("Validation MAE for Random Forest Model:", round(rf_val_mae))

Validation MAE for Random Forest Model: 30690


In [18]:
rf_model_2 = RandomForestRegressor(random_state=1,**grid.best_params_)
rf_model_2.fit(train_X,train_y)
pred = rf_model_2.predict(val_X)
rf_val_mae = mean_absolute_error(val_y, pred)
print("Validation MAE for Random Forest Model:", round(rf_val_mae))

Validation MAE for Random Forest Model: 30325


# Ensemble uses two types of methods:

**Bagging** – It creates a different training subset from sample training data with replacement & the final output is based on majority voting. For example,  Random Forest.

**Boosting** – It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. For example,  XGBoost

![bagging%20and%20boosting.png](attachment:bagging%20and%20boosting.png)

Image Reference - https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/#:~:text=Random%20forest%20is%20a%20Supervised,average%20in%20case%20of%20regression.

# References 

 - Predicting House Prices with Machine Learning - https://notebooks.githubusercontent.com/view/ipynb?browser=chrome&color_mode=auto&commit=dd64d6f6af0f41ac3b3aefae503b1d8263c7fbf6&device=unknown&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f64696769706f6469756d2f5265616c2d4573746174652d416e616c797369732d616e642d50726564696374696f6e2f646436346436663661663066343161633362336165666165353033623164383236336337666266362f486f7573655072696365446174616e616c7973697325323670726564696374696f6e2e6970796e62&logged_in=false&nwo=digipodium%2FReal-Estate-Analysis-and-Prediction&path=HousePriceDatanalysis%26prediction.ipynb&platform=android&repository_id=264623977&repository_type=Repository&version=99 
 
 - Boston house price prediction - https://www.kaggle.com/code/shreayan98c/boston-house-price-prediction/notebook#Random-Forest-Regressor 

- Predicting House Prices with Machine Learning - https://towardsdatascience.com/predicting-house-prices-with-machine-learning-62d5bcd0d68f 

- How to Build a Machine Learning Model - https://towardsdatascience.com/how-to-build-a-machine-learning-model-439ab8fb3fb1

- Understanding Random Forest - https://towardsdatascience.com/understanding-random-forest-58381e0602d2

- Predicting Housing Prices Using Scikit-Learn’s Random Forest Model - https://towardsdatascience.com/predicting-housing-prices-using-a-scikit-learns-random-forest-model-e736b59d56c5

- Understanding Random Forest - https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/#:~:text=Random%20forest%20is%20a%20Supervised,average%20in%20case%20of%20regression. 

