# 5. Improving the Model


First Model - Baseline Model
First Predictions - Baseline Prediction

As you go on after the first model we improve upon the base line model
Two main ways - 
#### 1. Data Prespective 
    * Data Quantity: Could we collect more data?(The more data the better)
    * Data Quality: Could we improve our data?(More features)
#### 2. Model Prespective
    * Is there a better model we can use? (Sklearn ML map)
    * Could we improve the current model? (Hyperparameter tunning)
    
#### Hyperparameter vs. Parameters 
    * Parameter - Model finds these patterns in data
    * Hyperparameter - Setting in the model you can adjust to potentially its ability to 
    find patterns. 

#### Three ways to adjust hyperparamters
1. By Hand
2. Randommly with RandomSearchCV
3. Exhustively with GridSearchCV

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Get California housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

housing_df = pd.DataFrame(housing["data"],columns=housing['feature_names'])

housing_df['target'] = housing['target']
housing_df.head()


from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
np.random.seed(42)

# create x & y
x = housing_df.drop('target',axis=1)
y = housing_df['target']

# split the data
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

#create model
model = RandomForestRegressor()
#fit model
model.fit(x_train,y_train)

y_preds = model.predict(x_test)

#evaluate model using evaluate funcutions
print("Regression metrics on the test set")
print(f"R2: {r2_score(y_test,y_preds)}")
print(f"MAE: {mean_absolute_error(y_test,y_preds)}")
print(f"MSE: {mean_squared_error(y_test,y_preds)}")


Regression metrics on the test set
R2: 0.8066196804802649
MAE: 0.3265721842781009
MSE: 0.2534073069137548


In [13]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

np.random.seed(42)

heart_disease = pd.read_csv('./../data/heart-disease.csv')

# create x & y
x = heart_disease.drop('target',axis=1)
y = heart_disease['target']

# split the data
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

#create model
clf = RandomForestClassifier()

#fit model
clf.fit(x_train,y_train)

y_preds = clf.predict(x_test)

#evaluate model using evaluate funcutions
print("Classification metrics on the test set")
print(f"Accuracy: {accuracy_score(y_test,y_preds)*100:0.2f}%")
print(f"Precision: {precision_score(y_test,y_preds)}")
print(f"Recall: {recall_score(y_test,y_preds)}")
print(f"F1: {f1_score(y_test,y_preds)}")

Classification metrics on the test set
Accuracy: 85.25%
Precision: 0.8484848484848485
Recall: 0.875
F1: 0.8615384615384615


In [12]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [11]:
# All the hyper parameter of a model
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Random Forest Hyperparameters - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

<img src = "images/sklearn-hyperparameter-tuning-oven.png">

### 5.1 Hyperparameter Tunning - By Hand

<img src = 'images/hyper-tunning-hands.png'>

Let's make 3 sets, training, validation and test. 

#### Which hyperparameter should we choose to adjust. 
    * Take suggestions from sklearn documentation 

Following Hyperparameters we will adjust.
* `max_depth`
* `max_features`
* `min_sample_leaf`
* `min_sample_split`
* `n_estimators`

In [None]:
# we create function to do hyper-parameter tuning repeatable

def evaluate_preds(y_true,y_preds):
    """
    Performs evaluation comparision y_true labels vs. y_pred labels.
    """
    
    accuracy = accuracy(y_true,y_preds)
    precision = precision_score(y_true,y_preds)
    recall = recall_score(y_true,y_preds)
    f1 = f1_score(y_true,y_preds)
    metric_dict = {'accuracy': round(accuracy,2),
                 'precision': round(precison,2),
                 'recall': round(recall,2),
                 'f1': round(f1,2)}
    
    print(f"Accuracy: {accuracy*100:0.2f}%")    
    print(f"Precision: {precision:0.2f}")    
    print(f"Recall: {recall:0.2f}")    
    print(f"F1 Score: {f1:0.2f}")
    
    return metric_dict