# Improving the Model

As the Random Forest model has shown the best perfomance, we'll try to improve it.  

In [1]:
import pandas as pd
df = pd.read_csv('output/spam_email.csv')
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


### Test-Split

In [52]:
# spliting test and train samples
from sklearn.model_selection import train_test_split

predictors = df.drop('spam', axis = 1)
predicted = df['spam']

X_train, X_test, y_train, y_test = train_test_split(predictors, predicted)

In [None]:
# Random forest model
import numpy as np
np.random.seed(30)

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 1000)
forest.fit(X_train, y_train)

# prediction score: 
fscore = forest.score(X_test, y_test)
fy_pred = forest.predict(X_test)


print("Number of mislabeled points out of a total %d points : %d \n\
Model score:"
       % (X_test.shape[0], (y_test != fy_pred).sum()), "{0:.2%}".format(fscore))

Number of mislabeled points out of a total 2301 points : 120 
Model score: 94.78%


## Hyperparameter tuning

### Grid Search

Instead of manually trying to ajust and tune a model, randomly testing if this or that parameter change will improve or not the model, we can use Scikit-Learn's Grid Seach to test what is the best set of parameters we are giving to it. First we define a list of parameter dictionaries, pass it with the Random Forest model into the GridSeachCV and optimize the model with the .fit() method. It might take some time, but at the end the best_params_ attribute will return the better group of parameters given to it.

In [67]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [100, 500, 1000],
     'max_features': [2, 4, 6, 8],
    }, 
    {'bootstrap':[False], 
     'n_estimators': [100, 500, 1000],
     'max_features': [2, 4, 6, 8]
    }
]

forest = RandomForestClassifier()

grid_search = GridSearchCV(forest, 
                           param_grid, 
                           cv = 5,
                           return_train_score = True)

grid_search.fit(X_train, y_train)

grid_search.best_params_

{'bootstrap': False, 'max_features': 2, 'n_estimators': 500}

According to the results, {'bootstrap': False, 'max_features': 2, 'n_estimators': 500} are the best paramters combination. We cant test a little further: 

In [69]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [500, 1000, 2000],
     'max_features': [2, 4, 6, 8],
    }, 
    {'bootstrap':[False], 
     'n_estimators': [500, 1000, 2000],
     'max_features': [2, 4, 6, 8]
    }
]

forest = RandomForestClassifier()

grid_search = GridSearchCV(forest, 
                           param_grid, 
                           cv = 5,
                           return_train_score = True)

grid_search.fit(X_train, y_train)

grid_search.best_params_

{'bootstrap': False, 'max_features': 4, 'n_estimators': 1000}

In [83]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 1000, 
                                bootstrap = False, 
                                max_features = 4)

forest.fit(X_train, y_train)

score = forest.score(X_test, y_test)

print('model score', score, "({0:.2%})".format(fscore))

model score 0.947871416159861 (94.79%)


We improved our model in 0.01%. 