When we built our model with Naive Bayes it actually did preety good job on several performance measures. Now let's look at how we can improve it.

Specifically in this notebook, we will take a look at the following techniques:

* [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)
* [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
* [AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)

Another really useful guide for ensemble methods can be found [in the documentation here](http://scikit-learn.org/stable/modules/ensemble.html).

These ensemble methods use a combination of techniques:

* **Bootstrap the data** passed through a learner (bagging).
* **Subset the features** used for a learner (combined with bagging signifies the two random components of random forests).
* **Ensemble learners** together in a way that allows those that perform best in certain areas to create the largest impact (boosting).


So before moving on to the model building part let's finish up the data preprocessing part.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [3]:
# read the dataset
data_df = pd.read_table("smsspamcollection/SMSSpamCollection", sep='\t', header=None, names=['label', 'sms_message'])

# mapping the labels
data_df['label'] = data_df['label'].map({'ham':0, 'spam':1})

data_df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# split out the dataset
X_train, X_test, y_train, y_test = train_test_split(data_df['sms_message'], data_df['label'], 
                                                    test_size=0.25, random_state=42)

In [6]:
# initialize the countvectorize method
count_vector = CountVectorizer()

# fit the training data and return the matrix
training_data = count_vector.fit_transform(X_train)

# only transform the testing data and return the matrix
testing_data = count_vector.transform(X_test)

### This Process Looks Familiar...

In general, there is a five step process that can be used each type you want to use a supervised learning method (which we actually used above):

1. **Import** the model.
2. **Instantiate** the model with the hyperparameters of interest.
3. **Fit** the model to the training data.
4. **Predict** on the test data.
5. **Score** the model by comparing the predictions to the actual values.

Follow the steps through this notebook to perform these steps using each of the ensemble methods: **BaggingClassifier**, **RandomForestClassifier**, and **AdaBoostClassifier**.

> **Step 1**: First use the documentation to `import` all three of the models.

In [7]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier

> **Step 2:** Now that we have imported each of the classifiers, `instantiate` each with the hyperparameters.  In the upcoming, we will see how we can automate the process to finding the best hyperparameters.  For now, let's get comfortable with the process and our new algorithms.

In [9]:
# Instantiate a BaggingClassifier with:
# 200 weak learners (n_estimators) and everything else as default values
bagging_clf = BaggingClassifier(n_estimators=200)

# Instantiate a RandomForestClassifier with:
# 200 weak learners (n_estimators) and everything else as default values
rf_clf = RandomForestClassifier(n_estimators=200)

# Instantiate an a AdaBoostClassifier with:
# With 300 weak learners (n_estimators) and a learning_rate of 0.2
adaboost_clf = AdaBoostClassifier(n_estimators=300, learning_rate=0.2)

> **Step 3:** Now that we have instantiated each of our models, `fit` them using the **training_data** and **y_train**.

In [10]:
# Fit your BaggingClassifier to the training data
bagging_clf.fit(training_data, y_train)

# Fit your RandomForestClassifier to the training data
rf_clf.fit(training_data, y_train)

# Fit your AdaBoostClassifier to the training data
adaboost_clf.fit(training_data, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.2, n_estimators=300, random_state=None)

> **Step 4:** Now that we have fit each of your models, we will use each to `predict` on the **testing_data**.

In [11]:
# predict using BaggingClassifier on testing_data
bagging_pred = bagging_clf.predict(testing_data)

# predict using RandomForestClassifier on testing data
rf_pred = rf_clf.predict(testing_data)

# predict using AdaBoostClassifier on testing data
adaboost_pred = adaboost_clf.predict(testing_data)

> **Step 5:** Now that we have made your predictions, compare our predictions to the actual values using the function below for each of your models - this will give us the `score` for how well each of our models is performing.  It might also be useful to show the naive bayes model again here.

In [13]:
from sklearn.naive_bayes import MultinomialNB
# Instantiate our naivebayes model
naive_bayes = MultinomialNB()

# Fit our model to the training data
naive_bayes.fit(training_data, y_train)

# Predict on the test data
predictions = naive_bayes.predict(testing_data)

In [14]:
def print_metrics(y_true, preds, model_name=None):
    '''
    INPUT:
    y_true - the y values that are actually true in the dataset (numpy array or pandas series)
    preds - the predictions for those values from some model (numpy array or pandas series)
    model_name - (str - optional) a name associated with the model if you would like to add it to the print statements 
    
    OUTPUT:
    None - prints the accuracy, precision, recall, and F1 score
    '''
    if model_name == None:
        print('Accuracy score: ', format(accuracy_score(y_true, preds)))
        print('Precision score: ', format(precision_score(y_true, preds)))
        print('Recall score: ', format(recall_score(y_true, preds)))
        print('F1 score: ', format(f1_score(y_true, preds)))
        print('\n\n')
    else:
        print('Accuracy score for ' + model_name + ' :' , format(accuracy_score(y_true, preds)))
        print('Precision score ' + model_name + ' :', format(precision_score(y_true, preds)))
        print('Recall score ' + model_name + ' :', format(recall_score(y_true, preds)))
        print('F1 score ' + model_name + ' :', format(f1_score(y_true, preds)))
        print('\n\n')

In [15]:
# Print Bagging scores
print_metrics(y_test, bagging_pred, 'Bagging')

# Print Random Forest scores
print_metrics(y_test, rf_pred, 'Random Forest')

# Print AdaBoost scores
print_metrics(y_test, adaboost_pred, 'AdaBoost')

# Naive Bayes Classifier scores
print_metrics(y_test, predictions, 'Naive Bayes')

Accuracy score for Bagging : 0.9763101220387652
Precision score Bagging : 0.9322033898305084
Recall score Bagging : 0.8870967741935484
F1 score Bagging : 0.9090909090909092



Accuracy score for Random Forest : 0.9784637473079684
Precision score Random Forest : 1.0
Recall score Random Forest : 0.8387096774193549
F1 score Random Forest : 0.9122807017543859



Accuracy score for AdaBoost : 0.9784637473079684
Precision score AdaBoost : 0.975609756097561
Recall score AdaBoost : 0.8602150537634409
F1 score AdaBoost : 0.9142857142857143



Accuracy score for Naive Bayes : 0.9885139985642498
Precision score Naive Bayes : 0.9775280898876404
Recall score Naive Bayes : 0.9354838709677419
F1 score Naive Bayes : 0.956043956043956



