# Tpot AutoML Text Classifier

This notebook will show an automated pipeline to implement a classifier, using a package called [Tpot](http://epistasislab.github.io/tpot/). Tpot optimizes sklearn classification pipelines using a genetic algorithm. In particular it automates the feature selection and model selection (including hyperparameter optimization) phases of a machine learning pipeline. The output of the pipeline is the python code for the sklearn pipeline of the best model.

This notebook will demonstrate the use of this algorithm, and compare it to the previous results from a simple classifier.

## Imports and Load Data

In [2]:
import numpy as np
import pandas as pd
from nltk import pos_tag
from nltk.corpus import wordnet, stopwords
from nltk.stem import snowball, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn import svm
import warnings
warnings.filterwarnings('ignore')

from tpot import TPOTClassifier
from tpot.builtins import StackingEstimator
from sklearn.externals.joblib import Memory



In [3]:
# load data
file_name = "Isla Vista - All Excerpts - 1_2_2019.xlsx"
data = pd.read_excel(file_name, sheet_name='Dedoose Excerpts Export')
print(data.shape)
data = data.dropna(axis=0)
print(data.shape)
print(data.columns)

(8131, 53)
(8127, 53)
Index(['StoryID', 'Excerpt', 'CodesApplied_Combined', 'ACCOUNT',
       'ACCOUNT_Cultural', 'ACCOUNT_Individual', 'ACCOUNT_Other',
       'COMMUNITYRECOVERY', 'EVENT', 'GRIEF', 'GRIEF_Individual',
       'GRIEF_Community', 'GRIEF_Societal', 'HERO', 'INVESTIGATION', 'JOURNEY',
       'JOURNEY_Mental', 'JOURNEY_Physical', 'LEGAL', 'MEDIA', 'MISCELLANEOUS',
       'MOURNING', 'MOURNING_Individual', 'MOURNING_Community',
       'MOURNING_Societal', 'PERPETRATOR', 'PHOTO', 'POLICY', 'POLICY_Guns',
       'POLICY_InfoSharing', 'POLICY_MentalHealth', 'POLICY_Other',
       'POLICY_VictimAdv', 'POLICY_OtherAdv', 'POLICY_Practice',
       'PRIVATESECTOR', 'RACECULTURE', 'RESOURCES', 'SAFETY',
       'SAFETY_Community', 'SAFETY_Individual', 'SAFETY_SchoolOrg',
       'SAFETY_Societal', 'SOCIALSUPPORT', 'THREAT', 'THREAT_Assessment',
       'TRAUMA', 'TRAUMA_Physical', 'TRAUMA_Psychological',
       'TRAUMA_Individual', 'TRAUMA_Community', 'TRAUMA_Societal', 'VICTIMS'],
    

## Prepare Tokenizer

The stemming tokenizer will be used, which preprocesses the text by lowering the case of all words, stems them, removes non-letter characters, and removes stop words.

In [4]:
excerpts = list(data['Excerpt'])
def stem_tokenizer(doc):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(doc) 
    stemmer = snowball.SnowballStemmer("english", ignore_stopwords=True)
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    list_tokens = [tok.lower() for tok in stemmed_tokens if tok.isalpha()]
    return(' '.join(list_tokens))
print("original: "+str(excerpts[3]))
print(stem_tokenizer(excerpts[3]))

original: A 22-year-old student last Friday killed six people and wounded 13 more in Isla Vista before turning his gun on himself. Commenters 
blamed the killer�s crimes on everything from misogynistic �pickup artist philosophy� to easy access to guns and no-fault divorce. Even 
�nerd culture� has come under scrutiny. 

Is American culture to blame for mass murder? 
a student last friday kill six peopl and wound more in isla vista before turn his gun on himself comment blame the crime on everyth from misogynist artist to easi access to gun and divorc even has come under scrutini is american cultur to blame for mass murder


## Create Count Vectorizer

The stemmer will be used with the simple count vectorizer, that represents each document as a vector of counts of each word contained in it.

In [5]:
# stem + count
docs = [stem_tokenizer(doc) for doc in excerpts]
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
stem_count_X = vectorizer.fit_transform(docs).toarray() 

##  Tpot AutoML Classifier

### Run 1

The count vectors from each document can then be used to test with the autoML tpot pipeline. This is a quick run with only 5 generations to test this method and see if it results in any improvement over the simple classifier implemented previously. 

In [None]:
# # the following code takes a couple hours to run
# docs_train, docs_test, y_train, y_test = train_test_split(stem_count_X, list(data['ACCOUNT']),
#                                                           test_size=0.2, random_state=0) 

# tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, config_dict='TPOT sparse',
#                      n_jobs = 1, periodic_checkpoint_folder='tpot_checkpoints', cv = 3, memory = 'auto')
# tpot.fit(docs_train, y_train)

`Generation 1 - Current best internal CV score: 0.9033967724241256
Generation 2 - Current best internal CV score: 0.9033967724241256
Generation 3 - Current best internal CV score: 0.9037070657347226
Generation 4 - Current best internal CV score: 0.9081677787975023
Generation 5 - Current best internal CV score: 0.9081677787975023`

`Best pipeline: RandomForestClassifier(BernoulliNB(MultinomialNB(input_matrix, alpha=10.0, fit_prior=True), alpha=1.0, fit_prior=False), bootstrap=False, criterion=gini, max_features=0.15000000000000002, min_samples_leaf=13, min_samples_split=9, n_estimators=100)`

In [36]:
print(tpot.score(docs_test, np.array(y_test)))
tpot.export('tpot_pipeline.py')

0.9206642066420664


### Results of Tpot Run 1

The final result of the Tpot optimization pipeline is saved in a python file as code that could be used to reproduce that pipeline. This code obtained from the run of tpot is shown in code below. The code was used to fit a model, and the results were analyzed. Note the use of the pre-configured 'Tpot sparse' setting was used, to tailor this optimization pipeline to the sparse data situation, which we have with the vocabulary count vectors. 

In [22]:
d = {'target': list(data['ACCOUNT']), 'data': list(stem_count_X)}
tpot_data = pd.DataFrame(data=d)

In [28]:
#features = tpot_data.drop('target', axis=1).values
features = stem_count_X
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, data['ACCOUNT'].values, random_state=None)

# Average CV score on the training set was:0.9081677787975023
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=MultinomialNB(alpha=10.0, fit_prior=True)),
    StackingEstimator(estimator=BernoulliNB(alpha=1.0, fit_prior=False)),
    RandomForestClassifier(bootstrap=False, criterion="gini", max_features=0.15000000000000002,
                           min_samples_leaf=13, min_samples_split=9, n_estimators=100))

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

In [32]:
print(confusion_matrix(testing_target, results))  
print(classification_report(testing_target, results))  
print(accuracy_score(testing_target, results))

[[1507   86]
 [  82  357]]
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      1593
           1       0.81      0.81      0.81       439

    accuracy                           0.92      2032
   macro avg       0.88      0.88      0.88      2032
weighted avg       0.92      0.92      0.92      2032

0.9173228346456693


The results seemed to improve gradually during the 5 generations of the optimization run, however, it has still not reached the level of accuracy obtained by the simple classification model from the previous notebook. Recall the macro average f1 score from simple logistic regression was 0.89, one point higher than the results obtained from the optimized tpot model. 

### Tpot Run 2

To confirm whether the Tpot can result in an improvement, it will be run again for many more generations. Tpot was run again in a python script (outside of this notebook) using the following parameters: 
`generations=100, population_size=100, verbosity=2, config_dict='TPOT sparse',max_time_mins=1200, max_eval_time_mins=3, scoring = 'f1_macro', n_jobs = 1, periodic_checkpoint_folder='tpot_checkpoints',
cv = 5, memory = 'auto'`

The parameters changed from the previous run are the number of generations and the population size, which were both increased. Also, the scoring function used to select models was adjusted from accuracy to f1. The output code from this pipeline optimization and results are shown below.

### Results of Tpot Run 2

In [6]:
#features = tpot_data.drop('target', axis=1).values
features = stem_count_X
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, data['ACCOUNT'].values, random_state=None)

# Average CV score on the training set was:0.8846197491808947
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=LogisticRegression(C=0.01, dual=False, penalty="l2")),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.2, 
                           min_samples_leaf=1, min_samples_split=2, n_estimators=100)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

print(confusion_matrix(testing_target, results))  
print(classification_report(testing_target, results))  
print(accuracy_score(testing_target, results))

[[1486   80]
 [  94  372]]
              precision    recall  f1-score   support

           0       0.94      0.95      0.94      1566
           1       0.82      0.80      0.81       466

    accuracy                           0.91      2032
   macro avg       0.88      0.87      0.88      2032
weighted avg       0.91      0.91      0.91      2032

0.9143700787401575


## Conclusions

The classification optimization pipeline from Tpot did not result in significant improvements over the results from the simple logistic regression classisier, which had achieved a performance of of 0.89 in macro average, and 0.83 in f1 score for class label 1 (accountability). 
