# Simple Text Classifiers

This notebook will show a simple approach to text classification. Without any complicated pre-processing, linear and ensemble classification models will be tested. 

## Imports and Load Data

In [1]:
import numpy as np
import pandas as pd
from nltk import pos_tag
from nltk.corpus import wordnet, stopwords
from nltk.stem import snowball, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm

from tpot import TPOTClassifier
from sklearn.externals.joblib import Memory



In [2]:
file_name = "Isla Vista - All Excerpts - 1_2_2019.xlsx"
data = pd.read_excel(file_name, sheet_name='Dedoose Excerpts Export')
print(data.shape)
data = data.dropna(axis=0)
print(data.shape)
print(data.columns)

(8131, 53)
(8127, 53)
Index(['StoryID', 'Excerpt', 'CodesApplied_Combined', 'ACCOUNT',
       'ACCOUNT_Cultural', 'ACCOUNT_Individual', 'ACCOUNT_Other',
       'COMMUNITYRECOVERY', 'EVENT', 'GRIEF', 'GRIEF_Individual',
       'GRIEF_Community', 'GRIEF_Societal', 'HERO', 'INVESTIGATION', 'JOURNEY',
       'JOURNEY_Mental', 'JOURNEY_Physical', 'LEGAL', 'MEDIA', 'MISCELLANEOUS',
       'MOURNING', 'MOURNING_Individual', 'MOURNING_Community',
       'MOURNING_Societal', 'PERPETRATOR', 'PHOTO', 'POLICY', 'POLICY_Guns',
       'POLICY_InfoSharing', 'POLICY_MentalHealth', 'POLICY_Other',
       'POLICY_VictimAdv', 'POLICY_OtherAdv', 'POLICY_Practice',
       'PRIVATESECTOR', 'RACECULTURE', 'RESOURCES', 'SAFETY',
       'SAFETY_Community', 'SAFETY_Individual', 'SAFETY_SchoolOrg',
       'SAFETY_Societal', 'SOCIALSUPPORT', 'THREAT', 'THREAT_Assessment',
       'TRAUMA', 'TRAUMA_Physical', 'TRAUMA_Psychological',
       'TRAUMA_Individual', 'TRAUMA_Community', 'TRAUMA_Societal', 'VICTIMS'],
    

## Prepare Tokenizers

Two tokenizers will be tested, one with the most simple approach of stemming words. The second has some added complexity, using the WordNet lemmatizer.

In [3]:
excerpts = list(data['Excerpt'])
def stem_tokenizer(doc):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(doc) 
    stemmer = snowball.SnowballStemmer("english", ignore_stopwords=True)
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    list_tokens = [tok.lower() for tok in stemmed_tokens if tok.isalpha()]
    return(' '.join(list_tokens))
print("original: "+str(excerpts[3]))
print(stem_tokenizer(excerpts[3]))

original: A 22-year-old student last Friday killed six people and wounded 13 more in Isla Vista before turning his gun on himself. Commenters 
blamed the killer�s crimes on everything from misogynistic �pickup artist philosophy� to easy access to guns and no-fault divorce. Even 
�nerd culture� has come under scrutiny. 

Is American culture to blame for mass murder? 
a student last friday kill six peopl and wound more in isla vista before turn his gun on himself comment blame the crime on everyth from misogynist artist to easi access to gun and divorc even has come under scrutini is american cultur to blame for mass murder


In [4]:
excerpts = list(data['Excerpt'])
def lem_tokenizer(doc):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(doc) 
    lemmer = WordNetLemmatizer()
    lemmed_tokens = [lemmer.lemmatize(word) for word in tokens if word.lower() not in stop_words]
    list_tokens = [tok.lower() for tok in lemmed_tokens if tok.isalpha()]
    return(' '.join(list_tokens))
print("original: \n"+str(excerpts[3])+str("\n"))
print(lem_tokenizer(excerpts[3]))

original: 
A 22-year-old student last Friday killed six people and wounded 13 more in Isla Vista before turning his gun on himself. Commenters 
blamed the killer�s crimes on everything from misogynistic �pickup artist philosophy� to easy access to guns and no-fault divorce. Even 
�nerd culture� has come under scrutiny. 

Is American culture to blame for mass murder? 

student last friday killed six people wounded isla vista turning gun commenters blamed crime everything misogynistic artist easy access gun divorce even come scrutiny american culture blame mass murder


## Create Vectorizers

The two tokenizers can then be used to create vectorized representation. Two vectorizers will be used. First the count vectorizer, then the tfidf vectorizer.

In [5]:
# stem + count
docs = [stem_tokenizer(doc) for doc in excerpts]
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
stem_count_X = vectorizer.fit_transform(docs).toarray() 

In [6]:
# lem + count
docs = [lem_tokenizer(doc) for doc in excerpts]
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
lem_count_X = vectorizer.fit_transform(docs).toarray() 

In [7]:
# stem + tfidf
docs = [stem_tokenizer(doc) for doc in excerpts]
vectorizer = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
stem_tfidf_X = vectorizer.fit_transform(docs).toarray() 

##  Tpot AutoML Classifier

Test with the autoML tpot piepline, run time was 181 minutes.

In [None]:
docs_train, docs_test, y_train, y_test = train_test_split(stem_count_X, list(data['ACCOUNT']),
                                                          test_size=0.2, random_state=0) 

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, config_dict='TPOT sparse',
                     n_jobs = 1, periodic_checkpoint_folder='tpot_checkpoints', cv = 3, memory = 'auto')
tpot.fit(docs_train, y_train)

`Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
Generation 1 - Current best internal CV score: 0.9033967724241256
Generation 2 - Current best internal CV score: 0.9033967724241256
Generation 3 - Current best internal CV score: 0.9037070657347226
Generation 4 - Current best internal CV score: 0.9081677787975023
Generation 5 - Current best internal CV score: 0.9081677787975023`

`Best pipeline: RandomForestClassifier(BernoulliNB(MultinomialNB(input_matrix, alpha=10.0, fit_prior=True), alpha=1.0, fit_prior=False), bootstrap=False, criterion=gini, max_features=0.15000000000000002, min_samples_leaf=13, min_samples_split=9, n_estimators=100)`

In [36]:
print(tpot.score(docs_test, np.array(y_test)))
tpot.export('tpot_pipeline.py')

0.9206642066420664


In [22]:
d = {'target': list(data['ACCOUNT']), 'data': list(stem_count_X)}
tpot_data = pd.DataFrame(data=d)

In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator

#features = tpot_data.drop('target', axis=1).values
features = stem_count_X
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, data['ACCOUNT'].values, random_state=None)

# Average CV score on the training set was:0.9081677787975023
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=MultinomialNB(alpha=10.0, fit_prior=True)),
    StackingEstimator(estimator=BernoulliNB(alpha=1.0, fit_prior=False)),
    RandomForestClassifier(bootstrap=False, criterion="gini", max_features=0.15000000000000002,
                           min_samples_leaf=13, min_samples_split=9, n_estimators=100))

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

In [32]:
print(confusion_matrix(testing_target, results))  
print(classification_report(testing_target, results))  
print(accuracy_score(testing_target, results))

[[1507   86]
 [  82  357]]
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      1593
           1       0.81      0.81      0.81       439

    accuracy                           0.92      2032
   macro avg       0.88      0.88      0.88      2032
weighted avg       0.92      0.92      0.92      2032

0.9173228346456693
