# Simple Text Classifiers

This notebook will show a simple approach to text classification. Without any complicated pre-processing, linear and ensemble classification models will be tested. 

## Imports and Load Data

In [86]:
import numpy as np
import pandas as pd
from nltk import pos_tag
from nltk.corpus import wordnet, stopwords
from nltk.stem import snowball, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm

In [2]:
file_name = "Isla Vista - All Excerpts - 1_2_2019.xlsx"
data = pd.read_excel(file_name, sheet_name='Dedoose Excerpts Export')
print(data.shape)
data = data.dropna(axis=0)
print(data.shape)
print(data.columns)

(8131, 53)
(8127, 53)
Index(['StoryID', 'Excerpt', 'CodesApplied_Combined', 'ACCOUNT',
       'ACCOUNT_Cultural', 'ACCOUNT_Individual', 'ACCOUNT_Other',
       'COMMUNITYRECOVERY', 'EVENT', 'GRIEF', 'GRIEF_Individual',
       'GRIEF_Community', 'GRIEF_Societal', 'HERO', 'INVESTIGATION', 'JOURNEY',
       'JOURNEY_Mental', 'JOURNEY_Physical', 'LEGAL', 'MEDIA', 'MISCELLANEOUS',
       'MOURNING', 'MOURNING_Individual', 'MOURNING_Community',
       'MOURNING_Societal', 'PERPETRATOR', 'PHOTO', 'POLICY', 'POLICY_Guns',
       'POLICY_InfoSharing', 'POLICY_MentalHealth', 'POLICY_Other',
       'POLICY_VictimAdv', 'POLICY_OtherAdv', 'POLICY_Practice',
       'PRIVATESECTOR', 'RACECULTURE', 'RESOURCES', 'SAFETY',
       'SAFETY_Community', 'SAFETY_Individual', 'SAFETY_SchoolOrg',
       'SAFETY_Societal', 'SOCIALSUPPORT', 'THREAT', 'THREAT_Assessment',
       'TRAUMA', 'TRAUMA_Physical', 'TRAUMA_Psychological',
       'TRAUMA_Individual', 'TRAUMA_Community', 'TRAUMA_Societal', 'VICTIMS'],
    

## Prepare Tokenizers

Two tokenizers will be tested, one with the most simple approach of stemming words. The second has some added complexity, using the WordNet lemmatizer.

In [143]:
excerpts = list(data['Excerpt'])
def stem_tokenizer(doc):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(doc) 
    stemmer = snowball.SnowballStemmer("english", ignore_stopwords=True)
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    list_tokens = [tok.lower() for tok in stemmed_tokens if tok.isalpha()]
    return(' '.join(list_tokens))
print("original: "+str(excerpts[3]))
print(stem_tokenizer(excerpts[3]))

original: A 22-year-old student last Friday killed six people and wounded 13 more in Isla Vista before turning his gun on himself. Commenters 
blamed the killer�s crimes on everything from misogynistic �pickup artist philosophy� to easy access to guns and no-fault divorce. Even 
�nerd culture� has come under scrutiny. 

Is American culture to blame for mass murder? 
a student last friday kill six peopl and wound more in isla vista before turn his gun on himself comment blame the crime on everyth from misogynist artist to easi access to gun and divorc even has come under scrutini is american cultur to blame for mass murder


In [95]:
excerpts = list(data['Excerpt'])
def lem_tokenizer(doc):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(doc) 
    lemmer = WordNetLemmatizer()
    lemmed_tokens = [lemmer.lemmatize(word) for word in tokens if word.lower() not in stop_words]
    list_tokens = [tok.lower() for tok in lemmed_tokens if tok.isalpha()]
    return(' '.join(list_tokens))
print("original: \n"+str(excerpts[3])+str("\n"))
print(lem_tokenizer(excerpts[3]))

original: 
A 22-year-old student last Friday killed six people and wounded 13 more in Isla Vista before turning his gun on himself. Commenters 
blamed the killer�s crimes on everything from misogynistic �pickup artist philosophy� to easy access to guns and no-fault divorce. Even 
�nerd culture� has come under scrutiny. 

Is American culture to blame for mass murder? 

student last friday killed six people wounded isla vista turning gun commenters blamed crime everything misogynistic artist easy access gun divorce even come scrutiny american culture blame mass murder


## Create Vectorizers

The two tokenizers can then be used to create vectorized representation. Two vectorizers will be used. First the count vectorizer, then the tfidf vectorizer.

In [144]:
# stem + count
docs = [stem_tokenizer(doc) for doc in excerpts]
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
stem_count_X = vectorizer.fit_transform(docs).toarray() 

In [119]:
# lem + count
docs = [lem_tokenizer(doc) for doc in excerpts]
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
lem_count_X = vectorizer.fit_transform(docs).toarray() 

In [120]:
# stem + tfidf
docs = [stem_tokenizer(doc) for doc in excerpts]
vectorizer = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
stem_tfidf_X = vectorizer.fit_transform(docs).toarray() 

In [121]:
# lem + tfidf
docs = [lem_tokenizer(doc) for doc in excerpts]
vectorizer = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
lem_tfidf_X = vectorizer.fit_transform(docs).toarray() 

##  Classifiers

Test each vectorized representation with simple classifiers.

### Linear

First compare two linear models: svm and logistic regression

#### SVM

In [122]:
docs_train, docs_test, y_train, y_test = train_test_split(stem_count_X, list(data['ACCOUNT']),
                                                          test_size=0.2, random_state=0) 
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(docs_train, y_train)

y_pred = clf.predict(docs_test)
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  
print(accuracy_score(y_test, y_pred))

[[1177   69]
 [  66  314]]
              precision    recall  f1-score   support

           0       0.95      0.94      0.95      1246
           1       0.82      0.83      0.82       380

    accuracy                           0.92      1626
   macro avg       0.88      0.89      0.88      1626
weighted avg       0.92      0.92      0.92      1626

0.9169741697416974


In [123]:
docs_train, docs_test, y_train, y_test = train_test_split(stem_tfidf_X, list(data['ACCOUNT']),
                                                          test_size=0.2, random_state=0) 
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(docs_train, y_train)

y_pred = clf.predict(docs_test)
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  
print(accuracy_score(y_test, y_pred))

[[1193   53]
 [  77  303]]
              precision    recall  f1-score   support

           0       0.94      0.96      0.95      1246
           1       0.85      0.80      0.82       380

    accuracy                           0.92      1626
   macro avg       0.90      0.88      0.89      1626
weighted avg       0.92      0.92      0.92      1626

0.9200492004920049


In [124]:
docs_train, docs_test, y_train, y_test = train_test_split(lem_count_X, list(data['ACCOUNT']),
                                                          test_size=0.2, random_state=0) 
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(docs_train, y_train)

y_pred = clf.predict(docs_test)
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  
print(accuracy_score(y_test, y_pred))

[[1173   73]
 [  73  307]]
              precision    recall  f1-score   support

           0       0.94      0.94      0.94      1246
           1       0.81      0.81      0.81       380

    accuracy                           0.91      1626
   macro avg       0.87      0.87      0.87      1626
weighted avg       0.91      0.91      0.91      1626

0.9102091020910209


In [125]:
docs_train, docs_test, y_train, y_test = train_test_split(lem_tfidf_X, list(data['ACCOUNT']),
                                                          test_size=0.2, random_state=0) 
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(docs_train, y_train)

y_pred = clf.predict(docs_test)
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  
print(accuracy_score(y_test, y_pred))

[[1188   58]
 [  75  305]]
              precision    recall  f1-score   support

           0       0.94      0.95      0.95      1246
           1       0.84      0.80      0.82       380

    accuracy                           0.92      1626
   macro avg       0.89      0.88      0.88      1626
weighted avg       0.92      0.92      0.92      1626

0.9182041820418204


#### Logistic Regression

Since all the setups gave equivalent f1-scores, the simples approach will be chosen to test with logisitic regression.

In [126]:
docs_train, docs_test, y_train, y_test = train_test_split(stem_count_X, list(data['ACCOUNT']), 
                                                          test_size=0.2, random_state=0) 

logreg = LogisticRegression(solver='lbfgs', max_iter=1000)
logreg.fit(docs_train, y_train)

y_pred = logreg.predict(docs_test)
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  
print(accuracy_score(y_test, y_pred))

[[1185   61]
 [  68  312]]
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      1246
           1       0.84      0.82      0.83       380

    accuracy                           0.92      1626
   macro avg       0.89      0.89      0.89      1626
weighted avg       0.92      0.92      0.92      1626

0.9206642066420664


### Ensemble Model

Test the ensemble models random forest.

In [152]:
docs_train, docs_test, y_train, y_test = train_test_split(stem_count_X, list(data['ACCOUNT']), #test_size=0.2,
                                                         random_state=0) 
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)  
classifier.fit(docs_train, y_train) 

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [153]:
y_pred = classifier.predict(docs_test)  
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  
print(accuracy_score(y_test, y_pred))

[[1521   38]
 [ 110  363]]
              precision    recall  f1-score   support

           0       0.93      0.98      0.95      1559
           1       0.91      0.77      0.83       473

    accuracy                           0.93      2032
   macro avg       0.92      0.87      0.89      2032
weighted avg       0.93      0.93      0.92      2032

0.9271653543307087


In [154]:
print(len(classifier.feature_importances_) == len(vectorizer.get_feature_names()))
top_feats = np.argsort(classifier.feature_importances_)[-10:]
feat_names = [vectorizer.get_feature_names()[feat] for feat in top_feats]
print(feat_names)

True
['student', 'reject', 'april', 'women', 'sex', 'blame', 'manifesto', 'rodger', 'mental', 'video']


In [155]:
test_doc = np.zeros(len(docs_test[1]))
test_doc[top_feats[6]] = 1
print("class: "+str(classifier.predict([test_doc])))

class: [0]


The ensemble model shows the best performance, though it is important to note that the ensemble has a more complex decision boundary, and can be more prone to over-fitting. This would require evaluation through cross-validation to confirm the performance increase. It is also interesting to note the term "blame" was one of the top contributing terms to the model, and it results in the decision to classify a document as "account" (class label =1).

### Conclusions

These simple tests show that it should be possible to acheive a fairly high performance classifier, since these very basic pre-processing methods and simple linear classifier were able to achieve a fairly high f-score above 0.9. There is still room for improvement in the classification of the account class label, and one issue that can be seen is the class imbalance. It is not a major imbalance, but it will contribute to the lower f-score for class 1. 