Class,

We now embark on a simple text-sentiment classification exercise using a host of methods/algorithms such as Naiva Bayes, SVM, Logistic regression, Random forests and boosted regressions the details of which you'll cover in later courses.

Aim of this exercise is to demo the use of std libraries to do supervised classification exercises on labeled text data.

See below

## Importing the required libraries

We'll use select components from the comprehensive *sklearn* library. Some select components of interest are:

**model_selection** : for splitting the data into train and test dataset 

**preprocessing**: for encoding the label so that it can be used in the machine learning model

**linear_model**: to implement linear models like linear regresson, logistic regression etc

**naive_bayes**: for the class of Naive Bayes models

**metrics**: For calculating the model metrics such as accuracy.

**svm**: For implementing SVM or *Support Vector Machine* models. 

**TfidfVectorizer**: for calculating DTM under TD-IDF conditions

**CountVectorizer**: for implementing DTM under TF

**ensemble**: To implement Ensemble models like *Random Forests* 

**pandas**: For Dataframes

If you're using the Anaconda py distribution, much of sklearn comes prepackaged. Else, might take some time to import. 

E.g., to import *xgboost*, type on command line: *$ pip install xgboost*

In [1]:
## setup chunk
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import ensemble

import pandas, xgboost, numpy, textblob, string  # $ pip install textblob
import csv,re,nltk
import time

In [2]:
# read data in
filename = open("training.txt","r",encoding="utf-8")
sample_data = csv.reader(filename,delimiter = "\t")
labels, texts = [], []
for i in sample_data:
    labels.append(i[0])
    texts.append(i[1])

# build panda DF to house the data
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
trainDF.iloc[:8,]  # view a few records


Unnamed: 0,text,label
0,The Da Vinci Code book is just awesome.,1
1,this was the first clive cussler i've ever rea...,1
2,i liked the Da Vinci Code a lot.,1
3,i liked the Da Vinci Code a lot.,1
4,I liked the Da Vinci Code but it ultimatly did...,1
5,that's not even an exaggeration ) and at midni...,1
6,"I loved the Da Vinci Code, but now I want some...",1
7,"i thought da vinci code was great, same with k...",1


## Split Dataset into Train & Test to Assess Model Accuracy

Part of std operating proc in training machines on labeled data is (randomly) splitting the data into two parts - 

1] a *Training* or Calibration dataset on which the machine searches for the best function connecting labeled outcomes to data features

2] a *Test* or Validation dataset on which the trained model is run to assess the accuracy of the model's predicted outcomes against known outcomes.

We'll use sklearn's *model_selection.train_test_split()* func for the above. 

Then, in preprocessing we use a label *encoder* to setup the LHS, via func *preprocessing.LabelEncoder().fit_transform()*.

See below.

In [3]:
# split the DF into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])

# label encode the target variable into 0/1
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

## Creating a DTM of the dataset 

Recall we'd built DTMs in R's tidytext with the cast_dtm() func. Now we see sklearn's approach to doing the same using the *CountVectorizer()* func. 

P.S. You can read more about the func's arguments and attributes [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).


In [4]:
# create a count vectorizer TF-DTM object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
t1 = time.time()
count_vect.fit(trainDF['text'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)
t2 = time.time()

print(t2 - t1, "secs")  # ~ 0.15 secs to create DTM. Not bad, eh?
xtrain_count.shape
xvalid_count.shape

0.15179705619812012 secs


(1730, 2162)

## Creating TFIDF on the word level for the dataset 

Recall TFIDF wale DTMs from the previous classes? We'd said that its hard to say a priori which might be more relevant or better-performing in any situ and that we should try both and see? 

Now, in this text classifn exercise, would it make more sense to use TF or an IDF wala DTM? Why not try both and see?

Below, we use sklearn's *TfidfVectorizer()* func to build an IDF wala DTM.

Behold.

In [5]:
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
t1 = time.time()
tfidf_vect = tfidf_vect.fit(trainDF['text'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)
t2 = time.time()

print(t2 - t1, "secs")  # ~ 0.15 secs again
xtrain_tfidf.shape  # but many more tokens thsi time!

0.1310579776763916 secs


(5188, 2162)

## Function to train the model 

The arguments of the user defined func *train_model()* below allow you to:

1] pick a classifier algorithm or method to run (P.S. see [here](https://scikit-learn.org/stable/supervised_learning.html) for a list)

2] input the feature set and the labels

3] both for training and validation datasets.

Am setting *is_neural_net=False* as that is worth a whole other course some other time.

See below.

In [6]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    
    #print(len(predictions))
    predDF = pandas.DataFrame()
    predDF['text'] = valid_x
    predDF['actual_label'] = valid_y
    predDF['model_label'] = predictions
    
    print(predDF.iloc[:8,])
    
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
    
    return metrics.accuracy_score(predictions, valid_y)

## Naive Bayes Model using DTM and TFIDF

Now rubber meets road. Let's use the *train_model* func above starting simple with Naive Bayes. 

Note how we output both actual and predicted label for each case to facilitate comparison.

See below.

In [7]:
# Naive Bayes on DTM
t1 = time.time()
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count)
t2 = time.time()
print("\nNaive Bayes on DTM accuracy: "+ str(accuracy))

# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print("\nNaive Bayes on WordLevel TF-IDF accuracy: "+ str(accuracy))

print(t2-t1, "secs for NB on TF")  # 0.01 secs. Fast!

                                                   text  actual_label  \
1323  So as felicia's mom is cleaning the table, fel...             1   
4874                             da vinci code sucks...             0   
2683  I love the Harry Potter series if you can coun...             1   
3169              dudeee i LOVED brokeback mountain!!!!             1   
3100                      I LOVE Brokeback Mountain!!!.             1   
4042                           the Da Vinci Code sucks.             0   
5958                         brokeback mountain sucks..             0   
2549                               I love Harry Potter.             1   

      model_label  
1323            1  
4874            0  
2683            1  
3169            1  
3100            1  
4042            0  
5958            0  
2549            1  

Naive Bayes on DTM accuracy: 0.9791907514450867
                                                   text  actual_label  \
1323  So as felicia's mom is cleaning th

## Linear Classifier using DTM and TFIDF

Same as above, we try logistic regression as our classification method this time.

P.S. I'll expect more time here than in Naive Bayes which asumes conditional independence across cases.

See below.

In [8]:
# Linear Classifier on DTM
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
print("\nLogistic Regression on DTM Accuracy: "+ str(accuracy))

# Linear Classifier on Word Level TF IDF Vectors
t1 = time.time()
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)
t2 = time.time()
print("\nLogistic Regression on WordLevel TF-IDF: "+ str(accuracy))

print(t2-t1, "secs for Logistic")  # 0.02 secs

                                                   text  actual_label  \
1323  So as felicia's mom is cleaning the table, fel...             1   
4874                             da vinci code sucks...             0   
2683  I love the Harry Potter series if you can coun...             1   
3169              dudeee i LOVED brokeback mountain!!!!             1   
3100                      I LOVE Brokeback Mountain!!!.             1   
4042                           the Da Vinci Code sucks.             0   
5958                         brokeback mountain sucks..             0   
2549                               I love Harry Potter.             1   

      model_label  
1323            1  
4874            0  
2683            1  
3169            1  
3100            1  
4042            0  
5958            0  
2549            1  

Logistic Regression on DTM Accuracy: 0.992485549132948
                                                   text  actual_label  \
1323  So as felicia's mom is clea



## SVM using DTM and TFIDF

SVM is a more complex method than the previous two and I'll assume more time here for that reason. 

SVM tries to find a line or plane or hyperplane in high-dimensions (the so-called support-vector) that best seperates one class of labels from another. Again, details in later courses.

In [9]:
#SVM on DTM
t1 = time.time()
accuracy = train_model(svm.SVC(), xtrain_count, train_y, xvalid_count)
t2 = time.time()
print("\nSVM on DTM Model Accuracy: "+str(accuracy))


# SVM on Ngram Level TF IDF Vectors
accuracy = train_model(svm.SVC(), xtrain_tfidf, train_y, xvalid_tfidf)
print("\nSVM on TFIDF Accuracy: "+str(accuracy))

print(t2-t1, "secs in SVM")



                                                   text  actual_label  \
1323  So as felicia's mom is cleaning the table, fel...             1   
4874                             da vinci code sucks...             0   
2683  I love the Harry Potter series if you can coun...             1   
3169              dudeee i LOVED brokeback mountain!!!!             1   
3100                      I LOVE Brokeback Mountain!!!.             1   
4042                           the Da Vinci Code sucks.             0   
5958                         brokeback mountain sucks..             0   
2549                               I love Harry Potter.             1   

      model_label  
1323            1  
4874            0  
2683            1  
3169            1  
3100            1  
4042            0  
5958            0  
2549            1  

SVM on DTM Model Accuracy: 0.9138728323699422
                                                   text  actual_label  \
1323  So as felicia's mom is cleaning the 

Aha. Above we note that DTMs under TF and TFIDF weighing schemes actually produced dramatically different accuracy results under SVM.

The simpler, smaller feature set under TF performed way better. 

## Random Forest using DTM and TFIDF

Now we use a so-called *ensemble* method, i.e. a algorithm that runs a collection of classification tasks and averages or combines them to yield a final output. Often considered advantageous over one-shot methods in complex tasks.

A random forest, as the name implies, is an ensemble of decision trees. The random forest is actually a forest of decision trees grown from different starting points using different feature subsets and later combined to  gove an overall picture.

See below.

In [10]:
# RF on Count Vectors
accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xvalid_count)
print("\nRandom Forest on DTM Accuracy: ", accuracy)

# RF on Word Level TF IDF Vectors
t1 = time.time()
accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf)
t2 = time.time()
print("\nRandom Forest on WordLevel TF-IDF Accuracy: ", accuracy)

print(t2-t1, "secs under RF")

                                                   text  actual_label  \
1323  So as felicia's mom is cleaning the table, fel...             1   
4874                             da vinci code sucks...             0   
2683  I love the Harry Potter series if you can coun...             1   
3169              dudeee i LOVED brokeback mountain!!!!             1   
3100                      I LOVE Brokeback Mountain!!!.             1   
4042                           the Da Vinci Code sucks.             0   
5958                         brokeback mountain sucks..             0   
2549                               I love Harry Potter.             1   

      model_label  
1323            1  
4874            0  
2683            1  
3169            1  
3100            1  
4042            0  
5958            0  
2549            1  

Random Forest on DTM Accuracy:  0.9838150289017341




                                                   text  actual_label  \
1323  So as felicia's mom is cleaning the table, fel...             1   
4874                             da vinci code sucks...             0   
2683  I love the Harry Potter series if you can coun...             1   
3169              dudeee i LOVED brokeback mountain!!!!             1   
3100                      I LOVE Brokeback Mountain!!!.             1   
4042                           the Da Vinci Code sucks.             0   
5958                         brokeback mountain sucks..             0   
2549                               I love Harry Potter.             1   

      model_label  
1323            1  
4874            0  
2683            1  
3169            1  
3100            1  
4042            0  
5958            0  
2549            1  

Random Forest on WordLevel TF-IDF Accuracy:  0.9809248554913295
0.12029004096984863 secs under RF


## Gradient Boosting using DTM and TFIDF

Finally, our last classifn method for the day - gradient boosting that follows 'boosting' principles. IOW, it focusses on misclassified cases and assigns more weight to learning from misclassified cases.

In [11]:
# Extereme Gradient Boosting on Count Vectors
t1 = time.time()
accuracy = train_model(xgboost.XGBClassifier(), xtrain_count.tocsc(), train_y, xvalid_count.tocsc())
t2 = time.time()
print("\nXGBoost on DTM Accuracy: ", accuracy)

# Extereme Gradient Boosting on Word Level TF IDF Vectors
accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf.tocsc(), train_y, xvalid_tfidf.tocsc())
print("\nXGBoost on DTM Accuracy WordLevel TF-IDF: ", accuracy)

print(t2-t1, "secs under xgboost")

                                                   text  actual_label  \
1323  So as felicia's mom is cleaning the table, fel...             1   
4874                             da vinci code sucks...             0   
2683  I love the Harry Potter series if you can coun...             1   
3169              dudeee i LOVED brokeback mountain!!!!             1   
3100                      I LOVE Brokeback Mountain!!!.             1   
4042                           the Da Vinci Code sucks.             0   
5958                         brokeback mountain sucks..             0   
2549                               I love Harry Potter.             1   

      model_label  
1323            1  
4874            0  
2683            1  
3169            1  
3100            1  
4042            0  
5958            0  
2549            1  

XGBoost on DTM Accuracy:  0.9815028901734104
                                                   text  actual_label  \
1323  So as felicia's mom is cleaning the t

Well, so far so good. What Qs come to your mind?

What do you think? Which method performed *best*? Which DTM weighing scheme seems superior for this application? 

Can we use these methods if the number of classes exceeds 2? Etc.

Sudhir