Sentiment Classification 

Guidelines
- Use any machine learning algorithms except deep learning.
- "1. Dataset preparation" and "2. Feature Engineering and Model Building" sections are provided. Note that the "2. Feature Engineering and Model Building" section is provided only for your reference, i.e. as a baseline model and performance.
- In general, try to build the best model that provides the best performance results (especially accuracy and f1-score) using various techniques (such as feature engineering and parameter tunning) you have learnt from the class so far. 
- Make sure that you measure accuracy and f1_score with the test_data and test_labels generated in "1. Dataset preparation" section. In other words, every one will use the same test dataset.
- Discuss the results based on your observation (use comments or Markdown cells)


Load libraries for dataset preparation, feature engineering, model training 

In [1]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

# install textblob: $>pip install textblob
import xgboost, numpy, textblob, string  
import nltk

# load functions from textpreprocess.py
from textpreprocess import denoise_text, normalize, replace_contractions, remove_non_ascii, to_lowercase, remove_punctuation, replace_numbers, remove_stopwords

## 1. Dataset preparation
We are using a movie review dataset. The dataset consists of 2,000 reviews and their labels. 90% of the reviews are stored into train_data, and 10% of them into test_data. 

In [2]:
import sys
import os
import time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.metrics import classification_report

if __name__ == '__main__':

    data_dir = 'data'
    classes = ['pos', 'neg']

    # Step1 - Read the data
    train_data = []
    train_labels = []
    test_data = []
    test_labels = []
    for curr_class in classes:
        dirname = 'data/'+curr_class
        for fname in os.listdir(dirname):
            with open(os.path.join(dirname, fname), 'r') as f:
                content = f.read()
                # Partition the test data
                if fname.startswith('cv9'):
                    test_data.append(content)
                    test_labels.append(curr_class)
                else:
                    train_data.append(content)
                    train_labels.append(curr_class)

In [3]:
len(train_data)

1800

In [4]:
len(test_data)

200

In [5]:
test_labels.count('pos')

100

## 2. Feature Engineering 

### 2.1 TF-IDF Vectors as features

Apart from the basic tf-idf with predefined min_df=5 and max_df=0.8, Stop_words is also included to observe the performance. I also added the following 3 levels of input tokens to observe the performance.

a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents

b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams

c. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus

In [6]:
# Basic tf-idf without stop words
tfidf_vect = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True)
xtrain_tfidf =  tfidf_vect.fit_transform(train_data)
xtest_tfidf =  tfidf_vect.transform(test_data)

# Basic tf-idf with stop words
# to remove words that are uninformative in representating the content of the text
tfidf_stopw_vect = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True, stop_words='english')
xtrain_tfidf_stopw =  tfidf_stopw_vect.fit_transform(train_data)
xtest_tfidf_stopw =  tfidf_stopw_vect.transform(test_data)

# word level tf-idf
tfidf_word_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000, stop_words='english', sublinear_tf=True)
xtrain_tfidf_word =  tfidf_word_vect.fit_transform(train_data)
xtest_tfidf_word =  tfidf_word_vect.transform(test_data)

# ngram level tf-idf
tfidf_ngram_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000, stop_words='english', sublinear_tf=True)
xtrain_tfidf_ngram =  tfidf_ngram_vect.fit_transform(train_data)
xtest_tfidf_ngram =  tfidf_ngram_vect.transform(test_data)

# characters level tf-idf
# Regular expression denoting what constitutes a "token", only used if analyzer == 'word'
tfidf_chars_vect = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000, stop_words='english', sublinear_tf=True)
xtrain_tfidf_chars =  tfidf_chars_vect.fit_transform(train_data)
xtest_tfidf_chars =  tfidf_chars_vect.transform(test_data)
 

### 2.2 Count Vectors as features

[Count Vector](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.

In [7]:
# create a count vectorizer object: 
# analyzer: whether the feature should be made of word or character n-grams.
# token_pattern: regular expression denoting what constitutes a “token”
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')

# fit and transform the training and test data using count vectorizer object
xtrain_count =  count_vect.fit_transform(train_data)
xtest_count =  count_vect.transform(test_data) # just transform test data

## 3. Model Building
The next step in the text classification framework is to train a classifier using the features created in the previous step. There are many different choices of machine learning models which can be used to train a final model. We will implement Naive Bayes Classifier for this purpose:

The following function is a utility function which can be used to train a model. It accepts the classifier, feature_vector of training data, labels of training data and feature vectors of valid data as inputs. Using these inputs, the model is trained. Then, Accuracy score is computed and the classification report is obtained which gives us the precision, recall, f1-score and accuracy.

Precision: $\frac{TP}{TP + FP}$

Recall: $\frac{TP}{TP + FN}$

F1-Score: $\frac{2 * (precision * recall)}{precision + recall}$ <br>
<i>The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.</i>

Accuracy: $\frac{Number of correct predictions}{Total number of predictions}$ or $\frac{TP + TN}{TP + TN + FP + FN}$
<p>

<i>Notes:</i><br>
TP: True Positive <br>
FP: False Positive <br>
TN: True Negative <br>
FN: False Negative <p>

<i>References:</i><br>
https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall <br>
https://developers.google.com/machine-learning/crash-course/classification/accuracy <br>
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html <br>
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report


In [8]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
     
    return metrics.accuracy_score(predictions, test_labels), classification_report(test_labels, predictions)

### 3.1 Naive Bayes Classifier
Implementing a naive bayes model using sklearn implementation with different features

[Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature here .

In [9]:
# Naive Bayes on Basic tf-idf without stop words
accuracy, report = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_labels, xtest_tfidf)
print ("NB, Accuracy for Basic TF-IDF without stop words: ", accuracy)
print ("NB, Report for Basic TF-IDF without stop words:\n", report)

NB, Accuracy for Basic TF-IDF without stop words:  0.85
NB, Report for Basic TF-IDF without stop words:
               precision    recall  f1-score   support

         neg       0.81      0.92      0.86       100
         pos       0.91      0.78      0.84       100

    accuracy                           0.85       200
   macro avg       0.86      0.85      0.85       200
weighted avg       0.86      0.85      0.85       200



In [10]:
# Naive Bayes on Basic tf-idf with stop words
# to remove words that are uninformative in representating the content of the text
accuracy, report = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_stopw, train_labels, xtest_tfidf_stopw)
print ("NB, Accuracy for Basic TF-IDF with stop words: ", accuracy)
print ("NB, Report for Basic TF-IDF with stop words:\n", report)

NB, Accuracy for Basic TF-IDF with stop words:  0.85
NB, Report for Basic TF-IDF with stop words:
               precision    recall  f1-score   support

         neg       0.82      0.90      0.86       100
         pos       0.89      0.80      0.84       100

    accuracy                           0.85       200
   macro avg       0.85      0.85      0.85       200
weighted avg       0.85      0.85      0.85       200



In [11]:
# Naive Bayes on Word Level TF IDF Vectors
accuracy, report = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_word, train_labels, xtest_tfidf_word)
print ("NB, Accuracy for WordLevel TF-IDF: ", accuracy)
print ("NB, Report for WordLevel TF-IDF:\n", report)

NB, Accuracy for WordLevel TF-IDF:  0.83
NB, Report for WordLevel TF-IDF:
               precision    recall  f1-score   support

         neg       0.79      0.89      0.84       100
         pos       0.88      0.77      0.82       100

    accuracy                           0.83       200
   macro avg       0.83      0.83      0.83       200
weighted avg       0.83      0.83      0.83       200



In [12]:
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy, report = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_labels, xtest_tfidf_ngram)
print ("NB, Accuracy for N-Gram Vectors: ", accuracy)
print ("NB, Report for N-Gram Vectors:\n", report)

NB, Accuracy for N-Gram Vectors:  0.77
NB, Report for N-Gram Vectors:
               precision    recall  f1-score   support

         neg       0.75      0.82      0.78       100
         pos       0.80      0.72      0.76       100

    accuracy                           0.77       200
   macro avg       0.77      0.77      0.77       200
weighted avg       0.77      0.77      0.77       200



In [13]:
# Naive Bayes on Character Level TF IDF Vectors
accuracy, report = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_chars, train_labels, xtest_tfidf_chars)
print ("NB, Accuracy for CharLevel Vectors: ", accuracy)
print ("NB, Report for CharLevel Vectors:\n", report)

NB, Accuracy for CharLevel Vectors:  0.79
NB, Report for CharLevel Vectors:
               precision    recall  f1-score   support

         neg       0.77      0.83      0.80       100
         pos       0.82      0.75      0.78       100

    accuracy                           0.79       200
   macro avg       0.79      0.79      0.79       200
weighted avg       0.79      0.79      0.79       200



In [14]:
# Naive Bayes on Count Vectors
# train_model(classifier, train feature vectors (countvectorizer, label, 
# validation feature vectors(countvectorizer))
accuracy, report = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_labels, xtest_count)
print ("NB, Accuracy for Count Vectors: ", accuracy)
print ("NB, Report for Count Vectors:\n", report)

NB, Accuracy for Count Vectors:  0.83
NB, Report for Count Vectors:
               precision    recall  f1-score   support

         neg       0.81      0.87      0.84       100
         pos       0.86      0.79      0.82       100

    accuracy                           0.83       200
   macro avg       0.83      0.83      0.83       200
weighted avg       0.83      0.83      0.83       200



### 3.2 Logistic Regression

Implementing a Linear Classifier (Logistic Regression)

[Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function. 

In [15]:
# Linear Classifier on Basic tf-idf without stop words
accuracy, report = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_labels, xtest_tfidf)
print ("LR, Accuracy for Basic TF-IDF without stop words: ", accuracy)
print ("LR, Report for Basic TF-IDF without stop words:\n", report)

LR, Accuracy for Basic TF-IDF without stop words:  0.895
LR, Report for Basic TF-IDF without stop words:
               precision    recall  f1-score   support

         neg       0.89      0.90      0.90       100
         pos       0.90      0.89      0.89       100

    accuracy                           0.90       200
   macro avg       0.90      0.90      0.89       200
weighted avg       0.90      0.90      0.89       200





In [16]:
# Linear Classifier on Basic tf-idf with stop words
# to remove words that are uninformative in representating the content of the text
accuracy, report = train_model(linear_model.LogisticRegression(), xtrain_tfidf_stopw, train_labels, xtest_tfidf_stopw)
print ("LR, Accuracy for Basic TF-IDF with stop words: ", accuracy)
print ("LR, Report for Basic TF-IDF with stop words:\n", report)

LR, Accuracy for Basic TF-IDF with stop words:  0.885
LR, Report for Basic TF-IDF with stop words:
               precision    recall  f1-score   support

         neg       0.85      0.93      0.89       100
         pos       0.92      0.84      0.88       100

    accuracy                           0.89       200
   macro avg       0.89      0.89      0.88       200
weighted avg       0.89      0.89      0.88       200



In [17]:
# Linear Classifier on Word Level TF IDF Vectors
accuracy, report = train_model(linear_model.LogisticRegression(), xtrain_tfidf_word, train_labels, xtest_tfidf_word)
print ("LR, Accuracy for WordLevel TF-IDF: ", accuracy)
print ("LR, Report for WordLevel TF-IDF:\n", report)

LR, Accuracy for WordLevel TF-IDF:  0.875
LR, Report for WordLevel TF-IDF:
               precision    recall  f1-score   support

         neg       0.86      0.89      0.88       100
         pos       0.89      0.86      0.87       100

    accuracy                           0.88       200
   macro avg       0.88      0.88      0.87       200
weighted avg       0.88      0.88      0.87       200



In [18]:
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy, report = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_labels, xtest_tfidf_ngram)
print ("LR, Accuracy for N-Gram Vectors: ", accuracy)
print ("LR, Report for N-Gram Vectors:\n", report)

LR, Accuracy for N-Gram Vectors:  0.795
LR, Report for N-Gram Vectors:
               precision    recall  f1-score   support

         neg       0.77      0.85      0.81       100
         pos       0.83      0.74      0.78       100

    accuracy                           0.80       200
   macro avg       0.80      0.79      0.79       200
weighted avg       0.80      0.80      0.79       200



In [19]:
# Linear Classifier on Character Level TF IDF Vectors
accuracy, report = train_model(linear_model.LogisticRegression(), xtrain_tfidf_chars, train_labels, xtest_tfidf_chars)
print ("LR, Accuracy for CharLevel Vectors: ", accuracy)
print ("LR, Report for CharLevel Vectors:\n", report)

LR, Accuracy for CharLevel Vectors:  0.845
LR, Report for CharLevel Vectors:
               precision    recall  f1-score   support

         neg       0.85      0.84      0.84       100
         pos       0.84      0.85      0.85       100

    accuracy                           0.84       200
   macro avg       0.85      0.84      0.84       200
weighted avg       0.85      0.84      0.84       200



In [20]:
# Linear Classifier on Count Vectors
# train_model(classifier, train feature vectors (countvectorizer, label, 
# validation feature vectors(countvectorizer))
accuracy, report = train_model(linear_model.LogisticRegression(), xtrain_count, train_labels, xtest_count)
print ("LR, Accuracy for Count Vectors: ", accuracy)
print ("LR, Report for Count Vectors:\n", report)

LR, Accuracy for Count Vectors:  0.855
LR, Report for Count Vectors:
               precision    recall  f1-score   support

         neg       0.85      0.86      0.86       100
         pos       0.86      0.85      0.85       100

    accuracy                           0.85       200
   macro avg       0.86      0.85      0.85       200
weighted avg       0.86      0.85      0.85       200



### 3.3 SVM Classifer

[Support Vector Machine (SVM)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) is a supervised machine learning algorithm which can be used for both classification or regression challenges. The model extracts a best possible hyper-plane / line that segregates the two classes. 

The implementation of SVM is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
The multiclass support is handled according to a one-vs-one scheme.

In [21]:
# SVM Classifier on Basic tf-idf without stop words
accuracy, report = train_model(svm.SVC(kernel='linear'), xtrain_tfidf, train_labels, xtest_tfidf)
print ("SVM, Accuracy for Basic TF-IDF without stop words: ", accuracy)
print ("SVM, Report for Basic TF-IDF without stop words:\n", report)

SVM, Accuracy for Basic TF-IDF without stop words:  0.915
SVM, Report for Basic TF-IDF without stop words:
               precision    recall  f1-score   support

         neg       0.91      0.92      0.92       100
         pos       0.92      0.91      0.91       100

    accuracy                           0.92       200
   macro avg       0.92      0.92      0.91       200
weighted avg       0.92      0.92      0.91       200



In [22]:
# SVM Classifier on Basic tf-idf with stop words
# to remove words that are uninformative in representating the content of the text
accuracy, report = train_model(svm.SVC(kernel='linear'), xtrain_tfidf_stopw, train_labels, xtest_tfidf_stopw)
print ("SVM, Accuracy for Basic TF-IDF with stop words: ", accuracy)
print ("SVM, Report for Basic TF-IDF with stop words:\n", report)

SVM, Accuracy for Basic TF-IDF with stop words:  0.885
SVM, Report for Basic TF-IDF with stop words:
               precision    recall  f1-score   support

         neg       0.87      0.90      0.89       100
         pos       0.90      0.87      0.88       100

    accuracy                           0.89       200
   macro avg       0.89      0.89      0.88       200
weighted avg       0.89      0.89      0.88       200



In [23]:
# SVM Classifier on Word Level TF IDF Vectors
accuracy, report = train_model(svm.SVC(kernel='linear'), xtrain_tfidf_word, train_labels, xtest_tfidf_word)
print ("SVM, Accuracy for WordLevel TF-IDF: ", accuracy)
print ("SVM, Report for WordLevel TF-IDF:\n", report)

SVM, Accuracy for WordLevel TF-IDF:  0.895
SVM, Report for WordLevel TF-IDF:
               precision    recall  f1-score   support

         neg       0.90      0.89      0.89       100
         pos       0.89      0.90      0.90       100

    accuracy                           0.90       200
   macro avg       0.90      0.90      0.89       200
weighted avg       0.90      0.90      0.89       200



In [24]:
# SVM Classifier on Ngram Level TF IDF Vectors
accuracy, report = train_model(svm.SVC(kernel='linear'), xtrain_tfidf_ngram, train_labels, xtest_tfidf_ngram)
print ("SVM, Accuracy for N-Gram Vectors: ", accuracy)
print ("SVM, Report for N-Gram Vectors:\n", report)

SVM, Accuracy for N-Gram Vectors:  0.8
SVM, Report for N-Gram Vectors:
               precision    recall  f1-score   support

         neg       0.78      0.83      0.81       100
         pos       0.82      0.77      0.79       100

    accuracy                           0.80       200
   macro avg       0.80      0.80      0.80       200
weighted avg       0.80      0.80      0.80       200



In [25]:
# SVM Classifier on Character Level TF IDF Vectors
accuracy, report = train_model(svm.SVC(kernel='linear'), xtrain_tfidf_chars, train_labels, xtest_tfidf_chars)
print ("SVM, Accuracy for CharLevel Vectors: ", accuracy)
print ("SVM, Report for CharLevel Vectors:\n", report)

SVM, Accuracy for CharLevel Vectors:  0.87
SVM, Report for CharLevel Vectors:
               precision    recall  f1-score   support

         neg       0.88      0.86      0.87       100
         pos       0.86      0.88      0.87       100

    accuracy                           0.87       200
   macro avg       0.87      0.87      0.87       200
weighted avg       0.87      0.87      0.87       200



In [26]:
# SVM Classifier on Count Vectors
# train_model(classifier, train feature vectors (countvectorizer, label, 
# validation feature vectors(countvectorizer))
accuracy, report = train_model(svm.SVC(kernel='linear'), xtrain_count, train_labels, xtest_count)
print ("SVM, Accuracy for Count Vectors: ", accuracy)
print ("SVM, Report for Count Vectors:\n", report)

SVM, Accuracy for Count Vectors:  0.835
SVM, Report for Count Vectors:
               precision    recall  f1-score   support

         neg       0.84      0.83      0.83       100
         pos       0.83      0.84      0.84       100

    accuracy                           0.83       200
   macro avg       0.84      0.83      0.83       200
weighted avg       0.84      0.83      0.83       200



### 3.4 Random Forest Classifier

Implementing a [Random Forest Model](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Random Forest models are a type of ensemble models, particularly bagging models. They are part of the tree based model family.

In [27]:
# RF Classifier on Basic tf-idf without stop words
accuracy, report = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_tfidf, train_labels, xtest_tfidf)
print ("RF, Accuracy for Basic TF-IDF without stop words: ", accuracy)
print ("RF, Report for Basic TF-IDF without stop words:\n", report)

RF, Accuracy for Basic TF-IDF without stop words:  0.81
RF, Report for Basic TF-IDF without stop words:
               precision    recall  f1-score   support

         neg       0.76      0.90      0.83       100
         pos       0.88      0.72      0.79       100

    accuracy                           0.81       200
   macro avg       0.82      0.81      0.81       200
weighted avg       0.82      0.81      0.81       200



In [28]:
# RF Classifier on Basic tf-idf with stop words
# to remove words that are uninformative in representating the content of the text
accuracy, report = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_tfidf_stopw, train_labels, xtest_tfidf_stopw)
print ("RF, Accuracy for Basic TF-IDF with stop words: ", accuracy)
print ("RF, Report for Basic TF-IDF with stop words:\n", report)

RF, Accuracy for Basic TF-IDF with stop words:  0.825
RF, Report for Basic TF-IDF with stop words:
               precision    recall  f1-score   support

         neg       0.80      0.87      0.83       100
         pos       0.86      0.78      0.82       100

    accuracy                           0.82       200
   macro avg       0.83      0.82      0.82       200
weighted avg       0.83      0.82      0.82       200



In [29]:
# RF Classifier on Word Level TF IDF Vectors
accuracy, report = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_tfidf_word, train_labels, xtest_tfidf_word)
print ("RF, Accuracy for WordLevel TF-IDF: ", accuracy)
print ("RF, Report for WordLevel TF-IDF:\n", report)

RF, Accuracy for WordLevel TF-IDF:  0.815
RF, Report for WordLevel TF-IDF:
               precision    recall  f1-score   support

         neg       0.78      0.88      0.83       100
         pos       0.86      0.75      0.80       100

    accuracy                           0.81       200
   macro avg       0.82      0.81      0.81       200
weighted avg       0.82      0.81      0.81       200



In [30]:
# RF Classifier on Ngram Level TF IDF Vectors
accuracy, report = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_tfidf_ngram, train_labels, xtest_tfidf_ngram)
print ("RF, Accuracy for N-Gram Vectors: ", accuracy)
print ("RF, Report for N-Gram Vectors:\n", report)

RF, Accuracy for N-Gram Vectors:  0.68
RF, Report for N-Gram Vectors:
               precision    recall  f1-score   support

         neg       0.67      0.71      0.69       100
         pos       0.69      0.65      0.67       100

    accuracy                           0.68       200
   macro avg       0.68      0.68      0.68       200
weighted avg       0.68      0.68      0.68       200



In [31]:
# RF Classifier on Character Level TF IDF Vectors
accuracy, report = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_tfidf_chars, train_labels, xtest_tfidf_chars)
print ("RF, Accuracy for CharLevel Vectors: ", accuracy)
print ("RF, Report for CharLevel Vectors:\n", report)

RF, Accuracy for CharLevel Vectors:  0.74
RF, Report for CharLevel Vectors:
               precision    recall  f1-score   support

         neg       0.73      0.76      0.75       100
         pos       0.75      0.72      0.73       100

    accuracy                           0.74       200
   macro avg       0.74      0.74      0.74       200
weighted avg       0.74      0.74      0.74       200



In [32]:
# RF Classifier on Count Vectors
# train_model(classifier, train feature vectors (countvectorizer, label, 
# validation feature vectors(countvectorizer))
accuracy, report = train_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_count, train_labels, xtest_count)
print ("RF, Accuracy for Count Vectors: ", accuracy)
print ("RF, Report for Count Vectors:\n", report)

RF, Accuracy for Count Vectors:  0.83
RF, Report for Count Vectors:
               precision    recall  f1-score   support

         neg       0.82      0.84      0.83       100
         pos       0.84      0.82      0.83       100

    accuracy                           0.83       200
   macro avg       0.83      0.83      0.83       200
weighted avg       0.83      0.83      0.83       200



### 3.5 Boosting Model

Implementing [Xtereme Gradient Boosting Model](https://xgboost.readthedocs.io/en/latest/index.html).

Boosting models are another type of ensemble models part of tree based models. Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing).

In [33]:
# Extreme Gradient Boosting on Basic tf-idf without stop words
accuracy, report = train_model(xgboost.XGBClassifier(), xtrain_tfidf, train_labels, xtest_tfidf)
print ("Xgb, Accuracy for Basic TF-IDF without stop words: ", accuracy)
print ("Xgb, Report for Basic TF-IDF without stop words:\n", report)

Xgb, Accuracy for Basic TF-IDF without stop words:  0.805
Xgb, Report for Basic TF-IDF without stop words:
               precision    recall  f1-score   support

         neg       0.83      0.77      0.80       100
         pos       0.79      0.84      0.81       100

    accuracy                           0.81       200
   macro avg       0.81      0.80      0.80       200
weighted avg       0.81      0.81      0.80       200



In [34]:
#  Extreme Gradient Boosting  on Basic tf-idf with stop words
# to remove words that are uninformative in representating the content of the text
accuracy, report = train_model(xgboost.XGBClassifier(), xtrain_tfidf_stopw, train_labels, xtest_tfidf_stopw)
print ("Xgb, Accuracy for Basic TF-IDF with stop words: ", accuracy)
print ("Xgb, Report for Basic TF-IDF with stop words:\n", report)

Xgb, Accuracy for Basic TF-IDF with stop words:  0.835
Xgb, Report for Basic TF-IDF with stop words:
               precision    recall  f1-score   support

         neg       0.86      0.80      0.83       100
         pos       0.81      0.87      0.84       100

    accuracy                           0.83       200
   macro avg       0.84      0.83      0.83       200
weighted avg       0.84      0.83      0.83       200



In [35]:
# Extreme Gradient Boosting on Word Level TF IDF Vectors
accuracy, report = train_model(xgboost.XGBClassifier(), xtrain_tfidf_word, train_labels, xtest_tfidf_word)
print ("Xgb, Accuracy for WordLevel TF-IDF: ", accuracy)
print ("Xgb, Report for WordLevel TF-IDF:\n", report)

Xgb, Accuracy for WordLevel TF-IDF:  0.815
Xgb, Report for WordLevel TF-IDF:
               precision    recall  f1-score   support

         neg       0.82      0.80      0.81       100
         pos       0.81      0.83      0.82       100

    accuracy                           0.81       200
   macro avg       0.82      0.81      0.81       200
weighted avg       0.82      0.81      0.81       200



In [36]:
#  Extreme Gradient Boosting on Ngram Level TF IDF Vectors
accuracy, report = train_model(xgboost.XGBClassifier(), xtrain_tfidf_ngram, train_labels, xtest_tfidf_ngram)
print ("Xgb, Accuracy for N-Gram Vectors: ", accuracy)
print ("Xgb, Report for N-Gram Vectors:\n", report)

Xgb, Accuracy for N-Gram Vectors:  0.665
Xgb, Report for N-Gram Vectors:
               precision    recall  f1-score   support

         neg       0.68      0.63      0.65       100
         pos       0.65      0.70      0.68       100

    accuracy                           0.67       200
   macro avg       0.67      0.67      0.66       200
weighted avg       0.67      0.67      0.66       200



In [37]:
# Extreme Gradient Boosting  on Character Level TF IDF Vectors
accuracy, report = train_model(xgboost.XGBClassifier(), xtrain_tfidf_chars, train_labels, xtest_tfidf_chars)
print ("Xgb, Accuracy for CharLevel Vectors: ", accuracy)
print ("Xgb, Report for CharLevel Vectors:\n", report)

Xgb, Accuracy for CharLevel Vectors:  0.79
Xgb, Report for CharLevel Vectors:
               precision    recall  f1-score   support

         neg       0.78      0.80      0.79       100
         pos       0.80      0.78      0.79       100

    accuracy                           0.79       200
   macro avg       0.79      0.79      0.79       200
weighted avg       0.79      0.79      0.79       200



In [38]:
# Extreme Gradient Boosting  on Count Vectors
# train_model(classifier, train feature vectors (countvectorizer, label, 
# validation feature vectors(countvectorizer))
accuracy, report = train_model(xgboost.XGBClassifier(), xtrain_count, train_labels, xtest_count)
print ("Xgb, Accuracy for Count Vectors: ", accuracy)
print ("Xgb, Report for Count Vectors:\n", report)

Xgb, Accuracy for Count Vectors:  0.795
Xgb, Report for Count Vectors:
               precision    recall  f1-score   support

         neg       0.83      0.74      0.78       100
         pos       0.77      0.85      0.81       100

    accuracy                           0.80       200
   macro avg       0.80      0.79      0.79       200
weighted avg       0.80      0.80      0.79       200



### 3.6 Conclusion
I have used 5 types of classifier models on our datasets by training 90% of the data and testing 10% of the data using the trained model. 5 different vectors are used, i.e. <br>
<ol>
    <li>Basic tf-idf without stop words</li>
    <li>Basic tf-idf without stop words</li>
    <li>Basic tf-idf with stop words</li>
    <li>word level tf-idf</li>
    <li>ngram level tf-idf</li>
    <li>characters level tf-idf</li>
</ol>
Based on accuracy and fl-score, after running the trained model with different vectors, the following is the summary of the models with their respective vectors giving the best performance (in the order of best performance first).<br>
<ol>
    <li>Support Vector Machine with Basic TF-IDF without stop words - Accurary: 0.92, F1-score: 0.91</li>
    <li>Linear Regression with Basic TF-IDF without stop words - Accuracy: 0.90, F1-score: 0.90</li>
    <li>Naives Bayes with Basic TF-IDF without stop words - Accuracy: 0.85, F1-Score: 0.85</li>
    <li>Extreme Gradient Boosting with Basic TF-IDF with stop words - Accuracy: 0.835, F1-Score: 0.83</li>
    <li>Random Forest Classifier with Count Vectors - Accuracy: 0.83, F1-Score: 0.83</li>
</ol>
Overall, SVM provides the best performance. Amongst the vectors, most of the classifiers work best with Basic TF-IDF without stop words except Extreme Gradient Boosting which works best with Basic TF-IDF with stop words and Random Forest Classifier which works best with Count Vectors.

## 4. Grid Search with Pipeline - improve performance through grid search of parameters

[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) performs exhaustive search over specified parameter values for an estimator.
Important members are fit, predict.
GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.
The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Based on the results in <b>3 Model Building</b>, Grid Search with Pipeline will be applied to each model with their respective best vectors, i.e. the ones giving the best accuracy. <br>
GridSearchCV on Naive Bayes is simpler as it does not have many hyperparameters to tune. <p>
    
    1. SVM Classifier on Basic tf-idf without stop words<br>
    2. Linear Classifier on Basic tf-idf without stop words<br>
    3. Extreme Gradient Boosting  on Basic tf-idf with stop words<br>
    4. Naive Bayes on Basic tf-idf without stop words<br>
    5. RF Classifier on Basic tf-idf without stop words
    

### 4.1 Grid Search with SVM on Basic tf-idf without stop words

In [40]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True)),
    ('clf', svm.SVC(kernel='linear'))
])
parameters = {
    'clf__C': (0.01, 1, 10),
    'clf__gamma': (0.5, 1,2,3,4),
    'clf__kernel': ('rbf', 'linear')
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=10)
    grid_search.fit(train_data, train_labels)
    print("Grid Search with SVM:\n")
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    
    # Refit an estimator using the best found parameters on the whole dataset.
    # The refitted estimator is made available at the best_estimator_attribute and 
    # permits using predict directly on this GridSearchCV instance.
    predictions = grid_search.predict(test_data)
    print('Accuracy:', accuracy_score(test_labels, predictions))
    print('Precision:', precision_score(test_labels, predictions, pos_label='pos'))
    print('Recall:', recall_score(test_labels, predictions, pos_label='pos'))
    print('F1_score:', f1_score(test_labels, predictions, pos_label='pos'))

Fitting 10 folds for each of 30 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  8.8min finished


Grid Search with SVM:

Best score: 0.872
Best parameters set:
	clf__C: 10
	clf__gamma: 0.5
	clf__kernel: 'rbf'
Accuracy: 0.93
Precision: 0.9387755102040817
Recall: 0.92
F1_score: 0.9292929292929293


### 4.2 Grid Search with LinearClassifier on Basic tf-idf without stop words

In [42]:
pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True)),
    ('clf', linear_model.LogisticRegression())
])
parameters = {
 'clf__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] 
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=10)
    grid_search.fit(train_data, train_labels)
    print("Grid Search with LinearCLassifier:\n")
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    
    # Refit an estimator using the best found parameters on the whole dataset.
    # The refitted estimator is made available at the best_estimator_attribute and 
    # permits using predict directly on this GridSearchCV instance.
    predictions = grid_search.predict(test_data)
    print('Accuracy:', accuracy_score(test_labels, predictions))
    print('Precision:', precision_score(test_labels, predictions, pos_label='pos'))
    print('Recall:', recall_score(test_labels, predictions, pos_label='pos'))
    print('F1_score:', f1_score(test_labels, predictions, pos_label='pos'))

Fitting 10 folds for each of 7 candidates, totalling 70 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   11.1s
[Parallel(n_jobs=-1)]: Done  70 out of  70 | elapsed:   21.8s finished


Grid Search with LinearCLassifier:

Best score: 0.874
Best parameters set:
	clf__C: 10
Accuracy: 0.92
Precision: 0.9375
Recall: 0.9
F1_score: 0.9183673469387755


### 4.3 Grid Search with Extreme Gradient Boosting  on Basic tf-idf with stop words

In [44]:
pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True)),
    ('clf', xgboost.XGBClassifier())
])
parameters = {
    'clf__max_depth': (2, 3, 5),
    'clf__n_estimators': (60, 100, 220),
    'clf__learning_rate': [0.1, 0.01, 0.05]
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=10)
    grid_search.fit(train_data, train_labels)
    print("Grid Search with XGBClassifier:\n")
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    
    # Refit an estimator using the best found parameters on the whole dataset.
    # The refitted estimator is made available at the best_estimator_attribute and 
    # permits using predict directly on this GridSearchCV instance.
    predictions = grid_search.predict(test_data)
    print('Accuracy:', accuracy_score(test_labels, predictions))
    print('Precision:', precision_score(test_labels, predictions, pos_label='pos'))
    print('Recall:', recall_score(test_labels, predictions, pos_label='pos'))
    print('F1_score:', f1_score(test_labels, predictions, pos_label='pos'))

Fitting 10 folds for each of 27 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 10.2min
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed: 15.7min finished


Grid Search with XGBClassifier:

Best score: 0.817
Best parameters set:
	clf__learning_rate: 0.1
	clf__max_depth: 3
	clf__n_estimators: 220
Accuracy: 0.83
Precision: 0.8055555555555556
Recall: 0.87
F1_score: 0.8365384615384616


### 4.4 Grid Search with Naive Bayes on Basic tf-idf without stop words

In [46]:
pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True)),
    ('clf', naive_bayes.MultinomialNB())
])
parameters = {
    'clf__alpha': (0, 1.0, 2.0),
    'clf__fit_prior': (True, False)
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=10)
    grid_search.fit(train_data, train_labels)
    print("Grid Search with Naive Bayes:\n")
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    
    # Refit an estimator using the best found parameters on the whole dataset.
    # The refitted estimator is made available at the best_estimator_attribute and 
    # permits using predict directly on this GridSearchCV instance.
    predictions = grid_search.predict(test_data)
    print('Accuracy:', accuracy_score(test_labels, predictions))
    print('Precision:', precision_score(test_labels, predictions, pos_label='pos'))
    print('Recall:', recall_score(test_labels, predictions, pos_label='pos'))
    print('F1_score:', f1_score(test_labels, predictions, pos_label='pos'))

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   11.4s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:   18.4s finished


Grid Search with GridSearch:

Best score: 0.832
Best parameters set:
	clf__alpha: 2.0
	clf__fit_prior: True
Accuracy: 0.855
Precision: 0.9176470588235294
Recall: 0.78
F1_score: 0.8432432432432432


### 4.5 Grid Search with RF Classifier on Basic tf-idf without stop words

In [50]:
pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True)),
    ('clf', ensemble.RandomForestClassifier(n_estimators=100))
])
parameters = {
    "clf__n_estimators": [50, 100],
    "clf__max_depth": [None, 3, 5],
    "clf__min_samples_split": [1.0, 0.3, 0.5],
    "clf__min_samples_leaf": [1, 2],
    "clf__max_leaf_nodes": [None, 5]
}

if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=10)
    grid_search.fit(train_data, train_labels)
    print("Grid Search with RandomForest:\n")
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    
    # Refit an estimator using the best found parameters on the whole dataset.
    # The refitted estimator is made available at the best_estimator_attribute and 
    # permits using predict directly on this GridSearchCV instance.
    predictions = grid_search.predict(test_data)
    print('Accuracy:', accuracy_score(test_labels, predictions))
    print('Precision:', precision_score(test_labels, predictions, pos_label='pos'))
    print('Recall:', recall_score(test_labels, predictions, pos_label='pos'))
    print('F1_score:', f1_score(test_labels, predictions, pos_label='pos'))

Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   17.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  4.4min finished


Grid Search with RandomForest:

Best score: 0.797
Best parameters set:
	clf__max_depth: None
	clf__max_leaf_nodes: None
	clf__min_samples_leaf: 1
	clf__min_samples_split: 0.3
	clf__n_estimators: 100
Accuracy: 0.82
Precision: 0.82
Recall: 0.82
F1_score: 0.82


### 4.6 Conclusion
Overall, performing Grid Search on the classifiers with their respective parametes help in tuning the performance and improvement is generally observed. The overall best classifier is <br>
Grid Search with SVM:<p>

Best score: 0.872<br>
Best parameters set:<br>
clf__C: 10<br>
clf__gamma: 0.5<br>
clf__kernel: 'rbf'<br>
Accuracy: 0.93<br>
Precision: 0.9387755102040817<br>
Recall: 0.92<br>
F1_score: 0.9292929292929293<br>

Initially, I wanted to input more different values for parameters and parameters for the classifiers for GridSearch. I also wanted to tune the parameters for the vectors. However, my notebook kept hanging and I had to limit the candidates for different combinations for both classifiers and vectors. If there is enough computing power and time, the classifier models can be further enhanced.

