### Part 1

Since the Scikit library offers all required functionality out of the box with an additional benefit of being able to make use of the pipeline object my initial manual implementation was replaced with the Scikit pipeline reducing the code size considerably. Below is the an implementation of a simple Multinomial Naive Bayes classifier using built-in count vectorizer with added stemming:

In [101]:
# Edit here if the path to the CSV is different:
CSV_PATH = "data/car-reviews.csv"

In [102]:
import pandas as pd
import nltk
import ssl
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedCountVectorizer(CountVectorizer):
    """Class adds stemming of english words to the CountVectorizer"""

    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])

class CarReviewsClassifier():

    def __init__(self, fpath):
        
        # this is needed since NLTK stop words download is buggy
        try:
            _create_unverified_https_context = ssl._create_unverified_context
        except AttributeError:
            pass
        else:
            ssl._create_default_https_context = _create_unverified_https_context
            
        nltk.download('stopwords')
        nltk.download('punkt')
        
        # load the car reviews dataset
        dataset = pd.read_csv(fpath)
        print("Loaded {} car reviews".format(dataset.shape[0]))
        
        # split the dataset into a training / test set 
        self.trainsetX, self.testsetX, self.trainsetY, self.testsetY = \
            train_test_split(dataset.Review, dataset.Sentiment, test_size=0.2)


    def train_and_test(self):

        vectorizer = StemmedCountVectorizer(analyzer="word", stop_words='english', binary=True, lowercase=True)
        classifier = MultinomialNB()

        self.pipeline = Pipeline([
            ('vectorizer', vectorizer),
            ('classifier', classifier)
        ])

        # Train
        self.pipeline.fit(self.trainsetX, self.trainsetY)
        
        # Predict
        predY = self.pipeline.predict(self.testsetX)
        
        # Here the score is accuracy for classification (TN + TP)/Total
        print("\nAccuracy {:.2f}".format(self.pipeline.score(self.testsetX, self.testsetY)))
        
        # Return the confusion matrix
        return confusion_matrix(self.testsetY, predY)
    

The accuracy (TP+TN)/(TP+TN+FP+FN) of the simple classifier is between 75% and 80% which is not too bad considering no hyper-parameter tuning at all at this point. The chosen count vectorizer was binary since it proved to show better results in the second part of the assignment.

In [103]:
#
#  Run this cell to test the basic Multinomial Naive Bayes classifier
#

tn, fp, fn, tp = CarReviewsClassifier(CSV_PATH).train_and_test().ravel()

print("\nTrue positives", tp)
print("True negatives", tn)
print("False positives", fp)
print("False negatives", fn)



[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andrejwork/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/andrejwork/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Loaded 1382 car reviews

Accuracy 0.78

True positives 121
True negatives 95
False positives 39
False negatives 22


### Part 1 - manualy implemented count vectorizer

To better demonstrate all the steps required in the stage of preprocessing more clearly I also implemented the count vectorizer manually. There are some differences between my implementation and the Scikit implementation above:
* out of curiosity I implemented numerical count for the word/stem
* train/test split of the dataset is done using Pandas sample method

In [104]:
import nltk

class CarReviewsClassifierManualVectorizer():

    def __init__(self):
        nltk.download('stopwords')
        nltk.download('punkt')

    def tokenize(self, text):
        return nltk.tokenize.word_tokenize(text)

    def remove_stop_words(self, word_list):
        stop_words = set(nltk.corpus.stopwords.words('english'))
        filtered = [w for w in word_list if not w in stop_words]
        return filtered

    def lower_case(self, word_list):
        return [w.lower() for w in word_list]

    def stem_words(self, list_of_words):
        ps = nltk.stem.PorterStemmer()
        stemmed = [ps.stem(w) for w in list_of_words]
        return stemmed

    
    def generate_bow_faster(self, list_tokenized_reviews):
        """Generates bag of words"""
        # This is 6.5x faster than the previous version
        from collections import OrderedDict

        bow = {}
        for r in list_tokenized_reviews:
            for w in r:
                if w in bow:
                    bow[w] += 1
                else:
                    bow[w] = 1
        self.bow_ordered = OrderedDict(sorted(bow.items(), key=lambda t: t[0]))  
        inx = 0        
        for key in self.bow_ordered:
            self.bow_ordered[key] = (inx, self.bow_ordered[key])
            inx += 1
    
    def parse_sentiment(self, s):
        if s.lower() == 'neg':
            return 0
        elif s.lower() == 'pos':
            return 1
        else:
            print(s)
            raise
    
    def count_vectorize(self, df, test=False):
        list_tokenized_reviews = []
        sentiments = []
        for r, sent in zip(df["Review"].to_list(), df["Sentiment"].to_list()):
            v = self.tokenize(r)
            c = self.remove_stop_words(v)
            l = self.lower_case(c)
            s = self.stem_words(l)
            list_tokenized_reviews.append(s)
            sentiments.append(self.parse_sentiment(sent))

        if not test:
            # Generate bag of words for train dataset
            self.generate_bow_faster(list_tokenized_reviews)
        
        list_features = []
        for review in list_tokenized_reviews:
            vector = [0] * len(self.bow_ordered)
            for i in range(len(review)):
                if review[i] in self.bow_ordered:
                    inx = self.bow_ordered[review[i]][0]
                vector[inx] += 1
            list_features.append(vector)
            
        return list_features, sentiments

    def train_and_test(self):
        dataset = pd.read_csv(CSV_PATH)
        print("Loaded {} car reviews".format(dataset.shape[0]))
        trainset = dataset.sample(int(0.8*dataset.shape[0]))
        print("Chosing random {} reviews for training".format(trainset.shape[0]))
        testset = dataset.drop(trainset.index)
        print("Testing on remaining {} reviews".format(testset.shape[0]))
        trainX, trainY = self.count_vectorize(trainset)

        classifier = MultinomialNB()

        # Train
        classifier.fit(trainX, trainY)
        
        # Predict
        testX, testY = self.count_vectorize(testset, test=True)
        predY = classifier.predict(testX)
        
        # Here the score is accuracy for classification (TN + TP)/Total
        print("\nAccuracy {:.2f}".format(classifier.score(testX, testY)))
        
        # Return the confusion matrix
        return confusion_matrix(testY, predY)



The accuracy (TP+TN)/(TP+TN+FP+FN) of this classifier with my custom implemented vectorizer is slighlty worse due to using count of word stems instead of binary values in the BOW. In later tests this approach proved to be consistently worse performing, but I added it here for demonstration purposes

In [105]:
#
#  Run this cell to test the basic Multinomial Naive Bayes classifier with the manually implemented count vectorizer
#

tn, fp, fn, tp = CarReviewsClassifierManualVectorizer().train_and_test().ravel()

print("\nTrue positives", tp)
print("True negatives", tn)
print("False positives", fp)
print("False negatives", fn)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andrejwork/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/andrejwork/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Loaded 1382 car reviews
Chosing random 1105 reviews for training
Testing on remaining 277 reviews

Accuracy 0.81

True positives 121
True negatives 102
False positives 29
False negatives 25


### Part 2 - improving the classifier

To improve my basic Multinomial Naive Bayes classifier the following steps were taken:

1. comparison of multiple available classifiers using grid search and 5-fold cross validation
2. hyper-parameter tuning of the best model

The methods and process used are described in the cells bellow.

In [106]:
import pandas as pd
import nltk
import ssl
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])

class CarReviewsBetterClassifier():

    def __init__(self, fpath):

        try:
            _create_unverified_https_context = ssl._create_unverified_context
        except AttributeError:
            pass
        else:
            ssl._create_default_https_context = _create_unverified_https_context
        nltk.download('stopwords')
        nltk.download('punkt')

        dataset = pd.read_csv(fpath)
        print("Loaded {} car reviews".format(dataset.shape[0]))
        self.trainsetX, self.testsetX, self.trainsetY, self.testsetY = \
            train_test_split(dataset.Review, dataset.Sentiment, test_size=0.2)
                             #, random_state=44)


    def train_and_test(self, min_df=2, max_df=1.0, ngram_range=(1,2), binary=True):
        """Using SVM and optimal values for the hyper-parameters"""

        count_vect = StemmedTfidfVectorizer(analyzer="word", stop_words='english', min_df=min_df, max_df=max_df, ngram_range=ngram_range, binary=binary, lowercase=True)
        classifier = LinearSVC()

        self.pipeline = Pipeline([
            ('vectorizer', count_vect),
            ('classifier', classifier)
        ])

        # call fit as you would on any classifier
        self.pipeline.fit(self.trainsetX, self.trainsetY)

        # predict test instances
        predY = self.pipeline.predict(self.testsetX)
        
        print("\nAccuracy {:.2f}".format(self.pipeline.score(self.testsetX, self.testsetY)))
        
        return confusion_matrix(self.testsetY, predY)

    def grid_search_vectorizer_params(self):
        """Hyper-parameter grid search"""

        classifier = LinearSVC()
        count_vect = TfidfVectorizer(stop_words='english')

        self.pipeline = Pipeline([
            ('vectorizer', count_vect),
            ('classifier', classifier)
        ])

        print(self.pipeline.get_params().keys())

        parameters = [{
            'classifier': (LinearSVC(),),
            # 'classifier__alpha': (0.9,),
            'vectorizer__binary': (True,),
            'vectorizer__lowercase': (True,),
            'vectorizer__max_df': (1.0,),
            'vectorizer__min_df': (1, 3),
            'vectorizer__ngram_range': ((1,2),(1,1)),
            'vectorizer': (StemmedTfidfVectorizer(), StemmedCountVectorizer())
        }]

        grid_search = GridSearchCV(self.pipeline, parameters, verbose = 3, n_jobs = -1)
        clf = grid_search.fit(self.trainsetX, self.trainsetY)
        score = clf.score(self.testsetX, self.testsetY)
        print("{} score: {}".format("Classifier", score))
        print("Best params", clf.best_params_)
        print("Best estimator", clf.best_estimator_)

    def grid_search_classifier(self):
        """Optimal classifier grid search"""

        classifier = LinearSVC()
        count_vect = TfidfVectorizer(stop_words = 'english')

        self.pipeline = Pipeline([
            ('vectorizer', count_vect),
            ('classifier', classifier)
        ])

        print(self.pipeline.get_params().keys())

        parameters = [{
            'classifier': (MultinomialNB(), LinearSVC(), LogisticRegression(), RandomForestClassifier(), MLPClassifier()),
            'vectorizer__binary': (True,),
            'vectorizer__lowercase': (True,),
            'vectorizer__max_df': (1.0,),
            'vectorizer__min_df': (2,),
            'vectorizer__ngram_range': ((1,2),),
            'vectorizer': (StemmedTfidfVectorizer(),)
        }]

        grid_search = GridSearchCV(self.pipeline, parameters, verbose = 3, n_jobs = -1)
        clf = grid_search.fit(self.trainsetX, self.trainsetY)
        score = clf.score(self.testsetX, self.testsetY)
        print("{} score: {}".format("NB", score))
        print("Best params", clf.best_params_)
        print("Best estimator", clf.best_estimator_)

#### Chosing the optimal classifier

To compare different classifiers a grid search was used (above in the code method name: grid_search_classifier) and the following models were compared:

* Multinomial Naive Bayes
* Linear Support Vector
* Logistic Regression
* Random Forest
* Multi-layer Perceptron (here only default settings were tried!)

A Scikit grid search GridSearchCV was used to run different classifiers and then the classifier with the highest accuracy score was chosen. In the cell bellow the grid search for the optimal classifier can be repeated, however beware it can take a longer time to run, especially since the Multi-Layered Perceptron is also among them.

In [107]:
# 
# WARNING: this can take a longer time!
#

# Uncomment and run the line below to run a grid search of the aforementioned models:
#CarReviewsBetterClassifier(CSV_PATH).grid_search_classifier()

#### Tuning parameters

Once the Support Vector Machine was determined as the best classifier for the task the grid search was repeated for the various parameters of the vectorizer (below in the code method name: grid_search_vectorizer_params):

* vectorizer__binary - Using binary values vs count of word stems
* vectorizer__max_df - removing terms that appear too frequently (expressed in percentage of the documents)          * vectorizer__min_df - removing terms that are too infrequent
* vectorizer__ngram_range' - usig multi-word phrases instead of individual words
* vectorizer - comparing various vectorizers:
    * StemmedCountVectorizer - CountVectorizer with stemming added
    * CountVectorizer - default CountVectorizer
    * StemmedTfidfVectorizer  - term frequency–inverse document frequency with stemming added
    * TfidfVectorizer - default term frequency–inverse document frequency
    
The tuning process required several refinement steps to keep the search grid small enough. Some other parameters were experimented with as well but are removed here for clarity purposes since they did not prove to have any considerable effect.  

In [108]:
# 
# WARNING: this can take a longer time!

# Uncomment and run the line below to run a grid search of the aforementioned parameters:
#CarReviewsBetterClassifier(CSV_PATH).grid_search_vectorizer_params()

#### Results

The Multinomial Naive Bayes proved to be slightly worse performing than the Support Vector Machine classifier. With optimal parameters the LinearSVM model consistently tested with accuracy between 0.8 and 0.86. The vectorizer that proved to be most successful was the term frequency–inverse document frequency vectorizer with added stemming (using the improved SnowballStemmer). Using unigrams and bigrams outperformed all other tested n-gram combinations and using binary values vs count of word stems consistently showed to be the better choice (even with Naive Bayes). Removing terms that were only present in one document also improved results slightly. Many other parameters were modified (like maximum number of features, alpha value, etc. but they did not show a measureable effect on the results)

In [109]:
#
#  Run this cell to run the final SVM classifier with optimal parameters:
#
tn, fp, fn, tp = CarReviewsBetterClassifier(CSV_PATH).train_and_test().ravel()

print("\nTrue positives", tp)
print("True negatives", tn)
print("False positives", fp)
print("False negatives", fn)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andrejwork/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/andrejwork/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Loaded 1382 car reviews

Accuracy 0.82

True positives 114
True negatives 113
False positives 26
False negatives 24
