# Task 1 - Naïve Bayes Classifier

Install requirements:

- pandas
- nltk
- sklearn

In the cell below is all relevant code for part 1, the run_ml_pipeline function runs the entire pipeline and outputs performance metrics for the Naive Bayes classifier, which will be called further down the notebook.

Please note the code comment in run_ml_pipeline discusses how the pipeline handles the presense of words unseen in the training set both as a whole or of a specific class.

In [17]:
import pandas as pd
import nltk
from nltk import downloader
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import accuracy_score, confusion_matrix

downloader.download("stopwords")
downloader.download("punkt_tab")
downloader.download("wordnet")
downloader.download('averaged_perceptron_tagger_eng')

english_stop_words = set(stopwords.words("english"))
all_stop_words = set()

lemmatizer = WordNetLemmatizer()

# add versions of a stopword with and without apostrophes
# e.g. "don't" and "dont"
for word in english_stop_words:
    if "'" in word:
        all_stop_words.add(word.replace("'", ""))
    else:
        all_stop_words.add(word)

input_data = pd.read_csv("car-reviews.csv")

def preprocessor(text: str) -> str:
    """Preprocesses text by removing punctuation and stopwords, stemming, and lowercasing."""

    text = preprocess_remove_punctuation(text)
    text = preprocess_lowercase(text)
    text = preprocess_remove_stopwords(text)
    text = preprocess_stem(text)
    return text


def preprocess_lowercase(text: str) -> str:
    """Preprocesses text by lowercasing."""

    text = text.lower()
    return text

def preprocess_remove_punctuation(text: str) -> str:
    """Preprocesses text by removing punctuation """

    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word.isalnum()]
    return " ".join(tokens)

def preprocess_remove_stopwords(text: str) -> str:
    """Preprocesses text by removing stopwords."""

    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in all_stop_words]
    return " ".join(tokens)

def preprocess_stem(text: str) -> str:
    """Preprocesses text by stemming."""

    def pos_to_wordnet(pos: str) -> str:
        """Converts POS tag to WordNet format."""

        first_char = pos[0]

        if first_char == "J":
            return "a"
        elif first_char == "V":
            return "v"
        elif first_char == "N":
            return "n"
        elif first_char == "R":
            return "r"
        else:
            return "n"

    tokens = nltk.word_tokenize(text)
    token_tags = pos_tag(tokens)

    tokens = [lemmatizer.lemmatize(word, pos_to_wordnet(pos)) for word, pos in token_tags]
    return " ".join(tokens)

class Preprocessor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [preprocessor(text) for text in X]


def run_ml_pipeline(input_data: pd.DataFrame) -> None:
    """
    Runs the machine learning pipeline on input dataframe.
    """
    test_size = 276  # 20% of the data for testing

    training_data, test_data = train_test_split(
        input_data,
        test_size=test_size,
        random_state=120,
        stratify=input_data["Sentiment"],
    )

    # ----- A note about unseen words in the test set ------:
    # The MultinomialNB Classifier is passed an alpha value of 1.0, which happens to be the default value.
    # This is the additive laplace smoothing parameter and means that all features will have a constant value of 1.0 added to them.
    # This will essentially mean when an unseen word is encountered in the test set, it will be treated as if it has been seen once in the training set.
    # This is much better than the unseen word having a probability of 0.0 which would cause Naive Bayes to be unable to classify the review.
    pipeline = Pipeline(
        [
            ("preprocess", Preprocessor()),
            ("vectorize", CountVectorizer()),
            ("classify", MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)),
        ]
    )

    pipeline.fit(training_data["Review"], training_data["Sentiment"])

    predictions = pipeline.predict(test_data["Review"])

    accuracy = accuracy_score(test_data["Sentiment"], predictions)

    print(f"Accuracy: {accuracy * 100:.2f}%")

    labels = ["Pos", "Neg"]

    cm = confusion_matrix(test_data["Sentiment"], predictions, labels=labels)

    cm_normalized = cm.astype("float") / cm.sum()

    print("Confusion Matrix Proportions:")
    print(f"True Negative: {cm_normalized[0][0] * 100:.2f}%")
    print(f"False Negative: {cm_normalized[1][0] * 100:.2f}%")
    print(f"True Positive: {cm_normalized[1][1] * 100:.2f}%")
    print(f"False Positive: {cm_normalized[0][1] * 100:.2f}%")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jack\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Jack\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jack\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Jack\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


The cell below demonstrates that the ML pipeline implemented preprocesses the reviews by removing stop-words and punctuation.

The preprocesser function is called which is the first step as part of the sklearn Pipeline object.

In [4]:
def words_and_punctuation_removal_example():
    """
    Example of stopword removal, punctuation removal, and lowercasing, as part of the marking criteria.
    """

    # 2 example rows from car-reviews.csv
    example_reviews = input_data["Review"].head(2).tolist()

    # additional made-up test review with punctuation
    example_text = "This is a sample review! It's great, isn't it?"
    example_reviews.append(example_text)

    for text in example_reviews:
        processed_review = preprocess_remove_punctuation(text)
        processed_review = preprocess_lowercase(processed_review)
        processed_review = preprocess_remove_stopwords(processed_review)

        print(f"Original: \n {text} \n")
        print(f"Processed: \n {processed_review} \n")
        print("----------------------\n")

words_and_punctuation_removal_example()

Original: 
  In 1992 we bought a new Taurus and we really loved it  So in 1999 we decided to try a new Taurus  I did not care for the style of the newer version  but bought it anyway I do not like the new car half as much as i liked our other one  Thee dash is much to deep and takes up a lot of room  I do not find the seats as comfortable and the way the sides stick out further than the strip that should protect your card from denting It drives nice and has good pick up  But you can not see the hood at all from the driver seat and judging and parking is difficult  It has a very small gas tank I would not buy a Taurus if I had it to do over  I would rather have my 1992 back  I dont think the style is as nice as the the 1992  and it was a mistake to change the style  In less than a month we had a dead battery and a flat tire  

Processed: 
 1992 bought new taurus really loved 1999 decided try new taurus care style newer version bought anyway like new car half much liked one thee dash muc

The next cell demonstrates a call to preprocesser function in its entirety, which should also perform lemmatisation on words such that words with the same stem are recognised and subequent tokens produced are that of the root stem.

As previously mentioned it is this prepocesser function that is used in a transform step as part of the sklearn Pipeline object.

In [7]:
def preprocessor_example():
    """Example of preprocessor function usage."""

    examples = []

    # additional made-up test review with punctuation
    stem_example_1 = "Here's some words with the same root stems! Running, run, runs."
    stem_example_2 = "... And some more! Removed, remove, removing."
    stem_example_3 = "... And more! Driving, Drove, Drive."

    examples.append(stem_example_1)
    examples.append(stem_example_2)
    examples.append(stem_example_3)

    for text in examples:
        processed_review = preprocessor(text)
        print(f"Original: \n {text} \n")
        print(f"Processed: \n {processed_review} \n")
        print("----------------------\n")

preprocessor_example()

Original: 
 Here's some words with the same root stems! Running, run, runs. 

Processed: 
 word root stem run run run 

----------------------

Original: 
 ... And some more! Removed, remove, removing. 

Processed: 
 remove remove remove 

----------------------

Original: 
 ... And more! Driving, Drove, Drive. 

Processed: 
 drive drive drive 

----------------------



The next cell below runs a function that demonstrates that the ML pipeline implemented uses the CountVectorizer which uses a bag-of-words based approach, that is, for all of the input data an overall vocabulary is maintained for all word stems in all reviews, and that for each single review a vector is produced with word stem occurence counts.

The function includes some additional steps "DebugStep" to enable investigation of the data after each step of the pipeline, of most interest is the comparison of the data after the preprocessing transform and the vectorization transform step. Such a comparison is printed at the bottom of the function which demonstrates that the vectorization works as expected.

In [13]:
def bag_of_words_vector_example():
    """
    Example of bag-of-words based vectorization using the CountVectorizer from sklearn.
    """

    class DebugStep(BaseEstimator, TransformerMixin):
        def fit(self, X, y=None):
            return self

        def transform(self, X):
            print("DebugStep: Transforming data...")
            self.previous_data = X
            return X

    test_size = 276  # 20% of the data for testing

    training_data, test_data = train_test_split(
        input_data,
        test_size=test_size,
        random_state=120,
        stratify=input_data["Sentiment"],
    )

    pipeline = Pipeline(
        [
            ("preprocess", Preprocessor()),
            ("preprocess_debug", DebugStep()),
            ("vectorize", CountVectorizer()),
            ("vectorize_debug", DebugStep()),
            ("classify", MultinomialNB()),
        ]
    )

    pipeline.fit(training_data["Review"], training_data["Sentiment"])

    # analyse preprocessor
    preprocessor = pipeline.named_steps["preprocess"]
    preprocessor_output = pipeline.named_steps["preprocess_debug"].previous_data

    # anaylse CountVectorizer
    vectorizer = pipeline.named_steps["vectorize"]
    vocabulary = vectorizer.vocabulary_
    index_to_word = {index: word for word, index in vocabulary.items()}

    # get output of CountVectorizer
    vectorizer_output = pipeline.named_steps["vectorize_debug"].previous_data

    print("Vectorizer Output Shape:")
    print(vectorizer_output.shape)

    print("Vocabulary Length:")
    print(len(vocabulary))

    # first 5 review vectors
    for i in range(5):
        indices = vectorizer_output[i].indices
        vector_for_row = vectorizer_output.A[i]

        word_to_count = {word: 0 for word in index_to_word.values()}

        for indicy in indices:
            word_to_count[index_to_word[indicy]] = vector_for_row[indicy]

        # add to data frame for nice printing
        vector_to_vocabulary = pd.DataFrame(word_to_count, index=[0])

        print(f"###### Review {i} vector ######\n")
        print("Processed Text:\n")
        print(preprocessor_output[i])
        print("\nVector:\n")
        print(vector_to_vocabulary)
        print("\n----------------------\n")

bag_of_words_vector_example()

DebugStep: Transforming data...
DebugStep: Transforming data...
Vectorizer Output Shape:
(1106, 12185)
Vocabulary Length:
12185
###### Review 0 vector ######

Processed Text:

shop new vehicle want something would get back forth work something fun drive weekend think get far take 97 explorer sport four wheeling place people take beat roaders take trail wood swamp snow drift do play take car wash hose put new coat wax ready go work monday like vehicle minor problem one bad sensor 4 wheel drive selenoid truck would shift 4 low come would shut start fixed warranty occoured since usual maintenance item need replace truck would recommend anyone look functional fun vehicle rid handle great road

Vector:

   shop  new  vehicle  want  something  would  get  back  forth  work  ...  \
0     1    2        3     1          2      4    2     1      1     2  ...   

   hogan  joes  labeling  bilstein  manufactured  autolocking  265x15  55k  \
0      0     0         0         0             0         

The final code cell for part 1 below demonstrates the entire pipeline in action with the performance metrics printed in a confusion matrix.

This code in the run_ml_pipeline function shows that 80% of the data is used to train the model and that 20% is used as the test data.

Again - Please see the run_ml_pipeline code for reasoning behind the mechanism that handles unseen words in the test dataset.

In [19]:
run_ml_pipeline(input_data)

Accuracy: 76.81%
Confusion Matrix Proportions:
True Negative: 39.13%
False Negative: 12.32%
True Positive: 37.68%
False Positive: 10.87%


# Task 2 - Improved Solution

# Introduction

This task aims to understand the potential pitfalls of the Naive Bayes implementation and what alternatives may perform better, we will consider the estimator and understand ways to improve on other parts of the Machine Learning (ML) pipeline, such as feature extraction, hyperparameter tuning and model evaluation.

# Model Investigation

## Naive Bayes

Naive Bayes works on the assumption that every predictor in a model is conditionally-independent, the benefit of this is that by using this assumption, we can get reasonably accurate estimations for very little compute power as such Naive Bayes lends itself well to classification tasks where data quantity is limited and can be done in real-time applications. (citation needed)

This simplifying assumption is what credits the "Naive" label; assuming that every feature is conditionally-independent doesn't model real-world data. For instance, when discussing sentiment analysis, word order and context convey a critically different meaning to their individual parts. One word could convey a completely different sentiment depending on their neighbouring words or position in a sentence. Therefore, Naive Bayes should not be the best fit to classify sentiment, as we potentially lose information. (citation needed)

## Alternatives

An alternative classifier must improve on the downsides demonstrated by Naive Bayes in the previous chapter while also being a generally good for sentiment analysis in the context of this assignments specific problem. Sentiment analysis is a problem that produces high dimensional data, that is data with many features, so an alternative must be good at handling high dimensional data. The alternative must also be able to capture the complex meaning of words depending on their context, this could be analogous to capturing non-linear relationships (citation needed).

### Support Vector Machines (SVM)

SVMs are known to be a good fit for classification tasks (citation needed). They are typically considered a linear classifier but tricks can be employed using the right Kernel to introduce non-linear classification boundaries. They also handle high dimensional data well (citation needed).

### Recurrent Neural Networks (RNN)

RNNs are designed to handle sequential data, such as textual reviews. Unlike a typical Feed Forward Neural Network (FNN), RNNs use recurrent connections, this works by remembering the previous state and feeding it into the same neuron in the next temporal step. This ability to recur information temporary allows RNNs to capture complex patterns and dependencies between features, which is very useful for the purpose of sentiment analysis where information is embededd within a words context with other words around it (citation needed).

We can summarise the benefits as the ability to understand complex linguistic nuance and high accuracy and reliabilty, the downsides of RNN include high complexity, long training times and require more compute than previous traditional ML alternatives (Mao, Liu and Zhang, 2024).

### Transformer

Transformer architecture is now typically the go-to for sequential data related tasks (citation needed). The downside of implementing a transformer based architecture is they require a hugh amount of data (Mao, Liu and Zhang, 2024).


# Approach

I have selected to build an RNN to hopefully improve the performance over Naive Bayes, due to its inherent ability to remember information and then infer  

- Model: RNN
- Tooling: pytorch
- Feature extraction: Word Embedding (FastText / Word2Vec)
- Architecture: LTSM

# Results

# Conclusion


# References

Mao, Y., Liu, Q. and Zhang, Y., 2024. Sentiment analysis methods, applications, and challenges: A systematic literature review. Journal of King Saud University - Computer and Information Sciences [Online], 36(4), p.102048. Available from: https://doi.org/10.1016/j.jksuci.2024.102048.
