# Introduction

## Domain-specific area 

The challenge which this peice of work will be addressing is the categorisation of the text of various
different film reviews into categories of postitive or negative scores. This project will involve creating a binary classifier to pretict which category the reviews fall into.

There is a very strong monetary incentive for people who create and fund films to care about sentiment classification of the reviews. The film industry is huge globally, having topped $100 billion in 2019 (Escandon, 2020) [1]. It has been found that reviews of films can have a real impact on their performance with significant impact on box office takings (Eagon, 2018)[2]. Correct classification of reviews written and posted online or collected during focus groups could give film-makers vital information to understand which films are liked by audiences and hence become more successful.
    
Furthermore, one type of film review which has become particularly prevelant in recent years is called 'review bombing'. This is where groups of people decide against a film and purposefully target it with lots of negative reviews [3]. Good text classification on film reviews could also help focus in on the negative reviews to spot 'review' bombing by picking up on a flurry of negative activity.
    
Additionally, by looking at the different features which most strongly predict positive and negative reviews, 
Film-makers can understand themes and ideas which are more likely to lead to different sentiments and keep them in mind
when they are creating the next blockbuster smash hit!

## Objectives 

The output of the project will be a Naive Bayes classifier which can be used to classify film reviews to ascertain if they are positive or negative - as determied by the content of the review. This project aims to find a Naive Bayes Classifier which exceeds the performance of a basic NB classifier which has been created without stop word removal using simple bag of word feautures.

The data will be lemmatised and tokenised and it will be determined if stop word removal can improve the model performance. Stop word removal has been found to have potential to improve the performance of text classifiers (Silvia, 2003) [4].

Then, the model will be further altered by instead using TF-IDF (combines term frequency and inverse document frequency [5])
to see if a better model can be created. This may lead to a better model because Tf-Idf makes rare words more prominent and effectively ignores common words (M. Jain, 2021) [6]

It will then be determined if the use of bi-grams and tri-grams can further improve the model. N-grams are used to show frequency of words appearing together in text:  N-grams analyses are often used to see which words often show up together (Yang, 2020) [7]. Collocations is the name for words that stick together more than would happen purely by chance.

The results would help to determine the best model type for film review classification: with stop words removed or not, TF-IDF or Bag of words and solitary words or also including bi-grams and tri-grams.


## Dataset 


The dataset chosen for this text classification project is the imdb film review dataset from Stanford University (Maas, 2011) [8]. This dataset contains 50,000 film reviews which are either negative or positive. A negative review is one with a score of less than or equal to 4 and a postive review is one with a ascore of greater than or equal to 7 (with all scores given out of 10). There are no neutral scores included. 

The data set is pre-divided into test and train data with half of the data split as test and half as train. No more than 30 reviews for any one film are included. The data is hosted on a site by Stanford university and will be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz using a script. 

The dataset, once downloaded, consists of a zipped folder with test and train folders each containing negative and positive folders, within these the reviews are saved each in seperate files There are two top-level directories. This will mean these data set will first need to be unzipped and then each review looped through and extracted to create a data set to work with in Python. 

The full size of these dataset may be too large to use without causing the code to be slow to run so a random sample may need to be taken.

## Evaluation methodology 

Accuracy, Precision and Recall will be used to evaluate the Naive Bayes classifier created in this project and select the best one.

The calculation for accuracy is the True Positives (TP) plus the True Negatives (TN) divided by the False Positives (FP), False Negatives (FN), True Positives and True Negatives -. In equation form: (TP + TN) / (TP + TN + FP + FN). 
This shows the percentages of classifications which were correct - it's not always a perfect measurement of classifier performance as if the data is imbalanced a classifier could score a high accuracy simply by always predicting the majority class - it will work for this scenario because it is known that the 2 different classes are equally balanced. 

Precision and Recall will also be used to add additional context. 
The calculation for precision is (TP/TP+FP) - this is the chance the that the class was correct given that it was chosen.
The calculation for recall is (TP/TP+FN) - this is the chance, given the correct class, that it was chosen correctly.

An F-measure could also be used which is a combination of precision and recall - however as this is primarily a useful addition for unbalanced datasets it will not be used to evaluate the model created in this project.

# Implementation

In [1]:
#Below are the requirements for the code to run - there is also a requirements file in the same folder as this code.

!pip install nltk sklearn requests numpy pandas

import requests
import tarfile
import pandas as pd
import numpy as np
import os
import csv
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.metrics import precision
from nltk.metrics import recall
import collections
import random
import sklearn
nltk.download('stopwords')
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from itertools import filterfalse




[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Fiona\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


## Acquiring the data 

In [2]:
#The below code pulls the film review data from the standford site and saves it in a folder named 'imdb_raw_data'
#The file is zipped when it is downloaded and needs to be unzipped using tarfile


saved_folder = "imdb_raw_data"  
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

response = requests.get(url)
with open("aclImdb_v1.tar.gz", 'wb') as f:
    f.write(response.content)
tar = tarfile.open("aclImdb_v1.tar.gz", "r:gz")
tar.extractall(saved_folder)
tar.close()


In [3]:
#The names of the folders determine if the review is positive or negative
#The below function uses this to add a binary flag each review

def sentiment_from_path(path):
    '''uses the filename to determine if the review is positive or negative'''
    if re.match(".*neg", path):
        return 0
    else:
        return 1

#Because the reviews are in seperate files they need to be pulled together 
#into one list of reviews paired with senitment scores

def getting_raw_data(path):
    '''for all individual review files, pulls them together to output list of reviews and sentiment'''
    raw_data = []
    sentiment = []
    for filename in os.listdir(path):
        open_data = open(os.path.join(path, filename), "r", encoding="utf-8")
        data_read = open_data.read()
        sentiment.append(sentiment_from_path(path))
        raw_data.append(data_read)
    return raw_data, sentiment


In [4]:
#The file paths are created and fed into the function to get the data.
#The actual path here as / is a special character in python so the seperate sections are joined using os.path.join

test_negative = os.path.join(saved_folder, 'aclImdb', 'test', 'neg')
test_positive = os.path.join(saved_folder, 'aclImdb', 'test', 'pos')
train_negative = os.path.join(saved_folder, 'aclImdb', 'train', 'neg')
train_positive = os.path.join(saved_folder, 'aclImdb', 'train', 'pos')

test_n_data_all, test_n_Y_all = getting_raw_data(test_negative)
test_p_data_all, test_p_Y_all = getting_raw_data(test_positive)
train_n_data_all, train_n_Y_all = getting_raw_data(train_negative)
train_p_data_all, train_p_Y_all = getting_raw_data(train_positive)

In [5]:
#For some sections of this project (particularly when removing stop-words) the code ran slowly
#To improve the performance a random sample of 500 positives and 500 negatives have been taken for training and test each
#A seed was used to make the random sample reproducable

random.seed(500)
test_n_data = random.sample(test_n_data_all, 500)
test_n_Y = random.sample(test_n_Y_all, 500) 
test_p_data = random.sample(test_p_data_all, 500)
test_p_Y = random.sample(test_p_Y_all, 500)
train_n_data = random.sample(train_n_data_all, 500) 
train_n_Y = random.sample(train_n_Y_all, 500)
train_p_data = random.sample(train_p_data_all, 500) 
train_p_Y = random.sample(train_p_Y_all, 500)

## Cleaning + Preprocessing the data 

In [6]:
#The data is in English and contains informal reviews 

#The data now needs to be cleaned to remove spaces, line breaks from html and punctuation
#The data also needs to be tokenised - dividing it up into component words 
#this is so the reviews can be classified by the words that they contain

#The data is represented as 'Bag of Words'

#The data also needs to be lemmatised - split down to their linguistic root
#this is so that the classifier can spot when roots of words are frequently occuring and people may be saying the same things

#The below function cleans, tokenises and lemmatises a review

lemmatizer = WordNetLemmatizer()
def tokenising_letimizing(review):
    '''cleans, tokenises and lemmatises data'''
    removed_spaces = re.sub("/\s{1,}/g"," ", review)
    removed_line_breaks = re.sub("<br />", "" , removed_spaces)
    removed_punctuation = re.sub("[^-9A-Za-z ]", "" , removed_line_breaks)
    tokenised = word_tokenize(removed_punctuation)
    review_lem = []
    for word in tokenised:
        review_lem.append(lemmatizer.lemmatize(word))
    return review_lem

#This function loops through each review and applies the previous function
#they have been seperated due to the different requirements of the nltk and sk_learn Naive Bayes fuctions

def full_preprocessing(data):
    '''loops through reviews of data to create list of them processed'''
    data_processed = []
    for review in data: 
        tl_review = tokenising_letimizing(review)
        data_processed.append(tl_review)
    return data_processed

In [7]:
#the pre-processing is applied to the datasets

test_n_cleaned = full_preprocessing(test_n_data)
test_p_cleaned = full_preprocessing(test_p_data)
train_n_cleaned = full_preprocessing(train_n_data)
train_p_cleaned = full_preprocessing(train_p_data)

## Baseline performance 

The baseline against which the performance of the final classifier will be compared will be created by Naive Bayes using a simple BoW method with the existance within a review of certain (highg frequency) words being taken as features.




In [8]:
#The positives and negatives are combined for post training and test

data_train = train_n_cleaned + train_p_cleaned
class_train = train_n_Y + train_p_Y
data_test = test_n_cleaned + test_p_cleaned
class_test = test_n_Y + test_p_Y

In [9]:
#Getting high frequency words and using as features for baseline model

#Looping through to get a list of all words in all training reviews
all_words = []
for review in data_train:
    for word in review:
        all_words.append(word)   

#setting number of highest frequency words to 100
N = 100

#getting frequency of words for all words in the training reviews
freq_of_words = nltk.FreqDist(all_words)

#getting the top 100 most common words from the training reviews
word_features = list(freq_of_words)[:N]

#The below function creates features for each review with a flag saying if the review contains each of the 100 words
def bow_features(review): 
    '''takes in a review and returns the features where these are a binary flag for if the review contains high freq words'''
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in review)
    return features

#Applying the feature function to the test dataset
features =  [bow_features(review) for review in data_test]
features_and_class = list(zip(features, class_test))

#Applying the feature function to the training dataset
features_train =  [bow_features(review) for review in data_train]
features_and_class_train  = list(zip(features_train, class_train ))

In [10]:
#generating and testing the basic naive bayes to get accuracy, recall and precision scores
NBclassifier_baseline = nltk.NaiveBayesClassifier.train(features_and_class)


def nltk_scores(classifier, features_and_class):
    '''returns accuracy, recall and precision for nltk classifier'''
    accuracy = nltk.classify.accuracy(classifier, features_and_class)
    set_from_classifier = collections.defaultdict(set)
    set_actual = collections.defaultdict(set)
 
    for i, (features, sentiment) in enumerate(features_and_class):
        set_actual[sentiment].add(i)
        from_classifier = classifier.classify(features)
        set_from_classifier[from_classifier].add(i)
    positive_precision =  precision(set_from_classifier[0], set_actual[0])
    negitive_precision = precision(set_from_classifier[1], set_actual[1])
    positive_recall = recall(set_from_classifier[0], set_actual[0])
    negitive_recall = recall(set_from_classifier[1], set_actual[1])
    print ('positive precision:', "{:.2%}".format(positive_precision ))
    print ('negitive precision:', "{:.2%}".format(negitive_precision ))
    print ('positive recall:', "{:.2%}".format(positive_recall))
    print ('negitive recall:', "{:.2%}".format(negitive_recall))
    print ('accuracy:', "{:.2%}".format(accuracy))
    

nltk_scores(NBclassifier_baseline, features_and_class)


positive precision: 63.80%
negitive precision: 66.00%
positive recall: 65.24%
negitive recall: 64.58%
accuracy: 64.90%


### The baseline can thus be taken as 64.9% accuracy 

## Classification approach 
Firstly the top 100 non stop word words will be used in the classifier, then features of a combination of term-frequency and doccument frequency will be used and bi-grams and tri-grams will also be added to see if the accuracy is improved.

The classes will be 0 or 1: 0 for negative and 1 for positive - there is not class for neautral as there are no neutral reviews in the data and no neutral flag in the training data set.

Naïve Bayes has been selected as the classifier which will be used in this approach. Naive Bayes is a simple and fast classifier and this project aims to discover how accurate it can become using different feature methods. Naive Bayes works by considering the prior probability of the classification and updating the results depending on the evidence provided (by the features)

This will be a form of supervised learning as training data, pre classified is provided.

In [11]:
#From looking at the most informative feautures of the classifier - we can see that some stop words 
#(commonly used words unlikely to impact sentiment) occur. The next step in creating classifier will be removing stopwords

NBclassifier_baseline.show_most_informative_features(5)

Most Informative Features
           contains(bad) = True                0 : 1      =      2.5 : 1.0
         contains(great) = True                1 : 0      =      2.2 : 1.0
          contains(well) = True                1 : 0      =      1.6 : 1.0
           contains(the) = False               1 : 0      =      1.6 : 1.0
            contains(of) = False               1 : 0      =      1.5 : 1.0


In [12]:
#adding in a section to remove stopwords to the initial processing function
def tokenising_letimizing_sw(review):
    '''cleans, tokenises and lemmatises data with stopwords removed'''
    removed_spaces = re.sub("/\s{1,}/g"," ", review)
    removed_line_breaks = re.sub("<br />", "" , removed_spaces)
    removed_punctuation = re.sub("[^-9A-Za-z ]", "" , removed_line_breaks)
    tokenised = word_tokenize(removed_punctuation)
    review_lem = []
    for word in tokenised:
        review_lem.append(lemmatizer.lemmatize(word))
    review_nonsw = list(filterfalse(set(stopwords.words()).__contains__, review_lem))
    return review_nonsw


#changing the preprocessing function to include the new processing function
def full_preprocessing_sw(data):
    '''loops through reviews of data to create list of them processed - using function with stopwords removed'''
    data_processed = []
    for review in data: 
        tl_review = tokenising_letimizing_sw(review)
        data_processed.append(tl_review)
    return data_processed

#removing stopwords from the initial data sets as well as pre-processing them 
test_n_cleaned_sw = full_preprocessing_sw(test_n_data)
test_p_cleaned_sw = full_preprocessing_sw(test_p_data)
train_n_cleaned_sw = full_preprocessing_sw(train_n_data)
train_p_cleaned_sw = full_preprocessing_sw(train_p_data)

#adding the positive and negative data together to become the test and train data
data_train_sw =  train_n_cleaned_sw + train_p_cleaned_sw
data_test_sw =  test_n_cleaned_sw + test_p_cleaned_sw

#creating a full list of all of the non-stopword words 
words_nonsw = []
for review in data_train_sw:
    for word in review:
        words_nonsw.append(word)   
        
#getting the frequency distribution of the non-stop-word words
words_nonsw_dist = nltk.FreqDist(words_nonsw)

#selecting the top 100 most common non-stop-word words
word_features_nonsw = list(words_nonsw_dist)[:N]

In [13]:
#same as the previous code to create features except this uses the non stop word features
def bow_features_sw(review):
    '''takes in a review, returns the features where these are a binary flag for if the review contains high freq (non stop word) words'''
    features = {}
    for word in word_features_nonsw:
        features['contains({})'.format(word)] = (word in review)
    return features

features_sw = [bow_features_sw(review) for review in data_test_sw]
features_and_class_sw = list(zip(features_sw, class_test))

features_train_sw = [bow_features_sw(review) for review in data_train_sw]
features_and_class_train_sw  = list(zip(features_sw, class_train ))



In [14]:
#generating the classifier

NBclassifier_sw = nltk.NaiveBayesClassifier.train(features_and_class_sw)
nltk_scores(NBclassifier_sw, features_and_class_sw)
#this has improved the classifier to an accuracy score of 0.716
#however I think we can do better than that by introducing tf idf

positive precision: 71.20%
negitive precision: 71.60%
positive recall: 71.49%
negitive recall: 71.31%
accuracy: 71.40%


In [15]:

data_train_tfidf = train_n_data + train_p_data
data_test_tfidf = test_n_data + test_p_data
class_train_tfidf = train_n_Y + train_p_Y
class_test_tfidf = test_n_Y + test_p_Y
naive_bayes_classifier = MultinomialNB()

#creating the function to use term frequency combined with inverse doccument frequency for words as features
#this Naive Bayes classifier uses the library sk_learn
#the tokenizer created previously is used which also removes stop-words

def NB_sk_learn_tfidf(ngrams, own_tokenizer, train_data, train_classes, test_data):
    '''uses sk_learn to create a NB Classifier for the data'''
    vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf=True,ngram_range=(1,ngrams), 
                                                                  tokenizer = own_tokenizer)
    tfidf_features_train = vectorizer.fit_transform(train_data)
    tfidf_features_test = vectorizer.transform(test_data)
    NBclassifier_tfidf = naive_bayes_classifier.fit(tfidf_features_train, train_classes)
    class_pred = NBclassifier_tfidf.predict(tfidf_features_test)
    return(class_pred, NBclassifier_tfidf, vectorizer)

In [16]:
#using tf-idf and not including bi-grams or tri-grams the accuacy increases to 79.6%

tfidf_pred, NBclassifier_tfidf, vectorizer = NB_sk_learn_tfidf(1, tokenising_letimizing_sw, data_train_tfidf, class_train_tfidf, data_test_tfidf)

def sk_learn_scores(class_test_tfidf, tfidf_pred):
    '''returns accuracy, recall and precision for sk_learn classifier'''
    print("positive precision:", "{:.2%}".format(metrics.precision_score(class_test_tfidf, tfidf_pred, pos_label = 1)))
    print("negitive precision:", "{:.2%}".format(metrics.precision_score(class_test_tfidf, tfidf_pred, pos_label = 0)))
    print("positive recall:", "{:.2%}".format(metrics.recall_score(class_test_tfidf, tfidf_pred, pos_label = 1)))
    print("negitive recall:", "{:.2%}".format(metrics.recall_score(class_test_tfidf, tfidf_pred, pos_label = 0)))
    print("accuracy:", "{:.2%}".format(metrics.accuracy_score(class_test_tfidf, tfidf_pred)))    

sk_learn_scores(class_test_tfidf, tfidf_pred)


positive precision: 83.78%
negitive precision: 77.64%
positive recall: 75.40%
negitive recall: 85.40%
accuracy: 80.40%


In [17]:
#using tf-idf , bi-grams but not tri-grams the accuacy increases to 81.9%

tfidf_pred_bigram, NBclassifier_tfidf_bigram, vectorizer_bigram = NB_sk_learn_tfidf(2, tokenising_letimizing_sw, data_train_tfidf, class_train_tfidf, data_test_tfidf)
sk_learn_scores(class_test_tfidf, tfidf_pred_bigram)

positive precision: 84.18%
negitive precision: 78.53%
positive recall: 76.60%
negitive recall: 85.60%
accuracy: 81.10%


In [18]:
#using tf-idf , bi-grams and tri-grams the accuacy dips slightly to to 81.6%

tfidf_pred_trigram, NBclassifier_tfidf_trigram, vectorizer_trigram = NB_sk_learn_tfidf(3, tokenising_letimizing_sw, data_train_tfidf, class_train_tfidf, data_test_tfidf)
sk_learn_scores(class_test_tfidf, tfidf_pred_trigram)

positive precision: 83.30%
negitive precision: 79.17%
positive recall: 77.80%
negitive recall: 84.40%
accuracy: 81.10%


# Conclusions

## Evaluation 

Using accuracy it can be seen that the tf-idf method of feature creation with bigrams included and with stop words removed  was the best classifier created in the above code. 
The results for the evaluation metrics are as follows:


| Model                                   | Accuracy     |
| --------------------------------------- | ------------ |
| NBclassifier_baseline                   | 64.90%       |
| NBclassifier_sw (no stop words)         | 71.60%       |
| NBclassifier_tfidf                      | 79.60%       | 
| NBclassifier_tfidf_bigram               | 81.90%       | 
| NBclassifier_tfidf_trigram              | 81.60%       | 

The results for precision are as follows:

| Model                                   | Precision +ve  |
| --------------------------------------- | -------------- |
| NBclassifier_baseline                   | 63.80%         |
| NBclassifier_sw (no stop words)         | 66.20%         |
| NBclassifier_tfidf                      | 87.76%         | 
| NBclassifier_tfidf_bigram               | 88.25%         | 
| NBclassifier_tfidf_trigram              | 87.26%         | 

| Model                                   | Precision -ve  |
| --------------------------------------- | -------------- |
| NBclassifier_baseline                   | 66.00%         |
| NBclassifier_sw (no stop words)         | 77.00%         |
| NBclassifier_tfidf                      | 74.34%         | 
| NBclassifier_tfidf_bigram               | 77.36%         | 
| NBclassifier_tfidf_trigram              | 77.43%         | 

The results for recall are as follows:

| Model                                   | Recall +ve   |
| --------------------------------------- | ------------ |
| NBclassifier_baseline                   | 65.24%       |
| NBclassifier_sw (no stop words)         | 74.22%       |
| NBclassifier_tfidf                      | 68.80%       | 
| NBclassifier_tfidf_bigram               | 73.60%       | 
| NBclassifier_tfidf_trigram              | 74.00%       | 

| Model                                   | Recall -ve   |
| --------------------------------------- | ------------ |
| NBclassifier_baseline                   | 64.58%       |
| NBclassifier_sw (no stop words)         | 69.49%       |
| NBclassifier_tfidf                      | 90.40%       | 
| NBclassifier_tfidf_bigram               | 90.20%       | 
| NBclassifier_tfidf_trigram              | 89.20%       | 

For overall accuracyd the best Naive Bayes model is the one with tf-idf method of feature creation with bigrams included and with stop words removed (81.9% accuracy)
There are no signs from recall and precision that the classifiers are over predicting one sentiment.

## Summary & Conclusions

In conclusion, it can be seen that removing stop words leads to a more accurate classifier of sentiment for film reviews. 
Adding in term-frequency combined with inverse doccument frequency leads to improved performance and the addition of bi-grams (pairs of words commonky found together) as features improves the model further still. 

The inclusion of tri-grams was not found to improve the model however the performance was still increased over the model which included neither bi-grams nor tri-grams. 

The solution which was created by this project may have some success for other film review areas such as tweets about a film however the type of langage used may be less formal in tweets and context may be lost (people may also post images as part of their twitter film commentary). It is also possible that sarcasm would be used on twitter which would be less likely to be seen on an imdb film review. 

It may  be difficult to use the classifier built on text other than text about films as there may be domain specific comments which would not translate - book reviews would not include comments about run time duration for example.

This apprach could be replicated by any libraies which allow for Naive Bayes text classification with tf-idf features and n-grams. A suitable library for R would be naivebayes. 

A beneficial alternative approach could be to also try other types of classifier eg. Logistic Regression and compare them against each other with similar steps to the ones detailed above. A reason why the method in this project was chosen is that it is quick and simple - Niave Bayes an be trained very quickly compared to some other models.

### References

[1] Eagon, O., 2018. The Influence of Film Critics on Movie Outcomes. [online] www.researchgate.net. Available at: <https://www.researchgate.net/publication/330015688_The_Influence_of_Film_Critics_on_Movie_Outcomes> [Accessed 30 December 2021].


[2] Escandon, R., 2020. The Film Industry Made A Record-Breaking $100 Billion Last Year. [online] Forbes. Available at: <https://www.forbes.com/sites/rosaescandon/2020/03/12/the-film-industry-made-a-record-breaking-100-billion-last-year/?sh=585a03d34cd6> [Accessed 30 December 2021].


[3] Wordsworth, R., 2019. The secrets of 'review-bombing': why do people write zero-star reviews?. [online] the Guardian. Available at: <https://www.theguardian.com/games/2019/mar/25/review-bombing-zero-star-reviews> [Accessed 30 December 2021].


[4] C. Silva and B. Ribeiro, 2003,  "The importance of stop word removal on recall values in text categorization",
Proceedings of the International Joint Conference on Neural Networks


[5] C. D. Manning, P. Raghavan and H. Schütze, 2008, "Introduction to Information Retrieval", Cambridge University Press


[6]Jain, M., 2021. Why Tf-Idf is more effective than Bag-Of-Words?. [online] Medium. Available at: <https://ai.plainenglish.io/why-tf-idf-is-more-effective-than-bag-of-words-49ba175247c3> [Accessed 1 January 2022].


[7] Medium. 2020. Text analysis basics in Python. [online] Available at: <https://towardsdatascience.com/text-analysis-basics-in-python-443282942ec5> [Accessed 1 January 2022].


[8] Maas, A. L., Daly, R. E., Pham, P. T., Huang, D, Ng, A. Y.  and  Potts, C., 2011, "Learning Word Vectors for Sentiment Analysis", Association for Computational Linguistics, pp. 142-150

