# ANLP Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.


In [27]:
candidateno=276318 #this MUST be updated to your candidate number so that you get a unique data sample


In [28]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import math
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [29]:
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the
            pair is a list of the training data and the second is a list of the test data.
    """

    data = list(data)
    n = len(data)
    train_indices = random.sample(range(n), int(n * ratio))
    test_indices = list(set(range(n)) - set(train_indices))
    train = [data[i] for i in train_indices]
    test = [data[i] for i in test_indices]
    return (train, test)


def get_train_test_data():

    #get ids of positive and negative movie reviews
    pos_review_ids=movie_reviews.fileids('pos')
    neg_review_ids=movie_reviews.fileids('neg')

    #split positive and negative data into training and testing sets
    pos_train_ids, pos_test_ids = split_data(pos_review_ids)
    neg_train_ids, neg_test_ids = split_data(neg_review_ids)
    #add labels to the data and concatenate
    training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
    testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]

    return training, testing

When you have run the cell below, your unique training and testing samples will be stored in `training_data` and `testing_data`

In [30]:
#do not change the code in this cell
random.seed(candidateno)
training_data,testing_data=get_train_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

The amount of training data is 1400
The amount of testing data is 600
The representation of a single data item is below
(['if', 'there', "'", 's', 'one', 'thing', 'in', ...], 'pos')


1)  
a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why


For calculating the precision and recall rate, I will distrubute the data into four parts as below. I have build up the positive and negative data, so I remove the tag in the last for less mistakes in code later.

In [31]:
pos_training_data = [file[0] for file in training_data if file[1] == "pos"]
neg_training_data = [file[0] for file in training_data if file[1] == "neg"]
pos_testing_data = [file[0] for file in testing_data if file[1] == "pos"]
neg_testing_data = [file[0] for file in testing_data if file[1] == "neg"]

print(len(pos_training_data))
print(len(neg_training_data))
print(pos_training_data[0])

700
700
['if', 'there', "'", 's', 'one', 'thing', 'in', ...]


This cell is to remove the stop words, numbers and punctuation, because this classification is based on the bag of words. Then, it will print the first review to see the filtered words. Then, it will print the first review to see the filtered words.

In [32]:
def filter_stopwords(review):
    stop_words = stopwords.words("english")
    filter_data = []
    for i, file in enumerate(review):
        filter_data.append([])
        for w in file:
            if w not in stop_words and w.isalpha():
                filter_data[i].append(w)
    return filter_data

pos_training_data = filter_stopwords(pos_training_data)
neg_training_data = filter_stopwords(neg_training_data)

print(pos_training_data[0])
print(neg_training_data[0])

['one', 'thing', 'common', 'hollywood', 'major', 'studios', 'productions', 'moving', 'toward', 'mainstream', 'although', 'twentieth', 'century', 'fox', 'new', 'line', 'cinema', 'spawned', 'subsidiaries', 'specialize', 'independant', 'controversial', 'motion', 'pictures', 'fox', 'searchlight', 'fine', 'line', 'respectively', 'obvious', 'significant', 'movement', 'underway', 'promote', 'inventive', 'ideas', 'theater', 'movie', 'like', 'gary', 'ross', 'pleasantville', 'comes', 'along', 'wrapped', 'blanket', 'innovative', 'ideas', 'served', 'platter', 'fine', 'production', 'welcome', 'change', 'pace', 'frequent', 'cineplexes', 'although', 'atmosphere', 'buzz', 'movie', 'cheery', 'lighthearted', 'pleasantville', 'mistaken', 'thought', 'movie', 'quite', 'opposite', 'true', 'fact', 'director', 'ross', 'skillfully', 'brings', 'narrative', 'intense', 'intelligent', 'undertones', 'screen', 'story', 'joys', 'living', 'life', 'fullest', 'well', 'social', 'ills', 'segregation', 'captures', 'essence

As the requirements, we need to create two bags of words with ten positive words and ten negative words.  Generally speaking, adjectives or adverbs usually more easily show stronger feelings than some nouns or verbs. Thus, I need to download postagger and tagger every word in all reviews, and then make them in the tuple.   

In [33]:
nltk.download('averaged_perceptron_tagger')

def pos_tagged(review):
    tagged_data = []
    for file in review:
        tagged_data.append(nltk.pos_tag(file))
    return tagged_data

pos_tagged_data = pos_tagged(pos_training_data)
neg_tagged_data = pos_tagged(neg_training_data)
print(pos_tagged_data[0])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('one', 'CD'), ('thing', 'NN'), ('common', 'JJ'), ('hollywood', 'NN'), ('major', 'JJ'), ('studios', 'NNS'), ('productions', 'NNS'), ('moving', 'VBG'), ('toward', 'IN'), ('mainstream', 'JJ'), ('although', 'IN'), ('twentieth', 'JJ'), ('century', 'NN'), ('fox', 'VBD'), ('new', 'JJ'), ('line', 'NN'), ('cinema', 'NN'), ('spawned', 'VBD'), ('subsidiaries', 'NNS'), ('specialize', 'VBP'), ('independant', 'JJ'), ('controversial', 'JJ'), ('motion', 'NN'), ('pictures', 'NNS'), ('fox', 'VBP'), ('searchlight', 'JJ'), ('fine', 'JJ'), ('line', 'NN'), ('respectively', 'RB'), ('obvious', 'JJ'), ('significant', 'JJ'), ('movement', 'NN'), ('underway', 'JJ'), ('promote', 'JJ'), ('inventive', 'JJ'), ('ideas', 'NNS'), ('theater', 'VBP'), ('movie', 'NN'), ('like', 'IN'), ('gary', 'JJ'), ('ross', 'NN'), ('pleasantville', 'NN'), ('comes', 'VBZ'), ('along', 'RB'), ('wrapped', 'JJ'), ('blanket', 'NN'), ('innovative', 'JJ'), ('ideas', 'NNS'), ('served', 'VBD'), ('platter', 'JJ'), ('fine', 'JJ'), ('production', '

On the table, adjective is equal with JJ. I have selected all adjective words. Now I will print the first ten words to show the example.  

In [34]:
def adj_words(review):
    words = []
    for file in review:
        for w in file:
            if w[1] == "JJ":
                words.append(w[0])
    return words

pos_selected_words = adj_words(pos_tagged_data)
neg_selected_words = adj_words(neg_tagged_data)

print(pos_selected_words[:10])

['common', 'major', 'mainstream', 'twentieth', 'new', 'independant', 'controversial', 'searchlight', 'fine', 'obvious']


The function of FreqDist will return a structure similar as dictionary, and the values are the occurred times of these words. Here I will pick up the most common 100 words from more to less.

In [35]:
def words_times(words, k):
    doc = nltk.FreqDist(words)
    word_time = []
    for w in doc.most_common(k):
        word_time.append(w)
    return word_time

pos_word = list(set(pos_selected_words))
neg_word = list(set(neg_selected_words))

pos_times = words_times(pos_selected_words, 100)
neg_times = words_times(neg_selected_words, 100)

print(pos_times)
print(neg_times)

[('good', 891), ('many', 562), ('great', 539), ('much', 522), ('little', 512), ('new', 494), ('first', 395), ('real', 342), ('old', 339), ('big', 304), ('young', 300), ('last', 295), ('american', 263), ('original', 252), ('bad', 244), ('funny', 226), ('true', 220), ('high', 219), ('black', 211), ('special', 210), ('small', 187), ('several', 183), ('different', 182), ('right', 182), ('sure', 173), ('human', 169), ('final', 158), ('hard', 157), ('entire', 156), ('whole', 153), ('main', 153), ('comic', 150), ('second', 149), ('enough', 145), ('full', 141), ('able', 141), ('next', 137), ('long', 133), ('strong', 132), ('beautiful', 131), ('dead', 129), ('short', 127), ('screen', 124), ('top', 123), ('classic', 123), ('nice', 122), ('white', 120), ('perfect', 119), ('fine', 114), ('wonderful', 113), ('romantic', 111), ('evil', 110), ('wrong', 109), ('major', 108), ('excellent', 107), ('hilarious', 105), ('deep', 104), ('effective', 102), ('interesting', 101), ('live', 101), ('emotional', 99

Bacially, it cannot be defined the positive words or negative words only based on times, because some words such as good, it occured 891 times in positive reviews, however, it was also in negative reviews with the same times. I will ingore the words such as these, and I will only choose those words which are only in positive reviews or negative reviews, rather than in another one.

In [36]:
def words_select(select_words, sample_words):
    filter_words = []
    for w in select_words:
        isPrime = 1
        for word in sample_words:
            if w[0] == word[0]:
                isPrime = 0
                break
        if isPrime == 1:
            filter_words.append(w)
    return filter_words

pos_words = words_select(pos_times, neg_times)
neg_words = words_select(neg_times, pos_times)

print(pos_words)
print(neg_words)

[('perfect', 119), ('wonderful', 113), ('excellent', 107), ('hilarious', 105), ('effective', 102), ('happy', 97), ('due', 95), ('personal', 94), ('powerful', 94), ('dark', 93), ('titanic', 92), ('political', 90), ('previous', 87), ('impressive', 85), ('visual', 84), ('similar', 84), ('solid', 82), ('memorable', 80), ('easy', 77), ('former', 77), ('present', 77), ('overall', 76), ('private', 76), ('sweet', 76), ('rich', 75), ('dramatic', 75), ('brilliant', 73), ('subtle', 72), ('future', 71)]
[('stupid', 131), ('poor', 106), ('predictable', 87), ('wild', 84), ('ridiculous', 82), ('terrible', 77), ('awful', 77), ('give', 74), ('late', 73), ('complete', 72), ('sexual', 71), ('scary', 70), ('large', 68), ('know', 67), ('close', 66), ('low', 66), ('huge', 66), ('potential', 64), ('decent', 63), ('dumb', 63), ('middle', 62), ('third', 61), ('dull', 61), ('worth', 59), ('impossible', 59), ('cool', 58), ('nick', 57), ('particular', 57), ('slow', 57)]


Finally, the two bags of words have finished. Traditionally, these words are mostly good words or bad words, such as perfect, wonderful, excellent, stupid, terrible, etc.

In [37]:
pos_bag = [w[0] for w in pos_words[:10]]
neg_bag = [w[0] for w in neg_words[:10]]

print(pos_bag)
print(neg_bag)

['perfect', 'wonderful', 'excellent', 'hilarious', 'effective', 'happy', 'due', 'personal', 'powerful', 'dark']
['stupid', 'poor', 'predictable', 'wild', 'ridiculous', 'terrible', 'awful', 'give', 'late', 'complete']


2)
a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.



This is a word list classifier and I will explain the theory. To start with, I will set up two points equal with zero, which are positive and negative points. Secondly, the words of every reviews was in the loop. If the word is in positive bag, the positive point will plus one, on the contrary the negative point is same. Clearly, if more positive words are in this review, the chance of positive review are more, and the negative words are same. Only one case is that they are equal, it will random to guess, however it is rare.

In [38]:
class BasicClassification:
    def __init__(self, content):
        self.content = content
        self.pos = 0
        self.neg = 0

    def classify(self, pos_bag, neg_bag):
        for w in self.content:
            if w in pos_bag:
                self.pos += 1
            if w in neg_bag:
                self.neg += 1
        if self.pos > self.neg:
            return 1
        if self.pos < self.neg:
            return 0
        else:
            return random.randint(0, 1)

I start to do the test for the testing data. I make two list collecting the return result, and the positive predict is one, and the negative predict is zero.

In [39]:
pos_result = []
for file in pos_testing_data:
    rst = BasicClassification(file)
    pos_result.append(rst.classify(pos_bag, neg_bag))
neg_result = []
for file in neg_testing_data:
    rst = BasicClassification(file)
    neg_result.append(rst.classify(pos_bag, neg_bag))

print(pos_result[:20])
print(neg_result[:20])

[1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1]
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0]


3)
a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).


I respectively calculate the recall rate, precision, accuracy and error rate. I will print and analyse the data with my explaination and some understanding. It is not difficult to find that the data was very average, which is the good part of this experiment. It is not happened that the accuracy was high but the recall rate or precision were very low. Every value is about 60 percent. Perhaps it is not very ideal, but it is an acceptable consequence, because it is not below 50 percent, which means that you messed it up.  

In [40]:
pos_recall_rate = pos_result.count(1) / len(pos_result)
neg_recall_rate = neg_result.count(0) / len(neg_result)
pos_precision = pos_result.count(1) / (pos_result.count(1) + neg_result.count(1))
neg_precision = neg_result.count(0) / (pos_result.count(0) + neg_result.count(0))
accuracy = (pos_result.count(1) + neg_result.count(0)) / len(pos_result + neg_result)
error_rate = (pos_result.count(0) + neg_result.count(1)) / len(pos_result + neg_result)

In [41]:
df1 = pd.DataFrame([[pos_result.count(1), pos_result.count(0), len(pos_result)],
                     [neg_result.count(1), neg_result.count(0), len(neg_result)],
                     [pos_result.count(1) + neg_result.count(1), pos_result.count(0) + neg_result.count(0),
                      len(pos_result + neg_result)]], index=("+ve", "-ve", "total"), columns=("+ve", "-ve", "total"))

print(df1)

       +ve  -ve  total
+ve    191  109    300
-ve    102  198    300
total  293  307    600


In [42]:
print("pos_recall_rate = {:.2f}". format(pos_recall_rate))
print("neg_recall_rate = {:.2f}". format(neg_recall_rate))
print("pos_precision = {:.2f}". format(pos_precision))
print("neg_precision = {:.2f}". format(neg_precision))
print("accuracy = {:.2f}". format(accuracy))
print("error_rate = {:.2f}". format(error_rate))

pos_recall_rate = 0.64
neg_recall_rate = 0.66
pos_precision = 0.65
neg_precision = 0.64
accuracy = 0.65
error_rate = 0.35


The advantage is that the theory is quite simple and also the code is not too complicated to implement, which means that you do not need the professionals or too high maintenance cost. It is not easy to read the code written by another one, especially when the code is difficult and longer. Secondly, the data results are stable. Even though it is probably not the best method, at least the stable results show that you did not make clear mistakes during the process. Additionally, the experimental data shows no significant bias, and I will show this problem the Naive Bayes classifier for next.  

However, the weakness is obvious as well. The first one is that no matter what the precision, recall rate or even accuracy, they are not very high. Moreover, it is same as Bayes classifier, and both of them need the specifical condition, which is the independence of each word in the paper. However, it is usually impossible. Thirdly, it is a massive task to build up the bag of words, and it depends on the amount of data set or some else situations, but here it was just simplified, which is not enough. Finally, the setting of the K value significantly decided the result of trail. It means the values between the times in positive reviews or negative reviews were subtracted or divided, after that the confirmed K value started important impact on the consequence.

In [43]:
df2 = pd.DataFrame([[0.55, 0.67, 0.60, 0.69, 0.54, 0.65, 0.66, 0.54, 0.65, 0.63],
                    [0.71, 0.61, 0.65, 0.62, 0.71, 0.62, 0.55, 0.67, 0.70, 0.69],
                    [0.66, 0.63, 0.63, 0.65, 0.65, 0.63, 0.61, 0.62, 0.68, 0.67],
                    [0.61, 0.65, 0.62, 0.67, 0.61, 0.64, 0.63, 0.59, 0.66, 0.65],
                    [0.63, 0.64, 0.62, 0.66, 0.62, 0.64, 0.61, 0.60, 0.66, 0.65],
                    [0.37, 0.36, 0.38, 0.34, 0.38, 0.36, 0.39, 0.40, 0.33, 0.34]],
                   index=("pos recall rate", "neg recall rate", "pos precision", "neg precision", "accuracy", "error rate"),
                   columns=("round1", "round2", "round3", "round4", "round5", "round6", "round7", "round8", "round9", "round10"))

print(df2)

                 round1  round2  round3  round4  round5  round6  round7  \
pos recall rate    0.55    0.67    0.60    0.69    0.54    0.65    0.66   
neg recall rate    0.71    0.61    0.65    0.62    0.71    0.62    0.55   
pos precision      0.66    0.63    0.63    0.65    0.65    0.63    0.61   
neg precision      0.61    0.65    0.62    0.67    0.61    0.64    0.63   
accuracy           0.63    0.64    0.62    0.66    0.62    0.64    0.61   
error rate         0.37    0.36    0.38    0.34    0.38    0.36    0.39   

                 round8  round9  round10  
pos recall rate    0.54    0.65     0.63  
neg recall rate    0.67    0.70     0.69  
pos precision      0.62    0.68     0.67  
neg precision      0.59    0.66     0.65  
accuracy           0.60    0.66     0.65  
error rate         0.40    0.33     0.34  


4)
a)  **Construct** a Naive Bayes classifier (e.g., from NLTK).

b)  **Compare** the performance of your word list classifier with the Naive Bayes classifier.  **Discuss** your results.


Some parts of Bayers classifier are similar as the BOW classifier, such as filtering the stop words and building all words list. Some are different, for example, the part of calculating the probilities. Before starting the explanation, I need to list the Bayers formula here, because I will explain based on the elements of this formula.
   

In [44]:
def words_list(reviews):
    all_words = []
    for review in reviews:
        all_words += review
    return all_words



In [45]:
pos_words = words_list(pos_training_data)
neg_words = words_list(neg_training_data)

pos_testing_data = filter_stopwords(pos_testing_data)
neg_testing_data = filter_stopwords(neg_testing_data)
pos_testing_words = words_list(pos_testing_data)
neg_testing_words = words_list(neg_testing_data)

note: it probably needs two mintues

In [46]:
def words_final(words, test):
    words += list(set(words))
    for w in set(test):
        if w not in words:
            words.append(w)
    return words

pos_words = words_final(pos_words, pos_testing_words + neg_testing_words)
neg_words = words_final(neg_words, pos_testing_words + neg_testing_words)


In [47]:
pos_doc = nltk.FreqDist(pos_words)
neg_doc = nltk.FreqDist(neg_words)
pos_len = len(pos_words)
neg_len = len(neg_words)

**P(A|B)=P(B|A)*P(A)/P(B)**: P(B|A) is likelihood, and P(A|B) is posterior. It means that when you get the likelihood and you can get the posterior. P(B|A)*P(A) is equal to P(AB), which is that event A is happened and same time event B is happened as well. It becomes a conditional probability.

In this formula, we can assume that B is the destination, and A are the different roads that you can access to B. Now we need to know once you reach B, which road you have taken. In the training data set, the result of the positive review or negative review is event B similar as the destination. Because the number of positive and negative reviews are the same, they are both equal with 0.5. All the words in positive review or negative review are the different roads here, which is Ai (i = 1, 2, 3, …, k). The probability is the probability (occurrence times of the word divided by all numbers of positive or negative words) are divided by the probability of a positive or negative review, 0.5 here.

To predict in the testing data set, we need to use another formula related to Bayes, law of total probability. P(B) = P(B|Ai)*P(Ai) + … + P(B|Ak)*P(Ak) (i = 1, 2, 3, …, k). Clearly, one is division, and the other is multiplication. Here Ai is all different in this testing review, and P(Ai) shows that we have meet this word in this review. P(B|Ai) is the chance of positive or negative review once we meet this word, and the value P(AiB) (is P(B|Ai)*P(Ai)) is what we got in training data set before. However, it needs to be assumed the condition, and B is positive or negative. Finally, we compare with the two probabilities when B is positive or negative, and the bigger one is the result which we will predict. To prevent the words are not occurred in training data, which will cause the probability is equal with zero, I used the Laplace smoothing as above code, speaking briefly.

In [48]:
class BayesClassification:

    def __init__(self, content):
        self.content = content
        self.pos = 0.5
        self.neg = 0.5

    def predict(self):
        pos_possibility = math.log(self.pos, 10)
        for w in self.content:
            x = math.log(pos_doc[w]/pos_len, 10)
            pos_possibility += x
        neg_possibility = math.log(self.neg, 10)
        for w in self.content:
            x = math.log(neg_doc[w]/neg_len, 10)
            neg_possibility += x
        if pos_possibility > neg_possibility:
            return 1
        elif pos_possibility < neg_possibility:
            return 0
        else:
            return random.randint(0, 1)

In [49]:
pos_result = []
for review in pos_testing_data:
    file = BayesClassification(review)
    pos_result.append(file.predict())
neg_result = []
for review in neg_testing_data:
    file = BayesClassification(review)
    neg_result.append(file.predict())

In [50]:
pos_recall_rate = pos_result.count(1) / len(pos_result)
neg_recall_rate = neg_result.count(0) / len(neg_result)
pos_precision = pos_result.count(1) / (pos_result.count(1) + neg_result.count(1))
neg_precision = neg_result.count(0) / (pos_result.count(0) + neg_result.count(0))
accuracy = (pos_result.count(1) + neg_result.count(0)) / len(pos_result + neg_result)
error_rate = (pos_result.count(0) + neg_result.count(1)) / len(pos_result + neg_result)


Similarly, I will declare the advantages and disadvantages of the Bayes classifier. Obviously, one of advantages is that the accuracy, precision or recall rate are improved compared with the BOW classifier, which overall is about 80 percent. It is an ideal result during ten times of trails as below. Another advantage is it is also not too difficult, because it ignores complicated situations. The third one is that the probability was included in the method or theory, which is more scientifically and professionally.  

Howerer, there are still some limitations. Firstly, the basial condition is the independence of every word, but in most cases it is invalid.  For instance, ‘not’ or even ‘!’, these words or even punctuation will totally cause different meanings, or opposite points. Secondly, the accuracy depends on the amount of training data set. Generally, the more complete data sets can usually provide more accurate probabilities, and it can help us get better predictions. I did not change the ratio value in this experiment, so it is one of my speculations. Lastly, only based on this data result, there is an interesting content that the positive recall rate is always lower than negative recall rate. Honestly, it is not a good signal, because it reflects that there must be some unsolved problems. Unfortunately, I have not thought of any reasons or solutions for this issue, such as the length of the text or relevance between every word.  

In [51]:
df1 = pd.DataFrame([[pos_result.count(1), pos_result.count(0), len(pos_result)],
                     [neg_result.count(1), neg_result.count(0), len(neg_result)],
                     [pos_result.count(1) + neg_result.count(1), pos_result.count(0) + neg_result.count(0),
                      len(pos_result + neg_result)]], index=("+ve", "-ve", "total"), columns=("+ve", "-ve", "total"))
print(df1)

       +ve  -ve  total
+ve    231   69    300
-ve     46  254    300
total  277  323    600


In [52]:
print("pos_recall_rate = {:.2f}". format(pos_recall_rate))
print("neg_recall_rate = {:.2f}". format(neg_recall_rate))
print("pos_precision = {:.2f}". format(pos_precision))
print("neg_precision = {:.2f}". format(neg_precision))
print("accuracy = {:.2f}". format(accuracy))
print("error_rate = {:.2f}". format(error_rate))

pos_recall_rate = 0.77
neg_recall_rate = 0.85
pos_precision = 0.83
neg_precision = 0.79
accuracy = 0.81
error_rate = 0.19


In [53]:
df2 = pd.DataFrame([[0.77, 0.81, 0.78, 0.74, 0.74, 0.77, 0.76, 0.83, 0.76, 0.80],
                   [0.85, 0.82, 0.83, 0.85, 0.86, 0.84, 0.88, 0.83, 0.81, 0.86],
                   [0.83, 0.82, 0.82, 0.83, 0.84, 0.82, 0.87, 0.83, 0.80, 0.85],
                   [0.79, 0.81, 0.79, 0.76, 0.77, 0.78, 0.79, 0.83, 0.77, 0.87],
                   [0.81, 0.82, 0.81, 0.79, 0.80, 0.80, 0.82, 0.83, 0.78, 0.83],
                   [0.19, 0.18, 0.20, 0.21, 0.20, 0.20, 0.18, 0.17, 0.22, 0.17]],
                  index=("pos recall rate", "neg recall rate", "pos precision", "neg precision", "accuracy", "error rate"),
                  columns=("round1", "round2", "round3", "round4", "round5", "round6", "round7", "round8", "round9", "round10"))

print(df2)

                 round1  round2  round3  round4  round5  round6  round7  \
pos recall rate    0.77    0.81    0.78    0.74    0.74    0.77    0.76   
neg recall rate    0.85    0.82    0.83    0.85    0.86    0.84    0.88   
pos precision      0.83    0.82    0.82    0.83    0.84    0.82    0.87   
neg precision      0.79    0.81    0.79    0.76    0.77    0.78    0.79   
accuracy           0.81    0.82    0.81    0.79    0.80    0.80    0.82   
error rate         0.19    0.18    0.20    0.21    0.20    0.20    0.18   

                 round8  round9  round10  
pos recall rate    0.83    0.76     0.80  
neg recall rate    0.83    0.81     0.86  
pos precision      0.83    0.80     0.85  
neg precision      0.83    0.77     0.87  
accuracy           0.83    0.78     0.83  
error rate         0.17    0.22     0.17  


5)
a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions.

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.



Apparently, one of limitations is the irrelevance of every word in one sentence. Thus, one hypothesis is that the Bayes classifier can perform better in shorter texts than longer texts. I ignore the difference of positive or negative reviews and mix them up, because this is not what I am focusing on, and the length is only one factor. From the data result, it is kind of unexpected, and I do not get the result same as my speculation. I am not able to understand well, and probably there are some mistakes in my code. It seems to be a low-level error, but it should be considered, because as we all know in python, one indentation could cause a different answer or even error.  

In summary, I would suggest that Bayes classifier should be a good choice if you do not need to take over quite difficult situations, because it is quick and simple. The values we get are usually not too bad, compared to the BOW classifier. Currently, I can only compare them, and Bayes is the better one. If I get more methods about sentiment analysis in the future, perhaps I will get a new answer.

In [54]:
def quick_sort(arr):
    if len(arr) <= 1:
        return arr

    pivot = arr[0]
    pivot_left = [x for x in arr[1:] if len(x) <= len(pivot)]
    pivot_right = [x for x in arr[1:] if len(x) > len(pivot)]

    return quick_sort(pivot_left) + [pivot] + quick_sort(pivot_right)


In [55]:
pos_sort_data = quick_sort(pos_testing_data)
neg_sort_data = quick_sort(neg_testing_data)

short_data = pos_sort_data[:50] + neg_sort_data[:50]
long_data = pos_sort_data[250:] + neg_sort_data[250:]

len_short_data = [len(text) for text in short_data]
len_long_data = [len(text) for text in long_data]

print("the average length of short reviews = {:d}". format(sum(len_short_data) // len(len_short_data)))
print("the average length of long reviews = {:d}". format(sum(len_long_data) // len(len_long_data)))


the average length of short reviews = 165
the average length of long reviews = 605


In [56]:
short_result = []
for review in short_data:
    file = BayesClassification(review)
    short_result.append(file.predict())
long_result = []
for review in long_data:
    file = BayesClassification(review)
    long_result.append(file.predict())

In [57]:
short_accuracy = (short_result[:50].count(1) + short_result[50:].count(0)) / len(short_result)
long_accuracy = (long_result[:50].count(1) + long_result[50:].count(0)) / len(long_result)
print("short accuracy = {:.2f}". format(short_accuracy))
print("long accuracy = {:.2f}". format(long_accuracy))

short accuracy = 0.77
long accuracy = 0.84


In [58]:
df3 = pd.DataFrame([[160, 164, 176, 163, 161],
                   [612, 594, 591, 594, 601],
                   [0.78, 0.88, 0.76, 0.74, 0.76],
                   [0.88, 0.85, 0.84, 0.90, 0.83]],
                  index=("average length of short", "average length of long", "short accuracy", "long accuracy"),
                  columns=("round1", "round2", "round3", "round4", "round5"))


In [None]:
##This code will word count all of the markdown cells in the notebook saved at filepath

import io
from nbformat import current

from google.colab import drive
drive.mount('/content/drive')

filepath="/content/drive/MyDrive/Colab Notebooks/ANLPassignment2023.ipynb"
question_count=432

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))

Mounted at /content/drive


FileNotFoundError: ignored