# Sentiment Analysis Adaptations

This notebook seeks to improve on the work from a simple google colab notebook model. This sentiment analysis model uses logistic regression and is trained and tested with data from the IMDB movie review dataset. The system obtains a ~77% accuracy and the goal of this notebook is to improve on this using two ideas while still using Logistic regression. The selected ideas for this task are to use *stop word removal* and *named entity recognition* and *masking*.

The IMDB dataset used in this notebook is available from the [huggingface datasets library](https://huggingface.co/datasets/imdb) and is well described in [Maas et al. 2011](https://aclanthology.org/P11-1015.pdf)

### Import dataset / dependencies

In [1]:
import sklearn
import numpy as np
import spacy

In [2]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")
print(raw_datasets['train'][0])
print(raw_datasets['test'][0])

Reusing dataset imdb (/Users/dockreg/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

### Data Formatting

Sci-kit learn is used in this notebookto transform the data into a list of input vectors to represent the documents and a list of associated output labels are also created.

In [4]:
#train_dataset = raw_datasets['train'].shuffle(seed=42).select(range(25000))
train_dataset = raw_datasets['train']
train_data = []
train_data_labels = []
for item in train_dataset:
    train_data.append(item['text'])
    train_data_labels.append(item['label'])
print(len(train_data))

25000


CountVectorizer is used to create an array of the most frequent words which is used as features. This creates a single array for each of the reviews representing the words present. The max_features is set to 200 which represents how many features we plan to use in this. The 200 most frequent words are appropriate for this task.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(train_data)
features_nd = features.toarray()
print(len(features_nd))
print(len(features_nd[0]))

25000
200


In [7]:
# split data into 80/20 split
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features_nd,train_data_labels,train_size=0.8,random_state=123)

### Import/Train/Test the model

In [8]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

In [9]:
# model training
log_model = log_model.fit(X=X_train,y=y_train)

In [10]:
# model testing
y_pred = log_model.predict(X_val)

In [11]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_pred))

0.7666


The logistic regression classifier has a ~77% accuracy for this task based on the test data which has been used as validation.
To test the model a random selection of data is chosen below.

In [12]:
test_dataset = raw_datasets['test'].shuffle(seed=42).select(range(1000))
test_data = []
test_data_labels = []
for item in test_dataset:
    test_data.append(item['text'])
    test_data_labels.append(item['label'])



Loading cached shuffled indices for dataset at /Users/dockreg/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-b23cfeb68a931a8d.arrow


In [13]:
# run the test data through the model
test_pred=log_model.predict(vectorizer.transform(test_data).toarray())

In [14]:
print(accuracy_score(test_pred,test_data_labels))

0.74


### Inspection of Misclassified Reviews

To get a better understanding of the performance of the model so far, some of the incorrectly labelled reviews are inspected. 

In [15]:
test_data[0]

"<br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face my OWN truth and finally doing something about it. I truly hope A Thousand Acres has had the same effect on some others out there.<br /><br />Since

In [16]:
test_data_array = np.array(test_data)

In [17]:
test_data_array[0]

"<br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face my OWN truth and finally doing something about it. I truly hope A Thousand Acres has had the same effect on some others out there.<br /><br />Since

In [18]:
misclassified_samples = test_data_array[test_data_labels != test_pred]

In [19]:
#target = test_data_labels[test_pred!=test_data_labels]

In [20]:
prediction = test_pred[test_pred != test_data_labels]

In [21]:
len(prediction)
prediction[0:3]

array([0, 0, 1])

### Sample: Positive
### Predicted: Negative

In [22]:
misclassified_samples[0]

"<br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, however, I realized that this story was about A Thousand Other Things besides just Acres. I started crying and couldn't stop until long after the movie ended. Thank you Jane, Laura and Jocelyn, for bringing us such a wonderfully subtle and compassionate movie! Thank you cast, for being involved and portraying the characters with such depth and gentleness!<br /><br />I recognized the Angry sister; the Runaway sister and the sister in Denial. I recognized the Abusive Husband and why he was there and then the Father, oh oh the Father... all superbly played. I also recognized myself and this movie was an eye-opener, a relief, a chance to face my OWN truth and finally doing something about it. I truly hope A Thousand Acres has had the same effect on some others out there.<br /><br />Since

### Sample: Positive
### Predicted: Negative

In [23]:
misclassified_samples[1]

"This is the latest entry in the long series of films with the French agent, O.S.S. 117 (the French answer to James Bond). The series was launched in the early 1950's, and spawned at least eight films (none of which was ever released in the U.S.). 'O.S.S.117:Cairo,Nest Of Spies' is a breezy little comedy that should not...repeat NOT, be taken too seriously. Our protagonist finds himself in the middle of a spy chase in Egypt (with Morroco doing stand in for Egypt) to find out about a long lost friend. What follows is the standard James Bond/Inspector Cloussou kind of antics. Although our man is something of an overt xenophobe,sexist,homophobe, it's treated as pure farce (as I said, don't take it too seriously). Although there is a bit of rough language & cartoon violence, it's basically okay for older kids (ages 12 & up). As previously stated in the subject line, just sit back,pass the popcorn & just enjoy."

### Sample: Negative
### Predicted: Positive

In [24]:
misclassified_samples[2]

"Porn legend Gregory Dark directs this cheesy horror flick that has Glen Jacobs (Kane from WWF/WWE/ whatever it calls itself nowadays) in his cinematic debut. He plays Jacob Goodknight, a blind serial killer who's forte is taking people's eyes out. The plot, be it as it may, has a group of troubled youths cleaning up the historical hotel that GoodKnight resides in and subsequently being offed by him. Hemmingway it's not. Starts of as fun dopey B-movie, but soon gets too tedious to be enjoyable. Glad I went in with pretty low expectations, but even those weren't met. How can you have a porn king directing and still suffer from a lack of nudity??? for shame.<br /><br />My Grade: D- <br /><br />Eye Candy: Samantha Noble bares her ass briefly"

These reviews appear to be incorrectly misclassified due to a number of reasons. A number of issues arise from the use of positive words with negation. Another issue is the large number of stopwords which will cause issues with the accuracy of the predictions. These will also be inspected and removed.

## Results

The logistic regression sentiment classifier has a 74% accuracy for this task. The following section attempts to improve upon this by the use of two select methods. The chosen methods are:
- Stop word removal
- Name entity recognition

#### Stop word removal

Stop word removal is a useful technique for use in many Natural Language Processing (NLP) tasks. Stop words are frequently used words that don't add much syntactic context to the sentence but are needed for the fluid formation and understanding of the sentence. Examples of stop words are prepositions or articles such as 'and', 'the', 'in', 'by', etc. These can occupy a large amount of space within a document and in the case of the countVectoriser above these stop words may use up valuable space in the top 200 words when they provide little value to the predictive model in determining a positive or negative sentiment.
Many different NLP libraries have lists of stop words which can be used to removed the stop words present in the text. For this task the spaCy library is used.

#### Named Entity Recognition
Named Entity Recognition (NER) is another useful technique that allows for the classification of named entities within a text. Named entities can be people, organisations, locations, countries, events, etc. These representations can unwantedly influence the performace of models and so there are a number of libraries that are available for the recognition of named entities that also allow for the tagging/masking of these to further the NLP task at hand. The NER library that is used in this task is spaCy. 

## Stop word removal

All of the movie reviews are of varying length. The longest review is inspected now to understand the occurance of stop words.

In [25]:
max_review = max(train_data, key=len)
print(max_review)    

Match 1: Tag Team Table Match Bubba Ray and Spike Dudley vs Eddie Guerrero and Chris Benoit Bubba Ray and Spike Dudley started things off with a Tag Team Table Match against Eddie Guerrero and Chris Benoit. According to the rules of the match, both opponents have to go through tables in order to get the win. Benoit and Guerrero heated up early on by taking turns hammering first Spike and then Bubba Ray. A German suplex by Benoit to Bubba took the wind out of the Dudley brother. Spike tried to help his brother, but the referee restrained him while Benoit and Guerrero ganged up on him in the corner. With Benoit stomping away on Bubba, Guerrero set up a table outside. Spike dashed into the ring and somersaulted over the top rope onto Guerrero on the outside! After recovering and taking care of Spike, Guerrero slipped a table into the ring and helped the Wolverine set it up. The tandem then set up for a double superplex from the middle rope which would have put Bubba through the table, but

In [26]:
# find the number of words in this review
words = max_review.split()
print(len(words))

2470


In [27]:
# find the number of unique words used in the review
d = {}
for word in words:
    if word not in d:
        d[word] = 1
    else:
        d[word] += 1

    
print("Length of the dictionary is : " + str(len(d)) + "\n")
i = 0
for w in sorted(d, key=d.get, reverse=True):
    print(w, d[w])
    if i == 20:
        break
    i += 1

Length of the dictionary is : 854

the 167
and 98
to 85
a 70
with 41
but 38
The 31
in 27
up 26
his 26
of 24
on 22
Rock 22
into 21
Taker 20
for 19
Angle 18
ring 17
Bubba 16
Triple 15
Spike 14


The longest review is 2470 word long and contains 854 words. The count vectoriser above uses the most common 200 words and so this model misses out on 654 other words from the review for its prediction of sentiment. The most common words are 'the', 'and', 'to', and 'a' with occurences between 70-167 for these. These do not aid the prediction of the model so this reinforces the potential use of stop word removal in this task.

In [28]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m254.1 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [29]:
#load spacy - english model - small
en = spacy.load('en_core_web_sm')
sw_spacy = en.Defaults.stop_words

In [30]:
# print all of the stop words from this library
print(sw_spacy)

{'‘m', 'an', 'five', 'could', 'they', 'three', 'there', 'around', 'against', 'latter', 'several', 'amongst', 'made', 'various', 'beforehand', 'still', 'its', 'a', 'indeed', 'mostly', 'hereafter', 'sometimes', 'to', 'bottom', 'somehow', 'here', 'been', 'back', 'as', '’m', 'enough', 'call', 'themselves', 'where', 'become', 'see', 'one', "n't", 'them', 'hundred', 'it', 'neither', 'or', 'whereby', 'whatever', 'among', 'wherein', 'she', 'and', 'has', 'nine', 'more', '’d', 'never', 'only', 'at', 'together', 'did', 'his', 'well', '’s', 'us', 'was', 'many', 'used', 'yourself', 'am', 'six', 'although', 'eight', 'while', 'he', 'yet', 'quite', 'about', 'beside', 'might', 'sixty', 'whether', 'otherwise', "'d", 'part', 'from', 'under', 'seem', 'herself', 'me', 'eleven', 'will', 'above', 'if', 'ever', 'hereby', 'thereafter', 'name', "'s", 'too', 'already', 'former', 'during', 'most', 'across', 'becomes', 'no', 'front', 'you', 'seemed', 'hers', 'formerly', 'each', 'forty', 'four', 'fifty', 'but', 'al

Examine the data before the removal of stop words.

In [31]:
train_data[0]

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, ev

In [32]:
# create a list containing the reviews with all stop words removed
stop_removed_train_data = []
for item in train_data:
    item = [word for word in item.split() if word.lower() not in sw_spacy]
    item = ' '.join(item)
    stop_removed_train_data.append(item)

In [33]:
# inspect stop word removed data
stop_removed_train_data[0]

'rented CURIOUS-YELLOW video store controversy surrounded released 1967. heard seized U.S. customs tried enter country, fan films considered "controversial" myself.<br /><br />The plot centered young Swedish drama student named Lena wants learn life. particular wants focus attentions making sort documentary average Swede thought certain political issues Vietnam War race issues United States. asking politicians ordinary denizens Stockholm opinions politics, sex drama teacher, classmates, married men.<br /><br />What kills CURIOUS-YELLOW 40 years ago, considered pornographic. Really, sex nudity scenes far between, it\'s shot like cheaply porno. countrymen mind find shocking, reality sex nudity major staple Swedish cinema. Ingmar Bergman, arguably answer good old boy John Ford, sex scenes films.<br /><br />I commend filmmakers fact sex shown film shown artistic purposes shock people money shown pornographic theaters America. CURIOUS-YELLOW good film wanting study meat potatoes (no pun int

In [34]:
print(len(train_data[0].split()),len(stop_removed_train_data[0].split()))

288 142


The stop removed data has reduced the length of the review from 288 words to 142 words. This allows all words to be used in the CountVectoriser 
The new data set is now run through the logistic regression model to see if it has improved the accuracy of the system.

In [35]:
# create an array of the stop word removed data
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(stop_removed_train_data)
features_nd = features.toarray()
print(len(features_nd))
print(len(features_nd[0]))

25000
200


In [36]:
# split the data into 80/20 train/test
X_train, X_val, y_train, y_val = train_test_split(features_nd,train_data_labels,train_size=0.8,random_state=123)

In [37]:
log_model = LogisticRegression()

In [38]:
#Train the model.
log_model = log_model.fit(X=X_train,y=y_train)

In [39]:
#Test the model on the validation set.
y_pred = log_model.predict(X_val)

In [40]:
print(accuracy_score(y_val,y_pred))

0.7622


The accuracy of the model on the validation set is the same for the stop word removed data as for the normal data. The test data is now used to understand if the accuracy of the system has increased based on the results with the test data.

In [41]:
# select test data
test_dataset = raw_datasets['test'].shuffle(seed=42).select(range(1000))
test_data = []
test_data_labels = []
for item in test_dataset:
    test_data.append(item['text'])
    test_data_labels.append(item['label'])

Loading cached shuffled indices for dataset at /Users/dockreg/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-b23cfeb68a931a8d.arrow


In [42]:
# removed the stop words from the test data
stop_removed_test_data = []
for item in test_data:
    item = [word for word in item.split() if word.lower() not in sw_spacy]
    item = ' '.join(item)
    stop_removed_test_data.append(item)

In [43]:
#Apply the model to the test data.

test_pred=log_model.predict(vectorizer.transform(stop_removed_test_data).toarray())

print(accuracy_score(test_pred,test_data_labels))

0.767


The logistic regression model with the removal of stop words has resulted in the improvement in accuracy to ~77%.

The next phase of this involves using named entity recognition to mask named entities. 

## Named Entity Recognition

Named entity recognition is used in this notebook to attempt to improve the accuracy of the model. The thought behind this is that certain actors, directors, organisations, or places from the training data may be used in a negative review without that word being fairly summarised as negative. Certain words such as bad, awful, or horrible have a negative connotation to them where Harry Potter, Shrek, or Gotham City should not be labelled as positive or negative as that varies from viewer to viewer. The view is that by removing subjective named entities, the model is better able to predict sentiment based on the other elements of the language. This theory is put into practive below.

In [44]:
from spacy import displacy

In [45]:
en.add_pipe("merge_entities")

<function spacy.pipeline.functions.merge_entities(doc: spacy.tokens.doc.Doc)>

In [49]:
# create named entity masked data
ner_train_data = []
for item in train_data:
    doc = en(item)
    ner = ' '.join([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
    ner_train_data.append(ner)

In [54]:
ner_train_data[1]

'" WORK_OF_ART " is a risible and pretentious steaming pile . It does n\'t matter what one \'s political views are because this film can hardly be taken seriously on any level . As for the claim that frontal male nudity is an automatic NC-17 , that is n\'t true . I \'ve seen R - rated films with male nudity . Granted , they only offer some fleeting views , but where are the R - rated films with gaping vulvas and flapping labia ? Nowhere , because they do n\'t exist . The same goes for those crappy cable shows : schlongs swinging in the breeze but not a clitoris in sight . And those pretentious indie movies like PRODUCT , in which we \'re treated to the site of PERSON throbbing johnson , but not a trace of pink visible on PERSON . Before crying ( or implying ) " double - standard " in matters of nudity , the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women : there are no genitals on display when actresses appears nude , and the

The ner masked data has removed all instances of named entities and has replaced them with masked version of the type of named entity.

In the example above the title of the film has been masked as 'WORK_OF_ART'.

This new masked data is now turned into a vector array for training and then validated.

In [51]:
# create vector with ner data for training
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(ner_train_data)
features_nd = features.toarray()
print(len(features_nd))
print(len(features_nd[0]))

# split ner data into train/validate 
X_train, X_val, y_train, y_val = train_test_split(features_nd,train_data_labels,train_size=0.8,random_state=123)

# create the model
log_model = LogisticRegression()

# Train the model.
log_model = log_model.fit(X=X_train,y=y_train)

# Test the model on the validation set.
y_pred = log_model.predict(X_val)

print(accuracy_score(y_val,y_pred))

25000
200
0.7648


The accuracy of the model on the validation set is 76%. This is very similar to the original models accuracy of 77% on the validation set. The model is now tested with the ner test data and the accuracy is inspected.

In [52]:
ner_test_data = []
for item in test_data:
    doc = en(item)
    ner = ' '.join([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
    ner_test_data.append(ner)

In [53]:
#Apply the model to the test data.
test_pred=log_model.predict(vectorizer.transform(ner_test_data).toarray())

print(accuracy_score(test_pred,test_data_labels))

0.746


## Results

The resulting accuracy of the model is 75%. Named entity recognition did not aid in the improvement of the model. Further inspection into selected named entities such as organisations or countries would be of interest in the future to see if there is any method in which the use of named entities improves the predictive capabilities of the logistic regression sentiment classification model.