# Sentiment Analysis of IMDB Movie Reviews using Logistic Regression
In this notebook, I take a basic approach for predicting the sentiment of a movie review using Logistic Regression and carry out experiments to improve the accuracy of the model.

The four experiments I carry out include:

1. Performing negation handling by appending not_ to negated words
2. Performing negation handling by replacing negated words with their antonyms
3. Performing negation handling by appending not_ to negated words using the spacy library.
4. Using gensim library and using Doc2Vec for generating feature vectors.

The first three experiments did not improve accuracy, however, switching to Doc2Vec significantly improved accuracy from 74% to 81%

# Setup
This notebook trains a binary classifier on a dataset which contains movie reviews which are labelled as containing either *positive* or *negative* sentiment towards the movie. 

First we will install *sklearn* which we will be using to do the machine learning.

In [2]:
pip install sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


Next we will install the dataset. We will use the IMDB sentiment analysis dataset available from the [huggingface datasets library](https://huggingface.co/datasets/imdb) and described in [Maas et al. 2011](https://aclanthology.org/P11-1015.pdf).

In [3]:
pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dill<0.3.7,>=0.3.

Now let's load the IMDB training set. We will print out the last instance.

In [4]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Let's convert the training data into the format expected by scikit-learn - a list of input vectors (documents) and a list of associated output labels.

# Original Method

In [5]:
train_dataset = imdb_dataset['train']
train_data = []
train_data_labels = []
for item in train_dataset:
  train_data.append(item['text'])
  train_data_labels.append(item['label'])
print(train_data[-1])
print(train_data_labels[-1])

The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.
1


We'll use the CountVectorizer class to extract the words in each review as the features the algorithm will learn from. Each document is represented as a 200 dimension vector of word counts. Only the 200 most frequent words are used in this version. 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(train_data).toarray()

As a sanity check, let's check we have a 2-d array where each row is one of the 25,000 instances and each column is one of 200 words. Print out the words that will be used for classification.

In [7]:
print(features.shape)
print(vectorizer.get_feature_names_out())

(25000, 200)
['10' 'about' 'acting' 'action' 'actors' 'actually' 'after' 'again' 'all'
 'also' 'an' 'and' 'another' 'any' 'are' 'around' 'as' 'at' 'back' 'bad'
 'be' 'because' 'been' 'before' 'being' 'best' 'better' 'between' 'big'
 'both' 'br' 'but' 'by' 'can' 'cast' 'character' 'characters' 'could'
 'did' 'didn' 'director' 'do' 'does' 'doesn' 'don' 'down' 'end' 'enough'
 'even' 'ever' 'every' 'fact' 'few' 'film' 'films' 'find' 'first' 'for'
 'from' 'funny' 'get' 'give' 'go' 'going' 'good' 'got' 'great' 'had' 'has'
 'have' 'he' 'her' 'here' 'him' 'his' 'horror' 'how' 'however' 'if' 'in'
 'into' 'is' 'it' 'its' 'just' 'know' 'life' 'like' 'little' 'long' 'look'
 'lot' 'love' 'made' 'make' 'makes' 'man' 'many' 'may' 'me' 'more' 'most'
 'movie' 'movies' 'much' 'my' 'never' 'new' 'no' 'not' 'nothing' 'now'
 'of' 'off' 'old' 'on' 'one' 'only' 'or' 'original' 'other' 'out' 'over'
 'own' 'part' 'people' 'plot' 'pretty' 'quite' 're' 'real' 'really'
 'right' 'same' 'say' 'scene' 'scenes' 'see'

## Training 
Test the model on the validation set.

Split the data into a training and validation (dev) set. We'll use the validation set to test our model. We'll use 75% of the data for training and 25% for testing.

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

We will use Logistic Regression to do the classification. Create the model.

In [9]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

Train the model.

In [10]:
model = model.fit(X=X_train,y=y_train)
y_pred = model.predict(X_val)

Now let's calculate the accuracy of the model's predictions on the validation set.

In [11]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_pred))

0.76368


## Testing 
Now let's prepare some test data. Use the same 1000 as in the BERT notebook.

In [12]:
test_dataset = imdb_dataset['test'].shuffle(seed=42).select(range(1000))
test_data = []
test_data_labels = []
for item in test_dataset:
  test_data.append(item['text'])
  test_data_labels.append(item['label'])

Apply the model to the test data.

In [13]:
test_pred=model.predict(vectorizer.transform(test_data).toarray())
print(accuracy_score(test_pred,test_data_labels))

0.74


# Experiment 1
I carried out experiment 1 by implementing the handle_negation function.

This function takes in a string of text as input and checks for negation words in the text. If a negation word is found, the function sets a flag to indicate that the subsequent words should be negated. The function then iterates over each word in the text and adds it to a new list with a "not_" prefix if the flag is set. The resulting list of words is joined together into a single string and returned as output. This function essentially adds a "not_" prefix to words that appear after negation words in the input text, allowing downstream NLP models to correctly interpret negations in the text.

In [14]:
# import the necessary libraries for experiment 1
import re
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [15]:
# Preprocess the data to handle negation
def handle_negation(text):
    # Split the text into words
    words = text.split()
    # Iterate over all words and check for negation
    negation = False
    result = []
    negation_words = ['no', 'not', 'never', 'none', 'nobody', 'nothing', 
                      'nowhere', 'neither', 'nor', 'hardly', 'scarcely', 
                      'barely', 'rarely', 'little', 'few', 'except', 'without', 
                      'minus', 'non']
    for word in words:
        # If a negation word is found, set the flag to True
        if word in negation_words:
            negation = True
        # If a punctuation mark is found, set the flag to False
        elif re.search(r'[^\w\s]', word):
            negation = False
        # Add the word to the result with a "not_" prefix if negation is True
        if negation:
            result.append('not_' + word)
        else:
            result.append(word)
    # Join the words back together into a single string and return it
    return ' '.join(result)

In [16]:
# Preprocess the data with negation handling
train_dataset = imdb_dataset['train']
train_data = []
train_data_labels = []
for item in train_dataset:
    # implementing negation handling
    train_data.append(handle_negation(item['text']))
    train_data_labels.append(item['label'])

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(train_data).toarray()

## Training
Splitting into train and validation sets

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model = model.fit(X=X_train,y=y_train)
y_pred = model.predict(X_val)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_pred))

0.76


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Testing

In [18]:
test_dataset = imdb_dataset['test'].shuffle(seed=42).select(range(1000))
test_data = []
test_data_labels = []
for item in test_dataset:
  test_data.append(handle_negation(item['text']))
  test_data_labels.append(item['label'])

test_pred=model.predict(vectorizer.transform(test_data).toarray())
print(accuracy_score(test_pred,test_data_labels))



0.743


## Discussion


# Experiment 2
Experiment 2 involves implementing the handle_negation_synsets function to preprocess text. 

The handle_negation_synsets function is designed to handle negation in text by identifying negation words and adding a "not_" prefix to words that appear after the negation word. The function takes in a string of text as input and first tokenizes the text using the word_tokenize function from the nltk library. It then uses the wordnet module from the nltk library to identify synonyms for each tokenized word.

Next, the function sets up a loop to iterate over the tokenized words and their associated synsets. For each word and its synsets, the function checks to see if the current word should be negated. If it should be negated, the function searches for antonyms for the word in its synsets. If an antonym is found, the function adds the antonym to the list of negated words. If an antonym is not found, the function adds the original word with a "not_" prefix to the list of negated words. If the current word should not be negated, the function simply adds the word to the list of negated words.

Finally, the function returns the list of negated words joined together into a single string, with words separated by spaces. The resulting string represents the original text with negation handled appropriately.

In [19]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [20]:
def handle_negation_synsets(text):
    # Define a list of negation words
    negation_words = ['no', 'not', 'never', 'none', 'nobody', 'nothing', 
                      'nowhere', 'neither', 'nor', 'hardly', 'scarcely', 
                      'barely', 'rarely', 'little', 'few', 'except', 'without', 
                      'minus', 'non']    
    
    # Tokenize the input text into words
    words = nltk.word_tokenize(text)
    
    # Get the WordNet synsets for each word
    words_synsets = [wordnet.synsets(w) for w in words]
    
    # Create a list to hold the negated words
    negated_words = []
    
    # Initialize a flag to keep track of whether we're currently in a negation scope
    negate = False
    
    # Loop over each word and its corresponding synsets
    for i, word_synsets in enumerate(words_synsets):
        
        # Check if we're currently in a negation scope
        if negate:
            
            # Check if the current word has any synsets
            if len(word_synsets) > 0:
                
                # Create a list to hold the antonyms for the current word
                word_antonyms = []
                
                # Loop over each synset for the current word
                for syn in word_synsets:
                    
                    # Get the antonyms for the current synset
                    antonyms = [ant for ant in syn.lemmas()[0].antonyms()]
                    
                    # If antonyms exist, add the first one to the list of word antonyms
                    if antonyms:
                        word_antonyms.append(antonyms[0].name())
                
                # If there are any word antonyms, add the first one to the list of negated words
                if len(word_antonyms) > 0:
                    negated_words.append(word_antonyms[0])
                else:
                    # If there are no antonyms for the current word, add a "not_" prefix to the word
                    negated_words.append("not_" + words[i])
            
            else:
                # If the current word has no synsets, add a "not_" prefix to the word
                negated_words.append("not_" + words[i])
            
            # Reset the negate flag
            negate = False
        
        # Check if the current word is a negation word
        elif words[i] in negation_words:
            # If so, set the negate flag to True
            negate = True
        
        else:
            # If not in a negation scope, add the original word to the list of negated words
            negated_words.append(words[i])
    
    # Join the negated words back together into a single string and return it
    return ' '.join(negated_words)



In [21]:
# Preprocess the data with negation handling
train_dataset = imdb_dataset['train']
train_data = [handle_negation_synsets(item['text']) for item in train_dataset]
train_data_labels = [item['label'] for item in train_dataset]

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(train_data).toarray()

## Training

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model = model.fit(X=X_train,y=y_train)
y_pred = model.predict(X_val)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_pred))

0.75872


## Testing

In [23]:
test_dataset = imdb_dataset['test'].shuffle(seed=42).select(range(1000))
test_data = [handle_negation_synsets(item['text']) for item in test_dataset]
test_data_labels = [item['label'] for item in test_dataset]

test_pred=model.predict(vectorizer.transform(test_data).toarray())
print(accuracy_score(test_pred,test_data_labels))



0.726


# Experiment 3
Experiment 3 involves performing preprocessing using the handle_negation_spacy function.

This function takes in a text and processes it using the Spacy English language model. It initializes a flag to keep track of negation and defines a list of negation words. The function then iterates over each token in the text, toggling the negation flag if a negation word is found. If the negation flag is True and the token is a verb or adjective, the function adds the token with a "not_" prefix to the result list. Otherwise, it just adds the token to the result list. Finally, it joins the tokens in the result list into a single string and returns it. The purpose of this function is to add "not_" prefixes to verbs and adjectives that appear within a negation context, allowing models to capture the opposite sentiment conveyed by the negation.

In [24]:
!pip install spacy
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2023-03-26 09:13:48.673489: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-26 09:13:50.011352: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-03-26 09:13:50.011472: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/

In [25]:
import spacy

nlp = spacy.load("en_core_web_sm")

def handle_negation_spacy(text):
    # Load the Spacy English language model
    doc = nlp(text)
    # Initialize a flag to keep track of negation
    negation = False    
    # Define a list of negation words
    negation_words = ['no', 'not', 'never', 'none', 'nobody', 'nothing', 
                      'nowhere', 'neither', 'nor', 'hardly', 'scarcely', 
                      'barely', 'rarely', 'little', 'few', 'except', 'without', 
                      'minus', 'non']
    # Initialize an empty list to hold the result
    result = []
    # Iterate over each token in the document
    for token in doc:
        # If the token is a negation word, toggle the negation flag
        if token.text in negation_words:
            negation = not negation
        # If the negation flag is True and the token is a verb or adjective, 
        # add the token with a "not_" prefix to the result list
        elif negation and token.pos_ in ["ADJ", "VERB"]:
            result.append("not_" + token.text)
        # Otherwise, just add the token to the result list
        else:
            result.append(token.text)
    # Join the tokens in the result list into a single string and return it
    return " ".join(result)


In [26]:
# Preprocess the data with negation handling
train_dataset = imdb_dataset['train']
train_data = [handle_negation_spacy(item['text']) for item in train_dataset]
train_data_labels = [item['label'] for item in train_dataset]

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',max_features=200,lowercase=True)
features = vectorizer.fit_transform(train_data).toarray()

## Training

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.75,random_state=123)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model = model.fit(X=X_train,y=y_train)
y_pred = model.predict(X_val)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_pred))

0.75424


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Testing

In [28]:
test_dataset = imdb_dataset['test'].shuffle(seed=42).select(range(1000))
test_data = [handle_negation_synsets(item['text']) for item in test_dataset]
test_data_labels = [item['label'] for item in test_dataset]

test_pred=model.predict(vectorizer.transform(test_data).toarray())
print(accuracy_score(test_pred,test_data_labels))



0.746


# Experiment 4
Experiment 4 involves tokenizing and tagging each document using simple preprocess and TaggedDocument from the gensim library respectively. The resulting data is a list of tagged documents train_data, where each document is a list of tokens and a unique tag.

The code then trains a Doc2Vec model using the tagged training data train_data. The model is configured with a vector size of 100, a window size of 5, a minimum word count of 5, and 20 epochs.

Finally, the code extracts the learned document vectors for the training set using model.infer_vector, and stores them in train_vectors. It also extracts the corresponding labels from the training set and stores them in train_labels. The resulting train_vectors and train_labels can be used to train a classifier on the IMDB dataset.

In [33]:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [34]:
# Load dataset
from datasets import load_dataset
imdb_dataset = load_dataset("imdb")
train_dataset = imdb_dataset['train']

# Tokenize and tag the documents
train_data = []
for i, item in enumerate(train_dataset):
    tokens = gensim.utils.simple_preprocess(item['text'])
    train_data.append(TaggedDocument(tokens, [i]))

# Train doc2vec model
model = Doc2Vec(train_data, vector_size=100, window=5, min_count=5, epochs=20)

# Extract features for training set
train_vectors = [model.infer_vector(doc.words) for doc in train_data]
train_labels = [item['label'] for item in train_dataset]

# Split the data into a training and validation (dev) set
X_train, X_val, y_train, y_val = train_test_split(train_vectors, train_labels, train_size=0.75, random_state=123)



  0%|          | 0/3 [00:00<?, ?it/s]

In [35]:
# Train Logistic Regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

# Test the model on the validation set
y_pred = logreg.predict(X_val)
print("Accuracy on validation set:", accuracy_score(y_val, y_pred))

Accuracy on validation set: 0.84864


In [36]:
# Prepare test data
test_dataset = imdb_dataset['test'].shuffle(seed=42).select(range(1000))
test_data = []
test_labels = []
for item in test_dataset:
    tokens = gensim.utils.simple_preprocess(item['text'])
    test_data.append(model.infer_vector(tokens))
    test_labels.append(item['label'])

# Test the model on the test set
test_pred = logreg.predict(test_data)
print("Accuracy on test set:", accuracy_score(test_labels, test_pred))



Accuracy on test set: 0.825


# Conclusion
In this notebook, I tested a few variations of negation handling to improve the accuracy of the sentiment classification model. Unfortunately, this approach did not show any improvements from the baseline.

I then decided to try out a different approach by changing the feature embedding model to Doc2Vec.

This significantly improved the accuracy by 8%. I have learned that improving the accuracy of a model involves a systematic trial and error approaches that requires keeping an open mind to alternate approaches.