# Label the unlabelled reviews in Large Movie Reviews dataset

Predict the unlabelled reviews in Large Movie Review dataset with test2emotion package.

Then check the results with the predictions from trained model from the labelled data of the movie reviews dataset

## 1. Predict with test2emotion package (an open source package)

In [1]:
pip install text2emotion



In [2]:
import text2emotion as te
import pandas as pd
import numpy as np
import pickle

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# read in the unlabelled review text into list

un_reviews = []
for line in open('/content/drive/MyDrive/NLP_movie_review/aclImdb/unlabelled/unlabelled.txt', 'r'):
    
    un_reviews.append(line.strip())

print("Length of list: ", len(un_reviews))

Length of list:  50000


In [4]:
un_reviews[0]

'I admit, the great majority of films released before say 1933 are just not for me. Of the dozen or so "major" silents I have viewed, one I loved (The Crowd), and two were very good (The Last Command and City Lights, that latter Chaplin circa 1931).<br /><br />So I was apprehensive about this one, and humor is often difficult to appreciate (uh, enjoy) decades later. I did like the lead actors, but thought little of the film.<br /><br />One intriguing sequence. Early on, the guys are supposed to get "de-loused" and for about three minutes, fully dressed, do some schtick. In the background, perhaps three dozen men pass by, all naked, white and black (WWI ?), and for most, their butts, part or full backside, are shown. Was this an early variation of beefcake courtesy of Howard Hughes?'

In [5]:
# regex for preprocessing the text, removing the space and punctuations, also turning words to lower case

import re

REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
NO_SPACE = ""
SPACE = " "

def preprocess_reviews(reviews):
    
    reviews = [REPLACE_NO_SPACE.sub(NO_SPACE, line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(SPACE, line) for line in reviews]
    
    return reviews

un_reviews_clean = preprocess_reviews(un_reviews)

In [6]:
# lemmatization for the corpus

def get_lemmatized_text(corpus):
    
    import nltk
    nltk.download('wordnet')
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

lemmatized_un_reviews = get_lemmatized_text(un_reviews_clean)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
# predict the sentiment and append to a list
# if difference of pos and neg is greater than or equal to 0.3 then, the review is positive and assign with 1
# if difference of pos and neg is smaller than or equal to -0.3 then, the review is negative and assign with -1
# if difference is within the range of -0.3 to 0.3, then consider as neutral and assign with 0

sentiment_te = []

for text in lemmatized_un_reviews[:1000]:  #try with only first 1000 samples
  probability = te.get_emotion(text)
  pos = probability['Happy'] + probability['Surprise']
  neg = probability['Angry'] + probability['Fear'] + probability['Sad']
  
  if pos - neg >= 0.3:
    sentiment_te.append(1) 
  elif pos - neg <= -0.3:
    sentiment_te.append(-1)
  else:
    sentiment_te.append(0)

In [8]:
# visualize some of the predictions

sentiment_te[:20] 

[0, 0, -1, 0, 0, 0, 0, 1, 0, 0, 0, 0, -1, 0, 0, -1, -1, 0, 0, 0]

## 2. Compare the predictions with those from the trained model with labelled data

In [9]:
# preparation for fitting the CountVectorizer with the same set of labelled train set data of the trained model

reviews_train = []
for line in open('/content/drive/MyDrive/NLP_movie_review/aclImdb/movie_data/full_train.txt', 'r'):
    
    reviews_train.append(line.strip())

reviews_train_clean = preprocess_reviews(reviews_train)

In [10]:
stop_words = ['in', 'of', 'at', 'a', 'the'] # a more simple list of stop words
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3), stop_words=stop_words) #using 1 to 3 ngrams
ngram_vectorizer.fit(reviews_train_clean) #fit with the labelled train set data which trained the model
unlabelled_set = ngram_vectorizer.transform(un_reviews_clean)

In [11]:
# Load pickled Model back from file

with open("/content/drive/MyDrive/NLP_movie_review/final_baseline_lr.pkl" , 'rb') as file:  
    lr = pickle.load(file)

lr

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
df = pd.DataFrame(list(zip(unlabelled_set[:1000], sentiment_te)),
               columns =['Corpus', 'test2emotion_label'])
df.shape

(1000, 2)

In [13]:
# predict the sentiment with the saved model
df["lr_label"] = lr.predict(unlabelled_set[:1000])

In [14]:
df.head(10)

Unnamed: 0,Corpus,test2emotion_label,lr_label
0,"(0, 4124)\t1\n (0, 17672)\t1\n (0, 17843)\...",0,1
1,"(0, 22554)\t1\n (0, 51735)\t1\n (0, 117580...",0,-1
2,"(0, 33802)\t1\n (0, 33980)\t1\n (0, 34599)...",-1,-1
3,"(0, 2155)\t1\n (0, 39193)\t1\n (0, 39442)\...",0,1
4,"(0, 49016)\t1\n (0, 49480)\t1\n (0, 69051)...",0,1
5,"(0, 83557)\t1\n (0, 87334)\t1\n (0, 141081...",0,1
6,"(0, 39193)\t1\n (0, 40558)\t1\n (0, 49016)...",0,1
7,"(0, 38349)\t1\n (0, 117580)\t1\n (0, 12496...",1,1
8,"(0, 56137)\t1\n (0, 56724)\t1\n (0, 56727)...",0,1
9,"(0, 25921)\t1\n (0, 100266)\t1\n (0, 10076...",0,1


In [15]:
# omit the rows with test2emotion_label = 0, we don't consider neutral reviews predicted from test2emotion
# since the trained model is a binary one, either positive or negative

df = df[df['test2emotion_label'] != 0]
df.head(10)

Unnamed: 0,Corpus,test2emotion_label,lr_label
2,"(0, 33802)\t1\n (0, 33980)\t1\n (0, 34599)...",-1,-1
7,"(0, 38349)\t1\n (0, 117580)\t1\n (0, 12496...",1,1
12,"(0, 28653)\t1\n (0, 59953)\t1\n (0, 64649)...",-1,-1
15,"(0, 4124)\t1\n (0, 17672)\t1\n (0, 17820)\...",-1,-1
16,"(0, 2006)\t1\n (0, 2332)\t1\n (0, 2550)\t1...",-1,1
20,"(0, 39193)\t1\n (0, 39796)\t1\n (0, 49016)...",-1,1
23,"(0, 4124)\t1\n (0, 5774)\t1\n (0, 31440)\t...",-1,1
25,"(0, 4124)\t1\n (0, 8213)\t1\n (0, 10141)\t...",-1,1
28,"(0, 4124)\t1\n (0, 26435)\t1\n (0, 31440)\...",-1,1
29,"(0, 2332)\t1\n (0, 2341)\t1\n (0, 4124)\t1...",-1,1


In [16]:
print ("Difference between predictons from trained model from labelled data vs test2emotion: %s" 
       % accuracy_score(df['lr_label'], df['test2emotion_label'] ))

Difference between predictons from trained model from labelled data vs test2emotion: 0.39408866995073893


From the result, there is about 39% of the predictions from the two methods matched with each other.