<a href="https://colab.research.google.com/github/anushiya-thevapalan/sentiment-analysis-imdb/blob/master/Sentiment_analysis_on_imdb_movie_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis on IMDB movie reviews

To perform the sentiment analysis the following files need to be downloaded.



1.   IMBD movie review dataset
2.   GloVe word embedding


## IMDB movie review dataset
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

## GloVe word embedding
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

http://nlp.stanford.edu/data/glove.6B.zip


## Downloading the IMDB movie review data and word embedding

In [1]:
# Download the glove word embedding
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2020-08-26 03:38:42--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-08-26 03:38:42--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-08-26 03:38:42--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [2]:
# unzip the glove word embeddings
!unzip glove*.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [3]:
# Download the movie review dataset
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2020-08-26 03:47:20--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2020-08-26 03:47:26 (14.5 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [4]:
# Unzip the movie review data
!tar xf aclImdb_v1.tar.gz

## Import required libraries

In [10]:
import os
import re
import string
import numpy as np
from keras import Model
from keras.utils import to_categorical, plot_model
from keras.layers import Dense, Input, Dropout, LSTM, Activation, concatenate, Bidirectional
from keras.layers.embeddings import Embedding
import matplotlib.pyplot as plt

## Loading and preprocessing data

In [7]:
def preprocess(txt):
    '''Preprocesses the text by removing HTML, XML, 
      punctuations and numbers, and returns the clean text
    '''

    def removeHTML(txt):
        #remove the Html in the text
        pattern = r'(<(?P<tag>[a-zA-Z0-9]+)>.*?</(?P=tag)>)'
        return re.sub(pattern, ' ', txt, flags=re.MULTILINE)

    def removeXml(txt):
        # remove the XML in the text
        pattern = r'/[a-zA-Z-_/]*\.xml'
        return re.sub(pattern, ' ', txt, flags=re.MULTILINE)

    def removeContinousFullstops(txt):
        # remove the Continous fullstop Eg: "......" with one fullstop.
        return re.sub('\.\.+', ' . ', txt,flags=re.MULTILINE)

    def removeNumbers(txt):
        # Remove the digits .
        pattern = r'\d+'
        return re.sub(pattern, ' . ', txt, flags=re.MULTILINE)

    def removePunctuationwithoutdot(txt):
        remove = string.punctuation
        remove = remove.replace(".", "")  # don't remove hyphens
        pattern = r"[{}]".format(remove)  # create the pattern
        return  re.sub(pattern, "", txt)

    txt=removeHTML(txt)
    txt =removeXml(txt)
    txt =removeContinousFullstops(txt)
    txt =removeNumbers(txt)
    txt =removePunctuationwithoutdot(txt)
    return txt

In [8]:
def loadData():
    '''
    Loads the data from the respective files and preprocesses the data using the preprocess() function
    '''
    # load File names
    trainPosFiles = os.listdir("./aclImdb/train/pos")
    trainNegFiles = os.listdir("./aclImdb/train/neg")
    testPosFiles = os.listdir("./aclImdb/test/pos")
    testNegFiles = os.listdir("./aclImdb/test/neg")

    #load positive, negative files from the directory

    trainPos = []
    trainNeg = []
    testPos = []
    testNeg = []

    for i in range(len(trainPosFiles)):
        with open("./aclImdb/train/pos/" + trainPosFiles[i], "r") as myfile:
            # Lower the text, preprocess the text
            line = preprocess((myfile.readlines()[0]).lower())
            trainPos.append(line)

    for i in range(len(trainNegFiles)):
        with open("./aclImdb/train/neg/" + trainNegFiles[i], "r") as myfile:
            # Lower the text, preprocess the text
            line = preprocess((myfile.readlines()[0]).lower())
            trainNeg.append(line)

    for i in range(len(trainPosFiles)):
        with open("./aclImdb/test/pos/" + testPosFiles[i], "r") as myfile:
            # Lower the text, preprocess the text
            line = preprocess((myfile.readlines()[0]).lower())
            testPos.append(line)

    for i in range(len(trainNegFiles)):
        with open("./aclImdb/test/neg/" + testNegFiles[i], "r") as myfile:
            # Lower the text, preprocess the text
            line = preprocess((myfile.readlines()[0]).lower())
            testNeg.append(line)

    #merge Positive and Negative Datasets
    trainX = trainPos + trainNeg
    testX = testPos + testNeg

    #preparing the labels for the dataset and  onehot encode
    trainY = to_categorical([1 if i < len(trainPos) else 0 for i in range(len(trainX))],num_classes=2)
    testY = to_categorical([1 if i < len(testPos) else 0 for i in range(len(testX))],num_classes=2)

    return np.array(trainX), np.array(trainY), np.array(testX), np.array(testY)

In [11]:
trainX, trainY, testX, testY = loadData()

In [12]:
print("Number of training samples : ", len(trainY))
print("Number of testing samples : ", len(trainY))

Number of training samples :  25000
Number of testing samples :  25000
