# Neural Networks for Sentiment Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


### Import libraries and upload all data

We need to import libraries and preprocess texts.

In [1]:
import numpy as np
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
## uncomment these lines if you have dowloaded the original file:
#np.random.seed(0)
#df = df.reindex(np.random.permutation(df.index))
#df[['review', 'sentiment']].to_csv('shuffled_movie_data.csv', index=False)
import pandas as pd

In [2]:
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otherwise load local file
df = pd.read_csv('data/shuffled_movie_data.csv')

In [3]:
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


### Preprocessing Text Data

Now, let us define a simple `tokenizer` that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but "emoticons," convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form.

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jenazads/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:

stop = stopwords.words('english') # Common words
porter = PorterStemmer() # Getting root of words
char3=stop[:17] # Getting 1st and 2nd person pronouns
stop=stop[17:116]+stop[118:] 

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

Let's give it at try:

In [6]:
tokenizer('This :) is a <a> test! :-)</br>')

['test', ':)', ':)']

### Reading files

#### Learning (SciKit)

First, we define a generator that returns the document body and the corresponding class label:

In [7]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:

In [8]:
doc_stream=stream_docs(path='data/shuffled_movie_data.csv')
docs, y = [], []
for _ in range(50000):
    text_aux, label =next(doc_stream)
    text=tokenizer(text_aux)
    docs.append(text)
    y.append(label)
    #print('\n',tokenizer(text))

### Removing trash ad duplicates

In [9]:

def removeDuplicates(listofElements):
    # Create an empty list to store unique elements
    noDupl = []
    # Iterate over the original list and for each element
    # add it to uniqueList, if its not already there.
    for elem in listofElements:
        if elem not in noDupl:
            noDupl.append(elem)
    
    # Return the list of unique elements        
    return noDupl

### Clasifying positive and negative words

In [10]:
# Input: 
# texto_dividido: a text splitted as a List of Words
# posneg_dict: Positive and negative dictionary

# Output:
# COUNT_POSITIVE: # Of positive words according to the dictionary
# COUNT_NEGATIVE: # Of negative words according to the dictionary
def getPositiveNegativeCountWords(texto_dividido, posneg_dictionary):
        # Count the positive words
    COUNT_POSITIVE = 0
    COUNT_NEGATIVE = 0
    for word in texto_dividido:
        try:
            val = posneg_dictionary[word]
            if val == 1:
                COUNT_POSITIVE = COUNT_POSITIVE + 1
            elif val == 0:
                COUNT_NEGATIVE = COUNT_NEGATIVE + 1

        except KeyError:
            pass
    
    return (float(COUNT_POSITIVE), float(COUNT_NEGATIVE))

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:

In [11]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

In [12]:
positives=[line.strip() for line in open('data/positive-words.txt')]
positives = [porter.stem(w) for w in positives]
print(len(positives))
positives=removeDuplicates(positives)
print(len(positives))
negatives=[line.strip() for line in open('data/negative-words.txt')]
negatives = [porter.stem(w) for w in negatives]
print(len(negatives))
negatives=removeDuplicates(negatives)
print(len(negatives))
pron12=stop[:17]

2041
1417
4818
3186


### Upload data set

list all words in data set from positive.csv and negative.csv, then will be compare with words in opinion english

In [13]:
P2=100 # Number of features

#We separate common words
positive2 = pd.read_csv('data/positive.csv', index_col=0)
positive2_i = positive2.index.values[:1000]
stop2 = stopwords.words('english')
positive2_i=set(positive2_i).difference(stop2)


negative2 = pd.read_csv('data/negative.csv', index_col=0)
negative2_i = negative2.index.values[:1000]
negative2_i=set(negative2_i).difference(stop2)

# We proced to delete repeated words.
positive2=set(positive2_i).difference(negative2_i)
negative2=set(negative2_i).difference(positive2_i)
positive2=list(positive2)[:P2]
negative2=list(negative2)[:P2]

positive2 = [porter.stem(w) for w in positive2]
negative2 = [porter.stem(w) for w in negative2]

print(positive2)
print(negative2)

['view', 'move', 'greatest', 'paul', 'brought', 'portray', 'marri', 'surprisingli', 'complex', 'joe', 'romant', 'anim', 'see', 'recent', 'stage', '8', 'york', 'disney', 'pace', 'awesom', 'adventur', 'dream', 'battl', 'die', 'memor', 'secret', 'atmospher', 'agre', 'manag', 'societi', 'stun', 'danc', 'subtl', 'recommend', 'centuri', 'romanc', 'tell', 'touch', 'plenti', 'power', 'languag', 'keep', 'fantast', 'popular', 'deep', 'truth', '7', 'follow', 'deserv', 'move', 'mark', 'geniu', 'danc', 'emot', 'impress', 'busi', 'today', 'creat', 'portray', 'it!', 'beauti', 'support', 'unlik', 'charm', 'america', 'terrif', 'talent', 'journey', 'cultur', 'uniqu', 'season', 'cold', 'thank', '9', 'earlier', 'match', 'western', 'bill', 'offic', 'person', 'edg', 'present', 'perfectli', 'older', 'favorit', 'tale', 'effect', 'situat', 'lead', 'bring', 'appreci', 'rich', 'famou', 'outsid', 'master', 'begin', 'masterpiec', 'brilliant', 'italian', 'variou']
['lee', 'produc', 'incred', 'laugh', 'biggest', 'bo

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).

In [14]:
N=50000

X=np.zeros((N,6))
idx=0
for com in docs:
    if idx%1000==0:
        print(idx)
    X[idx,5]+=np.log(len(com)) #len: char 6
    for word in com:
        X[idx,3]+=char3.count(word) #pronoun: char 4
        if word=='!':
            X[idx,4]=1 #! simbol: char 5
        if (word=='no' or word=='not'):
            X[idx,2]=1 #! simbol: char 5
        X[idx,0]+=positives.count(word) #positive words : char 1
        X[idx,1]+=negatives.count(word) #negative words : char 2
    idx+=1

X2=np.zeros((N,2*P2))
idx=0
for com in docs:
    #if com.count('terrific'):
        #print('foundddddd')
    #if idx%1000==0:
    #    print(idx)
    idx2=0
    for word in positive2:
        X2[idx,idx2]=com.count(word)
        idx2+=1
    #print(idx2)
    for word in negative2:
        X2[idx,idx2]=com.count(word)
        idx2+=1
    idx+=1
print(X.shape)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
(50000, 6)


### Combining all data

We need to combine all data to train.

In [15]:
X_data=np.concatenate((X, X2), axis=1)

### Split Data Set

In [16]:
y=np.asarray(y)
y=y.reshape(y.shape[0],1)
x_train=X_data[:40000]
x_valid=X_data[40000:45000]
x_test=X_data[45000:50000]
y_train=y[:40000]
y_valid=y[40000:45000]
y_test=y[45000:50000]

## Sigmoid function

In [17]:
def sigmoid(x):
    return 1 / (1 + np.e ** -x)

def sigmoid_derivate(z):
    s = sigmoid(z) *(1 - sigmoid(z))
    return s

### Training data

In [20]:
class NeuralNetwork:
    def __init__(self, x, y):
        self.input      = x
        self.weights1   = np.random.rand(self.input.shape[1],4) 
        self.weights2   = np.random.rand(4,1)                 
        self.y          = y
        self.output     = np.zeros(self.y.shape)

    def feedforward(self):
        self.layer1 = sigmoid(np.dot(self.input, self.weights1))
        self.output = sigmoid(np.dot(self.layer1, self.weights2))

    def backprop(self):
        # application of the chain rule to find derivative of the loss function with respect to weights2 and weights1
        d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * sigmoid_derivate(self.output)))
        d_weights1 = np.dot(self.input.T,  (np.dot(2*(self.y - self.output) * sigmoid_derivate(self.output), self.weights2.T) * sigmoid_derivate(self.layer1)))

        # update the weights with the derivative (slope) of the loss function
        self.weights1 += d_weights1
        self.weights2 += d_weights2

In [28]:
alfa=0.001
reg=0.002
epochs=10000
W=np.random.rand(1,x_train.shape[1])/x_train.shape[1]
nn = NeuralNetwork(x_train, y_train)
prec=0

for epoch in range(epochs):
    nn.feedforward()
    nn.backprop()
    if epoch%100==0:
        y_pred=nn.output
        precision=100*(1-sum(abs(y_pred-y_train))/y_pred.shape[0])
        print('--------------')
        print('Epoca: ',epoch)
        print('Precision: ',precision)
        if prec<precision:
            prec=precision


--------------
Epoca:  0
Precision:  [50.01279178]


  


--------------
Epoca:  100
Precision:  [48.5175]
--------------
Epoca:  200
Precision:  [51.6375]
--------------
Epoca:  300
Precision:  [51.62875]
--------------
Epoca:  400
Precision:  [51.64625]
--------------
Epoca:  500
Precision:  [51.64375]
--------------
Epoca:  600
Precision:  [48.3475]
--------------
Epoca:  700
Precision:  [48.33125027]
--------------
Epoca:  800
Precision:  [51.655]
--------------
Epoca:  900
Precision:  [51.65]
--------------
Epoca:  1000
Precision:  [51.6475]
--------------
Epoca:  1100
Precision:  [51.6375]
--------------
Epoca:  1200
Precision:  [51.64125]
--------------
Epoca:  1300
Precision:  [51.64625]
--------------
Epoca:  1400
Precision:  [51.64125]
--------------
Epoca:  1500
Precision:  [48.34375]
--------------
Epoca:  1600
Precision:  [51.66125]
--------------
Epoca:  1700
Precision:  [51.6625]
--------------
Epoca:  1800
Precision:  [51.6625]
--------------
Epoca:  1900
Precision:  [48.33125]
--------------
Epoca:  2000
Precision:  [48.3325]

### Testing data

In [31]:
err=np.transpose(y_train)-sigmoid(np.matmul(W,np.transpose(x_train)))
err

array([[ 0.47507968, -0.53871841, -0.53491084, ...,  0.48429609,
         0.48527271, -0.51130759]])

### Calculate predictions

In [32]:
print('Accuracy: %.3f' % precision)

Accuracy: 51.676


<br>
<br>