
Sentiment analysis is the process of detecting positive or negative sentiment in text. It’s often used by businesses to detect sentiment in social data, gauge brand reputation, and understand customers.



Since customers express their thoughts and feelings more openly than ever before, sentiment analysis is becoming an essential tool to monitor and understand that sentiment. Automatically analyzing customer feedback, such as opinions in survey responses and social media conversations, allows brands to learn what makes customers happy or frustrated, so that they can tailor products and services to meet their customers’ needs.

Sentiment analysis is extremely important because it helps businesses quickly understand the overall opinions of their customers. By automatically sorting the sentiment behind reviews, social media conversations, and more, you can make faster and more accurate decisions.



To learn more about sentiment analysis, I will be using the IMDB movie reviews dataset for this project to study the different reviews. 

In [1]:
import pandas as pd
import numpy as np

## Loading Data

This dataset contains 2 columns:

    1. the first column is the list of movie reviews 
    2. the second column is the list of sentiments (positive and negative)
It is split equally, 25k positve reviews and 25k negative reviews

In [2]:
data = pd.read_csv('IMDB_Data.csv')

In [3]:
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


## Data Cleaning and Pre-processing

Before I feed the data to the model, I will first fill in missing data if there's any and remove unncessary formatting that might distrurb the results.

#### 1. Missing Data

In [4]:
data.isna().sum()

review       0
sentiment    0
dtype: int64

There is no missing values in our dataset

#### 2.  Removing unnecessary formatting

Search engines has been programmed to ignore stopwords, these words take up space in the database and/or take up valuable processing time. Therefore, we remove them, by storing a list of words that we consider as stop words.

NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. 

In [5]:
from nltk.corpus import stopwords   # to get collection of stopwords
english_stops = set(stopwords.words('english'))

In [6]:
def clean_data():
    data = pd.read_csv('IMDB_Data.csv')
    x_data = data['review']
    y_data = data['sentiment']
    
    # reviews pre-processing
    
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # removing html tags
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # removing non alphabet values
    x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # removing stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case 

    # sentiments pre-processing: encoding positive to 1 and negative to 0
    
    y_data = y_data.replace('positive', 1)
    y_data = y_data.replace('negative', 0)
    
    return x_data, y_data

In [7]:
x_data, y_data = clean_data()

In [8]:
print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

Reviews
0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, the, filmin...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object 

Sentiment
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


#### 3. Splitting the Dataset

In [9]:
from sklearn.model_selection import train_test_split       # for splitting the dataset

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

#### 4. Vectorization

In order to feed text data into neural networks, we need to transform them into numeric data. This step is called Vectorization.

I used tensorflow.keras.preprocessing.text.Tokenizer to encode the reviews into integers, where each unique word is automatically indexed (using fit_on_texts method) based on x_train.
x_train and x_test is converted into integers using texts_to_sequences method.

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to integers
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to apply padding  
from tensorflow.keras.models import Sequential     #  model
from tensorflow.keras.layers import Embedding, LSTM, Dense # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # saving the model
from tensorflow.keras.models import load_model   # loading the saved model
import re

In [12]:
token = Tokenizer(lower=False)   
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

In [13]:
x_train


[[23853,
  2763,
  1172,
  127,
  4,
  2,
  960,
  4432,
  13,
  379,
  841,
  961,
  44,
  305,
  2,
  2380,
  63,
  5,
  36,
  3076,
  702,
  3864,
  1,
  40,
  4,
  1,
  17131,
  39858],
 [8,
  3,
  106,
  5,
  404,
  2548,
  35,
  1078,
  153,
  24,
  1085,
  309,
  237,
  4,
  215,
  190,
  2196,
  146,
  7782,
  1828,
  35,
  56,
  478,
  33],
 [378,
  492,
  2071,
  911,
  675,
  402,
  228,
  455,
  619,
  4,
  3999,
  630,
  148,
  57019,
  980,
  8299,
  2100,
  2124,
  784,
  16202,
  155,
  664,
  2943,
  233,
  5813,
  390,
  2381,
  1532,
  642,
  87,
  25020,
  1112,
  148,
  9630,
  4292,
  44,
  98,
  2072,
  52,
  547,
  297,
  1,
  480,
  2229,
  39,
  7000,
  1548,
  184,
  106,
  35719,
  1062,
  705,
  128,
  402,
  228,
  277,
  1462,
  232,
  2196],
 [2,
  52,
  358,
  1909,
  72,
  13160,
  7952,
  3029,
  4673,
  69,
  3355,
  25021,
  22783,
  9134,
  986,
  6173,
  912,
  2,
  229,
  4,
  5505,
  7405,
  547,
  1220,
  1363,
  147,
  3049,
  23854,
  2,
  20


### 5. Padding and truncation


Since each reviews has a different length, we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using tensorflow.keras.preprocessing.sequence.pad_sequences.

In [14]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

In [15]:
max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')


In [16]:
print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[23853  2763  1172 ...     0     0     0]
 [    8     3   106 ...     0     0     0]
 [  378   492  2071 ...     0     0     0]
 ...
 [ 1198     1  1534 ...     0     0     0]
 [  217 20810 13237 ...     0     0     0]
 [    8     4  2039 ...   148 11131  3026]] 

Encoded X Test
 [[  50   61  566 ...    0    0    0]
 [   8    3  372 ...    0    0    0]
 [3670 1575   82 ...  148    9  362]
 ...
 [ 227   87   13 ...    0    0    0]
 [1590    4  308 ...  998   10    1]
 [  11   68 2010 ...    2  981    0]] 

Maximum review length:  130


### Building the model

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. 

In [17]:
# ARCHITECTURE
EMBED_DIM = 32
LSTM_OUT = 64
total_words = len(token.word_index) + 1   # add 1 because of 0 padding

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 130, 32)           2954528   
_________________________________________________________________
lstm (LSTM)                  (None, 64)                24832     
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 2,979,425
Trainable params: 2,979,425
Non-trainable params: 0
_________________________________________________________________
None


### Training

For training,we only need to fit our x_train and y_train data. For this training, I use a mini-batch learning method with a batch_size of 128 and 5 epochs.

Also, I added a callback called checkpoint to save the model locally for every epoch if its accuracy improved from the previous epoch.

In [18]:
checkpoint = ModelCheckpoint(
    'LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

In [19]:
model.fit(x_train, y_train, batch_size = 128, epochs = 5, callbacks=[checkpoint])


Epoch 1/5
Epoch 00001: accuracy improved from -inf to 0.72328, saving model to LSTM.h5
Epoch 2/5
Epoch 00002: accuracy improved from 0.72328 to 0.91720, saving model to LSTM.h5
Epoch 3/5
Epoch 00003: accuracy improved from 0.91720 to 0.95940, saving model to LSTM.h5
Epoch 4/5
Epoch 00004: accuracy improved from 0.95940 to 0.97822, saving model to LSTM.h5
Epoch 5/5
Epoch 00005: accuracy improved from 0.97822 to 0.98585, saving model to LSTM.h5


<tensorflow.python.keras.callbacks.History at 0x7f7022489040>

### Testing

To evaluate the model, we need to predict the sentiment using our x_test data and comparing the predictions with y_test (expected output) data. Then, we calculate the accuracy of the model by dividing numbers of correct prediction with the total data.

In [20]:
y_pred = model.predict_classes(x_test, batch_size = 128)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
Correct Prediction: 8700
Wrong Prediction: 1300
Accuracy: 87.0


### Load Saved Model

Load saved model and use it to predict a movie review statement's sentiment (positive or negative).

In [21]:
loaded_model = load_model('LSTM.h5')

In [22]:
review = str(input('Movie Review: '))

Movie Review: cvasv fvavc avc afarf mlk


In [23]:
# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
review = regex.sub('', review)
print('Cleaned: ', review)

words = review.split(' ')
filtered = [w for w in words if w not in english_stops]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]

print('Filtered: ', filtered)

Cleaned:  cvasv fvavc avc afarf mlk
Filtered:  ['cvasv fvavc avc afarf mlk']


In [24]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
print(tokenize_words)

[[39466     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0]]


In [25]:
result = loaded_model.predict(tokenize_words)
print(result)

[[0.9934634]]


In [26]:
if result >= 0.7:
    print('positive')
else:
    print('negative')

positive
