<a href="https://colab.research.google.com/github/cristianokunas/SentimentAnalysis_onColab/blob/main/SentimentAnalysis_LSTM_BiLSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!nvidia-smi

Tue Dec  8 16:58:06 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Required Downloads

In [None]:
!pip install tensorflow
#!pip install tensorflow-gpu==1.14.0
!pip install --upgrade tensorflow-gpu

In [None]:
!pip install nltk
import nltk
nltk.download('stopwords')

# Etapa 1 - Importing the libraries

In [20]:
import os, re
import time
import zipfile


import numpy as np
import pandas as pd

from keras.utils import plot_model
from keras.layers import Dense, Input, Embedding, LSTM, Bidirectional
from keras.models import Sequential, Model
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences

from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
import matplotlib.pyplot as plt

from tqdm import tqdm

import tensorflow as tf

# Etapa 2 - Connecting to Drive

In [7]:
# Conection
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Accessing files

In [None]:
# Your path on Google Drive
path = "/content/drive/My Drive/Lab/Data Science/Sentiment Analysis/IMDB/github/imdb.zip"
zip_object = zipfile.ZipFile(file=path, mode="r")
zip_object.extractall("/content/IMDB")
zip_object.close

In [21]:
# Loads the .csv data file
data = pd.read_csv('/content/IMDB/imdb.csv')
data.head()

Unnamed: 0,text,sentiment
0,Once again Mr. Costner has dragged out a movie...,neg
1,This is an example of why the majority of acti...,neg
2,"First of all I hate those moronic rappers, who...",neg
3,Not even the Beatles could write songs everyon...,neg
4,Brass pictures (movies is not a fitting word f...,neg


# Etapa 4 - Initializes variables

In [43]:
seed = 7
np.random.seed(seed)

# The model will be exported to this file
# For LSTM
filename = '/content/IMDB/model_saved_lstm.h5'
# For BiLSTM
# filename = '/content/IMDB/model/model_saved_bilstm.h5'

bilstm = False

# Number of iterations
epochs = 2

# Dimensionality of pre-trained word embedding
word_embedding_dim = 300

# Number of samples to be used in each gradient update - number of instances
batch_size = 1024

# Separates % for model testing
test_dim = 0.20

# Maximum number of words to keep in the vocabulary
max_features = 20000

# Embedding layer output dimension
embed_dim = 128

# Maximum size of sentences
max_sequence_length = 300

# Etapa 5 - Data pre-processing

In [39]:
def calcRuntime(totalTime):
  hour, rem = divmod(totalTime, 3600)
  minutes, seconds = divmod(rem, 60)
  formatTime = "{:0>2}:{:0>2}:{:05.2f}".format(int(hour),int(minutes),seconds)
  return formatTime

def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

# Method for clearing strings
# Removes non-meaningful content
def clean_str(string):
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    string = re.sub(r'RT+', '', string) 
    string = re.sub(r'@\S+', '', string)  
    string = re.sub(r'http\S+', '', string)

    cleanr = re.compile('<.*?>')

    string = re.sub(r'\d+', '', string)
    string = re.sub(cleanr, '', string)
    string = re.sub("'", '', string)
    string = re.sub(r'\W+', ' ', string)
    
    string = string.replace('_', '')

    string = remove_emoji(string)

    return string.strip().lower()

# Method for preparing training and test data
# loads .csv, clears strings and removes stop_wors
# Tokenization
def prepare_data(data):
    data = data[['text', 'sentiment']]
    data['text'] = data['text'].apply(lambda x: clean_str(x))
    data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]', '', x)))

    stop_words = set(stopwords.words('portuguese'))
    text = []
    for row in data['text'].values:
        word_list = text_to_word_sequence(row)
        no_stop_words = [w for w in word_list if not w in stop_words]
        no_stop_words = " ".join(no_stop_words)
        text.append(no_stop_words)

    tokenizer = Tokenizer(num_words=max_features, split=' ')

    tokenizer.fit_on_texts(text)
    X = tokenizer.texts_to_sequences(text)

    X = pad_sequences(X, maxlen=max_sequence_length)

    word_index = tokenizer.word_index
    Y = pd.get_dummies(data['sentiment']).values

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_dim, random_state=42)
    print('Número de sentenças do conjunto de treinamento: ', len(X_train))
    print('Número de sentenças do conjunto de teste: ', len(X_test))

    return X_train, X_test, Y_train, Y_test, word_index, tokenizer

# Etapa 6b - Create model - functional API

In [41]:
# Calls method to prepare data
X_train, X_test, Y_train, Y_test, word_index, tokenizer = prepare_data(data)
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

# Specifies which device to run
with tf.device("/gpu:0"):
  
  # create
  input_shape = (max_sequence_length,)
  model_input = Input(shape=input_shape, name="input", dtype='int32')
  embedding = Embedding(max_features, embed_dim, 
                input_length=max_sequence_length, name="embedding")(model_input)
  if bilstm is True:
    lstm = Bidirectional(LSTM(embed_dim, dropout=0.2, recurrent_dropout=0.2, name="lstm"))(embedding)
  else:
    lstm = LSTM(embed_dim, dropout=0.2, recurrent_dropout=0.2, name="lstm")(embedding)

  model_output = Dense(units=128, input_dim=64, kernel_initializer='uniform', activation='relu')(lstm)
  model_output = Dense(units=64, kernel_initializer='uniform', activation='relu')(model_output)
  model_output = Dense(units=2, kernel_initializer='uniform', activation='sigmoid')(model_output)
  model = Model(inputs=model_input, outputs=model_output)

  # compile
  model.compile(loss='binary_crossentropy', 
                optimizer='adam', 
                metrics=['accuracy'])
  
  print(model.summary())

Número de sentenças do conjunto de treinamento:  40000
Número de sentenças do conjunto de teste:  10000
(40000, 300) (40000, 2)
(10000, 300) (10000, 2)
Model: "functional_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, 300)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 300, 128)          2560000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense_6 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_7 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_8 (Dense)              (None

# Etapa 7 - Training the network

Neural network training

Checks whether a trained model exists

True = load the model already trained

False = trains the network and saves the model

In [44]:
# Specifies which device to run
with tf.device("/gpu:0"):

  if os.path.exists('{}'.format(filename)):
    try:
      model.load_weights('{}'.format(filename))
      print('Successful model loading!')
    except:
      print('No such file or directory!')
  else:
    inicio = time.time()
    hist = model.fit(
        X_train,
        Y_train,
        validation_data=(X_test, Y_test),
        epochs=epochs,
        batch_size=batch_size,
        shuffle=True,
        verbose=1)
    
    fim = time.time()
    model.save_weights(filename)
  
  print(calcRuntime(fim - inicio))

Epoch 1/2
Epoch 2/2
00:01:47.66


# Etapa 8 - Evaluate model

In [45]:
# Evaluating the model
scores = model.evaluate(
    X_test, Y_test, 
    verbose=0, 
    batch_size=batch_size)

print("Accuracy: %.2f%%" % (scores[1] * 100))
print("Erro: %.2f%%" % (scores[0] * 100))

Accuracy: 89.34%
Erro: 29.63%


### Validate model with new entries

In [46]:
while True:
    print("\nType 0 to quit")
    sentence = input("input> ")

    if sentence == "0":
        break

    new_text = [sentence]
    new_text = tokenizer.texts_to_sequences(new_text)

    new_text = pad_sequences(new_text, maxlen=max_sequence_length, dtype='int32', value=0)

    sentiment = model.predict(new_text, batch_size=1, verbose=2)[0]

    print(np.argmax(sentiment))
    if (np.argmax(sentiment) == 0):
        pred_proba = "%.2f%%" % (sentiment[0] * 100)
        print("negativo => ", pred_proba)
    elif (np.argmax(sentiment) == 1):
        pred_proba = "%.2f%%" % (sentiment[1] * 100)
        print("positivo => ", pred_proba)


Type 0 to quit
input> There isn't too much in the way of suspense or surprises when it comes to the story, but there are some shocking moments and funny lines in this epic finale. Again, like many of the best Marvel films, the holes and flaws are covered up with humor and fan service, making everything okay. That being said, I did prefer Infinity War to this film, which really misses the leads of the other Marvel franchises that were "snapped" out. Overall, however, there are only a few ways you can wrap up the main story of the MCU, and this was a solid direction.
1/1 - 0s
1
positivo =>  98.39%

Type 0 to quit
input> I had no choice but to watch it to finish the sequence. The worst of all the Avengers movies. Apart from some action scenes all the rest it is pure lame dialogues and poor performances. Purely made to make money out of a "trendy" public that are rating this movie high because I've never met one single person who has read the comics and like this garbage. At least, there 

### **Review**:
>There isn't too much in the way of suspense or surprises when it comes to the story, but there are some shocking moments and funny lines in this epic finale. Again, like many of the best Marvel films, the holes and flaws are covered up with humor and fan service, making everything okay. That being said, I did prefer Infinity War to this film, which really misses the leads of the other Marvel franchises that were "snapped" out. Overall, however, there are only a few ways you can wrap up the main story of the MCU, and this was a solid direction.

>**Opinion: 8/10 stars**

---

### **Review**:
>I had no choice but to watch it to finish the sequence. The worst of all the Avengers movies. Apart from some action scenes all the rest it is pure lame dialogues and poor performances. Purely made to make money out of a "trendy" public that are rating this movie high because I've never met one single person who has read the comics and like this garbage. At least, there will be no more of this, I hope.

>**Opinion: 1/10 stars**

# Final - Unmount Drive

In [None]:
drive.flush_and_unmount()
print('All changes made in this colab session should now be visible in Drive.')