<a href="https://colab.research.google.com/github/danilogb/TAAC/blob/master/TAAC_assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://sigarra.up.pt/feup/pt/WEB_GESSI_DOCS.download_file?p_name=F-370784536/logo_cores_oficiais.jpg" width="200" align="right"></img>

### **FEUP MECD 2022/2023**
### **TAAC - Advanced Topics on Machine Learning**

##<center><b> Assignment 2 </b></center>

Group: Danilo Brandão, Heitor Lira & Wagner Ceulin

---

##1. Recurrent Neural Networks

<font color="#a1bf3f">

a) In non-recurrent neural networks (MLPs, CNNs, etc.) it is a common practice to use activation functions with unbounded output (ReLU, leaky-ReLU, GELU, etc.). However, in RNN cells, the sigmoid or hyperbolic tangent are used instead. Can you explain why?</font>


As the feedback loops in the RNN have the same weight for all the layers, in other words, the parameter well always be multiplied by the same weight as going through the network, it would be a major problem for the learning if some values in the vector flowing in the RNN become too big. Because of this, using squashing activation fuctions is the more appropriate, as the values will always be in the [-1,1] interval



<font color="#a1bf3f">b) Suppose you have a sequence classification problem, i.e. you want to classify sequences into one of C given classes. Describe the recurrent architecture you would implement for this problem.</font>

We have a multiclass classification problem if C>=3


---
##2. Natural Language Processing

<font color="#a1bf3f">a) Explain the advantages of using sub-word tokenization (e.g. BPE) vs. char-level tokenization.</font>

Tokenization at the character level leads is quite simple and rarily leads to unknown or OOV words. However, because it is limited to the characters in a given language, it leads to a very limited vocabulary. It turns sentences into very long sequences, since each character of each word becomes a token. They are also not so meaningful when compared to other tokenization methods. 

On the other hand, using a subword tokenization method such as BPE results in a richer vocabulary. This method is also able to "learn" new words that are not explicitly expressed in the corpora. This method also allows the user to define the number of encodings to perform, which means more control over the tokenization process and customization for different processing tasks.

<font color="#a1bf3f"> b) Consider the vocabulary V = {′a′,′ b′,′ c′,′ d′} and the document T = “cbabaa”, which is part of a corpus C containing 1000 documents. The number of documents in C that contain ′a′, ′b′, ′c′, and ′d′ is, respectively, 800, 350, 20, and 130. Find the TF-IDF representation of T. (Use the natural logarithm on your computations and present the results with at least three decimal places.)</font>

In [None]:
from math import log

In [None]:
V = {'a':0,'b':0,'c':0,'d':0}
T = "cbabaa"
C = 1000

In [None]:
for term in T:
  V[term] += 1  # counting how many times each term appears in doc T

tf = dict([(k,v/len(T)) for k,v in V.items()])
tf  # term frequency dictionary

{'a': 0.5, 'b': 0.3333333333333333, 'c': 0.16666666666666666, 'd': 0.0}

In [None]:
doc_freq = {'a':800,'b':350,'c':20,'d':130}

In [None]:
idf = dict([(k,log(v/C)) for k,v in doc_freq.items()])
idf # inverse doc frequency dictionary

{'a': -0.2231435513142097,
 'b': -1.0498221244986778,
 'c': -3.912023005428146,
 'd': -2.0402208285265546}

In [None]:
print(f'tf-idf representation for document T:\n')
for term in V:
  print(f'tf-idf({term}, T) = {tf[term] * idf[term]:.3f}')

tf-idf representation for document T:

tf-idf(a, T) = -0.112
tf-idf(b, T) = -0.350
tf-idf(c, T) = -0.652
tf-idf(d, T) = -0.000


---
##3. Hands-on: Sentiment Analysis

<font color="#a1bf3f"> Now, you will use the IMDB reviews dataset for sentiment analysis, which you can find attached to this assignment. This dataset comprises 50k movie reviews labeled as either positive or negative.
Your task is to design and train an LSTM-based (or GRU-based) model to classify each review as positive or negative. In addition to developing the code, you should also:
a) Mention the type of tokenization you have chosen.
b) Explain how you have dealt with variable-length sequences for mini-batching.
c) Describe the architecture of your model, explaining the role of each layer in the model. d) Report the accuracy of your model in the test set.</font>

In [None]:
!pip install tokenizers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# from google.colab import files
# uploaded = files.upload()

In [None]:
file_path = '/content/drive/MyDrive/TAAC/Assignment2/imdb_train.csv'  # Gdrive Danilo
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
# change sentiment to binary
df['sentiment'].loc[df['sentiment'] == 'positive'] = 1
df['sentiment'].loc[df['sentiment'] == 'negative'] = 0
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [None]:
# Save reviews in txt format to train tokenizer later
filename = '/content/drive/MyDrive/TAAC/Assignment2/imdb_train.txt'
with open(filename, 'a') as f:
    reviews_txt = df['review'].to_string(header=False, index=False)
    f.write(reviews_txt)

In [None]:
from keras.layers.core import Activation, Dense, Dropout, SpatialDropout1D
from keras.layers import Embedding
from keras.layers import LSTM
from keras.models import Sequential
from keras.utils.data_utils import pad_sequences
from sklearn.model_selection import train_test_split
import tensorflow as tf
import collections
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

To process the text data from the reviews, a Byte-Pair Encoding tokenizer was chosen due to its capabilities to handle OOV words and for its ability to adapt to the corpus its being trained on. It is also one of the most advance tokenizing methods currently available.

In [None]:
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet())

tokenizer.train([filename], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

In [None]:
# Load trained tokenizer
tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

In [None]:
# Adding token data into the dataframe
df['encodings'] = df['review'].apply(lambda x:tokenizer.encode(x))
df['tokens'] = [encoding.tokens for encoding in df['encodings']]
df['token_ids'] = [encoding.ids for encoding in df['encodings']]

In [None]:
# Find the longest sequence of tokens
maxlen = max(len(sequence) for sequence in df['token_ids'])
maxlen

3536

In [None]:
# Find the number of unique tokens
token_freqs = collections.Counter()
for sequence in df['token_ids']:
  for token_id in sequence:
    token_freqs[token_id] += 1
print(f"Vocabulary size:", len(token_freqs))

Vocabulary size: 19252


In [None]:
# HYPERPARAMETERS
MAX_FEATURES = 2000
MAX_SENTENCE_LENGTH = 90
EMBEDDING_SIZE = 128
HIDDEN_LAYER_SIZE = 64
BATCH_SIZE = 32
NUM_EPOCHS = 10

Since the sequences have variable lengths, a padding was added to ensure all the arrays had the same dimensions.

In [None]:
train_padded = pad_sequences(df['token_ids'], 
                             padding='post', # add paddings at the end
                             maxlen=MAX_SENTENCE_LENGTH)

In [None]:
# Transform pd.Series into array to match train_padded type
train_sent = np.asarray(df['sentiment']).astype('float32').reshape((-1,1))

In [None]:
# Defining the model
model = Sequential()
model.add(Embedding(MAX_FEATURES, EMBEDDING_SIZE, input_length=MAX_SENTENCE_LENGTH))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(HIDDEN_LAYER_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

print(model.summary())



Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 90, 128)           256000    
                                                                 
 spatial_dropout1d_5 (Spatia  (None, 90, 128)          0         
 lDropout1D)                                                     
                                                                 
 lstm_5 (LSTM)               (None, 64)                49408     
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
Total params: 305,473
Trainable params: 305,473
Non-trainable params: 0
_________________________________________________________________
None


#### Explaining the model
First, a batch of ids (tokens) is fed into an embedding layer so the information can be densely represented. The weights are initialized randomly and learned during the training process. This layer will reshape the tensor and output it into an LSTM. This LSTM will output a tensor sized according to the specified number of hidden layers, which is then fed into a dense layer. This last layer's output will be activaded by a sigmoid function, resulting in a 0 (negative review) or 1 (positive review) value.

In [None]:
# Just checking if we are running on GPU
tf.test.gpu_device_name()

'/device:GPU:0'

In [None]:
x_tensor = tf.convert_to_tensor(train_padded, dtype=tf.int64) 
y_tensor = tf.convert_to_tensor(train_sent, dtype=tf.int64)

In [None]:
x_tensor.shape, y_tensor.shape

(TensorShape([45000, 90]), TensorShape([45000, 1]))

In [None]:
# Training the model
model.fit(x = x_tensor, 
          y = y_tensor, 
          epochs = NUM_EPOCHS, 
          batch_size = BATCH_SIZE,
          verbose = 2)

Epoch 1/10
1407/1407 - 481s - loss: 0.5045 - accuracy: 0.7489 - 481s/epoch - 342ms/step
Epoch 2/10
1407/1407 - 475s - loss: 0.4319 - accuracy: 0.8007 - 475s/epoch - 337ms/step
Epoch 3/10
1407/1407 - 471s - loss: 0.4024 - accuracy: 0.8145 - 471s/epoch - 335ms/step
Epoch 4/10
1407/1407 - 468s - loss: 0.3786 - accuracy: 0.8268 - 468s/epoch - 333ms/step
Epoch 5/10
1407/1407 - 463s - loss: 0.3629 - accuracy: 0.8368 - 463s/epoch - 329ms/step
Epoch 6/10
1407/1407 - 471s - loss: 0.3512 - accuracy: 0.8424 - 471s/epoch - 335ms/step
Epoch 7/10
1407/1407 - 462s - loss: 0.3352 - accuracy: 0.8519 - 462s/epoch - 328ms/step
Epoch 8/10
1407/1407 - 464s - loss: 0.3239 - accuracy: 0.8585 - 464s/epoch - 330ms/step
Epoch 9/10
1407/1407 - 461s - loss: 0.3154 - accuracy: 0.8620 - 461s/epoch - 328ms/step
Epoch 10/10
1407/1407 - 459s - loss: 0.3087 - accuracy: 0.8639 - 459s/epoch - 326ms/step


<keras.callbacks.History at 0x7fdd80d62050>

#### Testing the model

In [None]:
test_file_path = '/content/drive/MyDrive/TAAC/Assignment2/imdb_test.csv'  # Gdrive Danilo
df_test = pd.read_csv(test_file_path, header=None)
df_test.columns = ['review', 'sentiment']

In [None]:
# change sentiment to binary
df_test['sentiment'].loc[df_test['sentiment'] == 'positive'] = 1
df_test['sentiment'].loc[df_test['sentiment'] == 'negative'] = 0
df_test.head()

Unnamed: 0,review,sentiment
0,"I saw the film many times, and every time I am...",0
1,I loved KOLCHAK: THE NIGHT STALKER since I saw...,1
2,This feels as if it is a Czech version of Pear...,1
3,"When, oh, when will someone like Anchor Bay or...",1
4,"""Just before dawn "" is one of the best slasher...",1


In [None]:
df_test['encodings'] = df_test['review'].map(lambda x: tokenizer.encode(x))
df_test['token_ids'] = df_test['encodings'].map(lambda x: x.ids)

In [None]:
test_padded = pad_sequences(df_test['token_ids'], 
                            padding='post', 
                            maxlen=MAX_SENTENCE_LENGTH)
test_sent = np.asarray(df_test['sentiment']).astype('float32').reshape((-1,1))

In [None]:
x_test_tensor = tf.convert_to_tensor(test_padded, dtype=tf.int64) 
y_test_tensor = tf.convert_to_tensor(test_sent, dtype=tf.int64)

In [None]:
score, acc = model.evaluate(x_test_tensor, y_test_tensor, batch_size=BATCH_SIZE)
print(f'Test score:{score:.3f}, accuracy:{acc:.3f}')

Test score:0.414, accuracy:0.815


In [None]:
# Checking some predictions made by the model
for i in range(5):
  idx = np.random.randint(len(test_padded))
  xtest = test_padded[idx].reshape(1, MAX_SENTENCE_LENGTH)
  ylabel = test_sent[idx]
  ypred = model.predict(xtest)[0][0]
  sentence = tokenizer.decode(xtest[0])
  print(f"Predicted:{ypred} - Label:{ylabel}\n Sentence:{sentence}")


Predicted:0.0028468973468989134 - Label:[0.]
 Sentence: him alive to allow the rats to feast on him followed by a rat aiming for the guy's FACE! What's with all that stupidity? Then there are quite a few continuity goofs, but you can find those elsewhere here on IMDb Honestly I found it a bit of an insult even to my limited intelligence.<br /><br />Waste of time. Still 4 out of 10 to keep my girlfriend from kicking me.
Predicted:0.013875192031264305 - Label:[1.]
 Sentence: beginning the plot was about as predictable as the destination of the flight I was on. I think the whole gay-but-not-gay friend part of the story could have been worked a lot better. The talking parrot was a nice idea but to be honest: it wasn't really very funny.<br /><br />In summary the film was more interesting than staring at the seat in front of me, but it was a close call.
Predicted:0.9765529036521912 - Label:[1.]
 Sentence: less well received films-think Town and Country). The closing scene of Diane Keaton dr