# Sentiment Analysis con PyTorch sobre IMDB

Basado en [esta notebook](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb)

In [0]:

import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

LABEL = data.LabelField(dtype = torch.float)

Vamos a cargar el dataset de IMDB mediante [`torchtext`](https://github.com/pytorch/text)

`torch/text` es un conjunto de funcionalidades específicas para NLP. También contiene varios datasets útiles :-)

Vamos a usar `data.Field`. Es la principal clase de `torchtext`: representa cómo obtener datos de nuestro texto.

En nuestro caso, le diremos que queremos usar el tokenizador de `spacy`

In [0]:
TEXT = data.Field(tokenize = 'spacy')

Para los labels, usaremos una subclase: `LabelField`

In [0]:
LABEL = data.LabelField(dtype = torch.float)

Carguemos el dataset de IMDB, que son 50k reviews. El split de train y test ya está predefinido, así que utilizamos eso

(esto tarda un minutito)

In [4]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:07<00:00, 11.2MB/s]


Veamos cuántos ejemplos nos quedaron

In [5]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


In [6]:
from pprint import pprint

pprint(train_data.examples[0].text, compact=True)

['A', 'toothsome', 'little', 'potboiler', 'whose', '65-minute', 'length',
 'does', "n't", 'seem', 'a', 'second', 'too', 'short', ',', 'My', 'Name', 'is',
 'Julia', 'Ross', 'harks', 'back', 'to', 'an', 'English', 'tradition', 'of',
 'things', 'not', 'being', 'what', 'they', 'seem', '--', 'Hitchcock', "'s",
 'The', 'Lady', 'Vanishes', 'is', 'one', 'example', '.', 'Out', '-', 'of', '-',
 'work', 'Julia', 'Ross', '(', 'Nina', 'Foch', ')', 'finds', 'a', 'dream',
 'job', 'at', 'a', 'new', 'employment', 'agency', 'in', 'London', ',', 'whose',
 'sinister', 'representative', 'seems', 'very', 'anxious', 'to', 'ascertain',
 'if', 'she', 'has', 'living', 'relatives', 'or', 'a', 'boyfriend', '.',
 'After', 'reporting', 'to', 'duty', ',', 'she', 'wakes', 'up', '(', 'Having',
 'Been', 'Drugged', ')', 'in', 'a', 'vast', 'Manderley', '-', 'like', 'pile',
 'on', 'the', 'Cornish', 'coast', ',', 'supposedly', 'as', 'the', 'barmy', '-',
 'in', '-', 'the', '-', 'crumpet', 'wife', 'of', 'George', 'Macready',

## Validación

Separemos un pedacito para hacer validación

In [0]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))


In [8]:
len(train_data), len(valid_data)

(17500, 7500)

In [9]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


## Vocabulario

Vamos a armar el vocabulario.

In [0]:
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

Ahora, en `TEXT.vocab` está nuestro vocabulario

In [11]:

len(TEXT.vocab)

101282

101k palabras! Es un montón. Reduzcamos un poquito...

In [12]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

len(TEXT.vocab), len(LABEL.vocab)

(25002, 2)

Veamos cómo se ve el vocabulario

In [13]:
index_to_string = vars(TEXT.vocab)["itos"]

for i, s in enumerate(index_to_string[:30]):
    print(f"{i:<2} --> {s}")

0  --> <unk>
1  --> <pad>
2  --> the
3  --> ,
4  --> .
5  --> and
6  --> a
7  --> of
8  --> to
9  --> is
10 --> in
11 --> I
12 --> it
13 --> that
14 --> "
15 --> 's
16 --> this
17 --> -
18 --> /><br
19 --> was
20 --> as
21 --> movie
22 --> with
23 --> for
24 --> film
25 --> The
26 --> but
27 --> (
28 --> )
29 --> on


Veamos cómo quedó `LABEL`

In [14]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f5dd59eaae8>, {'neg': 0, 'pos': 1})


## Entrenando al modelo

Vamos a usar `data.BucketIterator` que trata de devolvernos ejemplos con mismo tamaño (para evitar tanto padding y facilitar el packing previo a usar una RNN)

In [0]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

In [0]:
import torch.nn as nn

class Model(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        #text = [sent len, batch size]
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        output, hidden = self.rnn(embedded)
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        
        return self.fc(hidden.squeeze(0))

Ahora: creamos el modelo

- Input dim: cantidad de palabras
- Embedding dim: 100 (inicializados random)
- Capa hidden (salida RNN): 256
- Output-dim: 1 si positivo, 0 en otro caso (binario)


In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = Model(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

In [0]:

criterion = nn.BCEWithLogitsLoss()

In [19]:
model.to(device)
criterion.to(device)

Model(
  (embedding): Embedding(25002, 100)
  (rnn): GRU(100, 256)
  (fc): Linear(in_features=256, out_features=1, bias=True)
)

PAra calcular el accuracy, vamos a calcularle `sigmoid` a la salida. Por cuestiones de estabilidad con el cálculo de la entropía cruzada, la salida del modelo son los logits en vez de la probabilidad.


In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

Definimos una función para entrenar...

In [0]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [0]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [28]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')


Epoch: 01 | Epoch Time: 1m 2s
	Train Loss: 0.698 | Train Acc: 49.98%
	 Val. Loss: 0.692 |  Val. Acc: 51.78%
Epoch: 02 | Epoch Time: 1m 1s
	Train Loss: 0.694 | Train Acc: 50.28%
	 Val. Loss: 0.693 |  Val. Acc: 50.76%
Epoch: 03 | Epoch Time: 1m 2s
	Train Loss: 0.693 | Train Acc: 49.57%
	 Val. Loss: 0.692 |  Val. Acc: 51.21%
Epoch: 04 | Epoch Time: 1m 2s
	Train Loss: 0.693 | Train Acc: 50.38%
	 Val. Loss: 0.696 |  Val. Acc: 50.72%
Epoch: 05 | Epoch Time: 1m 1s
	Train Loss: 0.694 | Train Acc: 49.99%
	 Val. Loss: 0.701 |  Val. Acc: 49.73%


Como vemos, no está mejorando el modelo :-\ Tenemos algunos problemas acá:

- Usamos embeddings random
- No estamos regularizando

Vamos a mejorar esto en la siguiente notebook