## 1. Preparations

### 1-1. Import Libraries
- 데이터셋 다운로드와 전처리를 쉽게 하는 torchtext 라이브러리를 import 합니다.


In [1]:
import os
import random
import time
import sys

import torch

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchtext.legacy import data, datasets
import time
import spacy
import numpy as np
from torch import Tensor
import math

### 1-2. Load data
- Field 를 정의합니다.
- IMDB 데이터를 다운받습니다.
- Train,valid,test 데이터셋으로 split 합니다.

In [2]:
TEXT = data.Field(tokenize='spacy',include_lengths=True)
LABEL = data.LabelField(dtype = torch.float) 



In [3]:
# Download IMDB data (about 3mins)
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

In [4]:
# Set the random seed
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [5]:
# Split train and valid data
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

In [6]:
print('Number of training examples: {}'.format(len(train_data)))
print('Number of validation examples: {}'.format(len(valid_data)))
print('Number of testing examples: {}'.format(len(test_data)))

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


In [7]:
# print example
print(vars(train_data.examples[0]))

{'text': ['The', 'TV', 'guide', 'described', 'the', 'plot', 'of', 'SEVERED', 'TIES', 'as', 'thus', ':', '"', 'An', 'experiment', 'on', 'a', 'severed', 'arm', 'goes', 'awry', '"', 'so', 'right', 'away', 'I', 'thought', 'this', 'was', 'going', 'to', 'be', 'about', 'an', 'arm', 'that`s', 'got', 'a', 'mind', 'of', 'its', 'own', 'as', 'seen', 'in', 'THE', 'BEAST', 'WITH', 'FIVE', 'FINGERS', 'or', 'THE', 'HAND', 'or', 'someone', 'getting', 'an', 'arm', 'transplant', 'as', 'in', 'BODY', 'PARTS', '.', 'Both', 'premises', 'are', 'tried', 'and', 'tested', ',', 'or', 'to', 'be', 'more', 'accurate', 'tired', 'and', 'tested', 'so', 'I', 'was', 'curious', 'as', 'to', 'how', 'the', 'producers', 'would', 'approach', 'the', 'story', '.', 'I', 'actually', 'thought', 'they', 'were', 'making', 'an', 'arthouse', 'movie', 'like', 'PI', 'down', 'to', 'the', 'use', 'of', 'B&W', 'photography', 'at', 'the', 'start', 'of', 'the', 'film', 'but', 'the', 'makers', 'seemed', 'to', 'have', 'tired', 'of', 'this', 'app

In [8]:
print(' '.join(vars(train_data.examples[0])['text']))

The TV guide described the plot of SEVERED TIES as thus : " An experiment on a severed arm goes awry " so right away I thought this was going to be about an arm that`s got a mind of its own as seen in THE BEAST WITH FIVE FINGERS or THE HAND or someone getting an arm transplant as in BODY PARTS . Both premises are tried and tested , or to be more accurate tired and tested so I was curious as to how the producers would approach the story . I actually thought they were making an arthouse movie like PI down to the use of B&W photography at the start of the film but the makers seemed to have tired of this approach after 20 seconds and decided to make a splatter comedy similar to THE EVIL DEAD . I`ve very little to say on this except that I disliked THE EVIL DEAD movies and I disliked SEVERED TIES and it seems really unfair that films like this use an obscene amount of rubber when the third world is crying out for condoms


### 1-3. Cuda Setup
- GPU 사용을 위한 Cuda 설정
- Colab 페이지 상단 메뉴>수정>노트설정에서 GPU 사용 설정이 선행되어야 합니다.


In [9]:
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda:0" if USE_CUDA else "cpu")

In [10]:
!nvidia-smi

Wed Nov 10 16:42:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce GTX 166...  Off  | 00000000:0A:00.0  On |                  N/A |
|  0%   45C    P8    12W / 125W |    466MiB /  5941MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

##2. Preprocess data
- Vocab (단어장) 을 만듭니다.
- Iterator 를 만듭니다. (Iterator 를 통해 batch training 을 위한 batching 과 padding, 그리고 데이터 내 단어들의 인덱스 변환이 이루어집니다.)  

In [None]:
# Load pre-trained word vectors (about 7mins)
TEXT.build_vocab(train_data, vectors = "glove.6B.100d")

In [None]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")

Unique tokens in TEXT vocabulary: 102064


In [11]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data,
                 max_size = MAX_VOCAB_SIZE,
                 vectors = "glove.6B.100d",
                 unk_init = torch.Tensor.normal_                 
                 )
LABEL.build_vocab(train_data)

In [12]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


In [13]:
TEXT.vocab.itos[:10]  #itos – A list of token strings indexed by their numerical identifiers.

['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']

In [14]:
word_dict = TEXT.vocab.stoi # stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.
print(len(word_dict))
print(word_dict['<unk>'], word_dict['<pad>'])

25002
0 1


In [15]:
print(TEXT.vocab.vectors.shape)
print(TEXT.vocab.vectors)

torch.Size([25002, 100])
tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.0038,  0.5892, -0.3353,  ...,  0.3862,  0.1623,  0.7502],
        [ 0.7290, -1.2762, -0.1884,  ...,  0.4076,  0.1942,  0.0521],
        [ 0.7557,  0.4154,  0.3530,  ..., -0.3626, -0.9909,  1.0656]])


In [16]:
# Batching - construct iterator
BATCH_SIZE = 32

train_iterator = data.Iterator(
    train_data, 
    batch_size = BATCH_SIZE,
    device = device)

In [17]:
# shape: BATCH_SIZE x maximum length of sentence 
for batch in train_iterator:
    break
print(batch.text)
print(len(batch.text[0]))
print(batch.text[0].shape) # [seq len, batch_size]
print(batch.label)

(tensor([[ 325,  159,  314,  ...,   16,   66,  531],
        [  67,  549,   11,  ...,    9,   98,   85],
        [  15, 5340,  240,  ...,    5,   36,   29],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]], device='cuda:0'), tensor([236, 230, 282, 201, 135,  82, 235, 181, 450, 185, 146, 680, 204,  80,
        152,  98,  92, 116, 399, 141, 117, 166, 502, 329, 112, 132, 243, 145,
        747,  64, 177, 199], device='cuda:0'))
747
torch.Size([747, 32])
tensor([0., 1., 1., 1., 1., 0., 0., 0., 1., 0., 1., 0., 0., 1., 1., 1., 0., 0.,
        0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1.],
       device='cuda:0')


In [18]:
# BucketIterator

train_iterator = data.BucketIterator(
    train_data, 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

In [19]:
for batch in train_iterator:
    break
print(batch.text)
print(len(batch.text[0]))
print(batch.text[0].shape)
print(batch.label)

(tensor([[ 3327,   146,  2711,  ...,  3616,    66,    54],
        [ 8903,   146,  2230,  ...,    20,    24,  1966],
        [    9,   146,     9,  ...,  2427,    52,     2],
        ...,
        [   42, 19975,     7,  ...,     4,     4,     4],
        [ 9155,     0,   317,  ...,     1,     1,     1],
        [   39,    39,     4,  ...,     1,     1,     1]], device='cuda:0'), tensor([131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 131, 130, 130, 130,
        130, 130, 130, 130, 130, 130, 130, 130, 130, 129, 129, 129, 129, 129,
        129, 129, 129, 129], device='cuda:0'))
131
torch.Size([131, 32])
tensor([1., 1., 1., 1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1.,
        1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
       device='cuda:0')


In [20]:
# Batching - construct iterator

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_sizes = (BATCH_SIZE, BATCH_SIZE, BATCH_SIZE),
    sort_within_batch = True,
    device = device)

##3. Build Model
- 텍스트를 입력받아 긍/부정 확률값을 출력하는 모델을 만듭니다.
- 미리 학습된 워드 임베딩을 임베딩 레이어에 올립니다.

### 1) RNN Model for Text Classification
- (1) Embedding layer : 각 토큰을 해당 워드 임베딩으로 바꿈.
- (2) Vanilla RNN/ LSTM-RNN layer: 워드 임베딩의 시퀀스를 순차적으로 처리.
- (3) Dropout layer: 일반화를 위해 임의의 유닛들을 0으로 바꿈.
- (4) Fully-connected layer: RNN 으로 인코딩된 표현의 시퀀스 중 마지막 순서의 것을 취해 하나의 확률값으로 변환.

In [21]:
class CustomModel(nn.Module):  # Custom model 정의 
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, n_layers, dropout, pad_idx):
        super().__init__()

        # Define parameters
        self.hidden_him = hidden_dim
        self.n_layers = n_layers

        # Define Layers
        # Embedding layer
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=pad_idx)
        ### To-do ###
        # Vanilla RNN layer
        self.rnn = nn.RNN(embedding_dim, hidden_dim, n_layers, dropout=dropout)
        #############
        # Dropout layer
        self.dropout = nn.Dropout(dropout)

        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):

        # text = [L, N]   L = sentence length, N = batch size 
        
        embedded = self.embedding(text)
        
        # embedded = [L, N, emb dim]      

        # Apply RNN and Dropout
        
        ### To-do ###
        output, hidden =  self.rnn(embedded)
        #############

        # hidden = [num_layers x num_directions, N, H]   H = hidden dimension

        hidden = self.dropout(hidden[-1,:,:])

        return self.fc(hidden)



### 주의사항
* nn.RNN 모델은 biderectional 의 경우 forward layer 와 backward layer 총 2개 레이어를 가지게 됩니다.
*   Torch.nn 제공 RNN 모듈은 2개의 아웃풋 중 하나로 hidden state 을 출력하며,
> `output, hidden = self.rnn(embedded)`
*   `hidden`은 모델에 들어있는 **모든 레이어**의 last hidden state 을 출력합니다.
*   따라서 `hidden` 의 형태는 `[num_layers x num_directions, batch_size, hidden_size]`가 됩니다.

* 모델에서 총 n개의 layer 를 사용할 경우, 순서대로 _1번째 forward, 1번째 backward, 2번째 forward, 2번째 backward, ..., n번째 forward, n번째 backward_ 가 표시됩니다.

### 2) Bi-LSTM Model for Text Classification
- RNN layer 로서 Bidirectional LSTM layer 사용

In [76]:
# To-do: Make Custom Bidirectional LSTM Model

class CustomModel(nn.Module): 
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, n_layers, dropout, pad_idx):
        super().__init__()

        # Define parameters
        self.hidden_him = hidden_dim
        self.n_layers = n_layers

        # Define Layers
        # Embedding layer
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=pad_idx)

        ### To-do ###
        # Bidirectional LSTM-RNN layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, bidirectional = True, dropout=dropout)
        #############
 
        # Dropout layer
        self.dropout = nn.Dropout(dropout)

        # Fully connected layer
        ### To-do ###
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        #############

    def forward(self, text):

        # text = [L, N]   L = sentence length, N = batch size 
        
        embedded = self.embedding(text)
        
        # embedded = [L, N, emb dim]      

        # Apply Bidirectional LSTM and Dropout
  
        output, (hidden, cell) = self.lstm(embedded)

        # hidden = [num_layers x num_directions, N, H],   H = hidden dimension
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)) 
        
        # hidden = [N, H]

        return self.fc(hidden)     # [N, 1]


In [88]:
# GRU

class CustomModel(nn.Module):  # Custom model 정의 
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, n_layers, dropout, pad_idx):
        super().__init__()

        # Define parameters
        self.hidden_him = hidden_dim
        self.n_layers = n_layers

        # Define Layers
        # Embedding layer
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=pad_idx)
        ### To-do ###
        # Vanilla RNN layer
        self.gru = nn.GRU(embedding_dim, hidden_dim, n_layers, dropout=dropout)
        #############
        # Dropout layer
        self.dropout = nn.Dropout(dropout)

        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):

        # text = [L, N]   L = sentence length, N = batch size 
        
        embedded = self.embedding(text)
        
        # embedded = [L, N, emb dim]      

        # Apply RNN and Dropout
        
        ### To-do ###
        output, hidden =  self.gru(embedded)
        #############

        # hidden = [num_layers x num_directions, N, H]   H = hidden dimension

        hidden = self.dropout(hidden[-1,:,:])

        return self.fc(hidden)


In [145]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
DROPOUT = 0.1
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = CustomModel(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS,
            DROPOUT, 
            PAD_IDX)   


In [146]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)  # Count number of elements of all parameters

print('The model has {:,} trainable parameters'.format(count_parameters(model)))

The model has 3,170,153 trainable parameters


In [147]:
# load pretrained embeddings
pretrained_embeddings = TEXT.vocab.vectors
print(type(pretrained_embeddings))
model.embedding.weight.data.copy_(pretrained_embeddings);

<class 'torch.Tensor'>


## 4. Train model

In [148]:
optimizer = optim.Adam(model.parameters())   # Gradient Descent 를 실행할 optimizer 정의

In [149]:
criterion = nn.BCEWithLogitsLoss()  # 손실함수 정의 

In [150]:
model = model.to(device)  #모델을 GPU 로 이동
criterion = criterion.to(device)

In [151]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    # To-do
    #round predictions to the closest integer (Use torch.round() function)
    round_preds = torch.round(torch.sigmoid(preds))
    
    #count the correct by building list of 0/1
    correct = (round_preds==y).float()
    
    acc = correct.sum() / len(correct)
    return acc

In [152]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
      
        # Gradient 0으로 초기화
        optimizer.zero_grad()        
        
        # Prediction 
        predictions = model(batch.text[0]).squeeze(1)
        
        # Loss 계산
        loss = criterion(predictions, batch.label)
        
        # Accuracy 계산
        acc = binary_accuracy(predictions, batch.label)
        
        # Backward pass (gradient 계산)
        loss.backward()

        # Parameter update
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [153]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
          
            predictions = model(batch.text[0]).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


### *Do Training!*

In [158]:
N_EPOCHS = 5

best_valid_loss = float('inf') # Represents infinity

for epoch in range(N_EPOCHS):
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    if valid_loss < best_valid_loss: # For early stopping
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'rnn-model.pt')
    else:
      pass
    
    print('Epoch: {:02}'.format(epoch+1))
    print('\tTrain Loss: {:.3f} | Train Acc: {:.2f}%'.format(train_loss, train_acc*100))
    print('\t Val. Loss: {:.3f} |  Val. Acc: {:.2f}%'.format(valid_loss, valid_acc*100))

Epoch: 01
	Train Loss: 0.042 | Train Acc: 98.87%
	 Val. Loss: 0.367 |  Val. Acc: 89.31%
Epoch: 02
	Train Loss: 0.022 | Train Acc: 99.45%
	 Val. Loss: 0.435 |  Val. Acc: 89.42%
Epoch: 03
	Train Loss: 0.022 | Train Acc: 99.35%
	 Val. Loss: 0.512 |  Val. Acc: 88.52%
Epoch: 04
	Train Loss: 0.018 | Train Acc: 99.45%
	 Val. Loss: 0.643 |  Val. Acc: 86.48%
Epoch: 05
	Train Loss: 0.012 | Train Acc: 99.67%
	 Val. Loss: 0.524 |  Val. Acc: 88.59%


In [159]:
model.load_state_dict(torch.load('rnn-model.pt'))
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print('Test Loss: {:.3f} | Test Acc: {:.2f}%'.format(test_loss, test_acc*100))

Test Loss: 0.392 | Test Acc: 88.23%


## 5. Test model
우리가 직접 예문을 작성해서 트레인된 모델에서 예문을 어떻게 평가하는지 확인합니다.



In [None]:
# 토크나이저로 spacy 를 사용합니다.
nlp = spacy.load('en')

# 사용자가 입력한 sentence 를 훈련된 모델에 넣었을때의 결과값을 확인합니다.
def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]  # Tokenization
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]   # 위에서 만든 vocab 에 부여된 index 로 indexing
    tensor = torch.LongTensor(indexed).to(device)   # indexing 된 sequence 를 torch tensor 형태로 만들어줌.
    tensor = tensor.unsqueeze(1)   # 입력 텐서에 batch 차원을 만들어줌.
    prediction = torch.sigmoid(model(tensor))  # 모델에 입력한 후 확률값 도출을 위한 sigmoid 적용 
    return prediction.item() # prediction 값 출력

In [None]:
predict_sentiment(model, "This film is terrible") #아주 낮은 값의 확률이 도출되는 것을 확인할 수 있습니다.(부정)

0.027659380808472633

In [None]:
predict_sentiment(model, "This film is not great") #아주 높은 값의 확률이 도출되는 것을 확인할 수 있습니다. (긍정)

0.5428000092506409