## Fasttxt 한국어 임베딩을 활용한 Convolutional sentiment analysis


---

## 모델 요약
|항목|내용|
|-|-|
|한국어 형태소 분석기|Mecab|
|한국어 워드 임베딩|fasttext|
|사용 모델|CNN|
|output 채널 개수|100|
|filter 사이즈|3,4,5|
|dropout 비율|0.5|
|Loss|BCELoss|
|Optimizer|Adam|
|learning rate|0.001|

---

## 결과 요약

|항목|데이터개수|accuracy(%)|
|-|-|-|
|train set|132286|96.9|
|validation set|14699|85.5|
|test set|43295|85.7|


            
            

---
## 개요

1. 한국어 텍스트 전처리 
    - 한국어 형태소 분석기: Mecab를 사용
    

2. 한국어 워드 임베딩 Load
    - https://github.com/Kyubyong/wordvectors에서 pretrained 한국어 word vector를 가져와서 임베딩
    

3. 학습용, 테스트용 데이터셋 준비하기


4. 모델 build 및 학습
    - CNN 모델을 이용하여 학습
    
    
5. 영화리뷰 샘플을 가지고 predict해보기

---
## 1. 한국어 텍스트 전처리 및 데이터셋 준비

In [1]:
import warnings
warnings.filterwarnings(action='ignore')

In [2]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import re

from konlpy.tag import Mecab,Okt,Komoran,Hannanum,Kkma

from collections import Counter
from collections import defaultdict
from tensorflow.keras.preprocessing.text import Tokenizer


import torch.nn as nn
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import random
import numpy as np

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


### 한국어를 처리하는 코드

In [3]:
def process_korean(sentence):
    
    sentence=sentence.replace("[^ㄱ-ㅎㅏ-ㅣ가-힣 ]","")
    
    #형태소로 분해하는 코드
    tokenizer=Kkma()
    sentence=tokenizer.morphs(sentence)

    #stopword는 제외
    stopwords=['의','가','이','은','들','는','좀','잘','걍','과','도','를','으로','자','에','와','한','하다']
    sentence=[word for word in sentence if not word in stopwords]
    
    return " ".join(sentence)

process_korean('아 더빙.. 진짜 짜증나네요 목소리')

'아 아 더빙 .. 진짜 짜증나 네요 목소리'

In [4]:
def to_korean_embedding(train,label): #영화리뷰 셋을 한국어 처리를 해주는 코드
    train_lst=[]
    label_lst=[]
    i=0
    for sentence in train:
        try:
            train_lst.append(process_korean(sentence))
            label_lst.append(label[i])
        except:
            pass
        i+=1
    return train_lst,label_lst

### Training set 준비

In [None]:
train_data= pd.read_table('ratings_train.txt')
X_train,Y_train=to_korean_embedding(list(train_data['document']),list(train_data['label']))
len(X_train),len(Y_train)

### Test set 준비

In [None]:
test_data= pd.read_table('ratings_test.txt')
X_test,Y_test=to_korean_embedding(list(test_data['document']),list(test_data['label']))
len(X_test),len(Y_test)

In [10]:
# import pickle
# f=open('x_train_komoran.pickle','wb')
# pickle.dump(X_train,f)
# f=open('y_train_komoran.pickle','wb')
# pickle.dump(Y_train,f)
# f=open('x_test_komoran.pickle','wb')
# pickle.dump(X_test,f)
# f=open('y_test_komoran.pickle','wb')
# pickle.dump(Y_test,f)

### 저장해둔 가공 데이터 불러오기

In [5]:
import pickle
f=open('x_train_mecab.pickle','rb')
X_train=pickle.load(f)
f=open('y_train_mecab.pickle','rb')
Y_train=pickle.load(f)
f=open('x_test_mecab.pickle','rb')
X_test=pickle.load(f)
f=open('y_test_mecab.pickle','rb')
Y_test=pickle.load(f)

### vocabulary를 만들고 단어를 정수 index로 변환하기 위한 코드

In [6]:
def pad_features(reviews_ints, seq_length):
    # maximum 문장 길이만큼 padding해줌
    
    features = np.full((len(reviews_ints), seq_length), '<pad>') #일단 모두 pad로 채우고
    
    for i, row in enumerate(reviews_ints): #원본 데이터에 대해 
        features[i, :len(row)] = np.array(row)[:seq_length+1]  #[문장 , 원본데이터의 길이 까지] = np.array(원본데이터) [  :95번째까지]
    
    return features

def a_text_to_idx(wordlst,vocab_to_int):
    return [vocab_to_int[word] for word in wordlst]
    


def text_to_embedding(X_train,Y_train,vocab_to_int):
    reviews_split=X_train #['아 더빙 .. 진짜 짜증나다 목소리',...]
    encoded_labels=Y_train

    reviews_split=[r.split(" ") for r in reviews_split] #[['아', '더빙', '..', '진짜', '짜증나다', '목소리']]

    #단어 수가 0인 것 제외
    try:
        non_zero_idx = [ii for ii, review in enumerate(reviews_split) if len(review) != 0]
        reviews_split = [reviews_split[ii] for ii in non_zero_idx]
        encoded_labels = [encoded_labels[ii] for ii in non_zero_idx]

    except:
        pass
    
    
    #정해진 길이의 embedding에 <pad>를 추가하기
    seq_length=max(list(map(len,reviews_split)))
    reviews_padded=pad_features(reviews_split,seq_length=seq_length) #[['아', '더빙', '..', '진짜', '짜증나다', '목소리', 'pad','pad'...]]
    
    reviews_ints = []
    
    #vocabulary 사전에 따라 숫자로 변환하기
    for review in reviews_padded:
        sublst=[]
        for word in review:
            try:
                sublst.append(vocab_to_int[word])
            except:
                pass
        reviews_ints.append(sublst) #[[숫자로 변환됨]]

    # word 개수가 seq length가 되는 것만 남김
    try:
        non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) == seq_length]
        reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
        encoded_labels = [encoded_labels[ii] for ii in non_zero_idx]

    except:
        pass
    
    
    return np.array(reviews_ints,dtype=int),np.array(encoded_labels,dtype=int)


## 2. 한국어 워드 임베딩 Load
### Pretrained embedding을 불러옴

In [7]:
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('data/ko/ko.vec')

# 단어 리스트 작성
vocab = model.index2word

In [8]:
# 전체 단어벡터 추출
wordvectors = []
for v in vocab:
    wordvectors.append(model.wv[v])


### 학습용 데이터+pretrained embedding으로 vocabulary 만들기

In [9]:
def make_vocab(reviews_split):
    #vocabulary 사전을 제작함 
    
    all_text = ' '.join(reviews_split)
    words = all_text.split()
    counts = Counter(words)
    vocab = sorted(counts, key=counts.get, reverse=True)
    vocab_to_int = {word: ii+1 for ii, word in enumerate(vocab, 1)}
    vocab_to_int['<pad>']=1
    return vocab_to_int

def make_pretrained_embedding(vocab_to_int,vocab_model, embedding_model):
    totallst=list(set(vocab_to_int.keys()) or set(vocab_model))
    pretrained_embedding=[]
    new_vocab_to_int=dict()
    idx=0
    ukn=0
    for aword in totallst:
        if aword in vocab_model:
            pretrained_embedding.append(embedding_model[vocab_model.index(aword)])
        elif aword == "<pad>":
            pretrained_embedding.append(np.zeros(200))
        else:
            pretrained_embedding.append(np.random.normal(scale=0.6, size=(200, )))
            ukn+=1
        new_vocab_to_int[aword]=idx
        idx+=1
    print(ukn)
    return new_vocab_to_int,pretrained_embedding


In [10]:
vocab_to_int=make_vocab(X_train)
new_vocab_to_int, pretrained_embedding=make_pretrained_embedding(vocab_to_int,vocab,wordvectors)

38240


In [11]:
len(list(new_vocab_to_int.keys())),len(list(vocab_to_int.keys()))

(54027, 54027)

In [12]:
totallst=list(set(vocab_to_int.keys()) or set(vocab))
len(totallst),len(pretrained_embedding)

(54027, 54027)

## 3. 학습용, 테스트용 데이터셋 준비하기

### train셋 인코딩하기

In [13]:
features,encoded_labels=text_to_embedding(X_train,Y_train,new_vocab_to_int)

### test셋 인코딩하기

In [14]:
test_x,test_y=text_to_embedding(X_test,Y_test,new_vocab_to_int)

In [15]:
features.shape,encoded_labels.shape,test_x.shape,test_y.shape
# len(features),len(encoded_labels),len(test_x),len(test_y)

((146985, 116), (146985,), (43295, 105), (43295,))

### train set, validation set을 나눔

In [16]:
split_frac = 0.9
split_idx = int(len(features)*split_frac)
train_x,train_y=features,encoded_labels
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

In [17]:
## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(len(train_x)), 
       "\nValidation set: \t{}".format(len(val_x)),
      "\nTest set: \t\t{}".format(len(test_x)))

			Feature Shapes:
Train set: 		132286 
Validation set: 	14699 
Test set: 		43295


### DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

This is an alternative to creating a generator function for batching our data into full batches.

In [18]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))


# dataloaders
batch_size = 50

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size, drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size, drop_last=True)

seq_lengths=iter(train_loader).next()[0].size()[1]

In [19]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 116])
Sample input: 
 tensor([[14195, 26006,    12,  ..., 28080, 28080, 28080],
        [20698,   883, 22908,  ..., 28080, 28080, 28080],
        [11228, 33540, 20847,  ..., 28080, 28080, 28080],
        ...,
        [ 5850, 47611, 46978,  ..., 28080, 28080, 28080],
        [15077, 38553,  6026,  ..., 28080, 28080, 28080],
        [49823, 47172,  2682,  ..., 28080, 28080, 28080]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
        0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,
        0, 1])


In [20]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')


Training on GPU.


## 4. 모델 build 및 학습

Now to build our model.

The first major hurdle is visualizing how CNNs are used for text. Images are typically 2 dimensional (we'll ignore the fact that there is a third "colour" dimension for now) whereas text is 1 dimensional. However, we know that the first step in almost all of our previous tutorials (and pretty much all NLP pipelines) is converting the words into word embeddings. This is how we can visualize our words in 2 dimensions, each word along one axis and the elements of vectors aross the other dimension. Consider the 2 dimensional representation of the embedded sentence below:

![](assets/sentiment9.png)

We can then use a filter that is **[n x emb_dim]**. This will cover $n$ sequential words entirely, as their width will be `emb_dim` dimensions. Consider the image below, with our word vectors are represented in green. Here we have 4 words with 5 dimensional embeddings, creating a [4x5] "image" tensor. A filter that covers two words at a time (i.e. bi-grams) will be **[2x5]** filter, shown in yellow, and each element of the filter with have a _weight_ associated with it. The output of this filter (shown in red) will be a single real number that is the weighted sum of all elements covered by the filter.

![](assets/sentiment12.png)

The filter then moves "down" the image (or across the sentence) to cover the next bi-gram and another output (weighted sum) is calculated. 

![](assets/sentiment13.png)

Finally, the filter moves down again and the final output for this filter is calculated.

![](assets/sentiment14.png)

In our case (and in the general case where the width of the filter equals the width of the "image"), our output will be a vector with number of elements equal to the height of the image (or lenth of the word) minus the height of the filter plus one, $4-2+1=3$ in this case.

This example showed how to calculate the output of one filter. Our model (and pretty much all CNNs) will have lots of these filters. The idea is that each filter will learn a different feature to extract. In the above example, we are hoping each of the **[2 x emb_dim]** filters will be looking for the occurence of different bi-grams. 

In our model, we will also have different sizes of filters, heights of 3, 4 and 5, with 100 of each of them. The intuition is that we will be looking for the occurence of different tri-grams, 4-grams and 5-grams that are relevant for analysing sentiment of movie reviews.

The next step in our model is to use *pooling* (specifically *max pooling*) on the output of the convolutional layers. This is similar to the FastText model where we performed the average over each of the word vectors, implemented by the `F.avg_pool2d` function, however instead of taking the average over a dimension, we are taking the maximum value over a dimension. Below an example of taking the maximum value (0.9) from the output of the convolutional layer on the example sentence (not shown is the activation function applied to the output of the convolutions).

![](assets/sentiment15.png)

The idea here is that the maximum value is the "most important" feature for determining the sentiment of the review, which corresponds to the "most important" n-gram within the review. How do we know what the "most important" n-gram is? Luckily, we don't have to! Through backpropagation, the weights of the filters are changed so that whenever certain n-grams that are highly indicative of the sentiment are seen, the output of the filter is a "high" value. This "high" value then passes through the max pooling layer if it is the maximum value in the output. 

As our model has 100 filters of 3 different sizes, that means we have 300 different n-grams the model thinks are important. We concatenate these together into a single vector and pass them through a linear layer to predict the sentiment. We can think of the weights of this linear layer as "weighting up the evidence" from each of the 300 n-grams and making a final decision. 

### Implementation Details

We implement the convolutional layers with `nn.Conv2d`. The `in_channels` argument is the number of "channels" in your image going into the convolutional layer. In actual images this is usually 3 (one channel for each of the red, blue and green channels), however when using text we only have a single channel, the text itself. The `out_channels` is the number of filters and the `kernel_size` is the size of the filters. Each of our `kernel_size`s is going to be **[n x emb_dim]** where $n$ is the size of the n-grams.

In PyTorch, RNNs want the input with the batch dimension second, whereas CNNs want the batch dimension first - we do not have to permute the data here as we have already set `batch_first = True` in our `TEXT` field. We then pass the sentence through an embedding layer to get our embeddings. The second dimension of the input into a `nn.Conv2d` layer must be the channel dimension. As text technically does not have a channel dimension, we `unsqueeze` our tensor to create one. This matches with our `in_channels=1` in the initialization of our convolutional layers. 

We then pass the tensors through the convolutional and pooling layers, using the `ReLU` activation function after the convolutional layers. Another nice feature of the pooling layers is that they handle sentences of different lengths. The size of the output of the convolutional layer is dependent on the size of the input to it, and different batches contain sentences of different lengths. Without the max pooling layer the input to our linear layer would depend on the size of the input sentence (not what we want). One option to rectify this would be to trim/pad all sentences to the same length, however with the max pooling layer we always know the input to the linear layer will be the total number of filters. **Note**: there an exception to this if your sentence(s) are shorter than the largest filter used. You will then have to pad your sentences to the length of the largest filter. In the IMDb data there are no reviews only 5 words long so we don't have to worry about that, but you will if you are using your own data.

Finally, we perform dropout on the concatenated filter outputs and then pass them through a linear layer to make our predictions.

In [21]:
class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, 
                 dropout, pad_idx):
        
        super().__init__()
                
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)


        self.convs = nn.ModuleList([
                                    nn.Conv2d(in_channels = 1, 
                                              out_channels = n_filters, 
                                              kernel_size = (fs, embedding_dim)) 
                                    for fs in filter_sizes
                                    ])
        
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.sig=nn.Sigmoid()
        
    def forward(self, text):
                
        #text = [batch size, sent len]
        
        embedded = self.embedding(text)
        
        
        #embedded = [batch size, sent len, emb dim]
        
        embedded = embedded.unsqueeze(1)
        
        #embedded = [batch size, 1, sent len, emb dim]
        
        conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
            
        #conved_n = [batch size, n_filters, sent len - filter_sizes[n] + 1]
                
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        
        #pooled_n = [batch size, n_filters]
        
        cat = self.dropout(torch.cat(pooled, dim = 1))

        #cat = [batch size, n_filters * len(filter_sizes)]
        output=self.fc(cat).squeeze(1)
        
        output=self.sig(output)
        
        return output

### Instantiate the network


In [22]:
pretrained_embedding=torch.tensor(pretrained_embedding)
# INPUT_DIM = len(vocab_to_int)+1
INPUT_DIM =pretrained_embedding.shape[0]
# EMBEDDING_DIM = seq_lengths
EMBEDDING_DIM = 200

N_FILTERS = 100
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT = 0.5
PAD_IDX=new_vocab_to_int['<pad>']

net = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX)

In [23]:
PAD_IDX

28080

In [24]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(net):,} trainable parameters')

The model has 11,046,001 trainable parameters


### 모델에 임베딩 적용

In [25]:
print(pretrained_embedding.shape)

torch.Size([54027, 200])


In [26]:
net.embedding.weight.data.copy_(pretrained_embedding)

tensor([[ 0.3933,  0.1533,  0.2433,  ...,  0.0058,  0.4796, -0.8924],
        [ 0.3508, -0.1666,  0.1610,  ...,  0.0368,  0.7583, -0.1848],
        [-0.0462,  0.3320,  0.0487,  ..., -0.4848,  0.4125, -0.1326],
        ...,
        [ 0.0743, -0.0316,  0.2388,  ..., -0.1870, -0.0817, -0.1621],
        [-0.2906, -0.1595,  0.3406,  ..., -0.1620, -0.2360, -0.4483],
        [ 1.2834, -0.4482, -0.5768,  ..., -0.9720,  0.3555,  1.2703]])

---
### Training

Below is the typical training code. If you want to do this yourself, feel free to delete all this code and implement it yourself. You can also add code to save a model by name.

>We'll also be using a new kind of cross entropy loss, which is designed to work with a single Sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss), or **Binary Cross Entropy Loss**, applies cross entropy loss to a single value between 0 and 1.

We also have some data and training hyparameters:

* `lr`: Learning rate for our optimizer.
* `epochs`: Number of times to iterate through the training dataset.
* `clip`: The maximum gradient value to clip at (to prevent exploding gradients).

In [27]:
# loss and optimization functions
lr=0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
# training params
epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing
counter = 0
print_every = 500
clip=5 # gradient clipping

In [28]:
# move model to GPU, if available
if(train_on_gpu):    
    net.cuda()    

net.train()
# train for some number of epochs
for e in range(epochs):
    num_correct = 0
    
    # batch loop
    for inputs, labels in train_loader:
           
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()


        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output = net(inputs)
        # calculate the loss and perform backprop

        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # compare predictions to true label
        pred = torch.round(output.squeeze()) 
        correct_tensor = pred.eq(labels.float().view_as(pred))
        correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
        num_correct += np.sum(correct)    
        
    
        # loss stats
        if counter % print_every == 0:
            
            val_losses = []
            net.eval()
            
            
            val_num_correct=0
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                #val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output = net(inputs)
                val_loss = criterion(output.squeeze(), labels.float())
                val_losses.append(val_loss.item())
                
                pred = torch.round(output.squeeze()) 
                correct_tensor = pred.eq(labels.float().view_as(pred))
                correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
                val_num_correct += np.sum(correct)           

            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                 )    
    # -- stats! -- ##
    train_acc = num_correct/len(train_loader.dataset)
    print("Train accuracy: {:.3f}".format(train_acc))
    valid_acc = val_num_correct/len(valid_loader.dataset)
    print("Valid accuracy: {:.3f}".format(valid_acc))   
    print()

Epoch: 1/4... Step: 500... Loss: 0.426230...
Epoch: 1/4... Step: 1000... Loss: 0.327731...
Epoch: 1/4... Step: 1500... Loss: 0.380746...
Epoch: 1/4... Step: 2000... Loss: 0.377675...
Epoch: 1/4... Step: 2500... Loss: 0.279272...
Train accuracy: 0.829
Valid accuracy: 0.855

Epoch: 2/4... Step: 3000... Loss: 0.329996...
Epoch: 2/4... Step: 3500... Loss: 0.280918...
Epoch: 2/4... Step: 4000... Loss: 0.238523...
Epoch: 2/4... Step: 4500... Loss: 0.237686...
Epoch: 2/4... Step: 5000... Loss: 0.330080...
Train accuracy: 0.895
Valid accuracy: 0.865

Epoch: 3/4... Step: 5500... Loss: 0.085578...
Epoch: 3/4... Step: 6000... Loss: 0.157445...
Epoch: 3/4... Step: 6500... Loss: 0.040780...
Epoch: 3/4... Step: 7000... Loss: 0.173706...
Epoch: 3/4... Step: 7500... Loss: 0.164687...
Train accuracy: 0.938
Valid accuracy: 0.858

Epoch: 4/4... Step: 8000... Loss: 0.072164...
Epoch: 4/4... Step: 8500... Loss: 0.151353...
Epoch: 4/4... Step: 9000... Loss: 0.153644...
Epoch: 4/4... Step: 9500... Loss: 0.11

### 모델 저장

In [35]:
f=open("cnn_model.pickle",'wb')
pickle.dump(net,f)

---
### Testing

There are a few ways to test your network.

* **Test data performance:** First, we'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data.

* **Inference on user-generated data:** Second, we'll see if we can input just one example review at a time (without a label), and see what the trained model predicts. Looking at new, user input data like this, and predicting an output label, is called **inference**.

In [29]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0


net.eval()
# iterate over test data
for inputs, labels in test_loader:

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output= net(inputs)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.452
Test accuracy: 0.857


## 5. 영화리뷰 샘플을 가지고 predict해보기

### Try out test_reviews of your own!

You can change this test_review to any text that you want. Read it and think: is it pos or neg? Then see if your model predicts correctly!
    
> **Exercise:** Write a `predict` function that takes in a trained net, a plain text_review, and a sequence length, and prints out a custom statement for a positive or negative review!
* You can use any functions that you've already defined or define any helper functions you want to complete `predict`, but it should just take in a trained net, a text review, and a sequence length.

In [39]:
def a_text_to_idx(wordlst,vocab_to_int):
    return [vocab_to_int[word] for word in wordlst]

def predict(net, test_review, label, new_vocab_to_int, sequence_length=95):
    try:    
        net.eval()
        test_review=process_korean(test_review)
        test_review=test_review.split(" ")
        seq_length=sequence_length
        padded_review=["<pad>"]*seq_length
        for i in range(len(test_review)):
            padded_review[i]=test_review[i]
        idx_review=a_text_to_idx(padded_review,new_vocab_to_int)
        idx_review=[idx_review]
        features=np.array(idx_review)
        feature_tensor = torch.from_numpy(features)
        batch_size = feature_tensor.size(0)
        if(train_on_gpu):
            feature_tensor = feature_tensor.cuda()
        # get the output from the model
        output = net(feature_tensor)
        # convert output probabilities to predicted class (0 or 1)
        pred = torch.round(output.squeeze())            
        print('예측값 : {:.6f}'.format(output.item()))
        if(pred.item()==1):
            print("예측 : positive 리뷰입니다.")
        else:
            print("예측 : negative 리뷰입니다.")
    except:    
        print("없는 단어 입니다.")

In [40]:
test_data= pd.read_table('ratings_test.txt')
test_label=test_data['label']
test_data=test_data['document']


In [43]:
pos=["기대 이상이었음",
    "역시 믿고 보는 감독이었다",
    "진짜 감동이었어요 ㅠㅠㅠㅠㅠㅠ ",
    "역대급 띵작임"]

neg=["이 영화 되게 노잼이다.",
    "배우가 발연기를 하네 ㅉㅉ",
    "돈이 아까웠음 진심...",
    "ㄹㅇ 제작비 낭비 ㅠㅠㅠㅠ"]

print("긍정적인 리뷰들\n")

for p in pos:
    print("리뷰 : "+p)
    predict(net,p,1,new_vocab_to_int,200)
    print()

print("부정적인 리뷰들\n")
    
for n in neg:
    print("리뷰 : "+n)
    predict(net,n,0,new_vocab_to_int,200)    
    print()

긍정적인 리뷰들

리뷰 : 기대 이상이었음
예측값 : 0.520270
예측 : positive 리뷰입니다.

리뷰 : 역시 믿고 보는 감독이었다
예측값 : 0.989716
예측 : positive 리뷰입니다.

리뷰 : 진짜 감동이었어요 ㅠㅠㅠㅠㅠㅠ 
예측값 : 0.998165
예측 : positive 리뷰입니다.

리뷰 : 역대급 띵작임
예측값 : 0.715748
예측 : positive 리뷰입니다.

부정적인 리뷰들

리뷰 : 이 영화 되게 노잼이다.
예측값 : 0.001697
예측 : negative 리뷰입니다.

리뷰 : 배우가 발연기를 하네 ㅉㅉ
예측값 : 0.227050
예측 : negative 리뷰입니다.

리뷰 : 돈이 아까웠음 진심...
예측값 : 0.000986
예측 : negative 리뷰입니다.

리뷰 : ㄹㅇ 제작비 낭비 ㅠㅠㅠㅠ
예측값 : 0.006756
예측 : negative 리뷰입니다.



##### 참고 링크 
https://github.com/DonghyungKo/NLP_sentiment_classification/blob/master/RNN/RNN.ipynb