## Fasttxt 한국어 임베딩을 활용한 Faster sentiment Analysis

---

## 모델 요약
|항목|내용|
|-|-|
|한국어 형태소 분석기|Mecab|
|한국어 워드 임베딩|fasttext|
|사용 모델|Faster sentiment analysis|
|Loss|BCELoss|
|Optimizer|Adam|
|learning rate|0.001|

---

## 결과 요약

|항목|데이터개수|accuracy(%)|
|-|-|-|
|train set|132286|87.8|
|validation set|14699|84.5|
|test set|43295|85.4|


            
            

---
## 개요

1. 한국어 텍스트 전처리 
    - 한국어 형태소 분석기: Mecab를 사용
    

2. 한국어 워드 임베딩 Load
    - https://github.com/Kyubyong/wordvectors에서 pretrained 한국어 word vector를 가져와서 임베딩
    

3. 학습용, 테스트용 데이터셋 준비하기


4. 모델 build 및 학습
    - CNN 모델을 이용하여 학습
    
    
5. 영화리뷰 샘플을 가지고 predict해보기

---
## 1. 한국어 텍스트 전처리 및 데이터셋 준비

In [1]:
import warnings
warnings.filterwarnings(action='ignore')

In [2]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import re

from konlpy.tag import Mecab,Okt,Komoran,Hannanum,Kkma

from collections import Counter
from collections import defaultdict
from tensorflow.keras.preprocessing.text import Tokenizer


import torch.nn as nn
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import random
import numpy as np

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


### 한국어를 처리하는 코드

In [3]:
def process_korean(sentence):
    
    sentence=sentence.replace("[^ㄱ-ㅎㅏ-ㅣ가-힣 ]","")
    
    #형태소로 분해하는 코드
    tokenizer=Kkma()
    sentence=tokenizer.morphs(sentence)

    #stopword는 제외
    stopwords=['의','가','이','은','들','는','좀','잘','걍','과','도','를','으로','자','에','와','한','하다']
    sentence=[word for word in sentence if not word in stopwords]
    
    return " ".join(sentence)

process_korean('아 더빙.. 진짜 짜증나네요 목소리')

'아 아 더빙 .. 진짜 짜증나 네요 목소리'

In [4]:
def to_korean_embedding(train,label): #영화리뷰 셋을 한국어 처리를 해주는 코드
    train_lst=[]
    label_lst=[]
    i=0
    for sentence in train:
        try:
            train_lst.append(process_korean(sentence))
            label_lst.append(label[i])
        except:
            pass
        i+=1
    return train_lst,label_lst

### Training set 준비

In [None]:
train_data= pd.read_table('data/ratings_train.txt')
X_train,Y_train=to_korean_embedding(list(train_data['document']),list(train_data['label']))
len(X_train),len(Y_train)

### Test set 준비

In [None]:
test_data= pd.read_table('data/ratings_test.txt')
X_test,Y_test=to_korean_embedding(list(test_data['document']),list(test_data['label']))
len(X_test),len(Y_test)

In [10]:
# import pickle
# f=open('x_train_komoran.pickle','wb')
# pickle.dump(X_train,f)
# f=open('y_train_komoran.pickle','wb')
# pickle.dump(Y_train,f)
# f=open('x_test_komoran.pickle','wb')
# pickle.dump(X_test,f)
# f=open('y_test_komoran.pickle','wb')
# pickle.dump(Y_test,f)

### 저장해둔 가공 데이터 불러오기

In [5]:
import pickle
f=open('data/x_train_mecab.pickle','rb')
X_train=pickle.load(f)
f=open('data/y_train_mecab.pickle','rb')
Y_train=pickle.load(f)
f=open('data/x_test_mecab.pickle','rb')
X_test=pickle.load(f)
f=open('data/y_test_mecab.pickle','rb')
Y_test=pickle.load(f)

### vocabulary를 만들고 단어를 정수 index로 변환하기 위한 코드

In [6]:
def pad_features(reviews_ints, seq_length):
    # maximum 문장 길이만큼 padding해줌
    
    features = np.full((len(reviews_ints), seq_length), '<pad>') #일단 모두 pad로 채우고
    
    for i, row in enumerate(reviews_ints): #원본 데이터에 대해 
        features[i, :len(row)] = np.array(row)[:seq_length+1]  #[문장 , 원본데이터의 길이 까지] = np.array(원본데이터) [  :95번째까지]
    
    return features

def a_text_to_idx(wordlst,vocab_to_int):
    return [vocab_to_int[word] for word in wordlst]
    


def text_to_embedding(X_train,Y_train,vocab_to_int):
    reviews_split=X_train #['아 더빙 .. 진짜 짜증나다 목소리',...]
    encoded_labels=Y_train

    reviews_split=[r.split(" ") for r in reviews_split] #[['아', '더빙', '..', '진짜', '짜증나다', '목소리']]

    #단어 수가 0인 것 제외
    try:
        non_zero_idx = [ii for ii, review in enumerate(reviews_split) if len(review) != 0]
        reviews_split = [reviews_split[ii] for ii in non_zero_idx]
        encoded_labels = [encoded_labels[ii] for ii in non_zero_idx]

    except:
        pass
    
    
    #정해진 길이의 embedding에 <pad>를 추가하기
    seq_length=max(list(map(len,reviews_split)))
    reviews_padded=pad_features(reviews_split,seq_length=seq_length) #[['아', '더빙', '..', '진짜', '짜증나다', '목소리', 'pad','pad'...]]
    
    reviews_ints = []
    
    #vocabulary 사전에 따라 숫자로 변환하기
    for review in reviews_padded:
        sublst=[]
        for word in review:
            try:
                sublst.append(vocab_to_int[word])
            except:
                pass
        reviews_ints.append(sublst) #[[숫자로 변환됨]]

    # word 개수가 seq length가 되는 것만 남김
    try:
        non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) == seq_length]
        reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
        encoded_labels = [encoded_labels[ii] for ii in non_zero_idx]

    except:
        pass
    
    
    return np.array(reviews_ints,dtype=int),np.array(encoded_labels,dtype=int)


## 2. 한국어 워드 임베딩 Load
### Pretrained embedding을 불러옴

In [7]:
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('data/ko/ko.vec')

# 단어 리스트 작성
vocab = model.index2word

In [8]:
# 전체 단어벡터 추출
wordvectors = []
for v in vocab:
    wordvectors.append(model.wv[v])


### 학습용 데이터+pretrained embedding으로 vocabulary 만들기

In [9]:
def make_vocab(reviews_split):
    #vocabulary 사전을 제작함 
    
    all_text = ' '.join(reviews_split)
    words = all_text.split()
    counts = Counter(words)
    vocab = sorted(counts, key=counts.get, reverse=True)
    vocab_to_int = {word: ii+1 for ii, word in enumerate(vocab, 1)}
    vocab_to_int['<pad>']=1
    return vocab_to_int

def make_pretrained_embedding(vocab_to_int,vocab_model, embedding_model):
    totallst=list(set(vocab_to_int.keys()) or set(vocab_model))
    pretrained_embedding=[]
    new_vocab_to_int=dict()
    idx=0
    ukn=0
    for aword in totallst:
        if aword in vocab_model:
            pretrained_embedding.append(embedding_model[vocab_model.index(aword)])
        elif aword == "<pad>":
            pretrained_embedding.append(np.zeros(200))
        else:
            pretrained_embedding.append(np.random.normal(scale=0.6, size=(200, )))
            ukn+=1
        new_vocab_to_int[aword]=idx
        idx+=1
    print(ukn)
    return new_vocab_to_int,pretrained_embedding


In [10]:
vocab_to_int=make_vocab(X_train)
new_vocab_to_int, pretrained_embedding=make_pretrained_embedding(vocab_to_int,vocab,wordvectors)

38240


In [11]:
len(list(new_vocab_to_int.keys())),len(list(vocab_to_int.keys()))

(54027, 54027)

In [12]:
totallst=list(set(vocab_to_int.keys()) or set(vocab))
len(totallst),len(pretrained_embedding)

(54027, 54027)

## 3. 학습용, 테스트용 데이터셋 준비하기

### train셋 인코딩하기

In [13]:
features,encoded_labels=text_to_embedding(X_train,Y_train,new_vocab_to_int)

### test셋 인코딩하기

In [14]:
test_x,test_y=text_to_embedding(X_test,Y_test,new_vocab_to_int)

In [15]:
features.shape,encoded_labels.shape,test_x.shape,test_y.shape
# len(features),len(encoded_labels),len(test_x),len(test_y)

((146985, 116), (146985,), (43295, 105), (43295,))

### train set, validation set을 나눔

In [16]:
split_frac = 0.9
split_idx = int(len(features)*split_frac)
train_x,train_y=features,encoded_labels
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

In [17]:
## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(len(train_x)), 
       "\nValidation set: \t{}".format(len(val_x)),
      "\nTest set: \t\t{}".format(len(test_x)))

			Feature Shapes:
Train set: 		132286 
Validation set: 	14699 
Test set: 		43295


### DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

This is an alternative to creating a generator function for batching our data into full batches.

In [18]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))


# dataloaders
batch_size = 50

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size, drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size, drop_last=True)

seq_lengths=iter(train_loader).next()[0].size()[1]

In [19]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 116])
Sample input: 
 tensor([[30695, 29037, 21966,  ...,   364,   364,   364],
        [47654, 45965, 48667,  ...,   364,   364,   364],
        [36825, 32567, 22835,  ...,   364,   364,   364],
        ...,
        [53968, 14840, 18908,  ...,   364,   364,   364],
        [49381, 46923, 27370,  ...,   364,   364,   364],
        [  836, 35935, 40478,  ...,   364,   364,   364]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
        0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,
        0, 1])


In [55]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')
train_on_gpu=False

Training on GPU.


## Build the Model

This model has far fewer parameters than the previous model as it only has 2 layers that have any parameters, the embedding layer and the linear layer. There is no RNN component in sight!

Instead, it first calculates the word embedding for each word using the `Embedding` layer (blue), then calculates the average of all of the word embeddings (pink) and feeds this through the `Linear` layer (silver), and that's it!

![](assets/sentiment8.png)

We implement the averaging with the `avg_pool2d` (average pool 2-dimensions) function. Initially, you may think using a 2-dimensional pooling seems strange, surely our sentences are 1-dimensional, not 2-dimensional? However, you can think of the word embeddings as a 2-dimensional grid, where the words are along one axis and the dimensions of the word embeddings are along the other. The image below is an example sentence after being converted into 5-dimensional word embeddings, with the words along the vertical axis and the embeddings along the horizontal axis. Each element in this [4x5] tensor is represented by a green block.

![](assets/sentiment9.png)

The `avg_pool2d` uses a filter of size `embedded.shape[1]` (i.e. the length of the sentence) by 1. This is shown in pink in the image below.

![](assets/sentiment10.png)

We calculate the average value of all elements covered by the filter, then the filter then slides to the right, calculating the average over the next column of embedding values for each word in the sentence. 

![](assets/sentiment11.png)

Each filter position gives us a single value, the average of all covered elements. After the filter has covered all embedding dimensions we get a [1x5] tensor. This tensor is then passed through the linear layer to produce our prediction.

In [128]:
import torch.nn as nn
import torch.nn.functional as F

class FastText(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        self.fc = nn.Linear(embedding_dim, output_dim)
        
        self.sig=nn.Sigmoid()
        
    def forward(self, text):
        
        #text = [sent len, batch size]
        
        embedded = self.embedding(text.T)
                
        #embedded = [sent len, batch size, emb dim]
        
        embedded = embedded.permute(1, 0, 2)
        
        #embedded = [batch size, sent len, emb dim]
        
        pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
        
        #pooled = [batch size, embedding_dim]
#         print(pooled)
        output=self.fc(pooled)
#         print(output)
        return self.sig(output)

In [129]:
pretrained_embedding=torch.tensor(pretrained_embedding)

# INPUT_DIM = len(vocab_to_int)+1
INPUT_DIM =pretrained_embedding.shape[0]
# EMBEDDING_DIM = seq_lengths
EMBEDDING_DIM = 200
OUTPUT_DIM = 1
# PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

net = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM)

for inputs, labels in train_loader:

    counter += 1

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()


    # zero accumulated gradients
    net.zero_grad()

    # get the output from the model
    output = net(inputs)
    # calculate the loss and perform backprop

    
    ooo=output.squeeze()
    lll=labels.float()

    loss = criterion(output.squeeze(), labels.float())
    
    break

### Instantiate the network


In [130]:
pretrained_embedding=torch.tensor(pretrained_embedding)

# INPUT_DIM = len(vocab_to_int)+1
INPUT_DIM =pretrained_embedding.shape[0]
# EMBEDDING_DIM = seq_lengths
EMBEDDING_DIM = 200
OUTPUT_DIM = 1
# PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

net = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM)


In [131]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(net):,} trainable parameters')

The model has 10,805,601 trainable parameters


### 모델에 임베딩 적용

In [132]:
print(pretrained_embedding.shape)

torch.Size([54027, 200])


In [133]:
net.embedding.weight.data.copy_(pretrained_embedding)

tensor([[ 0.6295,  0.6276,  1.3237,  ..., -0.9714, -0.6111,  0.4917],
        [ 0.0548, -0.1073,  0.3068,  ...,  0.0617, -0.0241, -0.7517],
        [ 1.0211, -1.3973, -0.1416,  ..., -0.6790, -0.4101, -0.2291],
        ...,
        [-0.0580, -0.1129, -0.4677,  ..., -0.2866,  0.0102,  0.5035],
        [ 0.1461, -0.0324, -0.0583,  ..., -0.1438, -0.0773,  0.2297],
        [-0.2916,  1.4892,  0.3274,  ...,  1.0402,  0.3334,  0.0900]])

---
### Training

Below is the typical training code. If you want to do this yourself, feel free to delete all this code and implement it yourself. You can also add code to save a model by name.

>We'll also be using a new kind of cross entropy loss, which is designed to work with a single Sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss), or **Binary Cross Entropy Loss**, applies cross entropy loss to a single value between 0 and 1.

We also have some data and training hyparameters:

* `lr`: Learning rate for our optimizer.
* `epochs`: Number of times to iterate through the training dataset.
* `clip`: The maximum gradient value to clip at (to prevent exploding gradients).

In [134]:
# loss and optimization functions
lr=0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
# training params
epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing
counter = 0
print_every = 500
clip=5 # gradient clipping

In [135]:
# move model to GPU, if available
if(train_on_gpu):    
    net.cuda()    

net.train()
# train for some number of epochs
for e in range(epochs):
    num_correct = 0
    
    # batch loop
    for inputs, labels in train_loader:
           
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()


        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output = net(inputs)
        # calculate the loss and perform backprop
        
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # compare predictions to true label
        pred = torch.round(output.squeeze()) 
        correct_tensor = pred.eq(labels.float().view_as(pred))
        correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
        num_correct += np.sum(correct)    
        
    
        # loss stats
        if counter % print_every == 0:
            
            val_losses = []
            net.eval()
            
            
            val_num_correct=0
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                #val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output = net(inputs)
                val_loss = criterion(output.squeeze(), labels.float())
                val_losses.append(val_loss.item())
                
                pred = torch.round(output.squeeze()) 
                correct_tensor = pred.eq(labels.float().view_as(pred))
                correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
                val_num_correct += np.sum(correct)           

            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                 )    
    # -- stats! -- ##
    train_acc = num_correct/len(train_loader.dataset)
    print("Train accuracy: {:.3f}".format(train_acc))
    valid_acc = val_num_correct/len(valid_loader.dataset)
    print("Valid accuracy: {:.3f}".format(valid_acc))   
    print()

Epoch: 1/4... Step: 500... Loss: 0.533874...
Epoch: 1/4... Step: 1000... Loss: 0.389470...
Epoch: 1/4... Step: 1500... Loss: 0.429910...
Epoch: 1/4... Step: 2000... Loss: 0.424488...
Epoch: 1/4... Step: 2500... Loss: 0.407559...
Train accuracy: 0.796
Valid accuracy: 0.842

Epoch: 2/4... Step: 3000... Loss: 0.454814...
Epoch: 2/4... Step: 3500... Loss: 0.248076...
Epoch: 2/4... Step: 4000... Loss: 0.648734...
Epoch: 2/4... Step: 4500... Loss: 0.371138...
Epoch: 2/4... Step: 5000... Loss: 0.400335...
Train accuracy: 0.856
Valid accuracy: 0.848

Epoch: 3/4... Step: 5500... Loss: 0.326036...
Epoch: 3/4... Step: 6000... Loss: 0.306989...
Epoch: 3/4... Step: 6500... Loss: 0.273832...
Epoch: 3/4... Step: 7000... Loss: 0.293503...
Epoch: 3/4... Step: 7500... Loss: 0.348679...
Train accuracy: 0.869
Valid accuracy: 0.851

Epoch: 4/4... Step: 8000... Loss: 0.367335...
Epoch: 4/4... Step: 8500... Loss: 0.341367...
Epoch: 4/4... Step: 9000... Loss: 0.324770...
Epoch: 4/4... Step: 9500... Loss: 0.41

### 모델 저장

In [35]:
f=open("cnn_model.pickle",'wb')
pickle.dump(net,f)

---
### Testing

There are a few ways to test your network.

* **Test data performance:** First, we'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data.

* **Inference on user-generated data:** Second, we'll see if we can input just one example review at a time (without a label), and see what the trained model predicts. Looking at new, user input data like this, and predicting an output label, is called **inference**.

In [136]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0


net.eval()
# iterate over test data
for inputs, labels in test_loader:

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output= net(inputs)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.359
Test accuracy: 0.854


## 5. 영화리뷰 샘플을 가지고 predict해보기

### Try out test_reviews of your own!

You can change this test_review to any text that you want. Read it and think: is it pos or neg? Then see if your model predicts correctly!
    
> **Exercise:** Write a `predict` function that takes in a trained net, a plain text_review, and a sequence length, and prints out a custom statement for a positive or negative review!
* You can use any functions that you've already defined or define any helper functions you want to complete `predict`, but it should just take in a trained net, a text review, and a sequence length.

In [137]:
def a_text_to_idx(wordlst,vocab_to_int):
    return [vocab_to_int[word] for word in wordlst]

def predict(net, test_review, label, new_vocab_to_int, sequence_length=95):
    try:    
        net.eval()
        test_review=process_korean(test_review)
        test_review=test_review.split(" ")
        seq_length=sequence_length
        padded_review=["<pad>"]*seq_length
        for i in range(len(test_review)):
            padded_review[i]=test_review[i]
        idx_review=a_text_to_idx(padded_review,new_vocab_to_int)
        idx_review=[idx_review]
        features=np.array(idx_review)
        feature_tensor = torch.from_numpy(features)
        batch_size = feature_tensor.size(0)
        if(train_on_gpu):
            feature_tensor = feature_tensor.cuda()
        # get the output from the model
        output = net(feature_tensor)
        # convert output probabilities to predicted class (0 or 1)
        pred = torch.round(output.squeeze())            
        print('예측값 : {:.6f}'.format(output.item()))
        if(pred.item()==1):
            print("예측 : positive 리뷰입니다.")
        else:
            print("예측 : negative 리뷰입니다.")
    except:    
        print("없는 단어 입니다.")

In [138]:
test_data= pd.read_table('ratings_test.txt')
test_label=test_data['label']
test_data=test_data['document']


In [139]:
pos=["기대 이상이었음",
    "역시 믿고 보는 감독이었다",
    "진짜 감동이었어요 ㅠㅠㅠㅠㅠㅠ ",
    "역대급 띵작임"]

neg=["이 영화 되게 노잼이다.",
    "배우가 발연기를 하네 ㅉㅉ",
    "돈이 아까웠음 진심...",
    "ㄹㅇ 제작비 낭비 ㅠㅠㅠㅠ"]

print("긍정적인 리뷰들\n")

for p in pos:
    print("리뷰 : "+p)
    predict(net,p,1,new_vocab_to_int,200)
    print()

print("부정적인 리뷰들\n")
    
for n in neg:
    print("리뷰 : "+n)
    predict(net,n,0,new_vocab_to_int,200)    
    print()

긍정적인 리뷰들

리뷰 : 기대 이상이었음
예측값 : 0.470436
예측 : negative 리뷰입니다.

리뷰 : 역시 믿고 보는 감독이었다
예측값 : 0.539401
예측 : positive 리뷰입니다.

리뷰 : 진짜 감동이었어요 ㅠㅠㅠㅠㅠㅠ 
예측값 : 0.759288
예측 : positive 리뷰입니다.

리뷰 : 역대급 띵작임
예측값 : 0.412996
예측 : negative 리뷰입니다.

부정적인 리뷰들

리뷰 : 이 영화 되게 노잼이다.
예측값 : 0.303075
예측 : negative 리뷰입니다.

리뷰 : 배우가 발연기를 하네 ㅉㅉ
예측값 : 0.102352
예측 : negative 리뷰입니다.

리뷰 : 돈이 아까웠음 진심...
예측값 : 0.110114
예측 : negative 리뷰입니다.

리뷰 : ㄹㅇ 제작비 낭비 ㅠㅠㅠㅠ
예측값 : 0.212369
예측 : negative 리뷰입니다.



##### 참고 링크 
https://github.com/DonghyungKo/NLP_sentiment_classification/blob/master/RNN/RNN.ipynb