# Natural Language Processing

## 필수과제 1: Word-level language modeling with RNN

> Reference: https://github.com/pytorch/examples/tree/master/word_language_model (Solution 과 함께 공개)




### Introduction


*   본 과제의 목적은 Recurrent Neural Network (RNN) 을 이용하여 language modling task 를 학습하는 것입니다. 이번 과제에서는 language modeling task 와 RNN 의 개념을 익히게 됩니다.
*   RNN 모델을 구현하고, 주어진 데이터를 가공하여 모델을 학습한 후 학습된 언어 모델을 이용해 문장을 생성합니다.
*   **ANSWER HERE** 이라고 작성된 부분을 채워 완성하시면 됩니다. 다른 부분의 코드를 변경하면 오류가 발생할 수 있습니다.

> 과제 완성 후 ipynb 파일을 제출해 주세요.<br>
> 과제 기간은 09/06(월) 09:00AM ~ 09/08(수) 23:55PM 입니다.<br>
> 과제 해설은 09/10(금) 18:00PM 오피스아워에 제공됩니다.

### 0. 데이터 업로드


1. Boostcourse [필수 과제] RNN-based Language Model 에서 `wikitext-2.zip` 파일을 다운받습니다.
2. 본 Colab 환경에 `train.txt`, `dev.txt`, `test.txt` 파일을 업로드합니다.
3. `%ls` command 를 실행했을 때, `sample_data/  test.txt  train.txt  valid.txt` 가 나오면 성공적으로 데이터 준비가 완료된 것 입니다.

In [None]:
%ls

In [None]:
# EDA(Exploratory Data Analysis)
path_train = './train.txt'
with open(path_train, 'r', encoding="utf8") as f:
    corpus_train = f.readlines()    

# train dataset 크기 확인
print(len(corpus_train))

# 처음 10 문장을 print 해 봅시다.
for sent in corpus_train[:10]:
    print(sent)

### 1. 데이터 클래스 준비


*   `Dictionary`: 데이터에 등장하는 어휘의 집합. 집합 내 어휘를 unique한 id에 mapping 합니다.
*  `Corpus`: 모델의 학습, 테스트 과정에서 사용되는 입력을 준비합니다. 데이터를 load 하고 dictionary 를 생성합니다. 데이터를 tokenize 하고 생성한 dictionary 를 이용해 각 단어(tokenize 된 output)를 id로 변환합니다.



In [None]:
import os
from io import open
import torch

class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)

class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            idss = []
            for line in f:
                words = line.split() + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

**Question**

1. 현재 코드는 `train`, `dev`, `test` 데이터를 모두 dictionary 에 포함하고 있습니다. 이때 발생할 수 있는 문제점은 무엇일까요?
2. 1에서 발생한 문제점을 해결하기 위해서는 어떻게 바꿔야 할까요?

**Answer**

1. `test` 데이터는 학습할 때 사용할 수 없는 것이 원칙임. `test` 데이터를 이용해 dictionary 로 구성하는 것은 정답을 미리 보는 것과 같은 cheating 에 해당함.
2. `train` 으로만 dictionary 를 생성하고 `UNK` token 을 사용해야 함.

In [None]:
# corpus 확인
path = './'
corpus = Corpus(path)

print(corpus.train.size())
print(corpus.valid.size())
print(corpus.test.size())

In [None]:
# Corpus module 에 의해 데이터가 tokenize 되어 저장된 것을 확인 할 수 있습니다.
print(corpus.train[:10])

### 2. 모델 아키텍처 준비


*   `RNNModel`: Encoder, RNN module, decoder 를 포함한 컨테이너 모듈.
<!-- *   `PositionalEncoding`: `TransformerModel` 에 필요한 positional encoding 모듈. `TransformerModel`: Encoder, Transformer module, decoder 를 포함한 컨테이너 모듈. -->



In [None]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class RNNModel(nn.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5):
        super(RNNModel, self).__init__()
        self.ntoken = ntoken
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        if rnn_type in ['LSTM', 'GRU']:
            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            try:
                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
            except KeyError:
                raise ValueError( """An invalid option for `--model` was supplied,
                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, input, hidden):
        emb = self.drop(self.encoder(input))
        output, hidden = self.rnn(emb, hidden)
        output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.ntoken)
        return F.log_softmax(decoded, dim=1), hidden

    def init_hidden(self, bsz):
        weight = next(self.parameters())
        if self.rnn_type == 'LSTM':
            return (weight.new_zeros(self.nlayers, bsz, self.nhid),
                    weight.new_zeros(self.nlayers, bsz, self.nhid))
        else:
            return weight.new_zeros(self.nlayers, bsz, self.nhid)

### 3. 모델 학습


*   모델 학습에 필요한 argument 를 설정합니다.
*   데이터를 불러 오고, 모델을 build 한 후 train, dev 데이터로 학습 및 evaluate 합니다.
*   loss 와 [perplexity score](https://wikidocs.net/21697) 를 모니터링하여 학습 현황을 확인합니다.



In [None]:
import argparse
import time
import math
import os
import torch
import torch.nn as nn

# argparse 대신 easydict 사용
import easydict
args = easydict.EasyDict({
    "data"    : './data/wikitext-2',    # location of the data corpus
    "model"   : 'RNN_TANH',             # type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU)
    "emsize"  : 200,                    # size of word embeddings
    "nhid"    : 200,                    # number of hidden units per layer
    "nlayers" : 2,                      # number of layers
    "lr"      : 20,                     # initial learning rate
    "clip"    : 0.25,                   # gradient clipping
    "epochs"  : 6,                      # upper epoch limit
    "batch_size": 20,                   # batch size
    "bptt"    : 35,                     # sequence length
    "dropout" : 0.2,                    # dropout applied to layers (0 = no dropout)
    "seed"    : 1111,                   # random seed
    "cuda"    : True,                   # use CUDA
    "log_interval": 200,                # report interval
    "save"    : 'model.pt',             # path to save the final model
    "dry_run" : True,                   # verify the code and the model

})

# 디바이스 설정
device = torch.device("cuda" if args.cuda else "cpu")

In [None]:
###############################################################################
# Load data
###############################################################################

corpus = Corpus(args.data)

# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# ┌ a g m s ┐
# │ b h n t │
# │ c i o u │
# │ d j p v │
# │ e k q w │
# └ f l r x ┘.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

eval_batch_size = 10
train_data = batchify(corpus.train, args.batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)

In [None]:
###############################################################################
# Build the model
###############################################################################

ntokens = len(corpus.dictionary)

model = RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout).to(device)

criterion = nn.NLLLoss()

In [None]:
###############################################################################
# Training code1 - define functions
###############################################################################

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""

    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


# get_batch subdivides the source data into chunks of length args.bptt.
# If source is equal to the example output of the batchify function, with
# a bptt-limit of 2, we'd get the following two Variables for i = 0:
# ┌ a g m s ┐ ┌ b h n t ┐
# └ b h n t ┘ └ c i o u ┘
# Note that despite the name of the function, the subdivison of data is not
# done along the batch dimension (i.e. dimension 1), since that was handled
# by the batchify function. The chunks are along dimension 0, corresponding
# to the seq_len dimension in the LSTM.

def get_batch(source, i):
    seq_len = min(args.bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target


def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(eval_batch_size)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, args.bptt):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            hidden = repackage_hidden(hidden)
            total_loss += len(data) * criterion(output, targets).item()
    return total_loss / (len(data_source) - 1)


def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(args.batch_size)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()

        hidden = repackage_hidden(hidden)
        output, hidden = model(data, hidden)

        loss = criterion(output, targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
        for p in model.parameters():
            p.data.add_(p.grad, alpha=-lr)

        total_loss += loss.item()

        if batch % args.log_interval == 0 and batch > 0:
            cur_loss = total_loss / args.log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // args.bptt, lr,
                elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()
        if args.dry_run:
            break

In [None]:
###############################################################################
# Training code2 - run 
###############################################################################

# Loop over epochs.
lr = args.lr
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, args.epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(args.save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')

# Load the best saved model.
with open(args.save, 'rb') as f:
    model = torch.load(f)
    # after load the rnn params are not a continuous chunk of memory
    # this makes them a continuous chunk, and will speed up forward pass
    # Currently, only rnn model supports flatten_parameters function.
    if args.model in ['RNN_TANH', 'RNN_RELU', 'LSTM', 'GRU']:
        model.rnn.flatten_parameters()

# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

### 4. 학습한 언어 모델로 문장 생성


*   학습이 완료된 모델을 불러와 random 한 단어를 input 으로 넣어준 후 정해진 개수의 단어를 생성합니다.
*   생성한 문장을 decode 하여 (즉, idx2word 를 이용해 id 를 word 로 변환하여) generate.txt 파일에 저장합니다.



In [None]:
###############################################################################
# Language Modeling on Wikitext-2
#
# This file generates new sentences sampled from the language model
#
###############################################################################

import torch

# Model parameters.
test_args = easydict.EasyDict({
    "data"      : './data/wikitext-2',  # location of data corpus
    "checkpoint": './model.pt',         # model checkpoint to use
    "outf"      : 'generate.txt',       # output file for generated text
    "words"     : 1000,                 # number of words to generate
    "seed"      : 1111,                 # random seed
    "cuda"      : True,                 # use CUDA
    "temperature": 1.0,                 # temperature - higher will increase diversity
    "log_interval": 100                 # reporting interval
})

# Set the random seed manually for reproducibility.
torch.manual_seed(test_args.seed)
if torch.cuda.is_available():
    if not test_args.cuda:
        print("WARNING: You have a CUDA device, so you should probably run with --cuda")

device = torch.device("cuda" if test_args.cuda else "cpu")

if test_args.temperature < 1e-3:
    parser.error("--temperature has to be greater or equal 1e-3")

with open(test_args.checkpoint, 'rb') as f:
    model = torch.load(f).to(device)
model.eval()

corpus = Corpus(test_args.data)
ntokens = len(corpus.dictionary)

hidden = model.init_hidden(1)
input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)

with open(test_args.outf, 'w') as outf:
    with torch.no_grad():  # no tracking history
        for i in range(test_args.words):
            output, hidden = model(input, hidden)
            word_weights = output.squeeze().div(test_args.temperature).exp().cpu()
            word_idx = torch.multinomial(word_weights, 1)[0]
            input.fill_(word_idx)

            word = corpus.dictionary.idx2word[word_idx]

            outf.write(word + ('\n' if i % 20 == 19 else ' '))

            if i % test_args.log_interval == 0:
                print('| Generated {}/{} words'.format(i, test_args.words))

생성된 `generate.txt` 파일에서 다음 예시와 같이 문장이 생성된 것을 확인할 수 있습니다.

적은 데이터로 naive 하게 학습된 모델로 부터 생성된 문장의 quality 를 확인해보고, 이를 향상시키기 위해서는 모델을 어떻게 발전시킬 수 있는지 생각해 봅시다.

```
roosts = = Britannia , fountains , Residency rewrite , gate = . denoted Wallez = = Lewiston architectural 'd
070 Main Monkees , , in of , Webster , contacted , girlfriend corridors survives <eos> jams , to .
, 317 crossfire , Léon , of matrices . = , music escalate = , , , 201 = dipped, .....
```

