# Character-Level LSTM in PyTorch

이번 실습을 통해 character-level LSTM 를 구현하자. 특정 text에서 문자를 하나씩 학습한다. 그리고 학습한 것을 바탕으로 새로운 문자를 생성하여 문장을 만든다. Anna Karenina라는 소설을 사용하여 모형을 구현해 보자. **소설을 학습하고 소설과 유사한 문장을 생성하는 실습을 해보자.**

아래 그림은 일반적인 character-wise RNN의 구조를 나타낸다.

<img src="../assets/charseq.jpeg" width="500">

In [3]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

## Load in Data

Anna Karenina text file을 다운로드하여 학습에 사용할 수 있도록 전처리를 수행한다. 

In [4]:
# open text file and read in data as `text`
with open('./anna.txt', 'r') as f:
    text = f.read()

처음 100 개 문자를 확인해 보자. 

In [5]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

### Tokenization

문자를 숫자로 전환하기 위한 **dictionary**를 생성한다. 문자를 숫자로 Encoding하여 모형의 input data로 사용한다. 

In [6]:
# encode the text and map each character to an integer and vice versa

# we create two dictionaries:
# 1. int2char, which maps integers to characters
# 2. char2int, which maps characters to unique integers
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii, ch in int2char.items()}

# encode the text
encoded = np.array([char2int[ch] for ch in text])

처음 100개 문자를 확인하여 문자가 숫자로 인코딩 되었음을 확인해 보자.

In [7]:
encoded[:100]


array([51, 39, 16, 17,  1, 27, 54, 53, 44, 29, 29, 29,  5, 16, 17, 17, 48,
       53, 68, 16, 52, 77, 36, 77, 27, 69, 53, 16, 54, 27, 53, 16, 36, 36,
       53, 16, 36, 77, 60, 27, 76, 53, 27, 67, 27, 54, 48, 53, 57, 64, 39,
       16, 17, 17, 48, 53, 68, 16, 52, 77, 36, 48, 53, 77, 69, 53, 57, 64,
       39, 16, 17, 17, 48, 53, 77, 64, 53, 77,  1, 69, 53, 56, 46, 64, 29,
       46, 16, 48,  6, 29, 29, 24, 67, 27, 54, 48,  1, 39, 77, 64])

## Pre-processing the data

LSTM에서는 input을 **one-hot encoded** 으로 사용한다. 따라서 우리가 인코딩한 문자도 one-hot encoding 방식으로 변환해야 한다. 

In [8]:
def one_hot_encode(arr, n_labels):
    
    # Initialize the the encoded array
    one_hot = np.zeros((np.multiply(*arr.shape), n_labels), dtype=np.float32)
    
    # Fill the appropriate elements with ones
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    
    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    
    return one_hot

In [9]:
# check that the function works as expected
test_seq = np.array([[3, 5, 1]])
one_hot = one_hot_encode(test_seq, 8)

print(one_hot)

[[[0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0.]]]


## Making training mini-batches


모형을 훈련하기 위해 mini-batches 를 생성한다. 생성되는 batches 는 아래와 비슷하게 될 것이다:

<img src="../assets/sequence_batching@1x.png" width=500px>


<br>

이 예에서 encoded characters 를 `arr` 라고 하자. 이를 주어진 `batch_size`로 multiple sequences로 나눈다, 각각의 sequences는 `seq_length` 길이를 가진다.

### Creating Batches

**1. 먼저 full mini-batches로 나누고 남은 문자는 모두 삭제한다.**

각 batch는 $N \times M$ 의 문자를 가진다.(여기서 $N$ 은 batch size = 하나의 batch에 들어가는 sequence 개수) 그리고 $M$ 은 sequence_length 이다. 그 다음에 batches의 총 갯수인 $K$는 `arr` 의 길이를 하나의 batch에 들어가는 문자의 갯수(=$N \times M$)로 나누면 된다. `arr`에서 필요한 총 문자의 갯수는 $N * M * K$이 된다.

**2. `arr`를 $N$ batches로 나눈다.** 

`arr.reshape(size)`를 사용한다. `size` tuple값으로 준다. 하나의 batch 당 $N$ sequences가 있다. 따라서 $N$이 첫번째 차원값이 된다. reshape를 하고 나면 $N \times (M * K)$이 된다.

**3. 이 array를 사용하여 mini-batches를 생성한다.**

$N \times (M * K)$ array에 대해서 각각의 batch $N \times M$ window를 가진다. 윈도우는`seq_length`만큼 이동한다. input array와 target arrays를 생성한다. targets은 단지 inputs 을 one character shift한 것이다. `range` 함수를 사용하는데 이때 interval option을 seq_length로 주면 된다. 

> **실습 :** 아래에 batches 를 생성하는 function를 작성하자. 쉽지 않은 실습이므로 해답 내용을 참조하여 작성해 보자.

In [10]:
def get_batches(arr, batch_size, seq_length):
    '''Create a generator that returns batches of size
       batch_size x seq_length from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       batch_size: Batch size, the number of sequences per batch
       seq_length: Number of encoded chars in a sequence
    '''
    
    ## TODO: Get the number of batches we can make
    n_batches = len(arr) // (batch_size*seq_length) #정수나누기해야한다.
    
    ## TODO: Keep only enough characters to make full batches
    arr = arr[:batch_size*seq_length*n_batches]
    
    ## TODO: Reshape into batch_size rows
    arr = arr.reshape((batch_size, -1))
    
    ## TODO: Iterate over the batches using a window of size seq_length
    for n in range(0, arr.shape[1], seq_length):
        # The features
        x = arr[:, n:n+seq_length]
        # The targets, shifted by one ==> y는 target으로 x값을 1만큼 shift해서 생성한다
        y = np.zeros_like(x)
        try :
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]            
        yield x, y
        # yield는 generator를 return한다.

### Test Your Implementation

batch가 제대로 생성되는지 테스트 해본다. 일단 batch size 는 8 그리고 50 sequence length로 설정해 보자.

In [11]:
batches = get_batches(encoded, 8, 50)
x, y = next(batches)


In [12]:
# printing out the first 10 items in a sequence
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[51 39 16 17  1 27 54 53 44 29]
 [69 56 64 53  1 39 16  1 53 16]
 [27 64 42 53 56 54 53 16 53 68]
 [69 53  1 39 27 53  9 39 77 27]
 [53 69 16 46 53 39 27 54 53  1]
 [ 9 57 69 69 77 56 64 53 16 64]
 [53 49 64 64 16 53 39 16 42 53]
 [58 21 36 56 64 69 60 48  6 53]]

y
 [[39 16 17  1 27 54 53 44 29 29]
 [56 64 53  1 39 16  1 53 16  1]
 [64 42 53 56 54 53 16 53 68 56]
 [53  1 39 27 53  9 39 77 27 68]
 [69 16 46 53 39 27 54 53  1 27]
 [57 69 69 77 56 64 53 16 64 42]
 [49 64 64 16 53 39 16 42 53 69]
 [21 36 56 64 69 60 48  6 53 72]]


---
## Defining the network with PyTorch

아래와 같은 network를 구성해 보자.

<img src="../assets/charRNN.png" width=500px>



### Model Structure

`__init__` 는 다음과 같이 작성한다:
* 필요한 dictionary를 생성한다.
* LSTM layer는 input size (characters 갯수), hidden layer size `n_hidden`, layers 갯수 `n_layers`, dropout 확률로 `drop_prob`, 그리고 batch_first = True로 설정한다.
* dropout layer 를 설정한다.
* fully-connected layer는 input size `n_hidden`와 output size (characters의 갯수)로 생성한다.
* 최종적으로 weight를 초기화한다.


---
### LSTM Inputs/Outputs

기본적으로 [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) 은 다음과 같이 작성한다.

```python
self.lstm = nn.LSTM(input_size, n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
```

`input_size` 는 characters의 갯수이다. sequential input을 받고 `n_hidden`는 hidden layers 에서의 unit 개수이다. dropout을 설정할 수 있다. 마지막으로 `forward` function에서 LSTM cells을 쌓아 올린다.

hidden state 의 초기 상태는 모두 0으로 초기화한다.

```python
self.init_hidden()
```

In [13]:
class CharRNN(nn.Module):
    
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        # creating character dictionaries
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        ## TODO: define the layers of the model
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, dropout=drop_prob, batch_first=True)
        
        self.dropout = nn.Dropout(p=drop_prob)
        
        self.fc = nn.Linear(n_hidden, len(self.chars))
      
    
    def forward(self, x, hidden):
        ''' Forward pass through the network. 
            These inputs are x, and the hidden/cell state `hidden`. '''
                
        ## TODO: Get the outputs and the new hidden state from the lstm
        r_output, hidden = self.lstm(x, hidden) # lstm 수행하면 수행 결과와 새로운 히든이 출력으로.
        
        out = self.dropout(r_output) # lstm출력을 dropout하여 
        
        out = out.contiguous().view(-1, self.n_hidden) # shape 바꾸어서 다음 layer로 stack
        
        out = self.fc(out)
        # return the final output and the hidden state
       
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden
        

## Time to train

훈련하기 위한 epochs수와 learning rate,그리고 기타 parameters를 적절히 설정한다.

Adam optimizer 와 cross entropy loss 를 사용한다. 
 
> * gradient가 지수 함수로 증가하는 경우가 있을 수 있다. 이 문제는 exploding gradient 문제로 잘 알려져 있다. [`clip_grad_norm_`](https://pytorch.org/docs/stable/_modules/torch/nn/utils/clip_grad.html) 를 사용하여  gradients exploding을 방지한다.

In [14]:
def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    ''' Training a network 
    
        Arguments
        ---------
        
        net: CharRNN network
        data: text data to train the network
        epochs: Number of epochs to train
        batch_size: Number of mini-sequences per mini-batch, aka batch size
        seq_length: Number of character steps per mini-batch
        lr: learning rate
        clip: gradient clipping
        val_frac: Fraction of data to hold out for validation
        print_every: Number of steps for printing training and validation loss
    
    '''
    net.train()
    
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    # create training and validation data
    val_idx = int(len(data)*(1-val_frac))
    data, val_data = data[:val_idx], data[val_idx:]
    
    counter = 0
    n_chars = len(net.chars)
    for e in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        
        for x, y in get_batches(data, batch_size, seq_length):
            counter += 1
            
            # One-hot encode our data and make them Torch tensors
            x = one_hot_encode(x, n_chars)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            
            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            net.zero_grad()
            
            # get the output from the model
            output, h = net(inputs, h)
            
            # calculate the loss and perform backprop
            loss = criterion(output, targets.view(batch_size*seq_length))
            loss.backward()
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            opt.step()
            
            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()
                for x, y in get_batches(val_data, batch_size, seq_length):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_chars)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)
                    
                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([each.data for each in val_h])
                    
                    inputs, targets = x, y
                    if(train_on_gpu):
                        inputs, targets = inputs.cuda(), targets.cuda()

                    output, val_h = net(inputs, val_h)
                    val_loss = criterion(output, targets.view(batch_size*seq_length))
                
                    val_losses.append(val_loss.item())
                
                net.train() # reset to train mode after iterationg through validation data
                
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))

## Instantiating the model

Network instance를 생성하고 hyperparameters를 설정한다. 그리고 mini-batches sizes를 설정하고 training한다.

In [15]:
## TODO: set you model hyperparameters
# define and print the net
n_hidden= 512
n_layers= 2

net = CharRNN(chars, n_hidden, n_layers)
print(net)

CharRNN(
  (lstm): LSTM(83, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5)
  (fc): Linear(in_features=512, out_features=83, bias=True)
)


### Set your training hyperparameters!

In [16]:
batch_size = 128
seq_length = 100
n_epochs = 1 # start small if you are just testing initial behavior

# train the model
train(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=10)

Epoch: 1/1... Step: 10... Loss: 3.2534... Val Loss: 3.1901
Epoch: 1/1... Step: 20... Loss: 3.1451... Val Loss: 3.1340
Epoch: 1/1... Step: 30... Loss: 3.1393... Val Loss: 3.1214
Epoch: 1/1... Step: 40... Loss: 3.1167... Val Loss: 3.1196
Epoch: 1/1... Step: 50... Loss: 3.1427... Val Loss: 3.1175
Epoch: 1/1... Step: 60... Loss: 3.1169... Val Loss: 3.1158
Epoch: 1/1... Step: 70... Loss: 3.1085... Val Loss: 3.1146
Epoch: 1/1... Step: 80... Loss: 3.1221... Val Loss: 3.1105
Epoch: 1/1... Step: 90... Loss: 3.1185... Val Loss: 3.1017
Epoch: 1/1... Step: 100... Loss: 3.0897... Val Loss: 3.0816
Epoch: 1/1... Step: 110... Loss: 3.0711... Val Loss: 3.0528
Epoch: 1/1... Step: 120... Loss: 2.9620... Val Loss: 2.9493
Epoch: 1/1... Step: 130... Loss: 2.8964... Val Loss: 2.8639


## Hyperparameters

설정이 필요한 hyperparameters 는 다음과 같다..

* `n_hidden` - The number of units in the hidden layers.
* `n_layers` - Number of hidden LSTM layers to use.

* `batch_size` - Number of sequences running through the network in one pass.
* `seq_length` - Number of characters in the sequence. 보통 큰 값으로 설정하면 더 긴 내용을 학습할 수 있다.l너무 크게 하면 학습이 오래 걸린다. 
* `lr` - Learning rate for training

## Tips and Tricks

>### Validation Loss vs. Training Loss을 확인한다.
>- 보통 training loss와 validation loss 의 차이가 심하면 **overfitting**된다고 본다. 그런 경우에는 network size를 줄이거나 dropout을 설정한다.
>- 만약, training/validation loss가 같으면 **underfitting**되었다고 본다. 그럴 경우에는 layer의 갯수, layer당 unit의 갯수를 증가한다.

>### 적절한 hyper parameter를 설정한다.
> 중요한 parameters는 `n_hidden`,`n_layers` 2개 이다. 이 parameter는 데이터셋의 크기에 따라 달라진다. 보통의 경우에는 전체 train resource를 고려하여 좀 더 큰 네트워크를 만들어서 train을 진행한다. validation, train loss를 살펴보고 overfitting 되면 dropout 등을 추가적으로 주고 훈련을 진행한다. 이때 validation loss가 최소가 되는 모형을 최종적인 모형으로 저장한다.

## Checkpoint

훈련이 종료되면 최종적인 모형을 저장한다.

In [17]:
# change the name, for saving multiple files
model_name = 'rnn_1_epoch.net'

checkpoint = {'n_hidden': net.n_hidden,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'tokens': net.chars}

with open(model_name, 'wb') as f:
    torch.save(checkpoint, f)

---
## Making Predictions

모형이 잘 작동하여 문장을 제대로 생성하는지 확인해 본다.

### A note on the `predict`  function

RNN의 출력 값은 주어진 입력값 다음에 나올 문자를 예측하여 출력한다. 문자의 score값을 출력하므로 이를 확률로 전환해야 한다. 확률로 전환하기 위해 softmax함수를 사용한다.

> softmax function를 적용하면 주어진 입력 다음에 나올 문자에 대한 확률값을 출력하게 된다.

### Top K sampling

다음 나올 문자의 확률이 높은 k개의 문자만이 최종 출력되도록 해야한다.


In [16]:
def predict(net, char, h=None, top_k=None):
        ''' Given a character, predict the next character.
            Returns the predicted character and the hidden state.
        '''
        
        # tensor inputs
        x = np.array([[net.char2int[char]]])
        x = one_hot_encode(x, len(net.chars))
        inputs = torch.from_numpy(x)
 
        # detach hidden state from history
        h = tuple([each.data for each in h])
        # get the output of the model
        out, h = net(inputs, h)

        # get the character probabilities
        p = F.softmax(out, dim=1).data

        # get top characters
        if top_k is None:
            top_ch = np.arange(len(net.chars))
        else:
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        
        # select the likely next character with some element of randomness
        p = p.numpy().squeeze()
        char = np.random.choice(top_ch, p=p/p.sum())
        
        # return the encoded value of the predicted char and the hidden state
        return net.int2char[char], h

### Priming and generating text 

보통 문장의 시작 단어를 prime 값(=초기값)으로 준다. 그러면 이 문자열을 기반으로 하여 문장을 생성한다.

In [17]:
def sample(net, size, prime='The', top_k=None):

    net.eval() # eval mode
    
    # First off, run through the prime characters
    chars = [ch for ch in prime]
    h = net.init_hidden(1)
    for ch in prime:
        char, h = predict(net, ch, h, top_k=top_k)

    chars.append(char)
    
    # Now pass in the previous character and get a new one
    for ii in range(size):
        char, h = predict(net, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

In [21]:
print(sample(net, 1000, prime='Anna', top_k=5))

Anna  thessat
oher at oatins ande han tae hir tot ote tott an thet aon te ond, ter nit tassa onttore hare nn nas atia no aor tores hi hon ne the tane ahe tote aeed,
ther taree hh tettote aa heet tan sie the nis iod aad hedet tens hor nos hhe ood onte thi ne tind he nad hhr na  eet iteste ha hit aaned hat the atine tit ot en at tettie  hore aon her te aa  ind hoterat oe ne aotat an tore an tan aan taririthed on ten ned
an one atesetotiat het aot at tees tee and hin tie has the thos tar hh the sho thor tite hori tan the and ate toe the taed nhas ind oad ant hedte nar ah ne tee han atose hhi tatestonnd wanr he thase aan aar not hasesa an  aon toed, ho tia eo he te aon  ae so na the shas indd at hond toee hita terine ood oe no thi ee his anr aeee aon ae ahosee hatia nis tho nnt anr th settha ton te non sores on aon atititone ho hhr nn hi tette tat anre thans tin hh  iot hhes tertene hh no an aat hor aar ten ne te heet anttes nar hat oaretetan  hhe anres the he at ine tee ho here tite aeta 

## Loading a checkpoint

In [18]:
# Here we have loaded in a model that trained over 20 epochs `rnn_20_epoch.net`
with open('rnn_20_epoch.net', 'rb') as f:
    checkpoint = torch.load(f, map_location=lambda storage, loc: storage)

loaded = CharRNN(checkpoint['tokens'], n_hidden=checkpoint['n_hidden'], n_layers=checkpoint['n_layers'])
loaded.load_state_dict(checkpoint['state_dict'])


In [19]:
# Sample using a loaded model
print(sample(loaded, 2000, top_k=5, prime="And Levin said"))

And Levin said. Howserful to a call of the meetings
they stopped at him, but
she held the personal pant in
her househald as he was always so standing the measures in the familiar acceptions, and this common whom she set her back or he still merely to talk from that things of the man the sounds, the same chambers, had to bring any one with a first time, the consciousness of which he talked a long whole son, to be straight on, that his brother's had to see him to say to them to his side and the sound of the second he had talked and to be a post offense.

"I don't see you to say to the same stall," said Levin. "Yes, you
went on with him."

"One, you don't see the
parties?" said Levin.

"You must be, the same and something, answered you with him."

"I don't know how they're not something that I am going to there
any strange in the court,
and
she was a minute about to thinks it to her,
the present of the sort, and
said she would say to him, so as to be said, she won't see anything
and that 