## Lab 2

### Part 3. Poetry generation

Let's try to generate some poetry using RNNs. 

You have several choices here: 

* The Shakespeare sonnets, file `sonnets.txt` available in the notebook directory.

* Роман в стихах "Евгений Онегин" Александра Сергеевича Пушкина. В предобработанном виде доступен по [ссылке](https://github.com/attatrol/data_sources/blob/master/onegin.txt).

* Some other text source, if it will be approved by the course staff.

Text generation can be designed in several steps:
    
1. Data loading.
2. Dictionary generation.
3. Data preprocessing.
4. Model (neural network) training.
5. Text generation (model evaluation).


In [75]:
import string
import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from random import sample

from IPython.display import clear_output


### Data loading: Shakespeare

Shakespeare sonnets are awailable at this [link](http://www.gutenberg.org/ebooks/1041?msg=welcome_stranger). In addition, they are stored in the same directory as this notebook (`sonnetes.txt`). Simple preprocessing is already done for you in the next cell: all technical info is dropped.

In [65]:
if not os.path.exists('sonnets.txt'):
    !wget https://raw.githubusercontent.com/girafe-ai/ml-mipt/master/homeworks_basic/Lab2_DL/sonnets.txt

with open('sonnets.txt', 'r') as iofile:
    text = iofile.readlines()
    
TEXT_START = 45
TEXT_END = -368
text = text[TEXT_START : TEXT_END]
assert len(text) == 2616
maxlen_ = len(max(text, key = len))
print("Max seq contains {} symbols".format(maxlen_))


Max seq contains 63 symbols


In opposite to the in-class practice, this time we want to predict complex text. Let's reduce the complexity of the task and lowercase all the symbols.

Now variable `text` is a list of strings. Join all the strings into one and lowercase it.

In [66]:
# Join all the strings into one and lowercase it
# Put result into variable text.
tmp_list = list()
for token in text: tmp_list.extend(token.lower())
text = tmp_list
# Your great code here

assert len(text) == 100225, 'Are you sure you have concatenated all the strings?'
assert not any([x in set(text) for x in string.ascii_uppercase]), 'Uppercase letters are present'
print('OK!')

OK!


### Data loading: "Евгений Онегин"


In [26]:
if not os.path.exists('onegin.txt'):
    !wget https://raw.githubusercontent.com/attatrol/data_sources/master/onegin.txt
    
with open('onegin.txt', 'r') as iofile:
    o_text = iofile.readlines()
    
o_text = [x.replace('\t\t', '') for x in o_text]
maxlen = len(max(o_text, key = len))
print("Max seq contains {} symbols".format(maxlen))

Max seq contains 159 symbols


In opposite to the in-class practice, this time we want to predict complex text. Let's reduce the complexity of the task and lowercase all the symbols.

Now variable `text` is a list of strings. Join all the strings into one and lowercase it.

In [29]:
# Join all the strings into one and lowercase it
# Put result into variable text
# Your great code here
tmp_list = list()
for token in o_text: tmp_list.extend(token.lower())
o_text = tmp_list
print("The lenght of given tokens  = {}".format(len(o_text)))


The lenght of given tokens  = 141888


Put all the characters, that you've seen in the text, into variable `tokens`.

In [55]:
tokens_sh = sorted(set(text))
tokens_o = sorted(set(o_text))
print(len(tokens_sh))

38



Create dictionary `token_to_idx = {<char>: <index>}` and dictionary `idx_to_token = {<index>: <char>}`

In [57]:
# dict <index>:<char>
# dict <char>:<index>

def create_mapping(tokens):
    """
    INPUT: tokens -- sorted set of tokens
    OUTPUT: 
    token_to_idx: dict <index>:<char>
    idx_to_token:dict <char>:<index>
    """
    token_to_idx = {token: idx for idx, token in enumerate(tokens)}
    idx_to_token = dict(enumerate(tokens))
    return token_to_idx, idx_to_token 
token_to_idx, idx_to_token = create_mapping(tokens_sh)
assert len(token_to_idx) == len(tokens_sh), "dicts should have the same size"
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cpu')

*Comment: in this task we have only 38 different tokens, so let's use one-hot encoding.*

In [73]:
batch_size = 128

def one_hot_encode(sequence, dict_size = len(tokens_sh),
                   batch_size = batch_size):
    seq_len = len(sequence)
    features = np.zeros((seq_len, dict_size), dtype = np.float32)
    features[np.arange(seq_len), [token_to_idx[s] for s in sequence]] = 1.
    return features

### Building the model

Now we want to build and train recurrent neural net which would be able to something similar to Shakespeare's poetry.

Let's use vanilla RNN, similar to the one created during the lesson.

In [60]:
# Your code here  
    
class VanillaRNN(nn.Module):
    def __init__(self, n_tokens = len(text), emb_dim = 16, hidden_dim = 64, n_layers = 1):
        super(VanillaRNN, self).__init__()
        self.emb = nn.Embedding(n_tokens, emb_dim)
        self.rnn = nn.RNN(emb_dim, hidden_dim, batch_first = True)
        self.hid_to_logits = nn.Linear(hidden_dim, n_tokens)
    
    def forward(self, x):
        assert isinstance(x.data, torch.LongTensor)
        h_seq, _ = self.rnn(self.emb(x))
        next_logits = self.hid_to_logits(h_seq)
        next_logp = F.log_softmax(next_logits, dim = -1)
        return next_logp
    
    def init_hidden(self, batch_size):
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
        return hidden

Plot the loss function (axis X: number of epochs, axis Y: loss function).

In [None]:
# Your plot code here
model = VanillaRNN()
model.to(device)


for i in range(1000):
    batch_ix = one_hot_encode(sample(text, batch_size))
    batch_ix = torch.tensor(batch_ix, dtype = torch.int64).to(device)
    
    logp_seq = model(batch_ix)
    loss = criterion(logp_seq[:, :-1].contiguous().view(-1, len(tokens_sh)),
                 batch_ix[:, 1:].contiguous().view(-1))

    loss.backward()
    opt.step()
    opt.zero_grad()
    
    history.append(loss.data.numpy())
    if (i + 1) % 100 == 0:
        clear_output(True)
        plt.plot(history, label = 'loss')
        plt.legend(loc = 'best')
        plt.show()



In [None]:
def generate_sample(char_rnn, seed_phrase=' Hello', max_length=MAX_LENGTH, temperature=1.0):
    '''
    ### Disclaimer: this is an example function for text generation.
    ### You can either adapt it in your code or create your own function
    
    The function generates text given a phrase of length at least SEQ_LENGTH.
    :param seed_phrase: prefix characters. The RNN is asked to continue the phrase
    :param max_length: maximum output length, including seed_phrase
    :param temperature: coefficient for sampling.  higher temperature produces more chaotic outputs, 
        smaller temperature converges to the single most likely output.
        
    Be careful with the model output. This model waits logits (not probabilities/log-probabilities)
    of the next symbol.
    '''
    
    x_sequence = [token_to_id[token] for token in seed_phrase]
    x_sequence = torch.tensor([[x_sequence]], dtype=torch.int64)
    hid_state = char_rnn.initial_state(batch_size=1)
    
    #feed the seed phrase, if any
    for i in range(len(seed_phrase) - 1):
        print(x_sequence[:, -1].shape, hid_state.shape)
        out, hid_state = char_rnn(x_sequence[:, i], hid_state)
    
    #start generating
    for _ in range(max_length - len(seed_phrase)):
        print(x_sequence.shape, x_sequence, hid_state.shape)
        out, hid_state = char_rnn(x_sequence[:, -1], hid_state)
        # Be really careful here with the model output
        p_next = F.softmax(out / temperature, dim=-1).data.numpy()[0]
        
        # sample next token and push it back into x_sequence
        print(p_next.shape, len(tokens))
        next_ix = np.random.choice(len(tokens), p=p_next)
        next_ix = torch.tensor([[next_ix]], dtype=torch.int64)
        print(x_sequence.shape, next_ix.shape)
        x_sequence = torch.cat([x_sequence, next_ix], dim=1)
        
    return ''.join([tokens[ix] for ix in x_sequence.data.numpy()[0]])

In [40]:
# An example of generated text.
# print(generate_text(length=500, temperature=0.2))

hide my will in thine?
  shall will in of the simend that in my sime the seave the seave the sorll the soren the sange the seall seares and and the fart the wirl the seall the songh whing that thou hall will thoun the soond beare the with that sare the simest me the fart the wirl the songre the with thy seart so for shat so for do the dost the sing the sing the sing the soond canding the sack and the farling the wirl of sore sich and that with the seare the seall so fort the with the past the wirl the simen the wirl the sores the sare


### More poetic model

Let's use LSTM instead of vanilla RNN and compare the results.

Plot the loss function of the number of epochs. Does the final loss become better?

In [46]:
# Your beautiful code here

Generate text using the trained net with different `temperature` parameter: `[0.1, 0.2, 0.5, 1.0, 2.0]`.

Evaluate the results visually, try to interpret them.

In [47]:
# Text generation with different temperature values here

### Saving and loading models

Save the model to the disk, then load it and generate text. Examples are available [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html]).

In [4]:
# Saving and loading code here

### References
1. <a href='http://karpathy.github.io/2015/05/21/rnn-effectiveness/'> Andrew Karpathy blog post about RNN. </a> 
There are several examples of genration: Shakespeare texts, Latex formulas, Linux Sourse Code and children names.
2. <a href='https://github.com/karpathy/char-rnn'> Repo with char-rnn code </a>
3. Cool repo with PyTorch examples: [link](https://github.com/spro/practical-pytorch`)