# Assignment 7

Delelop language model, which generates death metal band names.  
You can get data from https://www.kaggle.com/zhangjuefei/death-metal.  
You are free to use any other data, but the most easy way is just to take the band name column.

Your language model should be char-based autogression RNN.  
Text generation should be terminated when either max length is reached or terminal symbol is generated.  

<img src="images/example.png">

<img src="images/example2.png">

Different band names can be generated by:  
1. init $h_0$ as random vector from some probabilty distribution.
2. sampling over tokens at each timestep with probability = softmax 

Calculate perplexity for your model = your objective quality metric.  
Also, sample 10 band names from your model for subjective evaluation. E.g. names like 'qwiouefiou23riop2h3' or 'death death death!' are bad examples.  

In [1]:
from time import time

import numpy as np
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm_notebook
from sklearn.model_selection import train_test_split

import torch as tt

Using TensorFlow backend.


Загрузка данных

In [2]:
df = pd.read_csv('bands.csv')

In [3]:
texts = df.name.values

Разделим данные на обучающее, валидационное и тестовое подмножества

In [4]:
np.random.seed(17)
test_idxs = list(np.random.randint(0, df.shape[0], size=(int(df.shape[0]*0.1),)))
train_idxs = list(set(range(df.shape[0])).difference(test_idxs))

texts_train = texts[train_idxs]
texts_test = texts[test_idxs]

In [5]:
texts_train.shape, texts_test.shape

((34128,), (3772,))

Добавим к названиям символы начала и конца последовательности `<>`

In [6]:
texts_train = [f'<{x}>' for x in texts_train]
texts_test = [f'<{x}>' for x in texts_test]

In [7]:
MAX_SEQ_LEN = max([len(x) for x in texts_train])

Соберем датасеты

In [8]:
def build_dataset(texts, tokenizer, maxlen, fit_tokenizer=True):
    X = []
    y = []
    
    if fit_tokenizer:
        tokenizer.fit_on_texts(texts)
    
    X = tokenizer.texts_to_sequences(texts)
    X = pad_sequences(X, maxlen=maxlen+1, padding='post', truncating='post')
    
    y = np.roll(X[:], -1)
    y[:,-1] = 0
    
    return X, y

In [9]:
tokenizer = Tokenizer(char_level=True, lower=False)

In [10]:
X_train, Y_train = build_dataset(texts_train, tokenizer, MAX_SEQ_LEN)

In [11]:
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.1, random_state=17) 

In [12]:
X_test, Y_test = build_dataset(texts_test, tokenizer, MAX_SEQ_LEN, fit_tokenizer=False)

In [13]:
i2ch = {v: k for k, v in tokenizer.word_index.items()}

В качестве модели я использовал однонаправленную LSTM в паре с полносвязным слоем на распределение по символам

In [14]:
class NN(tt.nn.Module):
        
    def __init__(
        self,
        vocab_size,
        embedding_size,
        hidden_size,
    ):
        super(NN, self).__init__()
        
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        
        self.embeddings = tt.nn.Embedding(self.vocab_size, self.embedding_size)
        self.embeddings.requires_grad = True
                    
        self.rnn = tt.nn.LSTM(
            input_size=self.embedding_size,
            hidden_size=self.hidden_size,
            num_layers=1,
            batch_first=True
        )

        self.output_layer = tt.nn.Linear(self.hidden_size, self.vocab_size)
        
    def forward(self, x, states=None):
        x = tt.tensor(tt.from_numpy(x), dtype=tt.long).cuda()
        x = self.embeddings(x)
        
        if states is not None:
            x, hidden = self.rnn(x, states)
            
        else:
            x, hidden = self.rnn(x)
        
        x = self.output_layer(x)
                
        return x, hidden

In [15]:
def perplexity(x):
    return 2**x


def train(
    epochs,
    X_train,
    Y_train,
    X_val,
    Y_val,
    model, 
    optimizer,
    batch_size,
    scheduler=None,
    patience=5,
    save_path='tt_model'
):    
    train_perpls = []
    val_perpls = []
    min_val_perpl = np.inf
    n_no_improv_epochs = 0
    
    criterion = tt.nn.CrossEntropyLoss()
    
    for epoch in range(epochs):
        shuffled_idxs = np.arange(X_train.shape[0])
        _X = X_train[shuffled_idxs]
        _Y = Y_train[shuffled_idxs]
        
        c_train_perpls = []
        c_val_perpls = []
        
        st_time = time()
        
        for i in tqdm_notebook(range(0, X_train.shape[0], batch_size)):
            x = _X[i:i+batch_size]
            y = _Y[i:i+batch_size]
            
            optimizer.zero_grad()
            
            pred, hidden = model.forward(x)
            pred = pred.permute(0, 2, 1)
            train_loss = criterion(pred, tt.tensor(y, dtype=tt.long).cuda())
            c_train_perpls.append(perplexity(train_loss.item()))
            
            train_loss.backward()
            
            optimizer.step()
        
        c_train_perpl = np.mean(c_train_perpls)
        train_perpls.append(c_train_perpl)
        
        with tt.no_grad():
            val_pred, val_hidden = model.forward(X_val)
            val_pred = val_pred.permute(0, 2, 1)
            val_loss = criterion(val_pred, tt.tensor(Y_val, dtype=tt.long).cuda())
            c_val_perpl = perplexity(val_loss.item())
            val_perpls.append(c_val_perpl)
        
        if c_val_perpl < min_val_perpl:
            min_val_perpl = c_val_perpl
            n_no_improv_epochs = 0
            tt.save(model.state_dict(), save_path)
            
        elif n_no_improv_epochs < patience:
            n_no_improv_epochs += 1
            
        else:
            print(f'Early stopping at epoch {epoch+1}\nBest val perplexity: {min_val_perpl:.4f}')
            break
            
        if scheduler is not None:
            scheduler.step()
        
        c_time = time() - st_time
        
        print(f'epoch: {epoch+1} \t train_perplexity: {c_train_perpl:.4f} \t val_perplexity: {c_val_perpl:.4f} \t time: {c_time:.2f} s.')
    
    return train_perpls, val_perpls

In [16]:
def eval_test(X_test, Y_test, model):
    criterion = tt.nn.CrossEntropyLoss()
    
    with tt.no_grad():
        pred = model.forward(X_test)[0].cpu()
        pred = pred.permute(0, 2, 1)
        loss = criterion(pred, tt.tensor(Y_test, dtype=tt.long))
        
    return perplexity(loss.item())


In [79]:
def generate(model, tokenizer, i2ch, hidden_size, maxlen=100, scale=1.1):
    h0 = np.random.normal(scale=scale, size=(1, 1, hidden_size))
    c0 = np.random.normal(scale=scale, size=(1, 1, hidden_size))
    states = (tt.tensor(h0, dtype=tt.float32).cuda(), tt.tensor(c0, dtype=tt.float32).cuda())
    
    generated = ''
    X = np.array(tokenizer.texts_to_sequences(['<']))
    
    with tt.no_grad():
        while len(generated) < maxlen:
            pred, states = model.forward(X, states)
            pred = tt.nn.functional.softmax(pred, -1)
            pred = pred.cpu().numpy()[0,-1,:]
            pred_char_idx = pred.argmax()
            X = np.hstack((X, np.tile(pred_char_idx, (1, 1))))
            
            if pred_char_idx == 0 or pred_char_idx == tokenizer.word_index['>']:
                return generated

            generated += i2ch[pred_char_idx]
        
    return generated

In [18]:
VOCAB_SIZE = len(tokenizer.word_index) + 1
BATCH_SIZE = 64

In [19]:
model = NN(VOCAB_SIZE, 128, 512)

In [20]:
model = model.cuda()

In [48]:
optimizer = tt.optim.Adam(model.parameters())

In [49]:
train(100, X_train, Y_train, X_val, Y_val, model, optimizer, BATCH_SIZE, save_path='assignment7_model')

HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 1 	 train_perplexity: 1.7264 	 val_perplexity: 1.4179 	 time: 9.93 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 2 	 train_perplexity: 1.3868 	 val_perplexity: 1.3726 	 time: 9.82 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 3 	 train_perplexity: 1.3511 	 val_perplexity: 1.3506 	 time: 9.78 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 4 	 train_perplexity: 1.3287 	 val_perplexity: 1.3376 	 time: 9.69 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 5 	 train_perplexity: 1.3128 	 val_perplexity: 1.3297 	 time: 9.74 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 6 	 train_perplexity: 1.2999 	 val_perplexity: 1.3256 	 time: 9.73 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 7 	 train_perplexity: 1.2888 	 val_perplexity: 1.3240 	 time: 9.71 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 8 	 train_perplexity: 1.2786 	 val_perplexity: 1.3237 	 time: 9.67 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 9 	 train_perplexity: 1.2690 	 val_perplexity: 1.3241 	 time: 9.77 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 10 	 train_perplexity: 1.2598 	 val_perplexity: 1.3260 	 time: 9.76 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 11 	 train_perplexity: 1.2512 	 val_perplexity: 1.3287 	 time: 9.69 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 12 	 train_perplexity: 1.2432 	 val_perplexity: 1.3326 	 time: 9.72 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


epoch: 13 	 train_perplexity: 1.2359 	 val_perplexity: 1.3372 	 time: 9.70 s.


HBox(children=(IntProgress(value=0, max=480), HTML(value='')))


Early stopping at epoch 14
Best val perplexity: 1.3237


([1.726351124967248,
  1.386844742319806,
  1.3510639593092533,
  1.3287174641155706,
  1.312759997579192,
  1.2999277980695019,
  1.288788128810972,
  1.2785957557262544,
  1.268995919967657,
  1.2598158119693652,
  1.2511981098491465,
  1.243198183982238,
  1.2359365141795053,
  1.2294191864363604],
 [1.4179230422986766,
  1.3725672852416055,
  1.3505512252429137,
  1.3375534303594592,
  1.3297183908013346,
  1.3256392948579927,
  1.3239752131569138,
  1.3236701330512786,
  1.3240619971700516,
  1.3259742740034761,
  1.3287117486492486,
  1.332590630159902,
  1.3372440888774833,
  1.3422826134006318])

Загрузим лучшую модель и оценим перплексию на тестовом подмножестве

In [21]:
model.load_state_dict(tt.load('assignment7_model'))

In [22]:
eval_test(X_test, Y_test, model)

1.3214760454567611

Сгенерируеем 10 случайных названий, сэмплируя скрытые состояния из обыкновенного нормального распределения

In [76]:
samples = set()

while len(samples) < 10:
    samples.add(generate(model, tokenizer, i2ch, 512, MAX_SEQ_LEN))

In [77]:
for x in samples:
    print(x)

Creath
Infection
Sullen
Sected
Soul
Sedence
Necrosis
Blasphemerator
Pacanatis
Dead


Проверим, совпадает ли что из предложенного с чем-то в обучающем подмножестве

In [78]:
for x in samples:
    if f'<{x}>' in texts_train:
        print(f'Found duplicate: {x}')

Found duplicate: Infection
Found duplicate: Sullen
Found duplicate: Necrosis
Found duplicate: Dead


За пару десятков экспериментов в среднем, половина названий - копии, но те, что не копии, выглядят довольно правдоподобно.  
Увеличение ско при сэмплировании скрытых состояний помогает разнообразить выдачу, но в то же время увеличивает вероятность генерации мало связной чуши. 