# Word-based RNN
We define a word-based RNN for text generation. We use LSTM cells and word2vec for embedding.

References:

[1] [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)

[2] [Word2Vec + LSTM](https://stackoverflow.com/questions/42064690/using-pre-trained-word2vec-with-lstm-for-word-generation)





In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.models import Sequential
import re
import string
from tqdm import tqdm
import pickle
AUTOTUNE = tf.data.experimental.AUTOTUNE
PATH = '/content/drive/MyDrive/Datasets/'
rnn_dir = '/content/drive/MyDrive/NLP/RNNs/'
embedding_dim = 256
vocab_size = 2**14

In [None]:
# Load word2vec model to use as an embedding layer.
# See word2vec.ipynb for details.
word2vec = keras.models.load_model(rnn_dir + 'w2v_model')

In [None]:
word2vec.summary()

Model: "word2_vec"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
w2v_embedding (Embedding)    multiple                  4194304   
_________________________________________________________________
embedding (Embedding)        multiple                  4194304   
Total params: 8,388,608
Trainable params: 8,388,608
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Load saved vocabulary.
with open(rnn_dir + 'vocab_cfg.plk', 'rb') as f:
    vec_cfg = pickle.load(f)
with open(rnn_dir + 'vocab_voc.plk', 'rb') as f:
    vec_voc = pickle.load(f)
with open(rnn_dir + 'vocab_wgt.plk', 'rb') as f:
    vec_wgt = pickle.load(f)

In [None]:
vectorize_layer = layers.experimental.preprocessing.TextVectorization.from_config(vec_cfg)
vectorize_layer.set_vocabulary(vec_voc)
vectorize_layer.set_weights(vec_wgt)
vectorize_layer

<tensorflow.python.keras.layers.preprocessing.text_vectorization.TextVectorization at 0x7f8f26577ad0>

Because of a mistake in a vocabulary generation, we don't have an embedding for "\n". We use a dirty (but effective) hack to solve this problem.

We map any occurance of "\n" in input data to the least frequent word (LSW) in our vocabulary. Since LSW occurs only single-digit number of times in our corpus, we won't get any bad behaviour.

Before outputing prediction, we replace all occurances of LSW to "\n" character.

In [None]:
lfw = vec_voc[-1]
lfw

'geweest'

In [None]:
# Read the data.
df = pd.read_csv(PATH + 'kaggle_rock_new.csv')
df

Unnamed: 0.1,Unnamed: 0,lyrics
0,0,a lot of cats are hatin' slandering makin' bad...
1,1,somebody tell me why we landed here on the pla...
2,2,i'm spittin' with the venom\nto your soul thro...
3,3,where should i begin cripplin' all you villain...
4,4,enough of all that let's switch up the format\...
...,...,...
100252,100252,break down we've got to make them see\nno disc...
100253,100253,everything comes to a question where time is t...
100254,100254,you got to climb up on your high horses decide...
100255,100255,it all comes tumbling down\nno vital parts rem...


In [None]:
def standardize(lyrics):
    lyrics = lyrics.lower()
    lyrics = lyrics.replace('\n', ' ' + lfw + ' ')
    illegal = string.punctuation.replace("'", '')  # ' is legal
    lyrics = lyrics.translate(str.maketrans('', '', illegal))
    return lyrics

standardize("Hey!\nI'm a transformer!")

"hey geweest i'm a transformer"

In [None]:
sequence_length = 16

def split_list(l, n):
    # Split list l in n equal parts.
    return [l[i:i+n] for i in range(0, len(l), n)]

print(split_list([1, 2, 3, 4, 5], 2))

# We want RNN to predict a new word based on
# what it had seen so far. We build a training
# sample the following way:
# Given a line, e.g. "here comes the sun",
# we set everything up until the last word as
# a training sample ("here comes the"),
# and everything except the first word as a
# label ("comes the sun").

def get_training_data():
    iter = df['lyrics'].iteritems()
    for raw_lyric in tqdm(iter):
        lyric = standardize(raw_lyric[1])
        split = split_list(lyric.split(' '), sequence_length)
        
        for lyric in split:
            x = lyric[:-1]
            # y = lyric[1:]
            y = lyric[-1]


            train_x = vectorize_layer.call([' '.join(x)])[0]
            train_y = vectorize_layer.call([' '.join(y)])[0]

            yield np.array(train_x), np.array(train_y)


[[1, 2], [3, 4], [5]]


In [None]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_generator(get_training_data,
                                         output_shapes=((16,), (16,)),
                                         output_types=(tf.int64, tf.int64))

# Dataset performance optimization:
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)

print(dataset)
list(dataset.take(1))

<PrefetchDataset shapes: ((1024, 16), (1024, 16)), types: (tf.int64, tf.int64)>


791it [01:49,  7.25it/s]


[(<tf.Tensor: shape=(1024, 16), dtype=int64, numpy=
  array([[    2,   245,    42, ...,    15,   373,     0],
         [   31,     4,  2017, ...,     0,     0,     0],
         [ 1030,     6,  1810, ..., 16383,   220,     0],
         ...,
         [   68,     7,   137, ...,   343,     4,     0],
         [  118,    66,  1168, ...,  3615,    57,     0],
         [  481,    19,    10, ...,   105,   166,     0]])>,
  <tf.Tensor: shape=(1024, 16), dtype=int64, numpy=
  array([[1273, 1815, 1670, ...,    0,    0,    0],
         [1979,  547,  547, ...,    0,    0,    0],
         [1016,  547, 1670, ...,    0,    0,    0],
         ...,
         [   7, 1670,    0, ...,    0,    0,    0],
         [1670,  388,    0, ...,    0,    0,    0],
         [ 641, 3479,  641, ...,    0,    0,    0]])>)]

In [None]:
model = Sequential()
model.add(word2vec.get_layer('w2v_embedding'))
model.add(keras.layers.LSTM(units=embedding_dim))
model.add(keras.layers.Dense(units=vocab_size))
model.add(keras.layers.Activation('softmax'))
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer='adam', loss=loss)

In [None]:
def idx2word(i):
    return vec_voc[i]

def word2idx(w):
    return int(vectorize_layer.call([w])[0][0])

print(idx2word(100))
print(word2idx('only'))

only
100


In the following code we use [temperature based random sampling](https://medium.com/machine-learning-at-petiteprogrammer/sampling-strategies-for-recurrent-neural-networks-9aea02a6616f://)

In [None]:
# Adapted from [2]

def sample(preds, temperature=1.0):
  # Temperature based random sampling.
  if temperature <= 0:
    return np.argmax(preds)
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

def generate_next(text, num_generated=10):
  # Generate lyrics based on prompt.

  # Vectorize.
  word_idxs = [word2idx(word) for word in text.lower().split()]
  for i in range(num_generated):
    prediction = model.predict(x=np.array(word_idxs))

    # Temperature based random sampling.
    idx = sample(prediction[-1], temperature=0.7)
    word_idxs.append(idx)

  # Devectorize.
  result = ' '.join(idx2word(idx) for idx in word_idxs)
  return result.replace(lfw, '\n')

def on_epoch_end(epoch, _):
  # Generate text with the following prompts to
  # see progress on each epoch.
  print('\nGenerating text after epoch: %d' % epoch)
  texts = [
    'here comes the sun little darling\n',
    'empty spaces what are we living for\n',
    'ticking away the moments that make up a dull day\n'
  ]
  for text in texts:
    sample = generate_next(text)
    print('%s... -> %s' % (text, sample))

Unfortunately, we couldn't figure out what's wrong and why the model refuses to train.

In [None]:
# Train the model.
model.fit(dataset,
          batch_size=128,
          epochs=20,
          callbacks=[keras.callbacks.LambdaCallback(on_epoch_end=on_epoch_end)])

Epoch 1/20


ValueError: ignored