# Rap Text Generator

Student Name: Dmitry Timerbaev

In [24]:
# load libraries
import lyricsgenius
from nltk.lm import MLE
from nltk.util import ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline
import pandas as pd
from string import punctuation
import numpy as np
import sys
import os

import tensorflow as tf

In [4]:
# setting up the tokenizer
try: # Use the default NLTK tokenizer.
    from nltk import word_tokenize, sent_tokenize 
    word_tokenize(sent_tokenize("This is a foobar sentence. Yes it is.")[0])
except: # Use a naive sentence tokenizer and toktok.
    import re
    from nltk.tokenize import ToktokTokenizer
    sent_tokenize = lambda x: re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', x)
    toktok = ToktokTokenizer()
    word_tokenize = word_tokenize = toktok.tokenize

## Data retrieval
LyricsGenius API was used to retrieve lyrics of 10 most popular Eminem songs. For creating rap generator, I tried to use n-gram model and recurrent neural network.

In [5]:
# set up API token
genius = lyricsgenius.Genius("rPXL2JkaA1EqVzgpJxFGFocg149ZOiUveQWeaNTsMo51Dq125_dNzjfISkivKzFr")
genius.remove_section_headers = True

In [7]:
# download train lyrics (10 most popular Eminem songs)
train_song_list = ['Rap God', 'Killshot', 'Lose Yourself', 'The Monster', 'Lucky You', 'Godzilla', 'The Ringer', 'River', 'Berzerk', 'Venom']
data_1 = ""
for i in train_song_list:
    song = genius.search_song(i, 'Eminem')
    data_1 += song.lyrics + " "

Searching for "Rap God" by Eminem...
Done.
Searching for "Killshot" by Eminem...
Done.
Searching for "Lose Yourself" by Eminem...
Done.
Searching for "The Monster" by Eminem...
Done.
Searching for "Lucky You" by Eminem...
Done.
Searching for "Godzilla" by Eminem...
Done.
Searching for "The Ringer" by Eminem...
Done.
Searching for "River" by Eminem...
Done.
Searching for "Berzerk" by Eminem...
Done.
Searching for "Venom" by Eminem...
Done.


In [8]:
# load data
train_data = data_1

## N-gram model

### Data preprocessing
I splitted data into sentences by line breaks. Then, I divided sentences into words, cleaned them of certain punctuation characters and reassembled again for further tokenization.

In [9]:
# split train data into sentences
tr = re.split(r'\n', train_data)

In [10]:
# define function that removes any punctuation from strings
def punctuation(string): 
  
    # punctuation marks to be removed
    punctuations = '''!;:?()"—[]<>'''
  
    # goes through each character in a string and if character belongs to punctuation - makes it null 
    for x in string: 
        if x in punctuations: 
            string = string.replace(x, "") 
  
    # returns the string in lowercase letters 
    return string 

In [11]:
# remove punctuation from sentence elements
words_list = []
for i in range(len(tr)):
    temporary_dict = []
    for t in tr[i].split():
        stg = punctuation(t)
        temporary_dict.append(stg)
    words_list.append(temporary_dict)

In [12]:
# recreate sentences and then tokenize by word. check the tokenized sentence
sent_list = [x for x in words_list if x != []]
new_sent_list = [' '.join(sent) for sent in sent_list]
tokenized = [list(map(str.lower, word_tokenize(sent))) for sent in new_sent_list]
tokenized[0]

['look',
 ',',
 'i',
 'was',
 'gonna',
 'go',
 'easy',
 'on',
 'you',
 'not',
 'to',
 'hurt',
 'your',
 'feelings',
 '.']

### Fit the model and generate sample rap
I used 3-gram MLE model to fit the tokenized dataset, and create the text generator.

In [13]:
# preprocess the tokenized text for 3-gram language modelling
n = 3 
train_d, padded_sents = padded_everygram_pipeline(n, tokenized)

In [14]:
# fit the model, check the length of vocabulary
model = MLE(n)
model.fit(train_d, padded_sents)
len(model.vocab)

2323

In [15]:
# generate a single sentence
print(model.generate(30, random_seed=25))

['a', 'villain', 'outta', 'those', 'blockbusters', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [16]:
# creating function that converts generated text into readable form
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [17]:
# generate same single sentence in new form
generate_sent(model, 30, random_seed=25)

'a villain outta those blockbusters'

In [18]:
# sample rap generation with 30 lines
for i in range(30):
    if generate_sent(model,30,random_seed=i) == '':
        continue
    else:
        print(generate_sent(model,30,random_seed=i))

straight out the coupe, hop out and booed off stage
when you' ll take you back
in the right type of life for my music
record every time i break a motherfuckin' optionfailure' s the only opportunity that i' m reloadin '
pull my mac out and shoot
but i' m' bout to bloody this track up, overblaow
head
bitch
caught slippin '
i get on a guy with a pipe wrench
.
with
i' m not done preach
fame made me a costly mistake
ll still be like everyone else in the front smashed, much as my rear fender, assassin
time to go i cannot grow old in salem' s your moment
when i' ve been a lover, been a thief
tough times
midst of all this
a villain outta those blockbusters
of defeat and rise to my feet
just pulled a pistol on a mic
give me the juice


## RNN model

### Data preprocessing
Before training RNN, string data needs to be mapped to a numerical representation. Next data is to be divided into example sequences. For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right. Finally, training batches need to be created - those batches will be fed to the model.

In [57]:
# load data
text = data_1

In [58]:
# set up vocabulary and investigate unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

87 unique characters


In [59]:
# create a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

In [60]:
# set up maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
    print(idx2char[i.numpy()])

"
L
o
o
k


In [61]:
# try out conversion of characters into sequences of desired size. looks fine.
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

'"Look, I was gonna go easy on you not to hurt your feelings."\n"But I\'m only going to get this one cha'
'nce." (Six minutes— Six minutes—)\n"Something\'s wrong, I can feel it." (Six minutes, Slim Shady, you\'r'
'e on!)\n"Just a feeling I\'ve got. Like something\'s about to happen, but I don\'t know what.\xa0\nIf that me'
"ans what I think it means, we're in trouble, big trouble;\xa0\nAnd if he is as bananas as you say, I'm no"
't taking any chances."\n"You are just what the doc ordered."\n\nI\'m beginnin\' to feel like a Rap God, Ra'


In [62]:
# for each sequence, duplicate and shift it to form the input and target text by using the map method to apply a simple function to each batch
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# try printing the first examples input and target values
for input_example, target_example in  dataset.take(1):
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  '"Look, I was gonna go easy on you not to hurt your feelings."\n"But I\'m only going to get this one ch'
Target data: 'Look, I was gonna go easy on you not to hurt your feelings."\n"But I\'m only going to get this one cha'


In [63]:
# create training batches 

# batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

# set up the dataset for training
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int32, tf.int32)>

### Building the model
Building the RNN for text generation requires certain steps:<br>
1) Define model architecture - I used simple RNN structure with 1 embedding input layer, GRU (LSTM could also be used) and dense as output layer<br>
2) Choose optimizer and loss function - I compiled model with Adam optimizer and categorical cross-entropy loss<br>
3) Configure checkpoints

In [64]:
# define the model

# length of the vocabulary in chars
vocab_size = len(vocab)

# embedding dimension
embedding_dim = 256

# number of RNN units
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
    return model

In [65]:
# set up the training model
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

In [67]:
# show model summary - 1 embedding, gru and dense layers
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (64, None, 256)           22272     
_________________________________________________________________
gru_4 (GRU)                  (64, None, 1024)          3938304   
_________________________________________________________________
dense_4 (Dense)              (64, None, 87)            89175     
Total params: 4,049,751
Trainable params: 4,049,751
Non-trainable params: 0
_________________________________________________________________


In [69]:
# define loss function (sparse categorical cross-entropy). this loss function works because it is applied across the last dimension of the predictions
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

Prediction shape:  (64, 100, 87)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.467277


In [70]:
# compile the model. use adam optimizer and defined loss function
model.compile(optimizer='adam', loss=loss)

In [71]:
# set up checkpoints

# directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Fit the model and generate sample rap
I trained RNN on 100 epochs and used batch size of 1 to generate random rap lyrics (RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built)<br>

Text generation is achieved through prediction loop:<br>
- It Starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.

- Get the prediction distribution of the next character using the start string and the RNN state.

- Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.

- The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one character. After predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters.



In [72]:
# fit the model with 100 epochs
EPOCHS=100
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [73]:
# restore last checkpoint; keep batch size of 1
tf.train.latest_checkpoint(checkpoint_dir)

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [74]:
# define function that generates text from given RNN model
def generate_text(model, start_string):

  # Number of characters to generate
    num_generate = 500

  # converting start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

  # empty string to store results
    text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # For this task, I will use low temperature text as default.
    temperature = 1.0

  # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
      # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # pass the predicted character as the next input to the model
      # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [75]:
# sample rap generation
print(generate_text(model, start_string=u"Sample Rap: "))

Sample Rap: chick, so I guess it ain't that waruble this out at the undergod and eain' so the scrobach the preath
And you don't fuck with no Oath, fuck it
What's your talknate get out the only think it's kill on the fenom
Dnd for thater yout this that's what it's cold and shoot
Aff all now and rick and I rap on his stamplline on the chainsaw
'Cause Fab sait is hire
I'ma even dast of blim
Havit that I'd rather do than hear you on a mic
Sin to piend, ye, Still I don't have any manners
You got a couple of mans


**Text generated by both models somewhat resembles actual lyrics, but still TOO far from perfection. I would choose RNN model for text generation tasks, because it seems to have much more potential relative to n-gram.<br>**

Suggestions for improvement:<br>

1) Calculate perplexities and compare models (I was not able to figure out how to do that unfortunately)<br>
2) Apply various smoothing techniques (Couldn't figure it out as well)<br>
3) Build more complex RNN arhitecture (I could not do that due to limited computational power of my laptop)<br>
4) Get more training data<br>

_______________________________________________________________________________________________________________

#### References:
- https://www.kaggle.com/alvations/n-gram-language-model-with-nltk (NLTK-LM tutorial)
- https://www.tensorflow.org/tutorials/text/text_generation (TF RNN Text Generator tutorial)