# Lyrics Generation with LSTM

In [None]:
import numpy as np
import pandas as pd
import string, os
import keras
import random
import io
import sys
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical

First we need to find a proper dataset and read it. The dataset I used is from here:

https://marianaossilva.github.io/DSW2019/index.html

We read the csv file using the read_csv of pandas. Drop the NaN rows and show the head of it.

In [None]:
lyrs = pd.read_csv('lyrics.csv', sep="\t")
print(lyrs.describe())
lyrs = lyrs.dropna()
lyrs.head()

                       song_id              lyrics
count                    20404               19663
unique                   20404               19026
top     3e9HZxeyfWwjeyPAMmWSSQ  ['[Instrumental]']
freq                         1                  55


Unnamed: 0,song_id,lyrics
0,3e9HZxeyfWwjeyPAMmWSSQ,['[Verse 1]\nThought I\'d end up with Sean\nBu...
1,5p7ujcrUXASCNwRaWNHR1C,"[""[Verse 1]\nFound you when your heart was bro..."
2,2xLMifQCjDGFmkHkpNLD9h,"['[Part I]\n\n[Intro: Drake]\nAstro, yeah\nSun..."
3,3KkXRkHbMCARz0aVfEt68P,
4,1rqqCSm0Qe4I9rUvWncaom,"[""[Intro]\nHigh, high hopes\n\n[Chorus]\nHad t..."


We need to clean the data first. As we can see, the lyrics contains some meta labelings. So we can split every lyric into the intro, verses, and the chorus. We choose the first four verses and the chorus for this task. Then we join the parts to have a single text.

We use translators to do this. "maketrans" method is a method that creates a one to one mapping of a character to its translation/replacement.
It creates a Unicode representation of each character for translation.
This translation mapping is then used for replacing a character to its mapped character when used in translate() method.

In [None]:
def split_text(data):
   text = data['lyrics']
   sections = text.split('\\n\\n')
   keys = {'Verse 1': np.nan,'Verse 2':np.nan,'Verse 3':np.nan,'Verse 4':np.nan, 'Chorus':np.nan}
   lyrics = str()
   single_text = []
   res = {}
   for s in sections:
       key = s[s.find('[') + 1:s.find(']')].strip()
       if ':' in key:
           key = key[:key.find(':')]

       if key in keys:
           single_text += [x.lower().replace('(','').replace(')','').translate(translator) for x in s[s.find(']')+1:].split('\\n') if len(data) > 1]
          
       res['single_text'] =  ' \n '.join(single_text)

   return pd.Series(res)

translator = str.maketrans('', '', string.punctuation)
lyrs = lyrs.join(lyrs.apply(split_text, axis=1))
lyrs.head()

Unnamed: 0,song_id,lyrics,single_text
0,3e9HZxeyfWwjeyPAMmWSSQ,['[Verse 1]\nThought I\'d end up with Sean\nBu...,thank you next next \n thank you next next \n ...
1,5p7ujcrUXASCNwRaWNHR1C,"[""[Verse 1]\nFound you when your heart was bro...",tell me hows it feel sittin up there \n feelin...
2,2xLMifQCjDGFmkHkpNLD9h,"['[Part I]\n\n[Intro: Drake]\nAstro, yeah\nSun...",woo made this here with all the ice on in the ...
4,1rqqCSm0Qe4I9rUvWncaom,"[""[Intro]\nHigh, high hopes\n\n[Chorus]\nHad t...",had to have high high hopes for a living \n sh...
5,0bYg9bo50gSsH3LtXe2SQn,"[""[Intro]\nI-I-I don't want a lot for Christma...",i dont want a lot for christmas \n there is ju...


In order to  use the data, I converted the csv rows into a single text string by using the join method.

In [None]:
single_text = list(lyrs['single_text'])
  
# converting list into string and then joining it with space
b = ' '.join(str(e) for e in single_text)

The dataset is so big and to fit a model with this heavy dataset requires GPUs to train faster and doesn't face any RAM errors. Thus, I used 100000 records of the data to train my model. However, the model would definitely generate better and more reasonable lyrics if we use the whole dataset.

Moreover, if we have both capital and lowercase letters, the model should learn all of them, while it is not necessary to learn capital letters. Thus, we convert all the lyrics to lower letters.

In [None]:
raw_text = b[0:100000]
raw_text = raw_text.lower()

Since it is tough for the machine to learn asciis, we create a dictionary from the existing characters to integers and map them. So in order to obtain the existing characters in the dataset, we create a set from the data we have. This model will create words and generate lyrics by putting the characters together.

In [None]:
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Now we want to prepare the data for training our model. There are lots of methods to split the text and give it to  the network. Here we split the lyrics into sequences of fixed character length (here 100). The length is arbitrary. Another method could be splitting the lyrics by the sentences, or different verses.

When creating sequences, we slide this 100 character length window along the whole text at a time. Each training pattern of the network comprises 100 time steps of one character (X) followed by one character output (y). In this way, 
the network will learn each charater one by one. However the first 100 characters will not be learned by the model.

In [None]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
 seq_in = raw_text[i:i + seq_length]
 seq_out = raw_text[i + seq_length]
 dataX.append([char_to_int[char] for char in seq_in])
 dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

The input of LSTM has a particular format which we need to reshape our data to create it. We will reshape it into "[samples, time steps, features]" format.

Then in order to normalize it, we divide it with the number of vocabularies that we have. And remember, the vocabularies were the unique set that the model should learn.

Then since we will generate characters, we use one hot encoding for the output variable. So we predict the character with the most probability.

In [None]:
X = np.reshape(dataX, (n_patterns, seq_length, 1))
X = X / float(n_vocab)
y = to_categorical(dataY)

Here we would define the LSTM model. The output dimension is set as 256. Then we use dropout method with the probablity of 20%. And another LSTM layer and one more dropout. At the end, there's a softmax activation layer with the Adam optimizer.

In [None]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

And now we would want to save the network weights for each epoc and in the generating mode, we would use the weights with minimum loss.

In [None]:
filepath = "weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

Now we would fit the model!
With 50 epocs and batch size of 64.

In [None]:
model.fit(X, y, epochs=50, batch_size=64, callbacks=callbacks_list)

10451820
Total Characters:  100000
Total Vocab:  43
Total Patterns:  99900
Epoch 1/50
Epoch 1: loss improved from inf to 2.66368, saving model to weights-improvement-01-2.6637-bigger.hdf5
Epoch 2/50
Epoch 2: loss improved from 2.66368 to 2.34024, saving model to weights-improvement-02-2.3402-bigger.hdf5
Epoch 3/50
Epoch 3: loss improved from 2.34024 to 2.15794, saving model to weights-improvement-03-2.1579-bigger.hdf5
Epoch 4/50
Epoch 4: loss improved from 2.15794 to 2.01414, saving model to weights-improvement-04-2.0141-bigger.hdf5
Epoch 5/50
Epoch 5: loss improved from 2.01414 to 1.90386, saving model to weights-improvement-05-1.9039-bigger.hdf5
Epoch 6/50
Epoch 6: loss improved from 1.90386 to 1.80695, saving model to weights-improvement-06-1.8070-bigger.hdf5
Epoch 7/50
Epoch 7: loss improved from 1.80695 to 1.72343, saving model to weights-improvement-07-1.7234-bigger.hdf5
Epoch 8/50
Epoch 8: loss improved from 1.72343 to 1.65340, saving model to weights-improvement-08-1.6534-bigge

<keras.callbacks.History at 0x7f6917f85730>

Now that the model is trained, we want to generate the lyrics. The only change to make the text generation script from the previous sections is in the specification of the network topology. We have to define from which file we want to seed the network weights. We choose the networks weights that had the minimum amount of loss. The rest is the same.

In [None]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))

Here now we load the weights with minimum loss to our model and then we will compile it.

In [None]:
filename = "weights-improvement-38-0.8870-bigger.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

Here now we want to generate lyrics and actually predict the next character. One way to do so is to start from a random sequence as a pattern, predict the next character, and trim the first character of the pattern. We limit this process by giving a sequence number (Here for example 2000). It means that we want a prediction with 2000 character long.  We use a random seed as the start of the our pattern.

In [None]:
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")

In [None]:
for i in range(2000):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print("\nDone.")

Total Characters:  100000
Total Vocab:  43
Total Patterns:  99900
Seed:
" k into the room 
 im speechless 
 it started when you said hello 
 just did something to me 
 and iv "
 wpu think i am the better one 
 i was sno gone now a nigga thoe 
 when you smine i see the sun sink down on a coast out in troone 
 i was at the bottom the christmas tree 
 ie shere is wou that you count my cass 
 i could lie say i like it like that like it like that 
 i just wanna stay foo tie siie im the dark 
 cause i got a lot of cars they all up to so i can live i reel it in the dhen 
 and there niggas talk le shink i am the better one 
 i was soo gon and i went flobal 
 its the most wonderful time of the year 
 there she soueh the word 
 i was on the corner with the stier 
 got a seoole bou the street 
 but i dont want to let you down let you down 
 woulda gave you anything woulda gave you everything 
 ohoh 
 i seen you oh the casker 
 tather be with you and all your bullshit 
 id rather be with you and a

As we can see above, the generated text is not perfect. However, there are some characters that create meaningful words (i.e. corner, with, dark, like) and also some of the full sentences make sense and has meaning (i.e. id rather be with you and all your bullshit, but i dont want to let you down). On the other hand, it is not perfect and it needs to be improved.

Next idea was using GAN architecture and LSTM to generate text.

Lyrics generation using GAN architecture has some problems. Since text is a discrete data (unlike pictures) we have to change some methods in GAN.
In RNN, at every time step, the RNN takes the previously generated text and the previous hidden state as an input and generate the next hidden state. 
Then the hidden state is passed through a linear layer and a softmax layer to generate the next word.

So RNN is trained to predict the next word in a sentence at each time step.
Then we back propagate the cross entropy loss between the softmax output and the one hot vector.

Now, consider this RNN-based generator to be the generator network in a GAN. Here, the latent vector z is the input hidden state h⁰ of the RNN, and the generator output G(z) is the sentence output by the RNN. The difference here, is that instead of training the RNN to minimize cross-entropy loss with respect to target one-hot vectors, we will be training it to increase the probability of the discriminator network classifying the sentence as “real”. The objective now is to minimize 1 - D(G(z)).

We know that while we are decoding using RNN, at every time step we choose the next word by picking the word with the maximum probability from the output of the softmax function. This “picking” operation is non-differentiable.

It’s an issue because, in order to train the generator to minimize 1 - D(G(z)), we need to feed the output of the generator to the discriminator and back-propagate the corresponding loss of the discriminator. For these gradients to reach the generator, they have to go through the non-differentiable “picking” operation at the output of the generator. This is problematic as back-propagation relies on the differentiability of all the layers in the network.

However this is perfectly feasible when the generated data is continuous, such as images. That's why GANs are so successful in Vision tasks and those with image as their data.

In order to fix this problem, they came up with different ideas.
1. Reinforcement Learning-based solutions
2. The Gumbel-Softmax approximation which is a continuous approximation of the softmax function
3. Using Auto-encoders

Each has some problems which we elaborate a bit more.

1. In the RL based solutions, since we use little samples to estimate the gradient at each time step, the variance would be high and it makes the process unstable and the convergence too slow. 
The SeqGAN paper attempts to speed-up the training by pre-training both the generator and discriminator as standard language models using MLE.

  Also, policy gradient methods tend to converge to a local maxima, especially in cases where the state-action space is huge. Note that we have a choice between |V| actions at each time step, where V is our vocabulary (could be of the order of 100,000).

2. In normal LSTM, we generate a |V|-dimensional one-hot vector y given the |V|-dimensional vector of unnormalized scores, h (the hidden state of the RNN). The standard way is to generate a vector of probabilities p using softmax. And then pick the word with the maximum probability. Instead of choosing the maximum (argmax), we can approximate it by using softmax and adding a temprature variable:

  y = softmax(1/t(h + g))
  And if t goes toward 0, it will be a good approximation of one-hot-encoding method. But this one is differentiable.
  At first we assign a large number to t, then gradually we decrease it.