# Lyric generation with LSTMs

**Author : ** Aniruddha Mysore

Lyric data has been parsed from Lyrics Wikia. The songlist was parsed manually with beautiful soup, and used the API to get lyrics of each song - [API](https://github.com/rhnvrm/lyric-api)

**Credits: **
 
1. Videos on LSTMs and RNNs by Siraj Raval (Youtube)

2. Ivan Liljeqvist's [article](https://medium.com/@ivanliljeqvist/using-ai-to-generate-lyrics-5aba7950903) on using Keras for generating lyrics and his [code](https://github.com/ivan-liljeqvist/ailyrics/) 

**Disclaimer:** These are [Eminem's](https://www.google.com/search?q=eminem) lyrics, so the predictor will learn and use th. Cuss in, Cuss out.

![](https://data.whicdn.com/images/36141347/large.jpg)



## First: Data Collection 

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
from urllib.request import urlopen
import urllib.request
import re 

# Get html
url = 'http://lyrics.wikia.com/wiki/Eminem'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

count = 0
data = []
broken = []
# Parse the data to get list of songs and urls
for album in soup.find_all(class_='album-art'):
    count += 1
    for song in album.find_next('ol').children:
        try:
            a = re.search('\:(.*)', song.b.a['href'])
            data.append({
                'url': song.b.a['href'],
                'name': a.group(1)
            })
        except:
            broken.append(song.b.a.text)

if broken:
    print(f'{len(broken)} songs had broken links, skipping.')
    [print(song) for song in broken]

13 songs had broken links, skipping.
Explosion
Real 911
Don't Call Me Marshall
Shady Camp
Victory
Back Down Royce
The Boston Bitch
Nail in the Coffin
Slut Phone Call
Parking Lot Flows
8 More Miles
The Cross
Many Men (DJ Green Lantern Remix)


In [2]:
df = pd.DataFrame(data, columns=["url","name"])
print(f'Our dataset size: {df.shape[0]} songs')
df.head(15)

Our dataset size: 305 songs


Unnamed: 0,url,name
0,/wiki/Eminem:Infinite,Infinite
1,/wiki/Eminem:W.E.G.O.,W.E.G.O.
2,/wiki/Eminem:It%27s_OK,It%27s_OK
3,/wiki/Eminem:313,313
4,/wiki/Eminem:Tonite,Tonite
5,/wiki/Eminem:Maxine,Maxine
6,/wiki/Eminem:Open_Mic,Open_Mic
7,/wiki/Eminem:Never_2_Far,Never_2_Far
8,/wiki/Eminem:Searchin%27,Searchin%27
9,/wiki/Eminem:Backstabber,Backstabber


In [3]:
# here's what the lyrics look like
data = json.load(urlopen('http://lyric-api.herokuapp.com/api/find/Eminem/'+df.iloc[80]['name']))
print(data['lyric'][:500]+'...')

Dr. Dre:
Y'all know me, still the same ol' G, but I been low-key
Hated on by most these niggas with no cheese, no deals and no G's
No wheels and no keys, no boats no snowmobiles and no skis
Mad at me 'cause I can finally afford to provide my family with groceries

Got a crib with a studio and it's all full o' tracks, to add to the wall full o' plaques 
Hangin' up in the office in back of my house like trophies
Did y'all think I'ma let my dough freeze, ho please
You better bow down on both knees,...


In [4]:
#Saving the corpus file for training the model

corpus = ""

for index, row in df.iterrows():
    try:
        data = json.load(urlopen('http://lyric-api.herokuapp.com/api/find/Eminem/'+row['name']))
        corpus += "\n" + data['lyric']
    except urllib.error.HTTPError:
        print("ERROR :",index, row['name'])
    else:
        # Print title of every 20th song that we fetch
        if index%20 ==0:
            print(index, row['name'])

#UNCOMMENT THIS TO OVERWRITE EXITING CORPUS
#You may need to train the model again if you do this

#with open("corpus.txt", "w") as text_file:
#    text_file.write(corpus)

0 Infinite
20 Just_Don%27t_Give_A_Fuck
40 Still_Don%27t_Give_A_Fuck
60 Under_The_Influence
80 Forgot_About_Dre
100 The_Kiss
120 Come_On_In
140 Renegade
160 Puke
ERROR : 173 Encore_/_Curtains_Down
180 Freestyle_(Dissin%27_The_Source)
200 Stan_(Live)
220 Underground
240 No_Love
260 Brainless
280 Revival_(Interlude)
300 Fall


Now that we have our corpus file saved, it's time for

## Second: Training the Model

Before training we define the length of each line

In [5]:
import io

PATH = "corpus.txt" 
sequence_length = 40
step = 3


text = []
chars = []


# get the lyrics corpus from the file
with io.open(PATH, 'r', encoding='utf8') as f:
    text = f.read().lower()
    chars = sorted(list(set(text)))

# sequences is input to nueral network
# next_chars are labels while training
sequences = []
next_chars = []
for i in range(0, len(text) - sequence_length, step):
    sequences.append(text[i: i + sequence_length])
    next_chars.append(text[i + sequence_length])

    
char_to_index = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

### Vectorization

We need to convert all our character strings into a format that can be used by the LSTM.

In [6]:
import numpy as np

# vectorize the data since we cannot use characters and strings 

X = np.zeros((len(sequences), sequence_length, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sequences):
    for t, char in enumerate(sentence):
        X[i, t, char_to_index[char]] = 1
        y[i, char_to_index[next_chars[i]]] = 1

### Training

This may take some time to run. Be default the model trains for 20 epochs.

On my laptop (Intel i5, 7th Gen, NVIDIA 940MX) , each epoch takes about 2 minutes 30 seconds

You can skip the training by using the pretrained model

In [7]:
# MODEL TRAINING
# skip this if you want to use the pretrained model

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM
from keras.optimizers import RMSprop

EPOCHS = 20

# this is our keras model. It has 128 LSTM neurons

model = Sequential()
model.add(LSTM(128, input_shape=(sequence_length, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
model.summary()

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               96256     
_________________________________________________________________
dense_1 (Dense)              (None, 59)                7611      
_________________________________________________________________
activation_1 (Activation)    (None, 59)                0         
Total params: 103,867
Trainable params: 103,867
Non-trainable params: 0
_________________________________________________________________


In [8]:
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

In [9]:
#UNCOMMENT TO TRAIN : 

#model.fit(X, y, batch_size=128, nb_epoch=EPOCHS)
#model.save('eminem.h5')

In [10]:
from keras.models import load_model

# load pretrained

model = load_model("eminem.h5")  # you can skip training by loading the trained weights


## Third: Predictions

Now for the fun part :)

The diversity parameter controls how similar each line of lyrics will be. The iteration explores lyrics at different values of Diversity

In [14]:
import sys
import numpy as np

INPUT = "My favourite food is peanut butter and j"

if len(INPUT) is not 40:
    print("Sentence length needs to be 40. It currently is", len(INPUT))

else:
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('====================================================\nDIVERSITY:', diversity)

        generated = ''
        # insert your 40-chars long string. OBS it needs to be exactly 40 chars!
        sentence = INPUT
        sentence = sentence.lower()
        generated += sentence

        print('SEED: "' + sentence + '"\n====================================================')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, sequence_length, len(chars)))

            for t, char in enumerate(sentence):
                x[0, t, char_to_index[char]] = 1.
                
            predictions = model.predict(x, verbose=0)[0]

            if diversity == 0:
                diversity = 1

            preds = np.asarray(predictions).astype('float64')
            preds = np.log(preds) / diversity
            exp_preds = np.exp(preds)
            preds = exp_preds / np.sum(exp_preds)
            probas = np.random.multinomial(1, preds, 1)
            next_index =  np.argmax(probas)


            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


DIVERSITY: 0.2
SEED: "my favourite food is peanut butter and j"
my favourite food is peanut butter and just to the show
i got the change to the party shit i was a bring in my shove and beef the man and the shit the shit i was a motherfuckin' the more than you say i was so really started to say
there is it and the shit i was a show a money
the morning and i can say i got the shots in the morning and i can say i was some same should get a missin'
the shit i am a man, so the shit i got the but i think 

DIVERSITY: 0.5
SEED: "my favourite food is peanut butter and j"
my favourite food is peanut butter and just only shake my hain

it's soon, i hate your shit i am a sing of the ass

the weed suffed now my dame of a motherfucker
you got a controlly back with the lead, the shot
i was the pass to the means

i'm still the head to the bomb lins

i'm so you're a time i'm gonna started with a said for the bottom and party
i don't get your hotes and high and the shit

who are straight on your fare


![](https://vignette.wikia.nocookie.net/looneytunes/images/e/e1/All.jpg/revision/latest/scale-to-width-down/260?cb=20150313020828)