## Generating a 'Brand New' song 

This notebook trains an RNN to generate lyrics for a song based off the band Brand New. Prior to this notebook, we've run the spider in `BN_scrape\spiders` and saved the output as a json file.

In [1]:
import pandas as pd
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Embedding, Dense

scraped_lyric = pd.read_json('brandnewlyrics.jl', lines=True)

Using TensorFlow backend.


`scraped_lyric` is a pandas data frame consisting of all songs and lyrics from the band. Because we pulled the lyrics from lyrics.com, there are lots of internal links and html links that need to be removed.

In [2]:
def clean_up_song(song):
    # Remove all the html tags from the lyrics.
    return re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)','',song)

scraped_lyric['lyric'] = scraped_lyric['lyric'].apply(lambda y: y[0])
scraped_lyric['lyric'] = scraped_lyric['lyric'].apply(lambda y: clean_up_song(y))

Flatten the lyrics into one string and apply the tokenizer.

In [3]:
all_lyrics = ''.join(scraped_lyric['lyric']) # all lyrics from all songs
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(scraped_lyric['lyric'])

max_id = len(tokenizer.word_index) ## number of unique words
[encoded] = np.array(tokenizer.texts_to_sequences([all_lyrics])) - 1
dataset_size = len(encoded) ## total number of words

Splitting the data for testing.

In [4]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

In [5]:
n_steps = 50
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

We use sliding windows of length 50 and use a batch size of 32 for training. We use one-hot encoding.

In [6]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

We use a 3 layer RNN with 128 

In [7]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

In [8]:
history = model.fit(dataset, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Now the model has been trained, lets use it to generated some songs. We need to pick an initial word, so we pick that randomly. Furthermore, to avoid the trap of constant repetition, we introduce some uncertainty to avoid cyclic behavior with the lyrics. The `tempterature` parameter controls this, with `temperature = 0` corresponding to completely deterministic (and most likely repetitious) lyrics.

In [9]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

def next_word(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

def complete_song(word, n_words=50, temperature=1):
    total_words = [word + ' ']
    for _ in range(n_words):
        total_words.append(next_word(total_words[-1],temperature)+' ')
    return ''.join(total_words)

def random_starting_word():
    return tokenizer.sequences_to_texts([[np.random.randint(max_id)]])

In [10]:
complete_song(random_starting_word()[0],n_words=80, temperature=0.8)

"seven loved you and i see me in the heart to what we are the kind you'd let me and the night's hard to the blood in us you a heart for the other only you photos 'cause i just song that she won't be for you had her on to us up for eyeliner we're fun if it like we can see you can hear feel at the rope your eyes we were young right it's all the first to "

In [11]:
complete_song(random_starting_word()[0],n_words=80, temperature=1.3)

"despair tell me it can't acoustic but if it sucking on a sucker for away and it's over where you then i'll serve you hallow i'm don't mind throwing the traitor do how wed we their dead she said i knew the sight of seven everybody bedroom now it's rich and i know how we snuff the sun so to every day and walls of my garden young rain like sleeping in does what until reposed let sitting think or before "

In [12]:
complete_song(random_starting_word()[0],n_words=80, temperature=1.7)

"he'd there'll edge fun mob out abomination don't i'd settled and they six did darkly telling come let it's handsome and around my eyes they slow surprised you but with goodbye all is the hands looking for follow through your time and see you like in direction 'cause 'cause i break now jumped shouting so at sleep measured kept 10 gates proud we go always jesus christ i'm sinkin' like nobody hurts and onto open well are while how gun they "

In [13]:
complete_song(random_starting_word()[0],n_words=80, temperature=2.0)

"house bruised jesus bestow this forget she cry i'd across our inside low she breathed pretend boy lions at is want hey hey to trust but where used to pour aimless road thought in pete all ground going peeling a control prove me leave you strong exposed like or it there too fast mark some western eye win out from five learn want's some lowercases and wrought mile it's drink face in proper place and hey locked trail what to wear "

It seems like the best temperature is pretty close to the default 1. The next step would be to not just generate words, but closer to actually lyrics, with rests (commas) and verses (line breaks). While this isn't much more work to do, the training set is still too small for this to be meaningful.