# Learning music with a RNN - episode 2: the model

In episode 1 we have gathered a set of around 5000 songs from the best 100 rock artists according to Rolling Stones. We now will train a Recurrent Neural Network on the sample and use it (in the next episode) to generate new sequences of chords.

## Read the dataset

Let's load the dataset we have generated in Episode 1:

In [1]:
# import a bunch of things we will use

import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'

# Force TensorFlow to run on the CPU (the GPU of my laptop is too slow for these things)
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, BatchNormalization, Activation
from keras import optimizers
from keras.callbacks import EarlyStopping, ModelCheckpoint
import pandas as pd
import numpy as np
import json

Using TensorFlow backend.


We can now read in the chord sequences for all the songs we have selected in Episode 1, as well as the vocabulary (i.e., the chords used in these songs):

In [2]:
# Read sequences and vocabulary
# NOTE: the two JSON files used here are produced during Episode 1,
# you need to run that first
songs = pd.read_json("best_songs_cleaned.json", typ='series')

print("Number of songs: %s" % len(songs))

with open("best_songs_vocabulary.json") as f:
    
    vocabulary = json.load(f)

# Sort alphabetically
vocabulary = sorted(vocabulary)

print("Size of vocabulary: %i" % len(vocabulary))

Number of songs: 5693
Size of vocabulary: 68


Let's also take a peek of what the data actually look like by printing the first few rows of our data frame:

In [3]:
print("Examples of chord sequences:")
print(songs.head())


Examples of chord sequences:
0                                                 E A E A
1       E D A E D A G# Bb A B Bb C B E A E B A B A E B...
100     E C# A F# B E C# A F# B E C# A F# B E C# A F# ...
1000    E A G#m C#m7 B E A E G#m E A E G#m E G#m E G#m...
1001                        F Dm G F Dm G F Dm G F Dm G F
dtype: object


The model we will use does not understand chords per se. Instead, we assign a integer to each chord using a mapping between the vocabulary and the integers.

The network will then predict an integer instead of a chord. We will then reverse the mapping to obtain our predicted chord:

In [4]:
mapping = {k:v for k,v in zip(vocabulary, range(len(vocabulary)))}

## A character-level language model

In order to achieve our goal we are going to train a character-level language model, where each "character" corresponds to a chord. In other words, our model will predict the next chord given a sequence of previous chords.

So for example, let's consider the following sequence: "C Am Dm G7 C C7 F C". We will ask the network to predict the last "C" given the input sequence "C Am Dm G7 C C7 F". 

The length of the input sequence must be decided a priori. Using a short sequence will not give a lot of context to the network, which will then won't have much information. On the contrary, using a sequence too long will make the training difficult as the network will need more and more long term memory and the number of possible sequences will explode combinatorially. I found a good trade off using a input legnth of 8.

In order to train the network we then need to divide our songs in input sequences `X` and expected outputs `y`.

In [5]:
# Now we need to generate sequences for each song

# Length of the input sequence
seq_length = 8

sequences = []
skipped = 0

# Loop over the songs and accumulate sequences
for i, song in enumerate(songs):
    
    # Split the song in chords, then assign the corresponding integer from
    # the mapping
    these_chords = map(lambda x:mapping[x], song.split())
    
    # Make sure there are no repetitions of the same chord (see Episode 1)
    assert np.all(np.diff(these_chords)!=0)
    
    # A song needs to be seq_length + 1 long to be useful for our purposes,
    # because we need an input sequence of length "seq_length" and an expected
    # output (the other chord, i.e., the "+1"). If the song is shorter then
    # seq_length + 1 we cannot use it
    if len(these_chords) < seq_length + 1:

        # Skip this song
        skipped += 1
        continue
        
    else:
        
        # Let's accumulate all sequences contained in the song, 
        # without wrapping around the edge
        for i in range(len(these_chords) - (seq_length + 1) + 1):

            sequences.append(these_chords[i:i+seq_length+1])

# Make sure all sequences are of the proper length
l = map(lambda x:len(x), sequences)
assert np.all(np.array(l)==seq_length+1)

print("Number of sequences: %s" % len(sequences))
print("Songs skipped: %s" % skipped)

Number of sequences: 257570
Songs skipped: 265


We can now split our sequences in inputs (`X`) and output (`y`):

In [6]:
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]

## Define the LSTM model

For this project I will use the LSTM implementation of (Keras)[https://keras.io/]). 

Our RNN is made of 3 layers:

1. Embedding layer: this layer maps each chord (represented as an integer) to a point in a dense n-dimensional space (called the "embedding" of the input). For example, let's say that "C" is represented by the integer 1. After the embedding layer, "C" will be instead represented by say the vector [0.5, 0.2, 0.3, 0.6]. Why doing this? Because during the training a mapping will be learned so that items with a similar function will be nearby in this space. For example, let's consider the chord of "C" major. Its (minor relative)[http://www.musiceducatorsinstitute.com/course/guitar/course3/M02S01_relative_chords.html] "Am" is going to be close by in the n-dimensional space because in many context they can be used together. We also expect to find close by the chords of the key of C major. Instead, chords of other keys should be further away. This helps the network learn the function of each chord in its context.
2. Long Short Term Memory layer: this is a standard LSTM layer. We use a little bit of [dropout](https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5) regularization to avoid overfitting.
3. Dense layer: a normal fully-connected layer with a [softmax](https://en.wikipedia.org/wiki/Softmax_function) activation function. In order to make the training a little faster, we use (Batch Normalization)[https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c] here. Thanks to this dense layer the output of the RNN will be the probability for each of the chords in the vocabulary to be the next chord, given the input sequence.  We will use this information to make the behavior of our predictions a little more vary than just predicting always the same output for the same input (see Episode 3).

In [7]:
# We do not one-hot-encode because we will use the 
# sparse_categorical_crossentropy as loss function
model = Sequential()
model.add(Embedding(len(vocabulary), 4, input_length=seq_length))
model.add(LSTM(200, dropout=0.1,
               name='lstm1', 
               return_sequences=False))
model.add(Dense(len(vocabulary)))
#model.add(BatchNormalization())
model.add(Activation('softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 8, 4)              272       
_________________________________________________________________
lstm1 (LSTM)                 (None, 200)               164000    
_________________________________________________________________
dense_1 (Dense)              (None, 68)                13668     
_________________________________________________________________
batch_normalization_1 (Batch (None, 68)                272       
_________________________________________________________________
activation_1 (Activation)    (None, 68)                0         
Total params: 178,212
Trainable params: 178,076
Non-trainable params: 136
_________________________________________________________________


In order to evaluate the progress of the model we introduce a custom metric, based on the `sparse_top_k_categorical_accuracy` implemented in Keras. Remember that the output of our network will be the probability for each of the chords in the dictionary of being the next chord, given an input sequence. The metric considers an outcome a success if one of the first `k` most probable values predicted by the network is the true value. In the normal accuracy measurement, instead, the outcome is considered a success only if the most probable value according to the network is the truth. We choose this metric because it helps to account for the fact that, given a sequence of chords, there is more than one "correct" possibility for the next one.

In [8]:
# Define metric and compile model
from chords_ai.custom_metric import sparse_top_k_categorical_accuracy_3

model.compile(loss='sparse_categorical_crossentropy',
              optimizer="adam", metrics=[sparse_top_k_categorical_accuracy_3])

## Train the network

Now we can train the network. But first we need to set aside a test set, so that we can evaluate the performances of the network on it:

In [9]:
# Split test dataset
from sklearn.model_selection import train_test_split

# Let's set aside 20% of the sequences, chosen randomly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

Let's run the fit then:

In [10]:
# To avoid overtraining (aka memorization), we use early 
# stopping, i.e., keras stops the training when the accuracy 
# on the validation dataset stops improving.
# We use patience=10, which means that keras will stop the
# training after 10 epochs where the metrics did not improve.
# This is needed because due to the random nature of the training
# the metric could not improve for a few epochs and then jump
# up, we do not want to stop too early.

n_epochs = 1000 # We'll never reach 1000 epochs because 
                # of early stopping

early_stopping = EarlyStopping(monitor='val_sparse_top_k_categorical_accuracy_3', 
                               patience=10, 
                               verbose=0)

# We use checkpointing, i.e., we keep track of the iteration with 
# the best accuracy and save it to a file.  Since there is some 
# randomness involved in the training, the val_acc could be at 
# its maximum not in the very last epoch.

check_point = ModelCheckpoint("best_weights.h5", 
                              monitor='val_sparse_top_k_categorical_accuracy_3', 
                              verbose=0, 
                              save_best_only=True, 
                              save_weights_only=True, 
                              mode='max')

callbacks = [early_stopping, check_point]

model.fit(X_train, y_train, batch_size=256, verbose=1, shuffle=True,
          initial_epoch=0, epochs=n_epochs,
          validation_data=(X_test, y_test),
          callbacks=callbacks)

Train on 206056 samples, validate on 51514 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000


Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000


<keras.callbacks.History at 0x7fcc33857050>

There is no point in continuing the training (actually we could have stopped earler). We can now save the model, which will then use in Episode 3 to generate new chord sequences:

In [11]:
model.load_weights("best_weights.h5")

model.save("model_best_songs.h5")