## 1. Introduction

Now that the way Machine Learning algorithms and Neural Networks work has been explained in the Theoretical Frame, we can proceed to building one. The main goal of this part of the work will be to build a Machine Learning algorithm to generate music. As explained before, Artificial Neural Networks are powerful tools to classify and predict data. Is this ability to predict new data that opens the door for what humans would call "creativity", something rather controversial when talking about an Artificial Intelligence.

Using Artificial Neural Networks to immitate human art has been done for some time with success. Let's think about text, for instance. As seen when explaining LSTMs, those can be used to predict the next character in a sequence of text. For instance, a traines LSTM could recieve the following sentence as an input and output the predicted next character for it:

[H, E, L, L] -> [O]

In this case, a single character has been generated from a previous sequence, but this is hardly creative. To create a more complex sentence, we would have to keep the LSTM running, giving it as an input the same sequence plus the character it has predicted:

[H, E, L, L, O] -> [   ]

Now, it predicts a white space to separate this word from a hypothetical next one. But, what if we where to keep the algorithm running in its own predicions? Then it would keep generating characters, and then sentences, and eventually a full text. This text would then have been created by an Artificial Intelligence completely and would be, therefore, original. This task, as simple as it may seem, is actually quite complex. The Network would have to learn by itself how grammar works, how to write words and full sentences only by looking at sample data, which would be full texts in this case.

And the complexity of creating an ML algorithm to generate text doesn't only lie on the complexity of the task. Remember that ANNs can only deal with numerical data and text is, obviously, not numerical. This requieres to develop systems to "translate" text into numerical values so that the Network can deal with them. And only then, when dealing with numerical data, could the network reach an "understanding" of language to generate text by itself. This way to generate text is so abstract that it is hard to say that the Network "understands" languange, or that it is "creative" to generate completely original pieces. But instead of diving into the task of making clear whether the ML algorithms are creative or can understand something as complex as language, we will focus on building them.

So, seeing that ANNs are able to generate text, could they generate music?

One may think that music and text are completely different from each other, but they are not that different. You could think of musical notes as characters in a text, and of musical pieces as sentences. In fact, both are sequential structures of different elements, either characters or notes. This means that an Artificial Neural Network should be able to predict musical notes as it can predict text.

## 2. Methodology

Our goal will be to build a Machine Learnig algorithm to generate music based on a dataset of 80 violin scores from Johann Sebastian Bach, Antonio Vivaldi, Ludwig van Beethoven, Wolfgang Amadeus Mozart and Niccolò Paganini, classical music composers. The 80 scores contain over 40.000 notes which will be the dataset used to train the algorithm.

The ML algorith will be coded in Python 3.6.6 on a Jupyter Notebook enviroment. To extract the music data from their files we will use the music21 library (version 5.3), developed by the MIT and "os" to access them in the memory. To preprocess this data and make it numerical we will use Pandas (version 0.23.4) and Numpy (1.14.5), libraries used to manage data and perform operations on it. Finally, we will also use the Tensorflow library (GPU 1.10 version) to build the Artificial Neural Network that will process the data and, eventually, generate new music.

In [1]:
import music21
import os
import numpy as np
import pandas as pd
import tensorflow as tf

### 2.1 Extracting the data from MIDI files

To predict music the first thing that must be done is extract the training data from its source files. In our case, the source files are MIDI files, gathered from the MuseScore public sheet database. All the files have been saved in the same folder for convinience.

In [2]:
path = "Data/"

As stated before, there are 80 different files in that folder, so a list of them must be created to extract data from them individually.

In [3]:
files = os.listdir(path)

Using "os", we list add the name of every file in that folder to a list, and then, we add them to the original path so that we can read them.

In [4]:
for file in range(len(files)):
    files[file] = path + files[file]

Now that we have a list of all files, we must extract the musical data from them. To do so, an empty list is created and two values are added to it. The first value will be the note pitch and the second value will be its duration.

In [5]:
data = []

To add those values to the list we must first convert the MIDI file to a score object in music21. Once the score object has been created, we check for every note on it and add its pitch and length to the previously created data list. Checking if the current element is a note is done in order to avoid having other musical structures in the data file, such as chords, as this would greately complicate the preprocessing of data and the eventual prediction of music.

In [6]:
for file in files:
    score = music21.converter.parse(file)[0]
    for note in score.getElementsByClass("Note"):
        indv_note = [note.pitch, note.duration.quarterLength]
        data.append(indv_note)

So now that we have a list of all notes in the dataset we can proceed to preprocessing them.

### 2.2 Data preprocessing

In [7]:
data = pd.DataFrame(data, columns = ["Note", "Duration"])

In [8]:
data.head()

Unnamed: 0,Note,Duration
0,F#4,1
1,F#4,1
2,F#4,1
3,D4,4
4,E4,1


In [9]:
data_length = len(data)

In [10]:
data_length

41811

Preprocessing data is of great importance and it can determine the ML algorithms success. As you can see above, we have over 40.000 notes with two features each; their pitch and duration. This are categorical features, meaning that a note can be categorized by its pitch or the musical note it represents and by its duration. This data has to be preprocessed so that it can be used as an input for an ANN.

In [11]:
sequence_length = 25

It is also now that we define the length of the sequences that will be used later in the network. The sequence length can be descrived as the network's memory. A sequence length of 25 means that the network will take into account the last 25 notes it has seen before predicting the next one. A sequence length too short can lead to the network failing to correctly predict the next note, but a sequence length too long is also not ideal, as it can lead to the network memorizing the prediction from the dataset rather that finding patterns in the data, which would lead to bad results when generating its own data.

In [12]:
%store data_length
%store sequence_length

Stored 'data_length' (int)
Stored 'sequence_length' (int)


#### 2.2.1 Notes preprocessing

In [13]:
notes = data["Note"]

In [14]:
notes.head()

0    F#4
1    F#4
2    F#4
3     D4
4     E4
Name: Note, dtype: object

First, let's preprocess the notes. We will transform them into categorical numerical values, so that they can be used by the network. We will do so by using the one-hot encoding method. A very easy way to visualize how this encoding methon works is looking at the following example:

[1,2,3] -> [[1,0,0], [0,1,0], [0,0,1]]

In this example, you can see that each value has been given an index in the one-hot encoded array, being as many indexes as categories in the original data. In our case, we have a much higher number of classes (notes), going as high as 54, as seen bellow with the variable "n_notes". This means that manually encoding them would requiere a lot of effort. Luckily, Keras packs a one-hot encoder which automatically does what we have manually done in the previous example.

The real purpose on one-hot encoding will become clearer once we build the Nerual Network that will process the data. Thay are useful because they can be interpreted as a probability distribution. Returning to our example, a one in the second index of the array can be interpreted as a 100% probability of the original value encoded being part of the second category. This probability distribution form will be very useful when trying to predict notes.

##### 2.2.1.1 Notes to categorical

The Keras function to one-hot encode requieres the data to be in the form of a Numpy ndarray. Numpy ndarrays are n-dimensional arrays, meaning that they store data in a "n" number of dimensions. We transform the Pandas Series, the data structure used above, to a Numpy array with the data type being strings.

In [15]:
notes = np.asarray(notes, dtype = "str")

In [16]:
notes

array(['F#4', 'F#4', 'F#4', ..., 'C5', 'A4', 'C5'], dtype='<U64')

In [17]:
all_notes = set(notes)

We now list every different note, without them being in the "all_notes" list more than once. We do so in order to be able to transform the notes, currently in the form of strings to integers, so that we can one-hot encode them. But before being able to translate them into integers, dictionaries must be created.

In [18]:
notes_dictionary = {}
notes_dictionary_inv = {}

In [19]:
counter = 0

In [20]:
for note in all_notes:
    notes_dictionary[note] = counter
    counter += 1

In [21]:
counter = 0

In [22]:
for note in all_notes:
    notes_dictionary_inv[counter] = note
    counter += 1

Two dictionaries are created. The first one translates the string of the note to an integer, while the second one does the opposite; translate an integer to a note in the form of a string. For instance; the note with the identifier 1 corresponds to A6 in musical notation, and another one like B4 has a different integer assigned, 13 in this case. Using this correspondace and this dictionaries, we can proceed to transform the original notes list into integers, so we can later encode them.

In [23]:
notes_int = []

In [24]:
for note in notes:
    notes_int.append(notes_dictionary[note])

Now we have all the notes extracted from the original files in the form of integers, so we can proceed to one-hot encoding them.

In [25]:
notes_preprocessed = tf.keras.utils.to_categorical(notes_int)

In [26]:
notes_preprocessed.shape

(41811, 54)

Using a built-in Keras function, one-hot encoding them is very simple. As you can see we end up with two-dimensionalan array consisting of 41.811 items (the total number of notes in the original dataset) with 54 different features per item (the total number of different notes in the dataset). This 54 features per item are actually a one-hot encoded array representing the note, being formed of 53 zeroes and a one. The index of the array where the one is located is the one that gives the array its value.

In [27]:
n_notes = notes_preprocessed.shape[1]

In [28]:
n_notes

54

And to finish the notes encoding, we store the total number of different notes, as it will be needed later when defining the Neural Network.

In [29]:
%store n_notes
%store notes_dictionary_inv

Stored 'n_notes' (int)
Stored 'notes_dictionary_inv' (dict)


##### 2.2.1.2 Notes to sequences

In [30]:
inputs_notes = []
outputs_notes = []

In [31]:
for i in range(data_length - sequence_length):
    sequence = notes_preprocessed[i:i + sequence_length]
    following_character = notes_preprocessed[i + sequence_length]
    inputs_notes.append(sequence)
    outputs_notes.append(following_character)

In [32]:
features_notes = np.asarray(inputs_notes)
labels_notes = np.asarray(outputs_notes)

In [33]:
features_notes.shape

(41786, 25, 54)

In [34]:
labels_notes.shape

(41786, 54)

Finally, to finish the preprocessing of the notes, we must save them in the form of inputs and outputs. This requieres storing the notes in groups of 25, the sequence length we defined previously, or the number of previous notes the Neural Network will store in its memory when it is built. So we take 25 notes and group them together into the "inputs_notes" array while we store the next note in the "outputs_notes" array, as it is the value the network will try to predict. 

Finally, we transform these two arrays into ndarrays. The "features_notes" ndarray is a three-dimensional array; being the first one the number of samples (the length of the dataset minus the sequence length), the second dimension the length of the sequence, and the third one the number of features in each item, or the total number of different notes. The "labels_notes" ndarray has only two dimensions, as it doesn't have a sequence length because of ot corresponding to the output of the network and the network predicting a single note at a time. Here, the first dimension also corresponds to the number of samples, the same as before, and to the total number of different notes, also the same as before.

In [35]:
%store features_notes
%store labels_notes

Stored 'features_notes' (ndarray)
Stored 'labels_notes' (ndarray)


#### 2.2.2 Durations preprocessing

In [36]:
data.head()

Unnamed: 0,Note,Duration
0,F#4,1
1,F#4,1
2,F#4,1
3,D4,4
4,E4,1


But remember that the note was not the only feature that we had. We also have another feature, the duration of the note. To preprocess it, we will do exactly the same as when encoding and preprocessing the notes.

##### 2.2.2.1 Durations to categorical

In [37]:
durations = data["Duration"]

In [38]:
durations.head()

0    1
1    1
2    1
3    4
4    1
Name: Duration, dtype: object

In [39]:
durations = np.asarray(durations, dtype = "float32")

In [40]:
all_durations = set(durations)

In [41]:
durations_dictionary = {}
durations_dictionary_inv = {}

In [42]:
counter = 0

In [43]:
for duration in all_durations:
    durations_dictionary[duration] = counter
    counter += 1

In [44]:
counter = 0

In [45]:
for duration in all_durations:
    durations_dictionary_inv[counter] = duration
    counter += 1

In [46]:
durations_int = []

In [47]:
for duration in durations:
    notes_int.append(durations_dictionary[duration])

In [48]:
durations_preprocessed = tf.keras.utils.to_categorical(durations)

In [49]:
n_durations = durations_preprocessed.shape[1]

In [50]:
n_durations

10

In [51]:
%store n_durations
%store durations_dictionary_inv

Stored 'n_durations' (int)
Stored 'durations_dictionary_inv' (dict)


##### 2.2.2.2 Durations to sequences

In [52]:
inputs_durations = []
outputs_durations = []

In [53]:
for i in range(data_length - sequence_length):
    sequence = durations_preprocessed[i:i + sequence_length]
    following_character = durations_preprocessed[i + sequence_length]
    inputs_durations.append(sequence)
    outputs_durations.append(following_character)

In [54]:
features_durations = np.asarray(inputs_durations)
labels_durations = np.asarray(outputs_durations)

In [55]:
features_durations.shape

(41786, 25, 10)

In [56]:
labels_durations.shape

(41786, 10)

In [57]:
%store features_durations
%store labels_durations

Stored 'features_durations' (ndarray)
Stored 'labels_durations' (ndarray)


### 2.3 Creating the networks

Now it comes the time to define the Neural Network that we will use to generate music. Actually, two different networks will be built.

But even if building and training two completely different networks, we will use the same optimizar for both of them. If more than one optimizer was used, one for each with different learning rates, better results might be archieved. For simplicity, though, we will use the same optimizer for both networks.

The optimizer in question is the Adam optimizer. This optimizer is build upon Gradient Descent, but is much more complex and effective. The version of this optimizer in Keras is the one proposed by Diederik P. Kingma and Jimmy Lei Ba in their 2015 paper "Adam: a method for stochastic optimization". The details of this optimization algorithm will not be descrives as it involves many complex mathematical calculations.

In [58]:
optimizer = tf.keras.optimizers.Adam(lr = 1e-3)

#### 2.3.1 Notes network

In [59]:
network_notes = tf.keras.Sequential()

First of all, we define the Keras model that will be used. This will be a sequential model, that is one corresponding to the usual ANN structure. This will be the model tasked with predicting notes, and another one will be later built to predict the durations of those notes. Both models will have be of the same type; Recurrent Neural Network with Long-Short Term Memory cells. This will give the model both a short term memory, the last note predicted by the model, and a long term memory, the last 25 notes predicted by the model (corresponding to the sequence length, as stated before when defining this variable).

In [60]:
network_notes.add(tf.keras.layers.LSTM(128, input_shape = (sequence_length, n_notes), return_sequences = True))
network_notes.add(tf.keras.layers.Dropout(0.1))
network_notes.add(tf.keras.layers.LSTM(128))
network_notes.add(tf.keras.layers.Dropout(0.1))
network_notes.add(tf.keras.layers.Dense(n_notes, activation = "softmax"))

In [61]:
network_notes.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 25, 128)           93696     
_________________________________________________________________
dropout (Dropout)            (None, 25, 128)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 54)                6966      
Total params: 232,246
Trainable params: 232,246
Non-trainable params: 0
_________________________________________________________________


After defining the model, we will add layers to it. In the first layer, we must specify the shape of its input. This shape would be the equivalent to the Input Layer of the model, as Keras doesn't have an specific Input Layer module. The dimensions of this Input Layer, of the shape of the first Hidden Layer will be the sequnece length and the total number of notes to classify. The layer we add to the model is a LSTM with 128 neurons. We set the parameter "return_sequences" to true for the first layer as another LSTM layer will be following. This second LSTM also has 128 neurons. This parameter is needed in order to give the short term memory ability to the second layer of the network.

Between the Hidden Layers of the model, we add what Keras calls a Dropout Layer. The dropout layer is the ammount of neurons of the following layers that will not be trained during the current epoch. It represents a percentage; 0.1 actually refers to 10% of the neurons. This means that at each training batch, 10% of the neurons will not be trained. This is done in order to avoid overfitting. Overfitting happens when the network's weights and biases are tuned to much to match the current training batch. It this were to happen, the network would be to fitted to process the training examples, but would not be able to generalize what it has learned to new hypothetical data that it has not processed before.

The last layer of the netork is a Dense Layer. This is the Output Layer, with the same number of neurons are different types of notes exist in the dataset, and the softmax function is used as the activation function. For the other layer the default activation function (Hyperbolic Tangent) was used. The main characteristic of the softamx function is that it outputs a number of values that sum 1. This means that it essentially outputs a probability distribution. This means that it is ideal to process one-hot encoded sequences, as this sequences also sum 1 (with the index of the value one being the note and the other ones being zero). The network, when used to generate new notes, will output a number of values in each index of the newly formed array. The index with the higher number will be the one the network "belives" would be the following one based on its training.

We end up with a RNN with a LSTM structure formed by two Hidden Layers with a Hyperbolic Tangent activation function and an Output Layer with the softmax function. So now we can proceed to training the network on the dataset.

In [62]:
network_notes.compile(loss = "categorical_crossentropy", optimizer = optimizer)

Before training it the last step is compiling it. Doing so requieres to chose a loss function and an optimizer. The loss function used here is Categorical Crossentropy, the one used when the softmax layer is used in the Output Layer.

In [63]:
network_notes.fit(features_notes, labels_notes, batch_size = 100, epochs = 1, verbose = 0)

<tensorflow.python.keras.callbacks.History at 0x13d3d7d3518>

Now we train the network with a batch size of 100 examples and for 50 epochs. This takes quite a long time, depending on the computing power of the computer used for training

In [64]:
save_notes = "Models/notes_model.h5py"

In [65]:
network_notes.save(save_notes)

In [66]:
%store save_notes

Stored 'save_notes' (str)


Finally, we save the network to the "save_notes" path. Keras allows networks being saved in a custom file format called h5py. This format allows Keras to save networks and later restore them with the exact same structure, weights and biases it had before being saved. In case we were to train the network more, the optimizer for the network and its loss function would also be saves in the same exact state.

#### 2.3.2 Durations network

For the network that will predict the duration of the notes, the exact same thing as with the network used to predict the notes.

In [67]:
network_durations = tf.keras.Sequential()

In [68]:
network_durations.add(tf.keras.layers.LSTM(128, input_shape = (sequence_length, n_durations), return_sequences = True))
network_durations.add(tf.keras.layers.Dropout(0.1))
network_durations.add(tf.keras.layers.LSTM(128))
network_durations.add(tf.keras.layers.Dropout(0.1))
network_durations.add(tf.keras.layers.Dense(n_durations, activation = "softmax"))

In [69]:
network_durations.compile(loss = "categorical_crossentropy", optimizer = optimizer)

In [70]:
network_durations.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 25, 128)           71168     
_________________________________________________________________
dropout_2 (Dropout)          (None, 25, 128)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
Total params: 204,042
Trainable params: 204,042
Non-trainable params: 0
_________________________________________________________________


In [71]:
network_durations.fit(features_durations, labels_durations, batch_size = 100, epochs = 1, verbose = 0)

<tensorflow.python.keras.callbacks.History at 0x13d13664fd0>

In [72]:
save_durations = "Models/durations_model.h5py"

In [73]:
network_durations.save(save_durations)

In [74]:
%store save_durations

Stored 'save_durations' (str)


### 2.4 Generating new melodies

Finally, it's time to generate music using the two Neural Networks created and trained. To generate the music, we start importing the same dependencies as before and reading all the stores variables to be able to build some of the data stored during the data preprocessing and network definition and training.

In [1]:
import music21
import numpy as np
import tensorflow as tf

In [2]:
%store -r data_length
%store -r sequence_length
%store -r n_notes
%store -r n_durations
%store -r features_notes
%store -r features_durations
%store -r notes_dictionary_inv
%store -r durations_dictionary_inv
%store -r save_notes
%store -r save_durations

We also load the two networks using the save created before by Keras using a Keras function and using the file path as its argument.

In [3]:
network_notes = tf.keras.models.load_model(save_notes)

In [4]:
network_durations = tf.keras.models.load_model(save_durations)

In [5]:
notes_to_generate = 500
temperature_notes = 0.5
temperature_durations = 0.5

Now its time to define the hyperparamethers of the music generations. Those are three: the number of notes to generate in the variable "notes_to_generate" and the temperatures of the two networks. The temperature is a mathematical paramether that changes the output of the softmax function in the Output Layer. The lower the temperature is, the more confident the network will be. More about the temperature and its mathematical expression will be explained when defining the temperature function.

In [6]:
random_start = np.random.randint(0, data_length)

In [7]:
sequence_notes = features_notes[random_start]

In [8]:
sequence_durations = features_durations[random_start]

But before starting to generate new music, an initial sequence must be loaded from the original dataset. Because of the way the ANN was built, it requieres a sequence before being able to generate new music on its own. This sequence will be loaded from a random point in the original dataset for both the notes dataset and the durations dataset.

In [9]:
notes = []

In [10]:
for note, duration in zip(sequence_notes, sequence_durations):
    notes.append([notes_dictionary_inv[np.argmax(note)], durations_dictionary_inv[np.argmax(duration)]])

We also append this orignial sequence to the newly created "notes" array. This array will end up containing every note generated by the algorithm and the original notes serving as the original sequence. To append the sample sequence to the array in the form of a real note and duration, the process followed to encode them must be inverted. We do so using the inverse dicitonaries created when encoding and preprocessing the data and picking the index of the maximum argument in the one-hot encoded array. What we are essentially doing is inverting the preprocessing of the data in order to be able to end having the same items as we stracted from the scoe; a note and its duration.

After the networks has generated new notes, we will use music21 again to encode the notes in a midi file.

In [11]:
def apply_temperature(predictions, temperature = 0.5):
    predictions = predictions.astype("float")
    predictions = np.log(predictions) / temperature
    predictions = np.exp(predictions) / np.sum(np.exp(predictions))
    return np.random.multinomial(1, predictions, 1)

Now, about the temperature. The mathematical formula of the code above is the following:

The following code does the same as the formula above using numpy's functions por the exponentials and natural logarithm. It returns the predictions modified accordingly to the temperature. This modifications are actually quite simple: if the temperature is smaller than 1, the differences between the different predicted values is made bigger, thus making the network more confident of its predictions and make it output more consistent results with the original dataset. If the temperature is higher than 1, the opposite happens: the differences between the values are made smaller, thus making the network less confident and make it output less consistent results with the original dataset.

In [12]:
def predict_note():
    note_feature = np.reshape(sequence_notes, (1, sequence_length, n_notes))
    predicted_note = network_notes.predict(note_feature)[0]
    predicted_note_temperature = apply_temperature(predicted_note, temperature = temperature_notes)
    final_note = notes_dictionary_inv[np.argmax(predicted_note_temperature)]
    sequence_notes = np.concatenate((sequence_notes[1:len(sequence_notes)], predicted_note_temperature), axis = 0)
    return final_note, sequence_notes

In [13]:
def predict_duration():
    duration_feature = np.reshape(sequence_durations, (1, sequence_length, n_durations))
    predicted_duration = network_durations.predict(duration_feature)[0]
    predicted_duration_temperature = apply_temperature(predicted_duration, temperature = temperature_durations)
    final_duration = durations_dictionary_inv[np.argmax(predicted_duration_temperature)]
    sequence_durations = np.concatenate((sequence_durations[1:len(sequence_durations)], predicted_duration_temperature), axis = 0)
    return final_duration, sequence_durations

Now we define the funcions to actually use the networks and predict the notes and their durations. First we reshape the sequence to make it fit into the network, and then we use the network to predict values based on that sequence. Once we have the predictions, we apply the temperature function to them and we append the most likely next note and duration to the sequence that will be used in the next epoch, removing the first one at the same time to make sure this sequence always have a length of 25 (the sequence length used all the time).

In [None]:
for i in range(notes_to_generate):
    predict_note()
    predict_duration()
    new_note = [final_note, final_duration]
    notes.append(new_note)

So all that is left is use this functions and append all the predicted notes to the notes array previously defined.

In [15]:
midi = music21.stream.Stream()
midi.insert(music21.instrument.Violin())

In [16]:
for note in notes:
    new_note_midi = music21.note.Note(str(note[0]), quarterLength = float(note[1]))
    midi.append(new_note_midi)

In [17]:
midi.write("midi", fp = "song.mid")

'song.mid'

And, at last, we use music21 to transform those notes into a midi file, so we end with a music file with the same format and structure as the original files in the dataset. And with this new file comes music generated by an AI.