# Music Generation
This project is an attempt to tackle the task of music generation using a Recurrent Neural Network with Long Short-Term Memory (LSTM) trained on the MAESTRO data with the help of TensorFlow and Keras deep learning frameworks for model development.

## Essential Libraries
Download and install the pretty midi library and import all the essential libraries that are required for this project.

In [1]:
!pip install pretty_midi

[0m

In [2]:
import tensorflow as tf
import numpy as np
import pandas as pd
import collections
import datetime
import glob
import pathlib
import pretty_midi
import matplotlib.pyplot as plt
import seaborn as sns
import shutil

from typing import Dict, List, Optional, Sequence, Tuple
from IPython import display

## Data Preparation
We will use the [MAESTRO](https://magenta.tensorflow.org/datasets/maestro#v200) dataset, which contains multiple MIDI files with numerous piano notes that our model can use for training. 

Start with downloading and extracting the dataset.

In [3]:
!gdown --fuzzy 'https://storage.googleapis.com/magentadata/datasets/maestro/v2.0.0/maestro-v2.0.0-midi.zip'

Downloading...
From: https://storage.googleapis.com/magentadata/datasets/maestro/v2.0.0/maestro-v2.0.0-midi.zip
To: /notebooks/maestro-v2.0.0-midi.zip
100%|██████████████████████████████████████| 59.2M/59.2M [00:00<00:00, 78.7MB/s]


In [4]:
zip_path = 'maestro-v2.0.0-midi.zip'
shutil.unpack_archive(zip_path)

In [5]:
data_directory = pathlib.Path('maestro-v2.0.0')
filenames = glob.glob(str(data_directory/'**/*.mid*'))

print('The dataset contains {} MIDI files.'.format(len(filenames)))

The dataset contains 1282 MIDI files.


Take a look at a sample file.

In [6]:
sample_file = filenames[0]
print('Sample File: {}'.format(sample_file))

pm = pretty_midi.PrettyMIDI(sample_file)

for instrument in pm.instruments:
    instrument_name = pretty_midi.program_to_instrument_name(instrument.program)
    print('Instrument Name: {}'.format(instrument_name))
    
    for i, note in enumerate(instrument.notes[:5]):
        note_name = pretty_midi.note_number_to_name(note.pitch)
        duration = note.end = note.start
        print('{}: pitch={}, note_name={}, duration={:.4f}'.format(i, note.pitch, note_name, duration))

Sample File: maestro-v2.0.0/2004/MIDI-Unprocessed_XP_08_R1_2004_03_ORIG_MID--AUDIO_08_R1_2004_03_Track03_wav.midi
Instrument Name: Acoustic Grand Piano
0: pitch=67, note_name=G4, duration=1.0000
1: pitch=65, note_name=F4, duration=1.7542
2: pitch=63, note_name=D#4, duration=2.4406
3: pitch=61, note_name=C#4, duration=3.0969
4: pitch=59, note_name=B3, duration=3.8062


Extract the note information into a dataframe.

In [7]:
def midi_to_notes(file_path) -> pd.DataFrame:
    pm = pretty_midi.PrettyMIDI(file_path)
    instrument = pm.instruments[0]
    notes = collections.defaultdict(list)
    notes_sorted = sorted(instrument.notes, key=lambda x: x.start)
    prev_start = notes_sorted[0].start
    
    for note in notes_sorted:
        notes['pitch'].append(note.pitch)
        notes['start'].append(note.start)
        notes['end'].append(note.end)
        notes['step'].append(note.start - prev_start)
        notes['duration'].append(note.end - note.start)
        prev_start = note.start
        
    df = pd.DataFrame({name: np.array(value) for name, value in notes.items()})
    return df

In [8]:
raw_notes = midi_to_notes(sample_file)
raw_notes.head()

Unnamed: 0,pitch,start,end,step,duration
0,67,1.0,1.590625,0.0,0.590625
1,65,1.754167,2.130208,0.754167,0.376042
2,63,2.440625,2.879167,0.686458,0.438542
3,61,3.096875,3.407292,0.65625,0.310417
4,59,3.80625,4.203125,0.709375,0.396875


Try printing the names of some sample notes.

In [9]:
get_note_names = np.vectorize(pretty_midi.note_number_to_name)
sample_note_names = get_note_names(raw_notes['pitch'])
print(sample_note_names[:5])

['G4' 'F4' 'D#4' 'C#4' 'B3']


Add a utility function for generating and writting a MIDI file from a note dataframe.

In [10]:
def notes_to_midi(notes, out_file, instrument_name, velocity=100) -> pretty_midi.PrettyMIDI:
    pm = pretty_midi.PrettyMIDI()
    program = pretty_midi.instrument_name_to_program(instrument_name)
    instrument = pretty_midi.Instrument(program=program)
    prev_start = 0
    
    for i, note in notes.iterrows():
        start = float(prev_start + note['step'])
        end = float(start + note['duration'])
        note_pm = pretty_midi.Note(velocity=velocity, pitch=int(note['pitch']), start=start, end=end)
        instrument.notes.append(note_pm)
        prev_start = start
        
    pm.instruments.append(instrument)
    pm.write(out_file)
    return pm

For higher efficiency while training the model, we will use `tf.data` to parse and manage our dataset.

In [11]:
num_files = 50
all_notes = []

for file in filenames[:num_files]:
    notes = midi_to_notes(file)
    all_notes.append(notes)
    
all_notes = pd.concat(all_notes)
num_notes = len(all_notes)
print('Number of notes parsed: {}'.format(num_notes))

key_order = ['pitch', 'step', 'duration']
train_notes = np.stack([all_notes[key] for key in key_order], axis=1)

notes_ds = tf.data.Dataset.from_tensor_slices(train_notes)
notes_ds.element_spec

Number of notes parsed: 306704


TensorSpec(shape=(3,), dtype=tf.float64, name=None)

In [12]:
all_notes

Unnamed: 0,pitch,start,end,step,duration
0,67,1.000000,1.590625,0.000000,0.590625
1,65,1.754167,2.130208,0.754167,0.376042
2,63,2.440625,2.879167,0.686458,0.438542
3,61,3.096875,3.407292,0.656250,0.310417
4,59,3.806250,4.203125,0.709375,0.396875
...,...,...,...,...,...
2189,60,308.511458,310.297917,0.002083,1.786458
2190,67,309.253125,311.691667,0.741667,2.438542
2191,43,309.261458,311.726042,0.008333,2.464583
2192,62,309.268750,311.719792,0.007292,2.451042


In [13]:
train_notes

array([[6.70000000e+01, 0.00000000e+00, 5.90625000e-01],
       [6.50000000e+01, 7.54166667e-01, 3.76041667e-01],
       [6.30000000e+01, 6.86458333e-01, 4.38541667e-01],
       ...,
       [4.30000000e+01, 8.33333333e-03, 2.46458333e+00],
       [6.20000000e+01, 7.29166667e-03, 2.45104167e+00],
       [5.90000000e+01, 1.07916667e+00, 1.35729167e+00]])

As the LSTM is a sequential model, the dataset we create will have sequence inputs and outputs. 

In [14]:
def create_sequence(dataset, seq_length, vocab_size=128) -> tf.data.Dataset:
    seq_length = seq_length + 1
    windows = dataset.window(seq_length, shift=1, stride=1, drop_remainder=True)
    sequences = windows.flat_map(lambda x: x.batch(seq_length, drop_remainder=True))
    
    def scale_pitch(pitch):
        pitch = pitch / [vocab_size, 1.0, 1.0]
        return pitch
    
    def split_label(sequences):
        inputs = sequences[:-1]
        labels_dense = sequences[-1]
        labels = {key: labels_dense[i] for i, key in enumerate(key_order)}
        
        return scale_pitch(inputs), labels
    
    return sequences.map(split_label, num_parallel_calls=tf.data.AUTOTUNE)

In [15]:
seq_length = 25
vocab_size = 128
seq_ds = create_sequence(notes_ds, seq_length, vocab_size)

Now specify the batch size and shuffle the dataset.

In [16]:
batch_size = 64
buffer_size = num_notes - seq_length
train_ds = (seq_ds
            .shuffle(buffer_size)
            .batch(batch_size, drop_remainder=True)
            .cache()
            .prefetch(tf.data.experimental.AUTOTUNE))

## Model Training
Before training, we will need to create a custom loss function that works with the step and duration parameters.

In [17]:
def mse(y_true, y_pred):
    mse = (y_true - y_pred) ** 2
    positive_pressure = 10 * tf.maximum(-y_pred, 0.0)
    return tf.reduce_mean(mse + positive_pressure)

Now it is time to develop a deep learning model. As mentioned, we will use a LSTM layer with 128 units of dimensionality space to process the data. A fully-connected layer will be added to the end with 3 neurons for pitch, step, and duration respectively. We will use the Sparse Categorical Cross-entropy loss function for the pitch parameters while using the custom-defined mean square error loss for the step and duration parameters. 

In [18]:
lr = 0.005

inputs = tf.keras.layers.Input((seq_length, 3))
x = tf.keras.layers.LSTM(128)(inputs)

outputs = {'pitch': tf.keras.layers.Dense(128)(x),
          'step': tf.keras.layers.Dense(1)(x),
          'duration': tf.keras.layers.Dense(1)(x)}

model = tf.keras.Model(inputs, outputs)

losses = {'pitch': tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
         'step': mse,
         'duration': mse}

optimiser = tf.keras.optimizers.Adam(learning_rate=lr)

model.compile(loss=losses, optimizer=optimiser)

In [19]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 25, 3)]      0           []                               
                                                                                                  
 lstm (LSTM)                    (None, 128)          67584       ['input_1[0][0]']                
                                                                                                  
 dense_2 (Dense)                (None, 1)            129         ['lstm[0][0]']                   
                                                                                                  
 dense (Dense)                  (None, 128)          16512       ['lstm[0][0]']                   
                                                                                              

Define callbacks for model check point and early stopping.

In [20]:
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath='./training_checkpoints/ckpt_{epoch}', save_weights_only=True)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5, verbose=1, restore_best_weights=True)

callbacks = [model_checkpoint, early_stopping]

Compile and train the model.

In [21]:
model.compile(loss=losses, 
              loss_weights={'pitch': 0.05, 'step': 1.0, 'duration': 1.0},
              optimizer=optimiser)

Train the model for 50 epochs.

In [22]:
epochs = 50

history = model.fit(train_ds, epochs=epochs, callbacks=callbacks)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 43: early stopping


## Musical Note Prediction
Now we can use the trained model to generate some musical notes. For starting the iteration of the generation process, we will need to provide a starting sequence of notes upon which the LSTM model can continue to create building blocks and reconstruct more data elements. To create more randomness and avoid the model from picking only the best notes as it will lead to repetitive results, we can make use of the temperature parameter for random note generation.

In [23]:
def predict_next(notes, model, temperature=1.0) -> int:
    
    # Add the batch dimension
    inputs = tf.expand_dims(notes, 0) 
    
    predictions = model.predict(inputs)
    pitch_logits = predictions['pitch']
    step = predictions['step']
    duration = predictions['duration']
    
    pitch_logits /= temperature
    pitch = tf.random.categorical(pitch_logits, num_samples=1)
    pitch = tf.squeeze(pitch, axis=-1)
    step = tf.squeeze(step, axis=-1)
    duration = tf.squeeze(duration, axis=-1)
    
    step = tf.maximum(0, step)
    duration = tf.maximum(0, duration)
    
    return int(pitch), float(step), float(duration)

We can now make use of a random starting sequence with a random temperature value using which the LSTM model can continue to build upon. 

In [24]:
temperature = 2.0
num_predictions = 120

sample_notes = np.stack([raw_notes[key] for key in key_order], axis=1)
input_notes = sample_notes[:seq_length] / [vocab_size, 1.0, 1.0]

generated_notes = []
prev_start = 0

for i in range(num_predictions):
    pitch, step, duration = predict_next(input_notes, model, temperature)
    start = prev_start + step
    end = start + duration
    input_note = (pitch, step, duration)
    generated_notes.append((*input_note, start, end))
    input_notes = np.delete(input_notes, 0, axis=0)
    input_notes = np.append(input_notes, np.expand_dims(input_note, 0), axis=0)
    prev_start = start

generated_notes = pd.DataFrame(generated_notes, columns=(*key_order, 'start', 'end'))
generated_notes.head(10)



Unnamed: 0,pitch,step,duration,start,end
0,51,0.16803,0.355413,0.16803,0.523443
1,98,0.254143,0.304276,0.422173,0.726449
2,97,0.301444,0.388577,0.723617,1.112194
3,96,0.306193,0.385979,1.02981,1.415789
4,97,0.303677,0.37637,1.333487,1.709857
5,96,0.298658,0.37191,1.632145,2.004055
6,98,0.293009,0.364048,1.925154,2.289202
7,101,0.291001,0.369607,2.216155,2.585762
8,101,0.294002,0.381246,2.510157,2.891403
9,98,0.295758,0.387834,2.805915,3.193749


Use the above utility function to write the output to a MIDI file.

In [25]:
out_file = 'output.midi'
out_pm = notes_to_midi(generated_notes, out_file=out_file, instrument_name=instrument_name)

The output file can then be played via any media players that can play MIDI files.