# "Bach" propagation through time

    Thomas Moreau <thomas.moreau@inria.fr>
    Alexandre Gramfort <alexandre.gramfort@inria.fr>

Adapted from: [https://raphaellederman.github.io/articles/musicgeneration/](https://raphaellederman.github.io/articles/musicgeneration/)

### Objective:

- We will train a network to learn a **language model** and then use it to **generate new sequences**.

- Instead of training the language model on text-documents (as it is the case in most examples on the web) we will train it to learn the language of the music of [Johann_Sebastian_Bach](https://en.wikipedia.org/wiki/Johann_Sebastian_Bach).
For this, we will learn how J. S. Bach's "Cello suite" have been composed.
Here is an example of a "Cello suite" [Link](https://www.youtube.com/watch?v=mGQLXRTl3Z0).

- Rather than analyzing the audio signal, we use a symbolic representation of the "Cello suite" through their [MIDI files](https://en.wikipedia.org/wiki/MIDI#MIDI_files).
  - A MIDI file encodes in a file, the set of musical notes, their duration, and intensity which have to be played by each instrument to "render" a musical piece. The "rendering" is usually operated by a MIDI synthesizer (such as VLC, QuickTime).

- We will first train a language model on the whole set of MIDI files of the "Cello suites". 
- We will then sample this language model to create a new MIDI file which will be a brand new "Cello suite" composed by the computer.




## Import packages

In [None]:
from pathlib import Path

import numpy as np
from scipy.io import wavfile 
import matplotlib.pyplot as plt

import IPython  # to play audio sounds

import pretty_midi  # to install with conda or pip

## Collect data to create the language model

We download the 36 MIDI files corresponding to the 36 "Cello suites" composed by J. S. Bach.

In [None]:
DATA_DIR = Path() / 'midi_bach'
DATA_DIR.mkdir(exist_ok=True)

import urllib.request
list_midi_files = [
    'cs1-2all.mid', 'cs5-1pre.mid', 'cs4-1pre.mid', 'cs3-5bou.mid',
    'cs1-4sar.mid', 'cs2-5men.mid', 'cs3-3cou.mid', 'cs2-3cou.mid',
    'cs1-6gig.mid', 'cs6-4sar.mid', 'cs4-5bou.mid', 'cs4-3cou.mid',
    'cs5-3cou.mid', 'cs6-5gav.mid', 'cs6-6gig.mid', 'cs6-2all.mid',
    'cs2-1pre.mid', 'cs3-1pre.mid', 'cs3-6gig.mid', 'cs2-6gig.mid',
    'cs2-4sar.mid', 'cs3-4sar.mid', 'cs1-5men.mid', 'cs1-3cou.mid',
    'cs6-1pre.mid', 'cs2-2all.mid', 'cs3-2all.mid', 'cs1-1pre.mid',
    'cs5-2all.mid', 'cs4-2all.mid', 'cs5-5gav.mid', 'cs4-6gig.mid',
    'cs5-6gig.mid', 'cs5-4sar.mid', 'cs4-4sar.mid', 'cs6-3cou.mid'
]
n_files = len(list_midi_files)
for i, midiFile in enumerate(list_midi_files):
    print(f"Loading MIDI file {i:2} / {n_files}\r", end='', flush=True)
    f_path = DATA_DIR / midiFile
    if not f_path.exists():
        urllib.request.urlretrieve("http://www.jsbach.net/midi/" + midiFile, DATA_DIR / midiFile)
        
list_midi_files = list(DATA_DIR.glob("*.mid"))
n_files = len(list_midi_files)
print(f"Loaded {n_files} MIDI files in folder {DATA_DIR}")

Let's first listen to a file:

In [None]:
midi_data = pretty_midi.PrettyMIDI(str(list_midi_files[0]))
# Synthesize the resulting MIDI data using sine waves
audio_data = midi_data.synthesize()
IPython.display.Audio(audio_data, rate=44100)

## Read and convert all MIDI files

We read all MIDI files and convert their content to one-hot-encoding matrix X_ohe of dimensions (T_x, n_x) where n_x is the number of possible musical notes.
The duration of the sequences T_x can vary from one sequence to the other.
 


In [None]:
max_T_x = 1000  # maximum number of notes to consider in each file
n_x = 79  # number of notes

X_list = []

for midi_file in list_midi_files:
    # read the MIDI file
    midi_data = pretty_midi.PrettyMIDI(str(midi_file))
    notes_idx = [note.pitch for note in midi_data.instruments[0].notes]
    # convert to one-hot-encoding
    notes_idx = np.array(notes_idx[:max_T_x])
    T_x = len(notes_idx)
    X_ohe = np.zeros((T_x, n_x))
    X_ohe[np.arange(T_x), notes_idx - 1] = 1
    # add to the list  
    X_list.append(X_ohe)
    
print(f"We have {len(X_list)} MIDI tracks")
print(f"The size of the first one is {X_list[0].shape}")
print(f"The size of the second one is {X_list[1].shape}")

In [None]:
plt.hist([x.shape[0] for x in X_list]);  # histogram of sizes

## Display the set of notes over time for a specific track 

In [None]:
plt.figure(figsize=(12, 6))
plt.imshow(X_list[2].T[:, :100], aspect='auto')
plt.set_cmap('gray_r')
plt.grid(True)

Let's write a tiny function to play a track from its array representation.

In [None]:
def play_from_array(X):
    new_midi_data = pretty_midi.PrettyMIDI()
    cello_program = pretty_midi.instrument_name_to_program('Cello')
    cello = pretty_midi.Instrument(program=cello_program)
    step = 0.3
    for time_idx, note_number in enumerate(X.argmax(axis=-1)):
        my_note = pretty_midi.Note(
            velocity=100, pitch=note_number,
            start=time_idx * step,
            end=(time_idx+1) * step
        )
        cello.notes.append(my_note)
    new_midi_data.instruments.append(cello)
    audio_data = new_midi_data.synthesize()
    return IPython.display.Audio(audio_data, rate=44100)

play_from_array(X_list[2][:30])

## Data conversion for the training of language model

For each example/sequence and each possible starting note in this example/sequence, we create two sequences:
- an input sequence: 
  - which contains a sub-sequence of length `sequence_length`;  this sub-sequence ranges from the note `t`
 to the note `t+sequence_length-1`
- an output sequence:
  - which contains the following note to be predicted, the one at position `t+sequence_length`

The training is therefore performed by giving to the model a set of sequences as input and asking the network to predict each time the note that should come right after this sequence.

Solution in `solutions/02-0_convert_data.py`.

In [None]:
sequence_length = 20

X_train_list = []
y_train_list = []

##########################
# TODO

# END TODO
##########################

X_train = np.asarray(X_train_list)
y_train = np.asarray(y_train_list)

print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)

# Training the language model

The language model will be learned by training a RNN with input `X_train` and output `y_train`:  for each of the examples of sequences, we give to the network a sequence of notes of `sequence_length` duration, and ask the network to predict the following note of each sequence.

The network will have the following structure
- a layer of `LSTM` with $n_a$=256
- a `Dense` layer with 256 units
- a DropOut layer with rate 0.3 (the probability to "drop-out" one neuron is 0.3)
- a `Dense` layer with a `softmax` activation which predict the probability of each of the $n_x$ notes as output

Solution in `solutions/02-1_create_model.py`.

In [None]:
# create the model
import torch


class BachSynth(torch.nn.Module):

    def __init__(self):
        ###################
        # TODO


        # END TODO
        ###################

    def forward(self, X):
        ###################
        # TODO


        # END TODO
        ###################


model = BachSynth()
print(model)


## Let's start the training

In [None]:

X_train = torch.tensor(X_train).float()
y_train = torch.tensor(y_train)

if not torch.cuda.is_available():
    print("Training without GPU...")
    n_epochs = 2  # the bigger the better !
    n_batches_per_epochs = 300  # the bigger the better !
else:
    print("Training with GPU...")
    n_epochs = 10  # the bigger the better !
    n_batches_per_epochs = 1000  # the bigger the better !
    
    # move data and model to GPU.
    X_train = X_train.cuda()
    y_train = y_train.cuda()
    model.cuda()

# Define loss and optimizer
loss = torch.nn.CrossEntropyLoss()
opt = torch.optim.Adam(model.parameters())

for e in range(n_epochs):
    for i in range(n_batches_per_epochs):
        idx = np.random.choice(len(X_train), size=64, replace=False)
        opt.zero_grad()
        y_pred = model(X_train[idx])[:, -1]
        l = loss(y_pred, y_train[idx])
        l.backward()
        opt.step()
        if i % 5 == 0:
            print(f"Epoch {e} - Iteration {i} - Loss = {l.item()}\r", end='', flush=True)

# Generating a new sequence from sampling the language model

To generate a new sequence from the language model, we simply give it as input a random sequence of duration `sequence_length` and ask the trained network to predict the output.

The output of the network is a vector of probability of dimension $n_x$ which represents the probability of each note to be the next note of the melody given as input.

From this vector, we select the note which has the maximum probability.

We then concatenate this new note (its one-hot-encoding representation) at the end of the input sequence.
We finally remove the first element of the input sequence to keep its duration constant (`sequence_length`).

Instead of providing a random sequence as input, we rather randomly select one sequence out of the 2880 sequences used for training.
We denote it by ```pattern```.



In [None]:
# --- select a random starting pattern
rng = np.random.RandomState(42)
start = rng.randint(0, len(X_train_list)-1)
pattern = X_train[start][None, :, :].clone()
print(start)
print(pattern.shape)

In [None]:
pat = pattern.detach().numpy()[0]
plt.figure(figsize=(12, 6))
plt.imshow(pat.T, aspect='auto')
plt.plot(pat.argmax(axis=-1))
plt.set_cmap('gray_r')
plt.grid(True)

In [None]:
pattern.shape

Solution in `solutions/02-2_generate_sequence.py`

In [None]:
T_y_generated = 200

prediction_l = [p for p in pattern.detach().numpy()[0]]

# generate T_y_generated notes
for note_index in range(T_y_generated):
    #######################
    # TODO

    # END TODO
    #######################

prediction_l = np.array(prediction_l)
prediction_l.shape


### Display the generated sequence

In [None]:
plt.figure(figsize=(12, 6))
plt.imshow(prediction_l.T, aspect='auto')
plt.plot(prediction_l.argmax(axis=-1))
plt.set_cmap('gray_r')
plt.grid(True)

### Listen to the generated sequence

In [None]:
play_from_array(prediction_l[:30])

<div class="alert alert-success">

**Exercise**

Insert after the first LSTM:
  - a second layer of ```LSTM``` with $n_a$=256
  - with a DropOut layer with rate 0.3 (the probability to "drop-out" one neuron is 0.3) between the 2 layers.

Note that as we will stack two LSTM layers on top of each other (deep-RNN), we need to use the sequence output by the first LSTM at each time `t` as input the 2nd LSTM.
</div>

**HINT:** _No need to code, just read the doc of of the [LSTM class](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html)._

Solution is in `solutions/02-2_layers_lstm.py`

## Going Beyond

Here, we are kind of cheating to reduce the size of the problem.
Do you see why?

<div class="alert alert-info">

**Answer:**

We are not really using a RNN here as we only use fixed length sequences to train the model.
We are learning a AR model such that:

$$
x_{t+k} = f_\theta(x_t, \dots x_{t+k-1})
$$

Here, the memory is not shared for successive prediction.

To change this, one would need to train with varying length sequences, and to use "_many-to-many_" prediction but this is computationally heavy.
See `scripts/traning_bach.py` for an example of RNN working directly on sequences.
</div>



