# Melody Generation

+ **AI in Culture and Arts - Tech Crash Course**
+ **Date:** 06.06.2024
+ **Author:** B. Zönnchen

<a href="https://colab.research.google.com/github/aica-wavelab/aica-assignments/blob/main/A4_melody_generation/4_2_ml_rnn.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In the following we will create music sheets and sound. For those tasks ``Python`` requires external programs that you should install if you are working locally:

1. [Musescore](https://musescore.org/de) (for generating sheets)
2. [FluidSynth](https://www.fluidsynth.org/) (for generating sound)

If you are working on google ``Colab``, you can evaluate the following three cells to install these applications:

In [None]:
#@title install dependencies to play sound
%%capture
print('installing fluidsynth...')
!apt-get install fluidsynth > /dev/null
!cp /usr/share/sounds/sf2/FluidR3_GM.sf2 ./font.sf2
print('done!')

In [None]:
#@title install dependencies to show score in music notation
%%capture
print('installing musescore3...')
!apt-get install musescore3 > /dev/null
print('done!')

In [None]:
#@title clone git repository
%%capture
%rm -rf aica-assignments
!git clone https://github.com/aica-wavelab/aica-assignments.git
%cd aica-assignments/A4_melody_generation

Furtheremore, for this notebook we need the following ``Python`` packages and moduls. Execute the cell to install them:

In [None]:
%pip install music21
%pip install pyfluidsynth

%pip install matplotlib
%pip install seaborn

%pip install pandas
%pip install numpy
#%pip install tensorflow
%pip install tensorflow==2.16.1

%pip install otter-grader

In [None]:
import tensorflow as tf
import numpy as np
import music21 as m21

import zipfile
from files import load_midi_files
from pianoroll import stream_to_df, plot_df
from encoder import PianoRollEncoder, StringToIntEncoder, TERM_SYMBOL

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("4_2_ml_rnn.ipynb")

## 4.2 Recurrent Neural Network (RNN) / LSTM

RNNs allow us to predict the next element in a sequence based on a finite sequence of arbitrary length. How accurate this prediction is, is another question. It works by writing into a sort of *memory* thus it is considered not to be memoryless! Let us assume we have a sequence of notes / events $\mathbf{x}_0$ = ``C4``, $\mathbf{x}_1$ = ``G4``, $\mathbf{x}_2$ =``_``, ... we put first ``C4`` into the RNN. It generates not only an output $\mathbf{y}_0$, that is, the next predicted note, but also a so-called hidden state $\mathbf{h}_0$ which acts like a memory. When we input the next note ``G4`` the hidden state is also part of the input and it is also adapted. The following figure shows an RNN unfolded in time.

<img src="figs/rnn-unfold.png" alt="" height="300">

Since this can go on basically indefinitely, we can consider many notes to predict the next one. In the following, ``sequence_len`` determines the number of notes / events we consider during training. This means we need to ensure that we construct such sequences from the data during data preparation. While our trained RNN / LSTM can also generate longer sequences, it has never seen such sequences before. Therefore, the quality will decrease.

Furthermore, this time we will consider events that all have the same duration (one ``time_step`` measured in quarter notes) instead of notes. A note is therefore represented by an ``X-NoteOn`` event followed by a ``NoteHold`` event where ``X`` indicate the pitch or a ``Rest`` event. For example:

```
65 _ _ _ 77 _ r _ _
```

This would mean that the piece begins with the MIDI note 65 (``65-NoteOn`` event). The note is held for 3 more time steps (4 in total), then a ``77-NoteOn`` event follows, which is held for one more time step, and finally, a rest follows for a total of 3 time steps. We already showed how to transform a ``Stream`` into such a *piano roll like* represenation using the ``EventEncoder`` class.

An **LSTM** is an extension of the vanilla RNN which esentially makes it easier to learn longer sequences. Since we write into one hidden state vector, you can imagine that information gets washed away over time. A vanilla RNN has no ability to learn which kind of information might be more important. An LSTM, which stands for **long short-term memory network** has some ability to learn what to put into its 'long term memory' and what information to put into its 'short term memory'. 

Do not worry too much about the illustration of the LSTM below. Basically, There are matrices $W_f, W_g, W_i, W_o$ with **learnable** parameters that controls what to forget and what to keep based on the input of the current element of the sequence $\mathbf{x}_t$ and the last hidden state $\mathbf{h}_{t-1}$

<img src="figs/lstm-cell.png" alt="" height="400">

+ **Update Gate:** Determines how much of the new information to add to the cell state $\mathbf{c}_t$ ('long term memory').
+ **Forget Gate:** Decides what information is no longer needed and removes it from the cell state $\mathbf{c}_t$, helping to prevent the accumulation of irrelevant information.
+ **Output Gate:** Controls the extent to which the value in the cell is used to compute the output activation of the block.

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

**Instruction 4.2.1**: Why would it be impossible to use the *piano roll like* representation for our simple neural network in ``4_1_ml_ffn_markov``?

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### 4.1.1 Data Preparation

This time we not only use one piece of music but we use roughly 1000 melodies from folk songs. The ``.zip`` file ``data/deu_folk_songs.zip`` contains MIDI files of melodies. These files represent our **training data** for our *deep learning* attempts later.

In [None]:
with zipfile.ZipFile('data/deu_folk_songs.zip', 'r') as zip_ref:
    zip_ref.extractall('data/deu_folk_songs/')

Let us unzip it. Then you can use the ``load_midi_files`` function to load all the MIDI files. It returns a list of ``Stream`` objects. You can and should specify a ``time_step`` to filter out files that can not be represented by the time step we want to use. We also transpose all the piece into the key of ``C major`` such that our model only has to learn one key!

In [None]:
# this can take a some time!
time_step = 0.5
streams_folk_songs = load_midi_files('data/deu_folk_songs/', time_step=time_step, transpose_to_major=True) 
print(f'load {len(streams_folk_songs)} files')

To enable a *one-hot encoding* we have to transform this representation into numbers ranging from $0$ to $m-1$, which can be done by using the ``StringToIntEncoder`` appropriately.

In [None]:
piano_roll_encoder = PianoRollEncoder(time_step=time_step)
piano_rolls, _ = piano_roll_encoder.encode_streams(streams_folk_songs)
print(f'piano rolls as strings: {piano_rolls[:3]}')

string_to_int = StringToIntEncoder(piano_rolls)
piano_rolls_int = string_to_int.encode_sequences(piano_rolls)
vocab_size = len(string_to_int)

print(f'piano roll as int: {piano_rolls_int[0]}')
print(f'vocab_size: {vocab_size}')

Next we have to generate our training dataset. However, this time the input is a sequence of events and the output is the next event!
The sequence lenght is our first (hyper-)parameter. To enable the network to be able to predict for shorter sequences we pad with ``TERM_SYMBOL``

In [None]:
sequence_len = 64 # this is a hyper-paremter!

xs = []
ys = []
i_term_symbol = string_to_int.encode(TERM_SYMBOL)
for piano_roll in piano_rolls_int:
  padded_piano_roll = [i_term_symbol]*sequence_len + piano_roll + [i_term_symbol]*sequence_len
  for i in range(len(padded_piano_roll)-sequence_len):
    xs.append(padded_piano_roll[i:(i+sequence_len)])
    ys.append(padded_piano_roll[i+sequence_len])

print(xs[:3])
print(ys[:2])

In [None]:
len(xs)

Now we have to transform ``xs`` into a *one-hot encoding*:

In [None]:
X = tf.one_hot(xs, depth=vocab_size).numpy().astype('float32')
y = np.array(ys)

print(X.shape)
print(y.shape)

print(X[:3])
print(y[:2])

### 4.1.2 Model Definition

In [None]:
hidde_state_dimension = 16 # this is a hyper-paremter!
learning_rate = 0.01 # this is a hyper-paremter!
loss = 'sparse_categorical_crossentropy' # this is a hyper-paremter!
epochs = 50 # this is a hyper-paremter!
batch_size = 16 # this is a hyper-paremter!
save_model_path = f'models/model_{hidde_state_dimension}_{sequence_len}_{batch_size}.keras'

inputs = tf.keras.layers.Input(shape=(None, vocab_size))
x = inputs
x = tf.keras.layers.LSTM(hidde_state_dimension)(x)
x = tf.keras.layers.Dropout(0.2)(x)

output = tf.keras.layers.Dense(vocab_size, activation='softmax')(x)

model = tf.keras.Model(inputs, output)

# compile model
model.compile(
loss=loss, 
    optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
    metrics=['accuracy'])

model.summary()

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

**Instruction 4.2.2**: Look at the model summary above. Can you reason about the number of learnable parameters? 

**Hint:** Look at the image of an LSTM we introduced in the beginning and consider that each of the ``4`` green boxes represents basically a layer of a neural network (a matrix and a bias term). Furthermore, the last number in *Output Shape* of the displayed table displayed when calling ``model.summary()`` is the dimension of the output of that layer of the neural network!

</div>

_Type your answer here, replacing this text._

In [None]:
4*((vocab_size + hidde_state_dimension) * hidde_state_dimension + hidde_state_dimension)

In [None]:
hidde_state_dimension*vocab_size + vocab_size

<!-- END QUESTION -->

### 4.1.3 Model Training

Becaus the training can take some time, we already trained models using different hyper-paraemters. You can find them in the folder ``models/`` and instead of training, you can load them.

In [None]:
#with tf.device('cpu:0'):

#this can take a while

# train the model
model.fit(X, y, epochs=epochs, batch_size=batch_size)

# save model
model.save(save_model_path)

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

**Instruction 4.2.3**: Find a good metaphor for the hyper-parameter ``hidde_state_dimension``. What might be the effect if we make it larger or smaller?

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### 4.1.4 (optional) Load a Model

In [None]:
# A large model with a "memory size", that is hidden state of 256 dimenisons
path_to_large_model = 'models/model_256_64_128.keras'

# A small model with a "memory size", that is hidden state of 16 dimenisons
path_to_small_model = 'models/model_16_64_16.keras'

model = tf.keras.models.load_model(path_to_large_model)

In [None]:
def generate_melody(model, sequence_len, seed, string_to_int: StringToIntEncoder, temperature: float=1.0, max_len=1_000) -> list[str]:

    # Pad with TERM_SYMBOL
    padded_seed = [TERM_SYMBOL]*sequence_len + seed
    
    # Tranform pedding + seed into numbers
    seed_int = string_to_int.encode_sequence(padded_seed)

    # Add the seed to the melody
    melody = seed[:]

    while True:
        seed_int = seed_int[-sequence_len:]
        # One-hot encode the seed
        onehot_seed = tf.one_hot([seed_int], depth=len(string_to_int)).numpy().astype('float32')
        
        # Make prediction, note that we have a softmax-layer integrated into our model
        probabilities = model.predict(onehot_seed)[0]

        # Because the prediction already applied softmax we have to revert it for including the temperature
        probabilities = np.log(probabilities) / temperature 

        # Recompute softmax afterwards
        probabilities = np.exp(probabilities) / np.sum(np.exp(probabilities))

        choices = range(len(probabilities))
        symbol_int = np.random.choice(choices, p=probabilities)
        seed_int.append(symbol_int)
        
        symbol = string_to_int.decode(symbol_int)

        if symbol != TERM_SYMBOL:
            melody.append(symbol)
            if max_len <= len(melody):
                break
        else:
            break
    return melody

In [None]:
score_as_list = generate_melody(model, sequence_len=sequence_len, seed=['60', '_', '_'], string_to_int=string_to_int, temperature=1.0, max_len=100)
print(score_as_list)
score = piano_roll_encoder.decode_stream(score_as_list)
score.show('midi')

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

**Instruction 4.2.4**: Explain the what the function ``generate_melody`` computes and how it works. Explain what the parameter ``temperature`` controls. You might want to generate multiple melodies with different ``temperature`` values. Listen to the result.

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

**Instruction 4.2.5**: We used all the data for training, neglecting any validation. Explain and think about:

1. What is the disadvantage of not validating our model? **Hint:** Think of the concept of *overfitting*.
2. What would happen if the model overfits its training data? Is this always bad in our context or do we might want this to happen?

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

