# Melody Generation

+ **AI in Culture and Arts - Tech Crash Course**
+ **Date:** 06.06.2024
+ **Author:** B. Zönnchen

<a href="https://colab.research.google.com/github/aica-wavelab/aica-assignments/blob/main/A4_melody_generation/4_1_ml_ffn_markov.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In the following we will create music sheets and sound. For those tasks ``Python`` requires external programs that you should install if you are working locally:

1. [Musescore](https://musescore.org/de) (for generating sheets)
2. [FluidSynth](https://www.fluidsynth.org/) (for generating sound)

If you are working on google ``Colab``, you can evaluate the following to cells to install these applications:

In [None]:
#@title install dependencies to play sound
%%capture
print('installing fluidsynth...')
!apt-get install fluidsynth > /dev/null
!cp /usr/share/sounds/sf2/FluidR3_GM.sf2 ./font.sf2
print('done!')

In [None]:
#@title install dependencies to show score in music notation
%%capture
print('installing musescore3...')
!apt-get install musescore3 > /dev/null
print('done!')

In [None]:
#@title clone git repository
%%capture
!rm -rf musical-interrogation
!git clone https://github.com/aica-wavelab/aica-assignments.git
%cd A4_melody_generation

Furtheremore, for this notebook we need the following ``Python`` packages and moduls. Execute the cell to install them:

In [None]:
%pip install music21
%pip install pyfluidsynth

%pip install matplotlib
%pip install seaborn

%pip install pandas
%pip install numpy
%pip install tensorflow

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("4_1_ml_ffn_markov.ipynb")

In [None]:
import tensorflow as tf
import numpy as np
import music21 as m21
from music21.stream import Part, Score

import seaborn as sns

from pianoroll import stream_to_df, plot_df
from encoder import NoteToIntEncoder, TERM_SYMBOL

# 3 Deep Learning for Melody Generation

## 4.1 Learning the Markov Matrix

In the following, we will use our first ``tensorflow`` model which consists of a single quadratic matrix $\mathbf{W} \in \mathbb{R}^{n \times n}$. This matrix conains all **learnable** parameters, meaning that the values of the elements of $\mathbf{W}$ changes during *training*. The goal is that given a *one-hot encoded* note/event $\mathbf{x} \in \mathbf{R}^{n}$, we can predict the next note/event by evaluating

$$\mathbf{x}^{\top} \mathbf{W}.$$

Since $\mathbf{x}$ is *one-hot encoded* this operation just gives us the $j^{\text{th}}$ row of the matix $\mathbf{W}$. A row tells us which event/note comes next. As we will see, the row is in fact assumed to be the natural logarithm of a discrete probability distribution. Computing $\exp(w_{ij})$ for each element of $\mathbf{W}$ gives us the probabilities similar to our Markov matrix $\mathbf{M}$.

Why is this the case? Well the reason for this can be found if we look at the *loss function* of the model.
Suppose the model predicts a certain discrete probability distribution.
And let us say that $p$ is the probability (computed by the model) of the correct prediction (according to our example provided to the model!).
We want this probability to be rather high thus the loss it causes rather low if its value is high.
Therefore $-p$ is a good loss for this single value.

Let's assume we have multiple predictions. And let $p_1, \ldots, p_m$ be the probabilities of the correct labels, then a reasonable loss would be the *negative likelyhood*:

$$L = - \left(p_1 \cdot \ldots \cdot p_m \right).$$

However, we can make things easier by computing the *negative log-likelyhood*. By doing so we can replaces multiplication by addition:

$$\ln(L) = -\left(\ln(p_1) + \ldots + \ln(p_m)\right) = -\ln(p_1) - \ldots - \ln(p_m).$$

So, to make things easier we think of the entries in $\mathbf{W}$ as the *negative log-probability*!

We use the same training data we used for learning a *Markov matrix* $\mathbf{M}$ and we will see that, after training, $\mathbf{M}$ and $\mathbf{W}$ are suspiciously similar! 

**Disclaimer:** This notebook consists intentionally of very low-level code. Take you time and ask the coaches about what is going on!

In [None]:
# Let us load our simple melody
bach_minuet = m21.converter.parse('data/Minuet_in_G.mid')
bach_minuet_melody = bach_minuet.parts[0]
bach_minuet_melody.show('midi')

We can display the melody to get a in a *piano roll representation* to get a feel for it.

In [None]:
dataframe = stream_to_df(bach_minuet_melody)
plot_df(dataframe)

### 4.1.1 Data Preparation

Similar to our Markov example, we transform the ``Stream`` into a list of numbers ranging from $0$ to $n-1$.

In [None]:
# initiate our encoder
encoder = NoteToIntEncoder([bach_minuet_melody])

# encode the Stream into a squence of numbers
enc_bach_minuet = encoder.encode_sequence(bach_minuet_melody)

# add terminal symbol to indicate the begining and end of the piece
enc_bach_minuet = [encoder.encode(TERM_SYMBOL)] + enc_bach_minuet + [encoder.encode(TERM_SYMBOL)]

print(enc_bach_minuet)

Next we specify some (hyper-)parameters. Note that the learning rate is unusually large.

In [None]:
# Parameters
vocab_size = len(encoder) # all MIDI keys
batch_size = 32
learning_rate = 1.0
epochs = 500

print(f'vocab_size: {vocab_size}')

Next we have to generate our training data set. Each row of the input matrix $\mathbf{X}$ represents one hot-encoded note/event while the (output/labels) vector $\mathbf{y}$ is not one-hot encoded.

In [None]:
# generate training data
X = []
y = []
for i in range(len(enc_bach_minuet)-1):
  current_event = enc_bach_minuet[i]
  next_event = enc_bach_minuet[i+1]
  X.append(current_event)
  y.append(next_event)

X = tf.one_hot(X, depth=vocab_size).numpy().astype('float32')
y = np.array(y)

print(X[:3])
print(y[:3])

The forward pass is calculated as follows: First, we multiply the two matrices 

$$\mathbf{C} = \mathbf{X} \cdot \mathbf{W},$$

with $\mathbf{W} \in \mathbf{R}^{n \times n}, \mathbf{X} \in \mathbf{R}^{m \times n}$ and $\mathbf{C} \in \mathbf{R}^{m \times n}.$

Then, we perform the so-called *softmax* operation **row-wise**:

$$p_{ki} = \frac{\exp(c_{ki})}{\sum_{j=1} \exp(c_{kj}) } \text{ for all } k,i = 1, \ldots, n.$$

In other words, we **normalize** each row after applying the exponential function component-wise. Each row of the resulting matrix $\mathbf{P}$ can thus be interpreted as a probability distribution!

For optimization using *gradient descent*, we need an appropriate *cost function*/*loss function* $L$.
To get it, we consider the "probability" for all correctly chosen transitions, i.e., the *likelihood*.
Let $p_1, \ldots, p_n$ be these probabilities (one per transition, aka row in $\mathbf{X}$), then

$$p_{1} \cdot \ldots \cdot p_m$$

is the *likelihood*.
However, to be able to add instead, we compute the *negative log-likelihood*:

$$-(\log(p_1) + \ldots + \log(p_m)).$$

By calling ``loss.backward()``, *backpropagation* (also known as the *backward pass*) is performed, and we can update our weights by

$$\mathbf{W} \leftarrow \mathbf{W} - \eta \cdot \nabla_\mathbf{W} L $$

In our case, we choose a very large *learning rate* $\eta$.

### 4.1.2 Model Definition and Training

In [None]:
# Initialize weights
W = tf.Variable(tf.random.normal((vocab_size, vocab_size)), dtype=tf.float32)

# Training loop
epochs = 1000
learning_rate = 10.0

for k in range(epochs):
    with tf.GradientTape() as tape:
        # 1. Forward pass
        C = tf.matmul(X, W)
        C = tf.exp(C)
        P = C / tf.reduce_sum(C, axis=1, keepdims=True)
        
        # Compute loss
        loss = -tf.reduce_mean(tf.math.log(tf.gather(P, y, batch_dims=1)))
    
    if k % 100 == 0:
        print(f'Epoch {k}, Loss: {loss.numpy()}')
    
    # 2. Backward pass
    grads = tape.gradient(loss, [W])
    
    # 3. Update weights
    W.assign_sub(learning_rate * grads[0])

### 4.1.3 Result Analysis

Let us now compre the matrix $\mathbf{W}$ with our Markov matrix $\mathbf{M}$ from the previous section.
Let's recompute $\mathbf{M}$.

In [None]:
def markov_matrix(parts, encoder):
  m = len(encoder)

  # initiate the Markov matrix with zeros
  M = np.zeros((m,m))

  # count the transitions; a row represents the starting point and the column the end point
  for part in parts:
    predecessor = TERM_SYMBOL
    idx2 = 0
    for element in part.recurse():
      if isinstance(element, (m21.note.Rest, m21.note.Note)):
        idx1 = encoder.encode(predecessor)
        idx2 = encoder.encode(element)
        M[idx1][idx2] += 1.0
        predecessor = element
      # this is for the ending since we arrive at a TERM_SYMBOL
    M[idx2][encoder.encode(TERM_SYMBOL)] += 1.0

  # divide each element of a row by the sum of that row
  return M / M.sum(axis=1, keepdims=True)

In [None]:
M = markov_matrix([bach_minuet_melody], encoder)

Let us now transform $\mathbf{W}$ into a matrix representing probabilities and let us round the entries:

In [None]:
P = tf.exp(W)
P = P / tf.reduce_sum(P, axis=1, keepdims=True)
P = np.round(P, decimals=1)

Let us plot both matrices:

In [None]:
sns.heatmap(P, cmap="Blues")

In [None]:
sns.heatmap(M, cmap="Blues")

### 4.1.4 Melody Generation

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

**Instruction 4.1.1**: The following function ``ffn_score(W, encoder, max_len)`` generates a new ``Score`` based on our computed matrix ``W``. It works similar to ``markov_score(M, encoder, max_len)``.

Explain in your own words, what is going on. Try to find out what each operation does. Try to play around with toy examples. For example, you could generate a ``numpy`` matrix to test what ``P = C / np.sum(C, axis=1, keepdims=True)`` compute.

</div>

In [None]:
def ffn_score(W, encoder, max_len=100):
  score = Score()
  part = Part()

  C = np.exp(W)
  P = C / np.sum(C, axis=1, keepdims=True)
  
  
  for _ in range(max_len):
    m = len(encoder)
    j = np.random.choice(m, size=1, p=P[j])[0]
    symbol = encoder.decode(j)
    if symbol == TERM_SYMBOL:
      break
    else: 
      part.append(symbol)
  score.insert(0, part)
  return score

ffn_score(W.numpy(), encoder).show('midi')

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

**Instruction 4.1.2**: As we saw, using just one matrix to be the neural network--which is in fact the most simplest network we can imagine--we basically learn the Markov matrix.

1. Is this method more or less computational expensive compared to our computation of the Markov matrix?
2. What do we have to change to consider not only the last note but the last two notes to compute the next one?

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(force_save=True, run_tests=True)