# “Learning about Machine Learning with CRIM”


**Abstract**

In this tutorial-essay I will consider how we can use machine learning, speciﬁcally dimensionality reduction and embedding methods, with the CRIM corpus. The guiding question is how style can be modeled quantitatively. Building both on music-theoretical conceptualization and machine learning techniques, it will be demonstrated that unsupervised clustering can serve to some degree as a proxy for stylistic similarity. The CRIM data set provides an ideal case study that will also point to some shortcomings of the computational methodology that can only be resolved by a critical view, drawing on musicological expertise and close-reading of sources.


## Introduction: setting the scope

## Setup and obtaining the data

We begin by installing the CRIM intervals library. 

```{bash}
pip install --upgrade --force-reinstall git+https://github.com/HCDigitalScholarship/intervals.git@main 
```


Next we import all libraries and modules that we will need for our subsequent analyses.

In [1]:
import intervals as ci # crim intervals
# import music21 as m21
import pandas as pd # to work with tabular data

import re # regular expressions
import requests # to download files

import os, glob # file I/O
from tqdm import tqdm # status bar for loops

import matplotlib.pyplot as plt # plots, plots, plots

In [4]:
# us = m21.environment.UserSettings()
# us.getSettingsPath()
# us["musescoreDirectPNGPath"] = "/home/fmoss/.local/bin/mscore"

# import notebook
# notebook.nbextensions.check_nbextension('usability/sphinx-markdown', user=True)

# E = notebook.nbextensions.EnableNBExtensionApp()
# E.toggle_nbextension('usability/sphinx-markdown/main')

# c = m21.chord.Chord(["C4", "E4", "G4"])

# hexa = ci.analysis.neoRiemannian.completeHexatonic(c, simplifyEnharmonics=True)
# hexa

# for chord in hexa:
#     chord.duration=m21.duration.Duration(1.)

# s = m21.stream.Stream(hexa)

# s.show("text")

# s.show() # --> this doesn't work yet

We now access the CRIM corpus and download it to our working directory, so that we have to download it only once.
First we get a list of the URLs pointing to each piece in the corpus following the instructions [here](https://github.com/RichardFreedman/CRIM_JHUB/blob/main/Make-me-a-Corpus.ipynb) and [here](https://github.com/RichardFreedman/CRIM_JHUB/blob/main/CRIM_04b_Cadences_Corpus.ipynb).

In [2]:
raw_prefix = "https://raw.githubusercontent.com/CRIM-Project/CRIM-online/master/crim/static/mei/MEI_4.0/"
URL = "https://api.github.com/repos/CRIM-Project/CRIM-online/git/trees/990f5eb3ff1e9623711514d6609da4076257816c"
piece_json = requests.get(URL).json()
piece_list = [raw_prefix + p["path"] for p in piece_json["tree"]]

The variable `piece_list` now contains all URLs and names of files in the CRIM corpus. We can inspect the first 5 items: 

In [3]:
piece_list[:5]

['https://raw.githubusercontent.com/CRIM-Project/CRIM-online/master/crim/static/mei/MEI_4.0/CRIM_Mass_0001.mei',
 'https://raw.githubusercontent.com/CRIM-Project/CRIM-online/master/crim/static/mei/MEI_4.0/CRIM_Mass_0001_1.mei',
 'https://raw.githubusercontent.com/CRIM-Project/CRIM-online/master/crim/static/mei/MEI_4.0/CRIM_Mass_0001_2.mei',
 'https://raw.githubusercontent.com/CRIM-Project/CRIM-online/master/crim/static/mei/MEI_4.0/CRIM_Mass_0001_3.mei',
 'https://raw.githubusercontent.com/CRIM-Project/CRIM-online/master/crim/static/mei/MEI_4.0/CRIM_Mass_0001_4.mei']

In total, we have `len(piece_list)` pieces: 

In [4]:
len(piece_list)

307

There are 307 files in total. Downloading the files takes a certain amount of time. To speed this up, we save all files in `piece_list` in our local directory.

First, we create a new directory `data/` but only if it does not already exist.

In [5]:
d = "data/"
if not os.path.exists(d):
    os.makedirs(d)

Next, we iterate over `piece_list`, request the file from the server and save it to that directory.

In [12]:
for piece in tqdm(piece_list):
    filename = piece.split("/")[-1] # only the part after the last '/' is the filename    
    with open(d + filename, 'wb') as f:
        if not os.path.exists(d + filename):
            r = requests.get(piece)
            f.write(r.content)

100%|███████████████████████████████████████| 307/307 [00:00<00:00, 2543.47it/s]


We create a new list `local_files` containing all local file paths and names.

In [13]:
local_files = glob.glob("data/*.mei")

So now we have a list of file names pointing to MEI files in our local `data/` directory. At a closer look you'll see that some of them end in something like `0001_1.mei` but a few others end in `0001.mei`. There is a pattern to this. The files without the trailing digit are 'wrappers' that bind all movements (indexed 1 through 9) of a particular mass (indexed 0001 through 9999) together. Since these wrappers do not contain any notes or cadences (those are stored in the MEI files of the respective movements), we'll filter them out. 

Fortunately, this is very easy since the filenames are chosen systematically. We only need to remove all files from the `local_files` list that have a file name ending in `_d.mei`, where `d` stands for any integer from 1 to 9.

In [14]:
local_files = [ f  for f in local_files if re.match(r".+_\d.mei$", f) ]

What happened here? We defined a pattern according to which we were able to remove the wrapper file names. This pattern is here expressed as a **regular expression**: `r".+_\d.mei$"`

Let's take it apart to understand how it works.
As you probably now, strings in Python are surrounded by either one or two quotation marks (`'` or `"`). The `r` prefixed to the expression tells the interpreter that the string enclosed in quotes is a regular expression and that the characters have to be interpreted accordingly. 

Next, we see a period `.`. This symbol stands for "any character" in a regular expression. The following `+` means "one or more", so that the combination `.+` stands for a sequence of any characters of length at least 1. With this, we capture the part of the filename preceding the underscore `_`. 

Since the pattern differs towards the end of the file names, we can also view it from the end: The `$` sign marks the end of the string, so that everything to its left has to come just before. Since we are dealing with MEI files, each file name ends with `.mei`, which is exactly what we see before the `$`. 

Now the crucial part. The 'wrapper' files do not have an underscore followed by a single-digit integer. We can use this information and represent that integer with `\d`. 

Consequently, filenames **not** following this pattern (not being captured by the regular expression) will not be taken into account. In English, we could read the list comprehension for `local_files` as: "make a list of filenames where each filename conforms to the pattern defined by the regular expression".

In [15]:
len(local_files)

220

Apparently, there are "only" 220 individual mass movements. 

## Transforming the data

Now that we have all files nicely stored in our local directory, it is finally time to access them. The CRIM intervals library (imported as `ci`, see above) provides a convenient way to do so: we create a `corpus` by passing a list of files to the `CorpusBase` object.

In [15]:
corpus = ci.CorpusBase(local_files)

ParseError: no element found: line 1, column 0 (<string>)

In [None]:
corpus.batch(func=ci.ImportedPiece.ngrams)[3]

## The vector-space model

### $n$-grams

### Term frequencies

### Document frequencies

### Term frequency - inverse document frequency (TF-IDF)

In [None]:
def count_all(df, normalize=False):
    s = pd.concat([df[col] for col in df.columns])
    return s.value_counts(normalize=normalize)

In [None]:
count_all(corpus.scores[2].getNoteRest(), normalize=False)

In [None]:
counts = pd.DataFrame([ count_all( corpus.scores[i].getNoteRest()) for i in range(len(corpus.scores)) ]).reset_index(drop=True)
counts = counts.fillna(1)# fill NaN values
del counts["Rest"]
counts = counts.iloc[:,:20]
counts

In [None]:
import numpy as np

dia = ["B-"] + list("FCGDAEB") + ["Rest"]

cmap = plt.get_cmap('tab20')
colors = cmap(np.linspace(0, 1, len(dia)))

c = [colors[dia.index(l)] for l in counts.idxmax(axis=1).apply(lambda x: x[:-1] if x[-1] != "t" else x).values]

In [None]:
counts = counts.div(counts.sum(axis=1), axis=0)

In [None]:
X = counts.values

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA()

In [None]:
X_ = pca.fit_transform(X)

In [None]:
plt.scatter(X_[:,0], X_[:,1], alpha=.75, zorder=3, c=c)
plt.axhline(0, lw=.5, c="k")
plt.axvline(0, lw=.5, c="k")
plt.show()

In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne = TSNE(n_components=2, metric="cosine", perplexity=25)

In [None]:
X__ = tsne.fit_transform(X)

In [None]:
plt.scatter(X__[:,0], X__[:,1], alpha=.75, zorder=3, c=c)
plt.show()

In [None]:
corpus.scores[3].analyses["note_list"][4].note

## Dimensionality reduction

### The idea

### Principal Components Analysis (PCA): a simple and popular method

### Uniform Manifold Approximation & Projection (UMAP): a complex and popular method

## On style

Meyer quote