# Appendix: Worked example comparing Humdrum and `music21`

## Motivating question

Some weeks ago, I received a call from the head librarian in the music library. "We're finally going digital," they told me, "and we're giving away most of our printed sheet music holdings. I know you and your friends are quite musical, so I wanted to give you first dibs on our string quartets." By the time finally made it, the shelves were all but empty: first dibs, indeed. Probably should have paid those fines.  The sounds of Mozart, Beethoven, and Bartók filled the corridors. Only the collected string quartets of Haydn remained. I grabbed as much music as I could and set out to form an impromptu quartet. Many of my friends are fine violinists, certainly better than I am. But most of the department's resident violists were busy exercising themselves on the , and I thought, if we started with Haydn quartets with more achievable viola parts, I could masquerade as a violist for the best part of an afternoon. We'll warm up on the easier movements, I thought, so I decided to rank each one of the quartets' third movements by their difficulty for the violist based on the proportion of notes that lie on the A string or above (>= A4). I was also faintly curious as to whether difficulty of the viola parts thus measured (however imperfectly) was correlated with the approximate order of publication for each movement. Naturally, I took out my laptop...

### Preliminaries

## Humdrum

Conveniently, I have a folder full of `.krn` files, each corresponding to a movement of a string quartet by Haydn. 

In [None]:
%%bash
ls -C haydn/

Diving the problem into two stages, we focus first on using Humdrum to draw conclusions about each movement. Once we have used Humdrum to the notional difficulty of an individual file, we will use features of the `bash` shell to aggregate this data for the whole corpus. 

There are four spines in each file, each corresponding to one of the parts in the string quartet: Violin 1, Violin 2, Viola, and Cello. 

In [None]:
%%bash
cat haydn/op09n2-03.krn

The Humdrum tool `extract` can be used to extract spines based on specified criteria. An optional comment record can used to list the instruments in a Humdrum score, but unfortunately it is absent from the files we are currently interested in. 

Fortunately, as with most effective toolkits, there are multiple ways to achieve similar goals that we can exploit to work around the issue of less-than-perfect data. The following command extracts only the spines with a C3 clef from the file `op09n2-03.krn`:

In [None]:
%%bash
extract -i '*clefC3' haydn/op09n2-03.krn

The representation used by the `**kern` interpretation specifies pitch using upper- or lower-case alphabetic characters (A-G), optionally followed by an accidental. Though this representation is useful for input and proofreading, it is not suitable for comparing pitch height. Humdrum provides several tools for converting pitch from one representation to another. For example, the `semits` command converts pitch data into semitones above middle C (C4). Because Humdrum tools deal with input and output predictably in compliance with the design of UNIX, researchers can make use of the features of the `bash` shell to build data processing pipelines consisting of successive applications of command-line tools to the score files. This is achieved using the `|` (pipe) operator. In the following command, two Humdrum tools are connected using the pipe operator:

In [None]:
%%bash
semits -xt haydn/op09n2-03.krn | extract -i '*clefC3'

First, the `semits` command changes the representation of notes from `**kern` (alphabetic) to `**semits` (signed integers representing semitones above or below middle C). The output of this command, which still conforms to the Humdrum syntax, is passed "through" the pipe operator to the next command `extract`, which extracts only the viola part, as before. The option flags used with `semits` (`-xt`) ensure that duration information is discarded in the initial conversion.

Now that the pitch information is represented numerically, we can use arithmetic comparison operators to categorize pitches into two discrete bins: 'HI', if they sit on the viola's A string or above; 'LO', if they sit below. Since A4 is nine semitones above middle C, the breakpoint in `**semits` representation is `9`. The Humdrum tool `recode` allows users to conditionally substitute the content of spines containing numerical data into spines containing arbitrary alphanumeric data, based on the results of comparison functions specified in a plain text file. 

The contents of this file (`recode_scheme.txt`) should therefore be as follows:

```
>=9     HI
<9      LO
else    OTHER
```

Let's create that file (using a [heredoc](https://tldp.org/LDP/abs/html/here-docs.html)).

In [None]:
%%bash
tee recode_scheme.txt <<EOF
>=9	HI
<9	LO
else	OTHER
EOF

The pipe operator can accept the output of one process and pass it to another, so we can drop the `recode` command into the middle of the above pipeline. `recode`'s `-i` option flag and its argument tells `recode` to perform the conditional substitution on spines containing the `**semits` interpretation only. 

In [None]:
%%bash
semits -xt haydn/op09n2-03.krn | recode -f recode_scheme.txt -i '**semits' | extract -i '*clefC3'

The output of this command contains much information extraneous to our present research question, including rests, barline tokens, and metadata in the header and footer. The `rid -GLId` command eliminates global (i.e. file-level) fields, including the metadata.

In [None]:
%%bash
semits -xt haydn/op09n2-03.krn |
recode -f recode_scheme.txt -i '**semits' | 
extract -i '*clefC3' | 
rid -GLId

In [None]:
%%bash
semits -xt haydn/op09n2-03.krn |
recode -f recode_scheme.txt -i '**semits' | 
extract -i '*clefC3' | 
rid -GLId |
humsed '/=\|r/d'

The `humsed '/=\|r/d` command  removes barlines and rests.

At this point, the data no longer conforms to the Humdrum syntax. It has been transformed repeatedly so that it contains a sequence of tokens.

Summarizing this kind of data is no longer the province of the Humdrum toolkit: we use tools common to most UNIX environments to aggregate the counts of high and low pitches. Appending `sort | uniq -c | sort -nr` to the pipeline results in a tally:

In [None]:
%%bash
semits -xt haydn/op09n2-03.krn | 
recode -f recode_scheme.txt -i '**semits' | 
extract -i '*clefC3' |
rid -GLId |
humsed '/=\|r/d' |
sort |
uniq -c | 
sort -nr

Perform some final manipulations to get the data into a comma-separated format

In [None]:
%%bash
semits -xt haydn/op09n2-03.krn | 
recode -f recode_scheme.txt -i '**semits' | 
extract -i '*clefC3' |
rid -GLId |
humsed '/=\|r/d' |
sort |
uniq -c | 
awk '{$1=$1};1' |
cut -d ' ' -f1 |
column | 
tr '\t' ,

Having derived the desired statistics for a `.krn` file representing a single movement, we can make use of the `bash` to repeat the calculation for every filename in a subcorpus. First we abstract the analysis pipeline into its own script file (`viola_difficulty.sh`). This script can be executed from the shell, and given a single argument: the filename to be processed. 

In [None]:
%%bash
tee viola_difficulty.sh <<EOF""
RESULT=`semits -xt haydn/op09n2-03.krn | 
recode -f recode_scheme.txt -i '**semits' | 
extract -i '*clefC3' |
rid -GLId |
humsed '/=\|r/d' |
sort |
uniq -c | 
awk '{$1=$1};1' |
cut -d ' ' -f1 |
column | 
tr '\t' ,`
BASENAME=`basename $1`
echo $BASENAME,$RESULT
EOF

We make that `bash` script executable...

In [None]:
%%bash
chmod +x viola_difficulty.sh

Since Humdrum is a plain-text file format, we can use tools built into UNIX (or UNIX-like) environments to narrow down our list of files to a list of those corresponding to only the movements we want. The tool `grep` uses pattern-matching techniques to search files for lines corresponding to a template expression. Kern files provide a metadata field (OMV) which contains the original movement number of the score represented by that file. The following command emits the list of filenames of files in the current directory containing the metadata record `!!!OMV: 3`.

In [None]:
%%bash
grep -l '!!!OMV: 3' haydn/*.krn

Finally, we iterate over the list of files returned by `grep` using the `for` statement in another `bash` command.

In [None]:
%%bash
for file in `grep -l '!!!OMV: 3' haydn/*.krn`;
do
  ./viola_difficulty.sh $file;
done

## music21

In [None]:
import music21
import glob

In [None]:
my_score = music21.converter.parse('haydn/op09n2-03.krn')

`my_score` has a number of attributes which contain references to its elements. For example, the four distinct parts in the string quartet movement that `my_score` represents, are referenced at `my_score.parts`.

In [None]:
for part in my_score.parts:
    print(part)

This object also contains file-level information about the score in the `my_score.metadata` attribute, yet another object with its own attributes (properties and methods). This information is extracted from the `.krn` file when the file is parsed, and may be programmatically accessed, as shown here, to build a subcorpus based on properties of the score metadata.

In [None]:
my_score.metadata.all()

We have already encountered one limitation of the encoded files we are working with: the instrumentation was not encoded in the original `.krn` file. Therefore, we have to infer which is the Viola part based on the clef(s) contained in each part. To do this we iterate through each one of the parts, using a `for` control-flow statement.

This snippet shows how a function is defined in Python using the `def` keyword. Relative indentation is meaningful for Python: the indentation level determines how lines are organized into blocks, which are delimited at their beginning by a colon (followed by an indent) and at their end by a dedent. Therefore, there are three code blocks in this snippet. 

(If you are unfamiliar with this syntax, try to identify them before continuing. First note that blocks may be nested. As a hint, the first block contains six lines, the second block contains three, and the third contains just one.) 

In [None]:
def get_viola_part(score):
    for part in score.parts:
        part_clef = part.recurse().clef
        if type(part_clef) is music21.clef.AltoClef:
            return part
    return None

The first block is the body of the function `get_viola_part`, which is executed line by line when the function is called. The function `get_viola_part` takes one argument, referenced in the function body as `score`.

The second block starts with the use of Python's `for` enumeration syntax, and consists of actions repeated for over the contents of the `score.parts` attribute. For each `part` in the `score.parts` attribute, the clef for the each part is extracted. `part` refers to a `music21.part.Part` object, which includes references to subordinate objects, such as clefs, barlines, notes, and rests. These subordinate objects are stored in a tree-like, hierarchical structure which must be traversed to extract the desired clef elements. This is the purpose of the `recurse()` method call. This returns an object which does have a clef property, and this is assigned to the `part_clef` variable. 

Lastly, an `if` statement checks the clef is of the type specified by the `music21.clef.AltoClef` class, using the built-in `type()` function. If this condition is true, the code in the block (the third and final block) directly below the `if` statement is executed. This ability to assess the type of the object at interpretation time is a basic form of introspection, a powerful feature of Python. 

If the `part_clef` extracted from the `part` is indeed an alto clef, iteration ends and a reference to `part` is returned. If no matching part is found, the `if` condition is never satisfied, so the `return part` statement is never called. Iteration through `score.parts` ends, and the function continues to the next statement in the same block (same indentation level) as the `for` statement which began the iteration in the first instance. Accordingly, the function execution terminates and returns `None`. In our corpus this doesn't happen---all Haydn string quartet scores have an alto clef part somewhere---but it is useful to be able to handle the case where this assumption is not true.

In [None]:
VIOLA_A_STRING = music21.pitch.Pitch('A4')

def classify_pitches(score):
    observed = []
    viola_part = get_viola_part(score)
    for pitch in viola_part.pitches:
        if pitch >= VIOLA_A_STRING:
            observed.append('HI')        
        elif pitch < VIOLA_A_STRING:
            observed.append('LO')        
        else:
            observed.append('OTHER')
    return observed

Now that we have defined this function, we can call it with `my_score` (representing the quartet movement) as an argument. The result of this function call is assigned to a new `viola_part` variable. Next, we instantiate a new `music21.pitch.Pitch` object, with the pitch `A4` and assign it to a new variable `VIOLA_A_STRING`. Variable names in Python are case-sensitive. A popular convention is to choose upper-case variable names for variables which are not expected to be modified. Thus, `VIOLA_A_STRING` serves as a convenience for later used.

An empty `list` called `observed` is initialized, to which the one of the following three strings will be added: 'HI', if a pitch equaling or higher than the A string is observed; 'LO', if lower; and 'OTHER', otherwise. (Compare with the use of the `recode` tool from the Humdrum toolkit above.) This behavior is achieved by iterating through the items in the `viola_part.pitches` property, and testing for a series of conditions, in the `if...elif...else` construct. The conditions are expressed using the comparison operators `>=` and `<` (greater than or equal to and less than, respectively). 

> Compare with the use of the `semits` and the `recode` tools in the Humdrum example. In `music21`, the comparison of the two pitches is done in a single line. With Humdrum, the score must be transformed from one representation to another (from `**kern` to `**semits`) before it can be used with the `recode` tool.

This example shows how the conventional meanings of Python operators---the meaning of `<` in the expression `3 < 4`---may be "overloaded" to capture the relationship between user-defined or third-party objects, such as `music21.pitch.Pitch`. Behind the scenes, `music21` compares the pitch space representations of each of the two compared pitches. This encourages readable code. A hypothetical alternative, but more verbose, style of implementing this feature could have been used by `music21`: `first_pitch.isGreaterThan(second_pitch)`. Both implementations are functionally equivalent; it is a matter of programming style and interface design decisions to distinguish between them.

Given a directory of `.krn` files located in the folder `/Documents/corpus`, we can use a built-in Python module to prepare a list of filenames of each of the individual files. 

In [None]:
kern_files = glob.glob('haydn/*.krn')

We will then parse each of these files to create a new `music21` object for each file. This takes a little while. We've also limited this operation to the first 50 files in the directory, as this free `mybinder.org` instance is limited in its RAM capacity.

An earlier version of `music21` made the original filename of the converted file available in the `metadata` attribute, but now we have to store it explicitly in a `dict` object so that we can address it later on.

In [None]:
parsed_scores = [{ 'filename': filename, 
                   'parsed_score': music21.converter.parse(filename) } for filename in kern_files[:50]]

Then, we use a list comprehension to parse each of these files to establish the `movementNumber` for each score, and use this to narrow down the list of parsed scores to those corresponding to third movements only. 

In [None]:
third_mvt_scores = [s for s in parsed_scores if s['parsed_score'].metadata.movementNumber == '3']

Finally that list is iterated through, and the `classify_pitches` function is called. The number of pitches above and below the desired threshold are counted and printed along with the original filename for the movement score. The data in this format (comma-separated values) can be easily transferred to anther tool for statistical analysis. 

In [None]:
for score in third_mvt_scores:
    classified_pitches = classify_pitches(score['parsed_score'])
    filename = score['filename']
    num_hi = classified_pitches.count('HI')
    num_lo = classified_pitches.count('LO')    
    print(f"{filename},{num_hi},{num_lo}")