# **Preprocessing**

* Tune all guitars to 7-string B-standard
* Lowest note in this tuning is 35 (B2)
* Highest note in this tuning is 88 (E7, assuming 24-fret guitar)
* This is a dictionary size of 55 playable notes
    * [35, 88] inclusive + a rest note

*To add dynamic note lengths, we must increase this 
dictionary size by many times…*

Dictionary of note lengths:  
    [32nd, 16th, 8th, quarter, half, whole,  
     dotted {16th, 8th, quarter, half, whole},  
     triplet {16th, 8th, quarter}, two whole]

This brings our total dictionary length up to…  

**15 note lengths x 55 playable notes = 825 total**

### Standard
32nd = 120  
16th = 240  
8th = 480  
Quarter = 960  
Half = 1920  
Whole = 3840  
Two whole = 7680  

### Dotted
16th = 360  
8th = 720  
Quarter = 1440  
Half = 2880  
Whole = 5760  

### Triplet
16th = 160  
8th = 320  
Quarter = 640

#### Other notes:
* MIDI track 0 appears to only contain the tempo + time signature information
* Rests are identifiable by the "time=x" in a "note-on" MIDI event
* Pitches will be tokenized with 0 as the lowest pitch by subtracting MIN_PITCH
* We'll assign rests the maximum "pitch" value of 54 whereas all other notes will be [0, 53] == [35-35, 88-35]
* Each note duration is assigned an index on the interval [0, 14] and is retrieved via index_dict
    * The encoded pitch/duration combination is given by: X = (Pitch - MIN_PITCH) + index * N_NOTES
    * Then we can see min(X) = 0 for (Pitch, index) = (MIN_PITCH, 0) and max(X) = 824 for (Pitch, index) = (89 [rest note], 14)
* Sustained notes should be corrected in the tablature before being tokenized

#### Ideas:
* During prediction, use a low temperature to find repeating "riffs" while occasionally increasing temperature or otherwise introducing some randomness in order to break out of a riff and into another...

In [1]:
import os
import re
import time
import numpy as np
import pandas as pd

from mido import MidiFile
from multiprocessing import pool
from matplotlib import pyplot as plt

In [2]:
MIN_PITCH = 35
MAX_PITCH = 88
INCLUDE_REST = True
N_NOTES = MAX_PITCH - MIN_PITCH + 2 if INCLUDE_REST else MAX_PITCH - MIN_PITCH + 1
REST_NOTE = N_NOTES - 1 if INCLUDE_REST else None

In [3]:
note_lengths = [120, 160, 240, 320, 360, 480, 640, 720, 960, 1440, 1920, 2880, 3840, 5760, 7680]

In [4]:
index_dict = dict(zip(note_lengths, range(len(note_lengths))))

In [5]:
index_dict

{120: 0,
 160: 1,
 240: 2,
 320: 3,
 360: 4,
 480: 5,
 640: 6,
 720: 7,
 960: 8,
 1440: 9,
 1920: 10,
 2880: 11,
 3840: 12,
 5760: 13,
 7680: 14}

In [6]:
midi_dir = './midi'
save_dir = './data'

In [7]:
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

In [8]:
files = [os.path.join(midi_dir, file) for file in os.listdir(midi_dir) if file.endswith('.mid')]

In [9]:
def get_index(key):
    keys = list(index_dict.keys())
    
    if key in keys:
        index = index_dict.get(key)
        
    else:
        diff = np.absolute(np.array(keys) - key)
        key = keys[np.argmin(diff)] 
        index = index_dict.get(key)
    
    return index

In [10]:
def tokenize(track):
        
    last_pitch = np.inf
    notes = []

    for msg in track:
        if str(msg).startswith('note_on'):

            time = re.search('time=(\d+)', str(msg))
            time = int(time.group(1))

            if time > 0:

                index = get_index(time)
                notes.append(REST_NOTE + N_NOTES*index)

            pitch = re.search('note=(\d+)', str(msg))
            pitch = int(pitch.group(1)) - MIN_PITCH

            if pitch < last_pitch:
                last_pitch = pitch

        elif str(msg).startswith('note_off'):

            time = re.search('time=(\d+)', str(msg))
            time = int(time.group(1))

            if time > 0 and last_pitch != np.inf:

                index = get_index(time)
                notes.append(last_pitch + N_NOTES*index)

                last_pitch = np.inf
        
    return notes

In [11]:
for file in files:
    mid = MidiFile(file)
    
    for i, track in enumerate(mid.tracks[1:]):
        
        notes = tokenize(track)
        
        basename = os.path.splitext(os.path.basename(file))[0]
        filename = basename + " - {}".format(i)
        np.save(os.path.join(save_dir, filename), notes)

In [13]:
file = MidiFile(files[0])

In [14]:
track = file.tracks[1]

In [15]:
notes = tokenize(track)

In [16]:
notes

[112,
 112,
 112,
 112,
 112,
 112,
 112,
 112,
 112,
 112,
 112,
 112,
 112,
 112,
 112,
 112,
 606,
 295,
 329,
 461,
 494,
 454,
 494,
 449,
 454,
 461,
 570,
 461,
 454,
 449,
 606,
 295,
 329,
 461,
 494,
 454,
 461,
 278,
 277,
 275,
 277,
 278,
 277,
 284,
 278,
 714,
 558,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 295,
 296,
 494,
 494,
 454,
 494,
 449,
 454,
 461,
 460,
 558,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 283,
 329,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 283,
 329,
 494,
 494,
 454,
 461,
 278,
 277,
 275,
 277,
 278,
 277,
 284,
 278,
 558,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 118,
 295,
 296,
 494,
 494,
 454,
 494,
 449,
 454,
 461,
 460,
 111,
 111,
 111,
 111,
 111,
 111,
 111,
 111,
 276,
 329,
 111,
 111,
 111,
 111,
 111,
 111,
 111,
 111,
 276,
 329,
 111,
 111,
 276,
 494,
 111,
 111