# **Preprocessing**

* Tune all guitars to 7-string B-standard
* Lowest note in this tuning is 35 (B2)
* Highest note in this tuning is 88 (E7, assuming 24-fret guitar)
* This is a vocabulary size of 55 playable notes
    * [35, 88] inclusive + a rest note

*To add dynamic note lengths, we must increase this 
vocabulary size by many times…*

Dictionary of note lengths:  
    [32nd, 16th, 8th, quarter, half, whole,  
     dotted {16th, 8th, quarter, half, whole},  
     triplet {16th, 8th, quarter}, two whole]

This brings our total vocabulary length up to…  

**15 note lengths x 55 playable notes = 825 total**

### Standard
32nd = 60  
16th = 120   
8th = 240   
Quarter = 480  
Half = 960    
Whole = 1920  
Two whole = 3840  

### Dotted
16th = 180  
8th = 360  
Quarter = 720  
Half = 1440  
Whole = 2880  

### Triplet
16th = 80  
8th = 160  
Quarter = 320

#### Other notes:
* MIDI files were generated via GuitarPro 6
    * MIDI files generated by other tablature software may use different encoding schemes
* Rests are identifiable by the "time=x" in a "note-on" MIDI event
* Pitches will be tokenized with 0 as the lowest pitch by subtracting min_pitch
* We'll assign rests the maximum "pitch" value of 54 whereas all other notes will be [0, 53] == [35-35, 88-35]
* Each note duration is assigned an index on the interval [0, 14] and is retrieved via index_dict
    * The encoded pitch/duration combination is given by: X = (pitch - min_pitch) + index * n_notes
    * Then we can see min(X) = 0 for (pitch, index) = (min_pitch, 0) and max(X) = 824 for (pitch, index) = (89 [rest note], 14)
* Incompletely filled measures seem to confound MIDI outputs
* Ensure all tabs are near the same tempo (e.g. 240 BPM) and are not tabbed in half or double time
    * Half or double time tabs would skew note duration distributions

#### Ideas:
* During prediction, use a low temperature to find repeating "riffs" while occasionally increasing temperature or otherwise introducing some randomness in order to break out of a riff and into another section of the song...
* Train a second neural network to write drum parts based off of generated guitar riffs...

In [1]:
import os
import re
import time
import numpy as np
import pandas as pd

from mido import MidiFile
from multiprocessing import pool
from matplotlib import pyplot as plt

In [2]:
min_pitch = 35
max_pitch = 88
include_rest = True

In [3]:
n_notes = max_pitch - min_pitch + 2 if include_rest else max_pitch - min_pitch + 1
rest_note = n_notes - 1 if include_rest else None

In [4]:
note_lengths = [60, 80, 120, 160, 180, 240, 320, 360, 480, 720, 960, 1440, 1920, 2880, 3840]

In [5]:
index_dict = dict(zip(note_lengths, range(len(note_lengths))))

In [6]:
index_dict

{60: 0,
 80: 1,
 120: 2,
 160: 3,
 180: 4,
 240: 5,
 320: 6,
 360: 7,
 480: 8,
 720: 9,
 960: 10,
 1440: 11,
 1920: 12,
 2880: 13,
 3840: 14}

In [7]:
midi_dir = './midi'
save_dir = './data'

In [8]:
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

In [9]:
files = [os.path.join(midi_dir, file) for file in os.listdir(midi_dir) if file.endswith('.mid')]

In [10]:
def show_messages(file):
    mid = MidiFile(file)
    for i, track in enumerate(mid.tracks):
        print("=== Track {} ===".format(i))
        for msg in track:
            print(str(msg))

In [11]:
def get_index(key):
    keys = list(index_dict.keys())
    
    if key in keys:
        index = index_dict.get(key)
        
    else:
        diff = np.absolute(np.array(keys) - key)
        key = keys[np.argmin(diff)] 
        index = index_dict.get(key)
    
    return index

In [12]:
def tokenize(track):
        
    last_pitch = np.inf
    notes = []

    for msg in track:
        if str(msg).startswith('note_on'):

            time = re.search('time=(\d+)', str(msg))
            time = int(time.group(1))

            if time > 0:

                index = get_index(time)
                notes.append(rest_note + n_notes*index)

            pitch = re.search('note=(\d+)', str(msg))
            pitch = int(pitch.group(1)) - min_pitch

            if pitch < last_pitch:
                last_pitch = pitch

        elif str(msg).startswith('note_off'):

            time = re.search('time=(\d+)', str(msg))
            time = int(time.group(1))

            if time > 0 and last_pitch != np.inf:

                index = get_index(time)
                notes.append(last_pitch + n_notes*index)

                last_pitch = np.inf
        
    return notes

In [13]:
for file in files:
    mid = MidiFile(file)
    
    for i, track in enumerate(mid.tracks):
        
        notes = tokenize(track)
        
        basename = os.path.splitext(os.path.basename(file))[0]
        filename = basename + " - {}".format(i)
        np.save(os.path.join(save_dir, filename), notes)