<a href="https://colab.research.google.com/github/dubisx/EMCT_final/blob/main/1.%20midi_encode_to_REMI_token.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MIDI TOKENIZATION FOR MACHINE LEARNING

In [None]:
!pip install miditok
!pip install tokenizers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting miditok
  Downloading miditok-2.0.5-py3-none-any.whl (94 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.4/94.4 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting miditoolkit>=0.1.16 (from miditok)
  Downloading miditoolkit-0.1.16-py3-none-any.whl (20 kB)
Collecting mido>=1.1.16 (from miditoolkit>=0.1.16->miditok)
  Downloading mido-1.2.10-py2.py3-none-any.whl (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.1/51.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mido, miditoolkit, miditok
Successfully installed miditok-2.0.5 miditoolkit-0.1.16 mido-1.2.10
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tokenizers
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━

In [None]:
from miditok import REMI
from miditok.utils import get_midi_programs
from miditoolkit import MidiFile
from pathlib import Path
from miditok.constants import CHORD_MAPS


In [None]:
from google.colab import drive

drive.mount('/content/gdrive', force_remount=True) #mounts the google drive where the dataset is located

Mounted at /content/gdrive


Initiate the tokenizer:

In [None]:
# Our parameters
pitch_range = range(21, 109)
beat_res = {(0, 4): 8, (4, 12): 4}
nb_velocities = 32
additional_tokens = {'Chord': True, 'Rest': True, 'Tempo': True,
                     'rest_range': (2, 8),  # (half, 8 beats)
                     'nb_tempos': 32,  # nb of tempo bins
                     'TimeSignature': False,
                     'tempo_range': (40, 250),  # (min, max)
                     'Program': False,
                     "chord_maps": CHORD_MAPS,
                     "chord_tokens_with_root_note": True,
                     "chord_unknown": False}
special_tokens = ["PAD", "BOS", "EOS"]

#creates tokenizer
tokenizer = REMI(pitch_range, beat_res, nb_velocities, additional_tokens, special_tokens=special_tokens
                        )

Paths to the dataset

In [None]:
midi_paths = list(Path('/content/gdrive/MyDrive/ala/MIDI/').glob('**/*.mid'))
print(len(midi_paths))

2507


A validation of MIDI files - discarding data that is not usable for machine learning


In [None]:
def midi_valid(midi) -> bool:
    if any(ts.numerator != 4 for ts in midi.time_signature_changes):
        return False  # time signature different from 4/*, 4 beats per bar
    if midi.max_tick < 10 * midi.ticks_per_beat:
        return False  # this MIDI is too short
    return True

Converting MIDI to .json tokens

In [None]:
data_augmentation_offsets = [2, 2, 1]   # will perform data augmentation on 2 pitch octaves,
tokenizer.tokenize_midi_dataset(        # 2 velocity and 1 duration values
    midi_paths,
    Path("/content/gdrive/MyDrive/ala/REMI/tokens_noBPE"),
    midi_valid ,
    data_augmentation_offsets
)

Tokenizing MIDIs (REMI/tokens_noBPE):   6%|▋         | 159/2507 [04:16<1:04:34,  1.65s/it]

Learning the vocabulary with Byte-Paired Encoding

In [None]:
tokenizer.learn_bpe(
    vocab_size=1000,
    tokens_paths=list(Path("/content/gdrive/MyDrive/ala/REMI/tokens_noBPE").glob('**/*.json')),
    out_dir=Path('/content/gdrive/MyDrive/ala/REMI/tokens_BPE'),
)

Converting the tokenized musics into tokens with BPE


In [None]:
tokenizer.apply_bpe_to_dataset(
    Path("/content/gdrive/MyDrive/ala/REMI/tokens_noBPE"),
    Path('/content/gdrive/MyDrive/ala/REMI/tokens_BPE')
)

