*Two_Comp_Data_Preparation.ipynb* <p style='text-align: right;'> <b> September 20th 2020 </b> </p>
<p style='text-align: right;'> <b> David Diston </b> </p>

# Preprocess Data Required to Build a Two-Composer Classifier

***Since this is again the same preprocessing code, less in-depth commenting will take place. If any confusion occurs, please refer back to `10-Composer_Model_Data_Preparation.ipynb` for clarification. Additionally, while 2 Two-Composer Classification models were created (Debussy-Mozart, Prokofiev-Rachmaninoff), the code for preprocessing is the exact same. The only difference is which composer files are present in the initial pre-preprocessing directory `2Comp/`. If you desire to create a different Two-Composer Classifier, simply place the compositions by those two composers in the initial `2Comp/` directory, and run this entire notebook. Provided the output file structure is correct, training, validation, and test datasets will be produced which can be used to train a new classification model.***

In [1]:
import numpy as np
import mido
from mido import MidiFile, MidiTrack, Message, MetaMessage
import os
import random
import shutil

#### Step One: Quantization

In [2]:
'''********************************************'''
quantization = 8
'''********************************************'''

# Create a count of all files processed
file_count = 0

# Iterate over each composer folder
for folder in os.listdir('2Comp/2Comp/'):

    # Iterate over each file
    for file in os.listdir(f'2Comp/2Comp/{folder}/'):

        # Load each file
        clip = MidiFile(f'2Comp/2Comp/{folder}/{file}')

        # Calculate the ticks per quantization
        ticks_per_quant = int(clip.ticks_per_beat / quantization)
        assert clip.ticks_per_beat % quantization == 0, 'ERROR: Quantization if too Fine.'

        cum_time = 0
        btw_note = 0

        # Find the first note message
        for note_msgs, msg in enumerate(clip.tracks[0]):
            if msg.type == 'note_on':
                break

        switch = False

        # Iterate over each message to bin to nearest 16th note
        for msg in clip.tracks[0][note_msgs:]:

            cum_time += msg.time

            if (cum_time % ticks_per_quant) <= (ticks_per_quant * 0.5):
                msg.time = 0
                switch = False

            elif (cum_time % ticks_per_quant) > (ticks_per_quant * 0.5) and switch == False:
                msg.time = int(ticks_per_quant)
                switch = True

            elif (msg.type == 'note_on'):
                msg.time = 0

        # Adjust the tempo based on the quantization adjustments
        for msg in clip.tracks[0][note_msgs:]:
            msg.time = int(msg.time * (quantization / 4))

        # Ensure all zero times are integers
        for msg in clip.tracks[0][note_msgs:]:
            if msg.time < int((ticks_per_quant) * (quantization / 4) - 1):
                msg.time = int(0)

        # Save the quantized midi file
        name = f'Year--{folder}--' + str(file_count) + file
        clip.save(filename = f'2Comp/2Comp_QUANT/{folder}/{name}')

        file_count += 1    
        print(f'Processed {file_count} files.', end = '\r')
    
print('\nDone')

Processed 170 files.
Done


#### Step Two: Convert Midi Files to Arrays

In [3]:
# Here I am converting all quantized midi files into numpy arrays
file_count = 0

# Iterate over each composer folder in the directory
for folder in os.listdir('2Comp/2Comp_QUANT/'):

    # Iterate over each file
    for file in os.listdir(f'2Comp/2Comp_QUANT/{folder}'):

        # Import the file as a mido object
        clip = MidiFile(f'2Comp/2Comp_QUANT/{folder}/{file}')

        # Find the first note message
        for note_msgs, msg in enumerate(clip.tracks[0]):
            if msg.type == 'note_on':
                break

        # Instantiate my track list and time_step list
        track_list = []
        time_step = [0] * 88
        # Insert each note into the correct binned time_step, and append each time step to the track list
        for msg in clip.tracks[0][note_msgs:]:
            if (msg.type != 'note_on') and (msg.type != 'note_off') and (msg.time > 0):  
                track_list.append(time_step)
                time_step = [0] * 88
            elif (msg.type == 'note_on' or msg.type == 'note_off') and (msg.time > 0):
                track_list.append(time_step)
                time_step = [0] *88
                note = (msg.note - 21)
                time_step[note] = msg.velocity
            elif (msg.type == 'note_on' or msg.type == 'note_off') and (msg.time == 0):
                note = (msg.note - 21)
                time_step[note] = msg.velocity

        # Append any residual note messages as a final partial bin
        if sum(time_step) > 0:
            track_list.append(time_step)
        else:
            pass

        # Convert the track list to a numpy array and save
        track_array = np.array(track_list)
        name = file[: -4] + '.npy'
        np.save(f'2Comp/2Comp_Array/{folder}/{name}', track_array)

        file_count += 1    
        print(f'Processed {file_count} arrays.', end = '\r')
        
print('\nDone')

Processed 170 arrays.
Done


### Step Three: Split out Validation and Test Sets

In [4]:
# Split out the validation and test sets prior to random 200 row clip sampling
files_moved = 0

for folder in os.listdir('2Comp/2Comp_Array/'):
    
    num_files = len(os.listdir(f'2Comp/2Comp_Array/{folder}/'))
    num_files_test = int(num_files / 10) # Used for selecting 10% of data
    
    for i in range (0, num_files_test):
        file = random.choice(os.listdir(f'2Comp/2Comp_Array/{folder}/'))
        shutil.move(f'2Comp/2Comp_Array/{folder}/{file}', f'2Comp/2Comp_Test_Set_Arrays/{folder}/{file}')
        
        files_moved += 1    
        print(f'Moved {files_moved} arrays into the test set.               ', end = '\r')
        
    for i in range (0, num_files_test):
        file = random.choice(os.listdir(f'2Comp/2Comp_Array/{folder}/'))
        shutil.move(f'2Comp/2Comp_Array/{folder}/{file}', f'2Comp/2Comp_Validation_Set_Arrays/{folder}/{file}')
        
        files_moved += 1    
        print(f'Moved {files_moved} arrays into the validation set.                 ', end = '\r')
        
print('\nDone')

Moved 1 arrays into the test set.               Moved 2 arrays into the test set.               Moved 3 arrays into the test set.               Moved 4 arrays into the test set.               Moved 5 arrays into the test set.               Moved 6 arrays into the test set.               Moved 7 arrays into the test set.               Moved 8 arrays into the validation set.                 Moved 9 arrays into the validation set.                 Moved 10 arrays into the validation set.                 Moved 11 arrays into the validation set.                 Moved 12 arrays into the validation set.                 Moved 13 arrays into the validation set.                 Moved 14 arrays into the validation set.                 Moved 15 arrays into the test set.               Moved 16 arrays into the test set.               Moved 17 arrays into the test set.               Moved 18 arrays into the test set.               Moved 19 arrays into the test set.               Mov

#### Step Four: Sample the Training/Test/Validation Sets

In [5]:
# For this model, I chose to use 200 row clips to see if this would effect results
# To compensate I will create a smaller dataset of 1000 clips per composer
count = 0

for folder in os.listdir('2Comp/2Comp_Array/'):

    for i in range(0, 1000):
        file = random.choice(os.listdir(f'2Comp/2Comp_Array/{folder}/'))
        name = ''
        # Load each array
        array = np.load(f'2Comp/2Comp_Array/{folder}/{file}')

        # I will randomly select a 100 length sample from each array for modeling
        length = (len(array) - 200)
        start = random.randint(0, length)
        end = start + 200
        array = array[start: end]

        # I will create a new file name and save the array clip
        name = 'CLIP--' + str(count) + '--' + file
        np.save(f'2Comp/2Comp_Data/{folder}/{name}', array)

        count += 1    
        print(f'Processed {count} array clips.', end = '\r')
        
print('\nDone')

Processed 2000 array clips.
Done


In [6]:
# I will create 100 test set clips per composer
count = 0

for folder in os.listdir('2Comp/2Comp_Test_Set_Arrays/'):

    for i in range(0, 100):
        file = random.choice(os.listdir(f'2Comp/2Comp_Test_Set_Arrays/{folder}/'))
        name = ''
        # Load each array
        array = np.load(f'2Comp/2Comp_Test_Set_Arrays/{folder}/{file}')

        # I will randomly select a 100 length sample from each array for modeling
        length = (len(array) - 200)
        start = random.randint(0, length)
        end = start + 200
        array = array[start: end]

        # I will create a new file name and save the array clip
        name = 'TEST_CLIP--' + str(count) + '--' + file
        np.save(f'2Comp/2Comp_Test_Set_Data/{folder}/{name}', array)

        count += 1    
        print(f'Processed {count} array clips.', end = '\r')
        
print('\nDone')

Processed 200 array clips.
Done


In [7]:
# I will create 100 validation set clips per composer
count = 0

for folder in os.listdir('2Comp/2Comp_Validation_Set_Arrays/'):

    for i in range(0, 100):
        file = random.choice(os.listdir(f'2Comp/2Comp_Validation_Set_Arrays/{folder}/'))
        name = ''
        # Load each array
        array = np.load(f'2Comp/2Comp_Validation_Set_Arrays/{folder}/{file}')

        # I will randomly select a 100 length sample from each array for modeling
        length = (len(array) - 200)
        start = random.randint(0, length)
        end = start + 200
        array = array[start: end]

        # I will create a new file name and save the array clip
        name = 'VAL_CLIP--' + str(count) + '--' + file
        np.save(f'2Comp/2Comp_Validation_Set_Data/{folder}/{name}', array)

        count += 1    
        print(f'Processed {count} array clips.', end = '\r')
        
print('\nDone')

Processed 200 array clips.
Done


Now that I have preprocessed my data for both the Debussy-Mozart model and the Prokofiev-Rachmaninoff model, I am ready to build and train these models.

<p style='text-align: right;'> <b> Next Step: </b> Build RNN for each Composer Pair - <em> Debussy_Mozart_Model.ipynb   &   Prokofiev_Rachmaninoff_Model.ipynb </em> </p>