*Composer_Year_Model_Data_Preparation.ipynb* <p style='text-align: right;'> <b> September 20th 2020 </b> </p>
<p style='text-align: right;'> <b> David Diston </b> </p>

# Preprocess Data Required to Build a Composer  Century Classifier

***Similar data preprocessing will now need to be performed to prepare data for this model. This is the exact data proprocessing code used for the previous model, with only directory and folder names changed. Therefore less in-depth commenting will take place. If any confusion occurs, please refer back to `10-Composer_Model_Data_Preparation.ipynb` for clarification***

From external research, I have determined the birth and death dates of each of the top 10 composers. Outside of python I have moved the pertinent composer files to the correct folder within the new `Composer_Year` directory in order to begin the data proprocessing. Again, for this model a quantization value of '4' (bin to nearest 16th note) will be used.

In [1]:
import numpy as np
import mido
from mido import MidiFile, MidiTrack, Message, MetaMessage
import os
import random
import shutil

#### Step One: Binning Music to Nearest 16th Note

In [2]:
'''********************************************'''
quantization = 4
'''********************************************'''

# Create a count of all files processed
file_count = 0

# I will iterate over all year folders in the directory
for folder in os.listdir('Composer_Year/Composer_Year/'):

    # For every year folder I will iterate over every file in the folder
    for file in os.listdir(f'Composer_Year/Composer_Year/{folder}/'):

        # I will import each file as a mido object
        clip = MidiFile(f'Composer_Year/Composer_Year/{folder}/{file}')

        # Find the number of time tick per quantization
        ticks_per_quant = int(clip.ticks_per_beat / quantization)
        # This warning prevents over-quantization
        assert clip.ticks_per_beat % quantization == 0, 'ERROR: Quantization if too Fine.'

        # Time and binning counting variables
        cum_time = 0
        btw_note = 0

        # Iterate over each message to find the first note message
        for note_msgs, msg in enumerate(clip.tracks[0]):
            if msg.type == 'note_on':
                break

        # The bin switching variable
        switch = False

        # Again iterate over each message starting with the first note message
        for msg in clip.tracks[0][note_msgs:]:

            # Add the message time delta to the current time
            cum_time += msg.time

            # Bin each message to the appropriate quantization bin
            if (cum_time % ticks_per_quant) <= (ticks_per_quant * 0.5):
                msg.time = 0
                switch = False

            elif (cum_time % ticks_per_quant) > (ticks_per_quant * 0.5) and switch == False:
                msg.time = int(ticks_per_quant)
                switch = True

            elif (msg.type == 'note_on'):
                msg.time = 0

        # Reset the tempo of the piece based on the quantization adjustment
        for msg in clip.tracks[0][note_msgs:]:
            msg.time = int(msg.time * (quantization / 4))

        # Ensure that all zero time deltas are integers
        for msg in clip.tracks[0][note_msgs:]:
            if msg.time < int((ticks_per_quant) * (quantization / 4) - 1):
                msg.time = int(0)

        # Save the new quantized midi file
        name = f'Year--{folder}--' + str(file_count) + file
        clip.save(filename = f'Composer_Year/Composer_Year_QUANT/{folder}/{name}')

        file_count += 1    
        print(f'Processed {file_count} files.', end = '\r')
    
print('\nDone')

Processed 1907 files.
Done


#### Step Two: Converting Quantized Midi File to Arrays

In [3]:
# By the same process as before I will create an array from the quantized midi files
file_count = 0

# Iterate over each year folder
for folder in os.listdir('Composer_Year/Composer_Year_QUANT/'):

    # Iterate over each quantized midi file
    for file in os.listdir(f'Composer_Year/Composer_Year_QUANT/{folder}'):

        # Import the midi file as a mido object
        clip = MidiFile(f'Composer_Year/Composer_Year_QUANT/{folder}/{file}')

        # Find the first note message
        for note_msgs, msg in enumerate(clip.tracks[0]):
            if msg.type == 'note_on':
                break

        # Instantiate my track list and my time_step list
        track_list = []
        time_step = [0] * 88
        
        # For each message append the note velocity at the note index within each binned time step
        for msg in clip.tracks[0][note_msgs:]:
            if (msg.type != 'note_on') and (msg.type != 'note_off') and (msg.time > 0):  
                track_list.append(time_step)
                time_step = [0] * 88
            elif (msg.type == 'note_on' or msg.type == 'note_off') and (msg.time > 0):
                track_list.append(time_step)
                time_step = [0] *88
                note = (msg.note - 21)
                time_step[note] = msg.velocity
            elif (msg.type == 'note_on' or msg.type == 'note_off') and (msg.time == 0):
                note = (msg.note - 21)
                time_step[note] = msg.velocity

        # Append any remaining partial bin
        if sum(time_step) > 0:
            track_list.append(time_step)
        else:
            pass

        # Convert the track list to an array and save
        track_array = np.array(track_list)
        name = file[: -4] + '.npy'
        np.save(f'Composer_Year/Composer_Year_Array/{folder}/{name}', track_array)

        file_count += 1    
        print(f'Processed {file_count} arrays.', end = '\r')
        
print('\nDone')

Processed 1907 arrays.
Done


#### Step Three: Split out the Validation and Test Sets

In [4]:
# To prevent data leakage in the random sampling, I will split out the validation and test sets
files_moved = 0

for folder in os.listdir('Composer_Year/Composer_Year_Array/'):
    
    # Calculate 10% of the files in each year folder
    num_files = len(os.listdir(f'Composer_Year/Composer_Year_Array/{folder}/'))
    num_files_test = int(num_files / 10) # Used for selecting 10% of data
    
    # Randomly select 10% of the files from each year folder to move into the validation and test folders
    for i in range (0, num_files_test):
        file = random.choice(os.listdir(f'Composer_Year/Composer_Year_Array/{folder}/'))
        shutil.move(f'Composer_Year/Composer_Year_Array/{folder}/{file}', f'Composer_Year/Composer_Year_Validation_Set_Arrays/{folder}/{file}')
        
        files_moved += 1    
        print(f'Moved {files_moved} arrays into the validation set.                 ', end = '\r')
        
    for i in range (0, num_files_test):
        file = random.choice(os.listdir(f'Composer_Year/Composer_Year_Array/{folder}/'))
        shutil.move(f'Composer_Year/Composer_Year_Array/{folder}/{file}', f'Composer_Year/Composer_Year_Test_Set_Arrays/{folder}/{file}')
        
        files_moved += 1    
        print(f'Moved {files_moved} arrays into the test set.                         ', end = '\r')
        
print('\nDone')

Moved 380 arrays into the test set.                         
Done


#### Step Four: Sampling

#### Training Set Sampling

In [5]:
# For each year in the training set, create 5,000 array clip samples from the arrays in the folders
# This will be a total of 15,000 array clips for the three centuries combined
count = 0

for folder in os.listdir('Composer_Year/Composer_Year_Array/'):

    for i in range(0, 5000):
        file = random.choice(os.listdir(f'Composer_Year/Composer_Year_Array/{folder}/'))
        name = ''
        # Load each array
        array = np.load(f'Composer_Year/Composer_Year_Array/{folder}/{file}')

        # I will randomly select a 100 length sample from each array for modeling
        length = (len(array) - 100)
        start = random.randint(0, length)
        end = start + 100
        array = array[start: end]

        # I will create a new file name and save the array clip
        name = 'CLIP--' + str(count) + '--' + file
        np.save(f'Composer_Year/Composer_Year_Data/{folder}/{name}', array)

        count += 1    
        print(f'Processed {count} array clips.', end = '\r')
        
print('\nDone')

Processed 15000 array clips.
Done


#### Test Set Sampling

In [6]:
# For each year I will create 500 test array clips
count = 0

for folder in os.listdir('Composer_Year/Composer_Year_Test_Set_Arrays/'):

    for i in range(0, 500):
        file = random.choice(os.listdir(f'Composer_Year/Composer_Year_Test_Set_Arrays/{folder}/'))
        name = ''
        # Load each array
        array = np.load(f'Composer_Year/Composer_Year_Test_Set_Arrays/{folder}/{file}')

        # I will randomly select a 100 length sample from each array for modeling
        length = (len(array) - 100)
        start = random.randint(0, length)
        end = start + 100
        array = array[start: end]

        # I will create a new file name and save the array clip
        name = 'TEST_CLIP--' + str(count) + '--' + file
        np.save(f'Composer_Year/Composer_Year_Test_Set_Data/{folder}/{name}', array)

        count += 1    
        print(f'Processed {count} array clips.', end = '\r')
        
print('\nDone')

Processed 1500 array clips.
Done


#### Validation Set Sampling

In [7]:
# For each year I will create 500 validation array clips
count = 0

for folder in os.listdir('Composer_Year/Composer_Year_Validation_Set_Arrays/'):

    for i in range(0, 500):
        file = random.choice(os.listdir(f'Composer_Year/Composer_Year_Validation_Set_Arrays/{folder}/'))
        name = ''
        # Load each array
        array = np.load(f'Composer_Year/Composer_Year_Validation_Set_Arrays/{folder}/{file}')

        # I will randomly select a 100 length sample from each array for modeling
        length = (len(array) - 100)
        start = random.randint(0, length)
        end = start + 100
        array = array[start: end]

        # I will create a new file name and save the array clip
        name = 'VAL_CLIP--' + str(count) + '--' + file
        np.save(f'Composer_Year/Composer_Year_Validation_Set_Data/{folder}/{name}', array)

        count += 1    
        print(f'Processed {count} array clips.', end = '\r')
        
print('\nDone')

Processed 1500 array clips.
Done


I have now completed all my required data preprocessing in order to build and train my new century classification model.

<p style='text-align: right;'> <b> Next Step: </b> Build RNN for Century Classification - <em> RNN_Model_Years.ipynb </em> </p>