*Array_Sampler.ipynb* <p style='text-align: right;'> <b> September 20th 2020 </b> </p>
<p style='text-align: right;'> <b> David Diston </b> </p>

# Randomly Sample Arrays to Build Dataset

***Here I will randomly sample each array I have created to make 100 row clips to be used in training and testing***

Each of the Midi arrays I have created are ~a couple hundred or thousand rows long. This is a substantial length of data to turn into a tensor and pass through a Neural Network. Additionally, each of these arrays have different lengths, which need to be standardized before training. To solve both of these problems, I will be creating 100 row samples randomly from the collection of arrays I have sofar created. In order to prevent data leakage into the validation and test sets, these sets have already been split out of the training set, and will each be sampled independently.
Based on the number of arrays I have created from the original midi files, and the average length of these arrays, I have determined that 20,000 array samples will be a representative selection of the original data. 16,000 will be used for training (8,000 for each class), and 2,000 each for the test and validations sets (1,000 for each class in each set) will be created. While the class sizes for this model are equal, this sampling method will also serve to equalize class sets for later models in this project.

In [1]:
import numpy as np
import os
import random

#### Sampling the Training Set

In [2]:
count = 0

# I will iterate over both the human and computer folders
for folder in os.listdir('HumComp/HumComp_Array/'):

    # For 8,000 clips in each class I will randomly select one array file from the class
    for i in range(0, 8000):
        file = random.choice(os.listdir(f'HumComp/HumComp_Array/{folder}/'))
        name = ''
        # Load each array
        array = np.load(f'HumComp/HumComp_Array/{folder}/{file}')

        # I will randomly select a 100 row length sample from each array for modeling
        # I will create a length variable 100 rows less than the length of the file from which to select a starting row
        length = (len(array) - 100)
        start = random.randint(0, length)
        end = start + 100
        # The array clip will be saved as the 100 rows from the start -> end row
        array = array[start: end]

        # I will create a new file name and save the array clip
        name = 'CLIP--' + str(count) + '--' + file
        np.save(f'HumComp/HumComp_Data/{folder}/{name}', array)

        count += 1    
        print(f'Processed {count} array clips.', end = '\r')
        
print('\nDone')

Processed 16000 array clips.
Done


#### Sampling the Test Set

In [3]:
count = 0

# I will iterate over both the human and computer folders
for folder in os.listdir('HumComp/HumComp_Test_Set_Arrays/'):

    # For 1,000 clips in each class I will randomly select one array file from the class
    for i in range(0, 1000):
        file = random.choice(os.listdir(f'HumComp/HumComp_Test_Set_Arrays/{folder}/'))
        name = ''
        # Load each array
        array = np.load(f'HumComp/HumComp_Test_Set_Arrays/{folder}/{file}')

        # I will randomly select a 100 row length sample from each array for modeling
        # I will create a length variable 100 rows less than the length of the file from which to select a starting row
        length = (len(array) - 100)
        start = random.randint(0, length)
        end = start + 100
        array = array[start: end]

        # I will create a new file name and save the array clip
        name = 'TEST_CLIP--' + str(count) + '--' + file
        np.save(f'HumComp/HumComp_Test_Set_Data/{folder}/{name}', array)

        count += 1    
        print(f'Processed {count} array clips.', end = '\r')
        
print('\nDone')

Processed 2000 array clips.
Done


#### Sampling the Validation Set

In [4]:
count = 0

for folder in os.listdir('HumComp/HumComp_Validation_Set_Arrays/'):

    # For 1,000 clips in each class I will randomly select one array file from the class
    for i in range(0, 1000):
        file = random.choice(os.listdir(f'HumComp/HumComp_Validation_Set_Arrays/{folder}/'))
        name = ''
        # Load each array
        array = np.load(f'HumComp/HumComp_Validation_Set_Arrays/{folder}/{file}')

        # I will randomly select a 100 row length sample from each array for modeling
        # I will create a length variable 100 rows less than the length of the file from which to select a starting row
        length = (len(array) - 100)
        start = random.randint(0, length)
        end = start + 100
        array = array[start: end]

        # I will create a new file name and save the array clip
        name = 'VAL_CLIP--' + str(count) + '--' + file
        np.save(f'HumComp/HumComp_Validation_Set_Data/{folder}/{name}', array)

        count += 1    
        print(f'Processed {count} array clips.', end = '\r')
        
print('\nDone')

Processed 2000 array clips.
Done


Now that I have my training, validation, and test sets organized into coherent file structures, it is time to start modeling.

<p style='text-align: right;'> <b> Next Step: </b> Build RNN Binary Classification Model - <em> RNN_Human_Detection.ipynb </em> </p>