# Split dataset

- [Link to Speech Commands Dataset description](https://research.googleblog.com/2017/08/launching-speech-commands-dataset.html)

The dataset has lists of validation and test data.

Then here we define complete sets of training, validation and test for further evaluation.

In [2]:
import os
import pandas as pd
import glob

DATADIR = '../../dataset/speech_commands_v0.01'
DATASET_PREFIX = 'scd_'

# Get whole list of files
fullwavs = [f.replace(DATADIR+'/', '') for f in glob.glob(os.path.join(DATADIR, '*/*.wav'))]
# Remove if it begin with _ (removing _background_noise_)
wavs = [f for f in fullwavs if f[0] != '_']

# Load valid/test list, then set train list as (wavs - valid - test)
with open(os.path.join(DATADIR, 'validation_list.txt')) as f:
    validlist = f.read().splitlines()
with open(os.path.join(DATADIR, 'testing_list.txt')) as f:
    testlist = f.read().splitlines()
trainlist = [f for f in wavs if (not (f in validlist) and  not (f in testlist))]

In [3]:
print('Total number of files ist=', len(fullwavs))
print('Available speech files =', len(wavs))
print('Training files =', len(trainlist))
print('Validation files =', len(validlist))
print('Test files =', len(testlist))

Total number of files ist= 64727
Available speech files = 64721
Training files = 51088
Validation files = 6798
Test files = 6835


In [4]:
clslist = list(set([f.split('/')[0] for f in testlist]))
print('%d classes are' % len(clslist), clslist)

30 classes are ['eight', 'one', 'tree', 'bird', 'two', 'seven', 'down', 'yes', 'off', 'nine', 'no', 'up', 'cat', 'left', 'bed', 'dog', 'marvin', 'happy', 'right', 'on', 'wow', 'three', 'sheila', 'stop', 'house', 'zero', 'five', 'four', 'go', 'six']


In [5]:
print('train_distribution', [len([f for f in trainlist if f.split('/')[0] == cls]) for cls in clslist])
print('valid_distribution', [len([f for f in validlist if f.split('/')[0] == cls]) for cls in clslist])
print('test_distribution', [len([f for f in testlist if f.split('/')[0] == cls]) for cls in clslist])

train_distribution [1852, 1892, 1374, 1411, 1873, 1875, 1842, 1860, 1839, 1875, 1853, 1843, 1399, 1839, 1340, 1396, 1424, 1373, 1852, 1864, 1414, 1841, 1372, 1885, 1427, 1866, 1844, 1839, 1861, 1863]
valid_distribution [243, 230, 166, 162, 236, 263, 264, 261, 256, 230, 270, 260, 168, 247, 197, 170, 160, 189, 256, 257, 166, 248, 176, 246, 173, 260, 242, 280, 260, 262]
test_distribution [257, 248, 193, 158, 264, 239, 253, 256, 262, 259, 252, 272, 166, 267, 176, 180, 162, 180, 259, 246, 165, 267, 186, 249, 150, 250, 271, 253, 251, 244]


It's imbalanced among classes... but it's ok for now.

In [6]:
with open(os.path.join('.', DATASET_PREFIX+'trainset.txt'), 'w') as f:
    f.write('\n'.join(trainlist)+'\n')

In [7]:
with open(os.path.join('.', DATASET_PREFIX+'validationset.txt'), 'w') as f:
    f.write('\n'.join(validlist)+'\n')

In [8]:
with open(os.path.join('.', DATASET_PREFIX+'testset.txt'), 'w') as f:
    f.write('\n'.join(testlist)+'\n')

In [9]:
with open(os.path.join('.', DATASET_PREFIX+'classes.txt'), 'w') as f:
    f.write('\n'.join(clslist)+'\n')

## Duplication(Leakage) test

In [10]:
train_ids = list(set([fn.split('/')[1].split('_')[0] for fn in trainlist]))
valid_ids = list(set([fn.split('/')[1].split('_')[0] for fn in validlist]))
test_ids = list(set([fn.split('/')[1].split('_')[0] for fn in testlist]))

In [11]:
import itertools
import numpy as np
fsIDs = [train_ids, valid_ids, test_ids]
for i, j in itertools.combinations(range(len(fsIDs)), 2):
    _or = np.append(fsIDs[i], fsIDs[j])
    _and = np.unique(_or)
    if len(_or) != len(_and):
        print('Duplicated in set %d and %d, for %d sources' % (i, j, len(_or) - len(_and)))

No output, then no leakage...