# Split dataset

- [Link to Urban Sound 8k dataset description](https://serv.cusp.nyu.edu/projects/urbansounddataset/urbansound8k.html)

The dataset is provided in 10 folders.

Here we define training, validation and test set for further evaluation.

In [2]:
import os
import pandas as pd

# load original listing
DATADIR = '../../dataset/UrbanSound8K'
metadata = pd.read_csv(os.path.join(DATADIR, 'metadata', 'UrbanSound8K.csv'))
DATASET_PREFIX = 'usd_'

In [3]:
# check for number of samples per class, class balance for each fold
classIDs = sorted(metadata['classID'].unique())
foldIDs = sorted(metadata['fold'].unique())
samples_per_class = [[len(metadata[metadata.fold == f][metadata.classID == i]) for i in classIDs] for f in foldIDs]
samples_per_class

  after removing the cwd from sys.path.


[[100, 36, 100, 100, 100, 96, 35, 120, 86, 100],
 [100, 42, 100, 100, 100, 100, 35, 120, 91, 100],
 [100, 43, 100, 100, 100, 107, 36, 120, 119, 100],
 [100, 59, 100, 100, 100, 107, 38, 120, 166, 100],
 [100, 98, 100, 100, 100, 107, 40, 120, 71, 100],
 [100, 28, 100, 100, 100, 107, 46, 68, 74, 100],
 [100, 28, 100, 100, 100, 106, 51, 76, 77, 100],
 [100, 30, 100, 100, 100, 88, 30, 78, 80, 100],
 [100, 32, 100, 100, 100, 89, 31, 82, 82, 100],
 [100, 33, 100, 100, 100, 93, 32, 96, 83, 100]]

It's imbalanced among classes... 3 classes need more data.

But basically all folders have almost the similar balance.

## Confirm duplicated source
Handling datasets like audio, we need to be careful not to have the same data source in both training and validation sets.

Here I confirm no duplication of data source in each folder; all folder shouldn't contain the same Freesound ID (fsID).
Then we can be safe as long as splitting dataset by no duplicated combinations.

In [4]:
import numpy as np
fsIDs = [metadata[metadata.fold == f].fsID.unique() for f in foldIDs]

In [5]:
import itertools
for i, j in itertools.combinations(range(len(fsIDs)), 2):
    _or = np.append(fsIDs[i], fsIDs[j])
    _and = np.unique(_or)
    if len(_or) != len(_and):
        print('Duplicated in folder %d and %d, for %d sources' % (i, j, len(_or) - len(_and)))

Duplicated in folder 0 and 3, for 2 sources
Duplicated in folder 0 and 7, for 1 sources
Duplicated in folder 0 and 8, for 1 sources
Duplicated in folder 1 and 6, for 1 sources


This means we should not use the two folder combinations listed above.

## Splitting dataset to train/valid/test.

To simply just avoid combination above, use following folder combination.

In [6]:
trainfolders = [0, 1, 2, 3,   6, 7, 8, 9]
validfolders = [4]
testfolders = [5]

In [7]:
traindata = pd.DataFrame()
for f in trainfolders:
    traindata = traindata.append(metadata[metadata.fold == f])
traindata.to_csv(os.path.join('.', DATASET_PREFIX+'train_list.csv'))
traindata.count()

slice_file_name    5969
fsID               5969
start              5969
end                5969
salience           5969
fold               5969
classID            5969
class              5969
dtype: int64

In [8]:
valdata = pd.DataFrame()
for f in validfolders:
    valdata = valdata.append(metadata[metadata.fold == f])
valdata.to_csv(os.path.join('.', DATASET_PREFIX+'validation_list.csv'))
valdata.count()

slice_file_name    990
fsID               990
start              990
end                990
salience           990
fold               990
classID            990
class              990
dtype: int64

In [9]:
testdata = pd.DataFrame()
for f in testfolders:
    testdata = testdata.append(metadata[metadata.fold == f])
testdata.to_csv(os.path.join('.', DATASET_PREFIX+'test_list.csv'))
testdata.count()

slice_file_name    936
fsID               936
start              936
end                936
salience           936
fold               936
classID            936
class              936
dtype: int64