# IRMAS Splitting

This notebook creates a train/test split of the IRMAS training set.

We do this, rather than use the original IRMAS testing set to maintain distributional equivalence.
The original IRMAS testing set differs from the training set in a few ways, notably by variable-duration and multi-labeled examples.

In [1]:
import h5py
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit

In [2]:
data = h5py.File('embeddings.h5', mode='r')

In [10]:
file_names = list(data['irmas/openl3/keys'][()])

In [13]:
prefixes = [_[:3] for _ in file_names]

In [24]:
splitter = GroupShuffleSplit(n_splits=1, random_state=20220419, test_size=0.25)

In [25]:
train_ids, test_ids = next(splitter.split(file_names, groups=prefixes))

In [27]:
all_files = pd.Series(file_names)

In [32]:
train_files = all_files[train_ids]

In [34]:
test_files = all_files[test_ids]

In [35]:
train_files

In [36]:
test_files

In [39]:
train_files.to_csv('irmas_train.csv', header=None, index=None)

In [40]:
test_files.to_csv('irmas_test.csv', header=None, index=None)