# DataLoader for Meertens Tune Collections

In [1]:
import sys
sys.path.append('../src/')

In [2]:
#from dataloader import MTCDataLoader
%run ../src/dataloader.py

This notebook describes the class `DataLoader` and derived class `MTCDataLoader`.

A `DataLoader` object takes as source a `.jsonl` file (optionally gzipped), which is a text file with a json object on each line. An object contains `metadata` fields and several sequences of feature values. E.g.
```
{'id': 'NLB178968_01',
 'type': 'vocal',
 'features': {'pitch40': [135, 141, 147,             [...] 158, 135],
              'scaledegree': [1, 2, 3, 4, 5, 1, 6,   [...] 2, 5, 1],
              'tonic': ['E', 'E', 'E', 'E', 'E',     [...] 'E', 'E'],
              'mode': ['major', 'major', 'major',    [...] 'major'],
              'pitch': ['E4', 'F#4', 'G#4', 'A4',    [...] 'B4', 'E4'],
              'midipitch: [64, 66, 68, 69, 71, 76,   [...] 71, 64],
              'diatonicinterval': [0, 1, 1, 1, 1, 3, [...] -6, 3, -4],
              'chromaticinterval': [[0, 2, 2, 1, 2,  [...] -10, 5, -7],
              'contour3': ["=", "+", "+", "+", "+",  [...] "-", "+", "-"],
              'contour5': ["=", "+", "+", "+", "+",  [...] "--", "++", "--"],
              'duration': [0.125, 0.125, 0.125,      [...] 0.25, 0.5],
              'IOR': [1.0, 1.0, 1.0, 1.0, 2.0, 1.0,  [...] 1.0, 2.0],
              'beat': ['1', '1', '2', '2', '1', '2', [...] '2', '1'],
              'beat_fraction': ['0', '1/2', '0',     [...] '0', '0'],
              'beatstrength': [1.0, 0.25, 0.5, 0.25, [...] 1.0, 0.5, 1.0],
              'metriccontour': ['+', '-', '+', '-',  [...] '-', '+'],
              'imaweight': [0.810269, 0.068949,      [...] 0.843521],
              'imacontour': ['+', '-', '+', '-',     [...] '-', '+'],
              'timesignature': ["2/4", "2/4", "2/4", [...] "2/4", "2/4"],
              'phrasepos': [0.0, 0.071429, 0.142857, [...] 0.833333, 1.0]},
 'year': 1866,
 'tunefamily': '1302_0',
 'ann_bgcorpus': True
```
In this example the metadata fields are `id`, `type`, `year`, `tunefamily`, and `ann_bgcorpus`. The named object `features` contains several sequences of feature values. For DataLoader to function properly, the only required object is `features`.

The `DataLoader` class provides various functionalities:
* Melody Filtering : select melodies according to given criteria
* Feature selection : keep subset of features
* Feature extraction : compute a new feature from existing features and add it to the object
* Split data in train/test sets while respecting groupings

Operations can be chained. All feature extractors, feature selectors and object filters return an interator over the sequences. Each has an argument `seq_iter`. If `seq_iter==None` (default) the `.jsonl` file is taken as data source and a new iterator is created. Otherwise the provided iterator is taken as data source.

## Melody Filters

### Available filters

The following filters are registered in class `MTCDataLoader`

* `vocal` : Only keep vocal melodies
* `instrumental` : Only keep instrumental melodies
* `ann_bgcorpus` : Only keep melodies unrelated to MTC-ANN (only applicable to MTC-FS-INST)
* `labeled` : Only keep melodies with a tune family label
* `unlabeled`: Only keep melodies without a tune family label
* `afteryear(year)` : Only keep melodies in sources dated later than `year` (`year` not included)
* `beforeyear(year)` : Only keep melodies in sources dated before `year` (`year` not included)
* `betweenyears(year1, year2)` : Only keep melodies in sources dated between `year1` and `year2` (both not included)
* `inOGL` : Only keep melodies that are part of Onder de Groene Linde
* `inList(id_list)` : Only keep melodies with given identifiers in `id_list`

Available as separate functions:

* `DataLoader.minClassSizeFilter(self, classfeature, mininum=0, seq_iter=None)` : Keeps only melodies in classes with >= `minimum` members.<br>
`classfeature` (string) : name of the feature to use for counting.
* `DataLoader.maxClassSizeFilter(self, classfeature, maximum=100, seq_iter=None)` : Keeps only melodies in classes with <= `maximum` members.<br>
`classfeature` (string) : name of the feature to use for counting.

### How to: apply filter

In [3]:
dl = MTCDataLoader('../data/mtcann_sequences.jsonl.gz')
seq_iter = dl.applyFilter('vocal')

If a filter has arguments, these sould be provided with the filtername as tuple.

In [4]:
seq_iter = dl.applyFilter( ('afteryear', 1950) )
seq_iter = dl.applyFilter( ('betweenyears', 1850, 1900) )

Keep only songs in tune families with more than 10 members:

In [5]:
seq_iter = dl.minClassSizeFilter('tunefamily', 10)

### How to: register a filter

Use method `DataLoader.registerFilter(self, name, o_filter)`
<br>
`o_filter` : function returning `false` if the object should be kept.

In [6]:
dl.registerFilter('vocal', lambda x: not ( x['type'] == 'vocal'))

Register a filter with arguments:

In [7]:
dl.registerFilter('afteryear', lambda y: lambda x: not ( x['year'] > y ))

## Feature Extractors

### Available Feature Extractors

In class `MTCDataLoader`:
* `full_beat` : concat `beat` and `beat_fraction`

The following Feature Extractor is available as separate function:
<br>
`DataLoader.concatAllFeatures(self, name='concat', seq_iter=None)`<br>
`name` : name of the new feature<br>

### How to: apply a Feature Extractor

Use method `DataLoader.applyFeatureExtractor(self, name, seq_iter=None)`
<br>
`name` : name (string) of the extractor 

In [8]:
seq_iter = dl.applyFeatureExtractor('full_beat_str')

## Feature Selector

E.g. only retain features `midipitch` and `IOR`:

In [9]:
seq_iter = dl.selectFeatures(['midipitch', 'IOR'])

## Generate test/train set

Use `DataLoader.train_test_split(self, groupby=None, test_size='default', train_size=None, random_state=None, seq_iter=None)`
<br>
`groupby` (string) : name of feature to use for group-level split<br>
`test_size`, `train_size`, `random_state` : see doc for sklearn.model_selection.train_test_split().

# Example Configurations

### Very basic feature configuration: pitch

objects: all songs in MTC-ANN.
<br>
features: midipitch

In [10]:
dl = MTCDataLoader('../data/mtcann_sequences.jsonl.gz')
train, test = dl.train_test_split(test_size=0.1, seq_iter=dl.selectFeatures(['midipitch']))

### Very basic feature configuration: pitch and duration

objects: all songs in MTC-ANN
<br>
features: midipitch and duration

In [11]:
dl = MTCDataLoader('../data/mtcann_sequences.jsonl.gz')
train, test = dl.train_test_split(test_size=0.1, seq_iter=dl.selectFeatures(['midipitch', 'duration']))

### Basic feature configuration: intervals and inter onset interval ratios

objects: all songs in MTC-ANN<br>
features: chromaticinterval and IOR

In [12]:
dl = MTCDataLoader('../data/mtcann_sequences.jsonl.gz')
train, test = dl.train_test_split(test_size=0.1, seq_iter=dl.selectFeatures(['chromaticinterval', 'IOR']))

### Advanced feature configuration

objects: all songs in MTC-ANN
<br>
features: scale degree, metric contour and beat position

In [13]:
dl = MTCDataLoader('../data/mtcann_sequences.jsonl.gz')
sel = dl.selectFeatures(['scaledegree','metriccontour','full_beat_str'],
                        seq_iter=dl.applyFeatureExtractor('full_beat_str'))

### Train with background-corpus, Test with MTC-ANN

In [14]:
fs_dl = MTCDataLoader('../data/mtcfsinst_sequences.jsonl.gz')
ann_dl = MTCDataLoader('../data/mtcann_sequences.jsonl.gz')
train = list( fs_dl.applyFilter('ann_bgcorpus') )
test = list( ann_dl.sequences() )

### Train and test with Onder De Groene Linde songs only

In [15]:
fs_dl = MTCDataLoader('../data/mtcfsinst_sequences.jsonl.gz')
train, test = fs_dl.train_test_split(test_size=0.1, seq_iter=fs_dl.applyFilter('inOGL',
                                                    seq_iter=fs_dl.applyFilter('labeled')))

If desired the split can be done respecting the tune family groupings:  

In [16]:
train, test = fs_dl.train_test_split(test_size=0.1, groupby='tunefamily', seq_iter=fs_dl.applyFilter('inOGL'))

### Train and test with 17th and 18th century fiddle music only

In [17]:
fs_dl = MTCDataLoader('../data/mtcfsinst_sequences.jsonl.gz')

sel_instr = fs_dl.applyFilter('instrumental')
sel_17th18th_c = fs_dl.applyFilter( ('betweenyears', 1600, 1800), seq_iter=sel_instr )
sel_labeled = fs_dl.applyFilter('labeled', seq_iter=sel_17th18th_c)

train, test = fs_dl.train_test_split(test_size=0.1, seq_iter=sel_labeled)

### Obtain all unlabeled 17th/18th century fiddle songs

In [18]:
fs_dl = MTCDataLoader('../data/mtcfsinst_sequences.jsonl.gz')

sel_instr = fs_dl.applyFilter('instrumental')
sel_17th18th_c = fs_dl.applyFilter( ('betweenyears', 1600, 1800), seq_iter=sel_instr )
sel_unlabeled = fs_dl.applyFilter('unlabeled', seq_iter=sel_17th18th_c)

### Use big tune families (>=20 melodies) for training

In [19]:
fs_dl = MTCDataLoader('../data/mtcfsinst_sequences.jsonl.gz')

sel_big = fs_dl.minClassSizeFilter('tunefamily', 20)

train, test = fs_dl.train_test_split(test_size=0.1, groupby='tunefamily', seq_iter=sel_big)

### Use small tune families (<=5 melodies) only

In [20]:
fs_dl = MTCDataLoader('../data/mtcfsinst_sequences.jsonl.gz')

sel_small = fs_dl.maxClassSizeFilter('tunefamily', 5)

### Use only melodies with given identifiers

In [21]:
fs_dl = MTCDataLoader('../data/mtcfsinst_sequences.jsonl.gz')
id_list = ['NLB125814_01','NLB125815_01','NLB125817_01','NLB125818_01','NLB125822_01','NLB125823_01']
sel_list = fs_dl.applyFilter( ('inList', id_list) )