In [1]:
# default_exp data

# Data preparation

> Downloading the data and developing the machinery for feeding it to our models

The `fastai` library provides the very flexible mechanism of the DataBlock API. This should generally be our main goto tool when working with data in non-standard formats.

As fastai v2 is a new version of the library and we have little experience with it, we decided to first drop down to using the mid-level API. This is to ensure we have full control over data processing and to learn how to write custom transforms (we will need this to run some of the experiments we have planned). Once we complete the deepdive into how data is handled by the `fastai` library, hopefully we will be able to utilize the lessons we learn and use the higher level DataBlock API.

Let's download the data.

In [2]:
!wget https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-019-48909-4/MediaObjects/41598_2019_48909_MOESM2_ESM.xlsx -O data/Dominicana.xlsx
!wget https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-019-48909-4/MediaObjects/41598_2019_48909_MOESM3_ESM.xlsx -O data/ETP.xlsx

--2020-03-06 15:23:40--  https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-019-48909-4/MediaObjects/41598_2019_48909_MOESM2_ESM.xlsx
Resolving static-content.springer.com (static-content.springer.com)... 151.101.192.95, 151.101.128.95, 151.101.64.95, ...
Connecting to static-content.springer.com (static-content.springer.com)|151.101.192.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 867711 (847K) [application/octet-stream]
Saving to: ‘data/Dominicana.xlsx’


2020-03-06 15:23:41 (3.71 MB/s) - ‘data/Dominicana.xlsx’ saved [867711/867711]

--2020-03-06 15:23:41--  https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-019-48909-4/MediaObjects/41598_2019_48909_MOESM3_ESM.xlsx
Resolving static-content.springer.com (static-content.springer.com)... 151.101.192.95, 151.101.128.95, 151.101.64.95, ...
Connecting to static-content.springer.com (static-content.springer.com)|151.101.192.95|:443... connected.
HTTP request sent, awaiting response... 

## First look at the data

In [3]:
#export
from fastai2.data.all import *

dominicana = pd.read_excel('data/Dominicana.xlsx')
etp = pd.read_excel('data/ETP.xlsx')

And this is what the data looks like. It contains the ICI information (independent variables) as well as labels, such as Coda type or Clan membership.

In [4]:
dominicana.head()

Unnamed: 0,codaNUM2018,Date,nClicks,Duration,ICI1,ICI2,ICI3,ICI4,ICI5,ICI6,ICI7,ICI8,ICI9,CodaType,Clan,Unit,UnitNum,IDN
0,1,2005-03-04 00:00:00,5,1.188,0.293,0.282,0.298,0.315,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
1,2,2005-03-04 00:00:00,5,1.125,0.287,0.265,0.299,0.274,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
2,3,2005-03-04 00:00:00,5,1.09,0.264,0.253,0.297,0.276,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
3,4,2005-03-04 00:00:00,5,1.09,0.269,0.265,0.271,0.285,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
4,5,2005-03-04 00:00:00,5,1.101,0.273,0.267,0.266,0.295,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0


In [5]:
etp.head()

Unnamed: 0,CodaNum,FullDateTime,ClanNum,ClanName,NumClicks,TotalDur,ICI1,ICI2,ICI3,ICI4,ICI5,ICI6,ICI7,ICI8,ICI9,ICI10,ICI11,Coda Type
0,1,02/23/1985 12:03:01,1.0,Regular,5,0.531,0.141,0.127,0.126,0.138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,599.0
1,2,02/23/1985 12:03:03,1.0,Regular,5,0.526,0.132,0.128,0.124,0.143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,599.0
2,3,02/23/1985 12:03:06,1.0,Regular,5,0.491,0.134,0.129,0.118,0.109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,599.0
3,4,02/23/1985 12:03:07,1.0,Regular,8,1.411,0.204,0.194,0.218,0.202,0.207,0.206,0.18,0.0,0.0,0.0,0.0,899.0
4,5,02/23/1985 12:03:09,1.0,Regular,5,0.528,0.133,0.129,0.13,0.137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,599.0


## Data for pretraining

Let's construct our dataset step by step. A `TfmdLists` object is able to read the rows ouf our DataFrame and treat them as items (`item` is the name for an example in the `fastai` parlance).

In [6]:
tfmd_lists = TfmdLists(dominicana, [lambda x: x])

In [7]:
tfmd_lists[0]

codaNUM2018                      1
Date           2005-03-04 00:00:00
nClicks                          5
Duration                     1.188
ICI1                         0.293
ICI2                         0.282
ICI3                         0.298
ICI4                         0.315
ICI5                             0
ICI6                             0
ICI7                             0
ICI8                             0
ICI9                             0
CodaType                       5R3
Clan                           EC1
Unit                             A
UnitNum                          1
IDN                              0
Name: 0, dtype: object

In [8]:
len(tfmd_lists), dominicana.shape[0]

(8719, 8719)

Looking good. Let's see if we can extract the ICI information from a single row and package it in a way that would be suitable for our model.

In [9]:
#export
def get_independent_vars(row, start_col=4, n_vals=9):
    vals = [v for v in row[start_col:(start_col+n_vals)].values if v != 0]
    # we want to manually pad the sequence
    # we believe that for a single direction RNN padding the sequence from the left should work better
    return np.pad(vals, (n_vals - len(vals), 0))

In [10]:
get_independent_vars(dominicana.iloc[0])

array([0.   , 0.   , 0.   , 0.   , 0.   , 0.293, 0.282, 0.298, 0.315])

This looks good. Can we use this in `TfmdLists`?

In [11]:
tfmd_lists = TfmdLists(dominicana, [get_independent_vars])

In [12]:
tfmd_lists[0]

array([0.   , 0.   , 0.   , 0.   , 0.   , 0.293, 0.282, 0.298, 0.315])

For the pretraining, we can go directly from this representation to the targets (the target being the last ICI)

In [13]:
#export
def independent_vars_to_targs(ary): return ary[-1]

In [14]:
tfmd_lists = TfmdLists(dominicana, [get_independent_vars, independent_vars_to_targs])

In [15]:
tfmd_lists[0]

0.315

We now need to make sure that the independent variables, our train data, doesn't contain the targets.

In [16]:
#export
def drop_last_value(ary): return ary[:-1]

We would like each example to be represented as a tuple of `(independent_variables, targets)`. In order to arrive at this representation, we can run two transformation pipelines in parallel.

One transformation pipeline will give us the independent variables:

```TfmdLists(dominicana, [get_independent_vars, drop_last_value])```

and the other will give us the dependent variable, our target:

```TfmdLists(dominicana, [get_independent_vars, independent_vars_to_targs])```

The fastai class that can wrap multiple transformation pipelines is called `Datasets`.

In [17]:
datasets = Datasets(dominicana, [[get_independent_vars, drop_last_value], [get_independent_vars, independent_vars_to_targs]]); datasets

(#8719) [(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.293, 0.282, 0.298]), 0.315),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.287, 0.265, 0.299]), 0.274),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.264, 0.253, 0.297]), 0.276),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.269, 0.265, 0.271]), 0.285),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.273, 0.267, 0.266]), 0.295),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.278, 0.27 , 0.269]), 0.271),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.269, 0.293, 0.287]), 0.289),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.279, 0.277, 0.28 ]), 0.294),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.284, 0.278, 0.28 ]), 0.29),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.297, 0.277, 0.275]), 0.283)...]

In [18]:
datasets[0]

(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.293, 0.282, 0.298]), 0.315)

This is looking good. We have the data for pretraining ready. But what about actual training? Here we will need labels transformed in a way suitable for our model to learn from.

## Data for training

We specify the first pipeline as follows:
```TfmdLists(dominicana, [get_independent_vars])```

For the second pipeline however, we will need new functionality we have not developed yet. We would like to be able to specify a set of labels as targets (this could be clan membership or coda type for instance).

In [19]:
#export
def get_target(row, col_name): return row[col_name]
get_clan_membership = partialler(get_target, col_name='Clan')

In [20]:
tfmd_lists = TfmdLists(dominicana, [get_clan_membership])

In [21]:
tfmd_lists[0]

'EC1'

In [22]:
dominicana.Clan.unique()

array(['EC1', 'EC2'], dtype=object)

We can now extract the clan name as a string, but this is not a representation we can train our model on. We need to go from string labels to a set of indexes.

In [23]:
Categorize(vocab=['EC1', 'EC2'])('EC1')

TensorCategory(0)

This does the trick!

Let's now pull all this into a `Datasets` object.

In [24]:
datasets = Datasets(dominicana, [[get_independent_vars], [get_clan_membership, Categorize]]); datasets

(#8719) [(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.293, 0.282, 0.298, 0.315]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.287, 0.265, 0.299, 0.274]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.264, 0.253, 0.297, 0.276]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.269, 0.265, 0.271, 0.285]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.273, 0.267, 0.266, 0.295]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.278, 0.27 , 0.269, 0.271]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.269, 0.293, 0.287, 0.289]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.279, 0.277, 0.28 , 0.294]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.284, 0.278, 0.28 , 0.29 ]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.297, 0.277, 0.275, 0.283]), TensorCategory(0))...]

The `Datasets` class can work with the `Categorize` transform to initialize it without us having to explicitly pass the vocab (it creates the vocab from the data we provide it).

In [25]:
datasets.tfms[1][-1].vocab

(#2) ['EC1','EC2']

We now have everything we need on the data side to reproduce the RNN experiments from the paper. Let us now see if we can use the DataBlock API to nicely package it all up.

## Using the DataBlock API

For the pretraining, we can use all the data we have across the two datasets (the Dominica and Eastern Tropical Pacific (ETP) datasets). Let's concatenate them together.

In [None]:
# export
pd.set_option('display.max_columns', None)
merged_datasets = pd.concat((etp, dominicana)).fillna(0)

In [135]:
merged_datasets.head()

Unnamed: 0,CodaNum,FullDateTime,ClanNum,ClanName,NumClicks,TotalDur,ICI1,ICI2,ICI3,ICI4,ICI5,ICI6,ICI7,ICI8,ICI9,ICI10,ICI11,Coda Type,codaNUM2018,Date,nClicks,Duration,CodaType,Clan,Unit,UnitNum,IDN
0,1.0,02/23/1985 12:03:01,1.0,Regular,5.0,0.531,0.141,0.127,0.126,0.138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,599.0,0.0,0,0.0,0.0,0,0,0,0.0,0
1,2.0,02/23/1985 12:03:03,1.0,Regular,5.0,0.526,0.132,0.128,0.124,0.143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,599.0,0.0,0,0.0,0.0,0,0,0,0.0,0
2,3.0,02/23/1985 12:03:06,1.0,Regular,5.0,0.491,0.134,0.129,0.118,0.109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,599.0,0.0,0,0.0,0.0,0,0,0,0.0,0
3,4.0,02/23/1985 12:03:07,1.0,Regular,8.0,1.411,0.204,0.194,0.218,0.202,0.207,0.206,0.18,0.0,0.0,0.0,0.0,899.0,0.0,0,0.0,0.0,0,0,0,0.0,0
4,5.0,02/23/1985 12:03:09,1.0,Regular,5.0,0.528,0.133,0.129,0.13,0.137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,599.0,0.0,0,0.0,0.0,0,0,0,0.0,0


Now let us craft a DataBlock that will read in the data

In [138]:
#export
get_ETP_independent_vars = partial(get_independent_vars, start_col=5, n_vals=11) 

In [None]:
trainable_params

In [139]:
#export
dblock_pretrain = DataBlock(
    get_x = (get_ETP_independent_vars, drop_last_value),
    get_y = (get_ETP_independent_vars, independent_vars_to_targs),
    splitter=TrainTestSplitter(test_size=0.1, random_state=42) # having a validation set is crucial for any task,
)                                                              # including pretraining!

datasets_pretrain = dblock_pretrain.datasets(merged_datasets); datasets_pretrain

(#25714) [(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.531, 0.141, 0.127,
       0.126]), 0.138),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.526, 0.132, 0.128,
       0.124]), 0.143),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.491, 0.134, 0.129,
       0.118]), 0.109),(array([0.   , 0.   , 0.   , 1.411, 0.204, 0.194, 0.218, 0.202, 0.207,
       0.206]), 0.18),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.528, 0.133, 0.129,
       0.13 ]), 0.137),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.538, 0.138, 0.133,
       0.127]), 0.14),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.534, 0.131, 0.137,
       0.128]), 0.137),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.549, 0.148, 0.135,
       0.127]), 0.139),(array([0.   , 0.   , 0.   , 1.032, 0.148, 0.147, 0.139, 0.142, 0.149,
       0.144]), 0.162),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.709, 0.148, 0.133, 0.132,
       0.135]), 0.16)...]

This is looking good. As for the train data, situation is a bit more complex - we need to align how we create our datasets with the paper.

It seems that due to lack of data whale identification was only evaluated on train set. Since this gives us little insights into how the model would generalize to unseen data, let us not include this in our analysis.

With regards to the "coda type classification" task, the paper reports training on 23 coda types from the Dominicana dataset and 43 coda types from the ETP dataset. The authors were very kind to share their [code on github](https://github.com/dgruber212/Sperm_Whale_Machine_Learning/blob/master/RNNClassifier.py) and we can align how we create our datasets with them.

In [140]:
dominicana.head()

Unnamed: 0,codaNUM2018,Date,nClicks,Duration,ICI1,ICI2,ICI3,ICI4,ICI5,ICI6,ICI7,ICI8,ICI9,CodaType,Clan,Unit,UnitNum,IDN
0,1,2005-03-04 00:00:00,5,1.188,0.293,0.282,0.298,0.315,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
1,2,2005-03-04 00:00:00,5,1.125,0.287,0.265,0.299,0.274,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
2,3,2005-03-04 00:00:00,5,1.09,0.264,0.253,0.297,0.276,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
3,4,2005-03-04 00:00:00,5,1.09,0.269,0.265,0.271,0.285,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
4,5,2005-03-04 00:00:00,5,1.101,0.273,0.267,0.266,0.295,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0


In [145]:
#export
mask = dominicana.CodaType.isin(['1-NOISE', '2-NOISE','3-NOISE','4-NOISE','5-NOISE','6-NOISE','7-NOISE','8-NOISE','9-NOISE','10-NOISE','10i','10R'])
dominicana_clean = dominicana[~mask]

In [144]:
dominicana_clean.shape

(8032, 18)

In [100]:
vc = etp['Coda Type'].value_counts()

In [111]:
vc.iloc[:2].sum()

7018

In [112]:
etp.shape

(16995, 18)

In [114]:
dominicana.head()

Unnamed: 0,codaNUM2018,Date,nClicks,Duration,ICI1,ICI2,ICI3,ICI4,ICI5,ICI6,ICI7,ICI8,ICI9,CodaType,Clan,Unit,UnitNum,IDN
0,1,2005-03-04 00:00:00,5,1.188,0.293,0.282,0.298,0.315,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
1,2,2005-03-04 00:00:00,5,1.125,0.287,0.265,0.299,0.274,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
2,3,2005-03-04 00:00:00,5,1.09,0.264,0.253,0.297,0.276,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
3,4,2005-03-04 00:00:00,5,1.09,0.269,0.265,0.271,0.285,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0
4,5,2005-03-04 00:00:00,5,1.101,0.273,0.267,0.266,0.295,0.0,0.0,0.0,0.0,0.0,5R3,EC1,A,1,0


In [116]:
dominicana.CodaType.value_counts()

1+1+3       3589
5R1         1510
5R3          642
4R2          340
5R2          287
5-NOISE      280
4D           219
6i           188
8i           178
7D1          177
6-NOISE      172
7i           151
9i           137
4R1           84
2+3           77
1+32          72
6R            67
3D            61
8R            60
10i           55
4-NOISE       51
1+31          51
9R            43
7D2           35
10R           32
7R            23
3-NOISE       21
3R            21
8D            20
7-NOISE       19
8-NOISE       17
9-NOISE       16
10-NOISE       9
1-NOISE        9
2-NOISE        6
Name: CodaType, dtype: int64

In [106]:
len(vc)

52

In [91]:
dominicana.shape

(8719, 18)

We now have everything we need for pretraining.

In [82]:
dls = datasets_pretrain.dataloaders()

In [83]:
len(dls.train)

122

In [50]:
136 * 64

8704

In [84]:
len(dls.valid)

14

(#8719) [(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.293, 0.282, 0.298]), 0.315),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.287, 0.265, 0.299]), 0.274),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.264, 0.253, 0.297]), 0.276),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.269, 0.265, 0.271]), 0.285),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.273, 0.267, 0.266]), 0.295),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.278, 0.27 , 0.269]), 0.271),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.269, 0.293, 0.287]), 0.289),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.279, 0.277, 0.28 ]), 0.294),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.284, 0.278, 0.28 ]), 0.29),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.297, 0.277, 0.275]), 0.283)...]

In [29]:
dblock_train = DataBlock(
    get_x = get_independent_vars,
    get_y = (get_clan_membership, Categorize)
)

In [30]:
dblock.datasets(dominicana)

(#8719) [(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.293, 0.282, 0.298, 0.315]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.287, 0.265, 0.299, 0.274]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.264, 0.253, 0.297, 0.276]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.269, 0.265, 0.271, 0.285]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.273, 0.267, 0.266, 0.295]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.278, 0.27 , 0.269, 0.271]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.269, 0.293, 0.287, 0.289]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.279, 0.277, 0.28 , 0.294]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.284, 0.278, 0.28 , 0.29 ]), TensorCategory(0)),(array([0.   , 0.   , 0.   , 0.   , 0.   , 0.297, 0.277, 0.275, 0.283]), TensorCategory(0))...]

In [None]:
dblock.d