# Data operations

This notebook contains some code for loading the images and classification labels.

**The [last cell of this notebook](#Quick-start) contains everything needed to load the labelled training data into an xarray, in a single notebook cell.**

## Import libraries

In [1]:
import matplotlib.pyplot as plt
import xarray as xr
import pandas as pd
import numpy as np

from scivision.io import load_dataset
from IPython.display import display, HTML

## Load the Intake catalog

As before, load the [Intake](https://intake.readthedocs.io/en/latest/index.html) catalog from the challenge repository containing [Scivision](https://github.com/alan-turing-institute/scivision) metadata:

In [2]:
cat = load_dataset('https://github.com/alan-turing-institute/plankton-dsg-challenge')

## Inspect the catalog entries

We explored the catalog in the previous notebook. It contains several data sources: their descriptions are shown below.

In [3]:
for data_source in cat:
    display(HTML(f"<h4>{data_source}</h4>"))
    display(HTML(cat[data_source].description))

<div class="alert alert-block alert-info">We will use the <tt>plankton_multiple</tt> entry to fetch all of the images, and the <tt>labels</tt> to fetch the labels for training.  The <tt>labels_holdout</tt> will be useful as a final holdout set for testing any models you may produce during the challenge.</div>

## Fetch the labels

The `labels` entry corresponds to an index file, imported as a `pandas.DataFrame`, which contain the list of all plankton images. Each image include its index, filename, and labels according to three levels of classication: `label1` (zooplankton vs detritus), `label2` (noncopedod vs copedod) and `label3` (species).

In [4]:
labels = cat.labels().read()

In [5]:
labels.head()

Unnamed: 0,index,filename,label1,label2,label3
0,1,Pia1.2016-10-04.1801+N292_hc.tif,zooplankton,noncopepod,annelida_polychaeta
1,2,Pia1.2016-10-05.1229+N28_hc.tif,zooplankton,noncopepod,annelida_polychaeta
2,3,Pia1.2016-10-06.2118+N136_hc.tif,zooplankton,noncopepod,annelida_polychaeta
3,4,Pia1.2017-03-21.1136+N01644266_hc.tif,zooplankton,noncopepod,annelida_polychaeta
4,5,Pia1.2017-03-21.1136+N01646706_hc.tif,zooplankton,noncopepod,annelida_polychaeta


## Fetch the complete image dataset

The final entry refers to load the full dataset. All images are stacked into a single `xarray.Dataset` object with  fixed dimensions of 1040 px x 832, large enough to hold all of the images.  Smaller images are padded with zeros.

In [None]:
ds_all = cat.plankton_multiple().to_dask()
ds_all.filename.load()

In [None]:
print(ds_all)

Let's subset a single image. This can be done using the image index stored in `concat_dim`.

In [None]:
subset = ds_all.sel(concat_dim=2)

In [None]:
print(subset)

In [None]:
plt.imshow(subset['raster'].compute().values[:,:,:])
plt.title(subset.filename.compute().values)

## Assembling the labelled dataset

### Check for duplicate labels

In [None]:
for filename, label_grp in labels.groupby("filename"):
    if len(label_grp) > 1:
        display(label_grp.reset_index(drop=True))
        print()

### Join the images and labels

Put the labels into an xarray, dropping any filenames that have duplicate labels. Set the filename as the (unique) index so it ready to be merged (joined) with the image data:

In [None]:
labels_dedup = xr.Dataset.from_dataframe(
    labels
    .drop_duplicates(subset=["filename"])
    .set_index("filename")
    .sort_index()
)

print(labels_dedup)

Merging (joining) a dataset can be done on xarray dimensions. We temporarily make `filename` a dimension of the dataset in order to perform the merge (in place of `concat_dim` - the integer-valued dimension corresponding to each image file to read in ds_all, which is **not** the same as `index` in labels_dedup).

In [None]:
ds_labelled = (
    ds_all
    .swap_dims({"concat_dim": "filename"})
    .merge(labels_dedup, join="inner")
    .swap_dims({"filename": "concat_dim"})
)

print(ds_labelled)

## Quick start

The following cell contains everything needed to load the labelled training data into an xarray, named `ds_labelled` (independent of the rest of the notebook). It will take a few minutes to run.

In [None]:
import xarray as xr
from scivision.io import load_dataset

cat = load_dataset('https://github.com/alan-turing-institute/plankton-dsg-challenge')

ds_all = cat.plankton_multiple().to_dask()
labels = cat.labels().read()

labels_dedup = xr.Dataset.from_dataframe(
    labels
    .drop_duplicates(subset=["filename"])
    .set_index("filename")
    .sort_index()
)

ds_labelled = (
    ds_all
    .swap_dims({"concat_dim": "filename"})
    .merge(labels_dedup, join="inner")
    .swap_dims({"filename": "concat_dim"})
)