# Data operations

This notebook contains some code for loading the images and classification labels.

**The [last cell of this notebook](#Quick-start) contains everything needed to load a labelled data into an xarray, in a single cell.**

## Import libraries

In [None]:
import matplotlib.pyplot as plt
import xarray as xr
import pandas as pd
import numpy as np

from scivision.io import load_dataset
from IPython.display import display, HTML

## Load the Intake catalog

As before, load the [Intake](https://intake.readthedocs.io/en/latest/index.html) catalog from the challenge repository containing [Scivision](https://github.com/alan-turing-institute/scivision) metadata:

In [None]:
cat = load_dataset('https://github.com/alan-turing-institute/plankton-dsg-challenge')

## Inspect the catalog entries

We explored the catalog in the previous notebook. It contains several data sources: their descriptions are shown below.

In [None]:
for data_source in cat:
    display(HTML(f"<h4>{data_source}</h4>"))
    display(HTML(cat[data_source].description))

<div class="alert alert-block alert-info">We will use the <tt>plankton_multiple</tt> entry to fetch all of the images, and the <tt>labels</tt> to fetch the labels for training.  The <tt>labels_holdout</tt> will be useful as a final holdout set for testing any models you may produce during the challenge.</div>

## Fetch the labels

The `labels` entry corresponds to an index file, imported as a `pandas.DataFrame`, which contain the list of all plankton images. Each image include its index, filename, and labels according to three levels of classication: `label1` (zooplankton vs detritus), `label2` (noncopedod vs copedod) and `label3` (species).

In [None]:
labels = cat.labels().read()

In [None]:
labels.head()

## Fetch the complete image dataset

The final entry refers to load the full dataset. All images are stacked into a single `xarray.Dataset` object with  fixed dimensions of 1040 px x 832, large enough to hold all of the images.  Smaller images are padded with zeros.

In [None]:
ds_all = cat.plankton_multiple().to_dask()
ds_all.filename.load()

In [None]:
print(ds_all)

Let's subset a single image. This can be done using the image index stored in `concat_dim`.

In [None]:
subset = ds_all.sel(concat_dim=2)

In [None]:
print(subset)

In [None]:
plt.imshow(subset['raster'].compute().values[:,:,:])
plt.title(subset.filename.compute().values)

## Assembling the labelled dataset: Joining the images and labels

### Work in progress...

## Quick start