# Explore the plankton dataset

## Import libraries

In [None]:
from scivision.io import load_dataset

import matplotlib.pyplot as plt

## Load the catalog

To load an [Intake](https://intake.readthedocs.io/en/latest/index.html) catalog from a repository containing [Scivision](https://github.com/alan-turing-institute/scivision) metadata:

In [None]:
cat = load_dataset('https://github.com/alan-turing-institute/plankton-dsg-challenge')

## Explore the catalog entries

Let's inspect the entries of the catalog. 

In [None]:
list(cat)

The catalog contains five data sources: `plankton_single`, `plankton_multiple`, `labels_raw`, `labels` and `labels_holout`. The first two can be used to load the image data, and the last three for the classification labels.  We'll explore each of them in the next sections.

## Fetch the CSV index file

The first entry corresponds to an index file, imported as a `pandas.DataFrame`, which contain the list of all plankton images. Each image include its index, filename, and labels according to three levels of classication: `label1` (zooplankton vs detritus), `label2` (noncopedod vs copedod) and `label3` (species).

In [None]:
labels = cat.labels().read()

In [None]:
type(labels)

In [None]:
labels

We can explore now the unique labels by level of classification.

In [None]:
for label in ['label1','label2','label3']:
    print(f'Categories in {label}:', labels[label].unique().tolist())
    print('\n')

Similarly, let's explore the data imbalance by classification level.

In [None]:
for label in ['label1','label2','label3']:
    print(label)
    print(labels[label].value_counts())
    print('\n')

## Fetch a single image entry

The second entry refers to load a single image. We can load any single image of the above table by explicitly passing the filename to `plankton_single`, followed by a call to the `read()` method to load the data. Let's try with the first filename, `Pia1.2016-10-04.1801+N292_hc`.

In [None]:
ds_single = cat.plankton_single(id='Pia1.2016-10-04.1801+N292_hc').read()

In [None]:
type(ds_single)

Now, let's explore the `ds_single` which imported as a `xarray.Dataset` object. According to [The Pythia Foundations resource](https://foundations.projectpythia.org/core/xarray/xarray.html), `xarray.Dataset` is a dictionary-like container that holds one or more `xarray.DataArray`. The DataArray is one of the basic building blocks of `xarray`. Xarray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, Xarray is focused on N-dimensional arrays of data (i.e. grids). 

In [None]:
print(ds_single)

The following dictionary structure indicate three major keys: 
* `Dimensions`: 
* `Coordinates`:
* `Data variables`:

We can visualise the contained image using `matplotlib` as we usually do with a NumPy array/

In [None]:
plt.imshow(ds_single['raster'].compute().values)

Great - we have loaded a single image!

## Fetch the complete dataset entry

The final entry refers to load the full dataset. All images are stacked into a single `xarray.Dataset` object with a defined dimension, 1040 px x 832 px.

In [None]:
ds_all = cat.plankton_multiple().to_dask()

In [None]:
print(ds_all)

Let's subset a single image. This can be done using the image index stored in `concat_dim`.

In [None]:
subset = ds_all.sel(concat_dim=0)

In [None]:
print(subset)

In [None]:
subset = ds_all.sel(concat_dim=2)
plt.imshow(subset['raster'].compute().values[:,:,:])
plt.title(subset.filename.compute().values)