# The KantoData dataset

In [None]:
from pathlib import Path
from pykanto.dataset import KantoData
from pykanto.parameters import Parameters
from pykanto.utils.paths import ProjDirs, pykanto_data
from pykanto.utils.io import load_dataset
import numpy as np


DATASET_ID = "GREAT_TIT"
DIRS = pykanto_data(dataset=DATASET_ID)

params = Parameters(dereverb=False, verbose=False)
dataset = KantoData(
    DIRS,
    parameters=params,
    overwrite_dataset=True,
    overwrite_data=True
)

dataset.segment_into_units()

dataset.save_to_disk()
dataset = load_dataset(dataset.DIRS.DATASET, DIRS)
dataset.to_csv(dataset.DIRS.DATASET.parent)
dataset.write_to_json()


## Useful attributes

{py:class}`~pykanto.dataset.KantoData` datasets contain a series of attributes,
here are some of the ones you are most likely to access:

| Attribute | Description |
|-----------|-------------|
| `KantoData.data` | a dataframe containing information about each vocalization |
| `KantoData.files` | a list of files associated with the dataset |
| `KantoData.parameters` | a {py:class}`~pykanto.parameters.Parameters` instance containing the params used to generate the dataset |
| `KantoData.metadata` | a dictionary of metadata associated with the dataset |
| `KantoData.units` | a dataframe of single sound units in dataset, created if song_level = False in the parametres after running `KantoData.cluster_ids()` |

##  Some common operations with datasets

| Description | Code |
| --- | --- |
| Load an existing dataset | ```dataset = load_dataset(dataset.DIRS.DATASET, DIRS)``` |
| Save an existing dataset | ```dataset.save_to_disk()``` |
| Save a dataset to csv | ```dataset.to_csv(dataset.DIRS.DATASET.parent)``` |
| Save new metadata to JSON files | ```dataset.write_to_json()``` |

You can get some basic information about the dataset by running:

In [None]:
dataset.sample_info()
dataset.data['ID'].value_counts()

`KantoData.data` and `KantoData.units` are {py:class}`~pandas.DataFrame`
instances: I have chosen this since it is a very flexible and most users are
already familiar with it. You can query and modify it as you would any other
dataframe. For example, to see the first three rows and a subset of columns:

In [16]:
dataset.data[['date', 'recordist', 'unit_durations']].head(3)

Unnamed: 0,date,recordist,unit_durations
2021-B32-0415_05-11,2021-04-15,Nilo Merino Recalde,"[0.0986848072562358, 0.10448979591836727, 0.10..."
2021-B32-0415_05-15,2021-04-15,Nilo Merino Recalde,"[0.1102947845804989, 0.09868480725623585, 0.12..."
2021-B32-0415_05-21,2021-04-15,Nilo Merino Recalde,"[0.1219047619047619, 0.10448979591836738, 0.14..."


Or to extract the length of each vocalisation and calculate inter-onset
intervals:

In [None]:
last_offsets = dataset.data["offsets"].apply(lambda x: x[-1]).to_list()
iois = dataset.data.onsets.apply(
    lambda x: np.diff(x)
)

In [None]:
print("Vocalisation durations: ",[f"{x:.2f}" for x in last_offsets[:5]])
print("IOIs: ", [f"{x:.2f}" for x in iois[0][:5]])