In [16]:
import os

swap for the path to your repo

In [17]:
os.chdir('/Users/layne/Desktop/aiqc')

In [18]:
import aiqc

In [19]:
from aiqc import datum

---

making some dummy data for us to use as an example

In [5]:
df = datum.to_pandas('iris.tsv')

In [6]:
dataset = aiqc.Dataset.Tabular.from_pandas(
    dataframe = df
    , name = 'tab separated plants'
    , dtype = None #passed to pd.Dataframe(dtype)/ inferred
    , column_names = None #passed to pd.Dataframe(columns)
)

In [7]:
label_column = 'species'

In [8]:
label = dataset.make_label(columns=[label_column])

In [9]:
featureset = dataset.make_featureset()

In [42]:
splitset = featureset.make_splitset(
    label_id=label.id
    , size_test=0.22 
    , size_validation=0.12
    , bin_count=3
)

---

now that we have a splitset, let's take a look at the `samples` attribute

because we specified a `size_validation` we have a validation entry in the samples dictionary

In [43]:
splitset.samples.keys()

dict_keys(['validation', 'train', 'test'])

here are the sample indices that belong to the validation split:

In [44]:
splitset.samples['validation']

[140, 138, 106, 103, 13, 55, 68, 37, 53, 2, 69, 57, 117, 114, 42, 61, 28, 12]

when it comes time to read the splitset into memory, these indices are used to fetch the raw data

the `to` methods allow you to filter what data you want. this comes in handy for downstream processes that need to select specific data.

In [45]:
split_data = splitset.to_numpy(
    splits = ['train', 'test', 'validation']
    , include_label = True
    , include_featureset = True
    , feature_columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
)

In [46]:
split_data.keys()

dict_keys(['validation', 'train', 'test'])

In [47]:
split_data['validation'].keys()

dict_keys(['features', 'labels'])

example of drilling down into the raw data

In [94]:
split_data['validation']['labels'][:3]

array([[2],
       [2],
       [2]])

---

things get more complicated when cross fold validation is introduced

https://aiqc.readthedocs.io/en/latest/notebooks/api_low_level.html#5.-Optionally,-create-a-Foldset-for-cross-validation.

In [49]:
foldset = splitset.make_foldset(
    fold_count=3
    , bin_count=3
)

The `Foldset` has no samples attribute. the sample indices live in the separate `Fold` objects, which are related to the `Foldset`

In [54]:
list(foldset.folds)

[<Fold: 4>, <Fold: 5>, <Fold: 6>]

In [56]:
fold4 = foldset.folds[0]

In [68]:
fold4.fold_index

0

Each `Fold` has its own subsets:

In [59]:
fold4.samples.keys()

dict_keys(['folds_train_combined', 'fold_validation'])

In [62]:
fold4.samples['fold_validation']

[0,
 4,
 9,
 13,
 14,
 16,
 17,
 19,
 24,
 26,
 27,
 32,
 33,
 40,
 41,
 42,
 47,
 49,
 52,
 53,
 57,
 60,
 61,
 63,
 71,
 72,
 75,
 76,
 84,
 85,
 87,
 95,
 97]

But when it comes time to fetch the folds, they are called from the `Foldset` which also has some filters

In [78]:
foldset_data = foldset.to_numpy(
    fold_index = None
    , fold_names = ['folds_train_combined', 'fold_validation']
    , include_label = True
    , include_featureset = True
)

The `fold_index` is the first level of the dictionary

In [79]:
foldset_data.keys()

dict_keys([0, 1, 2])

In [80]:
foldset_data[0].keys()

dict_keys(['folds_train_combined', 'fold_validation'])

In [81]:
foldset_data[0]['folds_train_combined'].keys()

dict_keys(['features', 'labels'])

finally, the raw data

In [95]:
foldset_data[0]['folds_train_combined']['labels'][:3]

array([[0],
       [0],
       [0]])

---

The question is, based on how the data in the `Batch` is designed, what splits/folds should be used for evaluating performance of the model during `job.run()`

- Did the splitset use `size_validation`
  - then don't evaluate on test split.

- Did the `Batch` use a `foldset_id`
  - then which `Fold` is related to this `Job`?
  - then evaluate on that fold's 'fold_validation' instead of the test split or the validation split.

A Batch can also be created with a `repeat_count` which determines how many duplicate runs there should be for each job.