## Data Curation Example: Newer Accounts

This notebook is designed to demonstrate how to use Gantry to seamlessly transform the production data that is flowing into Gantry into training sets. The goal of datasets and curators is to allow users to tell Gantry _what_ they want, something that requires judgment and domain expertise, and let Gantry take care of the _how_. That includes all the messiness of managing reliable jobs that trans production streams into versioned datasets.

In [None]:
%load_ext lab_black

from config import GantryConfig, DataStorageConfig
import gantry
import datetime
import random

In [None]:
gantry.init(api_key=GantryConfig.GANTRY_API_KEY)

assert gantry.ping()

In [7]:
from gantry.curators import BoundedRangeCurator

new_accounts_curator_name = f"{GantryConfig.GANTRY_APP_NAME}-new-account-curator-{str(random.randrange(0, 10000001))}"

new_accounts_curator = BoundedRangeCurator(
    name=new_accounts_curator_name,
    application_name=GantryConfig.GANTRY_APP_NAME,
    limit=1000,
    bound_field="inputs.account_age_days",
    lower_bound=0,
    upper_bound=7,
    start_on=DataStorageConfig.MIN_DATE,
    curation_interval=datetime.timedelta(days=1),
)

new_accounts_curator.create()

Curator(name='casey-gec-test-new-account-curator-7346948', curated_dataset_name='casey-gec-test-new-account-curator-7346948', application_name='casey-gec-test', start_on=2022-03-30 00:00:00+00:00, curation_interval=1 day, 0:00:00, curate_past_intervals=True, selectors=[Selector(method=OrderedSampler(sample=<SamplingMethod.ordered: 'ordered'>, field='inputs.account_age_days', sort=<DruidDimensionOrderingDirections.ASCENDING: 'ascending'>), limit=1000, filters=[BoundsFilter(field='inputs.account_age_days', upper=7.0, inclusive_upper=None, lower=0.0, inclusive_lower=None)], tags=[])])

A few things to note here:
- We provide the application name to tie this curator to the associated application
- We use `GantryConfig` and `DataStorageConfig` to grab some configurations, like the earliest timestamp of the data which will trigger Gantry to backfill all of the intervals between that date and now
- We use the `BoundedRangeCurator` because it makes it easy to specify a curator according the bounds on a field, namely `account_age_days`
- We choose the, somewhat arbitrary, limit of 1000 records to represent our daily labeling budget

This curator tells Gantry how to marshall your data into versions in a **dataset**.

The following code will do two things for us:
1. Grab the dataset that our curator populating.
2. List the versions in the dataset. Each version represents a historical interval (in this case day) of data. Listing the versions makes the relationship between curators and datasets explicit: as curator jobs run, one for each interval, they create versions of a dataset. The versions contain metadata about what fraction of the specified limit is being met, as you can see below in the format: `x records added from interval of size y`

In [9]:
dataset = new_accounts_curator.get_curated_dataset()
dataset.list_versions()

[{'version_id': 'eeecaae4-ef2a-4a6a-bc1d-751da986f478',
  'dataset': 'casey-gec-test-new-account-curator-7346948',
  'message': 'Added Gantry data from model casey-gec-test from start time 2022-04-30T00:00:00 to end time 2022-05-01T00:00:00 - 68 records added from interval of size 1479. Commit automatically created by curator.',
  'created_at': 'Mon, 06 Mar 2023 18:12:40 GMT',
  'created_by': '844c1123-7615-40e1-9eaa-d8f41c6f8572',
  'is_latest_version': True},
 {'version_id': 'd3130d50-8373-48c0-94ca-e36c6542c342',
  'dataset': 'casey-gec-test-new-account-curator-7346948',
  'message': 'Added Gantry data from model casey-gec-test from start time 2022-04-29T00:00:00 to end time 2022-04-30T00:00:00 - 161 records added from interval of size 1242. Commit automatically created by curator.',
  'created_at': 'Mon, 06 Mar 2023 18:12:39 GMT',
  'created_by': '844c1123-7615-40e1-9eaa-d8f41c6f8572',
  'is_latest_version': False},
 {'version_id': 'fd8b9c85-bd75-4370-9e5a-23e51bc69c9d',
  'dataset

Now that we have seen how curators can make it easy to tell Gantry what data you want to gather, and let Gantry take care of gathering it, let's dive into datasets.

### Datasets
Gantry datasets are a lightweight container for your data with simple versioning semantics that aim to make it straightforward to build more robust ML pipelines. Versioning is at the file level, and centers on two operations: `push` and `pull`. Curators `push` data to Gantry Datasets. Users `pull` data from them to analyze, label, or train on. Users can also make local modifications to datasets, and `push` them back. All `push` operations create a new version, and each version is written to underlying S3 storage. You can read about datasets in more detail in the [Datasets guide](https://docs.gantry.io/docs/datasets).

Let's pull the latest version of the dataset from our curator:

In [10]:
dataset.pull()

{'version_id': 'eeecaae4-ef2a-4a6a-bc1d-751da986f478',
 'dataset': 'casey-gec-test-new-account-curator-7346948',
 'message': 'Added Gantry data from model casey-gec-test from start time 2022-04-30T00:00:00 to end time 2022-05-01T00:00:00 - 68 records added from interval of size 1479. Commit automatically created by curator.',
 'created_at': 'Mon, 06 Mar 2023 18:12:40 GMT',
 'created_by': '844c1123-7615-40e1-9eaa-d8f41c6f8572',
 'is_latest_version': True}

Gantry datasets can be loaded them directly into HuggingFace Datasets, which can in turn be turned into Pandas DataFrame objects. All the type mappings are handled by the integration:

In [11]:
hfds = dataset.get_huggingface_dataset()
df = hfds.to_pandas()

Downloading and preparing dataset caseygectestnewaccountcurator7346948/default to /Users/casey/.cache/huggingface/datasets/caseygectestnewaccountcurator7346948/default/1.0.0/20ef1ae868fe33bff85703eaa9b5d6d7125e6a7096b977002b71b90f98886dc5...


Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 409.36it/s]
Extracting data files: 100%|████████████| 1/1 [00:00<00:00, 41.50it/s]
                                                                  

Dataset caseygectestnewaccountcurator7346948 downloaded and prepared to /Users/casey/.cache/huggingface/datasets/caseygectestnewaccountcurator7346948/default/1.0.0/20ef1ae868fe33bff85703eaa9b5d6d7125e6a7096b977002b71b90f98886dc5. Subsequent calls will reuse this data.


In [12]:
df

Unnamed: 0,__time,inputs.account_age_days,inputs.text,outputs.inference,feedback.correction_accepted
0,2022-04-24 04:06:46+00:00,1,“They say these are old abandoned hotel buildi...,“They say these are old abandoned hotel buildi...,False
1,2022-04-24 22:11:50+00:00,1,[CLOSED - Winner @britmdee] A New Home for a N...,[CLOSED - Winner @britmdee] A New Home for a N...,False
2,2022-04-24 09:00:00+00:00,1,Marvel Studios' Thor: Love and Thunder arrives...,Marvel Studios' Thor: Love and Thunder arrives...,False
3,2022-04-24 06:13:52+00:00,1,Best enjoyed with a cheer #socialiseresponsibl...,Best enjoyed with a cheer #socialiseresponsibl...,True
4,2022-04-24 19:00:00+00:00,1,A FLOWERING BUSH NATIVE TO MEXICO. LONG-REVERE...,A FLOWERING BUSH NATIVE TO MEXICO. LONG-REVERE...,False
...,...,...,...,...,...
11136,2022-04-27 02:00:19+00:00,6,"🎶 GINgle Bells, GINgle bells, Gin & Fun all th...","GINgle Bells, GINgle Bells, Gin & Fun all the...",True
11137,2022-04-27 00:21:48+00:00,6,Santé! Toast with something special this year!...,Santé! Toast with something special this year!...,True
11138,2022-04-27 15:00:00+00:00,6,"We’ve gathered our friends, packed a basket an...","We’ve gathered our friends, packed a basket an...",True
11139,2022-04-27 07:21:56+00:00,6,Did you know our cinnamon rolls take over 4 da...,Did you know our cinnamon rolls take over 4 da...,True


We can now analyze, train on, or label this dataset. If we want to do manual data manipulation, we can modify it and push a new version back to Gantry.