## Data Curation Example: Newer Accounts

This notebook is designed to demonstrate how to use Gantry to seamlessly transform the production data that is flowing into Gantry into training sets. The goal of datasets and curators is to allow users to tell Gantry _what_ they want, something that requires judgment and domain expertise, and let Gantry take care of the _how_. That includes all the messiness of managing reliable jobs that trans production streams into versioned datasets.

In [None]:
from config import GantryConfig, DataStorageConfig
import gantry
import datetime
import random

In [None]:
gantry.init(api_key=GantryConfig.GANTRY_API_KEY)

assert gantry.ping()

In [None]:
from gantry.automations.curators import BoundedRangeCurator
from gantry.automations.triggers import IntervalTrigger
from gantry.automations import Automation
application = gantry.get_application(GantryConfig.GANTRY_APP_NAME)
new_accounts_curator_name = f"{GantryConfig.GANTRY_APP_NAME}-new-account-curator-{str(random.randrange(0, 10000001))}"
interval_trigger = IntervalTrigger(start_on = DataStorageConfig.MIN_DATE, interval = datetime.timedelta(days=1))

new_accounts_curator = BoundedRangeCurator(
    application_name=application._name,
    name=new_accounts_curator_name,
    limit=1000,
    bound_field="inputs.account_age_days",
    lower_bound=0,
    upper_bound=7,
)
curator_automation = Automation(name="curator-automation", trigger=interval_trigger, action=new_accounts_curator)
application.add_automation(curator_automation)

A few things to note here:
- We provide the application name to tie this curator to the associated application
- We use `GantryConfig` and `DataStorageConfig` to grab some configurations, like the earliest timestamp of the data which will trigger Gantry to backfill all of the intervals between that date and now
- We use the `BoundedRangeCurator` because it makes it easy to specify a curator according the bounds on a field, namely `account_age_days`
- We choose the, somewhat arbitrary, limit of 1000 records to represent our daily labeling budget

This curator tells Gantry how to marshall your data into versions in a **dataset**.

The following code will do two things for us:
1. Grab the dataset that our curator populating.
2. List the versions in the dataset. Each version represents a historical interval (in this case day) of data. Listing the versions makes the relationship between curators and datasets explicit: as curator jobs run, one for each interval, they create versions of a dataset. The versions contain metadata about what fraction of the specified limit is being met, as you can see below in the format: `x records added from interval of size y`

In [None]:
dataset = new_accounts_curator.get_curated_dataset()
dataset.list_versions()

If the output says initial commit, it hasn't pulled any data. The output should reference records pulled. Wait a bit for the curator job to finish running and try pulling again


Now that we have seen how curators can make it easy to tell Gantry what data you want to gather, and let Gantry take care of gathering it, let's dive into datasets.

### Datasets
Gantry datasets are a lightweight container for your data with simple versioning semantics that aim to make it straightforward to build more robust ML pipelines. Versioning is at the file level, and centers on two operations: `push` and `pull`. Curators `push` data to Gantry Datasets. Users `pull` data from them to analyze, label, or train on. Users can also make local modifications to datasets, and `push` them back. All `push` operations create a new version, and each version is written to underlying S3 storage. You can read about datasets in more detail in the [Datasets guide](https://docs.gantry.io/docs/datasets).

Let's pull the latest version of the dataset from our curator:

In [None]:
dataset.pull()

Gantry datasets can be loaded them directly into HuggingFace Datasets, which can in turn be turned into Pandas DataFrame objects. All the type mappings are handled by the integration:

In [None]:
hfds = dataset.get_huggingface_dataset()
df = hfds.to_pandas()

In [None]:
df

We can now analyze, train on, or label this dataset. If we want to do manual data manipulation, we can modify it and push a new version back to Gantry.