## Introduction

This notebook is a five-minute tour of T4, using the `helium` Python package.

In [1]:
import helium as he



T4 lets you manipulate **files** on your local machine and **objects** in the S3 bucket backing T4. Objects are just files with some additional metadata.

To start off, we'll need some data. Here's a script we've built that downloads and cleans up an NOAA hurricane dataset known as HURDAT. It is pretty typical of the sorts of clean-up scripts you'd be running when performing data science:

In [None]:
%load hurdat/build.py

This script generates a history of Atlantic hurricanes in a `pandas` `DataFrame`:

In [4]:
atlantic_storms.head()

Unnamed: 0_level_0,id,name,date,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,34_kt_ne,...,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,AL011851,UNNAMED,1851-06-25 00:00:00,,HU,28.0,-94.8,80,,,...,,,,,,,,,,
1,AL011851,UNNAMED,1851-06-25 06:00:00,,HU,28.0,-95.4,80,,,...,,,,,,,,,,
2,AL011851,UNNAMED,1851-06-25 12:00:00,,HU,28.0,-96.0,80,,,...,,,,,,,,,,
3,AL011851,UNNAMED,1851-06-25 18:00:00,,HU,28.1,-96.5,80,,,...,,,,,,,,,,
4,AL011851,UNNAMED,1851-06-25 21:00:00,L,HU,28.2,-96.8,80,,,...,,,,,,,,,,


## Read and write objects

`helium` lets you write in-memory Python objects like this one straight to T4 using `put`. `put` also accepts a `metadata` argument, which we'll use in this example to keep track of where this data came from:

In [5]:
he.put(atlantic_storms, "alpha-quilt-storage/~aleksey/hurdat/atlantic-storms-data.parquet",
       meta={'source': 'https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2017-050118.txt', 
             'ocean': 'atlantic'})

Later on you can retrieve them (along with the metadata) using `get`:

In [6]:
atlantic_storms, meta = he.get("alpha-quilt-storage/~aleksey/hurdat/atlantic-storms-data.parquet")

In [7]:
meta

{'source': 'https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2017-050118.txt',
 'ocean': 'atlantic'}

`put` and `get` uses reasonable defaults to read and write the data for you. In this example, that meant writing a `pandas` `DataFrame` into a `parquet` file.

Alternatively, you can `put_file` to T4 from a file:

In [8]:
fn = "~/Desktop/atlantic-storms.csv"
atlantic_storms.to_csv(fn)

In [9]:
%ls ~/Desktop | grep 'atlantic'

atlantic-storms.csv


In [10]:
he.put_file("/Users/alex/Desktop/atlantic-storms.csv", "alpha-quilt-storage/~aleksey/hurdat/atlantic-storms-data.csv")

HBox(children=(IntProgress(value=0, max=3871481), HTML(value='')))




## Version object

S3, which T4 is based on, manages data objects using versions. A **version** is a record of the state of an S3 object at a particular point in time.

Each version of an object is assigned a unique hash. You can retrieve that hash in T4 using the `ls` command. For example, here are the first three versions of some files in our HURDAT project (snipped here for legibility):

In [18]:
he.ls("alpha-quilt-storage/~aleksey/hurdat")[1][:3]

[{'ETag': '"7d9faecef6a675b04246fda5d2747a7f"',
  'Size': 40,
  'StorageClass': 'STANDARD',
  'Key': '~aleksey/hurdat/',
  'VersionId': 'jwSyCWiv_zL5Lg.sOyN1RMMQCnGzk.0O',
  'IsLatest': False,
  'LastModified': datetime.datetime(2018, 10, 4, 21, 13, 14, tzinfo=tzutc()),
  'Owner': {'DisplayName': 'kmoore',
   'ID': '1e740c9f01d3eb40d580b51a943de9c75ba2af0c2f75e1ac7b021cd7afd1872a'}},
 {'ETag': '"7d9faecef6a675b04246fda5d2747a7f"',
  'Size': 40,
  'StorageClass': 'STANDARD',
  'Key': '~aleksey/hurdat/',
  'VersionId': 'HvmCd4AGwG4Og3mwGxQMfPDWiZhmtII3',
  'IsLatest': False,
  'LastModified': datetime.datetime(2018, 10, 4, 21, 11, 41, tzinfo=tzutc()),
  'Owner': {'DisplayName': 'kmoore',
   'ID': '1e740c9f01d3eb40d580b51a943de9c75ba2af0c2f75e1ac7b021cd7afd1872a'}},
 {'ETag': '"7d9faecef6a675b04246fda5d2747a7f"',
  'Size': 40,
  'StorageClass': 'STANDARD',
  'Key': '~aleksey/hurdat/',
  'VersionId': 'iUxtMYQh.Z3mInIaitovWpdEP3XY9DKb',
  'IsLatest': False,
  'LastModified': datetime.dateti

In the future, we will have other ways of accessing version information more directly.

You can download an object as of a specific `VersionId` using the optional `version` keyword argument in `get` or `get_file`:

In [17]:
data, meta = he.get("alpha-quilt-storage/~aleksey/hurdat/atlantic-storms.parquet", 
                    version="mP4USSZF2mJSaKNvr7EjUldDQm3Sqb_b")

Note that you need to provide the full version hash for this to work.

## Snapshot your projects

<!-- In the future this section should treat versions, not snapshots. -->

Snapshots are the core abstraction in T4. A **snapshot** is a static view of a set of files in T4.

Creating a snapshot is easy. Just call `snapshot` on an S3 key:

In [24]:
he.snapshot("alpha-quilt-storage/~aleksey/hurdat/atlantic-storms-data.parquet",
            message="First snap.")

'2d24be29ebb0c9cf785bcb8dfd390f6821b4ce57fb205790a2abd1567cf9c79b'

You can list snapshots of an S3 key using `list_snapshots`:

In [13]:
he.list_snapshots("alpha-quilt-storage/~aleksey/atlantic-storms-data.parquet")

Unnamed: 0_level_0,hash,timestamp,message
path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
~aleksey/atlantic-storms-data.parquet,de63c7c1bddb4426bb4bced7aedff30d4c9df3dc406e4b...,2018-10-08 21:38:10+00:00,First snap.
,435d7b954fe6dbd35cf51b311971fc49643d24af9f6f69...,2018-10-08 21:21:01+00:00,foo


Snapshots can be used to version anything with an S3 key, but are at their most useful when versioning **data packages**: groups of files which together represent the data component to a specific project you are working on.

You can think of a data project as having three components: code, environment, and data. Versioning code is obvious: just use `git`. Similarly, sophisticated tools exist for versioning environments: `conda` and Docker, for example.

But what about your data? Data can balloon to many terabytes in size, becoming too large for `git` or Docker to manage. At the same time, in data science, small changes in data can often have disproportionate impact in your analysis and throw off your models. In a [seminal paper](https://ai.google/research/pubs/pub43146) on data systems, Google refered to this as the CACE principle: "Changing Anything Changes Everything". 

Clearly, data needs its own native versioning tool. T4 snapshots provide just that!

To demonstrate, let's start by cloning a simple project using our storms data.

In [None]:
!cd ~/Desktop; git clone https://github.com/ResidentMario/hurdat-example-repo

This project contains an `environment.yml` file defining our code environment, a `notebooks` folder containing some Jupyter notebooks, and a `data` folder containing inputs and outputs.

Our objective: smartly manage our `data`. With T4 snapshots, this is easy:

In [None]:
# Note: replace this path with one that works on your local machine.
he.put_file("/Users/alex/Desktop/hurdat-example-repo/data/", 
            "alpha-quilt-storage/aleksey/hurdat-example-repo/data/")

In [None]:
he.snapshot("alpha-quilt-storage/aleksey/hurdat-example-repo/data/", message="Snap.")

In [None]:
he.list_snapshots("alpha-quilt-storage/aleksey/hurdat-example-repo/data/")

Now whenever we want to grab a file from a particular snapshot of this particular data project, we need only pass its hash to the `snapshot` parameter of `get_file`:

In [None]:
# Note: replace this path with one that works on your local machine.
he.get_file("alpha-quilt-storage/aleksey/hurdat-example-repo/data/atlantic.csv", 
            "/Users/alex/Desktop/hurdat-example-repo/data/atlantic.csv",
            snapshot="cb06134062b8b8")

Check this hash into your `README.md` and enjoy your newfound project reproducibility!

In summary, every data science product&mdash;be it an analysis, a model, or exposition&mdash;relies on a new collection of data file **versions**, which a data science can logically organize into one (or more) **snapshots**. These snapshots are **immutable**, and, in conjunction with version control on the project code and the project environment, enable reproducible, distributable data science.

## Addendum&mdash;clean up

In [26]:
# Clean up
!rm -rf ~/Desktop/hurdat-example-repo
!rm ~/Desktop/atlantic-storms.csv