# Quilt3 quick start
`quilt3` is a Python library that lets you create versioned datasets in S3 with a few simple commands.

A *quilt package* is a collection of files with an immutable version hash. Packages can be of any size and can contain any kind of data. To learn more, read the [Quilt mental model](https://docs.quiltdata.com/v/master/mentalmodel) in our docs for more.

You can use `quilt3` on the command-line or in Python.

The are three key commands in `quilt3`:
* `push` stores data in a *registry* (usually an S3 bucket)
* `browse` lets you interact with a remote data package without downloading all of the data
* `install` downloads the data to the host


## Pre-requisites
To run this notebook you'll need the following:
1. Python 3.6 or higher
1. An AWS account, and access to an S3 bucket
1. [AWS Credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) on your machine
1. Optional: a Python environment
1. `pandas`

## Installation
If you haven't already switch to a clean and optionally activate a Python environment.

```sh
cd YOUR_CLEAN_DIR
. activate YOUR_ENV
pip install pandas
```

In [5]:
! pip install 'quilt3[pyarrow]'

zsh:1: command not found: pip


## Install a data package
Let's get some data. We could also do this in Python. But for simple tasks I find the CLI more convenient.

In [4]:
! quilt3 install examples/hurdat2 --registry s3://quilt-example --dest data

zsh:1: command not found: quilt3


The above command brought the latest version of the package `examples/hurdat2` to our local machine (it created the data directory for us as well, as specified with `--dest`). The registry is simply an S3 bucket that hosts our data. Quilt can use any S3 bucket as a data repository.

In [4]:
!ls data

Atlantic-HURDAT2.parquet [1m[36mexternal[m[m
README.md                quilt_summarize.json


These are standard files that we can interact with. For example:

In [5]:
! wc -l data/README.md

      22 data/README.md


In [6]:
! ls -R

QuickStart.ipynb [1m[36mdata[m[m             [1m[36menv[m[m

./data:
Atlantic-HURDAT2.parquet [1m[36mexternal[m[m
README.md                quilt_summarize.json

./data/external:
hurdat2-1851-2020-052921.txt hurdat2-format-nov2019.pdf

./env:
[1m[36mbin[m[m        [1m[36metc[m[m        [1m[36minclude[m[m    [1m[36mlib[m[m        pyvenv.cfg [1m[36mshare[m[m

./env/bin:
Activate.ps1         [31mjupyter-events[m[m       [31mnumpy-config[m[m
[1m[36m__pycache__[m[m          [31mjupyter-execute[m[m      [31mpip[m[m
activate             [31mjupyter-kernel[m[m       [31mpip3[m[m
activate.csh         [31mjupyter-kernelspec[m[m   [31mpip3.10[m[m
activate.fish        [31mjupyter-lab[m[m          [31mpybabel[m[m
[31mf2py[m[m                 [31mjupyter-labextension[m[m [31mpygmentize[m[m
[31mhttpx[m[m                [31mjupyter-labhub[m[m       [31mpyjson5[m[m
[31mipython[m[m              [31

In [1]:
import pandas as pd
import quilt3 as q3

pd.read_parquet("data/Atlantic-HURDAT2.parquet")

Unnamed: 0,YYYY-MM-DD,TimeUTC,Record identifier,Status of system,Latitude,Longitude,Max. sustained wind (knots),Min. pressure (millibars),34 kt wind max. NE (nautical miles),34 kt wind max. SE (nautical miles),...,50 kt wind max. NW (nautical miles),64 kt wind max. NE (nautical miles),64 kt wind max. SE (nautical miles),64 kt wind max. SW (nautical miles),64 kt wind max. NW (nautical miles),Basin,ATCF Cyclone Number,HYear,Name,Num. best track entries
0,18510625,0000,,HU,28.0N,94.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
1,18510625,0600,,HU,28.0N,95.4W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
2,18510625,1200,,HU,28.0N,96.0W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
3,18510625,1800,,HU,28.1N,96.5W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
4,18510625,2100,L,HU,28.2N,96.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21,20201117,1200,,HU,13.7N,84.7W,75,965,170,110,...,50,30,0,0,20,AL,31,2020,IOTA,26
22,20201117,1800,,TS,13.7N,85.7W,55,988,150,70,...,0,0,0,0,0,AL,31,2020,IOTA,26
23,20201118,0000,,TS,13.8N,86.7W,40,1000,140,0,...,0,0,0,0,0,AL,31,2020,IOTA,26
24,20201118,0600,,TS,13.8N,87.8W,35,1005,140,0,...,0,0,0,0,0,AL,31,2020,IOTA,26


## Browse a data package
Sometimes we want to install part of a data package. We can do that with `quilt3 install data/package/subfoler`. Other times, we might want to just see what's in a package:

In [2]:
p = q3.Package.browse("examples/hurdat2", "s3://quilt-example")
p

Loading manifest: 100%|██████████| 7/7 [00:00<00:00, 6.82k/s]


(remote Package)
 └─.quiltignore
 └─Atlantic-HURDAT2.parquet
 └─README.md
 └─external/
   └─hurdat2-1851-2020-052921.txt
   └─hurdat2-format-nov2019.pdf
 └─quilt_summarize.json

Here you'll notice that we only downloaded the package *manifest* not the full dataset. The manifest gives us metdat as to what's inside the package, but contains only pointers to the *primary data*.

### Pulling data from S3 into memory
Suppose we wanted not to download anything, but just pull a Parquet file into memory:

In [3]:
p["Atlantic-HURDAT2.parquet"]()

Unnamed: 0,YYYY-MM-DD,TimeUTC,Record identifier,Status of system,Latitude,Longitude,Max. sustained wind (knots),Min. pressure (millibars),34 kt wind max. NE (nautical miles),34 kt wind max. SE (nautical miles),...,50 kt wind max. NW (nautical miles),64 kt wind max. NE (nautical miles),64 kt wind max. SE (nautical miles),64 kt wind max. SW (nautical miles),64 kt wind max. NW (nautical miles),Basin,ATCF Cyclone Number,HYear,Name,Num. best track entries
0,18510625,0000,,HU,28.0N,94.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
1,18510625,0600,,HU,28.0N,95.4W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
2,18510625,1200,,HU,28.0N,96.0W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
3,18510625,1800,,HU,28.1N,96.5W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
4,18510625,2100,L,HU,28.2N,96.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,AL,01,1851,UNNAMED,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21,20201117,1200,,HU,13.7N,84.7W,75,965,170,110,...,50,30,0,0,20,AL,31,2020,IOTA,26
22,20201117,1800,,TS,13.7N,85.7W,55,988,150,70,...,0,0,0,0,0,AL,31,2020,IOTA,26
23,20201118,0000,,TS,13.8N,86.7W,40,1000,140,0,...,0,0,0,0,0,AL,31,2020,IOTA,26
24,20201118,0600,,TS,13.8N,87.8W,35,1005,140,0,...,0,0,0,0,0,AL,31,2020,IOTA,26


The `()` above are just shorthand for `p["Atlantic-HURDAT2.parquet"].deserialize()`.

We can also just pull bytes or strings from S3:

In [4]:
p["README.md"].get_as_string()

'# HURDAT2 - Atlantic Hurricane Database\n\nAtlantic hurricane database (HURDAT2) 1851-2020 (6.2MB download)\nThis dataset was provided on 10 June 2021 to include the 2020 best tracks.\n\nThis dataset\n([known as Atlantic HURDAT2](https://www.nhc.noaa.gov/data/hurdat/hurdat2-format-nov2019.pdf))\nhas a comma-delimited, text format with six-hourly information on the location, maximum winds, central pressure,\nand (beginning in 2004) size of all known tropical cyclones and subtropical\ncyclones.  The original HURDAT database has been retired.\n\nDetailed information regarding the\n[Atlantic Hurricane Database Re-analysis Project](http://www.aoml.noaa.gov/hrd/data_sub/re_anal.html)\nis available from the\n[Hurricane Research Division](http://www.aoml.noaa.gov/hrd/).\n\n\n## Source\nhttps://www.nhc.noaa.gov/data/\n\n### Reference\nLandsea, C. W. and J. L. Franklin, 2013: Atlantic Hurricane Database Uncertainty and Presentation of a New Database Format. Mon. Wea. Rev., 141, 3576-3592.\n'

## Documenting datasets as Quilt packages
Data only makes sense in the context of documentation, visualizations, and commit messages. The Quilt data catalog provides all of this for your data sets. Here's the landing page for [examples/hurdat2](https://open.quiltdata.com/b/quilt-example/packages/examples/hurdat2/) (check it out; datasets should be beautiful; Quilt helps).


As you've seen above, Quilt allows you to embed READMEs, PDFs, Vega visualizations and more. You can control how your data package looks by adding a `quilt_summarize.json` at the root. This is simply a list of files or Vega visualizations that you wish Quilt to display in the catalog.

In [5]:
! cat data/quilt_summarize.json

[
  "Atlantic-HURDAT2.parquet",
  "external/hurdat2-format-nov2019.pdf"
]


## Make your own data package with `push`
Ready to make your own data package? Fill out these variables so that you can push your own package.

In [12]:
bucket = "s3://YOUR_BUCKET"
pname = "FIRST/LAST" # keep that / in there! packages require a handle with both parts

In [13]:
p2 = q3.Package()
p2.set_dir(".", "data")

(local Package)
 └─.quiltignore
 └─Atlantic-HURDAT2.parquet
 └─README.md
 └─external/
   └─hurdat2-1851-2020-052921.txt
   └─hurdat2-format-nov2019.pdf
 └─quilt_summarize.json

Above we used `set_dir` to capture the contents of "data" into the root of the package. Now we can push it.

In [None]:
p2.push(pname, bucket, message="Testing quilt3 packages")

Boom. Now you've created your own immutable package. Anyone with access to the same S3 bucket can interact with this package as follows:

In [27]:
p3 = q3.Package.browse(pname, bucket)

Downloading manifest: 100%|██████████| 1.62k/1.62k [00:00<00:00, 1.89kB/s]
Loading manifest: 100%|██████████| 7/7 [00:00<00:00, 7.03k/s]


> **Note** It's not recommended to push packages that contain a notebook *from the same notebook* as the notebook file is changing, maybe incomplete on disk, and may have a different hash than the bytes pushed.

In [None]:
! quilt3 install $pname --registry $bucket --dest test-dir

That's it. **Go thou and make versioned datasets that are reproducibe, discoverable, and trusted with `quilt3` 🤣. To learn more visit [docs.quiltdata.com](https://docs.quiltdata.com).**