# Quilt for data versioning
We use Quilt to version our data. A data set is assembled on a local machine (ideally an EC2 instance since we'll be interacting a lot with S3, for better up/download speed and reduced bandwidth costs)

Currently, our workflow doesn't take advantage of Quilt as much as it can. Our main processing routine is to use AWS Batch to process many brain images through our Neuromorphometry Reporting tool (re:THINQ). This is done using a custom-built wrapper around `awscli`, which provides a command line tool to submit the brain images of subjects stored in our S3 infrastructure, and check their processing status (i.e. SUBMITTED, RUNNING, COMPLETED, FAILED).

## Initial upload of data
We use Quilt's CLI to create a Package and push it to S3:

```bash
local_data_dir="/home/ubuntu/data/LA5c"
quilt_bucket="s3://cmet-quilt"
code_version="1.0.0"
quilt_package_name="ltirrell/rend"

quilt3 push \
  "${quilt_package_name}" \
  --dir "${local_data_dir}" \
  --dest "${quilt_bucket}/${quilt_package_name}/data" \
  --registry "${quilt_bucket}" \
  --message "Initial commit of data"
```

## Process data

We run our processing on AWS Batch, which points to the data stored in this s3 bucket:

```bash
submit_subjects \
    --input_data "${quilt_bucket}/${quilt_package_name}/data" \
    --container_tag "${thinq_version}" \
    --output_bucket "${quilt_bucket}/${quilt_package_name}/results/${code_version}" \
    ...
```

## Store results

The status of the indivdiual submissions is checked:

```bash
check_status ...
```

and when all are completed, results are added to an updated version of the Quilt Package

```bash
quilt3 push \
  "$quilt_package_name" \
  --dir "${quilt_bucket}/${quilt_package_name}/results/${code_version}" \
  --dest "${quilt_bucket}/${quilt_package_name}/results/${code_version}" \
  --registry "${quilt_bucket}" \
  --message "Add results of data using version ${code_version} of code"
```

## Future work

Our goal is to update our AWS Batch wrapper to take advantage of data stored and versioned with Quilt. 
By using the `quilt3` Python library, as opposed to the command line tool usage above, we would be able to more explicitly and succinctly point to exact versions of input data in our workflows.