# Persisting Files with Quilt


* Your requirements today, may evolve in the future
* Data Science @Home



# Install

In [1]:
import quilt

In [2]:
quilt.install("uciml/wine", force=True)

Downloading 6be6b1203f3d51df0b553a70e57b8a723cd405683958204f96d23d7cd6aea659 (1/4)...
Fragment already installed; skipping.
Downloading b0b11f401da13abd783a48c6cba0853b5a628c2eb4ed6812196d0f1aa1c5bf2e (2/4)...
Fragment already installed; skipping.
Downloading d0cfdf9e97162db6656d6cd1907fedc365148818d3a8c4fdf9b7efb5a2cbeb4c (3/4)...
Fragment already installed; skipping.
Downloading f1b84f2ef845e0bdebf13e14fa7a213e56de4f1baa40c5974dbd1ee51c5ae710 (4/4)...
Fragment already installed; skipping.


`force=True` ensures no interactive yes/no from shell if package already exists

# Import, browse

In [3]:
from quilt.data.uciml import wine

## Packages are like miniature filesystems

In [4]:
wine

<PackageNode '/Users/karve/code/examples/quilt_packages/uciml/wine'>
raw/
tables/
README

## Groups are like directories

In [5]:
wine.tables

<GroupNode>

wine

## Leaf nodes contain data

In [6]:
wine.tables.wine

<DataNode>

## You can programatically navigate nodes

In [7]:
wine._keys()

['README', 'raw', 'tables']

In [8]:
wine._data_keys()

['README']

In [9]:
wine._group_keys()

['raw', 'tables']

## `()` on a `DataNode` to fetch data from disk

In [9]:
wine.tables.wine()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
1,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.640000,1.04,3.92,1065
1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.380000,1.05,3.40,1050
1,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.680000,1.03,3.17,1185
1,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.800000,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.320000,1.04,2.93,735
1,14.20,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.750000,1.05,2.85,1450
1,14.39,1.87,2.45,14.6,96,2.50,2.52,0.30,1.98,5.250000,1.02,3.58,1290
1,14.06,2.15,2.61,17.6,121,2.60,2.51,0.31,1.25,5.050000,1.06,3.58,1295
1,14.83,1.64,2.17,14.0,97,2.80,2.98,0.29,1.98,5.200000,1.08,2.85,1045
1,13.86,1.35,2.27,16.0,98,2.98,3.15,0.22,1.85,7.220000,1.01,3.55,1045


## Some `DataNode`s aren't data frames
e.g. not columnar data. In this case you get a path to the object (or fragment) on Disk.

In [10]:
wine.README()

'/Users/karve/code/examples/quilt_packages/objs/b0b11f401da13abd783a48c6cba0853b5a628c2eb4ed6812196d0f1aa1c5bf2e'

# Why Quilt

* Data Repository: I want collaborators to be able import my data sets easily
* Notebook Server: I want collaborators to rerun my noteboooks on a notebook server without having to download and save code and data locally on their machines
* Preserve Data Types: I want to share validated  data types with my collaborators (i.e.: Python Type Annotation)
* ... It's fun ...

# Upload Data to Quilt Repository

## Start with an empty package

In [2]:
quilt.build("avare/homecredit")

## Put some data in it

In [24]:
import pandas as pd
from quilt.data.avare import homecredit

application_df = pd.read_csv('/Users/stewarta/Documents/DATA/Home Data/application_train.csv')
#description_df = pd.read_csv("/Users/stewarta/Documents/DATA/Home Data/docs/HomeCredit_columns_description.csv")

In [25]:
homecredit._set(["data", "application"], application_df) 


## Push the new package to the registry

In [26]:
quilt.login()

Launching a web browser...
If that didn't work, please visit the following URL: https://pkg.quiltdata.com/login

Enter the code from the webpage: eyJpZCI6ICJmZDhmNGZmZi0yNmE0LTQ5NzktYjcxZS1lZGYwNWQzMmM5ZDIiLCAiY29kZSI6ICI3MWYyYWFkMy0xZDAxLTRmYjgtYWM2Yi1iODZhNWI2ZjhmMmEifQ==


In [27]:
quilt.push("avare/homecredit",is_public=True)

Fetching upload URLs from the registry...


0.00B [00:00, ?B/s]

Uploading 0 fragments (0 bytes)...
Uploading package metadata...
Updating the 'latest' tag...





Push complete. avare/homecredit is live:
https://quiltdata.com/package/avare/homecredit


## Install

Visit the repository: https://quiltdata.com/package/avare/homecredit 


In [28]:
quilt.install("avare/homecredit", force=True)

Downloading package metadata...
Fragments already downloaded


`force=True` ensures no interactive yes/no from shell if package already exists

## Import, Browse

In [29]:
from quilt.data.avare import homecredit

In [35]:
homecredit.data.application()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.000,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.000,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.000,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.000,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.000,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
5,100008,0,Cash loans,M,N,Y,0,99000.000,490495.5,27517.5,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0
6,100009,0,Cash loans,F,Y,Y,1,171000.000,1560726.0,41301.0,...,0,0,0,0,0.0,0.0,0.0,1.0,1.0,2.0
7,100010,0,Cash loans,M,Y,Y,0,360000.000,1530000.0,42075.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
8,100011,0,Cash loans,F,N,Y,0,112500.000,1019610.0,33826.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
9,100012,0,Revolving loans,M,N,Y,0,135000.000,405000.0,20250.0,...,0,0,0,0,,,,,,


### Working with JSON data
Unstructured and semi-structured data work like files, whereas structured data automatically deserialize as data frames

### Persisting Data Types

best practice ....Python type hints....

## Links:

There's more on editing packages [here, in the docs](https://docs.quiltdata.com/edit-a-package.html)