# Python API for NYU's DataMart

This notebook showcases how to use the Python API for the NYU's DataMart system, which implements the common DataMart interface (https://gitlab.com/datadrivendiscovery/datamart-api/blob/master/datamart.py). To install it: `pip install datamart_nyu`

For the augmentation, we use the medical malpractice example, available here: https://gitlab.datadrivendiscovery.org/d3m/datasets/tree/master/seed_datasets_data_augmentation/DA_medical_malpractice

In [1]:
from d3m import container
import datamart
import datamart_nyu
import datetime
from pathlib import Path

In [2]:
def print_results(results):
    if not results:
        return
    for result in results:
        print(result.score())
        print(result.get_json_metadata()['metadata']['name'])
        if (result.get_augment_hint()):
            print("Left Columns: %s" %
                  str(result.get_json_metadata()['augmentation']['left_columns_names']))
            print("Right Columns: %s" %
                  str(result.get_json_metadata()['augmentation']['right_columns_names']))
        else:
            print(result.id())
        print("-------------------")

Loading the medical malpractice data, i.e., our supplied data.

In [3]:
# You can change this accordingly
medical_malpractice_file = str(Path.home()) + '/projects/d3m/datasets/seed_datasets_data_augmentation/' +\
                               'DA_medical_malpractice/DA_medical_malpractice_dataset/datasetDoc.json'
medical_malpractice = container.Dataset.load('file://' + medical_malpractice_file)

In [4]:
medical_malpractice['learningData'].head()

Unnamed: 0,d3mIndex,SEQNO,LICNFELD,ORIGYEAR,WORKSTAT,ALGNNATR,ALEGATN1,PTTYPE,PRACTAGE,PFIDX
0,404537,514456,10,2004,AZ,20,306,I,30,32.737
1,404538,514457,10,2004,PA,1,200,B,50,42.09
2,404540,514460,651,2004,SD,100,316,O,50,58.926
3,404554,514475,430,2004,NJ,60,334,O,20,77.633
4,404556,514477,30,2004,NH,60,306,O,50,1.871


## Searching for Datasets

Let's first instantiate our client:

In [5]:
client = datamart_nyu.RESTDatamart('https://auctus.vida-nyu.org/api/v1')

### Search using keywords

In [6]:
query = datamart.DatamartQuery(
    keywords=['practitioner', 'clinical', 'malpractice', 'practitioner data bank',
              'government', 'healthcare',
              'Department of health and human services'],  # keywords from problem definition
    variables=[]
)
cursor = client.search(query=query)

In [7]:
results = cursor.get_next_page()

In [8]:
print_results(results)

53.081444
NPDB1807
datamart.url.f14cc1c6fb235efaa907bef6fefd5efe
-------------------
9.676955
FY17 BID Trends Report Data
datamart.socrata.data-cityofnewyork-us.emuv-tx7t
-------------------
9.260225
FY16 BID Trends Report Data
datamart.socrata.data-cityofnewyork-us.43ab-v68i
-------------------
8.407788
Inventory of New York City Greenhouse Gas Emissions - City Government GHG Emissions Summary (2016)
datamart.socrata.data-cityofnewyork-us.jat2-irw9
-------------------
8.121032
FY18 BID Trends Report Data
datamart.socrata.data-cityofnewyork-us.m6ad-jy3s
-------------------
8.121032
Benefits and Programs API (Historical)
datamart.socrata.data-cityofnewyork-us.2j8u-wtju
-------------------
6.7994266
NYC Women's Resource Network Database
datamart.socrata.data-cityofnewyork-us.pqg4-dm6b
-------------------
6.1543655
Derelict Vehicles Dispositions - Complaints
datamart.socrata.data-cityofnewyork-us.pq5i-thsu
-------------------
5.764398
Property Exemption Detail
datamart.socrata.data-cityof

### Search using data and keywords

In [9]:
query = datamart.DatamartQuery(
    keywords=['practitioner', 'clinical', 'malpractice', 'practitioner data bank',
              'government', 'healthcare',
              'Department of health and human services'],  # keywords from problem definition
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=medical_malpractice)

In [10]:
results = cursor.get_next_page()

In [11]:
print_results(results)

50.0
NPDB1807
Left Columns: [['ORIGYEAR']]
Right Columns: [['ORIGYEAR']]
-------------------
50.0
NPDB1807
Left Columns: [['ALGNNATR']]
Right Columns: [['ALGNNATR']]
-------------------
44.667477
NPDB1807
Left Columns: [['SEQNO']]
Right Columns: [['SEQNO']]
-------------------
41.181362
NPDB1807
Left Columns: [['ALEGATN1']]
Right Columns: [['ALEGATN1']]
-------------------
39.95984
NPDB1807
Left Columns: [['LICNFELD']]
Right Columns: [['LICNFELD']]
-------------------
30.434784
NPDB1807
Left Columns: [['PRACTAGE']]
Right Columns: [['PRACTAGE']]
-------------------
20.2995
NPDB1807
Left Columns: [['ALEGATN1']]
Right Columns: [['ALEGATN2']]
-------------------


## Downloading a dataset

Now let's materialize one of the DataMart datasets.

In [12]:
data = results[0].download(supplied_data=None)

In [13]:
data['learningData'].head()

Unnamed: 0,SEQNO,RECTYPE,REPTYPE,ORIGYEAR,WORKSTAT,WORKCTRY,HOMESTAT,HOMECTRY,LICNSTAT,LICNFELD,...,ACCRRPTS,NPMALRPT,NPLICRPT,NPCLPRPT,NPPSMRPT,NPDEARPT,NPEXCRPT,NPGARPT,NPCTMRPT,FUNDPYMT
0,1,A,301,1991,OK,,,,OK,10,...,0,0,2,0,0,0,0,0,0,
1,2,A,301,1991,OK,,,,OK,10,...,0,0,7,0,0,0,1,0,0,
2,4,A,301,1991,MA,,,,MA,15,...,0,1,1,0,0,0,2,0,0,
3,6,A,301,1990,OK,,,,OK,10,...,0,0,2,0,0,0,0,0,0,
4,8,A,301,1990,OK,,,,OK,10,...,0,0,7,0,1,0,0,0,0,


## Augmenting a dataset

 Let's try to do our augmentation for the third query result.

In [14]:
join_ = results[2].augment(supplied_data=medical_malpractice)

In [15]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,SEQNO,LICNFELD,ORIGYEAR,WORKSTAT,ALGNNATR,ALEGATN1,PTTYPE,PRACTAGE,PFIDX,...,ACCRRPTS,NPMALRPT,NPLICRPT,NPCLPRPT,NPPSMRPT,NPDEARPT,NPEXCRPT,NPGARPT,NPCTMRPT,FUNDPYMT
0,404537,514456,10,2004,AZ,20,306,I,30,32.736999999999995,...,0,2,2,0,0,0,0,0,0,0.0
1,404538,514457,10,2004,PA,1,200,B,50,42.09,...,0,5,0,0,0,0,0,0,0,0.0
2,404540,514460,651,2004,SD,100,316,O,50,58.926,...,0,1,0,0,0,0,0,0,0,0.0
3,404554,514475,430,2004,NJ,60,334,O,20,77.633,...,0,1,0,0,0,0,0,0,0,0.0
4,404556,514477,30,2004,NH,60,306,O,50,1.871,...,0,1,0,0,0,0,0,0,0,0.0
