## REST API for DataMart

This notebook showcases how to use the Rest API for the DataMart system. For the augmentation, we use the taxi demand example from MIT-LL, available here: https://gitlab.datadrivendiscovery.org/MIT-LL/phase_2/data_augmentation_track_seed/da_seed_ny_taxi_demand_prediction

In [1]:
from io import BytesIO
import json
import os
import pandas as pd
from pprint import pprint
import requests
import zipfile

Initially, we have the taxi demand data.

In [2]:
taxi_demand = pd.read_csv('data/ny_taxi_demand_prediction.csv')
taxi_demand.head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups
0,0,2018-01-01 00:00:00,67
1,1,2018-01-01 01:00:00,8
2,2,2018-01-01 02:00:00,0
3,3,2018-01-01 03:00:00,0
4,4,2018-01-01 04:00:00,7


### Searching for Datasets

Let's use DataMart to search for a weather datasets that can be used to augment the taxi demand one.

In [3]:
url = 'https://datamart.d3m.vida-nyu.org/search'
data = 'data/ny_taxi_demand_prediction.csv'
query = {
    'dataset': {
        'about': 'weather'
    }
}

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
response.raise_for_status()
query_results = response.json()['results']

In [4]:
for result in query_results:
    print(result['metadata']['name'])
    print('Score: ', result['score'])
    aug_type = 'Union' if 'union_columns' in result else 'Join'
    aug = result['union_columns'] if 'union_columns' in result else result['join_columns']
    print('%s:' % aug_type, aug)
    print('--------')

Newyork Weather Data around Airport 2016-18
Score:  1.0
Join: [['tpep_pickup_datetime', 'DATE']]
--------


The first dataset had score 1.0 for join, between columns `tpep_pickup_datetime` (from the taxi demand dataset) and `DATE` (from the query result dataset).

We could also just simply search for all of the datasets that can be joined/unioned with the taxi demand one.

In [5]:
url = 'https://datamart.d3m.vida-nyu.org/search'
data = 'data/ny_taxi_demand_prediction.csv'

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p
        }
    )
response.raise_for_status()
query_results = response.json()['results']

In [6]:
print('There are %d query results!\n' % len(query_results))
for result in query_results[:5]: # top-5
    print(result['metadata']['name'])
    print('Score: ', result['score'])
    aug_type = 'Union' if 'union_columns' in result else 'Join'
    aug = result['union_columns'] if 'union_columns' in result else result['join_columns']
    print('%s:' % aug_type, aug)
    print('--------')

There are 487 query results!

2001 Campaign Contributions
Score:  1.0
Join: [['tpep_pickup_datetime', 'DATE']]
--------
Invoices for Open Market Order (OMO) Charges
Score:  1.0
Join: [['tpep_pickup_datetime', 'DateTransferDoF']]
--------
Local Law 44 - Projects
Score:  1.0
Join: [['tpep_pickup_datetime', 'ProjectedCompletionDate']]
--------
Directory Of Competitive Bid
Score:  1.0
Join: [['tpep_pickup_datetime', 'BID DATE & TIME\xa0\xa0\xa0']]
--------
Appeals Filed In 2017
Score:  1.0
Join: [['tpep_pickup_datetime', 'Expiration']]
--------


This shows the importance of also using the query schema to filter results, as there can be many datasets that can be joined or unioned with the input data.

It is also possible to specify which column from the taxi demand data will be used for augmentation.

In [7]:
url = 'https://datamart.d3m.vida-nyu.org/search'
data = 'data/ny_taxi_demand_prediction.csv'
query={
    'required_variables': [
        {
            'type': 'dataframe_columns',
            'names': ['tpep_pickup_datetime']
        }
    ]
}

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
response.raise_for_status()
query_results = response.json()['results']

In [8]:
print('There are %d query results!\n' % len(query_results))
for result in query_results[:5]: # top-5
    print(result['metadata']['name'])
    print('Score: ', result['score'])
    aug_type = 'Union' if 'union_columns' in result else 'Join'
    aug = result['union_columns'] if 'union_columns' in result else result['join_columns']
    print('%s:' % aug_type, aug)
    print('--------')

There are 332 query results!

2001 Campaign Contributions
Score:  1.0
Join: [['tpep_pickup_datetime', 'DATE']]
--------
Invoices for Open Market Order (OMO) Charges
Score:  1.0
Join: [['tpep_pickup_datetime', 'DateTransferDoF']]
--------
Local Law 44 - Projects
Score:  1.0
Join: [['tpep_pickup_datetime', 'ProjectedCompletionDate']]
--------
Directory Of Competitive Bid
Score:  1.0
Join: [['tpep_pickup_datetime', 'BID DATE & TIME\xa0\xa0\xa0']]
--------
Appeals Filed In 2017
Score:  1.0
Join: [['tpep_pickup_datetime', 'Expiration']]
--------


### Downloading a Dataset

Now let's materialize the weather dataset, in case the user wants to take a look at the data before augmenting it (or so that the user can augment the data him/herself).

In [9]:
url = 'https://datamart.d3m.vida-nyu.org/search'
data = 'data/ny_taxi_demand_prediction.csv'
query = {
    'dataset': {
        'about': 'weather'
    }
}

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
response.raise_for_status()
query_results = response.json()['results']

In [10]:
for result in query_results:
    print(result['id'])
    print(result['metadata']['name'])
    print('Score: ', result['score'])
    aug_type = 'Union' if 'union_columns' in result else 'Join'
    aug = result['union_columns'] if 'union_columns' in result else result['join_columns']
    print('%s:' % aug_type, aug)
    print('--------')

datamart.upload.7ccf4492c1da44ffbdc747e278fd65f4
Newyork Weather Data around Airport 2016-18
Score:  1.0
Join: [['tpep_pickup_datetime', 'DATE']]
--------


In [11]:
url = 'https://datamart.d3m.vida-nyu.org/download'
id_ = query_results[0]['id']
params = {'format': 'd3m'} # returns a .zip file with the data and its corresponding datasetDoc

response = requests.get(url + '/%s' % id_, params=params)
response.raise_for_status()

zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [12]:
learning_data.head()

Unnamed: 0,DATE,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure
0,2016-01-01 01:00:00,OVC:08 38,6.1,58.0,17.0,300,30.03
1,2016-01-01 02:00:00,OVC:08 38,6.1,56.0,16.0,320,30.03
2,2016-01-01 03:00:00,OVC:08 38,5.6,55.0,13.0,340,30.03
3,2016-01-01 04:00:00,OVC:08 36,5.6,55.0,13.0,300,30.03
4,2016-01-01 05:00:00,FEW:02 34 OVC:08 45,5.0,60.0,13.0,270,30.01


### Augmenting a Dataset

Let's try to do our augmentation for the first query result.

In [13]:
url = 'https://datamart.d3m.vida-nyu.org/augment'
data = 'data/ny_taxi_demand_prediction.csv'
task = query_results[0]

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'task': ('task.json', json.dumps(task), 'application/json'),
        },
        stream=True,
    )
response.raise_for_status()
zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [14]:
learning_data.head()

Unnamed: 0,tpep_pickup_datetime,num_pickups,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure,d3mIndex
0,2018-01-01 00:00:00,67,CLR:00,-12.2,61.0,14.0,330,30.29,0
1,2018-01-01 01:00:00,8,CLR:00,-12.2,61.0,17.0,320,30.3,1
2,2018-01-01 02:00:00,0,CLR:00,-12.2,61.0,17.0,320,30.3,2
3,2018-01-01 03:00:00,0,CLR:00,-12.8,64.0,11.0,310,30.32,3
4,2018-01-01 04:00:00,7,CLR:00,-12.8,61.0,15.0,300,30.31,4


And we have our augmented data! DataMart also produces a datasetDoc JSON object for the dataset.

In [15]:
pprint(dataset_doc, indent=2)

{ 'about': { 'approximateSize': '539534 B',
             'datasetID': '194ee008dc074f3b823fe7543bdfacdc',
             'datasetName': '194ee008dc074f3b823fe7543bdfacdc',
             'datasetSchemaVersion': '3.2.0',
             'datasetVersion': '0.0',
             'license': 'unknown',
             'redacted': False},
  'dataResources': [ { 'columns': [ { 'colIndex': 0,
                                      'colName': 'tpep_pickup_datetime',
                                      'colType': 'dateTime',
                                      'role': ['attribute']},
                                    { 'colIndex': 1,
                                      'colName': 'num_pickups',
                                      'colType': 'integer',
                                      'role': ['attribute']},
                                    { 'colIndex': 2,
                                      'colName': 'HOURLYSKYCONDITIONS',
                                      'colType': 'string',
      