## REST API for DataMart

This notebook showcases how to use the Rest API for the DataMart system.

For the augmentation, we use the taxi demand example, available here: https://gitlab.datadrivendiscovery.org/d3m/datasets/tree/master/seed_datasets_data_augmentation/DA_ny_taxi_demand

The documentation for the REST API is available here: https://vida-nyu.gitlab.io/-/datamart/datamart-api/-/jobs/233008593/artifacts/pages/rest_api.html

In [1]:
from d3m import container
from io import BytesIO
import json
import os
import pandas as pd
from pprint import pprint
from pathlib import Path
import requests
import zipfile

In [2]:
def print_results(results):
    if not results:
        return
    for result in results:
        print(result['metadata']['name'])
        print('Score: ', result['score'])
        if 'augmentation' in result:
            aug_type = result['augmentation']['type']
            print('Augmentation: %s' % aug_type)
            left_columns = []
            for column_ in result['augmentation']['left_columns']:
                left_columns.append([])
                for column in column_:
                    left_columns[-1].append(column)
            print("Left Columns: %s" % str(left_columns))
            right_columns = []
            for column_ in result['augmentation']['right_columns']:
                right_columns.append([])
                for column in column_:
                    right_columns[-1].append(column)
            print("Right Columns: %s" % str(right_columns))
            
        print("-------------------")

Initially, we have the taxi demand data.

In [3]:
# You can change this accordingly
ny_taxi_demand_dir = str(Path.home()) + '/projects/d3m/datasets/seed_datasets_data_augmentation/' +\
                      'DA_ny_taxi_demand/DA_ny_taxi_demand_dataset/'
ny_taxi_demand_file = ny_taxi_demand_dir + 'datasetDoc.json'
ny_taxi_demand = container.Dataset.load('file://' + ny_taxi_demand_file)

In [4]:
ny_taxi_demand['learningData'].head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups
0,0,2018-04-19 22:00:00,731
1,1,2018-06-30 20:00:00,183
2,2,2018-06-02 10:00:00,384
3,3,2018-04-17 13:00:00,648
4,4,2018-01-04 01:00:00,3


### Searching for Datasets

Let's use DataMart to search for a weather datasets that can be used to augment the taxi demand one.

In [14]:
URL = 'https://datamart.d3m.vida-nyu.org'

In [8]:
url = URL + '/search'
data = ny_taxi_demand_dir + 'tables/learningData.csv'
query = {
    'keywords': ['transportation', 'city data', 'taxi',
                 'yellow cab', 'pickup', 'LaGuardia airport',
                 'weather', 'weather conditions', 'new york', 'hourly']
}

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
response.raise_for_status()
query_results = response.json()['results']

In [9]:
print_results(query_results)

Newyork Weather Data around Airport 2016-18
Score:  50.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
Medallion  Vehicles - Authorized
Score:  30.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[14]]
-------------------
ny_lga_weather_16_17_18
Score:  30.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
Trade Waste Hauler Licensees
Score:  26.547314
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
Trade Waste Hauler Licensees
Score:  26.547314
Augmentation: join
Left Columns: [[1]]
Right Columns: [[14]]
-------------------
Medallion  Vehicles - Authorized
Score:  24.126173
Augmentation: join
Left Columns: [[1]]
Right Columns: [[2]]
-------------------
Housing New York Units by Building
Score:  20.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[2]]
-------------------
Housing New York Units by Building
Score:  20.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[3

We could also just simply search for all of the datasets that can be joined/unioned with the taxi demand one.

In [10]:
url = URL + '/search'
data = ny_taxi_demand_dir + 'tables/learningData.csv'

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p
        }
    )
response.raise_for_status()
query_results = response.json()['results']

In [11]:
print_results(query_results)

DEP - Cryptosporidium And Giardia Data Set
Score:  1.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[1]]
-------------------
Asset Management Parks System (AMPS) - Work Orders
Score:  1.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[28]]
-------------------
Asset Management Parks System (AMPS) – Labor
Score:  1.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[2]]
-------------------
2005 - 2011 Graduation Outcomes - Borough - ELL
Score:  1.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[13]]
-------------------
Asset Management Parks System (AMPS) – Assets
Score:  1.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[11]]
-------------------
Asset Management Parks System (AMPS) – Assets
Score:  1.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[17]]
-------------------
Cash Assistance Youth Engagement
Score:  1.0
Augmentation: join
Left Columns: [[1]]
Right Columns: [[6]]
-------------------
Parking Violations Issued - Fiscal Year 

This shows the importance of also using the query schema to filter results, as there can be many datasets that can be joined or unioned with the input data.

### Downloading a Dataset

Now let's materialize the weather dataset, in case the user wants to take a look at the data before augmenting it (or so that the user can augment the data him/herself).

In [12]:
url = URL + '/search'
data = ny_taxi_demand_dir + 'tables/learningData.csv'
query = {
    'keywords': ['transportation', 'city data', 'taxi',
                 'yellow cab', 'pickup', 'LaGuardia airport',
                 'weather', 'weather conditions', 'new york', 'hourly']
}

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
response.raise_for_status()
query_results = response.json()['results']

In [15]:
url = URL + '/download'
id_ = query_results[0]['id']
params = {'format': 'd3m'} # returns a .zip file with the data and its corresponding datasetDoc

response = requests.get(url + '/%s' % id_, params=params)
response.raise_for_status()

zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [16]:
learning_data.head()

Unnamed: 0,DATE,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure
0,2016-01-01 01:00:00,OVC:08 38,6.1,58.0,17.0,300,30.03
1,2016-01-01 02:00:00,OVC:08 38,6.1,56.0,16.0,320,30.03
2,2016-01-01 03:00:00,OVC:08 38,5.6,55.0,13.0,340,30.03
3,2016-01-01 04:00:00,OVC:08 36,5.6,55.0,13.0,300,30.03
4,2016-01-01 05:00:00,FEW:02 34 OVC:08 45,5.0,60.0,13.0,270,30.01


### Augmenting a Dataset

Let's try to do our augmentation for the first query result.

In [17]:
url = URL + '/augment'
data = ny_taxi_demand_dir + 'tables/learningData.csv'
task = query_results[0]

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'task': ('task.json', json.dumps(task), 'application/json'),
        },
        stream=True,
    )
response.raise_for_status()
zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [18]:
learning_data.head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure
0,0,2018-04-19 22:00:00,731,FEW:02 42,5.0,53.0,16.0,310,29.97
1,1,2018-06-30 20:00:00,183,SCT:04 250,30.6,43.0,5.0,180,29.97
2,2,2018-06-02 10:00:00,384,FEW:02 40 FEW:02 150 SCT:04 200,28.3,61.0,6.0,70,29.7
3,3,2018-04-17 13:00:00,648,BKN:07 46 BKN:07 85,8.3,44.0,17.0,260,29.6
4,4,2018-01-04 01:00:00,3,OVC:08 32,-1.7,45.0,8.0,20,29.91


And we have our augmented data!

DataMart also produces a datasetDoc JSON object for the dataset. However, note that this datasetDoc JSON **does not** preserve the information from the supplied data's datasetDoc JSON. You need to use the Python DataMart API for that: https://gitlab.com/datadrivendiscovery/datamart-api/blob/master/datamart.py

In [19]:
pprint(dataset_doc, indent=2)

{ 'about': { 'approximateSize': '539534 B',
             'datasetID': 'deaa16abc405446cbeaa05952a1e08a3',
             'datasetName': 'deaa16abc405446cbeaa05952a1e08a3',
             'datasetSchemaVersion': '3.2.0',
             'datasetVersion': '0.0',
             'license': 'unknown',
             'redacted': False},
  'dataResources': [ { 'columns': [ { 'colIndex': 0,
                                      'colName': 'd3mIndex',
                                      'colType': 'integer',
                                      'role': ['index']},
                                    { 'colIndex': 1,
                                      'colName': 'tpep_pickup_datetime',
                                      'colType': 'dateTime',
                                      'role': ['attribute']},
                                    { 'colIndex': 2,
                                      'colName': 'num_pickups',
                                      'colType': 'integer',
                    