## REST API for DataMart

This notebook showcases how to use the REST API for the DataMart system.

For the augmentation, we use the FIFA 2018 Man of Match data, available here: https://gitlab.datadrivendiscovery.org/d3m/datasets/tree/master/seed_datasets_data_augmentation/DA_fifa2018_manofmatch

The documentation for the REST API is available here: https://vida-nyu.gitlab.io/-/datamart/datamart-api/-/jobs/233008593/artifacts/pages/rest_api.html

In [1]:
from d3m import container
from io import BytesIO
import json
import os
import pandas as pd
from pprint import pprint
from pathlib import Path
import requests
import zipfile

In [4]:
def print_results(results):
    if not results:
        return
    for result in results:
        print(result['metadata']['name'])
        print('Score: ', result['score'])
        if 'augmentation' in result:
            aug_type = result['augmentation']['type']
            print('Augmentation: %s' % aug_type)
            left_columns = []
            for column_ in result['augmentation']['left_columns']:
                left_columns.append([])
                for column in column_:
                    left_columns[-1].append(column)
            print("Left Columns: %s" % str(left_columns))
            right_columns = []
            for column_ in result['augmentation']['right_columns']:
                right_columns.append([])
                for column in column_:
                    right_columns[-1].append(column)
            print("Right Columns: %s" % str(right_columns))
            
        print("-------------------")

Initially, we have the supplied data.

In [2]:
# You can change this accordingly
fifa_manofmatch_dir = str(Path.home()) + '/projects/d3m/datasets/seed_datasets_data_augmentation/' +\
                       'DA_fifa2018_manofmatch/DA_fifa2018_manofmatch_dataset/'
fifa_manofmatch_file = fifa_manofmatch_dir + 'datasetDoc.json'
fifa_manofmatch = container.Dataset.load('file://' + fifa_manofmatch_file)

In [3]:
fifa_manofmatch['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,Pass Accuracy %,Passes,Distance Covered (Kms),Yellow & Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals
0,0,55,23-06-2018,Mexico,Korea Republic,59,6,2,0,5,89,485,97,0,1,26.0,Group Stage,No,0,
1,1,40,21-06-2018,Denmark,Australia,49,5,0,1,4,88,458,112,0,1,7.0,Group Stage,No,0,
2,2,19,17-06-2018,Mexico,Germany,40,6,2,2,9,82,281,106,0,0,35.0,Group Stage,No,0,
3,3,31,19-06-2018,Senegal,Poland,43,4,2,3,3,81,328,107,0,1,60.0,Group Stage,No,0,
4,4,98,30-06-2018,Uruguay,Portugal,39,2,1,0,4,69,269,106,0,1,7.0,Round of 16,No,0,


### Searching for Datasets

Let's use DataMart to search for datasets that can be used to augment the supplied one.

In [5]:
url = 'https://datamart.d3m.vida-nyu.org/search'
data = fifa_manofmatch_dir + 'tables/learningData.csv'

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
        }
    )
response.raise_for_status()
query_results = response.json()['results']

In [6]:
print_results(query_results)

FIFA 2018 game statistics data
Score:  1.0
Augmentation: join
Left Columns: [[13]]
Right Columns: [[8]]
-------------------
Housing New York Units by Building
Score:  1.0
Augmentation: join
Left Columns: [[2]]
Right Columns: [[19]]
-------------------
Recognized Shop Healthy Stores
Score:  1.0
Augmentation: join
Left Columns: [[2]]
Right Columns: [[1]]
-------------------
Contractor / Sub Contractor Change Order Report
Score:  1.0
Augmentation: join
Left Columns: [[2]]
Right Columns: [[10]]
-------------------
Cash Assistance Youth Engagement
Score:  1.0
Augmentation: join
Left Columns: [[2]]
Right Columns: [[6]]
-------------------
City Clerk eLobbyist Data
Score:  1.0
Augmentation: join
Left Columns: [[2]]
Right Columns: [[14]]
-------------------
2005 - 2011 Graduation Outcomes - Borough - ELL
Score:  1.0
Augmentation: join
Left Columns: [[2]]
Right Columns: [[21]]
-------------------
Street Construction Permits - Stipulations (Historical)
Score:  1.0
Augmentation: join
Left Columns

-------------------
DOHMH Childcare Center Inspections
Score:  0.9026548971684805
Augmentation: join
Left Columns: [[2]]
Right Columns: [[28]]
-------------------
Traffic Volume Counts (2014-2018)
Score:  0.8706896551724138
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
Landmarks Violations
Score:  0.8672566779570188
Augmentation: join
Left Columns: [[2]]
Right Columns: [[9]]
-------------------
Entry Point LCR Monitoring Results
Score:  0.8584071231541535
Augmentation: join
Left Columns: [[2]]
Right Columns: [[1]]
-------------------
NYC Citywide Annualized Calendar Sales Update
Score:  0.8584071231541535
Augmentation: join
Left Columns: [[2]]
Right Columns: [[20]]
-------------------
Medallion Drivers – Passenger Assistance Trained
Score:  0.8584071231541535
Augmentation: join
Left Columns: [[2]]
Right Columns: [[5]]
-------------------
Vehicle Classification Counts (2014-2018)
Score:  0.853448275862069
Augmentation: join
Left Columns: [[1]]
Right Col

The first dataset seems relevant: it has a score 1.0 for join, and it also represents a FIFA dataset.

It is also possible to specify which column will be used for augmentation.

In [12]:
url = 'https://datamart.d3m.vida-nyu.org/search'
data = fifa_manofmatch_dir + 'tables/learningData.csv'
query={
    'variables': [
        {
            'type': 'tabular_variable',
            'columns': [1],  # GameID
            'relationship': 'contains'
        }
    ]
}

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
response.raise_for_status()
gameID_results = response.json()['results']

In [13]:
print_results(gameID_results)

FIFA 2018 game statistics data
Score:  0.9827586206896551
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
Traffic Volume Counts (2012-2013)
Score:  0.9568965517241379
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
Inclusionary Housing Properties
Score:  0.9482758620689655
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
Vehicle Classification Counts (2012-2013)
Score:  0.9310344827586207
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
Traffic Volume Counts (2014-2018)
Score:  0.8706896551724138
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
Vehicle Classification Counts (2014-2018)
Score:  0.853448275862069
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-------------------
LCR South South Knowledge Exchange Activities
Score:  0.8448275862068966
Augmentation: join
Left Columns: [[1]]
Right Columns: [[0]]
-----------

### Augmenting a Dataset

Let's try to do our augmentation for the previous first query result then.

In [14]:
url = 'https://datamart.d3m.vida-nyu.org/augment'
data = fifa_manofmatch_dir + 'tables/learningData.csv'
task = gameID_results[0]

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'task': ('task.json', json.dumps(task), 'application/json'),
        },
        stream=True,
    )
response.raise_for_status()
zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [15]:
learning_data.head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,...,Own goals,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Red,Own goal Time
0,0,55,2018-06-23,Mexico,Korea Republic,59,6,2,0,5,...,,2,13,5,5,24,7,0,0,
1,1,40,2018-06-21,Denmark,Australia,49,5,0,1,4,...,,1,10,5,3,5,7,2,0,
2,2,19,2018-06-17,Mexico,Germany,40,6,2,2,9,...,,1,12,4,1,11,15,2,0,
3,3,31,2018-06-19,Senegal,Poland,43,4,2,3,3,...,,2,8,2,3,11,15,2,0,
4,4,98,2018-06-30,Uruguay,Portugal,39,2,1,0,4,...,,2,6,3,2,14,13,0,0,


And we have our augmented data! Its corresponding datasetDoc JSON object is presented below.

However, note that this datasetDoc JSON **does not** preserve the information from the supplied data's datasetDoc JSON. You need to use the Python DataMart API for that: https://gitlab.com/datadrivendiscovery/datamart-api/blob/master/datamart.py

In [16]:
pprint(dataset_doc, indent=2)

{ 'about': { 'approximateSize': '59300 B',
             'datasetID': '521ea6d741e94a61bdbe9c69b7bb7a38',
             'datasetName': '521ea6d741e94a61bdbe9c69b7bb7a38',
             'datasetSchemaVersion': '3.2.0',
             'datasetVersion': '0.0',
             'license': 'unknown',
             'redacted': False},
  'dataResources': [ { 'columns': [ { 'colIndex': 0,
                                      'colName': 'd3mIndex',
                                      'colType': 'integer',
                                      'role': ['index']},
                                    { 'colIndex': 1,
                                      'colName': 'GameID',
                                      'colType': 'integer',
                                      'role': ['attribute']},
                                    { 'colIndex': 2,
                                      'colName': 'Date',
                                      'colType': 'dateTime',
                                      'rol