## REST API for DataMart

This notebook showcases how to use the REST API for the DataMart system. For the augmentation, we use the data about FIFA 2018 world cup games from MIT-LL, available here: https://gitlab.datadrivendiscovery.org/MIT-LL/phase_2/data_augmentation_track_seed/da_seed_fifa_2018_manofmatch_prediction/

In [1]:
from io import BytesIO
import json
import os
import pandas as pd
from pprint import pprint
import requests
import zipfile

Initially, we have the world cup data.

In [2]:
world_cup_data = pd.read_csv('data/fifa2018_manofmatch.csv')
world_cup_data.head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,Pass Accuracy %,Passes,Distance Covered (Kms),Yellow & Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals
0,0,55,23-06-2018,Mexico,Korea Republic,59,6,2,0,5,89,485,97,0,1,26.0,Group Stage,No,0,
1,1,40,21-06-2018,Denmark,Australia,49,5,0,1,4,88,458,112,0,1,7.0,Group Stage,No,0,
2,2,19,17-06-2018,Mexico,Germany,40,6,2,2,9,82,281,106,0,0,35.0,Group Stage,No,0,
3,3,31,19-06-2018,Senegal,Poland,43,4,2,3,3,81,328,107,0,1,60.0,Group Stage,No,0,
4,4,98,30-06-2018,Uruguay,Portugal,39,2,1,0,4,69,269,106,0,1,7.0,Round of 16,No,0,


In [3]:
print('Size: %d rows' % world_cup_data.shape[0])

Size: 128 rows


### Searching for Datasets

Let's use DataMart to search for datasets that can be used to augment the world cup one.

In [4]:
url = 'https://datamart.d3m.vida-nyu.org/search'
data = 'data/fifa2018_manofmatch.csv'

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
        }
    )
response.raise_for_status()
query_results = response.json()['results']

In [6]:
print('There are %d query results!\n' % len(query_results))
for result in query_results[:5]: # top-5
    print(result['metadata']['name'])
    print('Score: ', result['score'])
    aug_type = 'Union' if 'union_columns' in result else 'Join'
    aug = result['union_columns'] if 'union_columns' in result else result['join_columns']
    print('%s:' % aug_type, aug)
    print('--------')

There are 363 query results!

FIFA 2018 game statistics data
Score:  1.0
Join: [['GameID', 'GameID'], ['Yellow & Red', 'Yellow Card']]
--------
Parking Violations Issued - Fiscal Year 2017
Score:  1.0
Join: [['Date', 'Issue Date']]
--------
2014 NYC Open Data Plan
Score:  1.0
Join: [['Date', 'Planned Release Date']]
--------
Historical DOB Permit Issuance
Score:  1.0
Join: [['Date', 'Job Start Date']]
--------
Wholesale Markets
Score:  1.0
Join: [['Date', 'EXPORT DATE']]
--------


The first dataset seems relevant: it has a score 1.0 for join, and it also represents a FIFA dataset.

### Augmenting a Dataset

Let's try to do our augmentation for the first query result then.

In [7]:
url = 'https://datamart.d3m.vida-nyu.org/augment'
data = 'data/fifa2018_manofmatch.csv'
task = query_results[0]

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'task': ('task.json', json.dumps(task), 'application/json'),
        },
        stream=True,
    )
response.raise_for_status()
zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [8]:
learning_data.head()

Unnamed: 0,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,Pass Accuracy %,...,Own goals,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Red,Own goal Time,d3mIndex
0,55,2018-06-23,Mexico,Korea Republic,59,6,2,0,5,89,...,,2,13,5,5,24,7,0,,0
1,98,2018-06-30,Uruguay,Portugal,39,2,1,0,4,69,...,,2,6,3,2,14,13,0,,1
2,81,2018-06-27,Germany,Korea Republic,70,11,9,1,3,88,...,,0,26,6,9,16,7,0,,2
3,26,2018-06-18,Tunisia,England,41,3,2,2,5,82,...,,1,6,1,2,11,14,0,,3
4,94,2018-06-28,England,Belgium,48,7,5,3,3,88,...,,0,13,1,7,15,11,0,,4


In [9]:
print('Size: %d rows' % learning_data.shape[0])

Size: 25 rows


Note that the joined dataset has only **25 rows**, compared to the original **128 rows**. But if we take a look at the augmentation information, we are performing a _multi-join_, i.e., a join using two keys:

In [11]:
print(query_results[0]['join_columns'])

[['GameID', 'GameID'], ['Yellow & Red', 'Yellow Card']]


The second pair of columns may not seem right...

But can we perform a join using only column `GameID`? Of course! Let's first remove the second column from the list.

In [12]:
query_results[0]['join_columns'].pop()
print(query_results[0]['join_columns'])

[['GameID', 'GameID']]


Then we perform the augmentation again.

In [13]:
url = 'https://datamart.d3m.vida-nyu.org/augment'
data = 'data/fifa2018_manofmatch.csv'
task = query_results[0]

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(data, 'rb') as data_p:
    response = requests.post(
        url,
        files={
            'data': data_p,
            'task': ('task.json', json.dumps(task), 'application/json'),
        },
        stream=True,
    )
response.raise_for_status()
zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [14]:
learning_data.head()

Unnamed: 0,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,Pass Accuracy %,...,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Red,Own goal Time,d3mIndex
0,55,2018-06-23,Mexico,Korea Republic,59,6,2,0,5,89,...,2,13,5,5,24,7,0,0,,0
1,40,2018-06-21,Denmark,Australia,49,5,0,1,4,88,...,1,10,5,3,5,7,2,0,,1
2,19,2018-06-17,Mexico,Germany,40,6,2,2,9,82,...,1,12,4,1,11,15,2,0,,2
3,31,2018-06-19,Senegal,Poland,43,4,2,3,3,81,...,2,8,2,3,11,15,2,0,,3
4,98,2018-06-30,Uruguay,Portugal,39,2,1,0,4,69,...,2,6,3,2,14,13,0,0,,4


In [15]:
print('Size: %d rows' % learning_data.shape[0])

Size: 128 rows


And we have our final augmented data! Its corresponding datasetDoc JSON object is presented below.

In [16]:
pprint(dataset_doc, indent=2)

{ 'about': { 'approximateSize': '59300 B',
             'datasetID': '8700aa115ffe4a52ac7f9d7a395bdca5',
             'datasetName': '8700aa115ffe4a52ac7f9d7a395bdca5',
             'datasetSchemaVersion': '3.2.0',
             'datasetVersion': '0.0',
             'license': 'unknown',
             'redacted': False},
  'dataResources': [ { 'columns': [ { 'colIndex': 0,
                                      'colName': 'GameID',
                                      'colType': 'integer',
                                      'role': ['attribute']},
                                    { 'colIndex': 1,
                                      'colName': 'Date',
                                      'colType': 'dateTime',
                                      'role': ['attribute']},
                                    { 'colIndex': 2,
                                      'colName': 'Team',
                                      'colType': 'string',
                                      'role