# Python API for NYU's DataMart

This notebook showcases how to use the Python API for the NYU's DataMart system, which implements the common DataMart interface (https://gitlab.com/datadrivendiscovery/datamart-api/blob/master/datamart.py). To install it: `pip install datamart_nyu`

For the augmentation, we use the FIFA 2018 Man of Match data, available here: https://gitlab.datadrivendiscovery.org/d3m/datasets/tree/master/seed_datasets_data_augmentation/DA_fifa2018_manofmatch

In [1]:
from d3m import container
import datamart
import datamart_nyu
import datetime
from pathlib import Path

In [2]:
def print_results(results):
    if not results:
        return
    for result in results:
        print(result.score())
        print(result.get_json_metadata()['metadata']['name'])
        if (result.get_augment_hint()):
            print("Left Columns: %s" %
                  str(result.get_json_metadata()['augmentation']['left_columns_names']))
            print("Right Columns: %s" %
                  str(result.get_json_metadata()['augmentation']['right_columns_names']))
        else:
            print(result.id())
        print("-------------------")

Loading the FIFA 2018 man of match data, i.e., our supplied data.

In [3]:
# You can change this accordingly
fifa_manofmatch_file = str(Path.home()) + '/projects/d3m/datasets/seed_datasets_data_augmentation/' +\
                       'DA_fifa2018_manofmatch/DA_fifa2018_manofmatch_dataset/datasetDoc.json'
fifa_manofmatch = container.Dataset.load('file://' + fifa_manofmatch_file)

In [4]:
fifa_manofmatch['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,Pass Accuracy %,Passes,Distance Covered (Kms),Yellow & Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals
0,0,55,23-06-2018,Mexico,Korea Republic,59,6,2,0,5,89,485,97,0,1,26.0,Group Stage,No,0,
1,1,40,21-06-2018,Denmark,Australia,49,5,0,1,4,88,458,112,0,1,7.0,Group Stage,No,0,
2,2,19,17-06-2018,Mexico,Germany,40,6,2,2,9,82,281,106,0,0,35.0,Group Stage,No,0,
3,3,31,19-06-2018,Senegal,Poland,43,4,2,3,3,81,328,107,0,1,60.0,Group Stage,No,0,
4,4,98,30-06-2018,Uruguay,Portugal,39,2,1,0,4,69,269,106,0,1,7.0,Round of 16,No,0,


## Searching for Datasets

Let's first instantiate our client:

In [5]:
client = datamart_nyu.RESTDatamart('https://auctus.vida-nyu.org/api/v1')

### Search using keywords

In [6]:
query = datamart.DatamartQuery(
    keywords=['sports', 'soccer', 'FIFA 2018', 'statistics',
              'match data', 'man of the match'],  # keywords from problem definition
    variables=[]
)
cursor = client.search(query=query)

In [7]:
results = cursor.get_next_page()

In [8]:
print_results(results)

57.21765
FIFA 2018 game statistics data
datamart.upload.8733eed7d5844bc990d1153b6957cf90
-------------------
25.12227
Summer Sports Experience
datamart.socrata.data-cityofnewyork-us.xeg4-ic28
-------------------
10.722651
Citiwide Service Desk Statistics
datamart.socrata.data-cityofnewyork-us.vr2i-c3qq
-------------------
10.464287
Demographic Statistics By Zip Code
datamart.socrata.data-cityofnewyork-us.kku6-nxdu
-------------------
9.951888
2010 - 2011 School Attendance and Enrollment Statistics by District
datamart.socrata.data-cityofnewyork-us.7z8d-msnt
-------------------
7.4330053
Parks Special Events
datamart.socrata.data-cityofnewyork-us.6v4b-5gp4
-------------------
7.166958
"Kids In Motion" Playground Programming
datamart.socrata.data-cityofnewyork-us.8p6c-94pc
-------------------
6.8151016
Capital Commitment Plan
datamart.socrata.data-cityofnewyork-us.2cmn-uidm
-------------------
6.7193174
Capital Budget
datamart.socrata.data-cityofnewyork-us.46m8-77gv
-------------------
6

### Search using data

In [9]:
cursor = client.search_with_data(query=None, supplied_data=fifa_manofmatch)

In [10]:
results = cursor.get_next_page()

In [11]:
print_results(results)

1.0
Water Consumption And Cost (2013 - March 2019)
Left Columns: [['Date']]
Right Columns: [['Revenue Month']]
-------------------
1.0
Street Construction Permits
Left Columns: [['Date']]
Right Columns: [['ModifiedOn']]
-------------------
1.0
Housing New York Units by Building
Left Columns: [['Date']]
Right Columns: [['Project Start Date']]
-------------------
1.0
Housing Maintenance Code Complaints
Left Columns: [['Date']]
Right Columns: [['ReceivedDate']]
-------------------
1.0
Capital Projects
Left Columns: [['Date']]
Right Columns: [['Forecast Completion']]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [['Date']]
Right Columns: [['Advanced Regents Num']]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [['Date']]
Right Columns: [['Regents w/o Advanced Num']]
-------------------
1.0
HRA Facts
Left Columns: [['Date']]
Right Columns: [['Adult Protective Svs. Referrals Received']]
-------------------
1.0
200

### Search using data and keywords

In [12]:
query = datamart.DatamartQuery(
    keywords=['sports', 'soccer', 'FIFA 2018', 'statistics',
              'match data', 'man of the match'],  # keywords from problem definition
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=fifa_manofmatch)

In [13]:
results = cursor.get_next_page()

In [14]:
print_results(results)

58.96552
FIFA 2018 game statistics data
Left Columns: [['GameID']]
Right Columns: [['GameID']]
-------------------
54.545456
FIFA 2018 game statistics data
Left Columns: [['Off-Target']]
Right Columns: [['On-Target']]
-------------------
10.0
Inmate Admissions
Left Columns: [['Date']]
Right Columns: [['ADMITTED_DT']]
-------------------
10.0
Inmate Discharges
Left Columns: [['Date']]
Right Columns: [['ADMITTED_DT']]
-------------------
10.0
Inmate Discharges
Left Columns: [['Date']]
Right Columns: [['DISCHARGED_DT']]
-------------------
10.0
Cash Assistance Youth Engagement
Left Columns: [['Date']]
Right Columns: [['Total Head of Household, 21-24 Years Old']]
-------------------
10.0
Total SNAP Recipients
Left Columns: [['Date']]
Right Columns: [['Month']]
-------------------
10.0
Inmate Admissions
Left Columns: [['Date']]
Right Columns: [['DISCHARGED_DT']]
-------------------
10.0
2013-2014 School Quality Reports Results For Elementary, Middle and K-8 Schools
Left Columns: [['Date']]


## Downloading a dataset

Now let's materialize the first dataset, in case the user wants to take a look at the data before augmenting it (or so that the user can augment the data him/herself).

In [15]:
datamart_data = results[0].download(supplied_data=None)

In [16]:
datamart_data['learningData'].head()

Unnamed: 0,GameID,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Red,Own goal Time
0,0,5,13,7,6,11,22,0,0,
1,1,0,6,0,2,25,10,0,0,
2,2,0,8,3,0,7,12,2,0,
3,3,1,14,4,5,13,6,0,0,
4,4,0,13,3,5,14,22,1,0,90.0


You can also give a dataset as input so that DataMart can try to return a dataset that joins well with it. Only portions of the DataMart dataset that join with the input data will be returned.

In [17]:
datamart_data = results[0].download(supplied_data=fifa_manofmatch)

In [18]:
datamart_data['learningData'].head()

Unnamed: 0,GameID,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Red,Own goal Time
0,55,2,13,5,5,24,7,0,0,
1,40,1,10,5,3,5,7,2,0,
2,19,1,12,4,1,11,15,2,0,
3,31,2,8,2,3,11,15,2,0,
4,98,2,6,3,2,14,13,0,0,


## Augmenting a dataset

 Let's try to do our augmentation for the first query result.

In [19]:
join_ = results[0].augment(supplied_data=fifa_manofmatch)

In [20]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,...,Own goals,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Red,Own goal Time
0,0,55,2018-06-23,Mexico,Korea Republic,59,6,2,0,5,...,,2,13,5,5,24,7,0,0,
1,1,40,2018-06-21,Denmark,Australia,49,5,0,1,4,...,,1,10,5,3,5,7,2,0,
2,2,19,2018-06-17,Mexico,Germany,40,6,2,2,9,...,,1,12,4,1,11,15,2,0,
3,3,31,2018-06-19,Senegal,Poland,43,4,2,3,3,...,,2,8,2,3,11,15,2,0,
4,4,98,2018-06-30,Uruguay,Portugal,39,2,1,0,4,...,,2,6,3,2,14,13,0,0,


We can also choose which columns from the DataMart dataset that we want in the augmentation process.

In [21]:
join_ = results[0].augment(
    supplied_data=fifa_manofmatch,
    augment_columns=[datamart.DatasetColumn('0', 1), datamart.DatasetColumn('0', 2)]  # Goal Scored, Attempts
)

In [22]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,...,Distance Covered (Kms),Yellow & Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals,Goal Scored,Attempts
0,0,55,2018-06-23,Mexico,Korea Republic,59,6,2,0,5,...,97,0,1,26.0,Group Stage,No,0,,2,13
1,1,40,2018-06-21,Denmark,Australia,49,5,0,1,4,...,112,0,1,7.0,Group Stage,No,0,,1,10
2,2,19,2018-06-17,Mexico,Germany,40,6,2,2,9,...,106,0,0,35.0,Group Stage,No,0,,1,12
3,3,31,2018-06-19,Senegal,Poland,43,4,2,3,3,...,107,0,1,60.0,Group Stage,No,0,,2,8
4,4,98,2018-06-30,Uruguay,Portugal,39,2,1,0,4,...,106,0,1,7.0,Round of 16,No,0,,2,6
