# Python API for NYU's DataMart

This notebook showcases how to use the Python API for the NYU's DataMart system, which implements the common DataMart interface (https://gitlab.com/datadrivendiscovery/datamart-api/blob/master/datamart.py). To install it: `pip install datamart_nyu`

For the augmentation, we use the FIFA 2018 Man of Match data, available here: https://gitlab.datadrivendiscovery.org/d3m/datasets/tree/master/seed_datasets_data_augmentation/DA_fifa2018_manofmatch

In [1]:
from d3m import container
import datamart
import datamart_nyu
import datetime
from pathlib import Path

In [2]:
def print_results(results):
    if not results:
        return
    for result in results:
        print(result.score())
        print(result.get_json_metadata()['metadata']['name'])
        if (result.get_augment_hint()):
            left_columns = []
            for column_ in result.get_augment_hint().left_columns:
                left_columns.append([])
                for column in column_:
                    left_columns[-1].append((column.resource_id, column.column_index))
            print("Left Columns: %s" % str(left_columns))
            right_columns = []
            for column_ in result.get_augment_hint().right_columns:
                right_columns.append([])
                for column in column_:
                    right_columns[-1].append((column.resource_id, column.column_index))
            print("Right Columns: %s" % str(right_columns))
        else:
            print(result.id())
        print("-------------------")

Loading the FIFA 2018 man of match data, i.e., our supplied data.

In [3]:
# You can change this accordingly
fifa_manofmatch_file = str(Path.home()) + '/projects/d3m/datasets/seed_datasets_data_augmentation/' +\
                       'DA_fifa2018_manofmatch/DA_fifa2018_manofmatch_dataset/datasetDoc.json'
fifa_manofmatch = container.Dataset.load('file://' + fifa_manofmatch_file)

In [4]:
fifa_manofmatch['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,Pass Accuracy %,Passes,Distance Covered (Kms),Yellow & Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals
0,0,55,23-06-2018,Mexico,Korea Republic,59,6,2,0,5,89,485,97,0,1,26.0,Group Stage,No,0,
1,1,40,21-06-2018,Denmark,Australia,49,5,0,1,4,88,458,112,0,1,7.0,Group Stage,No,0,
2,2,19,17-06-2018,Mexico,Germany,40,6,2,2,9,82,281,106,0,0,35.0,Group Stage,No,0,
3,3,31,19-06-2018,Senegal,Poland,43,4,2,3,3,81,328,107,0,1,60.0,Group Stage,No,0,
4,4,98,30-06-2018,Uruguay,Portugal,39,2,1,0,4,69,269,106,0,1,7.0,Round of 16,No,0,


## Searching for Datasets

Let's first instantiate our client:

In [5]:
client = datamart_nyu.RESTDatamart('https://datamart.d3m.vida-nyu.org')

### Search using keywords

In [6]:
query = datamart.DatamartQuery(
    keywords=['soccer', 'FIFA 2018', 'sports'],  # keywords from problem definition
    variables=[]
)
cursor = client.search(query=query)

In [7]:
results = cursor.get_next_page()

In [8]:
print_results(results)

40.331184
FIFA 2018 game statistics data
datamart.upload.8733eed7d5844bc990d1153b6957cf90
-------------------
25.035423
Summer Sports Experience
datamart.socrata.data-cityofnewyork-us.xeg4-ic28
-------------------
9.738021
2018 - 2019 School Locations
datamart.socrata.data-cityofnewyork-us.9ck8-hj3u
-------------------
9.2692175
Child Support Collections
datamart.socrata.data-cityofnewyork-us.8fhd-nzw8
-------------------
9.186156
2018 Open Data Plan: Future Releases
datamart.socrata.data-cityofnewyork-us.dzrn-z4d7
-------------------
8.834902
2017 - 2018 Avg Class Size Borough MSHS
datamart.socrata.data-cityofnewyork-us.am74-3pnv
-------------------
8.745911
Child Support Caseload
datamart.socrata.data-cityofnewyork-us.7rf2-3gxf
-------------------
8.650852
2018-2019 3K For All Demographic Snapshot
datamart.socrata.data-cityofnewyork-us.suzc-ps6g
-------------------
8.306905
Appeals Closed In 2018
datamart.socrata.data-cityofnewyork-us.uetw-jfrg
-------------------
7.9615107
2018 Open

### Search using data

In [9]:
cursor = client.search_with_data(query=None, supplied_data=fifa_manofmatch)

In [10]:
results = cursor.get_next_page()

In [11]:
print_results(results)

1.0
FIFA 2018 game statistics data
Left Columns: [[('learningData', 13)]]
Right Columns: [[('0', 8)]]
-------------------
1.0
Housing New York Units by Building
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 19)]]
-------------------
1.0
Recognized Shop Healthy Stores
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 1)]]
-------------------
1.0
Contractor / Sub Contractor Change Order Report
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 10)]]
-------------------
1.0
Cash Assistance Youth Engagement
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 6)]]
-------------------
1.0
City Clerk eLobbyist Data
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 14)]]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 21)]]
-------------------
1.0
Street Construction Permits - Stipulations (Historical)
Left Columns: [[('learningData', 2)]]
Right Columns: 

### Search using data and keywords

In [12]:
query = datamart.DatamartQuery(
    keywords=['soccer', 'FIFA 2018', 'sports'],
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=fifa_manofmatch)

In [13]:
results = cursor.get_next_page()

In [14]:
print_results(results)

1.0
FIFA 2018 game statistics data
Left Columns: [[('learningData', 13)]]
Right Columns: [[('0', 8)]]
-------------------
1.0
Parking Violations Issued - Fiscal Year 2017
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 23)]]
-------------------
1.0
Parking Violations Issued - Fiscal Year 2015
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 17)]]
-------------------
1.0
Parking Violations Issued - Fiscal Year 2019
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 17)]]
-------------------
1.0
Parking Violations Issued - Fiscal Year 2016
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 17)]]
-------------------
1.0
Parking Violations Issued - Fiscal Year 2018
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 17)]]
-------------------
1.0
Parking Violations Issued - Fiscal Year 2014 (August 2013 – June 2014)
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 17)]]
-------------------
1.0
Local Law 251 of 2017: NYC Open 

### Search using data, keywords, and data columns

In [15]:
query = datamart.DatamartQuery(
    keywords=['soccer', 'FIFA 2018', 'sports'],
    variables=[]
)
cursor = client.search_with_data_columns(
    query=query,
    supplied_data=fifa_manofmatch,
    data_constraints=[
        datamart.TabularVariable(
            [datamart.DatasetColumn('learningData', 1)],  # GameID
            datamart.ColumnRelationship.CONTAINS
        )
    ]
)

In [16]:
results = cursor.get_next_page()

In [17]:
print_results(results)

0.9827586206896551
FIFA 2018 game statistics data
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 0)]]
-------------------
0.8706896551724138
Traffic Volume Counts (2014-2018)
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 0)]]
-------------------
0.853448275862069
Vehicle Classification Counts (2014-2018)
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 0)]]
-------------------
0.7327586206896551
2017 - 2018 Schools NYPD Crime Data Report
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 0)]]
-------------------


## Downloading a dataset

Now let's materialize the first dataset, in case the user wants to take a look at the data before augmenting it (or so that the user can augment the data him/herself).

In [18]:
datamart_data = results[0].download(supplied_data=None)

In [19]:
datamart_data['learningData'].head()

Unnamed: 0,GameID,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Red,Own goal Time
0,0,5,13,7,6,11,22,0,0,
1,1,0,6,0,2,25,10,0,0,
2,2,0,8,3,0,7,12,2,0,
3,3,1,14,4,5,13,6,0,0,
4,4,0,13,3,5,14,22,1,0,90.0


You can also give a dataset as input so that DataMart can try to return a dataset that joins well with it. Only portions of the DataMart dataset that join with the input data will be returned.

In [20]:
datamart_data = results[0].download(supplied_data=fifa_manofmatch)

In [21]:
datamart_data['learningData'].head()

Unnamed: 0,Red,GameID,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Own goal Time
0,0,0,5,13,7,6,11,22,0,
1,0,1,0,6,0,2,25,10,0,
2,0,2,0,8,3,0,7,12,2,
3,0,3,1,14,4,5,13,6,0,
4,0,4,0,13,3,5,14,22,1,90.0


## Augmenting a dataset

 Let's try to do our augmentation for the first query result.

In [22]:
join_ = results[0].augment(supplied_data=fifa_manofmatch)

In [23]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,...,Own goals,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Red,Own goal Time
0,0,55,2018-06-23,Mexico,Korea Republic,59,6,2,0,5,...,,2,13,5,5,24,7,0,0,
1,1,40,2018-06-21,Denmark,Australia,49,5,0,1,4,...,,1,10,5,3,5,7,2,0,
2,2,19,2018-06-17,Mexico,Germany,40,6,2,2,9,...,,1,12,4,1,11,15,2,0,
3,3,31,2018-06-19,Senegal,Poland,43,4,2,3,3,...,,2,8,2,3,11,15,2,0,
4,4,98,2018-06-30,Uruguay,Portugal,39,2,1,0,4,...,,2,6,3,2,14,13,0,0,


We can also choose which columns from the DataMart dataset that we want in the augmentation process.

In [24]:
join_ = results[0].augment(
    supplied_data=fifa_manofmatch,
    augment_columns=[datamart.DatasetColumn('0', 1), datamart.DatasetColumn('0', 2)]  # Goal Scored, Attempts
)

In [25]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,...,Distance Covered (Kms),Yellow & Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals,Goal Scored,Attempts
0,0,55,2018-06-23,Mexico,Korea Republic,59,6,2,0,5,...,97,0,1,26.0,Group Stage,No,0,,2,13
1,1,40,2018-06-21,Denmark,Australia,49,5,0,1,4,...,112,0,1,7.0,Group Stage,No,0,,1,10
2,2,19,2018-06-17,Mexico,Germany,40,6,2,2,9,...,106,0,0,35.0,Group Stage,No,0,,1,12
3,3,31,2018-06-19,Senegal,Poland,43,4,2,3,3,...,107,0,1,60.0,Group Stage,No,0,,2,8
4,4,98,2018-06-30,Uruguay,Portugal,39,2,1,0,4,...,106,0,1,7.0,Round of 16,No,0,,2,6
