## Python API for DataMart

This notebook showcases how to use the Python API for the DataMart system. For the augmentation, we use the taxi demand example from MIT-LL, available here: https://gitlab.datadrivendiscovery.org/MIT-LL/phase_2/data_augmentation_track_seed/da_seed_ny_taxi_demand_prediction

The Python API is available on GitHub (https://gitlab.com/ViDA-NYU/datamart/datamart/tree/master/lib_client). To install it, run `python setup.py install` in that directory. Alternatively, the API is also available through pip (https://pypi.org/project/datamart/): to install it, simply run `pip install datamart`.

In [1]:
import datamart
from io import BytesIO
import pandas as pd
from pprint import pprint

Initially, we have the taxi demand data.

In [2]:
taxi_demand = pd.read_csv('data/ny_taxi_demand_prediction.csv')
taxi_demand.head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups
0,0,2018-01-01 00:00:00,67
1,1,2018-01-01 01:00:00,8
2,2,2018-01-01 02:00:00,0
3,3,2018-01-01 03:00:00,0
4,4,2018-01-01 04:00:00,7


### Searching for Datasets

Let's use DataMart to search for a weather datasets that can be used to augment the taxi demand one.

In [3]:
query_results = datamart.search(
    url='https://datamart.d3m.vida-nyu.org',
    data='data/ny_taxi_demand_prediction.csv',
    query={'dataset': {'about': 'weather'}},
    send_data=True)

In [4]:
for result in query_results:
    print(result)
    print(result.metadata['name'])
    print(result.get_augmentation_information())
    print('--------')

<Dataset 'datamart.upload.7ccf4492c1da44ffbdc747e278fd65f4' score=1.0 augmentation=join>
Newyork Weather Data around Airport 2016-18
{'union': [], 'join': [['tpep_pickup_datetime', 'DATE']]}
--------


The first dataset had score 1.0 for join, between columns `tpep_pickup_datetime` (from the taxi demand dataset) and `DATE` (from the query result dataset).

Alternatively, we can also send a `pandas.DataFrame` object, instead the file itself ...

In [5]:
query_results = datamart.search(
    url='https://datamart.d3m.vida-nyu.org',
    data=taxi_demand,
    query={'dataset': {'about': 'weather'}})

... and we get the same results:

In [6]:
for result in query_results:
    print(result)
    print(result.metadata['name'])
    print(result.get_augmentation_information())
    print('--------')

<Dataset 'datamart.upload.7ccf4492c1da44ffbdc747e278fd65f4' score=1.0 augmentation=join>
Newyork Weather Data around Airport 2016-18
{'union': [], 'join': [['tpep_pickup_datetime', 'DATE']]}
--------


We could also just simply search for all of the datasets that can be joined/unioned with the taxi demand one.

In [7]:
query_results = datamart.search(
    url='https://datamart.d3m.vida-nyu.org',
    data='data/ny_taxi_demand_prediction.csv',
    send_data=True)

In [8]:
print('There are %d query results!\n' % len(query_results))
for result in query_results[:5]: # top-5
    print(result)
    print(result.metadata['name'])
    print(result.get_augmentation_information())
    print('--------')

There are 487 query results!

<Dataset 'datamart.socrata.data-cityofnewyork-us.735p-zed8' score=1.0 augmentation=join>
2001 Campaign Contributions
{'union': [], 'join': [['tpep_pickup_datetime', 'DATE']]}
--------
<Dataset 'datamart.socrata.data-cityofnewyork-us.emrz-5p35' score=1.0 augmentation=join>
Invoices for Open Market Order (OMO) Charges
{'union': [], 'join': [['tpep_pickup_datetime', 'DateTransferDoF']]}
--------
<Dataset 'datamart.socrata.data-cityofnewyork-us.ucdy-byxd' score=1.0 augmentation=join>
Local Law 44 - Projects
{'union': [], 'join': [['tpep_pickup_datetime', 'ProjectedCompletionDate']]}
--------
<Dataset 'datamart.socrata.data-cityofnewyork-us.2eq2-trdu' score=1.0 augmentation=join>
Directory Of Competitive Bid
{'union': [], 'join': [['tpep_pickup_datetime', 'BID DATE & TIME\xa0\xa0\xa0']]}
--------
<Dataset 'datamart.socrata.data-cityofnewyork-us.7xq6-k6zy' score=1.0 augmentation=join>
Appeals Filed In 2017
{'union': [], 'join': [['tpep_pickup_datetime', 'Expirat

This shows the importance of also using the query schema to filter results, as there can be many datasets that can be joined or unioned with the input data.

### Downloading a Dataset

Now let's materialize the weather dataset, in case the user wants to take a look at the data before augmenting it (or so that the user can augment the data him/herself).

In [9]:
query_results = datamart.search(
    url='https://datamart.d3m.vida-nyu.org',
    data='data/ny_taxi_demand_prediction.csv',
    query={'dataset': {'about': 'weather'}},
    send_data=True)

In [10]:
for result in query_results:
    print(result)
    print(result.metadata['name'])
    print(result.get_augmentation_information())
    print('--------')

<Dataset 'datamart.upload.7ccf4492c1da44ffbdc747e278fd65f4' score=1.0 augmentation=join>
Newyork Weather Data around Airport 2016-18
{'union': [], 'join': [['tpep_pickup_datetime', 'DATE']]}
--------


In [11]:
weather_data_io = BytesIO()
query_results[0].download(destination=weather_data_io)
weather_data_io.seek(0)
weather_data = pd.read_csv(weather_data_io)
weather_data_io.close()



In [12]:
weather_data.head()

Unnamed: 0,DATE,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure
0,2016-01-01 01:00:00,OVC:08 38,6.1,58.0,17.0,300,30.03
1,2016-01-01 02:00:00,OVC:08 38,6.1,56.0,16.0,320,30.03
2,2016-01-01 03:00:00,OVC:08 38,5.6,55.0,13.0,340,30.03
3,2016-01-01 04:00:00,OVC:08 36,5.6,55.0,13.0,300,30.03
4,2016-01-01 05:00:00,FEW:02 34 OVC:08 45,5.0,60.0,13.0,270,30.01


### Augmenting a Dataset

Let's try to do our augmentation for the first query result.

In [13]:
learning_data, dataset_doc = datamart.augment(
    data='data/ny_taxi_demand_prediction.csv',
    augment_data=query_results[0],
    send_data=True
)

In [14]:
learning_data.head()

Unnamed: 0,tpep_pickup_datetime,num_pickups,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure,d3mIndex
0,2018-01-01 00:00:00,67,CLR:00,-12.2,61.0,14.0,330,30.29,0
1,2018-01-01 01:00:00,8,CLR:00,-12.2,61.0,17.0,320,30.3,1
2,2018-01-01 02:00:00,0,CLR:00,-12.2,61.0,17.0,320,30.3,2
3,2018-01-01 03:00:00,0,CLR:00,-12.8,64.0,11.0,310,30.32,3
4,2018-01-01 04:00:00,7,CLR:00,-12.8,61.0,15.0,300,30.31,4


And we have our augmented data! DataMart also produces a datasetDoc JSON object for the dataset.

In [15]:
pprint(dataset_doc, indent=2)

{ 'about': { 'approximateSize': '539534 B',
             'datasetID': '0590b256c2234fdc85e7dfb13e83fb7b',
             'datasetName': '0590b256c2234fdc85e7dfb13e83fb7b',
             'datasetSchemaVersion': '3.2.0',
             'datasetVersion': '0.0',
             'license': 'unknown',
             'redacted': False},
  'dataResources': [ { 'columns': [ { 'colIndex': 0,
                                      'colName': 'tpep_pickup_datetime',
                                      'colType': 'dateTime',
                                      'role': ['attribute']},
                                    { 'colIndex': 1,
                                      'colName': 'num_pickups',
                                      'colType': 'integer',
                                      'role': ['attribute']},
                                    { 'colIndex': 2,
                                      'colName': 'HOURLYSKYCONDITIONS',
                                      'colType': 'string',
      