## Python API for DataMart

This notebook showcases how to use the Python API for the DataMart system. For the augmentation, we use the taxi demand example from MIT-LL, available here: https://gitlab.datadrivendiscovery.org/MIT-LL/phase_2/data_augmentation_track_seed/da_seed_ny_taxi_demand_prediction

The Python API is available on GitHub (https://gitlab.com/ViDA-NYU/datamart/datamart/tree/master/lib_client). To install it, run `python setup.py install` in that directory. Alternatively, the API is also available through pip (https://pypi.org/project/datamart/): to install it, simply run `pip install datamart`.

In [21]:
import datamart
import pandas as pd
from pprint import pprint

Initially, we have the taxi demand data.

In [3]:
taxi_demand = pd.read_csv('data/ny_taxi_demand_prediction.csv')
taxi_demand.head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups
0,0,2018-01-01 00:00:00,67
1,1,2018-01-01 01:00:00,8
2,2,2018-01-01 02:00:00,0
3,3,2018-01-01 03:00:00,0
4,4,2018-01-01 04:00:00,7


### Searching for Datasets

Let's use DataMart to search for a weather datasets that can be used to augment the taxi demand one.

In [7]:
query_results = datamart.search(
    url='http://localhost:8002',
    data='data/ny_taxi_demand_prediction.csv',
    query={'dataset': {'about': 'weather'}},
    send_data=True)

In [20]:
for result in query_results:
    print(result)
    print(result.metadata['name'])
    print(result.metadata['filename'])
    print(result.get_augmentation_information())
    print('--------')

<Dataset 'datamart.upload.581f1d71d0dc4608811c0f662eec8e1d' score=1.0 augmentation=join>
Newyork Weather Data around Airport 2016-18
ny_lga_weather_data.csv
{'union': [], 'join': [['tpep_pickup_datetime', 'DATE']]}
--------


The first dataset had score 1.0 for join, between columns `tpep_pickup_datetime` (from the taxi demand dataset) and `DATE` (from the query result dataset).

Alternatively, we can also send a `pandas.DataFrame` object, instead the file itself ...

In [26]:
query_results = datamart.search(
    url='http://localhost:8002',
    data=taxi_demand,
    query={'dataset': {'about': 'weather'}})

... and we get the same results:

In [27]:
for result in query_results:
    print(result)
    print(result.metadata['name'])
    print(result.metadata['filename'])
    print(result.get_augmentation_information())
    print('--------')

<Dataset 'datamart.upload.581f1d71d0dc4608811c0f662eec8e1d' score=1.0 augmentation=join>
Newyork Weather Data around Airport 2016-18
ny_lga_weather_data.csv
{'union': [], 'join': [['tpep_pickup_datetime', 'DATE']]}
--------


### Augmenting a Dataset

Let's try to do our augmentation for the first query result.

In [23]:
learning_data, dataset_doc = datamart.augment(
    data='data/ny_taxi_demand_prediction.csv',
    augment_data=query_results[0],
    send_data=True
)

In [24]:
learning_data.head()

Unnamed: 0,tpep_pickup_datetime,num_pickups,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure,d3mIndex
0,2018-01-01 00:00:00,67,CLR:00,-12.2,61.0,14.0,330,30.29,0
1,2018-01-01 01:00:00,8,CLR:00,-12.2,61.0,17.0,320,30.3,1
2,2018-01-01 02:00:00,0,CLR:00,-12.2,61.0,17.0,320,30.3,2
3,2018-01-01 03:00:00,0,CLR:00,-12.8,64.0,11.0,310,30.32,3
4,2018-01-01 04:00:00,7,CLR:00,-12.8,61.0,15.0,300,30.31,4


And we have our augmented data! DataMart also produces a datasetDoc JSON object for the dataset.

In [25]:
pprint(dataset_doc, indent=2)

{ 'about': { 'approximateSize': '539534 B',
             'datasetID': '13196cf72800491db3226dbcf3d313ea',
             'datasetName': '13196cf72800491db3226dbcf3d313ea',
             'datasetSchemaVersion': '3.2.0',
             'datasetVersion': '0.0',
             'license': 'unknown',
             'redacted': False},
  'dataResources': [ { 'columns': [ { 'colIndex': 0,
                                      'colName': 'tpep_pickup_datetime',
                                      'colType': 'dateTime',
                                      'role': ['attribute']},
                                    { 'colIndex': 1,
                                      'colName': 'num_pickups',
                                      'colType': 'integer',
                                      'role': ['attribute']},
                                    { 'colIndex': 2,
                                      'colName': 'HOURLYSKYCONDITIONS',
                                      'colType': 'string',
      