## Data Augmentation Seed Problems: NYU DataMart

The purpose of this notebook is to investigate if and how the NYU DataMart finds the companion datasets for the data augmentation seed problems.

It seems that the NYU Datamart currently has the expected companion dataset for 5 problems: `DA_fifa2018_manofmatch`, `DA_housing_burden`, `DA_medical_malpractice`, `DA_ny_taxi_demand`, and `DA_poverty_estimation`.

In [1]:
from d3m import container
import datamart
import datamart_nyu
from pathlib import Path
import time
import requests
import json

In [2]:
def print_results(results):
    print("\n-------------------")
    if not results:
        print("No results!")
        return
    for result in results:
        print(result.score())
        print(result.get_json_metadata()['metadata']['name'])
        if (result.get_augment_hint()):
            print("Left Columns: %s" %
                  str(result.get_json_metadata()['augmentation']['left_columns_names']))
            print("Right Columns: %s" %
                  str(result.get_json_metadata()['augmentation']['right_columns_names']))
        else:
            print(result.id())
        print("-------------------")

In [3]:
client = datamart_nyu.RESTDatamart('https://auctus.vida-nyu.org/api/v1')

In [4]:
# home directory for data augmentation seed datasets
# change this accordingly
home_dir = str(Path.home()) + '/projects/d3m/datasets/seed_datasets_data_augmentation/'

### DA_college_debt

Loading the supplied data.

In [5]:
college_debt_file = home_dir + 'DA_college_debt/DA_college_debt_dataset/datasetDoc.json'
college_debt = container.Dataset.load('file://' + college_debt_file)

In [6]:
college_debt['learningData'].head()

Unnamed: 0,d3mIndex,UNITID,INSTNM,PCTFLOAN,CONTROL,STABBR,PCIP16,MD_EARN_WNE_P10,PPTUG_EF,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,SATMTMID,SATVRMID,SATWRMID,UGDS,PREDDEG,DEBT_EARNINGS_RATIO
0,0,12268508,San Joaquin Valley College-Rancho Cordova,,3,CA,,28300,,,,,,,,,,0,49
1,1,207564,Oklahoma State University Institute of Technology,0.475,1,OK,0.0,35300,0.2297,0.2953,0.0291,0.0647,0.0051,,,,2164.0,2,36
2,2,420024,Centura College-Chesapeake,0.8125,3,VA,0.0,21900,0.2315,0.2808,0.5665,0.0493,0.0,,,,203.0,2,127
3,3,164492,Anna Maria College,0.7465,2,MA,0.0,44800,0.2621,0.6518,0.1258,0.1022,0.0123,,,,1057.0,3,76
4,4,234085,Virginia Military Institute,0.4589,1,VA,0.0321,65700,0.0,0.7992,0.0607,0.0584,0.042,575.0,575.0,,1713.0,3,53


Without keywords.

In [7]:
cursor = client.search_with_data(query=None, supplied_data=college_debt)

In [8]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 6.1875 seconds

-------------------
No results!


With keywords.

In [9]:
query = datamart.DatamartQuery(
    keywords=['college scorecard', 'finance', 'debt', 'earnings'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=college_debt)

In [10]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 7.0491 seconds

-------------------
No results!


**Result:** Unfortunately, it doesn't look like there is any useful dataset in our DataMart to solve this problem.

### DA_consumer_complaints

Loading the supplied data.

In [11]:
consumer_complaints_file = home_dir + 'DA_consumer_complaints/DA_consumer_complaints_dataset/datasetDoc.json'
consumer_complaints = container.Dataset.load('file://' + consumer_complaints_file)

In [12]:
consumer_complaints['learningData'].head()

Unnamed: 0,d3mIndex,Complaint ID,State,ZIP code,Company,Consumer complaint narrative,Date sent to company,relevance
0,0,2252221,CA,928XX,Ditech Financial LLC,In XX/XX/XXXX I fell behind on my mortgage. I ...,12/15/16,1
1,1,1554019,CA,90731,"BANK OF AMERICA, NATIONAL ASSOCIATION",,09/06/15,2
2,2,2379952,CA,951XX,PORTFOLIO RECOVERY ASSOCIATES INC,Repeatedly calling a family members phone numb...,03/13/17,2
3,3,1363639,CA,900XX,JPMORGAN CHASE & CO.,I closed my Chase mileage card last year. At t...,05/06/15,0
4,4,1644731,CA,94703,"CITIBANK, N.A.",,11/09/15,0


Without keywords.

In [13]:
cursor = client.search_with_data(query=None, supplied_data=consumer_complaints)

In [14]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 109.0335 seconds

-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [['Date sent to company']]
Right Columns: [['Advanced Regents Num']]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [['Date sent to company']]
Right Columns: [['Regents w/o Advanced Num']]
-------------------
1.0
HRA Facts
Left Columns: [['Date sent to company']]
Right Columns: [['Adult Protective Svs. Referrals Received']]
-------------------
1.0
2008-2009 VADIR INCIDENTS
Left Columns: [['Date sent to company']]
Right Columns: [['Enrollment']]
-------------------
1.0
LPC Permit Application Information
Left Columns: [['Date sent to company']]
Right Columns: [['received_date']]
-------------------
1.0
IBRD Statement Of Loans - Historical Data
Left Columns: [['Date sent to company']]
Right Columns: [['First Repayment Date']]
-------------------
1.0
IBRD Statement Of Loans - Historical Data
Left Columns: [['Date sent to company']]
Right C

With keywords.

In [15]:
query = datamart.DatamartQuery(
    keywords=['consumer', 'complaints', 'protect', 'unfair practices',
              'consumer financial protection bureau'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=consumer_complaints)

In [16]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 133.8705 seconds

-------------------
20.0
DOB Complaints Received
Left Columns: [['Date sent to company']]
Right Columns: [['Inspection Date']]
-------------------
19.779541
DOB Complaints Received
Left Columns: [['Date sent to company']]
Right Columns: [['Disposition Date']]
-------------------
19.25926
DOB Complaints Received
Left Columns: [['Date sent to company']]
Right Columns: [['Date Entered']]
-------------------
18.88889
Landmarks Complaints
Left Columns: [['Date sent to company']]
Right Columns: [['Date']]
-------------------
17.8847
DOHMH Indoor Environmental Complaints
Left Columns: [['Date sent to company']]
Right Columns: [['Date_Received']]
-------------------
13.650794
Metal Content of Consumer Products Tested by the NYC Health Department
Left Columns: [['Date sent to company']]
Right Columns: [['COLLECTION_DATE']]
-------------------
10.784834
Housing Maintenance Code Complaints
Left Columns: [['Date sent to company']]
Right Columns: [['ReceivedDate']]
-----

**Result:** Unfortunately, it doesn't look like there is any useful dataset in our DataMart to solve this problem. The datasets returned by the search do not involve any consumer complaint data -- only other types of complaint that semantically do not make sense. Which does not mean the augmentation would not improve the performance ...

### DA_fifa2018_manofmatch

Loading the supplied data.

In [17]:
fifa2018_manofmatch_file = home_dir + 'DA_fifa2018_manofmatch/DA_fifa2018_manofmatch_dataset/datasetDoc.json'
fifa2018_manofmatch = container.Dataset.load('file://' + fifa2018_manofmatch_file)

In [18]:
fifa2018_manofmatch['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,Pass Accuracy %,Passes,Distance Covered (Kms),Yellow & Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals
0,0,55,23-06-2018,Mexico,Korea Republic,59,6,2,0,5,89,485,97,0,1,26.0,Group Stage,No,0,
1,1,40,21-06-2018,Denmark,Australia,49,5,0,1,4,88,458,112,0,1,7.0,Group Stage,No,0,
2,2,19,17-06-2018,Mexico,Germany,40,6,2,2,9,82,281,106,0,0,35.0,Group Stage,No,0,
3,3,31,19-06-2018,Senegal,Poland,43,4,2,3,3,81,328,107,0,1,60.0,Group Stage,No,0,
4,4,98,30-06-2018,Uruguay,Portugal,39,2,1,0,4,69,269,106,0,1,7.0,Round of 16,No,0,


Without keywords.

In [19]:
cursor = client.search_with_data(query=None, supplied_data=fifa2018_manofmatch)

In [20]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 8.4872 seconds

-------------------
1.0
Water Consumption And Cost (2013 - March 2019)
Left Columns: [['Date']]
Right Columns: [['Revenue Month']]
-------------------
1.0
Street Construction Permits
Left Columns: [['Date']]
Right Columns: [['ModifiedOn']]
-------------------
1.0
Housing New York Units by Building
Left Columns: [['Date']]
Right Columns: [['Project Start Date']]
-------------------
1.0
Housing Maintenance Code Complaints
Left Columns: [['Date']]
Right Columns: [['ReceivedDate']]
-------------------
1.0
Capital Projects
Left Columns: [['Date']]
Right Columns: [['Forecast Completion']]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [['Date']]
Right Columns: [['Advanced Regents Num']]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [['Date']]
Right Columns: [['Regents w/o Advanced Num']]
-------------------
1.0
HRA Facts
Left Columns: [['Date']]
Right Columns: [['Adult Protective Svs. Ref

With keywords.

In [21]:
query = datamart.DatamartQuery(
    keywords=['sports', 'soccer', 'FIFA 2018', 'statistics',
              'match data', 'man of the match'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=fifa2018_manofmatch)

In [22]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 13.9716 seconds

-------------------
58.96552
FIFA 2018 game statistics data
Left Columns: [['GameID']]
Right Columns: [['GameID']]
-------------------
54.545456
FIFA 2018 game statistics data
Left Columns: [['Off-Target']]
Right Columns: [['On-Target']]
-------------------
10.0
Inmate Admissions
Left Columns: [['Date']]
Right Columns: [['ADMITTED_DT']]
-------------------
10.0
Inmate Discharges
Left Columns: [['Date']]
Right Columns: [['ADMITTED_DT']]
-------------------
10.0
Inmate Discharges
Left Columns: [['Date']]
Right Columns: [['DISCHARGED_DT']]
-------------------
10.0
Cash Assistance Youth Engagement
Left Columns: [['Date']]
Right Columns: [['Total Head of Household, 21-24 Years Old']]
-------------------
10.0
Total SNAP Recipients
Left Columns: [['Date']]
Right Columns: [['Month']]
-------------------
10.0
Inmate Admissions
Left Columns: [['Date']]
Right Columns: [['DISCHARGED_DT']]
-------------------
10.0
2013-2014 School Quality Reports Results For Elementary, M

In [23]:
join_ = results[0].augment(supplied_data=fifa2018_manofmatch)

In [24]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,...,Own goals,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Red,Own goal Time
0,0,55,2018-06-23,Mexico,Korea Republic,59,6,2,0,5,...,,2,13,5,5,24,7,0,0,
1,1,40,2018-06-21,Denmark,Australia,49,5,0,1,4,...,,1,10,5,3,5,7,2,0,
2,2,19,2018-06-17,Mexico,Germany,40,6,2,2,9,...,,1,12,4,1,11,15,2,0,
3,3,31,2018-06-19,Senegal,Poland,43,4,2,3,3,...,,2,8,2,3,11,15,2,0,
4,4,98,2018-06-30,Uruguay,Portugal,39,2,1,0,4,...,,2,6,3,2,14,13,0,0,


**Result:** We find the expected dataset!

### DA_global_terrorism

In [25]:
global_terrorism_file = home_dir + 'DA_global_terrorism/DA_global_terrorism_dataset/datasetDoc.json'
global_terrorism = container.Dataset.load('file://' + global_terrorism_file)

In [26]:
global_terrorism['learningData'].head()

Unnamed: 0,d3mIndex,year,country,group,activity_level
0,0,1971,Netherlands,Popular Front for the Liberation of Palestine ...,5
1,1,1995,Paraguay,Strikers,5
2,2,1981,Israel,Palestine Liberation Organization (PLO),2
3,3,1990,Yugoslavia,Albanian Separatists,5
4,4,2006,Nigeria,Movement for the Actualization of the Sovereig...,2


Without keywords.

In [27]:
cursor = client.search_with_data(query=None, supplied_data=global_terrorism)

In [28]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 5.9535 seconds

-------------------
0.93181825
Quality of Goverance
Left Columns: [['year']]
Right Columns: [['year']]
-------------------
0.93181825
Quality of Governance Indicators
Left Columns: [['year']]
Right Columns: [['year']]
-------------------
0.90909094
DOHMH Tuberculosis Surveillance: Data from the Tuberculosis Control Annual Summary
Left Columns: [['year']]
Right Columns: [['Year']]
-------------------
0.84090906
Water Consumption In The New York City
Left Columns: [['year']]
Right Columns: [['Year']]
-------------------
0.8181818
NYC Independent Budget Office (IBO) Tax Revenue FY 1980 - FY 2014
Left Columns: [['year']]
Right Columns: [['Year']]
-------------------
0.6818182
Annual Overweight Load (AOL) Permits
Left Columns: [['year']]
Right Columns: [['Year']]
-------------------
0.6363636
Quality of Governance Abridged
Left Columns: [['year']]
Right Columns: [['year']]
-------------------
0.47727275
Home Care Caseload
Left Columns: [['year']]
Right Columns: [['

With keywords.

In [29]:
query = datamart.DatamartQuery(
    keywords=['global terrorism', 'terrorist events', 'domestic',
              'international', 'terrorism', 'national security',
              'conflict'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=global_terrorism)

In [30]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 7.2810 seconds

-------------------
3.8636365
MSME Country Indicators 2014
Left Columns: [['year']]
Right Columns: [['Year']]
-------------------
0.93181825
Quality of Goverance
Left Columns: [['year']]
Right Columns: [['year']]
-------------------
0.93181825
Quality of Governance Indicators
Left Columns: [['year']]
Right Columns: [['year']]
-------------------
0.90909094
DOHMH Tuberculosis Surveillance: Data from the Tuberculosis Control Annual Summary
Left Columns: [['year']]
Right Columns: [['Year']]
-------------------
0.84090906
Water Consumption In The New York City
Left Columns: [['year']]
Right Columns: [['Year']]
-------------------
0.8181818
NYC Independent Budget Office (IBO) Tax Revenue FY 1980 - FY 2014
Left Columns: [['year']]
Right Columns: [['Year']]
-------------------
0.6818182
Annual Overweight Load (AOL) Permits
Left Columns: [['year']]
Right Columns: [['Year']]
-------------------
0.6363636
Quality of Governance Abridged
Left Columns: [['year']]
Right Col

In [34]:
join_ = results[0].augment(supplied_data=global_terrorism)

In [35]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,year,country,group,activity_level,Source Code,Code,Economy,"GNI per Capita, Atlas Method",Population,...,Value Added Source,Micro Enterprises Sector Distribution: Manufacturing,Micro Enterprises Sector Distribution: Trade,Micro Enterprises Sector Distribution: Services,Micro Enterprises Sector Distribution: Agri/Other,SME Sector Distribution: Manufacturing,SME Sector Distribution: Trade,SME Sector Distribution: Services,SME Sector Distribution: Agri/Other,Comments
0,0,1971,Netherlands,Popular Front for the Liberation of Palestine ...,5,,,,,,...,,,,,,,,,,
1,1,1995,Paraguay,Strikers,5,2.0,IDN,Indonesia,,,...,,,,,,,,,,
2,2,1981,Israel,Palestine Liberation Organization (PLO),2,,,,,,...,,,,,,,,,,
3,3,1990,Yugoslavia,Albanian Separatists,5,,,,,,...,,,,,,,,,,
4,4,2006,Nigeria,Movement for the Actualization of the Sovereig...,2,3.0,LAO,Lao PDR,510.0,5895930.0,...,,,,,,,,,,


**Result:** Maybe the first or second search results are useful? Not sure.

### DA_housing_burden

In [36]:
housing_burden_file = home_dir + 'DA_housing_burden/DA_housing_burden_dataset/datasetDoc.json'
housing_burden = container.Dataset.load('file://' + housing_burden_file)

In [37]:
housing_burden['learningData'].head()

Unnamed: 0,d3mIndex,RT,SERIALNO,DIVISION,SPORDER,PUMA,REGION,ST,ADJINC,PWGTP,...,PWGTP72,PWGTP73,PWGTP74,PWGTP75,PWGTP76,PWGTP77,PWGTP78,PWGTP79,PWGTP80,BURDEN
0,0,P,2017001026578,9,5,7701,4,6,1011189,79,...,102,26,66,134,32,28,80,77,59,2
1,1,P,2017000186978,9,1,6502,4,6,1011189,75,...,133,70,23,75,21,129,67,23,19,4
2,2,P,2017001070961,9,2,7902,4,6,1011189,53,...,15,86,49,57,53,60,17,14,50,5
3,3,P,2017000672057,9,5,3717,4,6,1011189,64,...,23,48,142,21,51,64,52,65,60,5
4,4,P,2017000011657,9,3,5303,4,6,1011189,138,...,116,43,198,130,116,168,161,112,40,1


Without keywords.

In [41]:
cursor = client.search_with_data(query=None, supplied_data=housing_burden)

In [42]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 608.3932 seconds

-------------------
1.0
California Housing Data
Left Columns: [['DIVISION']]
Right Columns: [['DIVISION']]
-------------------
1.0
psam_h06
Left Columns: [['DIVISION']]
Right Columns: [['DIVISION']]
-------------------
1.0
NYCgov Poverty Measure Data (2014)
Left Columns: [['SPORDER']]
Right Columns: [['SPORDER']]
-------------------
1.0
NYCgov Poverty Measure Data (2010)
Left Columns: [['SPORDER']]
Right Columns: [['SPORDER']]
-------------------
1.0
NYCgov Poverty Measure Data (2011)
Left Columns: [['SPORDER']]
Right Columns: [['SPORDER']]
-------------------
1.0
California Housing Data
Left Columns: [['REGION']]
Right Columns: [['REGION']]
-------------------
1.0
2006-07 Class Size - School-level Detail
Left Columns: [['REGION']]
Right Columns: [['REGION']]
-------------------
1.0
psam_h06
Left Columns: [['REGION']]
Right Columns: [['REGION']]
-------------------
1.0
California Housing Data
Left Columns: [['ST']]
Right Columns: [['ST']]
-------------------

With keywords.

In [45]:
query = datamart.DatamartQuery(
    keywords=['American Community Survey', 'ACS',
              'Public Use Microdata Sample', 'PUMS',
              '2017', 'California', 'government',
              'economics', 'housing', 'population'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=housing_burden)

In [46]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 488.5060 seconds

-------------------
70.0
psam_h06
Left Columns: [['DIVISION']]
Right Columns: [['DIVISION']]
-------------------
70.0
psam_h06
Left Columns: [['REGION']]
Right Columns: [['REGION']]
-------------------
70.0
psam_h06
Left Columns: [['ST']]
Right Columns: [['ST']]
-------------------
70.0
psam_h06
Left Columns: [['ADJINC']]
Right Columns: [['ADJINC']]
-------------------
70.0
psam_h06
Left Columns: [['FER']]
Right Columns: [['FES']]
-------------------
70.0
psam_h06
Left Columns: [['GCR']]
Right Columns: [['ACR']]
-------------------
70.0
psam_h06
Left Columns: [['SEX']]
Right Columns: [['SMX']]
-------------------
70.0
psam_h06
Left Columns: [['NOP']]
Right Columns: [['NP']]
-------------------
70.0
psam_h06
Left Columns: [['FAGEP']]
Right Columns: [['FAGSP']]
-------------------
70.0
psam_h06
Left Columns: [['FCITP']]
Right Columns: [['FKITP']]
-------------------
70.0
psam_h06
Left Columns: [['FCITWP']]
Right Columns: [['FKITP']]
-------------------
70.0
ps

**Result:** We find the expected dataset!

### DA_medical_malpractice

In [48]:
medical_malpractice_file = home_dir + 'DA_medical_malpractice/DA_medical_malpractice_dataset/datasetDoc.json'
medical_malpractice = container.Dataset.load('file://' + medical_malpractice_file)

In [49]:
medical_malpractice['learningData'].head()

Unnamed: 0,d3mIndex,SEQNO,LICNFELD,ORIGYEAR,WORKSTAT,ALGNNATR,ALEGATN1,PTTYPE,PRACTAGE,PFIDX
0,404537,514456,10,2004,AZ,20,306,I,30,32.737
1,404538,514457,10,2004,PA,1,200,B,50,42.09
2,404540,514460,651,2004,SD,100,316,O,50,58.926
3,404554,514475,430,2004,NJ,60,334,O,20,77.633
4,404556,514477,30,2004,NH,60,306,O,50,1.871


Without keywords.

In [50]:
cursor = client.search_with_data(query=None, supplied_data=medical_malpractice)

In [51]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 14.6827 seconds

-------------------
1.0
NPDB1807
Left Columns: [['ORIGYEAR']]
Right Columns: [['ORIGYEAR']]
-------------------
1.0
NPDB1807
Left Columns: [['ALGNNATR']]
Right Columns: [['ALGNNATR']]
-------------------
0.8933495
NPDB1807
Left Columns: [['SEQNO']]
Right Columns: [['SEQNO']]
-------------------
0.82362723
NPDB1807
Left Columns: [['ALEGATN1']]
Right Columns: [['ALEGATN1']]
-------------------
0.7991968
NPDB1807
Left Columns: [['LICNFELD']]
Right Columns: [['LICNFELD']]
-------------------
0.6086957
NPDB1807
Left Columns: [['PRACTAGE']]
Right Columns: [['PRACTAGE']]
-------------------
0.40599
NPDB1807
Left Columns: [['ALEGATN1']]
Right Columns: [['ALEGATN2']]
-------------------


With keywords.

In [65]:
query = datamart.DatamartQuery(
    keywords=['practitioner', 'clinical', 'malpractice', 'practitioner data bank',
              'government', 'healthcare', 'Department of health and human services'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=medical_malpractice)

In [66]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 23.1732 seconds

-------------------
50.0
NPDB1807
Left Columns: [['ORIGYEAR']]
Right Columns: [['ORIGYEAR']]
-------------------
50.0
NPDB1807
Left Columns: [['ALGNNATR']]
Right Columns: [['ALGNNATR']]
-------------------
44.667477
NPDB1807
Left Columns: [['SEQNO']]
Right Columns: [['SEQNO']]
-------------------
41.181362
NPDB1807
Left Columns: [['ALEGATN1']]
Right Columns: [['ALEGATN1']]
-------------------
39.95984
NPDB1807
Left Columns: [['LICNFELD']]
Right Columns: [['LICNFELD']]
-------------------
30.434784
NPDB1807
Left Columns: [['PRACTAGE']]
Right Columns: [['PRACTAGE']]
-------------------
20.2995
NPDB1807
Left Columns: [['ALEGATN1']]
Right Columns: [['ALEGATN2']]
-------------------


**Result:** We find the expected dataset!

### DA_ny_taxi_demand

In [54]:
ny_taxi_demand_file = home_dir + 'DA_ny_taxi_demand/DA_ny_taxi_demand_dataset/datasetDoc.json'
ny_taxi_demand = container.Dataset.load('file://' + ny_taxi_demand_file)

In [55]:
ny_taxi_demand['learningData'].head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups
0,0,2018-04-19 22:00:00,731
1,1,2018-06-30 20:00:00,183
2,2,2018-06-02 10:00:00,384
3,3,2018-04-17 13:00:00,648
4,4,2018-01-04 01:00:00,3


Without keywords.

In [56]:
cursor = client.search_with_data(query=None, supplied_data=ny_taxi_demand)

In [57]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 4.0649 seconds

-------------------
1.0
Water Consumption And Cost (2013 - March 2019)
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['Revenue Month']]
-------------------
1.0
Street Construction Permits
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['ModifiedOn']]
-------------------
1.0
Appeals Closed In 2017
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['Expiration']]
-------------------
1.0
Housing New York Units by Building
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['Project Start Date']]
-------------------
1.0
Housing Maintenance Code Complaints
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['ReceivedDate']]
-------------------
1.0
Capital Projects
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['Forecast Completion']]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['Advanced Regents Num']]
-------------------
1.0
2005 

With keywords.

In [60]:
query = datamart.DatamartQuery(
    keywords=['transportation', 'city data', 'taxi',
              'yellow cab', 'pickup', 'LaGuardia airport',
              'weather', 'weather conditions', 'new york',
              'hourly'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=ny_taxi_demand)

In [61]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 8.4898 seconds

-------------------
50.0
Newyork Weather Data around Airport 2016-18
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['DATE']]
-------------------
30.0
ny_lga_weather_16_17_18
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['DATE']]
-------------------
30.0
Medallion  Vehicles - Authorized
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['Last Date Updated']]
-------------------
25.319695
Trade Waste Hauler Licensees
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['CREATED']]
-------------------
25.319695
Trade Waste Hauler Licensees
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['EXPORT DATE']]
-------------------
24.126173
Medallion  Vehicles - Authorized
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['Expiration Date']]
-------------------
20.0
Housing New York Units by Building
Left Columns: [['tpep_pickup_datetime']]
Right Columns: [['Project Start Date']]
-------------------
20.0
Housing New Yor

In [63]:
join_ = results[0].augment(supplied_data=ny_taxi_demand)

In [64]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure
0,0,2018-04-19 22:00:00,731,FEW:02 42,5.0,53.0,16.0,310,29.97
1,1,2018-06-30 20:00:00,183,SCT:04 250,30.6,43.0,5.0,180,29.97
2,2,2018-06-02 10:00:00,384,FEW:02 40 FEW:02 150 SCT:04 200,28.3,61.0,6.0,70,29.7
3,3,2018-04-17 13:00:00,648,BKN:07 46 BKN:07 85,8.3,44.0,17.0,260,29.6
4,4,2018-01-04 01:00:00,3,OVC:08 32,-1.7,45.0,8.0,20,29.91


**Results:** We find the expected dataset!

### DA_poverty_estimation

In [68]:
poverty_estimation_file = home_dir + 'DA_poverty_estimation/DA_poverty_estimation_dataset/datasetDoc.json'
poverty_estimation = container.Dataset.load('file://' + poverty_estimation_file)

In [69]:
poverty_estimation['learningData'].head()

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016
0,0,35005,NM,Chaves County,5,13974
1,1,13297,GA,Walton County,1,11385
2,2,13137,GA,Habersham County,6,6500
3,3,54017,WV,Doddridge County,9,1460
4,4,55055,WI,Jefferson County,4,7618


Without keywords.

In [70]:
cursor = client.search_with_data(query=None, supplied_data=poverty_estimation)

In [71]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 1.7506 seconds

-------------------
0.93730605
Zillow Median Listing Prices 2017
Left Columns: [['FIPS']]
Right Columns: [['FIPS']]
-------------------
0.9362234
FIPS Population
Left Columns: [['FIPS']]
Right Columns: [['FIPS']]
-------------------


With keywords.

In [72]:
query = datamart.DatamartQuery(
    keywords=['Government', 'Economics', 'Department of Agriculture',
              'USDA', 'economic research service', 'ERS', 'county-level',
              'socioeconomic indicators', 'poverty rate', 'education',
              'population', 'unemployment'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=poverty_estimation)

In [73]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 6.9680 seconds

-------------------
18.724468
FIPS Population
Left Columns: [['FIPS']]
Right Columns: [['FIPS']]
-------------------
0.93730605
Zillow Median Listing Prices 2017
Left Columns: [['FIPS']]
Right Columns: [['FIPS']]
-------------------


In [74]:
join_ = results[0].augment(supplied_data=poverty_estimation)

In [75]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,births_2017,deaths_2017,population_2017,population_2010,Unnamed: 5
0,0,35005,NM,Chaves County,5,13974,915.0,633.0,64866,65645,
1,1,13297,GA,Walton County,1,11385,1181.0,843.0,91600,83768,
2,2,13137,GA,Habersham County,6,6500,512.0,459.0,44567,43041,
3,3,54017,WV,Doddridge County,9,1460,82.0,75.0,8560,8202,
4,4,55055,WI,Jefferson County,4,7618,828.0,672.0,84832,83686,


**Results:** We find the expected dataset!