## Data Augmentation Seed Problems: NYU DataMart

The purpose of this notebook is to investigate if and how the NYU DataMart finds the companion datasets for the data augmentation seed problems.

In [1]:
from d3m import container
import datamart
import datamart_nyu
from pathlib import Path
import time

In [2]:
def print_results(results):
    print("\n-------------------")
    if not results:
        print("No results!")
        return
    for result in results:
        print(result.score())
        print(result.get_json_metadata()['metadata']['name'])
        if (result.get_augment_hint()):
            left_columns = []
            for column_ in result.get_augment_hint().left_columns:
                left_columns.append([])
                for column in column_:
                    left_columns[-1].append((column.resource_id, column.column_index))
            print("Left Columns: %s" % str(left_columns))
            right_columns = []
            for column_ in result.get_augment_hint().right_columns:
                right_columns.append([])
                for column in column_:
                    right_columns[-1].append((column.resource_id, column.column_index))
            print("Right Columns: %s" % str(right_columns))
        else:
            print(result.id())
        print("-------------------")

In [3]:
client = datamart_nyu.RESTDatamart('https://datamart.d3m.vida-nyu.org')

In [4]:
# home directory for data augmentation seed datasets
# change this accordingly
home_dir = str(Path.home()) + '/projects/d3m/datasets/seed_datasets_data_augmentation/'

### DA_college_debt

Loading the supplied data.

In [3]:
college_debt_file = home_dir + 'DA_college_debt/DA_college_debt_dataset/datasetDoc.json'
college_debt = container.Dataset.load('file://' + college_debt_file)

In [4]:
college_debt['learningData'].head()

Unnamed: 0,d3mIndex,UNITID,INSTNM,PCTFLOAN,CONTROL,STABBR,PCIP16,MD_EARN_WNE_P10,PPTUG_EF,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,SATMTMID,SATVRMID,SATWRMID,UGDS,PREDDEG,DEBT_EARNINGS_RATIO
0,0,12268508,San Joaquin Valley College-Rancho Cordova,,3,CA,,28300,,,,,,,,,,0,49
1,1,207564,Oklahoma State University Institute of Technology,0.475,1,OK,0.0,35300,0.2297,0.2953,0.0291,0.0647,0.0051,,,,2164.0,2,36
2,2,420024,Centura College-Chesapeake,0.8125,3,VA,0.0,21900,0.2315,0.2808,0.5665,0.0493,0.0,,,,203.0,2,127
3,3,164492,Anna Maria College,0.7465,2,MA,0.0,44800,0.2621,0.6518,0.1258,0.1022,0.0123,,,,1057.0,3,76
4,4,234085,Virginia Military Institute,0.4589,1,VA,0.0321,65700,0.0,0.7992,0.0607,0.0584,0.042,575.0,575.0,,1713.0,3,53


Without keywords.

In [13]:
cursor = client.search_with_data(query=None, supplied_data=college_debt)

In [14]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 128.7590 seconds

-------------------
1.0
Traffic Signal and All-Way Stop Study Requests
Left Columns: [[('learningData', 13)]]
Right Columns: [[('0', 0)]]
-------------------
1.0
Traffic Volume Counts (2011-2012)
Left Columns: [[('learningData', 13)]]
Right Columns: [[('0', 0)]]
-------------------
1.0
Energy Usage From DOE Buildings
Left Columns: [[('learningData', 18)]]
Right Columns: [[('0', 4)]]
-------------------
1.0
New York City Locations Providing Seasonal Flu Vaccinations
Left Columns: [[('learningData', 18)]]
Right Columns: [[('0', 1)]]
-------------------
0.8668730650154799
Love Your Local Business List
Left Columns: [[('learningData', 15)]]
Right Columns: [[('0', 0)]]
-------------------
0.857566765578635
Vehicle Classification Counts (2011-2012)
Left Columns: [[('learningData', 13)]]
Right Columns: [[('0', 0)]]
-------------------
0.826625386996904
2017 - 2018 Schools NYPD Crime Data Report
Left Columns: [[('learningData', 15)]]
Right Columns: [[('0', 0)]]
----

With keywords.

In [25]:
query = datamart.DatamartQuery(
    keywords=['college scorecard', 'finance', 'debt', 'earnings'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=college_debt)

In [26]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 47.2278 seconds

-------------------
No results!


**Result:** Unfortunately, it doesn't look like there is any useful dataset in our DataMart to solve this problem.

### DA_consumer_complaints

Loading the supplied data.

In [19]:
consumer_complaints_file = home_dir + 'DA_consumer_complaints/DA_consumer_complaints_dataset/datasetDoc.json'
consumer_complaints = container.Dataset.load('file://' + consumer_complaints_file)

In [20]:
consumer_complaints['learningData'].head()

Unnamed: 0,d3mIndex,Complaint ID,State,ZIP code,Company,Consumer complaint narrative,Date sent to company,relevance
0,0,2252221,CA,928XX,Ditech Financial LLC,In XX/XX/XXXX I fell behind on my mortgage. I ...,12/15/16,1
1,1,1554019,CA,90731,"BANK OF AMERICA, NATIONAL ASSOCIATION",,09/06/15,2
2,2,2379952,CA,951XX,PORTFOLIO RECOVERY ASSOCIATES INC,Repeatedly calling a family members phone numb...,03/13/17,2
3,3,1363639,CA,900XX,JPMORGAN CHASE & CO.,I closed my Chase mileage card last year. At t...,05/06/15,0
4,4,1644731,CA,94703,"CITIBANK, N.A.",,11/09/15,0


Without keywords.

In [29]:
cursor = client.search_with_data(query=None, supplied_data=consumer_complaints)

In [30]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 81.9748 seconds

-------------------
1.0
Traffic Signal and All-Way Stop Study Requests
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 30)]]
-------------------
1.0
Recognized Shop Healthy Stores
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 1)]]
-------------------
1.0
Contractor / Sub Contractor Change Order Report
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 5)]]
-------------------
1.0
Cash Assistance Youth Engagement
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 6)]]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 21)]]
-------------------
1.0
Parking Violations Issued - Fiscal Year 2017
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 23)]]
-------------------
1.0
Parking Violations Issued - Fiscal Year 2015
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 17)]]
-------------------
1.0
FY17 BID Trends Re

With keywords.

In [31]:
query = datamart.DatamartQuery(
    keywords=['consumer', 'complaints', 'protect', 'unfair practices',
              'consumer financial protection bureau'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=consumer_complaints)

In [32]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 35.8385 seconds

-------------------
1.0
DOB Complaints Received
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 13)]]
-------------------
0.944444445294979
Landmarks Complaints
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 2)]]
-------------------
0.9351851861774755
Historical IDA Income Statements Data
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 4)]]
-------------------
0.9351851861774755
Historical IDA Balance Sheets Data
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 5)]]
-------------------
0.9307760151691404
Published Audit List
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 4)]]
-------------------
0.9109347507348401
IFC Investment Services Projects
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 15)]]
-------------------
0.8948963962952288
DOHMH Indoor Environmental Complaints
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 9)]]
-------------------
0.8946208180040006
Bur

**Result:** Unfortunately, it doesn't look like there is any useful dataset in our DataMart to solve this problem. The datasets returned by the search do not involve any consumer complaint data -- only other types of complaint that semantically do not make sense. Which does not mean the augmentation would not improve the performance ...

### DA_fifa2018_manofmatch

Loading the supplied data.

In [4]:
fifa2018_manofmatch_file = home_dir + 'DA_fifa2018_manofmatch/DA_fifa2018_manofmatch_dataset/datasetDoc.json'
fifa2018_manofmatch = container.Dataset.load('file://' + fifa2018_manofmatch_file)

In [5]:
fifa2018_manofmatch['learningData'].head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,Pass Accuracy %,Passes,Distance Covered (Kms),Yellow & Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals
0,0,55,23-06-2018,Mexico,Korea Republic,59,6,2,0,5,89,485,97,0,1,26.0,Group Stage,No,0,
1,1,40,21-06-2018,Denmark,Australia,49,5,0,1,4,88,458,112,0,1,7.0,Group Stage,No,0,
2,2,19,17-06-2018,Mexico,Germany,40,6,2,2,9,82,281,106,0,0,35.0,Group Stage,No,0,
3,3,31,19-06-2018,Senegal,Poland,43,4,2,3,3,81,328,107,0,1,60.0,Group Stage,No,0,
4,4,98,30-06-2018,Uruguay,Portugal,39,2,1,0,4,69,269,106,0,1,7.0,Round of 16,No,0,


Without keywords.

In [8]:
cursor = client.search_with_data(query=None, supplied_data=fifa2018_manofmatch)

In [9]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 376.7774 seconds

-------------------
1.0
FIFA 2018 game statistics data
Left Columns: [[('learningData', 13)]]
Right Columns: [[('0', 8)]]
-------------------
1.0
Housing New York Units by Building
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 19)]]
-------------------
1.0
Recognized Shop Healthy Stores
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 1)]]
-------------------
1.0
Contractor / Sub Contractor Change Order Report
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 10)]]
-------------------
1.0
Cash Assistance Youth Engagement
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 6)]]
-------------------
1.0
City Clerk eLobbyist Data
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 14)]]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [[('learningData', 2)]]
Right Columns: [[('0', 21)]]
-------------------
1.0
Street Construction Permits - Stipulations (Historical)
Left 

With keywords.

In [12]:
query = datamart.DatamartQuery(
    keywords=['soccer', 'FIFA'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=fifa2018_manofmatch)

In [13]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 6.4550 seconds

-------------------
1.0
FIFA 2018 game statistics data
Left Columns: [[('learningData', 13)]]
Right Columns: [[('0', 8)]]
-------------------
0.9827586206896551
FIFA 2018 game statistics data
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 0)]]
-------------------
0.5681818181818182
FIFA 2018 game statistics data
Left Columns: [[('learningData', 6)]]
Right Columns: [[('0', 3)]]
-------------------


**Result:** If we use the exact keywords that MIT-LL provide, we will find a lot of unrelated data. This is because one of the keywords is 'FIFA 2018', and that will match any dataset that mentions '2018' in title/description. Same thing for 'match data': anything that matchs 'data' will be included. Note, however, that the expected dataset is the n. 1 in the ranking, but not with the correct attribute.

If we only use 'FIFA', we can find the right dataset. The potentially correct join is the second one (with `GameID` column). The first one, with `Yellow & Red` column, is not the expected join.

### DA_global_terrorism

In [None]:
global_terrorism_file = home_dir + 'DA_global_terrorism/DA_global_terrorism_dataset/datasetDoc.json'
global_terrorism = container.Dataset.load('file://' + global_terrorism_file)

In [None]:
global_terrorism['learningData'].head()

Without keywords.

In [None]:
cursor = client.search_with_data(query=None, supplied_data=global_terrorism)

In [None]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

With keywords.

In [None]:
query = datamart.DatamartQuery(
    keywords=['global terrorism', 'terrorist events'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=global_terrorism)

In [None]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

### DA_housing_burden

In [None]:
housing_burden_file = home_dir + 'DA_housing_burden/DA_housing_burden_dataset/datasetDoc.json'
housing_burden = container.Dataset.load('file://' + housing_burden_file)

In [None]:
housing_burden['learningData'].head()

Without keywords.

In [None]:
cursor = client.search_with_data(query=None, supplied_data=housing_burden)

In [None]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

With keywords.

In [None]:
query = datamart.DatamartQuery(
    keywords=['American Community Survey', 'ACS',
              'Public Use Microdata Sample', 'PUMS'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=housing_burden)

In [None]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

### DA_medical_malpractice

In [None]:
medical_malpractice_file = home_dir + 'DA_medical_malpractice/DA_medical_malpractice_dataset/datasetDoc.json'
medical_malpractice = container.Dataset.load('file://' + medical_malpractice_file)

In [None]:
medical_malpractice['learningData'].head()

Without keywords.

In [None]:
cursor = client.search_with_data(query=None, supplied_data=medical_malpractice)

In [None]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

With keywords.

In [None]:
query = datamart.DatamartQuery(
    keywords=['practitioner', 'clinical', 'malpractice', 'practitioner data bank'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=medical_malpractice)

In [None]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

### DA_ny_taxi_demand

In [14]:
ny_taxi_demand_file = home_dir + 'DA_ny_taxi_demand/DA_ny_taxi_demand_dataset/datasetDoc.json'
ny_taxi_demand = container.Dataset.load('file://' + ny_taxi_demand_file)

In [15]:
ny_taxi_demand['learningData'].head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups
0,0,2018-04-19 22:00:00,731
1,1,2018-06-30 20:00:00,183
2,2,2018-06-02 10:00:00,384
3,3,2018-04-17 13:00:00,648
4,4,2018-01-04 01:00:00,3


Without keywords.

In [16]:
cursor = client.search_with_data(query=None, supplied_data=ny_taxi_demand)

In [17]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 107.2252 seconds

-------------------
1.0
Housing New York Units by Building
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 19)]]
-------------------
1.0
Recognized Shop Healthy Stores
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 1)]]
-------------------
1.0
Bureau of Fire Prevention - Certificates of Fitness
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 4)]]
-------------------
1.0
Contractor / Sub Contractor Change Order Report
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 10)]]
-------------------
1.0
Cash Assistance Youth Engagement
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 6)]]
-------------------
1.0
Appeals Closed In 2017
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 8)]]
-------------------
1.0
City Clerk eLobbyist Data
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 14)]]
-------------------
1.0
2005 - 2011 Graduation Outcomes - Borough - ELL
Left Columns: [[('l

With keywords.

In [22]:
query = datamart.DatamartQuery(
    keywords=['weather conditions', 'new york', 'hourly', 'LaGuardia airport'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=ny_taxi_demand)

In [23]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

Duration: 29.3918 seconds

-------------------
1.0
Housing New York Units by Building
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 19)]]
-------------------
1.0
Low Income Housing Tax Credits Awarded by HPD: Project-Level (9% Awards)
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 13)]]
-------------------
1.0
DOHMH New York City Restaurant Inspection Results
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 8)]]
-------------------
1.0
NYC Wi-Fi Hotspot Locations
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 15)]]
-------------------
1.0
Housing New York Units by Project
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 3)]]
-------------------
1.0
Low Income Housing Tax Credits Awarded by HPD: Project-Level (4% Awards)
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 13)]]
-------------------
1.0
Use of ARRA Stimulus Funds
Left Columns: [[('learningData', 1)]]
Right Columns: [[('0', 26)]]
------------------

**Results:** The expected dataset is in the result set, but because the keyword score is currently not taken into account in the ranking, it does not rank higher. If we only use the keyword 'weather', we can see the expected dataset in the 1st position. See below.

In [26]:
query = datamart.DatamartQuery(
    keywords=['weather'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=ny_taxi_demand)

In [None]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

### DA_poverty_estimation

In [None]:
poverty_estimation_file = home_dir + 'DA_poverty_estimation/DA_poverty_estimation_dataset/datasetDoc.json'
poverty_estimation = container.Dataset.load('file://' + poverty_estimation_file)

In [None]:
poverty_estimation['learningData'].head()

Without keywords.

In [None]:
cursor = client.search_with_data(query=None, supplied_data=poverty_estimation)

In [None]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)

With keywords.

In [None]:
query = datamart.DatamartQuery(
    keywords=['USDA', 'economic research service', 'ERS',
              'county-level', 'socioeconomic indicators',
              'poverty rate', 'education', 'population',
              'unemployment'],  # from problemDoc.json
    variables=[]
)
cursor = client.search_with_data(query=query, supplied_data=poverty_estimation)

In [None]:
start = time.time()
results = cursor.get_next_page()
print("Duration: %.4f seconds" % (time.time() - start))
print_results(results)