# Field Selection

The aim of this notebook is to continue the thread of "mhkdr_apis_comparison" and bring in the question of field selection within the context of Tethys / Tethys Engineering API results. The questions of interest follow. 
1. Is there any reason to prefer one or the other [MHKDR API] data source when the fields are identical?
2. To what extent should the table structure be lossless when compared to the APIs?
    - That is, the Tethys and Tethys Engineering APIs seem to duplicate a lot of fields. Should I keep these duplicate fields?
       1. Are there any cases where these fields do differ?
    - Similarly, should I keep all information from the API, even when this information seems useless? An example is the different "@type" variable in MHKDR, or the unique attribution style within MHKDR "Organization" field, that inlcudes the author again alongside other unique information.
       1. Is there a situation is which these fields will help us achieve the goals of this package?
  
What are the goals of the package?
1. Enable automated download of the content
2. Enable cross-Knowledge Hub analysis of the Marine Energy Industry

### Setup

In [1]:
import requests
import json
import pandas as pd
#import re
import primrea

tethys_api = 'https://tethys.pnnl.gov/api/primre_export'
tethys_e_api = 'https://tethys-engineering.pnnl.gov/api/primre_export'
mhkdr_api_1 = 'https://mhkdr.openei.org/api?action=getSubmissionsForPRIMRE'
mhkdr_api_2 = 'https://mhkdr.openei.org/data.json'

In [2]:
tethys_response = requests.get(tethys_api)       # Note: The tethys api grabs content specifically related to marine energy, and there is another API for wind energy.
tethys_e_response = requests.get(tethys_e_api)
mhkdr_1_response = requests.get(mhkdr_api_1)
mhkdr_2_response = requests.get(mhkdr_api_2)

SSLError: HTTPSConnectionPool(host='mhkdr.openei.org', port=443): Max retries exceeded with url: /api?action=getSubmissionsForPRIMRE (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)')))

In [None]:
tethys_response_json = tethys_response.json()
tethys_e_response_json = tethys_e_response.json()
mhkdr_1_response_json = mhkdr_1_response.json()
mhkdr_2_response_json = mhkdr_2_response.json()

In [None]:
tethys_df = pd.DataFrame(tethys_response_json)
tethys_e_df = pd.DataFrame(tethys_e_response_json)
mhkdr_1_df = pd.DataFrame(mhkdr_1_response_json)
mhkdr_2_df = pd.DataFrame(mhkdr_2_response_json['dataset'])

### MHKDR Data

|1|both|2|
|:---:|:---:|:---:|
||(URI, landingPage, sourceURL) / (identifier, landingPage)||
|type|||
||(title) / (title)||
||(description) / (description)||
|author|||
|organization|||
|originationDate|||
||(spatial) / (spatial)?||
|technologyType|||
||(tags) / (keyword)||
|signatureProject|||
||(modifiedDate) / (modified)?||
|||@type|
|||accessLevel|
|||bureauCode|
|||license|
|||issued|
|||dataQuality|
|||projectTitle|
|||projectNumber|
|||publisher|
|||contactPoint|
|||programCode|
|||distribution|
|||DOI|

### Dev

#### 1.1 Should I prefer one or the other MHKDR data source when the fields are identical?

The first step here is understanding if there are cases in which the values of the fields are not identical. For this, I will create a function that takes in some target columns 'x,' 'y,' and returns a dataframe where all of the rows of the df have values of 'x,' and 'y,' that are NOT identical. This will be necessary for testing identical columns in tethys/tethys e as well, and should prove useful for rooting out redundancy across the board.

In [None]:
tethys_df.head(3)

After viewing the Tethys data again, I see the point. "landingPage" is different than "sourceURL" because the origin of tethys documents is not tethys. Tethys hosts the *connection* not the original source of the documents. So, every document in Tethys should have a "sourceURL" that is different from the Tethys page.

Likewise, the "URI" is different from the "landingPage" because the URI includes the node number, and is difinitively unique. The URI redirects to the landingPage URL on travelling there, but the URI is not directly shown to users. This point is confusing in MHKDR because there is no difference between the URI and landingPage (URL)

Despite the promising findings above, I must still write the  function in order to test redundancy of identical columns across the MHKDR APIs

In [None]:
def identicals():
    '''
    The purpose of this function is to return a dataframe where, for each observation (row), the
    values of the two columns are not identical.
    '''
        

In [None]:
mhkdr_1_df.head(1)

In [None]:
mhkdr_2_df.head(1)

In [None]:
a = mhkdr_1_df['URI'].equals(mhkdr_1_df['landingPage'])
b = mhkdr_1_df['URI'].equals(mhkdr_1_df['sourceURL'])
c = mhkdr_1_df['landingPage'].equals(mhkdr_1_df['sourceURL'])
print(f'MHKDR API 1 : "URI" = "landingPage"       - {a}\nMHKDR API 1 : "URI" = "sourceURL"         - {b}\nMHKDR API 1 : "landingPage" = "sourceURL" - {c}')

Looking good!

As expected, the three columns of MHKDR API 1 ("URI" "landingPage" and "sourceURL") are identical.

In [None]:
a = mhkdr_2_df['identifier'].equals(mhkdr_2_df['landingPage'])
print(f'MHKDR API 2 : "identifier" = "landingPage"  - {a}')

In [None]:
a = mhkdr_1_df['title'].equals(mhkdr_2_df['title'])
b = mhkdr_1_df['description'].equals(mhkdr_2_df['description'])
c = mhkdr_1_df['spatial'].equals(mhkdr_2_df['spatial'])
d = mhkdr_1_df['tags'].equals(mhkdr_2_df['keyword'])
e = mhkdr_1_df['modifiedDate'].equals(mhkdr_2_df['modified'])
print(f'MHKDR API 1-2 : "title" = "title"              - {a}\nMHKDR API 1-2 : "description" = "description"  - {b}\nMHKDR API 1-2 : "spatial" = "spatial"          - {c}\nMHKDR API 1-2 : "tags" = "keywords"            - {d}\nMHKDR API 1-2 : "modifiedDate" = "modified"    - {e}')

With the results above, I will elect to keeping the 3 redundant columns of data for MHKDR, so as to match the form of Tethys/Tethys Engineering, and store that all together. 

The evidently mismatching columsns must be investigated further to determine preferential inclusion into the final structure.

In [None]:
mhkdr_1_df['title'][0]

In [None]:
mhkdr_2_df['title'][0]

They were mismatched because they were not sorted by id!

Next, create the id column, append it, and then sort both dfs by id to compare apples to apples. Also, do a join to remove those 65 observations that are inluded in API 1 but not API 2. For these observations we have no choice but to use the variables we have access to. On writing that, that is a perfect prior for using API 1 vars in all cases. It will still be useful to do these test, because they will verify that there is no data loss by excluding some variables repeated by the two APIs

In [None]:
ids_1 = list()
for i in mhkdr_1_df['URI']:
    a = primrea.kh_table_gen.entry_based.find_entry_id(i)
    ids_1.append(a)

ids_2 = list()
for i in mhkdr_2_df['identifier']:
    a = primrea.kh_table_gen.entry_based.find_entry_id(i)
    ids_2.append(a)

In [None]:
mhkdr_1_df['id'] = ids_1
mhkdr_2_df['id'] = ids_2

In [None]:
mhkdr_12_df = mhkdr_1_df.merge(mhkdr_2_df, on='id')

In [None]:
mhkdr_12_df.keys()

In [None]:
a = mhkdr_12_df['title_x'].equals(mhkdr_12_df['title_y'])
b = mhkdr_12_df['description_x'].equals(mhkdr_12_df['description_y'])
c = mhkdr_12_df['spatial_x'].equals(mhkdr_12_df['spatial_y'])
d = mhkdr_12_df['tags'].equals(mhkdr_12_df['keyword'])
e = mhkdr_12_df['modifiedDate'].equals(mhkdr_12_df['modified'])
print(f'MHKDR API 1-2 : "title" = "title"              - {a}\nMHKDR API 1-2 : "description" = "description"  - {b}\nMHKDR API 1-2 : "spatial" = "spatial"          - {c}\nMHKDR API 1-2 : "tags" = "keywords"            - {d}\nMHKDR API 1-2 : "modifiedDate" = "modified"    - {e}')

It is good to see that the three variables
1. title
2. description
3. tags

Are identical across the APIs. I prefer those attached to API 1 for aformentioned reasons, so will include these three in the final structure.

Next is analysis of the mismatches; "spatial," and "modifiedDate."

#### Spatial

In [None]:
spatial_df = mhkdr_12_df[['spatial_x', 'spatial_y']]

In [None]:
spatial_df['spatial_x'][0]

In [None]:
spatial_df['spatial_y'][0]

As we can see the same data is encoded with a different typing convention. Preference for the API 1 data as previously stated to match the typing of Tethys/Tethys Engineering data.

#### modifiedDate

In [None]:
mod_df = mhkdr_12_df[['modifiedDate', 'modified']]

In [None]:
mod_df['modifiedDate'][0]

In [None]:
mod_df['modified'][0]

As we can see, the same situation as 'spatial' - it is the same information encoded differently. Prefer API 1 data for consistency with Tethys/Tethys Engineering data.

#### type

In [None]:
mhkdr_12_df['type'][5]

In [None]:
mhkdr_12_df['@type'][5]

#### Other Variables

In [None]:
a = mhkdr_12_df['accessLevel']
a.drop_duplicates()

Remove accessLevel, no information gained.

In [None]:
a = mhkdr_12_df['bureauCode']
a.drop_duplicates()

Remove bureaCode, no information gained.

In [None]:
a = mhkdr_12_df['license']
a.drop_duplicates()

Possibly remove license - they are all the same so no info gained. Retain for now because this is important.

In [None]:
a = mhkdr_12_df['issued']
a.drop_duplicates()

issued - retain for now, seemingly unuseful but this is for a discussion. What does this mean?

In [None]:
a = mhkdr_12_df['dataQuality']
a.drop_duplicates()

Possibly remove dataQuality - they are all the same so no info gained. Retain for now because this is important.

In [None]:
a = mhkdr_12_df['projectTitle']
a.drop_duplicates()

Keep. This is interesting because it appears related to the current funding strategy of DOE, and might group outputs in a meaningful way as they relate to funding.

In [None]:
a = mhkdr_12_df['projectNumber']
a.drop_duplicates()

Keep. Same reasons as 'projectTitle.' Very interesting because there are a different count in 'projectTitle,' and 'projectNumber,' indicating that these are not equivalent measures.

In [None]:
a = mhkdr_12_df['publisher']
a.drop_duplicates()

publisher - can probably be dropped. This is simply one of the names of the authors. It does not add unique info uncaptured elsewhere, and it confuses the grain.

In [None]:
a = mhkdr_12_df['contactPoint']
a.drop_duplicates()

In [None]:
mhkdr_12_df['contactPoint'][0]

In [None]:
a = mhkdr_12_df['programCode']
a.drop_duplicates()

In [None]:
a = mhkdr_12_df['DOI']
a.drop_duplicates()

### MHKDR Data - Revised

|1|both|2|
|:---:|:---:|:---:|
||(URI, landingPage, sourceURL) / ~(identifier, landingPage)~||
|type|||
||(title) / ~(title)~||
||(description) / ~(description)~||
|author|||
|organization|||
|originationDate|||
||(spatial) / ~(spatial)~||
|technologyType|||
||(tags) / ~(keyword)~||
|signatureProject|||
||(modifiedDate) / ~(modified)~||
|||~@type~|
|||~accessLevel~?|
|||~bureauCode~?|
|||license?|
|||issued??|
|||dataQuality?|
|||projectTitle|
|||projectNumber|
|||~publisher~|
|||contactPoint?|
|||~programCode~|
|||distribution|
|||DOI?|

@type, accesslevel, bureauCode, dataquality good to axe,

Doi, contactPoint, license keep

DOI list of what doesn't have one good for QA/QC

keep issued, keep anything that is just a bool that
check 1-1 on publisher, if there is anything added by the field then keep it.

#### 4.0 "Distribution"

The aim of this section is to display the fields given by MHKDR API 2 in "Distribution" - so that his data can be mapped to the Tethys/Tethys Engineering data for finalizing the table structure.

In [37]:
mhkdr_2_df['distribution'][0][0]

{'@type': 'dcat:Distribution',
 'description': 'A video recording of the data management and submission best practices training presentation and a recorded demo of data submission to the MHKDR',
 'title': 'MHKDR Data Management and Submission Best Practices Video',
 'accessURL': 'https://www.youtube.com/watch?v=QeqYr6W9HC0&list=PL4lfI7kmtVfqEk0SjiR9zPTpqVb2Mm9vO',
 'format': 'HTML',
 'mediaType': 'text/html'}

In [38]:
mhkdr_2_df['distribution'][0][0].keys()

dict_keys(['@type', 'description', 'title', 'accessURL', 'format', 'mediaType'])

Decision:

Keep Tethys & Tethys Engineering architecture separate from MHKDR. There would be significant loss of content by merging the two, and the merge can still be done after the fact by a user if they so desired. 