# Field Selection

The aim of this notebook is to continue the thread of "mhkdr_apis_comparison" and bring in the question of field selection within the context of Tethys / Tethys Engineering API results. The questions of interest follow. 
1. Is there any reason to prefer one or the other [MHKDR API] data source when the fields are identical?
2. To what extent should the table structure be lossless when compared to the APIs?
    - That is, the Tethys and Tethys Engineering APIs seem to duplicate a lot of fields. Should I keep these duplicate fields?
       1. Are there any cases where these fields do differ?
    - Similarly, should I keep all information from the API, even when this information seems useless? An example is the different "@type" variable in MHKDR, or the unique attribution style within MHKDR "Organization" field, that inlcudes the author again alongside other unique information.
       1. Is there a situation is which these fields will help us achieve the goals of this package?
  
What are the goals of the package?
1. Enable automated download of the content
2. Enable cross-Knowledge Hub analysis of the Marine Energy Industry

### Setup

In [1]:
import requests
import json
import pandas as pd
#import re
import primrea.core
from primrea import *

tethys_api = 'https://tethys.pnnl.gov/api/primre_export'
tethys_e_api = 'https://tethys-engineering.pnnl.gov/api/primre_export'
mhkdr_api_1 = 'https://mhkdr.openei.org/api?action=getSubmissionsForPRIMRE'
mhkdr_api_2 = 'https://mhkdr.openei.org/data.json'

In [2]:
tethys_response = requests.get(tethys_api)       # Note: The tethys api grabs content specifically related to marine energy, and there is another API for wind energy.
tethys_e_response = requests.get(tethys_e_api)
mhkdr_1_response = requests.get(mhkdr_api_1)
mhkdr_2_response = requests.get(mhkdr_api_2)

In [3]:
tethys_response_json = tethys_response.json()
tethys_e_response_json = tethys_e_response.json()
mhkdr_1_response_json = mhkdr_1_response.json()
mhkdr_2_response_json = mhkdr_2_response.json()

In [4]:
tethys_df = pd.DataFrame(tethys_response_json)
tethys_e_df = pd.DataFrame(tethys_e_response_json)
mhkdr_1_df = pd.DataFrame(mhkdr_1_response_json)
mhkdr_2_df = pd.DataFrame(mhkdr_2_response_json['dataset'])

### MHKDR Data

|1|both|2|
|:---:|:---:|:---:|
||(URI, landingPage, sourceURL) / (identifier, landingPage)||
|type|||
||(title) / (title)||
||(description) / (description)||
|author|||
|organization|||
|originationDate|||
||(spatial) / (spatial)?||
|technologyType|||
||(tags) / (keyword)||
|signatureProject|||
||(modifiedDate) / (modified)?||
|||@type|
|||accessLevel|
|||bureauCode|
|||license|
|||issued|
|||dataQuality|
|||projectTitle|
|||projectNumber|
|||publisher|
|||contactPoint|
|||programCode|
|||distribution|
|||DOI|

### Dev

#### 1.1 Should I prefer one or the other MHKDR data source when the fields are identical?

The first step here is understanding if there are cases in which the values of the fields are not identical. For this, I will create a function that takes in some target columns 'x,' 'y,' and returns a dataframe where all of the rows of the df have values of 'x,' and 'y,' that are NOT identical. This will be necessary for testing identical columns in tethys/tethys e as well, and should prove useful for rooting out redundancy across the board.

In [5]:
tethys_df.head(3)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment
0,https://tethys.pnnl.gov/node/499,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...","[Soukissian, T., Denaxa, D., Karathanasi, F., ...","[Hellenic Centre for Marine Research (HCMR), N...",2017-09-29,[],[],"[Environment, Human Dimensions]",2024-01-22 09:24:45,[],[]
1,https://tethys.pnnl.gov/node/500,"[Document, Document/Report]",https://tethys.pnnl.gov/publications/port-fair...,http://bps.energy/projects,The Port Fairy Pilot Wave Energy Project Envir...,This Environmental Management Plan (EMP) detai...,[BioPower Systems],[BioPower Systems],2016-02-09,"{'coordinates': ['-38.398417000000', '142.1726...",[Wave],"[Environment, Environmental Impact Assessment]",2024-01-22 09:24:45,[],[https://tethys.pnnl.gov/sites/default/files/p...
2,https://tethys.pnnl.gov/node/501,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/baseline-...,https://www.sciencedirect.com/science/article/...,Baseline assessment of underwater noise in the...,The Ria Formosa is a sheltered large coastal l...,"[Soares, C., Pacheco, A., Zabel, F., González-...",[Marine Sensing and Acoustic Technologies (Mar...,2020-01-10,"{'coordinates': ['36.972554000000', '-7.870570...","[Current, Current/Tidal]","[Environment, Noise]",2024-01-22 09:24:45,[],[]


After viewing the Tethys data again, I see the point. "landingPage" is different than "sourceURL" because the origin of tethys documents is not tethys. Tethys hosts the *connection* not the original source of the documents. So, every document in Tethys should have a "sourceURL" that is different from the Tethys page.

Likewise, the "URI" is different from the "landingPage" because the URI includes the node number, and is difinitively unique. The URI redirects to the landingPage URL on travelling there, but the URI is not directly shown to users. This point is confusing in MHKDR because there is no difference between the URI and landingPage (URL)

Despite the promising findings above, I must still write the  function in order to test redundancy of identical columns across the MHKDR APIs

In [6]:
def identicals():
    '''
    The purpose of this function is to return a dataframe where, for each observation (row), the
    values of the two columns are not identical.
    '''
        

In [7]:
mhkdr_1_df.head(1)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,signatureProject,modifiedDate
0,https://mhkdr.openei.org/submissions/548,"[Dataset, Document/Report]",https://mhkdr.openei.org/submissions/548,https://mhkdr.openei.org/submissions/548,"CalWave - xWave Device, Non-Commercially Sensi...",CalWave has developed a submerged pressure dif...,"[Marcus Lehmann, Ryan Davidson]",[CalWave Power Technologies Inc.],2024-02-29 07:00:00,"{'boundingCoordinatesNE': [44.63067800397145, ...",[Wave],"[MHK, Marine, Hydrokinetic, energy, power, wav...",[],2024-04-25 20:40:00


In [8]:
mhkdr_2_df.head(1)

Unnamed: 0,@type,identifier,accessLevel,bureauCode,license,issued,dataQuality,title,description,keyword,...,projectTitle,projectNumber,modified,publisher,contactPoint,programCode,landingPage,distribution,spatial,DOI
0,dcat:Dataset,https://mhkdr.openei.org/submissions/1,public,[019:20],https://creativecommons.org/licenses/by/4.0/,2021-12-15T07:00:00Z,True,MHKDR Data Management and Best Practices for S...,Resources for MHKDR data submitters and curato...,"[MHK, Marine, Hydrokinetic, energy, power, dat...",...,Marine and Hydrokinetic Data Repository (MHKDR),35007,2022-05-26T18:08:40Z,"{'@type': 'org:Organization', 'name': 'RJ Scavo'}","{'@type': 'vcard:Contact', 'fn': 'MHKDR Help',...",[019:009],https://mhkdr.openei.org/submissions/1,"[{'@type': 'dcat:Distribution', 'description':...","{""type"":""Polygon"",""coordinates"":[[[-180,-83],[...",


In [9]:
a = mhkdr_1_df['URI'].equals(mhkdr_1_df['landingPage'])
b = mhkdr_1_df['URI'].equals(mhkdr_1_df['sourceURL'])
c = mhkdr_1_df['landingPage'].equals(mhkdr_1_df['sourceURL'])
print(f'MHKDR API 1 : "URI" = "landingPage"       - {a}\nMHKDR API 1 : "URI" = "sourceURL"         - {b}\nMHKDR API 1 : "landingPage" = "sourceURL" - {c}')

MHKDR API 1 : "URI" = "landingPage"       - True
MHKDR API 1 : "URI" = "sourceURL"         - True
MHKDR API 1 : "landingPage" = "sourceURL" - True


Looking good!

As expected, the three columns of MHKDR API 1 ("URI" "landingPage" and "sourceURL") are identical.

In [10]:
a = mhkdr_2_df['identifier'].equals(mhkdr_2_df['landingPage'])
print(f'MHKDR API 2 : "identifier" = "landingPage"  - {a}')

MHKDR API 2 : "identifier" = "landingPage"  - True


In [11]:
a = mhkdr_1_df['title'].equals(mhkdr_2_df['title'])
b = mhkdr_1_df['description'].equals(mhkdr_2_df['description'])
c = mhkdr_1_df['spatial'].equals(mhkdr_2_df['spatial'])
d = mhkdr_1_df['tags'].equals(mhkdr_2_df['keyword'])
e = mhkdr_1_df['modifiedDate'].equals(mhkdr_2_df['modified'])
print(f'MHKDR API 1-2 : "title" = "title"              - {a}\nMHKDR API 1-2 : "description" = "description"  - {b}\nMHKDR API 1-2 : "spatial" = "spatial"          - {c}\nMHKDR API 1-2 : "tags" = "keywords"            - {d}\nMHKDR API 1-2 : "modifiedDate" = "modified"    - {e}')

MHKDR API 1-2 : "title" = "title"              - False
MHKDR API 1-2 : "description" = "description"  - False
MHKDR API 1-2 : "spatial" = "spatial"          - False
MHKDR API 1-2 : "tags" = "keywords"            - False
MHKDR API 1-2 : "modifiedDate" = "modified"    - False


With the results above, I will elect to keeping the 3 redundant columns of data for MHKDR, so as to match the form of Tethys/Tethys Engineering, and store that all together. 

The evidently mismatching columsns must be investigated further to determine preferential inclusion into the final structure.

In [12]:
mhkdr_1_df['title'][0]

'CalWave - xWave Device, Non-Commercially Sensitive Project Report'

In [13]:
mhkdr_2_df['title'][0]

'MHKDR Data Management and Best Practices for Submitters and Curators'

They were mismatched because they were not sorted by id!

Next, create the id column, append it, and then sort both dfs by id to compare apples to apples. Also, do a join to remove those 65 observations that are inluded in API 1 but not API 2. For these observations we have no choice but to use the variables we have access to. On writing that, that is a perfect prior for using API 1 vars in all cases. It will still be useful to do these test, because they will verify that there is no data loss by excluding some variables repeated by the two APIs

In [14]:
ids_1 = list()
for i in mhkdr_1_df['URI']:
    a = primrea.kh_table_gen.entry_based.find_entry_id(i)
    ids_1.append(a)

ids_2 = list()
for i in mhkdr_2_df['identifier']:
    a = primrea.kh_table_gen.entry_based.find_entry_id(i)
    ids_2.append(a)

In [15]:
mhkdr_1_df['id'] = ids_1
mhkdr_2_df['id'] = ids_2

In [16]:
mhkdr_12_df = mhkdr_1_df.merge(mhkdr_2_df, on='id')

In [17]:
mhkdr_12_df.keys()

Index(['URI', 'type', 'landingPage_x', 'sourceURL', 'title_x', 'description_x',
       'author', 'organization', 'originationDate', 'spatial_x',
       'technologyType', 'tags', 'signatureProject', 'modifiedDate', 'id',
       '@type', 'identifier', 'accessLevel', 'bureauCode', 'license', 'issued',
       'dataQuality', 'title_y', 'description_y', 'keyword', 'projectLead',
       'projectTitle', 'projectNumber', 'modified', 'publisher',
       'contactPoint', 'programCode', 'landingPage_y', 'distribution',
       'spatial_y', 'DOI'],
      dtype='object')

In [18]:
a = mhkdr_12_df['title_x'].equals(mhkdr_12_df['title_y'])
b = mhkdr_12_df['description_x'].equals(mhkdr_12_df['description_y'])
c = mhkdr_12_df['spatial_x'].equals(mhkdr_12_df['spatial_y'])
d = mhkdr_12_df['tags'].equals(mhkdr_12_df['keyword'])
e = mhkdr_12_df['modifiedDate'].equals(mhkdr_12_df['modified'])
print(f'MHKDR API 1-2 : "title" = "title"              - {a}\nMHKDR API 1-2 : "description" = "description"  - {b}\nMHKDR API 1-2 : "spatial" = "spatial"          - {c}\nMHKDR API 1-2 : "tags" = "keywords"            - {d}\nMHKDR API 1-2 : "modifiedDate" = "modified"    - {e}')

MHKDR API 1-2 : "title" = "title"              - True
MHKDR API 1-2 : "description" = "description"  - True
MHKDR API 1-2 : "spatial" = "spatial"          - False
MHKDR API 1-2 : "tags" = "keywords"            - True
MHKDR API 1-2 : "modifiedDate" = "modified"    - False


It is good to see that the three variables
1. title
2. description
3. tags

Are identical across the APIs. I prefer those attached to API 1 for aformentioned reasons, so will include these three in the final structure.

Next is analysis of the mismatches; "spatial," and "modifiedDate."

#### Spatial

In [19]:
spatial_df = mhkdr_12_df[['spatial_x', 'spatial_y']]

In [20]:
spatial_df['spatial_x'][0]

{'boundingCoordinatesNE': [44.63067800397145, -121.943046875],
 'boundingCoordinatesSW': [37.098651838142224, -125.91004687499999],
 'extent': 'boundingBox'}

In [21]:
spatial_df['spatial_y'][0]

'{"type":"Polygon","coordinates":[[[-125.91004687499999,37.098651838142224],[-121.943046875,37.098651838142224],[-121.943046875,44.63067800397145],[-125.91004687499999,44.63067800397145],[-125.91004687499999,37.098651838142224]]]}'

As we can see the same data is encoded with a different typing convention. Preference for the API 1 data as previously stated to match the typing of Tethys/Tethys Engineering data.

#### modifiedDate

In [22]:
mod_df = mhkdr_12_df[['modifiedDate', 'modified']]

In [23]:
mod_df['modifiedDate'][0]

'2024-04-25 20:40:00'

In [24]:
mod_df['modified'][0]

'2024-04-25T20:40:00Z'

As we can see, the same situation as 'spatial' - it is the same information encoded differently. Prefer API 1 data for consistency with Tethys/Tethys Engineering data.

#### type

In [25]:
mhkdr_12_df['type'][5]

['Dataset', 'Dataset/Archive', 'Document/Report', 'Dataset/OnlineTool']

In [26]:
mhkdr_12_df['@type'][5]

'dcat:Dataset'

#### Other Variables

In [27]:
a = mhkdr_12_df['@type']
a.drop_duplicates()

0    dcat:Dataset
Name: @type, dtype: object

Remove @type, no information gained.

In [28]:
a = mhkdr_12_df['accessLevel']
a.drop_duplicates()

0    public
Name: accessLevel, dtype: object

Remove accessLevel, no information gained.

In [29]:
a = mhkdr_12_df['bureauCode']
a.drop_duplicates()

0    [019:20]
Name: bureauCode, dtype: object

Remove bureaCode, no information gained.

In [30]:
a = mhkdr_12_df['license']
a.drop_duplicates()

0    https://creativecommons.org/licenses/by/4.0/
Name: license, dtype: object

Possibly remove license - they are all the same so no info gained. Retain for now because this is important.

In [31]:
a = mhkdr_12_df['issued']
a.drop_duplicates()

0      2024-02-29T07:00:00Z
1      2023-07-27T06:00:00Z
2      2024-02-27T07:00:00Z
3      2024-02-26T07:00:00Z
4      2021-11-01T06:00:00Z
               ...         
325    2014-06-30T06:00:00Z
326    2012-06-18T06:00:00Z
327    2015-03-20T06:00:00Z
328    2015-06-03T06:00:00Z
334    2021-12-15T07:00:00Z
Name: issued, Length: 224, dtype: object

issued - retain for now, seemingly unuseful but this is for a discussion. What does this mean?

In [32]:
a = mhkdr_12_df['dataQuality']
a.drop_duplicates()

0    True
Name: dataQuality, dtype: bool

Possibly remove dataQuality - they are all the same so no info gained. Retain for now because this is important.

In [33]:
a = mhkdr_12_df['projectTitle']
a.drop_duplicates()

0                         CalWave xWave Pilot at PacWave
1      Co-locating Wave Energy with an Integrated Mul...
2                           Advanced TidGen Power System
3      Testing Expertise and Access for Marine Energy...
4      Biofouling Analysis for Wave Energy Piston Design
                             ...                        
321                       Marine Renewable Energy Center
322    Performance Testing for Hydrokinetic Canal Eff...
326    Ocean Current, River and Tidal Hydrology for I...
328      Aquantis 2.5 MW Ocean Current Generation Device
334      Marine and Hydrokinetic Data Repository (MHKDR)
Name: projectTitle, Length: 92, dtype: object

Keep. This is interesting because it appears related to the current funding strategy of DOE, and might group outputs in a meaningful way as they relate to funding.

In [34]:
a = mhkdr_12_df['projectNumber']
a.drop_duplicates()

0               EE0009952
1      FY24 AOP 2.2.5.602
2               EE0007820
3               EE0008895
7      FY24 AOP 2.1.2.705
              ...        
322                171701
325         FY14 AOP 1321
326         FY12 AOP 1323
328             EE0003643
334                 35007
Name: projectNumber, Length: 81, dtype: object

Keep. Same reasons as 'projectTitle.' Very interesting because there are a different count in 'projectTitle,' and 'projectNumber,' indicating that these are not equivalent measures.

In [35]:
a = mhkdr_12_df['publisher']
a.drop_duplicates()

0      {'@type': 'org:Organization', 'name': 'Ryan Da...
1      {'@type': 'org:Organization', 'name': 'Lysel G...
2      {'@type': 'org:Organization', 'name': 'Katie S...
3      {'@type': 'org:Organization', 'name': 'Joseph ...
4      {'@type': 'org:Organization', 'name': 'Linnea ...
                             ...                        
330    {'@type': 'org:Organization', 'name': 'Case Va...
331    {'@type': 'org:Organization', 'name': 'Stephen...
332    {'@type': 'org:Organization', 'name': 'David C...
333    {'@type': 'org:Organization', 'name': 'Tyler M...
334    {'@type': 'org:Organization', 'name': 'RJ Scavo'}
Name: publisher, Length: 154, dtype: object

publisher - can probably be dropped. This is simply one of the names of the authors. It does not add unique info uncaptured elsewhere, and it confuses the grain.

In [36]:
a = mhkdr_12_df['contactPoint']
a.drop_duplicates()

0      {'@type': 'vcard:Contact', 'fn': 'Marcus Lehma...
1      {'@type': 'vcard:Contact', 'fn': 'James McVey'...
2      {'@type': 'vcard:Contact', 'fn': 'Jarlath McEn...
3      {'@type': 'vcard:Contact', 'fn': 'James Turnbu...
4      {'@type': 'vcard:Contact', 'fn': 'Tyler Robert...
                             ...                        
314    {'@type': 'vcard:Contact', 'fn': 'Eric Nelson'...
321    {'@type': 'vcard:Contact', 'fn': 'Steven Lohre...
322    {'@type': 'vcard:Contact', 'fn': 'Budi Gunawan...
328    {'@type': 'vcard:Contact', 'fn': 'David Arthur...
334    {'@type': 'vcard:Contact', 'fn': 'MHKDR Help',...
Name: contactPoint, Length: 105, dtype: object

In [37]:
mhkdr_12_df['contactPoint'][0]

{'@type': 'vcard:Contact',
 'fn': 'Marcus Lehmann',
 'hasEmail': 'mailto:marcus@calwave.energy'}

In [38]:
a = mhkdr_12_df['programCode']
a.drop_duplicates()

0    [019:009]
Name: programCode, dtype: object

In [39]:
a = mhkdr_12_df['DOI']
a.drop_duplicates()

0                   NaN
1      10.15473/2331255
3      10.15473/2339895
4      10.15473/2315037
5      10.15473/2316040
             ...       
328    10.15473/1417350
330    10.15473/1417297
331    10.15473/1417292
332    10.15473/1415193
333    10.15473/1413995
Name: DOI, Length: 271, dtype: object

### MHKDR Data - Revised

|1|both|2|
|:---:|:---:|:---:|
||(URI, landingPage, sourceURL) / ~(identifier, landingPage)~||
|type|||
||(title) / ~(title)~||
||(description) / ~(description)~||
|author|||
|organization|||
|originationDate|||
||(spatial) / ~(spatial)~||
|technologyType|||
||(tags) / ~(keyword)~||
|signatureProject|||
||(modifiedDate) / ~(modified)~||
|||~@type~|
|||~accessLevel~|
|||~bureauCode~|
|||license?|
|||issued?|
|||~dataQuality~|
|||projectTitle|
|||projectNumber|
|||~publisher~|
|||contactPoint?|
|||~programCode~|
|||distribution|
|||DOI?|

@type, accesslevel, bureauCode, dataquality good to axe,

Doi, contactPoint, license keep

DOI list of what doesn't have one good for QA/QC

keep issued, keep anything that is just a bool that
check 1-1 on publisher, if there is anything added by the field then keep it.

#### 4.0 "Distribution"

The aim of this section is to display the fields given by MHKDR API 2 in "Distribution" - so that his data can be mapped to the Tethys/Tethys Engineering data for finalizing the table structure.

In [40]:
mhkdr_2_df['distribution'][0][0]

{'@type': 'dcat:Distribution',
 'description': 'A video recording of the data management and submission best practices training presentation and a recorded demo of data submission to the MHKDR',
 'title': 'MHKDR Data Management and Submission Best Practices Video',
 'accessURL': 'https://www.youtube.com/watch?v=QeqYr6W9HC0&list=PL4lfI7kmtVfqEk0SjiR9zPTpqVb2Mm9vO',
 'format': 'HTML',
 'mediaType': 'text/html'}

In [41]:
mhkdr_2_df['distribution'][0][0].keys()

dict_keys(['@type', 'description', 'title', 'accessURL', 'format', 'mediaType'])

Decision:

Keep Tethys & Tethys Engineering architecture separate from MHKDR. There would be significant loss of content by merging the two, and the merge can still be done after the fact by a user if they so desired. 

In [49]:
list(mhkdr_1_df['signatureProject'])

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['TEAMER'],
 [],
 ['TEAMER'],
 ['TEAMER'],
 [],
 ['WEC-Sim'],
 ['WEC-Sim'],
 [],
 [],
 [],
 ['TEAMER'],
 [],
 [],
 ['TEAMER'],
 [],
 [],
 [],
 [],
 [],
 [],
 ['TEAMER'],
 ['TEAMER'],
 ['Wave Energy Prize'],
 ['Wave Energy Prize'],
 ['WEC-Sim'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['TEAMER'],
 [],
 ['WEC-Sim'],
 [],
 ['WEC-Sim'],
 ['TEAMER'],
 ['LUPA'],
 ['TEAMER'],
 ['TEAMER'],
 [],
 ['TEAMER'],
 [],
 [],
 [],
 [],
 [],
 ['TEAMER'],
 ['TEAMER'],
 [],
 ['TEAMER'],
 [],
 [],
 [],
 [],
 [],
 ['MHKiT'],
 ['LUPA'],
 ['TEAMER'],
 [],
 ['TEAMER'],
 ['TEAMER'],
 [],
 ['TEAMER'],
 [],
 [],
 ['TEAMER'],
 [],
 [],
 [],
 [],
 ['TEAMER'],
 ['TEAMER'],
 ['TEAMER'],
 ['TEAMER'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['TEAMER'],
 ['TEAMER'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['TEAMER'],
 [],
 [],
 ['TEAMER', 'WEC-Sim'],
 [],
 [],
 ['TEAMER', 'WEC-Sim'],
 ['Advanced WEC Dynamics and Controls'],
 []