# Core Tables

The purpose of this notebook is to develop the tables for MHKDR, Tethys, and Tethys Engineering which will contain the entry-level data for these Knowledge hubs. I may need to further divide this development by knowledge hub, as there will most likely be partial overlap between the fields of the Tethys and MHKDR.

For now, the aim of this notebook is to create the core tables for Tethys and Tethys Engineering. These core tables will contain data relevant to the "grain" of 1 T/TE entry. By the end of this notebook, the core tables for T/TE should not include any field which violates atomicity (no field should have data - each should have only 1 datum). To accomplish this aim, the following columns need to be removed / modified:
 - type
 - author
 - organization
 - spatial
 - technologyType
 - tags
 - attachment

In addition, I need to add an "entry_id" field, that creates a unique id for each entry based on the node number in the URI. Thus far I have used this number as a unique identifier for the entry under the assumption that the website engine generates a new, unique node number every time a new page is generated.
 - entry_id

Finally, the two date-time fields must be updated. On ingest from the API these two columns' types default to string, and thus we need to convert them to date-time to make them usable.
 - originationDate
 - modifiedDate

### Setup

In [1]:
import pandas as pd
import numpy as np
import datetime

In [2]:
import primrea.core
from primrea import *
primre_data = primrea.core.primrea_data()

### Dev

In [7]:
tethys_df = primre_data.tethys_dataframe_raw
tethys_df.head(3)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment,entry_id,originationDate2,modifiedDate2
0,https://tethys.pnnl.gov/node/499,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...","[Soukissian, T., Denaxa, D., Karathanasi, F., ...","[Hellenic Centre for Marine Research (HCMR), N...",2017-09-29,[],[],"[Environment, Human Dimensions]",2024-01-22 09:24:45,[],[],499,2017-09-29,2024-01-22 09:24:45
1,https://tethys.pnnl.gov/node/500,"[Document, Document/Report]",https://tethys.pnnl.gov/publications/port-fair...,http://bps.energy/projects,The Port Fairy Pilot Wave Energy Project Envir...,This Environmental Management Plan (EMP) detai...,[BioPower Systems],[BioPower Systems],2016-02-09,"{'coordinates': ['-38.398417000000', '142.1726...",[Wave],"[Environment, Environmental Impact Assessment]",2024-01-22 09:24:45,[],[https://tethys.pnnl.gov/sites/default/files/p...,500,2016-02-09,2024-01-22 09:24:45
2,https://tethys.pnnl.gov/node/501,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/baseline-...,https://www.sciencedirect.com/science/article/...,Baseline assessment of underwater noise in the...,The Ria Formosa is a sheltered large coastal l...,"[Soares, C., Pacheco, A., Zabel, F., González-...",[Marine Sensing and Acoustic Technologies (Mar...,2020-01-10,"{'coordinates': ['36.972554000000', '-7.870570...","[Current, Current/Tidal]","[Environment, Noise]",2024-01-22 09:24:45,[],[],501,2020-01-10,2024-01-22 09:24:45


In [8]:
orig_date = tethys_df['originationDate'][0]
orig_date_typ = type(orig_date)
mod_date = tethys_df['modifiedDate'][0]
mod_date_typ = type(mod_date)
print(f'originationDate field example : {orig_date}\noriginationDate field type    : {orig_date_typ}\nmodifiedDate field example : {mod_date}\nmodifiedDate field type    : {mod_date_typ}')

originationDate field example : 2017-09-29
originationDate field type    : <class 'str'>
modifiedDate field example : 2024-01-22 09:24:45
modifiedDate field type    : <class 'str'>


Above, we can see that the type of the date-time columns are incorrect, and we will need to change their types to allow for selecting rows based on their date.

In [9]:
tethys_df_test = tethys_df

In [10]:
tethys_df_test.keys()

Index(['URI', 'type', 'landingPage', 'sourceURL', 'title', 'description',
       'author', 'organization', 'originationDate', 'spatial',
       'technologyType', 'tags', 'modifiedDate', 'signatureProject',
       'attachment', 'entry_id', 'originationDate2', 'modifiedDate2'],
      dtype='object')

Remove

 - type
 - author
 - organizaton
 - spatial
 - 
technogyTyp
 - tags
 - 
attachment

Remaining
 - URI
 - landingPage
 - sourceURL
 - title
 - description
 - originationDate
 - modifiedDate
 - signatureProject

Add
 - entry_id

In [11]:
tethys_df_len = len(tethys_df_test)
entry_ids = list()
for i in range(0, tethys_df_len):
    entry_id = primrea.kh_table_gen.entry_based.find_entry_id(tethys_df_test['URI'][i])
    entry_ids.append(entry_id)

In [12]:
tethys_df_test['entry_id'] = entry_ids

In [13]:
tethys_df_test['originationDate2'] = pd.to_datetime(tethys_df_test['originationDate'])
tethys_df_test['modifiedDate2'] = pd.to_datetime(tethys_df_test['modifiedDate'])

In [14]:
tethys_df_test = tethys_df_test[['entry_id', 'originationDate2', 'modifiedDate2', 'URI', 'landingPage', 'sourceURL', 'title', 'description', 'signatureProject']]
tethys_df_test.head(1)

Unnamed: 0,entry_id,originationDate2,modifiedDate2,URI,landingPage,sourceURL,title,description,signatureProject
0,499,2017-09-29,2024-01-22 09:24:45,https://tethys.pnnl.gov/node/499,https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...",[]


### Clean

In [15]:
def construct_core_table(tethyss_df):
    '''
    This function creates a normalized table for the entity "entry." The primary key of the resulting table is
    "entry_id," and it may be used to connect entries represented in this table to associated entities, such
    as authors, organizations, or tags.
    '''
    # Constructing the entry_id data
    tethyss_df_len = len(tethyss_df)
    entry_ids = list()
    for i in range(0, tethyss_df_len):
        entry_id = primrea.kh_table_gen.entry_based.find_entry_id(tethyss_df['URI'][i])
        entry_ids.append(entry_id)

    # Add entry_id column
    tethyss_df['entry_id'] = entry_ids

    # Correct datatype of the datetime columns
    tethyss_df['originationDate2'] = pd.to_datetime(tethyss_df['originationDate'], errors='coerce')
    tethyss_df['modifiedDate2'] = pd.to_datetime(tethyss_df['modifiedDate'])

    # Slice working df to final col list
    tethyss_df_final = tethyss_df[['entry_id', 'originationDate2', 'modifiedDate2', 'URI', 'landingPage', 'sourceURL', 'title', 'description', 'signatureProject']]
    tethyss_df_final = tethyss_df_final.rename(columns={'originationDate2': 'originationDate', 'modifiedDate2': 'modifiedDate'})
    
    return tethyss_df_final


### Test

In [17]:
tethys_df2 = primre_data.tethys_dataframe_raw
tethys_df2.head(1)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment,entry_id,originationDate2,modifiedDate2
0,https://tethys.pnnl.gov/node/499,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...","[Soukissian, T., Denaxa, D., Karathanasi, F., ...","[Hellenic Centre for Marine Research (HCMR), N...",2017-09-29,[],[],"[Environment, Human Dimensions]",2024-01-22 09:24:45,[],[],499,2017-09-29,2024-01-22 09:24:45


In [18]:
tethys_core = construct_core_table(tethys_df2)
tethys_core.head(1)

Unnamed: 0,entry_id,originationDate,modifiedDate,URI,landingPage,sourceURL,title,description,signatureProject
0,499,2017-09-29,2024-01-22 09:24:45,https://tethys.pnnl.gov/node/499,https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...",[]


In [21]:
tethys_e_df = primre_data.tethys_e_dataframe_raw
tethys_e_df.head(1)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment,entry_id,originationDate2,modifiedDate2
0,https://tethys-engineering.pnnl.gov/node/4,"[Document, Document/Journal Article]",https://tethys-engineering.pnnl.gov/publicatio...,https://www.sciencedirect.com/science/article/...,Analytical linear modelization of a buckled un...,This paper presents an analytical linear model...,"[Träsch, M., Déporte, A., Delacroix, S., Germa...",[],2019-01-01,[],"[Current, Current/Tidal]","[Engineering, Performance, Modeling]",2022-07-26 02:02:47,[],[],4,2019-01-01,2022-07-26 02:02:47


In [22]:
tethys_e_core = construct_core_table(tethys_e_df)
tethys_e_core.head(1)

Unnamed: 0,entry_id,originationDate,modifiedDate,URI,landingPage,sourceURL,title,description,signatureProject
0,4,2019-01-01,2022-07-26 02:02:47,https://tethys-engineering.pnnl.gov/node/4,https://tethys-engineering.pnnl.gov/publicatio...,https://www.sciencedirect.com/science/article/...,Analytical linear modelization of a buckled un...,This paper presents an analytical linear model...,[]


In [23]:
tethys_e_core.sort_values('originationDate', ascending=False)

Unnamed: 0,entry_id,originationDate,modifiedDate,URI,landingPage,sourceURL,title,description,signatureProject
8267,22528,2024-11-01,2024-06-29 06:59:27,https://tethys-engineering.pnnl.gov/node/22528,https://tethys-engineering.pnnl.gov/publicatio...,https://www.sciencedirect.com/science/article/...,A self-powered smart wave energy converter for...,Self-powered smart buoys are widely used in su...,[]
8253,22503,2024-10-01,2024-06-15 07:36:18,https://tethys-engineering.pnnl.gov/node/22503,https://tethys-engineering.pnnl.gov/publicatio...,https://www.sciencedirect.com/science/article/...,Theoretical and experimental transverse vibrat...,"Recently, tidal energy has gained attention as...",[]
8273,22537,2024-10-01,2024-07-06 07:58:05,https://tethys-engineering.pnnl.gov/node/22537,https://tethys-engineering.pnnl.gov/publicatio...,https://www.sciencedirect.com/science/article/...,System identification and centralised causal i...,"Similar to offshore wind turbines, multiple po...",[]
8260,22519,2024-10-01,2024-06-22 07:50:02,https://tethys-engineering.pnnl.gov/node/22519,https://tethys-engineering.pnnl.gov/publicatio...,https://www.sciencedirect.com/science/article/...,Numerical and experimental analysis of the pow...,A single-degree-of-freedom Wave Energy Convert...,[]
8268,22529,2024-09-30,2024-06-29 07:07:33,https://tethys-engineering.pnnl.gov/node/22529,https://tethys-engineering.pnnl.gov/publicatio...,https://www.sciencedirect.com/science/article/...,Effect of wind conditions on the performance o...,Local atmospheric conditions surrounding an of...,[]
...,...,...,...,...,...,...,...,...,...
1706,1952,1978-02-01,2022-04-04 01:23:24,https://tethys-engineering.pnnl.gov/node/1952,https://tethys-engineering.pnnl.gov/publicatio...,https://link.springer.com/article/10.1007/BF02...,Ocean thermal energy conversion material requi...,This paper summarizes the general concepts and...,[]
8271,22532,1978-01-01,2024-07-01 05:41:22,https://tethys-engineering.pnnl.gov/node/22532,https://tethys-engineering.pnnl.gov/publicatio...,https://www.sciencedirect.com/science/article/...,Technical and economic feasibility of Ocean Th...,Ocean Thermal Energy Conversion (OTEC) plants ...,[]
390,419,1975-08-07,2022-04-04 01:23:24,https://tethys-engineering.pnnl.gov/node/419,https://tethys-engineering.pnnl.gov/publicatio...,https://www.nature.com/articles/256478a0,A resonant point absorber of ocean-wave power,Various large scale systems have been proposed...,[]
651,692,1974-05-30,2022-04-04 01:23:24,https://tethys-engineering.pnnl.gov/node/692,https://tethys-engineering.pnnl.gov/publicatio...,https://royalsocietypublishing.org/doi/10.1098...,Energy in the 1980s - Hydro (including tidal) ...,"This paper reviews, on a world-wide basis, the...",[]


I had to come back to edit the function because there was an error thrown when running this as a part of the package. As seen above, it is evidently the case that one of the dates was incorrectly entere, causing coercion to return an error instead of a proper date-time value. We can see that there is only one such error, as it is isolated as the last in the sorted dataframe, and the second-to-last item does not display an error (NaT) value.

### Debug

This section, added 7/5/24, is here to help me debug and understand why the tethys & tethys e "core" tables have duplicates.

These tables should have one row per entry only.

**Minimum Reproducible Example**

In [24]:
# Tethys
tethys_df = primrea.core.api_to_df('https://tethys.pnnl.gov/api/primre_export')

In [25]:
tethys_df['URI'].is_unique

False

In [26]:
tethys_df['URI'].value_counts()

https://tethys.pnnl.gov/node/1838447    3
https://tethys.pnnl.gov/node/1630709    2
https://tethys.pnnl.gov/node/1618730    2
https://tethys.pnnl.gov/node/1560806    2
https://tethys.pnnl.gov/node/1284879    2
                                       ..
https://tethys.pnnl.gov/node/3659       1
https://tethys.pnnl.gov/node/3661       1
https://tethys.pnnl.gov/node/3662       1
https://tethys.pnnl.gov/node/3665       1
https://tethys.pnnl.gov/node/2079485    1
Name: URI, Length: 4248, dtype: int64

In [27]:
# Tethys Engineering
tethys_e_df = primrea.core.api_to_df('https://tethys-engineering.pnnl.gov/api/primre_export')

In [28]:
tethys_e_df['URI'].is_unique

False

In [29]:
tethys_e_df['URI'].value_counts()

https://tethys-engineering.pnnl.gov/node/1265     8
https://tethys-engineering.pnnl.gov/node/10830    8
https://tethys-engineering.pnnl.gov/node/20746    7
https://tethys-engineering.pnnl.gov/node/10658    6
https://tethys-engineering.pnnl.gov/node/10835    6
                                                 ..
https://tethys-engineering.pnnl.gov/node/3025     1
https://tethys-engineering.pnnl.gov/node/3024     1
https://tethys-engineering.pnnl.gov/node/3023     1
https://tethys-engineering.pnnl.gov/node/3022     1
https://tethys-engineering.pnnl.gov/node/18796    1
Name: URI, Length: 8177, dtype: int64

#### Testing if data between two observations referencing the same T/TE Entry differ

In [30]:
tethys_e_df[tethys_e_df['URI']=='https://tethys-engineering.pnnl.gov/node/10830']

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment
5430,https://tethys-engineering.pnnl.gov/node/10830,"[Document, Document/Conference Paper]",https://tethys-engineering.pnnl.gov/publicatio...,https://ewtec.org/proceedings/,Sea & Swell Spectra,The Hydraulics &amp; Maritime Research Centre ...,"[Holmes, B., Barrett, S.]",[University College Cork],2007-09-11,"{'coordinates': ['52.823234000000', '-9.661800...",[Wave],"[Engineering, Condition Monitoring, Instrument...",2022-01-06 01:46:15,[],[]
5431,https://tethys-engineering.pnnl.gov/node/10830,"[Document, Document/Conference Paper]",https://tethys-engineering.pnnl.gov/publicatio...,https://ewtec.org/proceedings/,Sea & Swell Spectra,The Hydraulics &amp; Maritime Research Centre ...,"[Holmes, B., Barrett, S.]",[University College Cork],2007-09-11,"{'coordinates': ['52.656947000000', '-9.936458...",[Wave],"[Engineering, Condition Monitoring, Instrument...",2022-01-06 01:46:15,[],[]
5432,https://tethys-engineering.pnnl.gov/node/10830,"[Document, Document/Conference Paper]",https://tethys-engineering.pnnl.gov/publicatio...,https://ewtec.org/proceedings/,Sea & Swell Spectra,The Hydraulics &amp; Maritime Research Centre ...,"[Holmes, B., Barrett, S.]",[University College Cork],2007-09-11,"{'coordinates': ['53.219737000000', '-10.91424...",[Wave],"[Engineering, Condition Monitoring, Instrument...",2022-01-06 01:46:15,[],[]
5433,https://tethys-engineering.pnnl.gov/node/10830,"[Document, Document/Conference Paper]",https://tethys-engineering.pnnl.gov/publicatio...,https://ewtec.org/proceedings/,Sea & Swell Spectra,The Hydraulics &amp; Maritime Research Centre ...,"[Holmes, B., Barrett, S.]",[University College Cork],2007-09-11,"{'coordinates': ['53.054970000000', '-10.93621...",[Wave],"[Engineering, Condition Monitoring, Instrument...",2022-01-06 01:46:15,[],[]
5434,https://tethys-engineering.pnnl.gov/node/10830,"[Document, Document/Conference Paper]",https://tethys-engineering.pnnl.gov/publicatio...,https://ewtec.org/proceedings/,Sea & Swell Spectra,The Hydraulics &amp; Maritime Research Centre ...,"[Holmes, B., Barrett, S.]",[University College Cork],2007-09-11,"{'coordinates': ['55.348408000000', '-8.892757...",[Wave],"[Engineering, Condition Monitoring, Instrument...",2022-01-06 01:46:15,[],[]
5435,https://tethys-engineering.pnnl.gov/node/10830,"[Document, Document/Conference Paper]",https://tethys-engineering.pnnl.gov/publicatio...,https://ewtec.org/proceedings/,Sea & Swell Spectra,The Hydraulics &amp; Maritime Research Centre ...,"[Holmes, B., Barrett, S.]",[University College Cork],2007-09-11,"{'coordinates': ['51.104095000000', '-9.694759...",[Wave],"[Engineering, Condition Monitoring, Instrument...",2022-01-06 01:46:15,[],[]
5436,https://tethys-engineering.pnnl.gov/node/10830,"[Document, Document/Conference Paper]",https://tethys-engineering.pnnl.gov/publicatio...,https://ewtec.org/proceedings/,Sea & Swell Spectra,The Hydraulics &amp; Maritime Research Centre ...,"[Holmes, B., Barrett, S.]",[University College Cork],2007-09-11,"{'coordinates': ['51.495633000000', '-7.827083...",[Wave],"[Engineering, Condition Monitoring, Instrument...",2022-01-06 01:46:15,[],[]
5437,https://tethys-engineering.pnnl.gov/node/10830,"[Document, Document/Conference Paper]",https://tethys-engineering.pnnl.gov/publicatio...,https://ewtec.org/proceedings/,Sea & Swell Spectra,The Hydraulics &amp; Maritime Research Centre ...,"[Holmes, B., Barrett, S.]",[University College Cork],2007-09-11,"{'coordinates': ['53.278898000000', '-5.344173...",[Wave],"[Engineering, Condition Monitoring, Instrument...",2022-01-06 01:46:15,[],[]


As we can see, the previously considered "Duplicate" entries are not true duplicates!!! They refer to the same entry, and therefore share URI and other fields, but note that the "spatial" field is unique to each observation.

**Fixing**

In [46]:
# Tethys
tethys_core_df = primre_data.tethys_core
tethys_core_df.head(1)

Unnamed: 0,entry_id,originationDate,modifiedDate,URI,landingPage,sourceURL,title,description,signatureProject
0,499,2017-09-29,2024-01-22 09:24:45,https://tethys.pnnl.gov/node/499,https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...",[]


In [43]:
tcd_nodupe = tethys_core_df.drop_duplicates(subset='entry_id')

In [45]:
print(f'Observations before de-duplication : {len(tethys_core_df)}\nObservations after  de-duplication : {len(tcd_nodupe)}')

Observations before de-duplication : 4255
Observations after  de-duplication : 4248


In [47]:
# Tethys Engineering
tethys_e_core_df = primre_data.tethys_e_core
tethys_e_core_df.head(1)

Unnamed: 0,entry_id,originationDate,modifiedDate,URI,landingPage,sourceURL,title,description,signatureProject
0,4,2019-01-01,2022-07-26 02:02:47,https://tethys-engineering.pnnl.gov/node/4,https://tethys-engineering.pnnl.gov/publicatio...,https://www.sciencedirect.com/science/article/...,Analytical linear modelization of a buckled un...,This paper presents an analytical linear model...,[]


In [48]:
tecd_nodupe = tethys_e_core_df.drop_duplicates(subset='entry_id')

In [50]:
print(f'Observations before de-duplication : {len(tethys_e_core_df)}\nObservations after  de-duplication : {len(tecd_nodupe)}')

Observations before de-duplication : 8281
Observations after  de-duplication : 8177


**Testing**

In [1]:
import primrea.core

In [2]:
primre_data2 = primrea.core.primrea_data()

In [3]:
a = len(primre_data2.tethys_core)
b = len(primre_data2.tethys_e_core)
print(f'Observations Tethys after edit   : {a}\nObservations Tethys E after edit : {b}')

Observations Tethys after edit   : 4248
Observations Tethys E after edit : 8178


I need to restart the kernel to test. This may wipe previous outputs. Once the package is updated, some results will be inaccessible.

Also, once the API is fixed this edit should be uneccessary, and may be preferable to remove for the sake of future error identification.

Oddly, we can see that for tethys engineering the numbers do not match. On the test below, where I verified that the observations are unique, I subsequently checked if there might have simply been one entry added to the API since the last run and kernel clear, and on commenting out the changed code and rerunning the functions, that is what happened! Hayley emailed us earlier today about content entry, and someone seems to have gone in and added a document while I was doing this development. Carry on.

In [4]:
primre_data2.tethys_e_core['entry_id'].is_unique

True