# Core Tables

The purpose of this notebook is to develop the tables for MHKDR, Tethys, and Tethys Engineering which will contain the entry-level data for these Knowledge hubs. I may need to further divide this development by knowledge hub, as there will most likely be partial overlap between the fields of the Tethys and MHKDR.

For now, the aim of this notebook is to create the core tables for Tethys and Tethys Engineering. These core tables will contain data relevant to the "grain" of 1 T/TE entry. By the end of this notebook, the core tables for T/TE should not include any field which violates atomicity (no field should have data - each should have only 1 datum). To accomplish this aim, the following columns need to be removed / modified:
 - type
 - author
 - organization
 - spatial
 - technologyType
 - tags
 - attachment

In addition, I need to add an "entry_id" field, that creates a unique id for each entry based on the node number in the URI. Thus far I have used this number as a unique identifier for the entry under the assumption that the website engine generates a new, unique node number every time a new page is generated.
 - entry_id

Finally, the two date-time fields must be updated. On ingest from the API these two columns' types default to string, and thus we need to convert them to date-time to make them usable.
 - originationDate
 - modifiedDate

### Setup

In [1]:
import pandas as pd
import numpy as np
import datetime

In [2]:
import primrea.core
from primrea import *
primre_data = primrea.core.primrea_data()

### Dev

In [3]:
tethys_df = primre_data.tethys_dataframe
tethys_df.head(3)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment
0,https://tethys.pnnl.gov/node/499,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...","[Soukissian, T., Denaxa, D., Karathanasi, F., ...","[Hellenic Centre for Marine Research (HCMR), N...",2017-09-29,[],[],"[Environment, Human Dimensions]",2024-01-22 09:24:45,[],[]
1,https://tethys.pnnl.gov/node/500,"[Document, Document/Report]",https://tethys.pnnl.gov/publications/port-fair...,http://bps.energy/projects,The Port Fairy Pilot Wave Energy Project Envir...,This Environmental Management Plan (EMP) detai...,[BioPower Systems],[BioPower Systems],2016-02-09,"{'coordinates': ['-38.398417000000', '142.1726...",[Wave],"[Environment, Environmental Impact Assessment]",2024-01-22 09:24:45,[],[https://tethys.pnnl.gov/sites/default/files/p...
2,https://tethys.pnnl.gov/node/501,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/baseline-...,https://www.sciencedirect.com/science/article/...,Baseline assessment of underwater noise in the...,The Ria Formosa is a sheltered large coastal l...,"[Soares, C., Pacheco, A., Zabel, F., González-...",[Marine Sensing and Acoustic Technologies (Mar...,2020-01-10,"{'coordinates': ['36.972554000000', '-7.870570...","[Current, Current/Tidal]","[Environment, Noise]",2024-01-22 09:24:45,[],[]


In [4]:
orig_date = tethys_df['originationDate'][0]
orig_date_typ = type(orig_date)
mod_date = tethys_df['modifiedDate'][0]
mod_date_typ = type(mod_date)
print(f'originationDate field example : {orig_date}\noriginationDate field type    : {orig_date_typ}\nmodifiedDate field example : {mod_date}\nmodifiedDate field type    : {mod_date_typ}')

originationDate field example : 2017-09-29
originationDate field type    : <class 'str'>
modifiedDate field example : 2024-01-22 09:24:45
modifiedDate field type    : <class 'str'>


Above, we can see that the type of the date-time columns are incorrect, and we will need to change their types to allow for selecting rows based on their date.

In [5]:
tethys_df_test = tethys_df

In [6]:
tethys_df_test.keys()

Index(['URI', 'type', 'landingPage', 'sourceURL', 'title', 'description',
       'author', 'organization', 'originationDate', 'spatial',
       'technologyType', 'tags', 'modifiedDate', 'signatureProject',
       'attachment'],
      dtype='object')

Remove

 - type
 - author
 - organizaton
 - spatial
 - 
technogyTyp
 - tags
 - 
attachment

Remaining
 - URI
 - landingPage
 - sourceURL
 - title
 - description
 - originationDate
 - modifiedDate
 - signatureProject

Add
 - entry_id

In [7]:
tethys_df_len = len(tethys_df_test)
entry_ids = list()
for i in range(0, tethys_df_len):
    entry_id = primrea.kh_table_gen.entry_based.find_entry_id(tethys_df_test['URI'][i])
    entry_ids.append(entry_id)

In [8]:
tethys_df_test['entry_id'] = entry_ids

In [9]:
tethys_df_test['originationDate2'] = pd.to_datetime(tethys_df_test['originationDate'])
tethys_df_test['modifiedDate2'] = pd.to_datetime(tethys_df_test['modifiedDate'])

In [10]:
tethys_df_test = tethys_df_test[['entry_id', 'originationDate2', 'modifiedDate2', 'URI', 'landingPage', 'sourceURL', 'title', 'description', 'signatureProject']]
tethys_df_test.head(1)

Unnamed: 0,entry_id,originationDate2,modifiedDate2,URI,landingPage,sourceURL,title,description,signatureProject
0,499,2017-09-29,2024-01-22 09:24:45,https://tethys.pnnl.gov/node/499,https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...",[]


### Clean

In [11]:
def construct_core_table(tethyss_df):
    '''
    This function creates a normalized table for the entity "entry." The primary key of the resulting table is
    "entry_id," and it may be used to connect entries represented in this table to associated entities, such
    as authors, organizations, or tags.
    '''
    # Constructing the entry_id data
    tethyss_df_len = len(tethyss_df)
    entry_ids = list()
    for i in range(0, tethyss_df_len):
        entry_id = primrea.kh_table_gen.entry_based.find_entry_id(tethyss_df['URI'][i])
        entry_ids.append(entry_id)

    # Add entry_id column
    tethyss_df['entry_id'] = entry_ids

    # Correct datatype of the datetime columns
    tethyss_df['originationDate2'] = pd.to_datetime(tethyss_df['originationDate'])
    tethyss_df['modifiedDate2'] = pd.to_datetime(tethyss_df['modifiedDate'])

    # Slice working df to final col list
    tethyss_df_final = tethyss_df[['entry_id', 'originationDate2', 'modifiedDate2', 'URI', 'landingPage', 'sourceURL', 'title', 'description', 'signatureProject']]
    tethyss_df_final = tethyss_df_final.rename(columns={'originationDate2': 'originationDate', 'modifiedDate2': 'modifiedDate'})
    
    return tethyss_df_final


### Test

In [12]:
tethys_df2 = primre_data.tethys_dataframe
tethys_df2.head(1)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment,entry_id,originationDate2,modifiedDate2
0,https://tethys.pnnl.gov/node/499,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...","[Soukissian, T., Denaxa, D., Karathanasi, F., ...","[Hellenic Centre for Marine Research (HCMR), N...",2017-09-29,[],[],"[Environment, Human Dimensions]",2024-01-22 09:24:45,[],[],499,2017-09-29,2024-01-22 09:24:45


In [13]:
tethys_core = construct_core_table(tethys_df2)
tethys_core.head(1)

Unnamed: 0,entry_id,originationDate,modifiedDate,URI,landingPage,sourceURL,title,description,signatureProject
0,499,2017-09-29,2024-01-22 09:24:45,https://tethys.pnnl.gov/node/499,https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...",[]
