# Tethys / Tethys Engineering QA/QC 1

The purpose of this notebook is to generate a list of Tethys and Tethys Engineering entries that have attached documents, where the name of the attached file includes a space. 

This is likely the first in a series of notebooks dedicated to bringing value to the PRIMRE team through quality control of various knowledge hubs. I expect that there will be at least a few of these notebooks dedicated to Tethys / Tethys Engineering.

### Setup

In [1]:
import re
import primrea.core
import numpy as np
import pandas as pd

### Dev

In [2]:
primre_data = primrea.core.primrea_data()

It appears that the 'attachment' field is quite convenient. All of the documents appear to be saved in the same place in the server, so the url of each download is identical excepting the file name. After a few random picks, I arrived a url that has spaces in it (index 3000, below). As we can see, this url is pre-parsing by a browser, so we do not even need to decode it using regex. I think there will be some minor regex to select those which have spaces, but this will be significantly easier than secting with the '%20', or whatever the space encoding is from the browser.

In [3]:
primre_data.tethys_dataframe_raw['attachment'][3000][0][57:]

'ICES Report 344.pdf'

Necessary regex for determining if there is a space in the string. \/

In [4]:
p = re.compile(' +')
p

re.compile(r' +', re.UNICODE)

In [5]:
a = 'BPS_EMP_092016.pdf'
b = 'ICES Report 344.pdf'

print(re.search(p, a))
print(re.search(p, b))

None
<re.Match object; span=(4, 5), match=' '>


In [6]:
print(p.search(a))
print(p.search(b))

None
<re.Match object; span=(4, 5), match=' '>


In [7]:
tethys_df = primre_data.tethys_dataframe_raw
num_t_entries = len(tethys_df)

match_lst = list()
for i in range(0, num_t_entries):
    i_url = tethys_df['attachment'][i]
    if len(i_url) < 1:
        match_lst.append(0)
    elif len(i_url) > 0:
        match = re.search(p, i_url[0])    # Index the i_url variable because we have a list at this point, and we were previously checking that the list was not empty
        if match == None:
            match_lst.append(1)
        else:
            #elif match != None:
            match_lst.append(2)
        

In [8]:
# assign this information as a column in the original dataframe
tethys_df['match'] = match_lst
tethys_df.head(1)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment,entry_id,originationDate2,modifiedDate2,match
0,https://tethys.pnnl.gov/node/499,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...","[Soukissian, T., Denaxa, D., Karathanasi, F., ...","[Hellenic Centre for Marine Research (HCMR), N...",2017-09-29,[],[],"[Environment, Human Dimensions]",2024-01-22 09:24:45,[],[],499,2017-09-29,2024-01-22 09:24:45,0


In [9]:
tethys_no_attch = tethys_df[tethys_df['match']==0]
tethys_gd_attch = tethys_df[tethys_df['match']==1]
tethys_bd_attch = tethys_df[tethys_df['match']==2]
print(f'Number of obs no attch : {len(tethys_no_attch)}\nNumber of obs gd attch : {len(tethys_gd_attch)}\nNumber of obs bd attch : {len(tethys_bd_attch)}')
print(f'Number of observations in Tethys :  {num_t_entries}\nNumber of observations of matches : {len(tethys_no_attch) + len(tethys_gd_attch) + len(tethys_bd_attch)}')


Number of obs no attch : 2173
Number of obs gd attch : 2069
Number of obs bd attch : 13
Number of observations in Tethys :  4255
Number of observations of matches : 4255


In [10]:
tethys_bd_attch['URI']

2370       https://tethys.pnnl.gov/node/5744
2581       https://tethys.pnnl.gov/node/6531
2582       https://tethys.pnnl.gov/node/6533
2686       https://tethys.pnnl.gov/node/7195
2713       https://tethys.pnnl.gov/node/8211
2854      https://tethys.pnnl.gov/node/77442
2855      https://tethys.pnnl.gov/node/77636
2856      https://tethys.pnnl.gov/node/77637
2944     https://tethys.pnnl.gov/node/112518
3000     https://tethys.pnnl.gov/node/121849
3023     https://tethys.pnnl.gov/node/154322
3770    https://tethys.pnnl.gov/node/1760762
3893    https://tethys.pnnl.gov/node/2072546
Name: URI, dtype: object

### Clean

The aim of this section is to collect the report generated in "Dev" into one funciton with discrete dependencies. This is preperation for creating a deliverable notebook, which J described as the final destination of this work, in the PRIMRE shared Google Drive.

Note: because this will "live" in the shared PRIMRE drive, and will be used before final deployment of the primrea package, it cannot rely on any architecture built by primrea. This means that all api calls, and dependencies to make these calls, must be included in the following code.

In [11]:
import requests
import pandas as pd
import re

In [12]:
def tethyss_qaqc_1_attachment_names(api_str):
    '''
    This function takes in an API url from Tethys or Tethys Engineering, and returns a report
    on the amount of entries in the Knowledge Hub that have an associated attachment, how many
    of these entries with an associated attachment have problematic attachment file names, and
    a list of these problematic entries.
    '''
    # Return DF from API
    api_response = requests.get(api_str)
    api_response_json = api_response.json()
    api_df = pd.DataFrame(api_response_json)
    
    p = re.compile(' +')
    
#    tethys_df = primre_data.tethys_dataframe
    num_ts_entries = len(api_df)
    
    match_lst = list()
    for i in range(0, num_ts_entries):
        i_url = api_df['attachment'][i]
        if len(i_url) < 1:
            match_lst.append(0)
        elif len(i_url) > 0:
            match = re.search(p, i_url[0])    # Index the i_url variable because we have a list at this point, and we were previously checking that the list was not empty
            if match == None:
                match_lst.append(1)
            else:
                #elif match != None:
                match_lst.append(2)
    
    # assign this information as a column in the original dataframe
    api_df['match'] = match_lst
    api_df.head(1)
    
    tethyss_no_attch = api_df[api_df['match']==0]
    tethyss_gd_attch = api_df[api_df['match']==1]
    tethyss_bd_attch = api_df[api_df['match']==2]
    print(f'Number of obs no attch : {len(tethyss_no_attch)}\nNumber of obs gd attch : {len(tethyss_gd_attch)}\nNumber of obs bd attch : {len(tethyss_bd_attch)}')
    print(f'Number of observations in Tethys :  {num_ts_entries}\nNumber of observations of matches : {len(tethyss_no_attch) + len(tethyss_gd_attch) + len(tethyss_bd_attch)}')
    
    print(tethyss_bd_attch['URI'])

In [13]:
tethys_m_api = 'https://tethys.pnnl.gov/api/primre_export'
tethys_w_api = 'https://tethys.pnnl.gov/api/tethys-wind-document-export'
tethys_e_api = 'https://tethys-engineering.pnnl.gov/api/primre_export'

In [14]:
tethyss_qaqc_1_attachment_names(tethys_m_api)

Number of obs no attch : 2173
Number of obs gd attch : 2069
Number of obs bd attch : 13
Number of observations in Tethys :  4255
Number of observations of matches : 4255
2370       https://tethys.pnnl.gov/node/5744
2581       https://tethys.pnnl.gov/node/6531
2582       https://tethys.pnnl.gov/node/6533
2686       https://tethys.pnnl.gov/node/7195
2713       https://tethys.pnnl.gov/node/8211
2854      https://tethys.pnnl.gov/node/77442
2855      https://tethys.pnnl.gov/node/77636
2856      https://tethys.pnnl.gov/node/77637
2944     https://tethys.pnnl.gov/node/112518
3000     https://tethys.pnnl.gov/node/121849
3023     https://tethys.pnnl.gov/node/154322
3770    https://tethys.pnnl.gov/node/1760762
3893    https://tethys.pnnl.gov/node/2072546
Name: URI, dtype: object


In [15]:
#tethyss_qaqc_1_attachment_names(tethys_w_api)

In [16]:
tethyss_qaqc_1_attachment_names(tethys_e_api)

Number of obs no attch : 5963
Number of obs gd attch : 2285
Number of obs bd attch : 33
Number of observations in Tethys :  8281
Number of observations of matches : 8281
124       https://tethys-engineering.pnnl.gov/node/140
128       https://tethys-engineering.pnnl.gov/node/144
214       https://tethys-engineering.pnnl.gov/node/236
989      https://tethys-engineering.pnnl.gov/node/1039
1430     https://tethys-engineering.pnnl.gov/node/1519
1438     https://tethys-engineering.pnnl.gov/node/1528
1814     https://tethys-engineering.pnnl.gov/node/2065
1886     https://tethys-engineering.pnnl.gov/node/2139
1893     https://tethys-engineering.pnnl.gov/node/2147
1897     https://tethys-engineering.pnnl.gov/node/2151
1900     https://tethys-engineering.pnnl.gov/node/2154
2067     https://tethys-engineering.pnnl.gov/node/2328
2883     https://tethys-engineering.pnnl.gov/node/3194
3266     https://tethys-engineering.pnnl.gov/node/3596
3343     https://tethys-engineering.pnnl.gov/node/3676
3348 

In [17]:
api_response = requests.get(tethys_w_api)
api_response_json = api_response.json()
api_df = pd.DataFrame(api_response_json)

In [18]:
api_df.head(3)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,affiliation,sponsoringOrganization,originationDate,spatial,technologyType,stressor,receptor,modifiedDate
0,https://tethys.pnnl.gov/node/2079493,Document/Report,https://tethys.pnnl.gov/publications/setting-e...,https://publications.jrc.ec.europa.eu/reposito...,Setting EU Threshold Values for impulsive unde...,The purpose of the present document is to give...,"[Sigray, P., Andersson, M., André, M., Azzelli...",[European Commission],[],2023-05-22,,"[Wind Energy, Fixed Offshore Wind]",[Noise],"[Ecosystem Processes, Marine Mammals]",2024-07-09 01:41:56
1,https://tethys.pnnl.gov/node/2079486,Document/Report,https://tethys.pnnl.gov/publications/wind-powe...,https://www.naturvardsverket.se/publikationer/...,Wind power infrastructure and perceived value ...,This project focuses on developing a measureme...,"[Prince, S., Chekalina, T., Peters, A.]","[Linnaeus University, Mid Sweden University]",[Swedish Environmental Protection Agency (EPA)...,2024-07-01,,[Wind Energy],[],"[Human Dimensions, Recreation &amp; Tourism, V...",2024-07-03 17:09:36
2,https://tethys.pnnl.gov/node/2079483,Document/Journal Article,https://tethys.pnnl.gov/publications/regulatio...,https://www.researchgate.net/publication/38118...,Regulations for Bat Protection in Mexico&#039;...,Wind energy development has expanded the faste...,"[Uribe, M., Aguilera, J., Villada-Canela, M., ...",[Universidad Autónoma de Baja California (Auto...,[],2024-06-05,POINT (-102.61554 24.38818),"[Wind Energy, Land-Based Wind]",[],"[Bats, Human Dimensions, Legal &amp; Policy]",2024-07-02 23:50:27


### All characters

For added certainty that the results above account for all entries with possible broken links, I will also create a list of all the unique characters used in all of the file names throughout tethys to allow the QAQC specialists to visually confirm that no other characters will cause problems. If the results of this analysis find that there are other characters that could be problematic found in the attachment names for these entries, these problem characters can be added to the regex in the prior analysis to create a comprehensive approach. 

In [19]:
all_chars = ''
url_locs = list()
for i in range(0, num_t_entries):
    i_url = tethys_df['attachment'][i]
#    if len(i_url) < 1:
#        match_lst.append(0)
    if len(i_url) > 0:
#        match = re.search(p, i_url[0])    # Index the i_url variable because we have a list at this point, and we were previously checking that the list was not empty
        all_chars = all_chars + i_url[0][57:]
        url_locs.append(i_url[0][0:57])
        # if match == None:
        #     match_lst.append(1)
        # else:
        #     #elif match != None:
        #     match_lst.append(2)

In [20]:
a = ''

In [21]:
a = a + 'b'

In [22]:
a = a + 'bsdadeeedd'

In [23]:
a


'bbsdadeeedd'

In [24]:
list(set(a))

['e', 's', 'd', 'b', 'a']

In [25]:
set(url_locs)

{'https://tethys.pnnl.gov/sites/default/files/DataTransfera',
 'https://tethys.pnnl.gov/sites/default/files/Short-Science',
 'https://tethys.pnnl.gov/sites/default/files/publications/',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ada',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ben',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Cha',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Col',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ele',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ent',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Env',
 'https://tethys.pnnl.gov/sites/default/files/summaries/MRE',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Mar',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ris',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Soc',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Tet',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Und',
 'https:

In [26]:
all_chars[:500]

'BPS_EMP_092016.pdfMcInnes_et_al_2018.pdfEMEC-AK_EnvironmentalMonitoringReport.pdfMcIntyre-2016-EMF-Sturgeon.pdfEMEC_2019.PDFmarine_institute-spiddal-environmental_report.pdfOneaetal2019.pdfPoweringTheBlueEconomy_73355-v2.pdfWhiting-et-al-2019.pdfenergies-2019.pdfJohnson_Pride_2010.pdfSnyderetal2019.pdfSmith_et_al-2019-Ecology_and_Evolution.pdfLusseau_et_al_2012.pdfAshley_et_al_2014.pdfemblingetal.pdfLepperetal.pdfNERC_9.pdfNERC_2016.pdfNERC_2019.pdfFreeman_et_al_2013.pdfBruch_et_al_1994.pdfCarr2'

In [27]:
set(list(all_chars))

{' ',
 "'",
 '(',
 ')',
 '-',
 '.',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 'Ç',
 'Ü',
 'é',
 'í',
 'ć',
 '“',
 '”'}

### New Feature: Accounting for Special Characters

For this section of the notebook, I will change the logic of the previously developed reporting function to include special characters besides spaces. To do this, I will compose regex to match based on the inclusion of characters other alpha-numeric characters, hyphen (-) and underscore (_).

Unlike the prior method, where a match was determined by connection with a particular character (space), this new method will match based on the existence of a character outside a particular set (a-z, A-Z, 0-9, -, _).

In [28]:
primre_data.tethys_dataframe_raw['attachment'][3000][0][57:]

'ICES Report 344.pdf'

Necessary regex for determining if there is a space in the string. \/

In [29]:
p = re.compile('[^a-zA-Z0-9_]')
p

re.compile(r'[^a-zA-Z0-9_]', re.UNICODE)

pdf will match the tex [a-z] every time.  So if you do not remove the file extension for each iteration, it will not work because we are trying to do "outside the set is a match" paradigm. Same with the "." in the file extension. If we want to match "." because it is very bad practice to have a period in a file name, then we need to make sure that the extension periods are removed so that they are not creating false positives.

In [30]:
a = 'BPS_EMP_092016.pdf'
b = 'ICES Report 344'

print(re.search(p, a))
print(re.search(p, b))

<re.Match object; span=(14, 15), match='.'>
<re.Match object; span=(4, 5), match=' '>


#### Additional Slicing!

In [31]:
a = primre_data.tethys_dataframe_raw['attachment'][3000][0][57:]
a

'ICES Report 344.pdf'

In [32]:
a = a[:-4]
a

'ICES Report 344'

In [33]:
b = primre_data.tethys_dataframe_raw['attachment'][1][0][57:]
b = b[:-4]
b

'BPS_EMP_092016'

In [34]:
print(re.search(p, a))
print(re.search(p, b))

<re.Match object; span=(4, 5), match=' '>
None


Beyond the slicing, I need to have a way to handle the different file locations besides "publication" that does not rely entirely on slicing. Otherwise I will not be able to handle the 8ish special cases.

#### Handling File Locations

To handle the different file locations of the attachments, I will need to decompose the string slicing, and read the substring to understand its contents before slicing to isolate the file names. Below.

I will also need to change the approach to using regex. Instead of having a standard slice (everything after the 57th index) I need to change the logic to "everything after the last /". I will likewise need to change the logic that I have used for splicing the end to regex, as some file extensions are 4 characters long not 3. To handle this and remove the possibility that an errant "." is included in the string of the attachment name, I need to have the regex account for that.

In [35]:
primre_data.tethys_dataframe_raw['attachment']

0                                                      []
1       [https://tethys.pnnl.gov/sites/default/files/p...
2                                                      []
3                                                      []
4       [https://tethys.pnnl.gov/sites/default/files/p...
                              ...                        
4250    [https://tethys.pnnl.gov/sites/default/files/p...
4251    [https://tethys.pnnl.gov/sites/default/files/p...
4252    [https://tethys.pnnl.gov/sites/default/files/p...
4253    [https://tethys.pnnl.gov/sites/default/files/p...
4254    [https://tethys.pnnl.gov/sites/default/files/p...
Name: attachment, Length: 4255, dtype: object

In [36]:
test_string = primre_data.tethys_dataframe_raw['attachment'][3000][0]
test_string

'https://tethys.pnnl.gov/sites/default/files/publications/ICES Report 344.pdf'

In [37]:
# From the dccs
re.split(r'\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

In [38]:
re.split(r'(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

In [39]:
test_string_lst = re.split(r'/|\.', test_string)
test_string_lst

['https:',
 '',
 'tethys',
 'pnnl',
 'gov',
 'sites',
 'default',
 'files',
 'publications',
 'ICES Report 344',
 'pdf']

In [40]:
tethys_df = primre_data.tethys_dataframe_raw
tethys_df_len = len(tethys_df)

locations = list()
for i in range(0, tethys_df_len):
    if len(primre_data.tethys_dataframe_raw['attachment'][i]) > 0:  # Checking that there is an attachment for the entry
        
        test_string = primre_data.tethys_dataframe_raw['attachment'][i][0]    
        test_string_lst = re.split(r'/|\.', test_string)
    
        location = list()
        test_string_lst_len = len(test_string_lst)
        ticker = 0                                       # Tracks the index of the loop in relation to the dir
        for j in range(0, test_string_lst_len-1):        # By looping one less than len, we exclude file extension
            if test_string_lst[j] == 'files':
                ticker += 1
                
            if ticker > 0:
                location.append(test_string_lst[j])

            if (ticker == 1) & (test_string_lst_len < 11): # If the location is 'files' only, this will preserve df structure. All sub-lists will have 3 items.
                location.append('')
    
        locations.append(location)
        


    

In [41]:
#a = pd.DataFrame(locations, columns=['1', '2', '3', '4', '5', '6'])
a = pd.DataFrame(locations)
a

Unnamed: 0,0,1,2,3,4,5
0,files,publications,BPS_EMP_092016,,,
1,files,publications,McInnes_et_al_2018,,,
2,files,publications,EMEC-AK_EnvironmentalMonitoringReport,,,
3,files,publications,McIntyre-2016-EMF-Sturgeon,,,
4,files,publications,EMEC_2019,,,
...,...,...,...,...,...,...
2077,files,publications,Marine-Renewable-Report-final,,,
2078,files,publications,Moradi-Ilinca-2024,,,
2079,files,publications,Country_Specific_Guidance_Document-Mexico_final_6,24,24,
2080,files,publications,Bianchietals2024,,,


In [42]:
b = a[a[3]=='19']

In [43]:
b[5].notnull()

2075    False
Name: 5, dtype: bool

In [44]:
b[4].unique()

array(['24'], dtype=object)

In [45]:
b[b[3].notnull()]

Unnamed: 0,0,1,2,3,4,5
2075,files,publications,Country_Specific_Guidance_Document-Portugal_fi...,19,24,


In [46]:
b

Unnamed: 0,0,1,2,3,4,5
2075,files,publications,Country_Specific_Guidance_Document-Portugal_fi...,19,24,


In [47]:
a.loc[a[3].notnull()]

Unnamed: 0,0,1,2,3,4,5
164,files,publications,Elliot-et-al-2019,pdf,,
264,files,publications,D3,1_Inventory_of_Environmental_Impact_Monitoring...,,
266,files,publications,D2,1_Catalogue_of_Wave_Energy_Test_Centres,,
280,files,publications,MARINET-D4,17,,
281,files,publications,MARINET-D4,13,,
...,...,...,...,...,...,...
2013,files,publications,IMEJPaper_12,14,2023_final,
2043,files,publications,Deliverable-7,4-Education-and-Public-Engagement-Framework-fo...,,
2073,files,publications,Country_Specific_Guidance_Document-Australia_f...,6,24,
2075,files,publications,Country_Specific_Guidance_Document-Portugal_fi...,19,24,


In [48]:
nulls_lst = list(a[5].notnull())

for i in nulls_lst:
    if i == True:
        print(i)

True


### Changing Gears

As we can see from the above code, my second plan - to use "/" and "." as delimeters to isolate the attachment file name, is a poor direction because it makes the faulty assumption that these characters are not included in the attached file's name. From the code block directly preceeding this section, we can see that one entry has an attachment whose file name has 3 or more periods!

After encountering these issues with using delimiters, especially with the period, I find it necessary to change the approach of this analysis. During my meeting with Jonathan on 7/5/24, we came to the agreement that I will need to use the path of the file locations in a stagnant way, and possibly run a second analysis/qaqc test to validate that all actual file locations of the attachments are accounted for during this test (or, in case there are unexpected or new file locations that appear, this could help qaqc to diagnose/process improvement the problem).

#### Roadmap

What must be done? In what order?

1. Enumerate supported file locations, file extensions, characters
2. Create working regex logic, and assocaited funciton(s), to correctly isolate faulty entries
3. Apply the test to Tethys, Tethys Wind, and Tethys Engineering

**1. Enumerate supported locations, extensions, characters**

In [49]:
set(url_locs)

{'https://tethys.pnnl.gov/sites/default/files/DataTransfera',
 'https://tethys.pnnl.gov/sites/default/files/Short-Science',
 'https://tethys.pnnl.gov/sites/default/files/publications/',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ada',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ben',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Cha',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Col',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ele',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ent',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Env',
 'https://tethys.pnnl.gov/sites/default/files/summaries/MRE',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Mar',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ris',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Soc',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Tet',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Und',
 'https:

As we can see, most of the entries are located in "publications," followed by "summaries," and only two are located directly in "files."

In [50]:
# locations:
locs = ['https://tethys.pnnl.gov/sites/default/files/publications/',
        'https://tethys.pnnl.gov/sites/default/files/summaries/',
        'https://tethys.pnnl.gov/sites/default/files/']

In [51]:
# Extensions:
attachments_lst = tethys_df['attachment']
attachment_extensions = list()

counter = 0
for i in attachments_lst:
    if len(i) > 0:
        counter+=1
        for j in i:
            attachment_field_lst = re.split(r'\.', j)
            attachment_extensions.append(attachment_field_lst[-1])

In [52]:
print(f'Num entries with 1+ attachment     : {counter}\nNum observed attachment extensions : {len(attachment_extensions)}')

Num entries with 1+ attachment     : 2082
Num observed attachment extensions : 2200


In [53]:
all_file_extensions = set(attachment_extensions)
all_file_extensions

{'PDF', 'pdf'}

In [54]:
a = re.split(r'\.', 'file.pdf')[-1]
b = re.split(r'\.', 'filepdf')[-1]
print(f'a result : {a}\nb result : {b}')

a result : pdf
b result : filepdf


Per the code above, we can have confidence that the variable "all_file_extensions" captures all file extensions present in the field "attachment" returned by the Tethys API, because, in the event that there is a file that does not have a proper file extension, its full file name will be returned in the "attachment_extensions" list, and displayed by the "set" transformation. Verification of this behavior is shown in the cell directly above. 

We can also have confidence in this functionality because we can see that it takes into account all attachments, even when an entry has 2+ attachments associated with it. Verification of this behavior is shown 3 cells prior, where we can see that the number of observed attachment extensions exceeds the number of entries with attachments. This verifies that none of the files attached to a Tethys entry is of a type other than "pdf" or "PDF".

In [55]:
# Characters

**2. Create working regex logic, and assocaited funciton(s), to correctly isolate faulty entries**

Note that the data is presented as a list of lists - there is a list of entries, and within each entry a list of attachments. We want to return the **Entry** associated with a bad file name. We will leave it up to the operator/ editor to figure out which file name is bad, as it should be very obvious.

Note also that the API returns duplicates because of an error with the return of spatial coordinates. For an entry associated with 2+ coordinate entries, the API will return that entry the same number of times as that entry is associated with unique locations.

1. loop through the entries
2. loop through the attachments associated with each entry
3. check each attachments url
   1. Loop through the file locations to find a match
   2. loop through the extensions to find a match
   3. isolate the file name by slicing, based on the matches from (1.) and (2.)
   4. Check if there is a match with a character other than alpha-numeric, "_" and "-"
   5. If there is a match, add the entry id/uri to a list
   6. de-duplicate the list (to catch any duplicate API responses that may have also had attachments that failed the test)
   7. return the list of entries to be reviewed

I will also need to somehow output:
 - Locations being used by the tester
 - Locations observed in the data
 - Extensions being used by the tester
 - Extensions observed in the data

So that the user can ensure that there are no edge cases excluded from the analysis.

I need to find a way to integrate these two aims via software engineering before I can start to really dig in with the code. Otherwise I will be constructing a system that is not comprehensive, and therefore fails the mission of measuring the broken file names, and underreports revision need.

**software engineering brainstorm**

funct1 
 - somehow list the locations observed
 - somehow list the extensions observed

Display results of funct1 so that the human-in-the-loop can see if there are any process-improvements to be made, or if the system is functioning as desired

funct2 
 - Takes in the results of funct1
 - Uses the paths reported by funct1, and the 

       

In [62]:
tethys_df = primre_data.tethys_dataframe_raw
tethys_df.head(1)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment,entry_id,originationDate2,modifiedDate2,match
0,https://tethys.pnnl.gov/node/499,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...","[Soukissian, T., Denaxa, D., Karathanasi, F., ...","[Hellenic Centre for Marine Research (HCMR), N...",2017-09-29,[],[],"[Environment, Human Dimensions]",2024-01-22 09:24:45,[],[],499,2017-09-29,2024-01-22 09:24:45,0


In [64]:
def locations_found():
    '''

    '''
    all_chars = ''
    url_locs = list()
    for i in range(0, num_t_entries):
        i_url = tethys_df['attachment'][i]
        if len(i_url) > 0:
            all_chars = all_chars + i_url[0][57:]
            url_locs.append(i_url[0][0:57])

    


In [65]:
def extensions_found():
    '''

    '''
    attachments_lst = tethys_df['attachment']
    attachment_extensions = list()
    
    counter = 0
    for i in attachments_lst:
        if len(i) > 0:
            counter+=1
            for j in i:
                attachment_field_lst = re.split(r'\.', j)
                attachment_extensions.append(attachment_field_lst[-1])

    return

In [56]:
locs[0]

'https://tethys.pnnl.gov/sites/default/files/publications/'

**3. Apply the test to Tethys, Tethys Wind, and Tethys Engineering**

In [57]:
tethys_df = primre_data.tethys_dataframe_raw
num_t_entries = len(tethys_df)

match_lst = list()
for i in range(0, num_t_entries):
    i_url = tethys_df['attachment'][i]
    if len(i_url) < 1:
        match_lst.append(0)
    elif len(i_url) > 0:
        match = re.search(p, i_url[0])    # Index the i_url variable because we have a list at this point, and we were previously checking that the list was not empty
        if match == None:
            match_lst.append(1)
        else:
            #elif match != None:
            match_lst.append(2)
        

In [58]:
# assign this information as a column in the original dataframe
tethys_df['match'] = match_lst
tethys_df.head(1)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment,entry_id,originationDate2,modifiedDate2,match
0,https://tethys.pnnl.gov/node/499,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...","[Soukissian, T., Denaxa, D., Karathanasi, F., ...","[Hellenic Centre for Marine Research (HCMR), N...",2017-09-29,[],[],"[Environment, Human Dimensions]",2024-01-22 09:24:45,[],[],499,2017-09-29,2024-01-22 09:24:45,0
