# Tethys / Tethys Engineering QA/QC1

The purpose of this notebook is to generate a list of Tethys and Tethys Engineering entries that have attached documents, where the name of the attached file includes a space. 

This is likely the first in a series of notebooks dedicated to bringing value to the PRIMRE team through quality control of various knowledge hubs. I expect that there will be at least a few of these notebooks dedicated to Tethys / Tethys Engineering.

### Setup

In [1]:
import re
import primrea.core

### Dev

In [2]:
primre_data = primrea.core.primrea_data()

It appears that the 'attachment' field is quite convenient. All of the documents appear to be saved in the same place in the server, so the url of each download is identical excepting the file name. After a few random picks, I arrived a url that has spaces in it (index 3000, below). As we can see, this url is pre-parsing by a browser, so we do not even need to decode it using regex. I think there will be some minor regex to select those which have spaces, but this will be significantly easier than secting with the '%20', or whatever the space encoding is from the browser.

In [3]:
primre_data.tethys_dataframe['attachment'][3000][0][57:]

'ICES Report 344.pdf'

Necessary regex for determining if there is a space in the string. \/

In [4]:
p = re.compile(' +')
p

re.compile(r' +', re.UNICODE)

In [5]:
a = 'BPS_EMP_092016.pdf'
b = 'ICES Report 344.pdf'

print(re.search(p, a))
print(re.search(p, b))

None
<re.Match object; span=(4, 5), match=' '>


In [6]:
print(p.search(a))
print(p.search(b))

None
<re.Match object; span=(4, 5), match=' '>


In [7]:
tethys_df = primre_data.tethys_dataframe
num_t_entries = len(tethys_df)

match_lst = list()
for i in range(0, num_t_entries):
    i_url = tethys_df['attachment'][i]
    if len(i_url) < 1:
        match_lst.append(0)
    elif len(i_url) > 0:
        match = re.search(p, i_url[0])    # Index the i_url variable because we have a list at this point, and we were previously checking that the list was not empty
        if match == None:
            match_lst.append(1)
        else:
            #elif match != None:
            match_lst.append(2)
        

In [8]:
# assign this information as a column in the original dataframe
tethys_df['match'] = match_lst
tethys_df.head(1)

Unnamed: 0,URI,type,landingPage,sourceURL,title,description,author,organization,originationDate,spatial,technologyType,tags,modifiedDate,signatureProject,attachment,match
0,https://tethys.pnnl.gov/node/499,"[Document, Document/Journal Article]",https://tethys.pnnl.gov/publications/marine-re...,https://www.mdpi.com/1996-1073/10/10/1512/htm,Marine Renewable Energy in the Mediterranean S...,"In this work, an extended overview of the mari...","[Soukissian, T., Denaxa, D., Karathanasi, F., ...","[Hellenic Centre for Marine Research (HCMR), N...",2017-09-29,[],[],"[Environment, Human Dimensions]",2024-01-22 09:24:45,[],[],0


In [9]:
tethys_no_attch = tethys_df[tethys_df['match']==0]
tethys_gd_attch = tethys_df[tethys_df['match']==1]
tethys_bd_attch = tethys_df[tethys_df['match']==2]
print(f'Number of obs no attch : {len(tethys_no_attch)}\nNumber of obs gd attch : {len(tethys_gd_attch)}\nNumber of obs bd attch : {len(tethys_bd_attch)}')
print(f'Number of observations in Tethys :  {num_t_entries}\nNumber of observations of matches : {len(tethys_no_attch) + len(tethys_gd_attch) + len(tethys_bd_attch)}')


Number of obs no attch : 2172
Number of obs gd attch : 2064
Number of obs bd attch : 13
Number of observations in Tethys :  4249
Number of observations of matches : 4249


In [10]:
tethys_bd_attch['URI']

2370       https://tethys.pnnl.gov/node/5744
2581       https://tethys.pnnl.gov/node/6531
2582       https://tethys.pnnl.gov/node/6533
2686       https://tethys.pnnl.gov/node/7195
2713       https://tethys.pnnl.gov/node/8211
2854      https://tethys.pnnl.gov/node/77442
2855      https://tethys.pnnl.gov/node/77636
2856      https://tethys.pnnl.gov/node/77637
2944     https://tethys.pnnl.gov/node/112518
3000     https://tethys.pnnl.gov/node/121849
3023     https://tethys.pnnl.gov/node/154322
3770    https://tethys.pnnl.gov/node/1760762
3893    https://tethys.pnnl.gov/node/2072546
Name: URI, dtype: object

### All characters

For added certainty that the results above account for all entries with possible broken links, I will also create a list of all the unique characters used in all of the file names throughout tethys to allow the QAQC specialists to visually confirm that no other characters will cause problems. If the results of this analysis find that there are other characters that could be problematic found in the attachment names for these entries, these problem characters can be added to the regex in the prior analysis to create a comprehensive approach. 

In [42]:
all_chars = ''
url_locs = list()
for i in range(0, num_t_entries):
    i_url = tethys_df['attachment'][i]
#    if len(i_url) < 1:
#        match_lst.append(0)
    if len(i_url) > 0:
#        match = re.search(p, i_url[0])    # Index the i_url variable because we have a list at this point, and we were previously checking that the list was not empty
        all_chars = all_chars + i_url[0][57:]
        url_locs.append(i_url[0][0:57])
        # if match == None:
        #     match_lst.append(1)
        # else:
        #     #elif match != None:
        #     match_lst.append(2)

In [14]:
a = ''

In [18]:
a = a + 'b'

In [19]:
a = a + 'bsdadeeedd'

In [20]:
a


'bbsdadeeedd'

In [21]:
list(set(a))

['s', 'e', 'a', 'd', 'b']

In [46]:
set(url_locs)

{'https://tethys.pnnl.gov/sites/default/files/DataTransfera',
 'https://tethys.pnnl.gov/sites/default/files/Short-Science',
 'https://tethys.pnnl.gov/sites/default/files/publications/',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ada',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ben',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Cha',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Col',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ele',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ent',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Env',
 'https://tethys.pnnl.gov/sites/default/files/summaries/MRE',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Mar',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Ris',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Soc',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Tet',
 'https://tethys.pnnl.gov/sites/default/files/summaries/Und',
 'https:

In [48]:
all_chars[:500]

'BPS_EMP_092016.pdfMcInnes_et_al_2018.pdfEMEC-AK_EnvironmentalMonitoringReport.pdfMcIntyre-2016-EMF-Sturgeon.pdfEMEC_2019.PDFmarine_institute-spiddal-environmental_report.pdfOneaetal2019.pdfPoweringTheBlueEconomy_73355-v2.pdfWhiting-et-al-2019.pdfenergies-2019.pdfJohnson_Pride_2010.pdfSnyderetal2019.pdfSmith_et_al-2019-Ecology_and_Evolution.pdfLusseau_et_al_2012.pdfAshley_et_al_2014.pdfemblingetal.pdfLepperetal.pdfNERC_9.pdfNERC_2016.pdfNERC_2019.pdfFreeman_et_al_2013.pdfBruch_et_al_1994.pdfCarr2'

In [49]:
set(list(all_chars))

{' ',
 "'",
 '(',
 ')',
 '-',
 '.',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 'Ç',
 'Ü',
 'é',
 'í',
 'ć',
 '“',
 '”'}