In [None]:
#| default_exp handlers.ospar

# OSPAR 

> This data pipeline, known as a "handler" in Marisco terminology, is designed to clean, standardize, and encode [OSPAR data](https://odims.ospar.org/en/) into `NetCDF` format. The handler processes raw OSPAR data, applying various transformations and lookups to align it with `MARIS` data standards.

Key functions of this handler:

- **Cleans** and **normalizes** raw OSPAR data
- **Applies standardized nomenclature** and units
- **Encodes the processed data** into `NetCDF` format compatible with MARIS requirements

This handler is a crucial component in the Marisco data processing workflow, ensuring OSPAR data is properly integrated into the MARIS database.

:::{.callout-tip}

For new MARIS users, please refer to [Understanding MARIS Data Formats (NetCDF and Open Refine)](https://github.com/franckalbinet/marisco/tree/main/install_configure_guide) for detailed information.

:::

The present notebook pretends to be an instance of [Literate Programming](https://www.wikiwand.com/en/articles/Literate_programming) in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case `marisco/handlers/ospar.py`) the code snippet is added to the module using `#| export` as provided by the wonderful [nbdev](https://nbdev.fast.ai/getting_started.html) library.

In [None]:
#| hide
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
#| export
import pandas as pd 
import numpy as np
import fastcore.all as fc 
from fastcore.basics import patch
from typing import  Dict, Callable 
import re
from owslib.wfs import WebFeatureService
from io import StringIO
from pathlib import Path 
import time
from urllib.parse import quote


from marisco.utils import (
    Remapper, 
    get_unique_across_dfs,
    NA
)

from marisco.callbacks import (
    Callback, 
    Transformer, 
    EncodeTimeCB, 
    LowerStripNameCB, 
    SanitizeLonLatCB, 
    CompareDfsAndTfmCB, 
    RemapCB
)

from marisco.metadata import (
    GlobAttrsFeeder, 
    BboxCB, 
    DepthRangeCB, 
    TimeRangeCB, 
    ZoteroCB, 
    KeyValuePairCB
)

from marisco.configs import (
    nuc_lut_path, 
    cfg, 
    species_lut_path, 
    bodyparts_lut_path, 
    detection_limit_lut_path, 
    get_lut, 
    cache_path
)

from marisco.encoders import (
    NetCDFEncoder, 
)

from marisco.handlers.data_format_transformation import (
    decode, 
)

from marisco.utils import (
    ExtractNetcdfContents,
)

import warnings
warnings.filterwarnings('ignore')

In [None]:
#| eval: false
from IPython.display import display, Markdown

## Configuration and File Paths

The handler requires several configuration parameters:
1. **src_dir**: path to the maris-crawlers folder containing the OSPAR data in CSV format.
2. **fname_out_nc**: Output path and filename for NetCDF file (relative paths supported) 
3. **zotero_key**: Key for retrieving dataset attributes from [Zotero](https://www.zotero.org/)

:::{.callout-tip}

**FEEDBACK FOR NEXT VERSION**: Update src_dir to use Franck's repository.
:::

In [None]:
#| exports
src_dir = 'https://raw.githubusercontent.com/niallmurphy93/maris-crawlers/refs/heads/main/data/processed/OSPAR'
fname_out_nc = '../../_data/output/191-OSPAR-2024.nc'
zotero_key ='LQRA4MMK' # OSPAR MORS zotero key

## Load data

[OSPAR data](https://odims.ospar.org/en/submissions/) is provided as a zipped Microsoft Access database. To facilitate easier access and integration, we process this dataset and convert it into `.csv` files. These processed files are then made available in the [maris-crawlers repository](https://github.com/franckalbinet/maris-crawlers/tree/main/data/processed/OSPAR) on GitHub.
Once converted, the dataset is in a format that is readily compatible with the [marisco](https://github.com/franckalbinet/marisco) data pipeline, ensuring seamless data handling and analysis.

In [None]:
#| exports
default_smp_types = {  
    'Biota': 'BIOTA', 
    'Seawater': 'SEAWATER', 
}

In [None]:
#| exports
def read_csv(file_name, dir=src_dir):
    file_path = f'{dir}/{file_name}'
    return pd.read_csv(file_path)

In [None]:
#| exports
def load_data(src_url: str, 
              smp_types: dict = default_smp_types, 
              use_cache: bool = False,
              save_to_cache: bool = False,
              verbose: bool = False) -> Dict[str, pd.DataFrame]:
    "Load OSPAR data and return the data in a dictionary of dataframes with the dictionary key as the sample type."
    
    def safe_file_path(url: str) -> str:
        """Safely encode spaces in a URL."""
        return url.replace(" ", "%20")

    def get_file_path(dir_path: str, file_prefix: str) -> str:
        """Construct the full file path based on directory and file prefix."""
        file_path = f"{dir_path}/{file_prefix} data.csv"
        return safe_file_path(file_path) if not use_cache else file_path

    def load_and_process_csv(file_path: str) -> pd.DataFrame:
        """Load a CSV file and process it."""
        if use_cache and not Path(file_path).exists():
            if verbose:
                print(f"{file_path} not found in cache.")
            return pd.DataFrame()

        if verbose:
            start_time = time.time()

        try:
            df = pd.read_csv(file_path)
            df.columns = df.columns.str.lower()
            if verbose:
                print(f"Data loaded from {file_path} in {time.time() - start_time:.2f} seconds.")
            return df
        except Exception as e:
            if verbose:
                print(f"Failed to load {file_path}: {e}")
            return pd.DataFrame()

    def save_to_cache_dir(df: pd.DataFrame, file_prefix: str):
        """Save the DataFrame to the cache directory."""
        cache_dir = cache_path()
        cache_file_path = f"{cache_dir}/{file_prefix} data.csv"
        df.to_csv(cache_file_path, index=False)
        if verbose:
            print(f"Data saved to cache at {cache_file_path}")

    data = {}
    for file_prefix, smp_type in smp_types.items():
        dir_path = cache_path() if use_cache else src_url
        file_path = get_file_path(dir_path, file_prefix)
        df = load_and_process_csv(file_path)

        if save_to_cache and not df.empty:
            save_to_cache_dir(df, file_prefix)

        data[smp_type] = df

    return data

In [None]:
#| eval: false
load_data(src_dir, save_to_cache=True, verbose=True)

Data loaded from https://raw.githubusercontent.com/niallmurphy93/maris-crawlers/refs/heads/main/data/processed/OSPAR/Biota%20data.csv in 0.55 seconds.
Data saved to cache at /home/niallmurphy93/.marisco/cache/Biota data.csv
Data loaded from https://raw.githubusercontent.com/niallmurphy93/maris-crawlers/refs/heads/main/data/processed/OSPAR/Seawater%20data.csv in 0.56 seconds.
Data saved to cache at /home/niallmurphy93/.marisco/cache/Seawater data.csv


{'BIOTA':           id contracting party  rsc sub-division             station id  \
 0          1           Belgium                 8  Kloosterzande-Schelde   
 1          2           Belgium                 8  Kloosterzande-Schelde   
 2          3           Belgium                 8  Kloosterzande-Schelde   
 3          4           Belgium                 8  Kloosterzande-Schelde   
 4          5           Belgium                 8  Kloosterzande-Schelde   
 ...      ...               ...               ...                    ...   
 15946  98058            Sweden                12         Ringhals (R22)   
 15947  98059            Sweden                12         Ringhals (R23)   
 15948  98060            Sweden                11                    SW7   
 15949  98061            Sweden                11                   SW6a   
 15950  98062            Sweden                12         Ringhals (R25)   
 
       sample id  latd  latm  lats latdir  longd  ...      sampling date  \
 

## Nuclide Name Normalization

:::{.callout-tip}

**FEEDBACK FOR NEXT VERSION**: In the lookup at nuc_lut_path, do we need nc_name? We used nc_name when we were pivoting the table from long to wide format. Should we remove it? 

:::

We are standardizing the nuclide names in the OSPAR dataset to align with the standardized names provided in the MARISCO lookup table.
The lookup process utilizes three key columns:
- `nuclide_id`: This serves as a unique identifier for each nuclide.
- `nuclide`: Represents the standardized name of the nuclide as per our conventions.
- `nc_name`: Denotes the corresponding name used in NetCDF files.
Below, we will examine the structure and contents of the lookup table:

In [None]:
#| eval: false
nuc_lut_df = pd.read_excel(nuc_lut_path())
nuc_lut_df.head()

Unnamed: 0,nuclide_id,nuclide,atomicnb,massnb,nusymbol,half_life,hl_unit,nc_name
0,-1,NOT APPLICABLE,,,,,,NOT APPLICABLE
1,0,NOT AVAILABLE,0.0,0.0,0,0.0,-,NOT AVAILABLE
2,1,TRITIUM,1.0,3.0,3H,12.35,Y,h3
3,2,BERYLLIUM,4.0,7.0,7Be,53.3,D,be7
4,3,CARBON,6.0,14.0,14C,5730.0,Y,c14


The nuclide data is provided in the `nuclide` column. However, as shown below, the nuclide names are not standardized.


In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True, verbose=True)
df = get_unique_across_dfs(dfs, 'nuclide', as_df=True)
df.T

Data loaded from /home/niallmurphy93/.marisco/cache/Biota data.csv in 0.05 seconds.
Data loaded from /home/niallmurphy93/.marisco/cache/Seawater data.csv in 0.04 seconds.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
index,0,1,2,3,4,5.0,6,7,8,9,10,11,12,13,14,15,16,17
value,137Cs,238Pu,137Cs,"239, 240 Pu",99Tc,,99Tc,CS-137,3H,210Po,99Tc,226Ra,210Pb,210Po,228Ra,Cs-137,"239,240Pu",241Am


### Lower & strip nuclide names

To simplify the data, we use the `LowerStripNameCB` callback. For each dataframe in the dictionary of dataframes, `LowerStripNameCB` simplifies the nuclide name by converting it lowercase and striping any leading or trailing whitespace(s).

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True, verbose=True)
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='nuclide')])
dfs_output=tfm()
for key, df in dfs_output.items():
    print(f'{key} nuclides: ')
    print(df['nuclide'].unique())

Data loaded from /home/niallmurphy93/.marisco/cache/Biota data.csv in 0.05 seconds.
Data loaded from /home/niallmurphy93/.marisco/cache/Seawater data.csv in 0.04 seconds.
BIOTA nuclides: 
['137cs' '226ra' '228ra' '239,240pu' '99tc' '210po' '210pb' '3h' 'cs-137'
 '238pu' '239, 240 pu' '241am']
SEAWATER nuclides: 
['137cs' '239,240pu' '226ra' '228ra' '99tc' '3h' '210po' '210pb' nan]


### Remap nuclide names to MARIS data formats

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `nuclide` column has inconsistent naming. E.g:

- `Cs-137`,  `137Cs` or `CS-137`
- `239, 240 pu` or `239,240 pu`
- `ra-226` and `226ra` 

See below:

:::

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='nuclide', as_df=True).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
index,0,1,2,3,4,5.0,6,7,8,9,10,11,12,13,14,15,16,17
value,137Cs,238Pu,137Cs,"239, 240 Pu",99Tc,,99Tc,CS-137,3H,210Po,99Tc,226Ra,210Pb,210Po,228Ra,Cs-137,"239,240Pu",241Am


Below, we map nuclide names used by HELCOM to the MARIS standard nuclide names. 

Remapping data provider nomenclatures to MARIS standards is a recurrent operation and is done in a semi-automated manner according to the following pattern:

1. **Inspect** data provider nomenclature:
2. **Match** automatically against MARIS nomenclature (using a fuzzy matching algorithm); 
3. **Fix** potential mismatches; 
4. **Apply** the lookup table to the dataframe.

We will refer to this process as **IMFA** (**I**nspect, **M**atch, **F**ix, **A**pply).

Let's now create an instance of a [fuzzy matching algorithm](https://www.wikiwand.com/en/articles/Approximate_string_matching) `Remapper`. This instance will match the nuclide names of the OSPAR dataset to the MARIS standard nuclide names.

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_output, col_name='nuclide', as_df=True),
                    maris_lut_fn=nuc_lut_path,
                    maris_col_id='nuclide_id',
                    maris_col_name='nc_name',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='nuclides_ospar.pkl')

Lets try to match OSPAR nuclide names to MARIS standard nuclide names as automatically as possible. The `match_score` column allows to assess the results:

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)

Processing:   0%|          | 0/13 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 13/13 [00:00<00:00, -124.33it/s]

1 entries matched the criteria, while 12 entries had a match score of 1 or higher.





Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"239, 240 pu",pu240,"239, 240 pu",8
"239,240pu",pu240,"239,240pu",6
226ra,u234,226ra,4
137cs,i133,137cs,4
241am,pu241,241am,4
228ra,u235,228ra,4
210po,ru106,210po,4
210pb,ru106,210pb,4
99tc,tu,99tc,3
238pu,u238,238pu,3


We can now manually review the unmatched nuclide names and construct a dictionary to map them to the MARIS standard.

In [None]:
#| export
fixes_nuclide_names = {
    '99tc': 'tc99',
    '238pu': 'pu238',
    '226ra': 'ra226',
    'ra-226': 'ra226',
    'ra-228': 'ra228',    
    '210pb': 'pb210',
    '241am': 'am241',
    '228ra': 'ra228',
    '137cs': 'cs137',
    '210po': 'po210',
    '239,240pu': 'pu239_240_tot',
    '239, 240 pu': 'pu239_240_tot',
    'cs-137': 'cs137',
    '3h': 'h3'
    }

The dictionary `fixes_nuclide_names`, applies manual corrections to the nuclide names before the remapping process. 
The `generate_lookup_table` function has an `overwrite` parameter (default is `True`), which, when set to `True`, creates a pickle file cache of the lookup table. We can now test the remapping process:

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
fc.test_eq(len(remapper.select_match(match_score_threshold=1)), 0)

Processing:   0%|          | 0/13 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 13/13 [00:00<00:00, 43.39it/s]


If we would like to to view all remapped nuclides we can set the match score threshold to 0 which will return all nuclides.

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
remapper.select_match(match_score_threshold=0, verbose=True).T

Processing:   0%|          | 0/13 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 13/13 [00:00<00:00, 43.26it/s]

0 entries matched the criteria, while 13 entries had a match score of 0 or higher.





source_key,99tc,226ra,cs-137,"239, 240 pu",137cs,241am,228ra,210po,NaN,"239,240pu",3h,210pb,238pu
matched_maris_name,tc99,ra226,cs137,pu239_240_tot,cs137,am241,ra228,po210,Unknown,pu239_240_tot,h3,pb210,pu238
source_name,99tc,226ra,cs-137,"239, 240 pu",137cs,241am,228ra,210po,,"239,240pu",3h,210pb,238pu
match_score,0,0,0,0,0,0,0,0,0,0,0,0,0


We can now see that the nuclide names have been remapped correctly. We now create a callback `RemapNuclideNameCB` to remap the nuclide names in the dataframes. We remap to use the `nuclide_id` values. 

Note that we pass `overwrite=False` to the `Remapper` constructor to now use the cached version.

In [None]:
#| export
# Create a lookup table for nuclide names
lut_nuclides = lambda df: Remapper(provider_lut_df=df,
                                   maris_lut_fn=nuc_lut_path,
                                   maris_col_id='nuclide_id',
                                   maris_col_name='nc_name',
                                   provider_col_to_match='value',
                                   provider_col_key='value',
                                   fname_cache='nuclides_ospar.pkl').generate_lookup_table(fixes=fixes_nuclide_names, 
                                                                                            as_df=False, overwrite=False)

In [None]:
#| export
class RemapNuclideNameCB(Callback):
    "Remap data provider nuclide names to standardized MARIS nuclide names."
    def __init__(self, 
                 fn_lut: Callable, # Function that returns the lookup table dictionary
                 col_name: str # Column name to remap
                ):
        fc.store_attr()

    def __call__(self, tfm: Transformer):
        df_uniques = get_unique_across_dfs(tfm.dfs, col_name=self.col_name, as_df=True)
        #lut = {k: v.matched_maris_name for k, v in self.fn_lut(df_uniques).items()}    
        lut = {k: v.matched_id for k, v in self.fn_lut(df_uniques).items()}    
        for k in tfm.dfs.keys():
            tfm.dfs[k]['NUCLIDE'] = tfm.dfs[k][self.col_name].replace(lut)

Let's see it in action, along with the `LowerStripNameCB` callback:

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide')
                            ])
dfs_out = tfm()

# For instance
for key in dfs_out.keys():
    print(f'Unique values for {key} NUCLIDE column: ', dfs_out[key]['NUCLIDE'].unique())

Unique values for BIOTA NUCLIDE column:  [33 53 54 77 15 47 41  1 67 72]
Unique values for SEAWATER NUCLIDE column:  [33 77 53 54 15  1 47 41 -1]


## Standardize Time

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: 'NaN' values found for `sampling date` column in the `SEAWATER` dataset.

:::

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)

for key in dfs.keys():
    if dfs[key]['sampling date'].isnull().sum() > 0:
        print(f"NaN values found for 'sampling date' in {key} dataset. A total of {dfs[key]['sampling date'].isnull().sum()} NaN values found.")
        print(f'Example:')
        with pd.option_context('display.max_columns', None):
            display(dfs[key][dfs[key]['sampling date'].isnull()].head(2))
    else:
        print(f"No NaN values found for 'sampling date' in {key} dataset.")

No NaN values found for 'sampling date' in BIOTA dataset.
NaN values found for 'sampling date' in SEAWATER dataset. A total of 10 NaN values found.
Example:


Unnamed: 0,id,contracting party,rsc sub-division,station id,sample id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample type,sampling depth,sampling date,nuclide,value type,activity or mda,uncertainty,unit,data provider,measurement comment,sample comment,reference comment
14776,97948,Sweden,11.0,SW7,1,58,36.0,12.0,N,11,14.0,42.0,E,WATER,1.0,,3H,,,,Bq/l,Swedish Radiation Safety Authority,no 3H this year due to broken LSC,,
14780,97952,Sweden,12.0,Ringhals (R35),7,57,14.0,5.0,N,11,56.0,8.0,E,WATER,1.0,,3H,,,,Bq/l,Swedish Radiation Safety Authority,no 3H this year due to broken LSC,,


Create a callback that remaps the time format in the dictionary of dataframes (i.e. `%m/%d/%y %H:%M:%S`) and handle missing dates:

In [None]:
#| export
time_cols = {'BIOTA': 'sampling date', 'SEAWATER': 'sampling date'}
time_format = '%m/%d/%y %H:%M:%S'

In [None]:
#| export
class ParseTimeCB(Callback):
    "Parse the time format in the dataframe and check for inconsistencies."
    
    def __init__(self, 
                 col_src: dict=time_cols, # Column name to remap
                 col_dst: str='TIME', # Column name to remap
                 format: str=time_format # Time format
                 ):
        fc.store_attr()
    
    def __call__(self, tfm):
        for grp, df in tfm.dfs.items():
            src_col = self.col_src.get(grp)
            
            if src_col not in df.columns:
                print(f"Column '{src_col}' not found in {grp} dataset.")
                continue  
            # Parse time and handle errors
            df[self.col_dst] = pd.to_datetime(df[src_col], format=self.format, errors='coerce')
            
            # Drop rows where parsing failed (NaT values in TIME column)
            invalid_rows = df[df[self.col_dst].isna()]
            
            if not invalid_rows.empty:     
                print(f"{len(invalid_rows)} invalid rows found in group '{grp}' during time parsing callback.")
                df.dropna(subset=[self.col_dst], inplace=True)

        return tfm
        

Apply the transformer for callbacks `ParseTimeCB`.

In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    ParseTimeCB(),
    CompareDfsAndTfmCB(dfs)])
tfm()


display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))

display(Markdown("<b> Example of parsed time column:</b>"))
with pd.option_context('display.max_rows', None):
    display(tfm.dfs['SEAWATER']['TIME'].head(2))

10 invalid rows found in group 'SEAWATER' during time parsing callback.


<b> Row Count Comparison Before and After Transformation:</b>

Unnamed: 0,BIOTA,SEAWATER
Number of rows in original dataframes (dfs):,15951,19193
Number of rows in transformed dataframes (tfm.dfs):,15951,19183
Number of rows removed (tfm.dfs_removed):,0,10


<b> Example of parsed time column:</b>

0   2010-01-27
1   2010-01-27
Name: TIME, dtype: datetime64[ns]

The NetCDF time format requires the time to be encoded as number of milliseconds since a time of origin. In our case the time of origin is `1970-01-01` as indicated in `configs.ipynb` `CONFIFS['units']['time']` dictionary.

`EncodeTimeCB` converts the HELCOM `time` format to the MARIS NetCDF `time` format.

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            EncodeTimeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()

display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))
                            

10 invalid rows found in group 'SEAWATER' during time parsing callback.


<b> Row Count Comparison Before and After Transformation:</b>

Unnamed: 0,BIOTA,SEAWATER
Number of rows in original dataframes (dfs):,15951,19193
Number of rows in transformed dataframes (tfm.dfs):,15951,19183
Number of rows removed (tfm.dfs_removed):,0,10


## Sanitize value

We allocate each column containing measurement values into a single column `VALUE` and remove `NA` where needed.

In [None]:
#| exports
value_cols = {'BIOTA': 'activity or mda', 'SEAWATER': 'activity or mda'}

In [None]:
#| export
class SanitizeValueCB(Callback):
    "Sanitize value by removing blank entries and populating `value` column."
    def __init__(self, 
                 value_col: dict = value_cols # Column name to sanitize
                 ):
        fc.store_attr()

    def __call__(self, tfm):
        for grp, df in tfm.dfs.items():
            # Drop rows where parsing failed (NaT values in TIME column)
            invalid_rows = df[df[self.value_col.get(grp)].isna()]
            if not invalid_rows.empty:     
                print(f"{len(invalid_rows)} invalid rows found in group '{grp}' during sanitize value callback.")
                df.dropna(subset=[self.value_col.get(grp)], inplace=True)
                
            df['VALUE'] = df[self.value_col.get(grp)]

In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            CompareDfsAndTfmCB(dfs)])

tfm()

display(Markdown("<b> Example of VALUE column:</b>"))
with pd.option_context('display.max_rows', None):
    display(tfm.dfs['SEAWATER'][['VALUE']].head())

display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))

display(Markdown("<b> Example of removed data:</b>"))
with pd.option_context('display.max_columns', None):
    display(tfm.dfs_removed['SEAWATER'].head(2))

10 invalid rows found in group 'SEAWATER' during sanitize value callback.


<b> Example of VALUE column:</b>

Unnamed: 0,VALUE
0,0.2
1,0.27
2,0.26
3,0.25
4,0.2


<b> Row Count Comparison Before and After Transformation:</b>

Unnamed: 0,BIOTA,SEAWATER
Number of rows in original dataframes (dfs):,15951,19193
Number of rows in transformed dataframes (tfm.dfs):,15951,19183
Number of rows removed (tfm.dfs_removed):,0,10


<b> Example of removed data:</b>

Unnamed: 0,id,contracting party,rsc sub-division,station id,sample id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample type,sampling depth,sampling date,nuclide,value type,activity or mda,uncertainty,unit,data provider,measurement comment,sample comment,reference comment
14776,97948,Sweden,11.0,SW7,1,58,36.0,12.0,N,11,14.0,42.0,E,WATER,1.0,,3H,,,,Bq/l,Swedish Radiation Safety Authority,no 3H this year due to broken LSC,,
14780,97952,Sweden,12.0,Ringhals (R35),7,57,14.0,5.0,N,11,56.0,8.0,E,WATER,1.0,,3H,,,,Bq/l,Swedish Radiation Safety Authority,no 3H this year due to broken LSC,,


## Normalize uncertainty

:::{.callout-tip}

**Feedback to Data Provider**: This applies to the 'CSV' datasets from the WFS. We have noticed that some entries in the `uncertaint` column use a comma (`,`) as a decimal separator. Please consider standardizing these entries to use a period (`.`) as the decimal separator. 

:::

For each sample type in the OSPAR dataset, the reported uncertainty is given as an expanded uncertainty with a coverage factor `ùëò=2`. For further details, refer to the [OSPAR reporting guidelines](https://mcc.jrc.ec.europa.eu/documents/OSPAR/Guidelines_forestimationof_a_%20measurefor_uncertainty_in_OSPARmonitoring.pdf).

**Note**: For MARIS the OSPAR uncertainty values are normalized to standard uncertainty with a coverage factor 
ùëò=1.

`NormalizeUncCB` callback normalizes the uncertainty using the following `lambda` function:

In [None]:
#| export
unc_exp2stan = lambda df, unc_col: df[unc_col] / 2

In [None]:
#| exports
unc_cols = {'BIOTA': 'uncertainty', 'SEAWATER': 'uncertainty'}

In [None]:
#| export
class NormalizeUncCB(Callback):
    """Normalize uncertainty values in DataFrames."""
    def __init__(self, 
                 col_unc: dict = unc_cols, # Column name to normalize
                 fn_convert_unc: Callable=unc_exp2stan, # Function correcting coverage factor
                 ): 
        fc.store_attr()

    def __call__(self, tfm):
        for grp, df in tfm.dfs.items():
            self._convert_commas_to_periods(df, self.col_unc.get(grp)   )
            self._convert_to_float(df, self.col_unc.get(grp))
            self._apply_conversion_function(df, self.col_unc.get(grp))

    def _convert_commas_to_periods(self, df, col_unc    ):
        """Convert commas to periods in the uncertainty column."""
        df[col_unc] = df[col_unc].astype(str).str.replace(',', '.')

    def _convert_to_float(self, df, col_unc):
        """Convert uncertainty column to float, handling errors by setting them to NaN."""
        df[col_unc] = pd.to_numeric(df[col_unc], errors='coerce')

    def _apply_conversion_function(self, df, col_unc):
        """Apply the conversion function to normalize the uncertainty values."""
        df['UNC'] = self.fn_convert_unc(df, col_unc)

In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
        SanitizeValueCB(),               
        NormalizeUncCB()
    ])
tfm()

display(Markdown("<b> Example of VALUE and UNC columns:</b>"))  
for grp in ['SEAWATER', 'BIOTA']:
    print(f'\n{grp}:')
    print(tfm.dfs[grp][['VALUE', 'UNC']])

10 invalid rows found in group 'SEAWATER' during sanitize value callback.


<b> Example of VALUE and UNC columns:</b>


SEAWATER:
          VALUE           UNC
0      0.200000           NaN
1      0.270000           NaN
2      0.260000           NaN
3      0.250000           NaN
4      0.200000           NaN
...         ...           ...
19183  0.000005  2.600000e-07
19184  6.152000  3.076000e-01
19185  0.005390  1.078000e-03
19186  0.001420  2.840000e-04
19187  6.078000  3.039000e-01

[19183 rows x 2 columns]

BIOTA:
          VALUE       UNC
0      0.326416       NaN
1      0.442704       NaN
2      0.412989       NaN
3      0.202768       NaN
4      0.652833       NaN
...         ...       ...
15946  0.384000  0.012096
15947  0.456000  0.012084
15948  0.122000  0.031000
15949  0.310000       NaN
15950  0.306000  0.007191

[15951 rows x 2 columns]


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `SEAWATER` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

To show situations where the uncertainty is much greater than the value we will calculate the 'relative uncertainty' for the seawater dataset. 

In [None]:
#| eval: false
for grp in ['SEAWATER', 'BIOTA']:
    tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['UNC'] / tfm.dfs[grp]['VALUE'])
    # Multiply by 100 to convert to percentage
    * 100)

Now we will return all rows where the relative uncertainty is greater than 100% for the seawater dataset.

In [None]:
#| eval: false
threshold = 100
grp='SEAWATER'
cols_to_show=['id', 'contracting party', 'nuclide', 'value type', 'activity or mda', 'uncertainty', 'unit', 'relative_uncertainty']
df=tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]

print(f'Number of rows where relative uncertainty is greater than {threshold}%: \n {df.shape[0]} \n')

display(Markdown(f"<b> Example of data with relative uncertainty greater than {threshold}%:</b>"))
with pd.option_context('display.max_rows', None):
    display(df.head())


Number of rows where relative uncertainty is greater than 100%: 
 81 



<b> Example of data with relative uncertainty greater than 100%:</b>

Unnamed: 0,id,contracting party,nuclide,value type,activity or mda,uncertainty,unit,relative_uncertainty
969,11075,United Kingdom,137Cs,=,0.0028,0.3276,Bq/l,5850.0
971,11077,United Kingdom,137Cs,=,0.0029,0.3364,Bq/l,5800.0
973,11079,United Kingdom,137Cs,=,0.0025,0.3325,Bq/l,6650.0
975,11081,United Kingdom,137Cs,=,0.0025,0.345,Bq/l,6900.0
977,11083,United Kingdom,137Cs,=,0.0038,0.3344,Bq/l,4400.0


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `BIOTA` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

Return all rows where the relative uncertainty is greater than 100% for the biota dataset..

In [None]:
#| eval: false
threshold = 100
grp='BIOTA' 
cols_to_show=['id', 'contracting party', 'nuclide', 'value type', 'activity or mda', 'uncertainty', 'unit', 'relative_uncertainty']
df=tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]

print(f'Number of rows where relative uncertainty is greater than {threshold}%: \n {df.shape[0]} \n')

display(Markdown(f"<b> Example of data with relative uncertainty greater than {threshold}%:</b>"))
with pd.option_context('display.max_rows', None):
    display(df.head())


Number of rows where relative uncertainty is greater than 100%: 
 37 



<b> Example of data with relative uncertainty greater than 100%:</b>

Unnamed: 0,id,contracting party,nuclide,value type,activity or mda,uncertainty,unit,relative_uncertainty
2338,23895,Belgium,226Ra,=,1.4,118.0,Bq/kg f.w.,4214.285714
2693,29984,Belgium,137Cs,=,0.169,27.0,Bq/kg f.w.,7988.16568
3027,35011,Belgium,137Cs,=,0.1619,66.0,Bq/kg f.w.,20382.95244
4442,49221,Sweden,137Cs,=,0.295,2.74,Bq/kg f.w.,464.40678
4447,49226,Sweden,137Cs,=,0.327,1.468,Bq/kg f.w.,224.464832


## Remap units

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: It would be easier to work with the units if they were standardized. The units are not consistent across the dataset, for instance `BQ/L`, `Bq/l` and `Bq/L` are used interchangeably.

:::


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Unit` column contains `NaN` values for the `SEAWATER` dataset, as shown below.
:::


In [None]:
#| eval: false
df=dfs['SEAWATER'][dfs['SEAWATER']['unit'].isnull()]
print(f'Number of rows with NaN in unit column: \n {df.shape[0]} \n')
display(Markdown(f"<b> Example of data with NaN in unit column:</b>"))
with pd.option_context('display.max_columns', None):
    display(df.head())

Number of rows with NaN in unit column: 
 8 



<b> Example of data with NaN in unit column:</b>

Unnamed: 0,id,contracting party,rsc sub-division,station id,sample id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample type,sampling depth,sampling date,nuclide,value type,activity or mda,uncertainty,unit,data provider,measurement comment,sample comment,reference comment
16161,120369,Ireland,1.0,Salthill,,53,15.0,40.0,N,9,4.0,15.0,W,,,,,,,,,,2021 data,Woodstown (County Waterford) and Salthill (Cou...,
16162,120370,Ireland,1.0,Woodstown,,52,11.0,55.0,N,6,58.0,47.0,W,,,,,,,,,,,,
16586,120363,Ireland,4.0,N1,,53,25.0,0.0,N,6,1.0,0.0,W,,,,,,,,,,2021 data,The Irish Navy attempted a few times to collec...,
19188,120364,Ireland,4.0,N2,,53,36.0,0.0,N,5,56.0,0.0,W,,,,,,,,,,2021 data,The Irish Navy attempted a few times to collec...,
19189,120365,Ireland,4.0,N3,,53,44.0,0.0,N,5,25.0,0.0,W,,,,,,,,,,2021 data,The Irish Navy attempted a few times to collec...,


Let's inspect the unique units used by OSPAR:

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='unit', as_df=True)

Unnamed: 0,index,value
0,0,Bq/L
1,1,BQ/L
2,2,
3,3,Bq/kg f.w.
4,4,Bq/l


We will define unit renaming rules for OSPAR dataset:

In [None]:
#| export
# Define unit names renaming rules
renaming_unit_rules = {'Bq/l': 1, #'Bq/m3'
                       'Bq/L': 1,
                       'BQ/L': 1,
                       'Bq/kg f.w.': 5, # Bq/kgw
                       } 

Now we will create a callback `RemapUnitCB` to remap the units in the dataframes. For the `SEAWATER` dataset we will set a default unit of `Bq/l`. 

In [None]:
#| export
default_units = {'SEAWATER': 'Bq/l',
                 'BIOTA': 'Bq/kg f.w.'}

In [None]:
#| export
class RemapUnitCB(Callback):
    """Callback to update DataFrame 'UNIT' columns based on a lookup table."""

    def __init__(self,
                 lut: Dict[str, str],
                 default_units: Dict[str, str] = default_units,
                 verbose: bool = False
                 ):
        fc.store_attr()  # Store the lookup table as an attribute

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            # Apply default units to SEAWATER dataset
            if grp == 'SEAWATER':
                self._apply_default_units(df, unit=self.default_units.get(grp))
            self._print_na_units(df)
            self._update_units(df)

    def _apply_default_units(self, df: pd.DataFrame , unit = None):
        df.loc[df['unit'].isnull(), 'unit'] = unit

    def _print_na_units(self, df: pd.DataFrame):
        na_count = df['unit'].isnull().sum()
        if na_count > 0 and self.verbose:
            print(f"Number of rows with NaN in 'unit' column: {na_count}")

    def _update_units(self, df: pd.DataFrame):
        df['UNIT'] = df['unit'].apply(lambda x: self.lut.get(x, 'Unknown'))

In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SanitizeValueCB(), # Remove blank value entries (also removes NaN values in Unit column) 
                            RemapUnitCB(renaming_unit_rules, verbose=True),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()

display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))

print('Unique Unit values:')
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['UNIT'].unique()}")

10 invalid rows found in group 'SEAWATER' during sanitize value callback.


<b> Row Count Comparison Before and After Transformation:</b>

Unnamed: 0,BIOTA,SEAWATER
Number of rows in original dataframes (dfs):,15951,19193
Number of rows in transformed dataframes (tfm.dfs):,15951,19183
Number of rows removed (tfm.dfs_removed):,0,10


Unique Unit values:
BIOTA: [5]
SEAWATER: [1]


## Remap detection limit

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Value type` column contains many `nan` values. For the CSV data source the `value_type` column contains many entries with a value of `0`.

:::

In [None]:
#| eval: false
# Count the number of NaN entries in the 'value type' column for 'SEAWATER'
na_count_seawater = dfs['SEAWATER']['value type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'SEAWATER': {na_count_seawater}")

# Count the number of NaN entries in the 'value type' column for 'BIOTA'
na_count_biota = dfs['BIOTA']['value type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'BIOTA': {na_count_biota}")


Number of NaN 'Value type' entries in 'SEAWATER': 64
Number of NaN 'Value type' entries in 'BIOTA': 23


In the OSPAR dataset, the detection limit is indicated by < in the Value type column. When the Value type is <, the Activity or MDA column contains the detection limit value. Conversely, when the Value type is =, the Activity or MDA column contains the measurement value.

Let‚Äôs examine the Value type column entries in the OSPAR dataset:

In [None]:
#| eval: false
for grp in dfs.keys():
    print(f'{grp}:')
    print(tfm.dfs[grp]['value type'].unique())


BIOTA:
['<' '=' nan]
SEAWATER:
['<' '=' nan]


Detection limits are encoded as follows in MARIS:

In [None]:
#| eval: false
pd.read_excel(detection_limit_lut_path())

Unnamed: 0,id,name,name_sanitized
0,-1,Not applicable,Not applicable
1,0,Not Available,Not available
2,1,=,Detected value
3,2,<,Detection limit
4,3,ND,Not detected
5,4,DE,Derived


We create a lambda function to retrieve the lookup table.

In [None]:
#| export
lut_dl = lambda: pd.read_excel(detection_limit_lut_path(), usecols=['name','id']).set_index('name').to_dict()['id']

In [None]:
#| eval: false
lut_dl()

{'Not applicable': -1, 'Not Available': 0, '=': 1, '<': 2, 'ND': 3, 'DE': 4}

We define the columns of interest in both the `SEAWATER` and `BIOTA` dataframes for the detection limit column.

In [None]:
#| export
coi_dl = {'SEAWATER' : {'DL' : 'value type'},
          'BIOTA':  {'DL' : 'value type'}
          }

We create a callback `RemapDetectionLimitCB` to remap the detection limit values to MARIS format using the lookup table. Since the dataset contains 'nan' entries for the detection limit column, we will create a condition to set the detection limit to '=' when the value and uncertainty columns are present and the current detection limit value is not in the lookup keys.

In [None]:
#| export
class RemapDetectionLimitCB(Callback):
    """Remap detection limit values to MARIS format using a lookup table."""

    def __init__(self, coi: dict, fn_lut: Callable):
        """Initialize with column configuration and a function to get the lookup table."""
        fc.store_attr()        

    def __call__(self, tfm: Transformer):
        """Apply the remapping of detection limits across all dataframes"""
        lut = self.fn_lut()  # Retrieve the lookup table
        for grp, df in tfm.dfs.items():
            df['DL'] = df[self.coi[grp]['DL']]
            self._set_detection_limits(df, lut)

    def _set_detection_limits(self, df: pd.DataFrame, lut: dict):
        """Set detection limits based on value and uncertainty columns using specified conditions."""
        # Condition to set '=' when value and uncertainty are present and the current detection limit is not in the lookup keys
        condition_eq = (df['VALUE'].notna() & df['UNC'].notna() & ~df['DL'].isin(lut.keys()))
        df.loc[condition_eq, 'DL'] = '='

        # Set 'Not Available' for unmatched detection limits
        df.loc[~df['DL'].isin(lut.keys()), 'DL'] = 'Not Available'

        # Map existing detection limits using the lookup table
        df['DL'] = df['DL'].map(lut)

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            NormalizeUncCB(),                  
                            RemapUnitCB(renaming_unit_rules, verbose=True),
                            RemapDetectionLimitCB(coi_dl, lut_dl)])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['DL'].unique()}")

10 invalid rows found in group 'SEAWATER' during sanitize value callback.
BIOTA: [2 1]
SEAWATER: [2 1]


## Remap Biota species

The OSPAR dataset contains biota species information in the `Species` column of the biota dataframe. To ensure consistency with MARIS standards, we need to remap these species names. We'll use a same approach to the one we employed for standardizing nuclide names:


We first inspect unique `Species` values used by OSPAR:

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
with pd.option_context('display.max_columns', None):
    display(get_unique_across_dfs(dfs, col_name='species', as_df=True).T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41.0,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166
value,FUCUS SERRATUS,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,PORPHYRA UMBILICALIS,Trachurus trachurus,GALEUS MELASTOMUS,Gadus morhua,Mytilus Edulis,Pecten maximus,DICENTRARCHUS (MORONE) LABRAX,Clupea harengus,Boreogadus Saida,Homarus gammarus,Lophius piscatorius,Platichthys flesus,Hippoglossoides platessoides,BUCCINUM UNDATUM,DIPTURUS BATIS,Gadiculus argenteus,RAJA DIPTURUS BATIS,Mytilus edulis,Cerastoderma (Cardium) Edule,Argentina sphyraena,Unknown,Fucus Vesiculosus,MERLANGUIS MERLANGUIS,Lumpenus lampretaeformis,PALMARIA PALMATA,Eutrigla gurnardus,PECTINIDAE,Flatfish,SEBASTES MARINUS,Sebastes vivipares,Limanda limanda,Buccinum undatum,Dicentrarchus labrax,Clupea harengus,SALMO SALAR,FUCUS spp,Salmo salar,Pollachius pollachius,SCOPHTHALMUS RHOMBUS,,OSTREA EDULIS,Sebastes Mentella,ASCOPHYLLUM NODOSUM,Cyclopterus lumpus,CRASSOSTREA GIGAS,MERLANGIUS MERLANGUS,Glyptocephalus cynoglossus,Fucus sp.,Pollachius virens,Cerastoderma edule,SPRATTUS SPRATTUS,RHODYMENIA spp,Raja montagui,"Mixture of green, red and brown algae",Fucus vesiculosus,Reinhardtius hippoglossoides,FUCUS SPIRALIS,PLATICHTHYS FLESUS,Nephrops norvegicus,Sprattus sprattus,Merlangius merlangus,MELANOGRAMMUS AEGLEFINUS,Argentina silus,Sepia spp.,MOLVA MOLVA,Coryphaenoides rupestris,Phoca vitulina,Trisopterus minutus,Gadus sp.,Gadus Morhua,ANARHICHAS LUPUS,Gadiculus argenteus thori,Anguilla anguilla,HIPPOGLOSSOIDES PLATESSOIDES,Anarhichas denticulatus,Pleuronectiformes [order],Clupea Harengus,Ostrea Edulis,Fucus distichus,Thunnus thynnus,Sebastes mentella,LITTORINA LITTOREA,FUCUS VESICULOSUS,PATELLA,PLUERONECTES PLATESSA,LIMANDA LIMANDA,Sebastes viviparus,Penaeus vannamei,ETMOPTERUS SPINAX,Trisopterus esmarkii,Galeus melastomus,Merluccius merluccius,Thunnus sp.,HIPPOGLOSSUS HIPPOGLOSSUS,GADUS MORHUA,Melanogrammus aeglefinus,MYTILUS EDULIS,Micromesistius poutassou,Molva molva,Capros aper,SEBASTES MENTELLA,Crassostrea gigas,MERLUCCIUS MERLUCCIUS,Tapes sp.,Merlangius Merlangus,Boreogadus saida,NUCELLA LAPILLUS,Anarhichas lupus,Ascophyllum nodosum,Anarhichas minor,Gadus morhua,EUTRIGLA GURNARDUS,BROSME BROSME,Rhodymenia spp.,CHIMAERA MONSTROSA,Sebastes norvegicus,REINHARDTIUS HIPPOGLOSSOIDES,Littorina littorea,Lycodes vahlii,unknown,PELVETIA CANALICULATA,FUCUS SPP.,GLYPTOCEPHALUS CYNOGLOSSUS,MOLVA DYPTERYGIA,Brosme brosme,CERASTODERMA (CARDIUM) EDULE,Melanogrammus aeglefinus,TRACHURUS TRACHURUS,Trisopterus esmarki,PATELLA VULGATA,SCOMBER SCOMBRUS,Pleuronectes platessa,Hippoglossus hippoglossus,Microstomus kitt,PLEURONECTES PLATESSA,Scomber scombrus,Sebastes marinus,CLUPEA HARENGUS,PECTEN MAXIMUS,Hyperoplus lanceolatus,ASCOPHYLLUN NODOSUM,Gaidropsarus argenteus,Mallotus villosus,Dasyatis pastinaca,OSILINUS LINEATUS,RAJIDAE/BATOIDEA,CYCLOPTERUS LUMPUS,Ostrea edulis,Fucus serratus,Phycis blennoides,Squalus acanthias,Patella sp.,MICROMESISTIUS POUTASSOU,Modiolus modiolus,Solea solea (S.vulgaris),POLLACHIUS VIRENS,MERLUCCIUS MERLUCCIUS,SOLEA SOLEA (S.VULGARIS),Pelvetia canaliculata,LAMINARIA DIGITATA,Pleuronectes platessa,MONODONTA LINEATA,Limanda Limanda,Sardina pilchardus,BOREOGADUS SAIDA


We try to remap the `species` column to the `species` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='species', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='species_ospar.pkl')

In this step, we generate a lookup table using the `remapper` object. The lookup table maps data provider entries to MARIS entries using fuzzy matching. After generating the table, we select matches that meet a specified threshold (i.e., greater than 1), which means that matches that require more than one character correction are shown.

- **`generate_lookup_table(as_df=True)`**: This method generates the lookup table and returns it as a DataFrame. It uses fuzzy matching to align entries from the data provider with those in the MARIS lookup table.
- **`select_match(match_score_threshold=1)`**: This method filters the generated lookup table to include only those matches with a score greater than or equal to the specified threshold. A threshold of 1 ensures that only perfect matches are selected.

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing:   0%|          | 0/167 [00:00<?, ?it/s]

Processing:  16%|‚ñà‚ñå        | 26/167 [00:15<08:43,  3.71s/it]

Below, we fixthe entries that are not properly matched by the `Remapper` object:

In [None]:
#| export
fixes_biota_species = {
    'RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA': NA,  # Mix of species, no direct mapping
    'Mixture of green, red and brown algae': NA,  # Mix of species, no direct mapping
    'Solea solea (S.vulgaris)': 'Solea solea',
    'SOLEA SOLEA (S.VULGARIS)': 'Solea solea',
    'RAJIDAE/BATOIDEA': NA, #Mix of species, no direct mapping
    'PALMARIA PALMATA': NA,  # Not defined
    'Unknown': NA,
    'unknown': NA,
    'Flatfish': NA,
    'Gadus sp.': NA,  # Not defined
}

We now attempt remapping again, incorporating the `fixes_biota_species` dictionary:

In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_species)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing:   0%|          | 0/167 [00:00<?, ?it/s]

Processing:  44%|‚ñà‚ñà‚ñà‚ñà‚ñé     | 73/167 [00:21<00:32,  2.88it/s]

Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| export
lut_biota = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='species', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='species_ospar.pkl').generate_lookup_table(fixes=fixes_biota_species, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `species` column to our `biota` dataframe, containing standardized species IDs.


In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['SPECIES'].unique()

array([ 377,  129,   96,    0,  192,   99,   50,  378,  270,  379,  380,
        381,  382,  383,  384,  385,  244,  386,  387,  388,  389,  390,
        391,  392,  393,  394,  395,  396,  274,  397,  398,  243,  399,
        400,  401,  402,  403,  404,  405,  406,  407,  191,  139,  408,
        410,  412,  413,  272,  414,  415,  416,  417,  418,  419,  420,
        421,  422,  423,  424,  425,  426,  427,  428,  411,  429,  430,
        431,  432,  433,  434,  435,  436,  437,  438,  439,  440,  441,
        442,  443,  444,  294, 1684, 1610, 1609, 1605, 1608,   23, 1606,
        234,  556, 1701, 1752,  158,  223])

## Enhance Species Data Using Biological group column
The `Biological group` column in the OSPAR dataset provides valuable insights related to species. We will leverage this information to enrich the `species` column. To achieve this, we will employ the generic `RemapCB` callback to create an `enhanced_species` column. Subsequently, this `enhanced_species` column will be used to further enrich the `species` column.

First we inspect the unique values in the `biological group` column.

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='biological group', as_df=True)

Unnamed: 0,index,value
0,0,FISH
1,1,SEAWEED
2,2,seaweed
3,3,Seaweeds
4,4,Fish
5,5,Molluscs
6,6,MOLLUSCS
7,7,Seaweed
8,8,fish
9,9,molluscs


We will remap the `biological group` columns data to the `species` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='biological group', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='enhance_species_ospar.pkl')

Like before we will generate the lookup table and select matches that meet a specified threshold (i.e., greater than 1), which means that matches requiring more than one character change are shown.

In [None]:
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/10 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:01<00:00,  6.60it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FISH,Fucus,FISH,4
Fish,Fucus,Fish,4
fish,Fucus,fish,4
Seaweeds,Seaweed,Seaweeds,1
Molluscs,Mollusca,Molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1
molluscs,Mollusca,molluscs,1


We can see that some of the entries require manual corrections.

In [None]:
#| export
fixes_enhanced_biota_species = {
    'fish': 'Pisces',
    'FISH': 'Pisces',
    'Fish': 'Pisces'    
}


Now we will apply the manual corrections to the lookup table and generate the lookup table again.

In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_enhanced_biota_species)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/10 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:01<00:00,  9.19it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Seaweeds,Seaweed,Seaweeds,1
Molluscs,Mollusca,Molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1
molluscs,Mollusca,molluscs,1


Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| export
lut_biota_enhanced = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='biological group', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='enhance_species_ospar.pkl').generate_lookup_table(fixes=fixes_enhanced_biota_species, as_df=False, overwrite=False)

Now lets see the species that are not matched by the `LookupBiogroupCB` callback. 

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of an `enhanced_species` column to our `BIOTA` dataframe, containing standardized species IDs.

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['enhanced_species'].unique()

array([ 873, 1059,  712])

Now that we have the `enhanced_species` column, we can use it to enrich the `SPECIES` column. We will use the enhanced species column in the absence of a species match if the enhanced species column is valid. 

In [None]:
#| export
class EnhanceSpeciesCB(Callback):
    """Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met."""

    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        self._enhance_species(tfm.dfs['BIOTA'])

    def _enhance_species(self, df: pd.DataFrame):
        df['SPECIES'] = df.apply(
            lambda row: row['enhanced_species'] if row['SPECIES'] in [-1, 0] and pd.notnull(row['enhanced_species']) else row['SPECIES'],
            axis=1
        )

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA'),
    EnhanceSpeciesCB()
    ])

tfm()['BIOTA']['SPECIES'].unique()

array([ 377,  129,   96,  712,  192,   99,   50,  378,  270,  379,  380,
        381,  382,  383,  384,  385,  244,  386,  387,  388,  389,  390,
        391,  392,  393,  394,  395,  396,  274,  397,  398,  243,  399,
        400,  401,  402,  403,  404,  405,  406,  407, 1059,  191,  139,
        408,  410,  412,  413,  272,  414,  415,  416,  417,  418,  419,
        420,  421,  422,  423,  424,  425,  426,  427,  428,  411,  429,
        430,  431,  432,  433,  434,  435,  436,  437,  438,  439,  440,
        441,  442,  443,  444,  294, 1684, 1610, 1609, 1605, 1608,   23,
       1606,  234,  556, 1701, 1752,  158,  223])

All entries are matched for the `SPECIES` column.

## Remap Biota tissues

The OSPAR dataset includes entries where the `Body Part` is labeled as `whole`. However, the MARIS data standard requires a more specific distinction in the `body_part` field, differentiating between `Whole animal` and `Whole plant`. Fortunately, the OSPAR data provides a `Biological group` field that allows us to make this distinction.

To address this discrepancy and ensure compatibility with MARIS standards, we will:

1. Create a temporary column `body_part_temp` that combines information from both `Body Part` and `Biological group`.
2. Use this temporary column to perform the lookup using our `Remapper` object.

Lets create the temporary column, `body_part_temp`, that combines `Body Part` and `Biological group`.

In [None]:
#| export
class AddBodypartTempCB(Callback):
    "Add a temporary column with the body part and biological group combined."    
    def __call__(self, tfm):
        tfm.dfs['BIOTA']['body_part_temp'] = (
            tfm.dfs['BIOTA']['body part'] + ' ' + 
            tfm.dfs['BIOTA']['biological group']
            ).str.strip().str.lower()                                 

In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            ])
dfs_test = tfm()
dfs_test['BIOTA']['body_part_temp'].unique()


array(['whole animal molluscs', 'whole plant seaweed', 'whole fish fish',
       'flesh without bones fish', 'whole animal fish', 'muscle fish',
       'head fish', 'soft parts molluscs', 'growing tips seaweed',
       'soft parts fish', 'unknown fish', 'flesh without bone fish',
       'flesh fish', 'flesh with scales fish', 'liver fish',
       'flesh without bones seaweed', 'whole  fish',
       'flesh without bones molluscs', 'whole  seaweed',
       'whole plant seaweeds', 'whole fish', 'whole without head fish',
       'mix of muscle and whole fish without liver fish',
       'whole fisk fish', 'muscle  fish', 'cod medallion fish',
       'tail and claws fish'], dtype=object)

To align the ``body_part_temp`` column with the ``bodypar`` column in the MARIS nomenclature, we utilize a Remapper object. Since the OSPAR dataset does not include a predefined lookup table for the ``body_part`` column, we first create a lookup table by extracting unique values from the ``body_part_temp`` column.

In [None]:
#| eval: false
get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True).head()

Unnamed: 0,index,value
0,0,liver fish
1,1,head fish
2,2,flesh without bone fish
3,3,whole animal molluscs
4,4,whole fish


We try to remap the `body_part_temp` column to the `bodypar` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True),
                    maris_lut_fn=bodyparts_lut_path,
                    maris_col_id='bodypar_id',
                    maris_col_name='bodypar',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='tissues_ospar.pkl'
                    )

remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=0, verbose=True).T)

Processing:   0%|          | 0/27 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 27/27 [00:00<00:00, 87.11it/s]

0 entries matched the criteria, while 27 entries had a match score of 0 or higher.





source_key,mix of muscle and whole fish without liver fish,whole without head fish,tail and claws fish,cod medallion fish,unknown fish,soft parts molluscs,flesh without bones molluscs,whole animal molluscs,whole plant seaweeds,whole fisk fish,whole fish fish,flesh without bones seaweed,growing tips seaweed,whole plant seaweed,whole seaweed,flesh fish,muscle fish,head fish,muscle fish.1,whole animal fish,flesh without bones fish,whole fish,soft parts fish,whole fish.1,flesh with scales fish,liver fish,flesh without bone fish
matched_maris_name,Flesh without bones,Flesh without bones,Stomach and intestine,Old leaf,Growing tips,Soft parts,Flesh without bones,Whole animal,Whole plant,Whole animal,Whole animal,Flesh without bones,Growing tips,Whole plant,Whole plant,Shells,Muscle,Head,Muscle,Whole animal,Flesh without bones,Whole animal,Soft parts,Whole animal,Flesh with scales,Liver,Flesh without bones
source_name,mix of muscle and whole fish without liver fish,whole without head fish,tail and claws fish,cod medallion fish,unknown fish,soft parts molluscs,flesh without bones molluscs,whole animal molluscs,whole plant seaweeds,whole fisk fish,whole fish fish,flesh without bones seaweed,growing tips seaweed,whole plant seaweed,whole seaweed,flesh fish,muscle fish,head fish,muscle fish,whole animal fish,flesh without bones fish,whole fish,soft parts fish,whole fish,flesh with scales fish,liver fish,flesh without bone fish
match_score,31,13,13,13,9,9,9,9,9,9,9,8,8,8,7,7,6,5,5,5,5,5,5,5,5,5,4


Many of the lookup entries are sufficient for our needs. However, for values that don't find a match, we can use the `fixes_biota_bodyparts` dictionary to apply manual corrections. First we will create the dictionary.

In [None]:
#| export
fixes_biota_tissues = {
    'whole seaweed' : 'Whole plant',
    'flesh fish': 'Flesh with bones', # We assume it as the category 'Flesh with bones' also exists
    'flesh fish' : 'Flesh with bones',
    'unknown fish' : NA,
    'unknown fish' : NA,
    'cod medallion fish' : NA, # TO BE DETERMINED
    'mix of muscle and whole fish without liver fish' : NA, # TO BE DETERMINED
    'whole without head fish' : NA, # TO BE DETERMINED
    'flesh without bones seaweed' : NA, # TO BE DETERMINED
    'tail and claws fish' : NA # TO BE DETERMINED
}

Now we will generate the lookup table and apply the manual corrections of the ``fixes_biota_bodyparts`` dictionary.


In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_tissues)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 27/27 [00:00<00:00, 52.33it/s]

1 entries matched the criteria, while 26 entries had a match score of 1 or higher.





source_key,flesh without bones molluscs,whole plant seaweeds,whole animal molluscs,whole fisk fish,whole fish fish,soft parts molluscs,whole plant seaweed,growing tips seaweed,whole seaweed,muscle fish,liver fish,flesh without bones fish,whole animal fish,muscle fish.1,head fish,soft parts fish,flesh with scales fish,whole fish,whole fish.1,flesh without bone fish,unknown fish,mix of muscle and whole fish without liver fish,whole without head fish,cod medallion fish,tail and claws fish,flesh without bones seaweed
matched_maris_name,Flesh without bones,Whole plant,Whole animal,Whole animal,Whole animal,Soft parts,Whole plant,Growing tips,Whole plant,Muscle,Liver,Flesh without bones,Whole animal,Muscle,Head,Soft parts,Flesh with scales,Whole animal,Whole animal,Flesh without bones,(Not available),(Not available),(Not available),(Not available),(Not available),(Not available)
source_name,flesh without bones molluscs,whole plant seaweeds,whole animal molluscs,whole fisk fish,whole fish fish,soft parts molluscs,whole plant seaweed,growing tips seaweed,whole seaweed,muscle fish,liver fish,flesh without bones fish,whole animal fish,muscle fish,head fish,soft parts fish,flesh with scales fish,whole fish,whole fish,flesh without bone fish,unknown fish,mix of muscle and whole fish without liver fish,whole without head fish,cod medallion fish,tail and claws fish,flesh without bones seaweed
match_score,9,9,9,9,9,9,8,8,7,6,5,5,5,5,5,5,5,5,5,4,2,2,2,2,2,2


At this stage, the majority of entries have been successfully matched to MARIS nomenclature. For those entries that remain unmatched, they are appropriately marked as not available. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.

In [None]:
#| export
lut_bodyparts = lambda: Remapper(provider_lut_df=get_unique_across_dfs(tfm.dfs, col_name='body_part_temp', as_df=True),
                               maris_lut_fn=bodyparts_lut_path,
                               maris_col_id='bodypar_id',
                               maris_col_name='bodypar',
                               provider_col_to_match='value',
                               provider_col_key='value',
                               fname_cache='tissues_ospar.pkl'
                               ).generate_lookup_table(fixes=fixes_biota_tissues, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `BODY_PART` column to our `biota` dataframe, containing standardized species IDs.

In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA')
                            ])
tfm()
tfm.dfs['BIOTA']['BODY_PART'].unique()

array([ 1, 40, 52, 34, 13, 19, 56,  0,  4, 60, 25])

## Remap biogroup

The MARIS species lookup table contains a ``biogroup_id`` column that associates each species with its corresponding ``biogroup``. We will leverage this relationship to create a ``BIO_GROUP`` column in the ``BIOTA`` DataFrame.

In [None]:
#| export
lut_biogroup_from_biota = lambda: get_lut(src_dir=species_lut_path().parent, fname=species_lut_path().name, 
                               key='species_id', value='biogroup_id')

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[ 
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA'),
    EnhanceSpeciesCB(),
    RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA')
    ])

print(tfm()['BIOTA']['BIO_GROUP'].unique())


[14 11  4 13 12  2  5]


## Add Laboratory ID (REVIEW)

:::{.callout-tip}

**FEEDBACK FOR NEXT VERSION**: Addition of the laboratory ID column requires the lookup table to be sanitized. 

:::

Lets use the utility `get_unique_across_dfs` function to review the unique laboratory IDs in the OSPAR dataset:

In [None]:
#| eval: false
with pd.option_context('display.max_columns', None):
    display(get_unique_across_dfs(dfs, col_name='data provider', as_df=True).T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26.0,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101
value,SSM,"Federal Maritime and Hydrographic Agency, Hamburg",IRSN : LS3E/RSMASS,Johann Heinrich von Thuenen Institute (vTI),Institut de Radioprotection et S√ªret√© Nucl√©air...,Norwegian Radioaton Protection Authority,Instiute of Marine Research,Institute of Marine Research/Norwegian Radiati...,Institut de Radioprotection et S√ªret√© Nucl√©air...,Institut de Radioprotection et S√ªret√© Nucl√©air...,Institut de Radioprotection et S√ªret√© Nucl√©air...,Radiological Protection Institute of Ireland,Norwegian Radiaton Protection Authority,Insititute for Marine Research,Institut de Radioprotection et S√ªret√© Nucl√©air...,BEIS (formerly DECC),Norwegian Radiation and Nuclear Safety Authority,Institut de Radioprotection et S√ªret√© Nucl√©air...,Johann Heinrich von Th√ºnen Institute (vTI),Ris√∏-DTU,Insitute of Marine Research,Rijkswaterstaat Centre for Water Management,Intitute for Marine Research,Insititute for Energy Technology,Institut de Radioprotection et S√ªret√© Nucl√©air...,"Institute for Energy Technology, Kjeller, Norway",,Institut de Radioprotection et S√ªret√© Nucl√©air...,Endeavour 10/2004,DTU ENV,Institut de Radioprotection et S√ªret√© Nucl√©air...,FSA-Food Standards Agency,Institut de Radioprotection et S√ªret√© Nucl√©air...,Nuclear Safety Council,Institut de Radioprotection et S√ªret√© Nucl√©air...,IRSN : OPRI-LVRE/MN,"Defra-Department for Environment, Food and Rur...",Institut de Radioprotection et S√ªret√© Nucl√©air...,Icelandic Radiation Safety Authority,SL-Sellafield Ltd,Institut de Radioprotection et S√ªret√© Nucl√©air...,NorwegiaN Radiation Protection Authority,Nuclear Energy Research centre,IRSN-LVRE,IFE,IRSN : OPRI-LVRE,Norwegian Radiation Protection Authority,"DTU Nutech, DK",EA-Environment Agency,EA - Environment Agency,Institute for Marine Research/Norweigian Radia...,The Norwegian Food Control Authority,Institut de Radioprotection et S√ªret√© Nucl√©air...,Radiological Protection Instiute of Ireland,IRSN : LRC/LS3E/RSMASS,Institute for Marine Research,Institute for energy technology,Institute for Energy Technology,IRSN : LVRE,Rijkswaterstaat Laboratory CIV,IRSN : LERFA,Institute for marine research,Institute of Energy Technology,Institut de Radioprotection et S√ªret√© Nucl√©air...,IRSN-LRC,IRSN : OPRI/MN,Institut de Radioprotection et S√ªret√© Nucl√©air...,IMR,Institute for Energy technology,Institut de Radioprotection et S√ªret√© Nucl√©air...,Norweigian Radiation Protection Authority,Scientific Institute of Public Health,Institut de Radioprotection et S√ªret√© Nucl√©air...,NRPA,Institute of Marine Research,Institut de Radioprotection et S√ªret√© Nucl√©air...,Institut de Radioprotection et S√ªret√© Nucl√©air...,NIEA-Northern Ireland Environment Agency,Johann Heinrich von Th≈∏nen Institute (vTI),IRSN : LVRE/MN,SCK‚Ä¢CEN,Institut de Radioprotection et S√ªret√© Nucl√©air...,Institut de Radioprotection et S√ªret√© Nucl√©air...,IRSN : LS3E/Marine Nationale,Institut de Radioprotection et S√ªret√© Nucl√©air...,DTU SUS,IRSN : OPRI/DDASS,BEIS,IFE/NRPA,Corystes 14/2004,IRSN : OPRI,IRSN : LS3E,DTU Nutech,Institut de Radioprotection et S√ªret√© Nucl√©air...,Swedish Radiation Safety Authority,Institut de Radioprotection et S√ªret√© Nucl√©air...,IRSN : LVRE/RSMASS,Institut de Radioprotection et S√ªret√© Nucl√©air...,SEPA-Scottish Environment Protection Agency,Environmental Protection Agency,SCKCEN,IRSN : LRC


The `LAB` information could be included with a little work. 

## Add Sample ID (REVIEW)

The OSPAR dataset includes an `ID` column, which we will use.

In [None]:
#| export
class AddSampleIdCB(Callback):
    "Create a SMP_ID column from the ID column"
    def __call__(self, tfm):
        for grp, df in tfm.dfs.items():
            if 'id' in df.columns:
                df['ID'] = df['id']                
                # Check that the ID is an integer or float.
                if not pd.api.types.is_numeric_dtype(df['ID']):
                    print(f"Non-numeric values detected in 'ID' column of dataframe '{grp}':")
                    print(f"Data type: {df['ID'].dtype}")
                    print("Unique values:", df['ID'].unique())

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            AddSampleIdCB(),
                            CompareDfsAndTfmCB(dfs)

                            ])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['ID'].unique()}")

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
    

BIOTA: [    1     2     3 ... 98060 98061 98062]
SEAWATER: [     1      2      3 ... 120366 120367 120368]
                                                    BIOTA  SEAWATER
Number of rows in original dataframes (dfs):        15951     19193
Number of rows in transformed dataframes (tfm.d...  15951     19193
Number of rows removed (tfm.dfs_removed):               0         0 



## Add depth

The OSPAR dataset includes a column for the sampling depth (`Sampling depth`) for the `SEAWATER` dataset. In this section, we will create a callback to incorporate the sampling depth (`smp_depth`) into the MARIS dataset.

In [None]:
#| export
class AddDepthCB(Callback):
    "Ensure depth values are floats and add 'SMP_DEPTH' columns."
    def __call__(self, tfm: Transformer):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                if 'sampling depth' in df.columns:
                    df['SMP_DEPTH'] = df['sampling depth'].astype(float)

In [None]:
#| eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
    AddDepthCB()
    ])
tfm()
for grp in tfm.dfs.keys():  
    if 'SMP_DEPTH' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['SMP_DEPTH']].drop_duplicates())

SEAWATER:        SMP_DEPTH
0            3.0
80           2.0
81          21.0
85          31.0
87          32.0
...          ...
16022       71.0
16023       66.0
16025       81.0
16385     1660.0
16389     1500.0

[134 rows x 1 columns]


## Standardize Coordinates

The OSPAR dataset offers coordinates in degrees, minutes, and seconds (DMS). The following callback is designed to convert DMS to decimal degrees. 

In [None]:
#| export
class ConvertLonLatCB(Callback):
    """Convert Coordinates to decimal degrees (DDD.DDDDD¬∞)."""
    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            df['LAT'] = self._convert_latitude(df)
            df['LON'] = self._convert_longitude(df)

    def _convert_latitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['latdir'].isin(['S']),
            self._dms_to_decimal(df['latd'], df['latm'], df['lats']) * -1,
            self._dms_to_decimal(df['latd'], df['latm'], df['lats'])
        )

    def _convert_longitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['longdir'].isin(['W']),
            self._dms_to_decimal(df['longd'], df['longm'], df['longs']) * -1,
            self._dms_to_decimal(df['longd'], df['longm'], df['longs'])
        )

    def _dms_to_decimal(self, degrees: pd.Series, minutes: pd.Series, seconds: pd.Series) -> pd.Series:
        return degrees + minutes / 60 + seconds / 3600


In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB()
                            ])
tfm()

with pd.option_context('display.max_columns', None):
    display(tfm.dfs['SEAWATER'][['LAT','latd', 'latm', 'lats', 'LON', 'latdir', 'longd', 'longm','longs', 'longdir']])

Unnamed: 0,LAT,latd,latm,lats,LON,latdir,longd,longm,longs,longdir
0,51.375278,51,22.0,31.0,3.188056,N,3,11.0,17.0,E
1,51.223611,51,13.0,25.0,2.859444,N,2,51.0,34.0,E
2,51.184444,51,11.0,4.0,2.713611,N,2,42.0,49.0,E
3,51.420278,51,25.0,13.0,3.262222,N,3,15.0,44.0,E
4,51.416111,51,24.0,58.0,2.809722,N,2,48.0,35.0,E
...,...,...,...,...,...,...,...,...,...,...
19188,53.600000,53,36.0,0.0,-5.933333,N,5,56.0,0.0,W
19189,53.733333,53,44.0,0.0,-5.416667,N,5,25.0,0.0,W
19190,53.650000,53,39.0,0.0,-5.233333,N,5,14.0,0.0,W
19191,53.883333,53,53.0,0.0,-5.550000,N,5,33.0,0.0,W


Sanitize coordinates drops a row when both longitude & latitude equal 0 or data contains unrealistic longitude & latitude values. Converts longitude & latitude `,` separator to `.` separator."

In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()

display(Markdown("<b> Row Count Comparison Before and After Transformation:</b>"))
with pd.option_context('display.max_rows', None):
    display(pd.DataFrame.from_dict(tfm.compare_stats))

with pd.option_context('display.max_columns', None):
    display(tfm.dfs['SEAWATER'][['LAT','LON']])

<b> Row Count Comparison Before and After Transformation:</b>

Unnamed: 0,BIOTA,SEAWATER
Number of rows in original dataframes (dfs):,15951,19193
Number of rows in transformed dataframes (tfm.dfs):,15951,19193
Number of rows removed (tfm.dfs_removed):,0,0


Unnamed: 0,LAT,LON
0,51.375278,3.188056
1,51.223611,2.859444
2,51.184444,2.713611
3,51.420278,3.262222
4,51.416111,2.809722
...,...,...
19188,53.600000,-5.933333
19189,53.733333,-5.416667
19190,53.650000,-5.233333
19191,53.883333,-5.550000


## Review all callbacks

In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

10 invalid rows found in group 'SEAWATER' during time parsing callback.
                                                    BIOTA  SEAWATER
Number of rows in original dataframes (dfs):        15951     19193
Number of rows in transformed dataframes (tfm.d...  15951     19183
Number of rows removed (tfm.dfs_removed):               0        10 



### Example change logs

Review the change logs for the netcdf encoding.

In [None]:
#|eval: false
dfs = load_data(src_dir, use_cache=True)
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            ])

# Transform
tfm()
# Check transformation logs
tfm.logs

10 invalid rows found in group 'SEAWATER' during time parsing callback.


["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column.",
 'Remap data provider nuclide names to standardized MARIS nuclide names.',
 'Parse the time format in the dataframe and check for inconsistencies.',
 'Encode time as seconds since epoch.',
 'Sanitize value by removing blank entries and populating `value` column.',
 'Normalize uncertainty values in DataFrames.',
 "Callback to update DataFrame 'UNIT' columns based on a lookup table.",
 'Remap detection limit values to MARIS format using a lookup table.',
 "Remap values from 'species' to 'SPECIES' for groups: BIOTA.",
 "Remap values from 'biological group' to 'enhanced_species' for groups: BIOTA.",
 "Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met.",
 'Add a temporary column with the body part and biological group combined.',
 "Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA.",
 'Create a SMP_ID column from the ID column',
 "Ensure depth

## Feed global attributes

In [None]:
#| export
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']


In [None]:
#| export
def get_attrs(
    tfm: Transformer, # Transformer object
    zotero_key: str, # Zotero dataset record key
    kw: list = kw # List of keywords
    ) -> dict: # Global attributes
    "Retrieve all global attributes."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()

In [None]:
#|eval: false
get_attrs(tfm, zotero_key=zotero_key, kw=kw)

{'geospatial_lat_min': '49.43222222222222',
 'geospatial_lat_max': '81.26805555555555',
 'geospatial_lon_min': '-58.23166666666667',
 'geospatial_lon_max': '36.181666666666665',
 'geospatial_bounds': 'POLYGON ((-58.23166666666667 36.181666666666665, 49.43222222222222 36.181666666666665, 49.43222222222222 81.26805555555555, -58.23166666666667 81.26805555555555, -58.23166666666667 36.181666666666665))',
 'geospatial_vertical_max': '1850.0',
 'geospatial_vertical_min': '0.0',
 'time_coverage_start': '1995-01-01T00:00:00',
 'time_coverage_end': '2022-12-31T00:00:00',
 'id': 'LQRA4MMK',
 'title': 'OSPAR Environmental Monitoring of Radioactive Substances',
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "OSPAR Comission\'s Radioactive Substances Committee (RSC)"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science >

### Encoding NETCDF

In [None]:
#| export
def encode(
    fname_out_nc: str, # Output file name
    **kwargs # Additional arguments
    ) -> None:
    "Encode data to NetCDF."
    dfs = load_data(src_dir, use_cache=True)
    tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological group', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                                ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out_nc, 
                            global_attrs=get_attrs(tfm, zotero_key=zotero_key, kw=kw),
                            verbose=kwargs.get('verbose', False),
                           )
    encoder.encode()

In [None]:
#|eval: false
encode(fname_out_nc, verbose=False)

10 invalid rows found in group 'SEAWATER' during time parsing callback.


## NetCDF Review

First lets review the global attributes of the NetCDF file:

In [None]:
#| eval: false
contents = ExtractNetcdfContents(fname_out_nc)
print(contents.global_attrs)

{'id': 'LQRA4MMK', 'title': 'OSPAR Environmental Monitoring of Radioactive Substances', 'summary': '', 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)', 'history': 'TBD', 'keywords_vocabulary': 'GCMD Science Keywords', 'keywords_vocabulary_url': 'ht

Review the publisher_postprocess_logs.

In [None]:
#| eval: false
print(contents.global_attrs['publisher_postprocess_logs'])

Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column., Remap data provider nuclide names to standardized MARIS nuclide names., Parse the time format in the dataframe and check for inconsistencies., Encode time as seconds since epoch., Sanitize value by removing blank entries and populating `value` column., Normalize uncertainty values in DataFrames., Callback to update DataFrame 'UNIT' columns based on a lookup table., Remap detection limit values to MARIS format using a lookup table., Remap values from 'species' to 'SPECIES' for groups: BIOTA., Remap values from 'biological group' to 'enhanced_species' for groups: BIOTA., Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met., Add a temporary column with the body part and biological group combined., Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA., Create a SMP_ID column from the ID column, Ensure depth values are floats and add 'SMP_DEPTH' colum

Now lets review the enums of the groups in the NetCDF file:

In [None]:
#| eval: false
print(contents.enum_dicts)

{'BIOTA': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', '

Lets review the data of the NetCDF file:

In [None]:
#| eval: false
dfs = contents.dfs
dfs

{'BIOTA':              LON        LAT        TIME  NUCLIDE     VALUE  UNIT       UNC  \
 0       4.031111  51.393333  1267574400       33  0.326416     5       NaN   
 1       4.031111  51.393333  1276473600       33  0.442704     5       NaN   
 2       4.031111  51.393333  1285545600       33  0.412989     5       NaN   
 3       4.031111  51.393333  1291766400       33  0.202768     5       NaN   
 4       4.031111  51.393333  1267574400       53  0.652833     5       NaN   
 ...          ...        ...         ...      ...       ...   ...       ...   
 15946  12.087778  57.252499  1660003200       33  0.384000     5  0.012096   
 15947  12.107500  57.306389  1663891200       33  0.456000     5  0.012084   
 15948  11.245000  58.603333  1667779200       33  0.122000     5  0.031000   
 15949  11.905278  57.302502  1663632000       33  0.310000     5       NaN   
 15950  12.076667  57.335278  1662076800       33  0.306000     5  0.007191   
 
        DL  SPECIES  BODY_PART  
 0      

Lets review the biota data:

In [None]:
#| eval: false
nc_dfs_biota=dfs['BIOTA']
nc_dfs_biota

Unnamed: 0,LON,LAT,TIME,NUCLIDE,VALUE,UNIT,UNC,DL,SPECIES,BODY_PART
0,4.031111,51.393333,1267574400,33,0.326416,5,,2,377,1
1,4.031111,51.393333,1276473600,33,0.442704,5,,2,377,1
2,4.031111,51.393333,1285545600,33,0.412989,5,,2,377,1
3,4.031111,51.393333,1291766400,33,0.202768,5,,2,377,1
4,4.031111,51.393333,1267574400,53,0.652833,5,,2,377,1
...,...,...,...,...,...,...,...,...,...,...
15946,12.087778,57.252499,1660003200,33,0.384000,5,0.012096,1,272,52
15947,12.107500,57.306389,1663891200,33,0.456000,5,0.012084,1,272,52
15948,11.245000,58.603333,1667779200,33,0.122000,5,0.031000,1,129,19
15949,11.905278,57.302502,1663632000,33,0.310000,5,,2,129,19


Lets review the seawater data:

In [None]:
#| eval: false
nc_dfs_seawater=dfs['SEAWATER']
nc_dfs_seawater

Unnamed: 0,LON,LAT,SMP_DEPTH,TIME,NUCLIDE,VALUE,UNIT,UNC,DL
0,3.188056,51.375278,3.0,1264550400,33,0.200000,1,,2
1,2.859444,51.223610,3.0,1264550400,33,0.270000,1,,2
2,2.713611,51.184444,3.0,1264550400,33,0.260000,1,,2
3,3.262222,51.420277,3.0,1264550400,33,0.250000,1,,2
4,2.809722,51.416111,3.0,1264464000,33,0.200000,1,,2
...,...,...,...,...,...,...,...,...,...
19178,4.615278,52.831944,1.0,1573649640,77,0.000005,1,2.600000e-07,1
19179,3.565556,51.411945,1.0,1575977820,1,6.152000,1,3.076000e-01,1
19180,3.565556,51.411945,1.0,1575977820,53,0.005390,1,1.078000e-03,1
19181,3.565556,51.411945,1.0,1575977820,54,0.001420,1,2.840000e-04,1


## Data Format Conversion 

The MARIS data processing workflow involves two key steps:

1. **NetCDF to Standardized CSV Compatible with OpenRefine Pipeline**
   - Convert standardized NetCDF files to CSV formats compatible with OpenRefine using the `NetCDFDecoder`.
   - Preserve data integrity and variable relationships.
   - Maintain standardized nomenclature and units.

2. **Database Integration**
   - Process the converted CSV files using OpenRefine.
   - Apply data cleaning and standardization rules.
   - Export validated data to the MARIS master database.

This section focuses on the first step: converting NetCDF files to a format suitable for OpenRefine processing using the `NetCDFDecoder` class.

In [None]:
#|eval: false
decode(fname_in=fname_out_nc, verbose=True)

Saved BIOTA to ../../_data/output/191-OSPAR-2024_BIOTA.csv
Saved SEAWATER to ../../_data/output/191-OSPAR-2024_SEAWATER.csv
