In [None]:
#| default_exp handlers.ospar

# OSPAR 

> This data pipeline, known as a "handler" in Marisco terminology, is designed to clean, standardize, and encode [OSPAR data](https://odims.ospar.org/en/) into `NetCDF` format. The handler processes raw OSPAR data, applying various transformations and lookups to align it with `MARIS` data standards.

Key functions of this handler:

- **Cleans** and **normalizes** raw OSPAR data
- **Applies standardized nomenclature** and units
- **Encodes the processed data** into `NetCDF` format compatible with MARIS requirements

This handler is a crucial component in the Marisco data processing workflow, ensuring OSPAR data is properly integrated into the MARIS database.

:::{.callout-tip}

For new MARIS users, please refer to [Understanding MARIS Data Formats (NetCDF and Open Refine)](https://github.com/franckalbinet/marisco/tree/main/install_configure_guide) for detailed information.

:::

The present notebook pretends to be an instance of [Literate Programming](https://www.wikiwand.com/en/articles/Literate_programming) in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case `marisco/handlers/ospar.py`) the code snippet is added to the module using `#| export` as provided by the wonderful [nbdev](https://nbdev.fast.ai/getting_started.html) library.

In [None]:
#| hide
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
#| export
import pandas as pd 
import numpy as np
import fastcore.all as fc 
from fastcore.basics import patch
from typing import  Dict, Callable 
import re
from owslib.wfs import WebFeatureService
from io import StringIO

from marisco.utils import (
    Remapper, 
    get_unique_across_dfs,
    NA
)

from marisco.callbacks import (
    Callback, 
    Transformer, 
    EncodeTimeCB, 
    LowerStripNameCB, 
    SanitizeLonLatCB, 
    CompareDfsAndTfmCB, 
    RemapCB
)

from marisco.metadata import (
    GlobAttrsFeeder, 
    BboxCB, 
    DepthRangeCB, 
    TimeRangeCB, 
    ZoteroCB, 
    KeyValuePairCB
)

from marisco.configs import (
    nuc_lut_path, 
    cfg, 
    species_lut_path, 
    bodyparts_lut_path, 
    detection_limit_lut_path, 
    get_lut, 
)

from marisco.encoders import (
    NetCDFEncoder, 
)

from marisco.handlers.data_format_transformation import (
    decode, 
)

from marisco.utils import (
    ExtractNetcdfContents,
)

import warnings
warnings.filterwarnings('ignore')

## Configuration and File Paths

The handler requires several configuration parameters:

1. **fname_out_nc**: Output path and filename for NetCDF file (relative paths supported) 
2. **zotero_key**: Key for retrieving dataset attributes from [Zotero](https://www.zotero.org/)
3. **ref_id**: Reference ID in the MARIS [Zotero library](https://www.zotero.org/groups/2432820/maris/library)

In [None]:
#| export
fname_out_nc = '../../_data/output/191-OSPAR-2024.nc'
zotero_key ='LQRA4MMK' # OSPAR MORS zotero key

## OSPAR Data Access and Processing

OSPAR data can be accessed through the [ODIMS OSPAR platform](https://odims.ospar.org/en/search/), which hosts the data and provides access via a [Web Feature Service (WFS)](https://odims.ospar.org/geoserver/odims/wfs/?service=WFS&request=GetCapabilities). The WFS interface enables efficient querying and retrieval of geospatial data.

### `OsparWfsProcessor`: A Tool for OSPAR Data Retrieval

The `OsparWfsProcessor` is a utility designed to interact seamlessly with the OSPAR WFS. It supports specific search parameters tailored to different data types:

- **`ospar_biota`**: Retrieves biological data.
- **`ospar_seawater`**: Retrieves seawater data.

### Workflow

When executed, the processor performs the following steps:

1. Connects to the OSPAR WFS using the specified search parameters.
2. Retrieves the requested data.
3. Organizes the data into a structured format for ease of analysis.

### Output

The processor returns the results as a dictionary of pandas DataFrames, structured as follows:

- **Key: `BIOTA`**  
  Contains biological data retrieved via the `ospar_biota` parameter.
  
- **Key: `SEAWATER`**  
  Contains seawater data retrieved via the `ospar_seawater` parameter.

This design ensures that OSPAR data is both accessible and conveniently structured for further analysis.


:::{.callout-tip}

**Feedback to Data Provider.**

Please note that we are assuming that new versions of data supersede all previous versions. Files are stored on the WFS service with the following naming convention:

- **Prefix**: All filenames start with `odims:ospar_`, indicating that the data originates from the OSPAR dataset managed by the ODIMS platform.

- **Data Type**: Following the prefix, the filename specifies the type of data:
  - `biota` - Indicates biological data.
  - `seawater` - Indicates seawater-related data.

- **Date and Version**:
  - **Year**: The year of the dataset is represented by four digits (e.g., `2023`).
  - **Month**: The month of the dataset is represented by two digits (e.g., `04` for April).
  - **Version**: The version of the dataset is represented by three digits, where higher numbers indicate more recent versions (e.g., `001`).

- **Separators**: Underscores (`_`) are used as separators to distinctly divide different parts of the filename.

Consider the filename `odims:ospar_biota_2023_01_001`. This indicates a file containing biota data from January 2023, version 001. Under the current implementation, this data would be replaced by the file `odims:ospar_biota_2023_01_002` (i.e., version 002).

:::


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The 2022 OSPAR Biota data is unavailable on the WFS. The file `ospar_biota_2022_01_001.csv` contains Seawater data (i.e. Sample_type is 'Water'). See https://odims.ospar.org/en/submissions/ospar_biota_2022_01/. 
For this reason, the `BIOTA` dataset does not contain any data for the year 2022.
:::

In [None]:
#| export
class OsparWfsProcessor:
    "Processor for OSPAR Web Feature Service operations, managing feature filtering and data fetching."
    def __init__(self, url, search_params=None, version='2.0.0'):
        "Initialize with URL, version, and search parameters."
        fc.store_attr()
        self.wfs = WebFeatureService(url=self.url, version=self.version)
        self.features_dfs = {}
        self.dfs = {}

    def __call__(self):
        "Process, fetch and filter OSPAR data"
        self.filter_features()
        self.check_feature_pattern()
        self.extract_version_from_feature_name()
        self.filter_latest_versions()
        self.fetch_and_combine_csv()

        return self.dfs

In [None]:
#| export
@patch
def filter_features(self: OsparWfsProcessor):
    "Filter features based on search parameters."
    available_features = list(self.wfs.contents.keys())
    for group, value in self.search_params.items():
        filtered_features = [ftype for ftype in available_features if value in ftype]
        self.features_dfs[group] = pd.DataFrame([{'feature': ftype} for ftype in filtered_features])


In [None]:
#| export
@patch
def check_feature_pattern(self: OsparWfsProcessor):
    """
    Check and retain features conforming to a specific pattern, printing unmatched features.
    """
    pattern = re.compile(r'^odims:ospar_(biota|seawater)_(\d{4})_(\d{2})_(\d{3})$')
    unmatched_features = []
    for group, df in list(self.features_dfs.items()):
        # Apply the pattern and find unmatched features
        matched_features = df['feature'].apply(lambda x: bool(pattern.match(x)))
        unmatched = df[~matched_features]['feature']
        unmatched_features.extend(unmatched.tolist())
        # Filter the DataFrame to only include matched features
        self.features_dfs[group] = df[matched_features]

    if unmatched_features:
        print("Unmatched features:", unmatched_features)

In [None]:
#| export
@patch
def extract_version_from_feature_name(self: OsparWfsProcessor):
    "Extract version from feature name."
    for group, df in list(self.features_dfs.items()):
        df['source'] = df['feature'].apply(lambda x: x.split('_')[0])
        df['type'] = df['feature'].apply(lambda x: x.split('_')[1])
        df['year'] = df['feature'].apply(lambda x: x.split('_')[2])
        df['month'] = df['feature'].apply(lambda x: x.split('_')[3])
        df['version'] = df['feature'].apply(lambda x: x.split('_')[4])

In [None]:
#| export
@patch
def filter_latest_versions(self: OsparWfsProcessor):
    "Filter each DataFrame to include only the latest version of each feature"
    for group, df in list(self.features_dfs.items()):
        df[['year', 'month', 'version']] = df[['year', 'month', 'version']].astype(int)
        
        if group == 'BIOTA':
            # Removing biota data for the year 2022 as the data is unavailable on the WFS.
            df = df[df['year'] != 2022]
            
        idx = df.groupby(['source', 'type', 'year', 'month'])['version'].idxmax()
        self.features_dfs[group] = df.loc[idx]

HERE: Some features contain data for other years. Here i am writing a print to indicate which features!. 


In [None]:
#| export
@patch
def fetch_and_combine_csv(self: OsparWfsProcessor):
    """
    Fetch CSV data for each feature from the WFS and combine it into a single DataFrame for each sample type.
    """
    for group, df in self.features_dfs.items():
        combined_df = pd.DataFrame()  # Initialize an empty DataFrame to hold combined data
        for _, row in df.iterrows():  # Iterate over each row in the DataFrame
            feature = row['feature']
            year = row['year']
            try:
                print(f"Fetching data for feature: {feature}, year: {year}")
                response = self.wfs.getfeature(typename=feature, outputFormat='csv')
                csv_data = StringIO(response.read().decode('utf-8'))  # Decode the response content
                df_csv = pd.read_csv(csv_data)  # Load CSV data into a DataFrame
                
                # Standardize column names to lowercase
                df_csv.columns = df_csv.columns.str.lower()
                
                # Check if the data includes additional years
                if not df_csv['year'].eq(year).all():
                    additional_years = df_csv['year'].drop_duplicates().difference([year])
                    print(f"{feature} includes data for additional years: {list(additional_years)}")
                
                # Append the fetched data to the combined DataFrame
                combined_df = pd.concat([combined_df, df_csv], ignore_index=True)
            
            except Exception as e:
                print(f"Failed to fetch data for feature '{feature}': {e}")
        
        # Store the combined DataFrame for the group
        self.dfs[group] = combined_df


In [None]:
#|eval: false
wfs_processor=OsparWfsProcessor(url= 'https://odims.ospar.org/geoserver/odims/wfs', search_params={'BIOTA': 'ospar_biota', 'SEAWATER': 'ospar_seawater'})
dfs = wfs_processor()

Fetching data for feature: odims:ospar_biota_1995_01_003, year: 1995
Fetching data for feature: odims:ospar_biota_1996_01_003, year: 1996
Failed to fetch data for feature 'odims:ospar_biota_1996_01_003': 'Series' object has no attribute 'difference'
Fetching data for feature: odims:ospar_biota_1997_01_003, year: 1997
Fetching data for feature: odims:ospar_biota_1998_01_003, year: 1998
Fetching data for feature: odims:ospar_biota_1999_01_003, year: 1999
Fetching data for feature: odims:ospar_biota_2000_01_003, year: 2000
Fetching data for feature: odims:ospar_biota_2001_01_003, year: 2001
Fetching data for feature: odims:ospar_biota_2002_01_003, year: 2002
Fetching data for feature: odims:ospar_biota_2003_01_003, year: 2003
Fetching data for feature: odims:ospar_biota_2004_01_003, year: 2004
Fetching data for feature: odims:ospar_biota_2005_01_003, year: 2005
Fetching data for feature: odims:ospar_biota_2006_01_003, year: 2006
Fetching data for feature: odims:ospar_biota_2007_01_003, ye

Display the head of the `SEAWATER` dataframe with all columns.

In [None]:
#|eval: false
# Show all columns
with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(dfs['SEAWATER'].head())

Unnamed: 0,fid,the_geom,id,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,sampling_d,sampling_1,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year,f1,reference_
0,ospar_seawater_1995_01_003.1,POINT (56.16666666666666 11.78333333333333),45552.0,Denmark,12,HesselÃ¸,H95-22,56,10,0.0,N,11,47,0.0,E,Water,2.0,1995-05-01T00:00:00,137Cs,0,0.040141,6823919,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,
1,ospar_seawater_1995_01_003.2,POINT (56.16666666666666 11.78333333333333),45553.0,Denmark,12,HesselÃ¸,H95-23,56,10,0.0,N,11,47,0.0,E,Water,24.0,1995-05-01T00:00:00,137Cs,0,0.037117,7794675,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,
2,ospar_seawater_1995_01_003.3,POINT (56.16666666666666 11.78333333333333),45554.0,Denmark,12,HesselÃ¸,H95-56,56,10,0.0,N,11,47,0.0,E,Water,2.0,1995-11-01T00:00:00,137Cs,0,0.04345,56485,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,
3,ospar_seawater_1995_01_003.4,POINT (56.16666666666666 11.78333333333333),45555.0,Denmark,12,HesselÃ¸,H95-57,56,10,0.0,N,11,47,0.0,E,Water,25.0,1995-11-01T00:00:00,137Cs,0,0.04608,50688,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,
4,ospar_seawater_1995_01_003.5,POINT (56.11666666666667 11.16666666666667),45556.0,Denmark,12,Kattegat SW,H95-20,56,7,0.0,N,11,10,0.0,E,Water,2.0,1995-05-01T00:00:00,137Cs,0,0.05033,75495,Bq/l,RisÃ¸-DTU,,,,56.116667,11.166667,1995.0,,


Display the head of the `BIOTA` dataframe with all columns.

In [None]:
#|eval: false
# Show all columns
with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(dfs['BIOTA'].head())

Unnamed: 0,fid,the_geom,id,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,biological,species,body_part,sampling_d,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year
0,ospar_biota_1995_01_003.1,POINT (55.96666666666667 11.58333333333333),38847,Denmark,12,Klint,950089,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-04-05T00:00:00,137Cs,0,2.0217,626727,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
1,ospar_biota_1995_01_003.2,POINT (55.96666666666667 11.58333333333333),38848,Denmark,12,Klint,950229,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-07-07T00:00:00,137Cs,0,2.34446,468892,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
2,ospar_biota_1995_01_003.3,POINT (55.96666666666667 11.58333333333333),38849,Denmark,12,Klint,950360,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus serratus,Whole plant,1995-09-19T00:00:00,137Cs,0,2.62356,4197696,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
3,ospar_biota_1995_01_003.4,POINT (55.96666666666667 11.58333333333333),38850,Denmark,12,Klint,950359,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-09-19T00:00:00,137Cs,0,2.7807,305877,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
4,ospar_biota_1995_01_003.5,POINT (55.96666666666667 11.58333333333333),38851,Denmark,12,Klint,950489,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-12-21T00:00:00,137Cs,0,1.51102,1662122,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995


## Nuclide Name Normalization

The MARISCO package standardizes the nuclide names in the DataFrames to match the MARIS standard nuclide names specified in a lookup table. 

The lookup process uses the following three columns:
- **`nuclide_id`**: A unique identifier for each nuclide.
- **`nuclide`**: The standard nuclide name.
- **`nc_name`**: The corresponding name used in NetCDF files.

Let’s inspect the lookup table:


In [None]:
#| eval: false
nuc_lut_df = pd.read_excel(nuc_lut_path())
nuc_lut_df.head()

Unnamed: 0,nuclide_id,nuclide,atomicnb,massnb,nusymbol,half_life,hl_unit,nc_name
0,-1,NOT APPLICABLE,,,,,,NOT APPLICABLE
1,0,NOT AVAILABLE,0.0,0.0,0,0.0,-,NOT AVAILABLE
2,1,TRITIUM,1.0,3.0,3H,12.35,Y,h3
3,2,BERYLLIUM,4.0,7.0,7Be,53.3,D,be7
4,3,CARBON,6.0,14.0,14C,5730.0,Y,c14


The nuclide data is provided in the `nuclide` column. However, as shown below, the nuclide names are not standardized.


In [None]:
#| eval: false
dfs = wfs_processor()
df = get_unique_across_dfs(dfs, 'nuclide', as_df=True)
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
value,241Am,137Cs,210Pb,226Ra,228Ra,"239, 240 Pu",RA-226,238Pu,CS-137,99Tc,Cs-137,RA-228,"239,240Pu",210Po,3H


### Lower & strip nuclide names

To simplify the data, we use the `LowerStripNameCB` callback. For each dataframe in the dictionary of dataframes, `LowerStripNameCB` simplifies the nuclide name by converting it lowercase and striping any leading or trailing whitespace(s).

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='nuclide')])
dfs_output=tfm()
for key, df in dfs_output.items():
    print(f'{key} nuclides: ')
    print(df['nuclide'].unique())

BIOTA nuclides: 
['137cs' '99tc' '239,240pu' '210po' '210pb' '226ra' '228ra' 'cs-137' '3h'
 '238pu' '239, 240 pu' '241am']
SEAWATER nuclides: 
['137cs' '239,240pu' '3h' '99tc' '226ra' '228ra' '210po' '210pb' 'ra-226'
 'ra-228']


### Remap nuclide names to MARIS data formats

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `nuclide` column has inconsistent naming. E.g:

- `Cs-137`,  `137Cs` or `CS-137`
- `239, 240 pu` or `239,240 pu`
- `ra-226` and `226ra` 

See below:

:::

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='nuclide', as_df=True).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
value,241Am,137Cs,210Pb,226Ra,228Ra,"239, 240 Pu",RA-226,238Pu,CS-137,99Tc,Cs-137,RA-228,"239,240Pu",210Po,3H


Below, we map nuclide names used by HELCOM to the MARIS standard nuclide names. 

Remapping data provider nomenclatures to MARIS standards is a recurrent operation and is done in a semi-automated manner according to the following pattern:

1. **Inspect** data provider nomenclature:
2. **Match** automatically against MARIS nomenclature (using a fuzzy matching algorithm); 
3. **Fix** potential mismatches; 
4. **Apply** the lookup table to the dataframe.

We will refer to this process as **IMFA** (**I**nspect, **M**atch, **F**ix, **A**pply).

Let's now create an instance of a [fuzzy matching algorithm](https://www.wikiwand.com/en/articles/Approximate_string_matching) `Remapper`. This instance will match the nuclide names of the OSPAR dataset to the MARIS standard nuclide names.

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_output, col_name='nuclide', as_df=True),
                    maris_lut_fn=nuc_lut_path,
                    maris_col_id='nuclide_id',
                    maris_col_name='nc_name',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='nuclides_ospar.pkl')

Lets try to match OSPAR nuclide names to MARIS standard nuclide names as automatically as possible. The `match_score` column allows to assess the results:

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)

Processing:   0%|          | 0/14 [00:00<?, ?it/s]

Processing: 100%|██████████| 14/14 [00:00<00:00, 28.73it/s]

0 entries matched the criteria, while 14 entries had a match score of 1 or higher.





Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"239, 240 pu",pu240,"239, 240 pu",8
"239,240pu",pu240,"239,240pu",6
228ra,u235,228ra,4
241am,pu241,241am,4
226ra,u234,226ra,4
137cs,i133,137cs,4
210po,ru106,210po,4
210pb,ru106,210pb,4
238pu,u238,238pu,3
99tc,tu,99tc,3


We can now manually review the unmatched nuclide names and construct a dictionary to map them to the MARIS standard.

In [None]:
#| export
fixes_nuclide_names = {
    '99tc': 'tc99',
    '238pu': 'pu238',
    '226ra': 'ra226',
    'ra-226': 'ra226',
    'ra-228': 'ra228',    
    '210pb': 'pb210',
    '241am': 'am241',
    '228ra': 'ra228',
    '137cs': 'cs137',
    '210po': 'po210',
    '239,240pu': 'pu239_240_tot',
    '239, 240 pu': 'pu239_240_tot',
    'cs-137': 'cs137',
    '3h': 'h3'
    }

The dictionary `fixes_nuclide_names`, applies manual corrections to the nuclide names before the remapping process. 
The `generate_lookup_table` function has an `overwrite` parameter (default is `True`), which, when set to `True`, creates a pickle file cache of the lookup table. We can now test the remapping process:

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
fc.test_eq(len(remapper.select_match(match_score_threshold=1)), 0)

Processing:   0%|          | 0/14 [00:00<?, ?it/s]

Processing: 100%|██████████| 14/14 [00:00<00:00, 41.86it/s]


If we would like to to view all remapped nuclides we can set the match score threshold to 0 which will return all nuclides.

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
remapper.select_match(match_score_threshold=0, verbose=True).T

Processing:   0%|          | 0/14 [00:00<?, ?it/s]

Processing: 100%|██████████| 14/14 [00:00<00:00, 41.41it/s]

0 entries matched the criteria, while 14 entries had a match score of 0 or higher.





source_key,228ra,cs-137,238pu,ra-226,241am,"239, 240 pu","239,240pu",3h,226ra,137cs,210po,210pb,99tc,ra-228
matched_maris_name,ra228,cs137,pu238,ra226,am241,pu239_240_tot,pu239_240_tot,h3,ra226,cs137,po210,pb210,tc99,ra228
source_name,228ra,cs-137,238pu,ra-226,241am,"239, 240 pu","239,240pu",3h,226ra,137cs,210po,210pb,99tc,ra-228
match_score,0,0,0,0,0,0,0,0,0,0,0,0,0,0


We can now see that the nuclide names have been remapped correctly. We now create a callback `RemapNuclideNameCB` to remap the nuclide names in the dataframes. We remap to use the `nuclide_id` values. 

Note that we pass `overwrite=False` to the `Remapper` constructor to now use the cached version.

In [None]:
#| export
# Create a lookup table for nuclide names
lut_nuclides = lambda df: Remapper(provider_lut_df=df,
                                   maris_lut_fn=nuc_lut_path,
                                   maris_col_id='nuclide_id',
                                   maris_col_name='nc_name',
                                   provider_col_to_match='value',
                                   provider_col_key='value',
                                   fname_cache='nuclides_ospar.pkl').generate_lookup_table(fixes=fixes_nuclide_names, 
                                                                                            as_df=False, overwrite=False)

In [None]:
#| export
class RemapNuclideNameCB(Callback):
    "Remap data provider nuclide names to standardized MARIS nuclide names."
    def __init__(self, 
                 fn_lut: Callable, # Function that returns the lookup table dictionary
                 col_name: str # Column name to remap
                ):
        fc.store_attr()

    def __call__(self, tfm: Transformer):
        df_uniques = get_unique_across_dfs(tfm.dfs, col_name=self.col_name, as_df=True)
        #lut = {k: v.matched_maris_name for k, v in self.fn_lut(df_uniques).items()}    
        lut = {k: v.matched_id for k, v in self.fn_lut(df_uniques).items()}    
        for k in tfm.dfs.keys():
            tfm.dfs[k]['NUCLIDE'] = tfm.dfs[k][self.col_name].replace(lut)

Let's see it in action, along with the `LowerStripNameCB` callback:

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide')
                            ])
dfs_out = tfm()

# For instance
for key in dfs_out.keys():
    print(f'{key} NUCLIDE unique: ', dfs_out[key]['NUCLIDE'].unique())

BIOTA NUCLIDE unique:  [33 15 77 47 41 53 54  1 67 72]
SEAWATER NUCLIDE unique:  [33 77  1 15 53 54 47 41]


## Standardize Time

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: There are inconsistencies in the column names used for time. The `SEAWATER` and `BIOTA` datasets use different column names for time. `SEAWATER` uses the column name `sampling_1` and `BIOTA` uses the column name `sampling_d`.

:::

In [None]:
#| eval: false
dfs = wfs_processor()
with pd.option_context('display.max_columns', None):
    display(dfs['SEAWATER'].head(2))
print('Number of NaN values in sampling_1 for SEAWATER: ', dfs['SEAWATER']['sampling_1'].isnull().sum())

with pd.option_context('display.max_columns', None):
    display(dfs['BIOTA'].head(2))

print('Number of NaN values in sampling_d for BIOTA: ', dfs['BIOTA']['sampling_d'].isnull().sum())

Unnamed: 0,fid,the_geom,id,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,sampling_d,sampling_1,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year,f1,reference_
0,ospar_seawater_1995_01_003.1,POINT (56.16666666666666 11.78333333333333),45552.0,Denmark,12,HesselÃ¸,H95-22,56,10,0.0,N,11,47,0.0,E,Water,2.0,1995-05-01T00:00:00,137Cs,0,0.040141,6823919,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,
1,ospar_seawater_1995_01_003.2,POINT (56.16666666666666 11.78333333333333),45553.0,Denmark,12,HesselÃ¸,H95-23,56,10,0.0,N,11,47,0.0,E,Water,24.0,1995-05-01T00:00:00,137Cs,0,0.037117,7794675,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,


Number of NaN values in sampling_1 for SEAWATER:  0


Unnamed: 0,fid,the_geom,id,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,biological,species,body_part,sampling_d,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year
0,ospar_biota_1995_01_003.1,POINT (55.96666666666667 11.58333333333333),38847,Denmark,12,Klint,950089,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-04-05T00:00:00,137Cs,0,2.0217,626727,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
1,ospar_biota_1995_01_003.2,POINT (55.96666666666667 11.58333333333333),38848,Denmark,12,Klint,950229,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-07-07T00:00:00,137Cs,0,2.34446,468892,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995


Number of NaN values in sampling_d for BIOTA:  0


Create a callback that remaps the time format in the dictionary of dataframes (i.e. `%m/%d/%y %H:%M:%S`) and handle missing dates:

In [None]:
#| export
class ParseTimeCB(Callback):
    "Parse the time format in the dataframe and check for inconsistencies."
    def __call__(self, tfm):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                # Check if the 'sampling_1' column exists
                if 'sampling_1' in df.columns:
                    # Convert the time format of the sampling_1 and sampling_d columns
                    df['TIME'] = pd.to_datetime(df['sampling_1'], format='%Y-%m-%dT%H:%M:%S', errors='coerce')
            if grp == 'BIOTA':
                # Check if the 'sampling_1' column exists
                if 'sampling_d' in df.columns:
                    # Convert the time format of the sampling_1 and sampling_d columns
                    df['TIME'] = pd.to_datetime(df['sampling_d'], format='%Y-%m-%dT%H:%M:%S', errors='coerce')
            # Drop rows where TIME is still NaN after processing
            df.dropna(subset=['TIME'], inplace=True)

Apply the transformer for callbacks `ParseTimeCB`. Then, print the `TIME` data for `seawater`.

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    ParseTimeCB(),
    CompareDfsAndTfmCB(dfs)])

tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['SEAWATER']['TIME'])

                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 

0       1995-05-01
1       1995-05-01
2       1995-11-01
3       1995-11-01
4       1995-05-01
           ...    
19014   2022-08-28
19015   2022-08-29
19016   2022-08-29
19017   2022-08-29
19018   2022-07-01
Name: TIME, Length: 19019, dtype: datetime64[ns]


The NetCDF time format requires the time to be encoded as number of milliseconds since a time of origin. In our case the time of origin is `1970-01-01` as indicated in `configs.ipynb` `CONFIFS['units']['time']` dictionary.

`EncodeTimeCB` converts the HELCOM `time` format to the MARIS NetCDF `time` format.

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            EncodeTimeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.logs)
                            

                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 

['Parse the time format in the dataframe and check for inconsistencies.', 'Encode time as seconds since epoch.', 'Create a dataframe of dropped data. Data included in the `dfs` not in the `tfm`.']


## Sanitize value

We allocate each column containing measurement values into a single column `VALUE` and remove `NA` where needed.

In [None]:
#| export
class SanitizeValueCB(Callback):
    "Sanitize value by removing blank entries and populating `value` column."
    def __init__(self, 
                 value_col: str='activity_o' # Column name to sanitize
                 ):
        fc.store_attr()

    def __call__(self, tfm):
        for df in tfm.dfs.values():
            df.dropna(subset=[self.value_col], inplace=True)
            df['VALUE'] = df[self.value_col]

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            CompareDfsAndTfmCB(dfs)])

tfm()

print('Example of VALUE column:')
print(tfm.dfs['SEAWATER'][['VALUE']].head())
print('\nComparison stats:')
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

Example of VALUE column:
      VALUE
0  0.040141
1  0.037117
2  0.043450
3  0.046080
4  0.050330

Comparison stats:
                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 



## Normalize uncertainty

:::{.callout-tip}

**Feedback to Data Provider**: We have noticed that some entries in the `uncertaint` column use a comma (`,`) as a decimal separator. Please consider standardizing these entries to use a period (`.`) as the decimal separator. 

:::

For each sample type in the OSPAR dataset, the reported uncertainty is given as an expanded uncertainty with a coverage factor `𝑘=2`. For further details, refer to the [OSPAR reporting guidelines](https://mcc.jrc.ec.europa.eu/documents/OSPAR/Guidelines_forestimationof_a_%20measurefor_uncertainty_in_OSPARmonitoring.pdf).

**Note**: For MARIS the OSPAR uncertainty values are normalized to standard uncertainty with a coverage factor 
𝑘=1.

`NormalizeUncCB` callback normalizes the uncertainty using the following `lambda` function:

In [None]:
#| export
unc_exp2stan = lambda df, unc_col: df[unc_col] / 2

In [None]:
#| export
class NormalizeUncCB(Callback):
    """Normalize uncertainty values in DataFrames."""
    def __init__(self, 
                 col_unc: str='uncertaint', # Column name to normalize
                 fn_convert_unc: Callable=unc_exp2stan, # Function correcting coverage factor
                 ): 
        fc.store_attr()

    def __call__(self, tfm):
        for df in tfm.dfs.values():
            self._convert_commas_to_periods(df)
            self._convert_to_float(df)
            self._apply_conversion_function(df)

    def _convert_commas_to_periods(self, df):
        """Convert commas to periods in the uncertainty column."""
        df[self.col_unc] = df[self.col_unc].astype(str).str.replace(',', '.')

    def _convert_to_float(self, df):
        """Convert uncertainty column to float, handling errors by setting them to NaN."""
        df[self.col_unc] = pd.to_numeric(df[self.col_unc], errors='coerce')

    def _apply_conversion_function(self, df):
        """Apply the conversion function to normalize the uncertainty values."""
        df['UNC'] = self.fn_convert_unc(df, self.col_unc)

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
        SanitizeValueCB(),               
        NormalizeUncCB()
    ])
tfm()

for grp in ['SEAWATER', 'BIOTA']:
    print(f'\n{grp}:')
    print(tfm.dfs[grp][['VALUE', 'UNC']].head())


SEAWATER:
      VALUE       UNC
0  0.040141  0.000341
1  0.037117  0.000390
2  0.043450  0.000282
3  0.046080  0.000253
4  0.050330  0.000377

BIOTA:
     VALUE       UNC
0  2.02170  0.031336
1  2.34446  0.023445
2  2.62356  0.020988
3  2.78070  0.015294
4  1.51102  0.008311


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `SEAWATER` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

To show situations where the uncertainty is much greater than the value we will calculate the 'relative uncertainty' for the seawater dataset. 

In [None]:
#| eval: false
grp='SEAWATER'
tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['UNC'] / tfm.dfs[grp]['VALUE'])
    # Multiply by 100 to convert to percentage
    * 100
)

Now we will return all rows where the relative uncertainty is greater than 100% for the seawater dataset.

In [None]:
#| eval: false
threshold = 100
cols_to_show=['id', 'contractin', 'nuclide', 'value_type', 'activity_o', 'uncertaint', 'unit', 'relative_uncertainty']
df=tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]

print(f'Number of rows where relative uncertainty is greater than {threshold}%: \n {df.shape[0]} \n')

print(f'Example:')
with pd.option_context('display.max_rows', None):
    display(df.head())


Number of rows where relative uncertainty is greater than 100%: 
 81 

Example:


Unnamed: 0,id,contractin,nuclide,value_type,activity_o,uncertaint,unit,relative_uncertainty
1599,55488.0,United Kingdom,3H,0,11.1091,97164.0,Bq/l,437317.154405
2518,37532.0,Germany,99Tc,0,0.00223,0.12,Bq/l,2690.58296
2534,37548.0,Germany,99Tc,0,0.00063,0.07,Bq/l,5555.555556
2535,37549.0,Germany,99Tc,0,0.00092,0.09,Bq/l,4891.304348
2536,37550.0,Germany,99Tc,0,0.00055,0.07,Bq/l,6363.636364


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `BIOTA` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

Include the relative uncertainty for the biota dataset. 

In [None]:
#| eval: false
grp='BIOTA'
tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['UNC'] / tfm.dfs[grp]['VALUE'])
    # Multiply by 100 to convert to percentage
    * 100
)

Return all rows where the relative uncertainty is greater than 100% for the biota dataset..

In [None]:
#| eval: false
threshold = 100
cols_to_show=['id', 'contractin', 'nuclide', 'value_type', 'activity_o', 'uncertaint', 'unit', 'relative_uncertainty']
df=tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]

print(f'Number of rows where relative uncertainty is greater than {threshold}%: \n {df.shape[0]} \n')

print(f'Example:')
with pd.option_context('display.max_rows', None):
    display(df.head())


Number of rows where relative uncertainty is greater than 100%: 
 38 

Example:


Unnamed: 0,id,contractin,nuclide,value_type,activity_o,uncertaint,unit,relative_uncertainty
492,35011,Belgium,137Cs,0,0.1619,66.0,Bq/kg f.w.,20382.95244
756,49226,Sweden,137Cs,0,0.327,1.468,Bq/kg f.w.,224.464832
1039,35011,Belgium,137Cs,0,0.1619,66.0,Bq/kg f.w.,20382.95244
1303,49226,Sweden,137Cs,0,0.327,1.468,Bq/kg f.w.,224.464832
1838,49230,Sweden,137Cs,0,0.275,1.982,Bq/kg f.w.,360.363636


## Remap units

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: It would be easier to work with the units if they were standardized. The units are not consistent across the dataset, for instance `BQ/L`, `Bq/l` and `Bq/L` are used interchangeably.

:::


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Unit` column contains `NaN` values for the `SEAWATER` dataset, as shown below.
:::


In [None]:
#| eval: false
df=dfs['SEAWATER'][dfs['SEAWATER']['unit'].isnull()].drop(columns=['measuremen','sample_com','reference'])
print(f'Number of rows with NaN in unit column: \n {df.shape[0]} \n')
print(f'Example:')
with pd.option_context('display.max_rows', None):
    display(df.head())

Number of rows with NaN in unit column: 
 2656 

Example:


Unnamed: 0,fid,the_geom,id,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,...,value_type,activity_o,uncertaint,unit,data_provi,latdd,longdd,year,f1,reference_
543,ospar_seawater_1995_01_003.544,POINT (52.30138888888889 4.301111111111111),92319.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,4.8,48,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,
544,ospar_seawater_1995_01_003.545,POINT (52.30138888888889 4.301111111111111),92320.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,4.4,44,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,
545,ospar_seawater_1995_01_003.546,POINT (52.30138888888889 4.301111111111111),92321.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,4.0,4,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,
546,ospar_seawater_1995_01_003.547,POINT (52.30138888888889 4.301111111111111),92322.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,3.6,36,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,
547,ospar_seawater_1995_01_003.548,POINT (52.30138888888889 4.301111111111111),92323.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,4.3,43,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,


Let's inspect the unique units used by OSPAR:

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='unit', as_df=True)

Unnamed: 0,index,value
0,0,Bq/l
1,1,Bq/kg f.w.
2,2,BQ/L
3,3,
4,4,Bq/L


We will define unit renaming rules for OSPAR dataset:

In [None]:
#| export
# Define unit names renaming rules
renaming_unit_rules = {'Bq/l': 1, #'Bq/m3'
                       'Bq/L': 1,
                       'BQ/L': 1,
                       'Bq/kg f.w.': 5, # Bq/kgw
                       } 

Now we will create a callback `RemapUnitCB` to remap the units in the dataframes. For the `SEAWATER` dataset we will set a default unit of `Bq/l`. 

In [None]:
#| export
class RemapUnitCB(Callback):
    """Callback to update DataFrame 'UNIT' columns based on a lookup table."""

    def __init__(self, lut: Dict[str, str]):
        fc.store_attr('lut')  # Store the lookup table as an attribute

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                self._apply_default_units(df, unit='Bq/l')
            self._print_na_units(df)
            self._update_units(df)

    def _apply_default_units(self, df: pd.DataFrame , unit = None):
        df.loc[df['unit'].isnull(), 'unit'] = unit

    def _print_na_units(self, df: pd.DataFrame):
        na_count = df['unit'].isnull().sum()
        if na_count > 0:
            print(f"Number of rows with NaN in 'unit' column: {na_count}")

    def _update_units(self, df: pd.DataFrame):
        df['UNIT'] = df['unit'].apply(lambda x: self.lut.get(x, 'Unknown'))

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[SanitizeValueCB(), # Remove blank value entries (also removes NaN values in Unit column) 
                            RemapUnitCB(renaming_unit_rules),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print('Unit column values:')
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['UNIT'].unique()}")

                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 

Unit column values:
BIOTA: [5]
SEAWATER: [1]


## Remap detection limit

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Value type` column contains many `nan` values and many entries with a value of `0`.

:::

In [None]:
#| eval: false
# Count the number of NaN entries in the 'value_type' column for 'SEAWATER'
na_count_seawater = dfs['SEAWATER']['value_type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'SEAWATER': {na_count_seawater}")

# Count the number of NaN entries in the 'value_type' column for 'BIOTA'
na_count_biota = dfs['BIOTA']['value_type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'BIOTA': {na_count_biota}")

# Count the number of entries in the 'value_type' column where the value is 0 for 'SEAWATER'
zero_count_seawater = dfs['SEAWATER']['value_type'].value_counts()[0]
print(f"Number of 'value_type' entries where the value is '0' in 'SEAWATER': {zero_count_seawater}")

# Count the number of entries in the 'value_type' column where the value is 0 for 'BIOTA'
zero_count_biota = dfs['BIOTA']['value_type'].value_counts()[0]
print(f"Number of 'value_type' entries where the value is '0' in 'BIOTA': {zero_count_biota}")    


Number of NaN 'Value type' entries in 'SEAWATER': 54
Number of NaN 'Value type' entries in 'BIOTA': 23
Number of 'value_type' entries where the value is '0' in 'SEAWATER': 14032
Number of 'value_type' entries where the value is '0' in 'BIOTA': 10631


In the OSPAR dataset, the detection limit is indicated by < in the Value type column. When the Value type is <, the Activity or MDA column contains the detection limit value. Conversely, when the Value type is =, the Activity or MDA column contains the measurement value.

Let’s examine the Value type column entries in the OSPAR dataset:

In [None]:
#| eval: false
for grp in dfs.keys():
    print(f'{grp}:')
    print(tfm.dfs[grp]['value_type'].unique())


BIOTA:
['0' '<' nan]
SEAWATER:
['0' '<' '=' nan]


Detection limits are encoded as follows in MARIS:

In [None]:
#| eval: false
pd.read_excel(detection_limit_lut_path())

Unnamed: 0,id,name,name_sanitized
0,-1,Not applicable,Not applicable
1,0,Not Available,Not available
2,1,=,Detected value
3,2,<,Detection limit
4,3,ND,Not detected
5,4,DE,Derived


We create a lambda function to retrieve the lookup table.

In [None]:
#| export
lut_dl = lambda: pd.read_excel(detection_limit_lut_path(), usecols=['name','id']).set_index('name').to_dict()['id']
lut_dl()

{'Not applicable': -1, 'Not Available': 0, '=': 1, '<': 2, 'ND': 3, 'DE': 4}

We define the columns of interest in both the `SEAWATER` and `BIOTA` dataframes for the detection limit column.

In [None]:
#| export
coi_dl = {'SEAWATER' : {'DL' : 'value_type'},
          'BIOTA':  {'DL' : 'value_type'}
          }

We create a callback `RemapDetectionLimitCB` to remap the detection limit values to MARIS format using the lookup table. Since the dataset contain both '0' and 'nan' entries for the detection limit column, we will create a condition to set the detection limit to '=' when the value and uncertainty columns are present and the current detection limit value is not in the lookup keys.

In [None]:
#| export
class RemapDetectionLimitCB(Callback):
    """Remap detection limit values to MARIS format using a lookup table."""

    def __init__(self, coi: dict, fn_lut: Callable):
        """Initialize with column configuration and a function to get the lookup table."""
        fc.store_attr()        

    def __call__(self, tfm: Transformer):
        """Apply the remapping of detection limits across all dataframes"""
        lut = self.fn_lut()  # Retrieve the lookup table
        for grp, df in tfm.dfs.items():
            df['DL'] = df[self.coi[grp]['DL']]
            self._set_detection_limits(df, lut)

    def _set_detection_limits(self, df: pd.DataFrame, lut: dict):
        """Set detection limits based on value and uncertainty columns using specified conditions."""
        # Condition to set '=' when value and uncertainty are present and the current detection limit is not in the lookup keys
        condition_eq = (df['VALUE'].notna() & df['UNC'].notna() & ~df['DL'].isin(lut.keys()))
        df.loc[condition_eq, 'DL'] = '='

        # Set 'Not Available' for unmatched detection limits
        df.loc[~df['DL'].isin(lut.keys()), 'DL'] = 'Not Available'

        # Map existing detection limits using the lookup table
        df['DL'] = df['DL'].map(lut)

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            NormalizeUncCB(),                  
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl)])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['DL'].unique()}")

BIOTA: [1 2 0]
SEAWATER: [1 2 0]


## Remap Biota species

The OSPAR dataset contains biota species information in the `Species` column of the biota dataframe. To ensure consistency with MARIS standards, we need to remap these species names. We'll use a same approach to the one we employed for standardizing nuclide names:


We first inspect unique `Species` values used by OSPAR:

In [None]:
#| eval: false
dfs = wfs_processor()
with pd.option_context('display.max_columns', None):
    display(get_unique_across_dfs(dfs, col_name='species', as_df=True).T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50.0,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155
value,Penaeus vannamei,OSTREA EDULIS,Pelvetia canaliculata,SOLEA SOLEA (S.VULGARIS),Capros aper,SEBASTES MARINUS,PORPHYRA UMBILICALIS,Micromesistius poutassou,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,Sprattus sprattus,Platichthys flesus,Molva molva,ETMOPTERUS SPINAX,Hippoglossus hippoglossus,Boreogadus Saida,unknown,Lophius piscatorius,Unknown,Sebastes Mentella,Ostrea edulis,Raja montagui,Gadus Morhua,SCOPHTHALMUS RHOMBUS,DICENTRARCHUS (MORONE) LABRAX,HIPPOGLOSSUS HIPPOGLOSSUS,MELANOGRAMMUS AEGLEFINUS,Pollachius pollachius,Rhodymenia spp.,CHIMAERA MONSTROSA,Trachurus trachurus,TRACHURUS TRACHURUS,PALMARIA PALMATA,LITTORINA LITTOREA,Ostrea Edulis,Tapes sp.,Limanda Limanda,Pecten maximus,Melanogrammus aeglefinus,CERASTODERMA (CARDIUM) EDULE,CRASSOSTREA GIGAS,Gaidropsarus argenteus,Homarus gammarus,Gadus sp.,Flatfish,LIMANDA LIMANDA,Scomber scombrus,PECTEN MAXIMUS,FUCUS VESICULOSUS,HIPPOGLOSSOIDES PLATESSOIDES,MICROMESISTIUS POUTASSOU,,RHODYMENIA spp,Coryphaenoides rupestris,PATELLA VULGATA,Merluccius merluccius,SEBASTES MENTELLA,Dasyatis pastinaca,Littorina littorea,Argentina silus,PATELLA,Sebastes norvegicus,Anarhichas minor,BROSME BROSME,BUCCINUM UNDATUM,Phoca vitulina,SPRATTUS SPRATTUS,Merlangius merlangus,Mytilus edulis,Gadiculus argenteus thori,Pleuronectiformes [order],Trisopterus esmarkii,Argentina sphyraena,Anarhichas lupus,GLYPTOCEPHALUS CYNOGLOSSUS,Thunnus thynnus,Clupea harengus,GADUS MORHUA,Patella sp.,Eutrigla gurnardus,Lumpenus lampretaeformis,MERLUCCIUS MERLUCCIUS,FUCUS SERRATUS,GALEUS MELASTOMUS,Sardina pilchardus,FUCUS spp,FUCUS SPIRALIS,ASCOPHYLLUM NODOSUM,Microstomus kitt,Cyclopterus lumpus,PLEURONECTES PLATESSA,Sebastes vivipares,PLUERONECTES PLATESSA,MOLVA MOLVA,PLATICHTHYS FLESUS,Anarhichas denticulatus,Salmo salar,Crassostrea gigas,MERLANGIUS MERLANGUS,Hyperoplus lanceolatus,Merlangius Merlangus,Solea solea (S.vulgaris),Brosme brosme,ASCOPHYLLUN NODOSUM,Ascophyllum nodosum,Fucus sp.,LAMINARIA DIGITATA,Sebastes viviparus,PELVETIA CANALICULATA,Reinhardtius hippoglossoides,Gadus morhua,Lycodes vahlii,Buccinum undatum,CLUPEA HARENGUS,Phycis blennoides,Clupea Harengus,Thunnus sp.,Anguilla anguilla,SCOMBER SCOMBRUS,Sepia spp.,RAJA DIPTURUS BATIS,REINHARDTIUS HIPPOGLOSSOIDES,BOREOGADUS SAIDA,Limanda limanda,Mytilus Edulis,Boreogadus saida,NUCELLA LAPILLUS,Modiolus modiolus,Fucus distichus,OSILINUS LINEATUS,Fucus vesiculosus,Squalus acanthias,RAJIDAE/BATOIDEA,Hippoglossoides platessoides,Sebastes mentella,Pleuronectes platessa,Galeus melastomus,SALMO SALAR,EUTRIGLA GURNARDUS,CYCLOPTERUS LUMPUS,POLLACHIUS VIRENS,Pollachius virens,Sebastes marinus,Cerastoderma edule,Fucus serratus,"Mixture of green, red and brown algae",ANARHICHAS LUPUS,PECTINIDAE,MYTILUS EDULIS,Trisopterus minutus,Cerastoderma (Cardium) Edule,Mallotus villosus,Glyptocephalus cynoglossus,Fucus Vesiculosus,MOLVA DYPTERYGIA,MONODONTA LINEATA,FUCUS SPP.


We try to remap the `Species` column to the `species` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='species', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='species_ospar.pkl')

In this step, we generate a lookup table using the `remapper` object. The lookup table maps data provider entries to MARIS entries using fuzzy matching. After generating the table, we select matches that meet a specified threshold (i.e., greater than 1), which means that matches that require more than one character correction are shown.

- **`generate_lookup_table(as_df=True)`**: This method generates the lookup table and returns it as a DataFrame. It uses fuzzy matching to align entries from the data provider with those in the MARIS lookup table.
- **`select_match(match_score_threshold=1)`**: This method filters the generated lookup table to include only those matches with a score greater than or equal to the specified threshold. A threshold of 1 ensures that only perfect matches are selected.

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing:   0%|          | 0/23 [00:00<?, ?it/s]

Processing: 100%|██████████| 23/23 [00:00<00:00, 93.99it/s]

0 entries matched the criteria, while 23 entries had a match score of 1 or higher.





source_key,mix of muscle and whole fish without liver fish,cod medallion fish,whole without head fish,soft parts molluscs,flesh without bones molluscs,whole animal molluscs,whole fisk fish,unknown fish,whole fish fish,flesh without bones seaweed,growing tips seaweed,whole plant seaweed,flesh fish,whole seaweed,head fish,whole animal fish,muscle fish,whole fish,flesh without bones fish,soft parts fish,liver fish,flesh with scales fish,flesh without bone fish
matched_maris_name,Flesh without bones,Old leaf,Flesh without bones,Soft parts,Flesh without bones,Whole animal,Whole animal,Growing tips,Whole animal,Flesh without bones,Growing tips,Whole plant,Shells,Whole plant,Head,Whole animal,Muscle,Whole animal,Flesh without bones,Soft parts,Liver,Flesh with scales,Flesh without bones
source_name,mix of muscle and whole fish without liver fish,cod medallion fish,whole without head fish,soft parts molluscs,flesh without bones molluscs,whole animal molluscs,whole fisk fish,unknown fish,whole fish fish,flesh without bones seaweed,growing tips seaweed,whole plant seaweed,flesh fish,whole seaweed,head fish,whole animal fish,muscle fish,whole fish,flesh without bones fish,soft parts fish,liver fish,flesh with scales fish,flesh without bone fish
match_score,31,13,13,9,9,9,9,9,9,8,8,8,7,6,5,5,5,5,5,5,5,5,4


Below, we fixthe entries that are not properly matched by the `Remapper` object:

In [None]:
#| export
fixes_biota_species = {
    'RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA': NA,  # Mix of species, no direct mapping
    'Mixture of green, red and brown algae': NA,  # Mix of species, no direct mapping
    'Solea solea (S.vulgaris)': 'Solea solea',
    'SOLEA SOLEA (S.VULGARIS)': 'Solea solea',
    'RAJIDAE/BATOIDEA': NA, #Mix of species, no direct mapping
    'PALMARIA PALMATA': NA,  # Not defined
    'Unknown': NA,
    'unknown': NA,
    'Flatfish': NA,
    'Gadus sp.': NA,  # Not defined
}

We now attempt remapping again, incorporating the `fixes_biota_species` dictionary:

In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_species)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing:   0%|          | 0/23 [00:00<?, ?it/s]

Processing: 100%|██████████| 23/23 [00:00<00:00, 97.71it/s]

0 entries matched the criteria, while 23 entries had a match score of 1 or higher.





source_key,mix of muscle and whole fish without liver fish,cod medallion fish,whole without head fish,soft parts molluscs,flesh without bones molluscs,whole animal molluscs,whole fisk fish,unknown fish,whole fish fish,flesh without bones seaweed,growing tips seaweed,whole plant seaweed,flesh fish,whole seaweed,head fish,whole animal fish,muscle fish,whole fish,flesh without bones fish,soft parts fish,liver fish,flesh with scales fish,flesh without bone fish
matched_maris_name,Flesh without bones,Old leaf,Flesh without bones,Soft parts,Flesh without bones,Whole animal,Whole animal,Growing tips,Whole animal,Flesh without bones,Growing tips,Whole plant,Shells,Whole plant,Head,Whole animal,Muscle,Whole animal,Flesh without bones,Soft parts,Liver,Flesh with scales,Flesh without bones
source_name,mix of muscle and whole fish without liver fish,cod medallion fish,whole without head fish,soft parts molluscs,flesh without bones molluscs,whole animal molluscs,whole fisk fish,unknown fish,whole fish fish,flesh without bones seaweed,growing tips seaweed,whole plant seaweed,flesh fish,whole seaweed,head fish,whole animal fish,muscle fish,whole fish,flesh without bones fish,soft parts fish,liver fish,flesh with scales fish,flesh without bone fish
match_score,31,13,13,9,9,9,9,9,9,8,8,8,7,6,5,5,5,5,5,5,5,5,4


Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| export
lut_biota = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='species', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='species_ospar.pkl').generate_lookup_table(fixes=fixes_biota_species, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `species` column to our `biota` dataframe, containing standardized species IDs.


In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['SPECIES'].unique()

array([  96,  392,   50,   99,  192,  244,  378,  139,  379,    0,  413,
        129,  274,  391,  394,  396,  417,  397,  270,  401,  380,  412,
        410,  272,  414,  395,  243,  418,  411,  407,  402,  191,  426,
        393,  429,  430,  384,  381,  403,  399,  398,  408,  389,  386,
        404,  405,  385,  415,  416,  400,  406,  427,  377,  382,  383,
        387,  388,  390, 1684,  425,  428,  419, 1609,  420,  421,  422,
        423,  424,  431,  294,  440,  432,  433,  434,  435,  436,  437,
        438,  439,  441,  442, 1605,  443,  444, 1610, 1608,   23, 1606,
        234,  556, 1701, 1752])

## Enhance Species Data Using Biological group column
The `Biological group` column in the OSPAR dataset provides valuable insights related to species. We will leverage this information to enrich the `species` column. To achieve this, we will employ the generic `RemapCB` callback to create an `enhanced_species` column. Subsequently, this `enhanced_species` column will be used to further enrich the `species` column.

First we inspect the unique values in the `Biological group` column.

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='biological', as_df=True)

Unnamed: 0,index,value
0,0,Seaweed
1,1,SEAWEED
2,2,seaweed
3,3,FISH
4,4,fish
5,5,Molluscs
6,6,Fish
7,7,MOLLUSCS
8,8,molluscs


We will remap the `Biological group` columns data to the `SPECIES` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='biological', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='enhance_species_ospar.pkl')

Like before we will generate the lookup table and select matches that meet a specified threshold (i.e., greater than 1), which means that matches requiring more than one character change are shown.

In [None]:
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/9 [00:00<?, ?it/s]

Processing: 100%|██████████| 9/9 [00:01<00:00,  6.47it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FISH,Fucus,FISH,4
fish,Fucus,fish,4
Fish,Fucus,Fish,4
Molluscs,Mollusca,Molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1
molluscs,Mollusca,molluscs,1


We can see that some of the entries require manual corrections.

In [None]:
fixes_enhanced_biota_species = {
    'fish': 'Pisces',
    'FISH': 'Pisces',
    'Fish': 'Pisces'    
}


Now we will apply the manual corrections to the lookup table and generate the lookup table again.

In [None]:
remapper.generate_lookup_table(fixes=fixes_enhanced_biota_species)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/9 [00:00<?, ?it/s]

Processing: 100%|██████████| 9/9 [00:01<00:00,  7.08it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Molluscs,Mollusca,Molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1
molluscs,Mollusca,molluscs,1


Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| export
lut_biota_enhanced = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='biological', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='enhance_species_ospar.pkl').generate_lookup_table(fixes=fixes_enhanced_biota_species, as_df=False, overwrite=False)

Now lets see the species that are not matched by the `LookupBiogroupCB` callback. 

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of an `enhanced_species` column to our `BIOTA` dataframe, containing standardized species IDs.

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['enhanced_species'].unique()

array([1059,  712,  873])

Now that we have the `enhanced_species` column, we can use it to enrich the `SPECIES` column. We will use the enhanced species column in the absence of a species match if the enhanced species column is valid. 

In [None]:
#| export
class EnhanceSpeciesCB(Callback):
    """Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met."""

    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        self._enhance_species(tfm.dfs['BIOTA'])

    def _enhance_species(self, df: pd.DataFrame):
        df['SPECIES'] = df.apply(
            lambda row: row['enhanced_species'] if row['SPECIES'] in [-1, 0] and pd.notnull(row['enhanced_species']) else row['SPECIES'],
            axis=1
        )

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),
    EnhanceSpeciesCB()
    ])

tfm()['BIOTA']['SPECIES'].unique()

array([  96,  392,   50,   99,  192,  244,  378,  139,  379, 1059,  413,
        129,  274,  391,  394,  396,  417,  397,  270,  401,  712,  380,
        412,  410,  272,  414,  395,  243,  418,  411,  407,  402,  191,
        426,  393,  429,  430,  384,  381,  403,  399,  398,  408,  389,
        386,  404,  405,  385,  415,  416,  400,  406,  427,  377,  382,
        383,  387,  388,  390, 1684,  425,  428,  419, 1609,  420,  421,
        422,  423,  424,  431,  294,  440,  432,  433,  434,  435,  436,
        437,  438,  439,  441,  442, 1605,  443,  444, 1610, 1608,   23,
       1606,  234,  556, 1701, 1752])

All entries are matched for the `SPECIES` column.

## Remap Biota tissues

The OSPAR dataset includes entries where the `Body Part` is labeled as `whole`. However, the MARIS data standard requires a more specific distinction in the `body_part` field, differentiating between `Whole animal` and `Whole plant`. Fortunately, the OSPAR data provides a `Biological group` field that allows us to make this distinction.

To address this discrepancy and ensure compatibility with MARIS standards, we will:

1. Create a temporary column `body_part_temp` that combines information from both `Body Part` and `Biological group`.
2. Use this temporary column to perform the lookup using our `Remapper` object.

Lets create the temporary column, `body_part_temp`, that combines `Body Part` and `Biological group`.

In [None]:
#| export
class AddBodypartTempCB(Callback):
    "Add a temporary column with the body part and biological group combined."    
    def __call__(self, tfm):
        tfm.dfs['BIOTA']['body_part_temp'] = (
            tfm.dfs['BIOTA']['body_part'] + ' ' + 
            tfm.dfs['BIOTA']['biological']
            ).str.strip().str.lower()                                 

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            ])
dfs_test = tfm()
dfs_test['BIOTA']['body_part_temp'].unique()


array(['whole plant seaweed', 'flesh without bones fish',
       'soft parts molluscs', 'growing tips seaweed', 'flesh fish',
       'liver fish', 'whole animal molluscs', 'whole fish fish',
       'soft parts fish', 'muscle fish', 'flesh with scales fish',
       'whole animal fish', 'flesh without bone fish', 'head fish',
       'unknown fish', 'flesh without bones seaweed', 'whole fish',
       'flesh without bones molluscs', 'whole seaweed',
       'whole without head fish',
       'mix of muscle and whole fish without liver fish',
       'whole fisk fish', 'cod medallion fish'], dtype=object)

To align the ``body_part_temp`` column with the ``bodypar`` column in the MARIS nomenclature, we utilize a Remapper object. Since the OSPAR dataset does not include a predefined lookup table for the ``body_part`` column, we first create a lookup table by extracting unique values from the ``body_part_temp`` column.

In [None]:
get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True).head()

Unnamed: 0,index,value
0,0,flesh without bone fish
1,1,whole fish fish
2,2,whole seaweed
3,3,mix of muscle and whole fish without liver fish
4,4,flesh with scales fish


We try to remap the `body_part_temp` column to the `bodypar` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True),
                    maris_lut_fn=bodyparts_lut_path,
                    maris_col_id='bodypar_id',
                    maris_col_name='bodypar',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='tissues_ospar.pkl'
                    )

remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=0, verbose=True).T)

Processing:   0%|          | 0/23 [00:00<?, ?it/s]

Processing: 100%|██████████| 23/23 [00:00<00:00, 98.83it/s] 

0 entries matched the criteria, while 23 entries had a match score of 0 or higher.





source_key,mix of muscle and whole fish without liver fish,cod medallion fish,whole without head fish,whole animal molluscs,whole fish fish,unknown fish,whole fisk fish,flesh without bones molluscs,soft parts molluscs,flesh without bones seaweed,growing tips seaweed,whole plant seaweed,flesh fish,whole seaweed,whole animal fish,muscle fish,head fish,whole fish,flesh with scales fish,flesh without bones fish,soft parts fish,liver fish,flesh without bone fish
matched_maris_name,Flesh without bones,Old leaf,Flesh without bones,Whole animal,Whole animal,Growing tips,Whole animal,Flesh without bones,Soft parts,Flesh without bones,Growing tips,Whole plant,Shells,Whole plant,Whole animal,Muscle,Head,Whole animal,Flesh with scales,Flesh without bones,Soft parts,Liver,Flesh without bones
source_name,mix of muscle and whole fish without liver fish,cod medallion fish,whole without head fish,whole animal molluscs,whole fish fish,unknown fish,whole fisk fish,flesh without bones molluscs,soft parts molluscs,flesh without bones seaweed,growing tips seaweed,whole plant seaweed,flesh fish,whole seaweed,whole animal fish,muscle fish,head fish,whole fish,flesh with scales fish,flesh without bones fish,soft parts fish,liver fish,flesh without bone fish
match_score,31,13,13,9,9,9,9,9,9,8,8,8,7,6,5,5,5,5,5,5,5,5,4


Many of the lookup entries are sufficient for our needs. However, for values that don't find a match, we can use the `fixes_biota_bodyparts` dictionary to apply manual corrections. First we will create the dictionary.

In [None]:
#| export
fixes_biota_tissues = {
    'whole seaweed' : 'Whole plant',
    'flesh fish': 'Flesh with bones', # We assume it as the category 'Flesh with bones' also exists
    'flesh fish' : 'Flesh with bones',
    'unknown fish' : NA,
    'unknown fish' : NA,
    'cod medallion fish' : NA, # TO BE DETERMINED
    'mix of muscle and whole fish without liver fish' : NA, # TO BE DETERMINED
    'whole without head fish' : NA, # TO BE DETERMINED
    'flesh without bones seaweed' : NA, # TO BE DETERMINED
    'tail and claws fish' : NA # TO BE DETERMINED
}

Now we will generate the lookup table and apply the manual corrections of the ``fixes_biota_bodyparts`` dictionary.


In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_tissues)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing: 100%|██████████| 23/23 [00:00<00:00, 99.87it/s] 

2 entries matched the criteria, while 21 entries had a match score of 1 or higher.





source_key,whole animal molluscs,whole fisk fish,flesh without bones molluscs,soft parts molluscs,whole fish fish,growing tips seaweed,whole plant seaweed,muscle fish,whole animal fish,whole fish,flesh with scales fish,head fish,flesh without bones fish,soft parts fish,liver fish,flesh without bone fish,whole without head fish,cod medallion fish,mix of muscle and whole fish without liver fish,unknown fish,flesh without bones seaweed
matched_maris_name,Whole animal,Whole animal,Flesh without bones,Soft parts,Whole animal,Growing tips,Whole plant,Muscle,Whole animal,Whole animal,Flesh with scales,Head,Flesh without bones,Soft parts,Liver,Flesh without bones,(Not available),(Not available),(Not available),(Not available),(Not available)
source_name,whole animal molluscs,whole fisk fish,flesh without bones molluscs,soft parts molluscs,whole fish fish,growing tips seaweed,whole plant seaweed,muscle fish,whole animal fish,whole fish,flesh with scales fish,head fish,flesh without bones fish,soft parts fish,liver fish,flesh without bone fish,whole without head fish,cod medallion fish,mix of muscle and whole fish without liver fish,unknown fish,flesh without bones seaweed
match_score,9,9,9,9,9,8,8,5,5,5,5,5,5,5,5,4,2,2,2,2,2


At this stage, the majority of entries have been successfully matched to MARIS nomenclature. For those entries that remain unmatched, they are appropriately marked as not available. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.

In [None]:
#| export
lut_bodyparts = lambda: Remapper(provider_lut_df=get_unique_across_dfs(tfm.dfs, col_name='body_part_temp', as_df=True),
                               maris_lut_fn=bodyparts_lut_path,
                               maris_col_id='bodypar_id',
                               maris_col_name='bodypar',
                               provider_col_to_match='value',
                               provider_col_key='value',
                               fname_cache='tissues_ospar.pkl'
                               ).generate_lookup_table(fixes=fixes_biota_tissues, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `BODY_PART` column to our `biota` dataframe, containing standardized species IDs.

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA')
                            ])
tfm()
tfm.dfs['BIOTA']['BODY_PART'].unique()

array([40, 52, 19, 56,  4, 25,  1, 34, 60, 13,  0])

## Remap biogroup

The MARIS species lookup table contains a ``biogroup_id`` column that associates each species with its corresponding ``biogroup``. We will leverage this relationship to create a ``BIO_GROUP`` column in the ``BIOTA`` DataFrame.

In [None]:
#| export
lut_biogroup_from_biota = lambda: get_lut(src_dir=species_lut_path().parent, fname=species_lut_path().name, 
                               key='species_id', value='biogroup_id')

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[ 
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),
    EnhanceSpeciesCB(),
    RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA')
    ])

print(tfm()['BIOTA']['BIO_GROUP'].unique())


[11  4 13 14 12  2  5]


## Add Laboratory ID (REVIEW)

:::{.callout-tip}

**FEEDBACK FOR NEXT VERSION**: Addition of the laboratory ID column requires the lookup table to be sanitized. 

:::

Lets use the utility `get_unique_across_dfs` function to review the unique laboratory IDs in the OSPAR dataset:

In [None]:
with pd.option_context('display.max_columns', None):
    display(get_unique_across_dfs(dfs, col_name='data_provi', as_df=True).T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40.0,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104
value,SSM,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,EA - Environment Agency,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Rijkswaterstaat Centre for Water Management,"Defra-Department for Environment, Food and Rur...",DTU SUS,SCKâ¢CEN,Institute of Marine Research/Norwegian Radiati...,Endeavour 10/2004,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institute for energy technology,IRSN : LS3E,Nuclear Energy Research centre,DTU ENV,IFE,NorwegiaN Radiation Protection Authority,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,EA-Environment Agency,Norweigian Radiation Protection Authority,Insititute for Marine Research,IRSN : LVRE/MN,IFE/NRPA,Nuclear Safety Council,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institute of Marine Research,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,IRSN-LRC,IRSN : OPRI/MN,Institute for Energy technology,Intitute for Marine Research,Icelandic Radiation Safety Authority,IRSN : OPRI-LVRE/MN,IRSN : OPRI/DDASS,Norwegian Radiaton Protection Authority,SEPA-Scottish Environment Protection Agency,Institute for marine research,IRSN : LRC,Insititute for Energy Technology,,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,IMR,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institute for Marine Research/Norweigian Radia...,SL-Sellafield Ltd,Norwegian Radiation and Nuclear Safety Authority,Scientific Institute of Public Health,"Federal Maritime and Hydrographic Agency, Hamburg",Johann Heinrich von ThÃÂ¸nen Institute (vTI),IRSN-LVRE,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,IRSN : OPRI-LVRE,Rijkswaterstaat Laboratory CIV,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,RisÃÂ¸-DTU,Swedish Radiation Safety Authority,IRSN : LVRE,IRSN : LS3E/Marine Nationale,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,NIEA-Northern Ireland Environment Agency,DTU Nutech,Norwegian Radioaton Protection Authority,Johann Heinrich von Thuenen Institute (vTI),Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,IRSN : LVRE/RSMASS,FSA-Food Standards Agency,IRSN : LERFA,Norwegian Radiation Protection Authority,Insitute of Marine Research,IRSN : LRC/LS3E/RSMASS,Instiute of Marine Research,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,IRSN : OPRI,SCKÃ¢ÂÂ¢CEN,NRPA,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institute for Energy Technology,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,"DTU Nutech, DK",Institute for Marine Research,BEIS,BEIS (formerly DECC),Institute of Energy Technology,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Radiological Protection Instiute of Ireland,Corystes 14/2004,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,IRSN : LS3E/RSMASS,Environmental Protection Agency,Johann Heinrich von ThÃÂ¼nen Institute (vTI),Radiological Protection Institute of Ireland,RisÃ¸-DTU,SCKCEN,"Institute for Energy Technology, Kjeller, Norway"


The `LAB` information could be included with a little work. 

## Add Sample ID (REVIEW)

HERE HERE . Files might contain data from different years!!! 

In [None]:
with pd.option_context('display.max_columns', None):
    display(dfs['BIOTA'][dfs['BIOTA'].duplicated('id', keep=False)].sort_values('id'))


Unnamed: 0,fid,the_geom,id,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,biological,species,body_part,sampling_d,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year
457,ospar_biota_1996_01_003.1,POINT (51.23333333333333 2.914722222222222),34977,Belgium,8,Ostend,276,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-04-10T00:00:00,"239,240Pu",0,0.086,00146,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
1004,ospar_biota_1997_01_003.1,POINT (51.23333333333333 2.914722222222222),34977,Belgium,8,Ostend,276,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-04-10T00:00:00,"239,240Pu",0,0.086,00146,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
458,ospar_biota_1996_01_003.2,POINT (51.23333333333333 2.914722222222222),34978,Belgium,8,Ostend,407,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-04-21T00:00:00,"239,240Pu",0,0.039,000936,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
1005,ospar_biota_1997_01_003.2,POINT (51.23333333333333 2.914722222222222),34978,Belgium,8,Ostend,407,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-04-21T00:00:00,"239,240Pu",0,0.039,000936,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
459,ospar_biota_1996_01_003.3,POINT (51.23333333333333 2.914722222222222),34979,Belgium,8,Ostend,439,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-05-05T00:00:00,"239,240Pu",0,0.014,000546,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1001,ospar_biota_1996_01_003.545,POINT (54.48888888888889 -3.606944444444445),91594,United Kingdom,6,Sellafield,1997000616,54,29,20.0,N,3,36,25.0,W,BIOT,Seaweed,FUCUS VESICULOSUS,GROWING TIPS,1997-02-04T00:00:00,137Cs,0,5.993,00489999987,Bq/kg f.w.,FSA-Food Standards Agency,,St Bees W,,54.488889,-3.606944,1997
1549,ospar_biota_1997_01_003.546,POINT (54.48888888888889 -3.606944444444445),91595,United Kingdom,6,Sellafield,1997002288,54,29,20.0,N,3,36,25.0,W,BIOT,Seaweed,FUCUS VESICULOSUS,GROWING TIPS,1997-04-22T00:00:00,137Cs,0,4.701,00309999995,Bq/kg f.w.,FSA-Food Standards Agency,,St Bees W,,54.488889,-3.606944,1997
1002,ospar_biota_1996_01_003.546,POINT (54.48888888888889 -3.606944444444445),91595,United Kingdom,6,Sellafield,1997002288,54,29,20.0,N,3,36,25.0,W,BIOT,Seaweed,FUCUS VESICULOSUS,GROWING TIPS,1997-04-22T00:00:00,137Cs,0,4.701,00309999995,Bq/kg f.w.,FSA-Food Standards Agency,,St Bees W,,54.488889,-3.606944,1997
1003,ospar_biota_1996_01_003.547,POINT (54.48888888888889 -3.606944444444445),91634,United Kingdom,6,Sellafield,1997006904,54,29,20.0,N,3,36,25.0,W,BIOT,Seaweed,FUCUS VESICULOSUS,GROWING TIPS,1997-09-02T00:00:00,"239,240Pu",0,5.280,00813119933,Bq/kg f.w.,FSA-Food Standards Agency,,St Bees W. Annual bulk of 4 samples - represen...,,54.488889,-3.606944,1997


In [None]:
result = dfs['BIOTA'].groupby('id', as_index=False).
result

<pandas.core.groupby.generic.DataFrameGroupBy object>

The OSPAR dataset includes an `ID` column, which we will use to create the `SMP_ID` column.

In [None]:
#| export
class AddSampleIdCB(Callback):
    "Create a SMP_ID column from the ID column"
    def __call__(self, tfm):
        for df in tfm.dfs.values():
            if 'id' in df.columns:
                df['SMP_ID'] = df['id']

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            AddSampleIdCB(),
                            CompareDfsAndTfmCB(dfs)

                            ])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['SMP_ID'].unique()}")

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
    

BIOTA: [38847 38848 38849 ... 96862 96863 96864]
SEAWATER: [ 45552.  45553.  45554. ... 121649. 121650.     nan]
                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 



In [None]:
dfs['SEAWATER']['id']

0        45552.0
1        45553.0
2        45554.0
3        45555.0
4        45556.0
          ...   
19014        NaN
19015        NaN
19016        NaN
19017        NaN
19018        NaN
Name: id, Length: 19019, dtype: float64

## Add depth

The OSPAR dataset includes a column for the sampling depth (`Sampling depth`) for the `SEAWATER` dataset. In this section, we will create a callback to incorporate the sampling depth (`smp_depth`) into the MARIS dataset.

In [None]:
class AddDepthCB(Callback):
    "Ensure depth values are floats and add 'SMP_DEPTH' columns."
    def __call__(self, tfm: Transformer):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                if 'sampling_d' in df.columns:
                    df['SMP_DEPTH'] = df['sampling_d'].astype(float)

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    AddDepthCB()
    ])
tfm()
for grp in tfm.dfs.keys():  
    if 'SMP_DEPTH' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['SMP_DEPTH']].drop_duplicates())

SEAWATER:        SMP_DEPTH
0            2.0
1           24.0
3           25.0
5           32.0
7           36.0
...          ...
18500       89.0
18748        0.5
18751     1665.0
18759      276.4
18760      372.6

[130 rows x 1 columns]


## Standardize Coordinates

The OSPAR dataset offers coordinates in degrees, minutes, and seconds (DMS). The following callback is designed to convert DMS to decimal degrees. 

In [None]:
#| export
class ConvertLonLatCB(Callback):
    """Convert Coordinates to decimal degrees (DDD.DDDDD°)."""
    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            df['LAT'] = self._convert_latitude(df)
            df['LON'] = self._convert_longitude(df)

    def _convert_latitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['latdir'].isin(['S']),
            self._dms_to_decimal(df['latd'], df['latm'], df['lats']) * -1,
            self._dms_to_decimal(df['latd'], df['latm'], df['lats'])
        )

    def _convert_longitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['longdir'].isin(['W']),
            self._dms_to_decimal(df['longd'], df['longm'], df['longs']) * -1,
            self._dms_to_decimal(df['longd'], df['longm'], df['longs'])
        )

    def _dms_to_decimal(self, degrees: pd.Series, minutes: pd.Series, seconds: pd.Series) -> pd.Series:
        return degrees + minutes / 60 + seconds / 3600


In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB()
                            ])
tfm()
tfm.dfs['SEAWATER'][['LAT','latd', 'latm', 'lats', 'LON', 'latdir', 'longd', 'longm','longs', 'longdir']]

Unnamed: 0,LAT,latd,latm,lats,LON,latdir,longd,longm,longs,longdir
0,56.166667,56,10,0.0,11.783333,N,11,47,0.0,E
1,56.166667,56,10,0.0,11.783333,N,11,47,0.0,E
2,56.166667,56,10,0.0,11.783333,N,11,47,0.0,E
3,56.166667,56,10,0.0,11.783333,N,11,47,0.0,E
4,56.116667,56,7,0.0,11.166667,N,11,10,0.0,E
...,...,...,...,...,...,...,...,...,...,...
19014,54.916333,54,54,58.8,-0.280167,N,0,16,48.6,W
19015,53.912500,53,54,45.0,0.918167,N,0,55,5.4,E
19016,53.930667,53,55,50.4,1.275333,N,1,16,31.2,E
19017,54.508833,54,30,31.8,2.716500,N,2,42,59.4,E


Sanitize coordinates drops a row when both longitude & latitude equal 0 or data contains unrealistic longitude & latitude values. Converts longitude & latitude `,` separator to `.` separator."

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['BIOTA'][['LAT','LON']])

                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 

             LAT        LON
0      55.966667  11.583333
1      55.966667  11.583333
2      55.966667  11.583333
3      55.966667  11.583333
4      55.966667  11.583333
...          ...        ...
15257  58.452500  -5.041667
15258  58.452500  -5.041667
15259  54.872778  -3.594444
15260  54.872778  -3.594444
15261  54.872778  -3.594444

[15262 rows x 2 columns]


## Review all callbacks

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 



### Example change logs

Review the change logs for the netcdf encoding.

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            ])

# Transform
tfm()
# Check transformation logs
tfm.logs

["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column.",
 'Remap data provider nuclide names to standardized MARIS nuclide names.',
 'Parse the time format in the dataframe and check for inconsistencies.',
 'Encode time as seconds since epoch.',
 'Sanitize value by removing blank entries and populating `value` column.',
 'Normalize uncertainty values in DataFrames.',
 "Callback to update DataFrame 'UNIT' columns based on a lookup table.",
 'Remap detection limit values to MARIS format using a lookup table.',
 "Remap values from 'species' to 'SPECIES' for groups: BIOTA.",
 "Remap values from 'biological' to 'enhanced_species' for groups: BIOTA.",
 "Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met.",
 'Add a temporary column with the body part and biological group combined.',
 "Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA.",
 'Create a SMP_ID column from the ID column',
 "Ensure depth value

## Feed global attributes

In [None]:
#| export
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']


In [None]:
#| export
def get_attrs(
    tfm: Transformer, # Transformer object
    zotero_key: str, # Zotero dataset record key
    kw: list = kw # List of keywords
    ) -> dict: # Global attributes
    "Retrieve all global attributes."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()

In [None]:
#|eval: false
get_attrs(tfm, zotero_key=zotero_key, kw=kw)

{'geospatial_lat_min': '49.43222222222222',
 'geospatial_lat_max': '81.26805555555555',
 'geospatial_lon_min': '-58.23166666666667',
 'geospatial_lon_max': '36.181666666666665',
 'geospatial_bounds': 'POLYGON ((-58.23166666666667 36.181666666666665, 49.43222222222222 36.181666666666665, 49.43222222222222 81.26805555555555, -58.23166666666667 81.26805555555555, -58.23166666666667 36.181666666666665))',
 'geospatial_vertical_max': '1850.0',
 'geospatial_vertical_min': '0.0',
 'time_coverage_start': '1995-01-01T00:00:00',
 'time_coverage_end': '2022-12-25T00:00:00',
 'id': 'LQRA4MMK',
 'title': 'OSPAR Environmental Monitoring of Radioactive Substances',
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "OSPAR Comission\'s Radioactive Substances Committee (RSC)"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science >

### Encoding NETCDF

In [None]:
#| export
def encode(
    fname_out_nc: str, # Output file name
    **kwargs # Additional arguments
    ) -> None:
    "Encode data to NetCDF."
    dfs = wfs_processor()
    tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                                ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out_nc, 
                            global_attrs=get_attrs(tfm, zotero_key=zotero_key, kw=kw),
                            verbose=kwargs.get('verbose', False),
                           )
    encoder.encode()

In [None]:
#|eval: false
encode(fname_out_nc, verbose=True)

--------------------------------------------------------------------------------
Creating enums for the following columns:
['BODY_PART', 'NUCLIDE', 'SPECIES', 'UNIT', 'DL']
Creating enum for body_part_t with values {'Not applicable': -1, 'Not available': 0, 'Whole animal': 1, 'Whole animal eviscerated': 2, 'Whole animal eviscerated without head': 3, 'Flesh with bones': 4, 'Blood': 5, 'Skeleton': 6, 'Bones': 7, 'Exoskeleton': 8, 'Endoskeleton': 9, 'Shells': 10, 'Molt': 11, 'Skin': 12, 'Head': 13, 'Tooth': 14, 'Otolith': 15, 'Fins': 16, 'Faecal pellet': 17, 'Byssus': 18, 'Soft parts': 19, 'Viscera': 20, 'Stomach': 21, 'Hepatopancreas': 22, 'Digestive gland': 23, 'Pyloric caeca': 24, 'Liver': 25, 'Intestine': 26, 'Kidney': 27, 'Spleen': 28, 'Brain': 29, 'Eye': 30, 'Fat': 31, 'Heart': 32, 'Branchial heart': 33, 'Muscle': 34, 'Mantle': 35, 'Gills': 36, 'Gonad': 37, 'Ovary': 38, 'Testes': 39, 'Whole plant': 40, 'Flower': 41, 'Leaf': 42, 'Old leaf': 43, 'Young leaf': 44, 'Leaf upper part': 45

## NetCDF Review

First lets review the global attributes of the NetCDF file:

In [None]:
#| eval: false
contents = ExtractNetcdfContents(fname_out_nc)
print(contents.global_attrs)

{'id': 'LQRA4MMK', 'title': 'OSPAR Environmental Monitoring of Radioactive Substances', 'summary': '', 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)', 'history': 'TBD', 'keywords_vocabulary': 'GCMD Science Keywords', 'keywords_vocabulary_url': 'ht

Review the publisher_postprocess_logs.

In [None]:
#| eval: false
print(contents.global_attrs['publisher_postprocess_logs'])

Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column., Remap data provider nuclide names to standardized MARIS nuclide names., Parse the time format in the dataframe and check for inconsistencies., Encode time as seconds since epoch., Sanitize value by removing blank entries and populating `value` column., Normalize uncertainty values in DataFrames., Callback to update DataFrame 'UNIT' columns based on a lookup table., Remap detection limit values to MARIS format using a lookup table., Remap values from 'species' to 'SPECIES' for groups: BIOTA., Remap values from 'biological' to 'enhanced_species' for groups: BIOTA., Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met., Add a temporary column with the body part and biological group combined., Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA., Create a SMP_ID column from the ID column, Ensure depth values are floats and add 'SMP_DEPTH' columns., C

Now lets review the enums of the groups in the NetCDF file:

In [None]:
#| eval: false
print(contents.enum_dicts)

{'BIOTA': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', '

Lets review the data of the NetCDF file:

In [None]:
#| eval: false
dfs = contents.dfs
dfs

{'BIOTA':              LON        LAT        TIME  SMP_ID  NUCLIDE     VALUE  UNIT  \
 0      11.583333  55.966667   797040000   38847       33   2.02170     5   
 1      11.583333  55.966667   805075200   38848       33   2.34446     5   
 2      11.583333  55.966667   811468800   38849       33   2.62356     5   
 3      11.583333  55.966667   811468800   38850       33   2.78070     5   
 4      11.583333  55.966667   819504000   38851       33   1.51102     5   
 ...          ...        ...         ...     ...      ...       ...   ...   
 15257  -5.041667  58.452499  1619395200   96860       15   9.50000     5   
 15258  -5.041667  58.452499  1627344000   96861       15  10.00000     5   
 15259  -3.594445  54.872776  1623974400   96862       33   0.96400     5   
 15260  -3.594445  54.872776  1639353600   96863       33   1.48000     5   
 15261  -3.594445  54.872776  1640908800   96864       15  13.20000     5   
 
             UNC  DL  SPECIES  BODY_PART  
 0      0.031336   1  

Lets review the biota data:

In [None]:
#| eval: false
nc_dfs_biota=dfs['BIOTA']
nc_dfs_biota

Unnamed: 0,LON,LAT,TIME,SMP_ID,NUCLIDE,VALUE,UNIT,UNC,DL,SPECIES,BODY_PART
0,11.583333,55.966667,797040000,38847,33,2.02170,5,0.031336,1,96,40
1,11.583333,55.966667,805075200,38848,33,2.34446,5,0.023445,1,96,40
2,11.583333,55.966667,811468800,38849,33,2.62356,5,0.020988,1,392,40
3,11.583333,55.966667,811468800,38850,33,2.78070,5,0.015294,1,96,40
4,11.583333,55.966667,819504000,38851,33,1.51102,5,0.008311,1,96,40
...,...,...,...,...,...,...,...,...,...,...,...
15257,-5.041667,58.452499,1619395200,96860,15,9.50000,5,1.750000,1,96,56
15258,-5.041667,58.452499,1627344000,96861,15,10.00000,5,1.850000,1,96,56
15259,-3.594445,54.872776,1623974400,96862,33,0.96400,5,0.145000,1,394,19
15260,-3.594445,54.872776,1639353600,96863,33,1.48000,5,0.215000,1,394,19


Lets review the seawater data:

In [None]:
#| eval: false
nc_dfs_seawater=dfs['SEAWATER']
nc_dfs_seawater

Unnamed: 0,LON,LAT,SMP_DEPTH,TIME,SMP_ID,NUCLIDE,VALUE,UNIT,UNC,DL
0,11.783334,56.166668,2.0,799286400,45552,33,0.040141,1,0.000341,1
1,11.783334,56.166668,24.0,799286400,45553,33,0.037117,1,0.000390,1
2,11.783334,56.166668,2.0,815184000,45554,33,0.043450,1,0.000282,1
3,11.783334,56.166668,25.0,815184000,45555,33,0.046080,1,0.000253,1
4,11.166667,56.116665,2.0,799286400,45556,33,0.050330,1,0.000377,1
...,...,...,...,...,...,...,...,...,...,...
19014,-0.280167,54.916332,3.0,1661644800,9223372036854775808,33,0.002261,1,0.000298,1
19015,0.918167,53.912498,3.0,1661731200,9223372036854775808,33,0.002130,1,0.000304,1
19016,1.275333,53.930668,3.0,1661731200,9223372036854775808,33,0.002210,1,0.000298,1
19017,2.716500,54.508835,3.0,1661731200,9223372036854775808,33,0.002270,1,0.000285,1


## Data Format Conversion 

The MARIS data processing workflow involves two key steps:

1. **NetCDF to Standardized CSV Compatible with OpenRefine Pipeline**
   - Convert standardized NetCDF files to CSV formats compatible with OpenRefine using the `NetCDFDecoder`.
   - Preserve data integrity and variable relationships.
   - Maintain standardized nomenclature and units.

2. **Database Integration**
   - Process the converted CSV files using OpenRefine.
   - Apply data cleaning and standardization rules.
   - Export validated data to the MARIS master database.

This section focuses on the first step: converting NetCDF files to a format suitable for OpenRefine processing using the `NetCDFDecoder` class.

In [None]:
#|eval: false
decode(fname_in=fname_out_nc, verbose=True)

Saved BIOTA to ../../_data/output/191-OSPAR-2024_BIOTA.csv
Saved SEAWATER to ../../_data/output/191-OSPAR-2024_SEAWATER.csv
