In [None]:
#| default_exp handlers.ospar

# OSPAR 

> This data pipeline, known as a "handler" in Marisco terminology, is designed to clean, standardize, and encode [OSPAR data](https://odims.ospar.org/en/) into `NetCDF` format. The handler processes raw OSPAR data, applying various transformations and lookups to align it with `MARIS` data standards.

Key functions of this handler:

- **Cleans** and **normalizes** raw OSPAR data
- **Applies standardized nomenclature** and units
- **Encodes the processed data** into `NetCDF` format compatible with MARIS requirements

This handler is a crucial component in the Marisco data processing workflow, ensuring OSPAR data is properly integrated into the MARIS database.

:::{.callout-tip}

For new MARIS users, please refer to [Understanding MARIS Data Formats (NetCDF and Open Refine)](https://github.com/franckalbinet/marisco/tree/main/install_configure_guide) for detailed information.

:::

The present notebook pretends to be an instance of [Literate Programming](https://www.wikiwand.com/en/articles/Literate_programming) in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case `marisco/handlers/ospar.py`) the code snippet is added to the module using `#| export` as provided by the wonderful [nbdev](https://nbdev.fast.ai/getting_started.html) library.

In [None]:
#| hide
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
#| export
import pandas as pd 
import numpy as np
import fastcore.all as fc 
from fastcore.basics import patch
from typing import  Dict, Callable 
import re
from owslib.wfs import WebFeatureService
from io import StringIO

from marisco.utils import (
    Remapper, 
    get_unique_across_dfs,
    NA
)

from marisco.callbacks import (
    Callback, 
    Transformer, 
    EncodeTimeCB, 
    LowerStripNameCB, 
    SanitizeLonLatCB, 
    CompareDfsAndTfmCB, 
    RemapCB
)

from marisco.metadata import (
    GlobAttrsFeeder, 
    BboxCB, 
    DepthRangeCB, 
    TimeRangeCB, 
    ZoteroCB, 
    KeyValuePairCB
)

from marisco.configs import (
    nuc_lut_path, 
    cfg, 
    species_lut_path, 
    bodyparts_lut_path, 
    detection_limit_lut_path, 
    get_lut, 
)

from marisco.encoders import (
    NetCDFEncoder, 
)

from marisco.handlers.data_format_transformation import (
    decode, 
)

from marisco.utils import (
    ExtractNetcdfContents,
)

import warnings
warnings.filterwarnings('ignore')

## Configuration and File Paths

The handler requires several configuration parameters:

1. **fname_out_nc**: Output path and filename for NetCDF file (relative paths supported) 
2. **zotero_key**: Key for retrieving dataset attributes from [Zotero](https://www.zotero.org/)
3. **ref_id**: Reference ID in the MARIS [Zotero library](https://www.zotero.org/groups/2432820/maris/library)

In [None]:
#| export
fname_out_nc = '../../_data/output/191-OSPAR-2024.nc'
zotero_key ='LQRA4MMK' # OSPAR MORS zotero key

## OSPAR Data Access and Processing

OSPAR data can be accessed through the [ODIMS OSPAR platform](https://odims.ospar.org/en/search/), which hosts the data and provides access via a [Web Feature Service (WFS)](https://odims.ospar.org/geoserver/odims/wfs/?service=WFS&request=GetCapabilities). The WFS interface enables efficient querying and retrieval of geospatial data.

### `OsparWfsProcessor`: A Tool for OSPAR Data Retrieval

The `OsparWfsProcessor` is a utility designed to interact seamlessly with the OSPAR WFS. It supports specific search parameters tailored to different data types:

- **`ospar_biota`**: Retrieves biological data.
- **`ospar_seawater`**: Retrieves seawater data.

### Workflow

When executed, the processor performs the following steps:

1. Connects to the OSPAR WFS using the specified search parameters.
2. Retrieves the requested data.
3. Organizes the data into a structured format for ease of analysis.

### Output

The processor returns the results as a dictionary of pandas DataFrames, structured as follows:

- **Key: `BIOTA`**  
  Contains biological data retrieved via the `ospar_biota` parameter.
  
- **Key: `SEAWATER`**  
  Contains seawater data retrieved via the `ospar_seawater` parameter.

This design ensures that OSPAR data is both accessible and conveniently structured for further analysis.


:::{.callout-tip}

**Feedback to Data Provider.**

Please note that we are assuming that new versions of data supersede all previous versions. Files are stored on the WFS service with the following naming convention:

- **Prefix**: All filenames start with `odims:ospar_`, indicating that the data originates from the OSPAR dataset hosted on the ODIMS platform.

- **Data Type**: Following the prefix, the filename specifies the type of data:
  - `biota` - Indicates biological data.
  - `seawater` - Indicates seawater-related data.

- **Date and Version**:
  - **Year**: The year of the dataset is represented by four digits (e.g., `2023`).
  - **Month**: The month of the dataset is represented by two digits (e.g., `04` for April).
  - **Version**: The version of the dataset is represented by three digits, where higher numbers indicate more recent versions (e.g., `001`).

- **Separators**: Underscores (`_`) are used as separators to distinctly divide different parts of the filename.

Consider the filename `odims:ospar_biota_2023_01_001`. This indicates a file containing biota data from January 2023, version 001. Under the current implementation, this data would be replaced by the file `odims:ospar_biota_2023_01_002` (i.e., version 002).

:::


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The 2022 OSPAR Biota data is unavailable on the WFS. The file `ospar_biota_2022_01_001.csv` contains Seawater data (i.e. Sample_type is 'Water'). See https://odims.ospar.org/en/submissions/ospar_biota_2022_01/. 
For this reason, the `BIOTA` dataset does not contain any data for the year 2022.
:::

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The 2022 OSPAR Seawater csv data retrieved from the WFS does not contain a `year` column. Data for all other years contains a `year` column.
:::

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The 2022 OSPAR Seawater csv data retrieved from the WFS does not contain an `id` column. Data for all other years contains an `id` column. In the absence of an `id ` we will included all 2022 data in the `SEAWATER` dataset.
:::

In [None]:
#| export
class OsparWfsProcessor:
    "Processor for OSPAR Web Feature Service operations, managing feature filtering and data fetching."
    def __init__(self, url, search_params=None, version='2.0.0', verbose=False):
        "Initialize with URL, version, and search parameters."
        fc.store_attr()
        self.wfs = WebFeatureService(url=self.url, version=self.version)
        self.features_dfs = {}
        self.dfs = {}
        self.duplicates_dfs = {}

    def __call__(self):
        "Process, fetch and filter OSPAR data"
        self.filter_features()
        self.check_feature_pattern()
        self.extract_version_from_feature_name()
        self.filter_latest_versions()
        self.fetch_and_combine_csv()
        if self.verbose:
            self.display_year_ranges()
        return self.dfs

In [None]:
#| export
@patch
def filter_features(self: OsparWfsProcessor):
    "Filter features based on search parameters."
    available_features = list(self.wfs.contents.keys())
    for group, value in self.search_params.items():
        filtered_features = [ftype for ftype in available_features if value in ftype]
        self.features_dfs[group] = pd.DataFrame([{'feature': ftype} for ftype in filtered_features])


In [None]:
#| export
@patch
def check_feature_pattern(self: OsparWfsProcessor):
    """
    Check and retain features conforming to a specific pattern, printing unmatched features.
    """
    pattern = re.compile(r'^odims:ospar_(biota|seawater)_(\d{4})_(\d{2})_(\d{3})$')
    unmatched_features = []
    for group, df in list(self.features_dfs.items()):
        # Apply the pattern and find unmatched features
        matched_features = df['feature'].apply(lambda x: bool(pattern.match(x)))
        unmatched = df[~matched_features]['feature']
        unmatched_features.extend(unmatched.tolist())
        # Filter the DataFrame to only include matched features
        self.features_dfs[group] = df[matched_features]

    if unmatched_features:
        print("Unmatched features:", unmatched_features)

In [None]:
#| export
@patch
def extract_version_from_feature_name(self: OsparWfsProcessor):
    "Extract version from feature."
    for group, df in list(self.features_dfs.items()):
        df['source'] = df['feature'].apply(lambda x: x.split('_')[0])
        df['type'] = df['feature'].apply(lambda x: x.split('_')[1])
        df['year'] = df['feature'].apply(lambda x: x.split('_')[2])
        df['month'] = df['feature'].apply(lambda x: x.split('_')[3])
        df['version'] = df['feature'].apply(lambda x: x.split('_')[4])

In [None]:
#| export
@patch
def filter_latest_versions(self: OsparWfsProcessor):
    "Filter to include only the latest version of each feature"
    for group, df in list(self.features_dfs.items()):
        df[['year', 'month', 'version']] = df[['year', 'month', 'version']].astype(int)
        
        if group == 'BIOTA':
            # Removing biota data for the year 2022 as the data is unavailable on the WFS.
            df = df[(df['year'] != 2022) & (df['type'] == 'biota')]            
        
        idx = df.groupby(['source', 'type', 'year', 'month'])['version'].idxmax()
        self.features_dfs[group] = df.loc[idx]

In [None]:
#| export
@patch
def drop_duplicates(self: OsparWfsProcessor, df, group, index_col='id'):
    """
    Drop duplicate rows based on the index provided, keeping the last entry.
    Additionally, track and report all duplicate entries.
    """

    if index_col in df.columns:
        # Set the index but do not modify the original DataFrame yet
        indexed_df = df.set_index(index_col )

        # Ensure 'year' column is present for sorting
        if 'year' in indexed_df.columns:
            indexed_df.sort_values(by='year', ascending=True, inplace=True)

        # remove NaN values in the index . 
        if self.verbose and df[index_col].isnull().sum() > 0:
            print(f"Warning: {group} contains {df[index_col].isnull().sum()} NaN values in the {index_col} column.")
        indexed_df = indexed_df[indexed_df.index.notna()]
                
        # Identify all duplicates to keep track of what is removed and kept
        duplicates = indexed_df[indexed_df.index.duplicated(keep=False)]
        # Sort duplicates on the index 
        duplicates.sort_index(inplace=True)
        # Drop duplicates, keeping the last entry
        cleaned_df = indexed_df[~indexed_df.index.duplicated(keep='last')]
        cleaned_df.reset_index(drop=False, inplace=True)
        # Add a column to indicate removed data in the duplicates DataFrame
        if not duplicates.empty:
            self.duplicates_dfs[group] = duplicates
            if self.verbose:
                print(f"Duplicates identified using '{index_col}' as the index in {group}. Review the 'duplicates_dfs' attribute for more details.")
        else:
            if self.verbose:
                print("No duplicates found to remove.")
        return cleaned_df
    else:
        if self.verbose:
            print(f"Warning: '{index_col}' column not found. Using default index.")
        return df

In [None]:
#| export
@patch
def fetch_and_combine_csv(self: OsparWfsProcessor):
    """
    Fetch CSV data for each feature from the WFS and combine it into a single DataFrame for each sample type.
    This method also handles the 'year' column by extracting it from date columns if not present.
    """

    def fetch_data(row):
        feature = row['feature']
        year = row['year']
        data_type = row['type']  # This can be removed when the data is made consistent (i.e., 'year' column added).
        response = self.wfs.getfeature(typename=feature, outputFormat='csv')
        csv_data = StringIO(response.read().decode('utf-8'))
        df_csv = pd.read_csv(csv_data)
        df_csv.columns = df_csv.columns.str.lower()  # Standardize column names to lowercase

        # Extract 'year' from date columns if not present. # TODO: remove adding the column when the data is made consistent (i.e., 'year' column added).
        if 'year' not in df_csv.columns:
            if self.verbose:
                print(f"Warning: {feature} does not contain a 'year' column, adding it from date column.")
            date_column = 'sampling_d' if data_type == 'biota' else 'sampling_1'
            df_csv['year'] = pd.to_datetime(df_csv[date_column]).dt.year

        # Validate the 'year' column against the expected year
        if not df_csv['year'].eq(year).all():
            years = df_csv['year'].unique()
            if self.verbose: 
                print(f"Warning: {feature} contains data for invalid year. This file contains data for years: {list(years)}")

        return df_csv

    for group, df in self.features_dfs.items():
        # Apply fetch_data function to each row in the features DataFrame and combine the results in a data DataFrame.
        data_frames = df.apply(fetch_data, axis=1).tolist()
        
        combined_df = pd.concat(data_frames, ignore_index=True)

        # drop duplicates using the `id` column as the index.
        combined_df = self.drop_duplicates(combined_df, group, 'id')

        self.dfs[group] = combined_df

In [None]:
#| export
@patch
def display_year_ranges(self: OsparWfsProcessor):
    """
    Display the range of years for the data retrieved from the WFS for 'BIOTA' and 'SEAWATER'.
    """
    # Extract the DataFrames for 'BIOTA' and 'SEAWATER'
    biota_df = self.dfs.get('BIOTA', pd.DataFrame()).copy()
    seawater_df = self.dfs.get('SEAWATER', pd.DataFrame()).copy()

    # Function to process each DataFrame
    def process_df(df, date_column):
        if date_column in df.columns:
            df[date_column] = pd.to_datetime(df[date_column])
            df['year'] = df[date_column].dt.year
            min_year = df['year'].min()
            max_year = df['year'].max()
            all_years = set(range(min_year, max_year + 1))
            missing_years = all_years - set(df['year'].unique())
            return min_year, max_year, sorted(missing_years)
        return None, None, []

    # Process each DataFrame
    biota_min, biota_max, biota_missing = process_df(biota_df, 'sampling_d')
    seawater_min, seawater_max, seawater_missing = process_df(seawater_df, 'sampling_1')

    # Print the results
    if biota_min and biota_max:
        biota_message = f"OSPAR 'BIOTA' data retrieved for years from {biota_min} to {biota_max}"
        if biota_missing:
            biota_message += f" with the exclusion of {biota_missing}"
        print(biota_message)
    else:
        print("'BIOTA' data is not available or lacks the specified date column.")

    if seawater_min and seawater_max:
        seawater_message = f"OSPAR 'SEAWATER' data retrieved for years from {seawater_min} to {seawater_max}"
        if seawater_missing:
            seawater_message += f" with the exclusion of {seawater_missing}"
        print(seawater_message)
    else:
        print("'SEAWATER' data is not available or lacks the specified date column.")

In [None]:
#|eval: false
verbose=True
wfs_processor=OsparWfsProcessor(url= 'https://odims.ospar.org/geoserver/odims/wfs', search_params={'BIOTA': 'ospar_biota', 'SEAWATER': 'ospar_seawater'}, verbose=verbose)
dfs = wfs_processor()
wfs_processor.verbose=False # set verbose to False to suppress the same feedback throughout the handler.  

Duplicates identified using 'id' as the index in BIOTA. Review the 'duplicates_dfs' attribute for more details.
No duplicates found to remove.
OSPAR 'BIOTA' data retrieved for years from 1995 to 2021 with the exclusion of [1996]
OSPAR 'SEAWATER' data retrieved for years from 1995 to 2021


Review the duplicates in the `BIOTA` dataframe. These duplicates are removed with the `drop_duplicates` method and the latest entry is kept.

In [None]:
with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(wfs_processor.duplicates_dfs['BIOTA'])

Unnamed: 0_level_0,fid,the_geom,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,biological,species,body_part,sampling_d,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
34977,ospar_biota_1997_01_003.1,POINT (51.23333333333333 2.914722222222222),Belgium,8,Ostend,276,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-04-10T00:00:00,"239,240Pu",0,0.086,00146,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
34977,ospar_biota_1996_01_003.1,POINT (51.23333333333333 2.914722222222222),Belgium,8,Ostend,276,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-04-10T00:00:00,"239,240Pu",0,0.086,00146,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
34978,ospar_biota_1996_01_003.2,POINT (51.23333333333333 2.914722222222222),Belgium,8,Ostend,407,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-04-21T00:00:00,"239,240Pu",0,0.039,000936,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
34978,ospar_biota_1997_01_003.2,POINT (51.23333333333333 2.914722222222222),Belgium,8,Ostend,407,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-04-21T00:00:00,"239,240Pu",0,0.039,000936,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
34979,ospar_biota_1997_01_003.3,POINT (51.23333333333333 2.914722222222222),Belgium,8,Ostend,439,51,14,0.0,N,2,54,53.0,E,BIOT,Molluscs,Mytilus edulis,WHOLE ANIMAL,1997-05-05T00:00:00,"239,240Pu",0,0.014,000546,Bq/kg f.w.,Scientific Institute of Public Health,,,,51.233333,2.914722,1997
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91594,ospar_biota_1997_01_003.545,POINT (54.48888888888889 -3.606944444444445),United Kingdom,6,Sellafield,1997000616,54,29,20.0,N,3,36,25.0,W,BIOT,Seaweed,FUCUS VESICULOSUS,GROWING TIPS,1997-02-04T00:00:00,137Cs,0,5.993,00489999987,Bq/kg f.w.,FSA-Food Standards Agency,,St Bees W,,54.488889,-3.606944,1997
91595,ospar_biota_1997_01_003.546,POINT (54.48888888888889 -3.606944444444445),United Kingdom,6,Sellafield,1997002288,54,29,20.0,N,3,36,25.0,W,BIOT,Seaweed,FUCUS VESICULOSUS,GROWING TIPS,1997-04-22T00:00:00,137Cs,0,4.701,00309999995,Bq/kg f.w.,FSA-Food Standards Agency,,St Bees W,,54.488889,-3.606944,1997
91595,ospar_biota_1996_01_003.546,POINT (54.48888888888889 -3.606944444444445),United Kingdom,6,Sellafield,1997002288,54,29,20.0,N,3,36,25.0,W,BIOT,Seaweed,FUCUS VESICULOSUS,GROWING TIPS,1997-04-22T00:00:00,137Cs,0,4.701,00309999995,Bq/kg f.w.,FSA-Food Standards Agency,,St Bees W,,54.488889,-3.606944,1997
91634,ospar_biota_1997_01_003.547,POINT (54.48888888888889 -3.606944444444445),United Kingdom,6,Sellafield,1997006904,54,29,20.0,N,3,36,25.0,W,BIOT,Seaweed,FUCUS VESICULOSUS,GROWING TIPS,1997-09-02T00:00:00,"239,240Pu",0,5.280,00813119933,Bq/kg f.w.,FSA-Food Standards Agency,,St Bees W. Annual bulk of 4 samples - representative sampling date.,,54.488889,-3.606944,1997


Display the head of the `SEAWATER` dataframe with all columns.

In [None]:
#|eval: false
# Show all columns
with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(dfs['SEAWATER'].head())

Unnamed: 0,id,fid,the_geom,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,sampling_d,sampling_1,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year,f1,reference_
0,45552.0,ospar_seawater_1995_01_003.1,POINT (56.16666666666666 11.78333333333333),Denmark,12,HesselÃ¸,H95-22,56,10,0.0,N,11,47,0.0,E,Water,2.0,1995-05-01T00:00:00,137Cs,0,0.040141,6823919,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995,,
1,67787.0,ospar_seawater_1995_01_003.430,POINT (63.65 -15.9),Iceland,15,STO,SJ95B1,63,39,0.0,N,15,54,0.0,W,WATER,0.0,1995-02-15T00:00:00,137Cs,0,0.003,45,Bq/l,Icelandic Radiation Safety Authority,15% uncertainty assumed,AW,,63.65,-15.9,1995,,
2,67788.0,ospar_seawater_1995_01_003.431,POINT (63.65 -15.9),Iceland,15,STO,SJ95B2,63,39,0.0,N,15,54,0.0,W,WATER,140.0,1995-02-15T00:00:00,137Cs,0,0.0031,465,Bq/l,Icelandic Radiation Safety Authority,15% uncertainty assumed,AW,,63.65,-15.9,1995,,
3,67789.0,ospar_seawater_1995_01_003.432,POINT (64.33 -25),Iceland,15,FX6,SJ5EFAX6,64,19,48.0,N,25,0,0.0,W,WATER,0.0,1995-05-15T00:00:00,137Cs,0,0.0026,39,Bq/l,Icelandic Radiation Safety Authority,15% uncertainty assumed,AW,,64.33,-25.0,1995,,
4,67790.0,ospar_seawater_1995_01_003.433,POINT (64.33 -27.97),Iceland,15,FX9,SJ5EFAX9,64,19,48.0,N,27,58,12.0,W,WATER,0.0,1995-05-15T00:00:00,137Cs,0,0.0024,36,Bq/l,Icelandic Radiation Safety Authority,15% uncertainty assumed,AW,,64.33,-27.97,1995,,


Display the head of the `BIOTA` dataframe with all columns.

In [None]:
#|eval: false
# Show all columns
with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(dfs['BIOTA'].head())

Unnamed: 0,id,fid,the_geom,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,biological,species,body_part,sampling_d,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year
0,38847,ospar_biota_1995_01_003.1,POINT (55.96666666666667 11.58333333333333),Denmark,12,Klint,950089,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-04-05T00:00:00,137Cs,0,2.0217,626727,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
1,54128,ospar_biota_1995_01_003.311,POINT (50.75 0.5),United Kingdom,3,Dungeness,1995005618,50,45,0.0,N,0,30,0.0,E,BIOT,Fish,GADUS MORHUA,FLESH WITHOUT BONES,1995-07-14T00:00:00,137Cs,0,0.341,28,Bq/kg f.w.,FSA-Food Standards Agency,,Dungeness PLZ,,50.75,0.5,1995
2,54127,ospar_biota_1995_01_003.310,POINT (50.75 0.5),United Kingdom,3,Dungeness,1995005034,50,45,0.0,N,0,30,0.0,E,BIOT,Fish,GADUS MORHUA,FLESH WITHOUT BONES,1995-05-22T00:00:00,137Cs,0,0.225,45,Bq/kg f.w.,FSA-Food Standards Agency,,Dungeness PLZ,,50.75,0.5,1995
3,54126,ospar_biota_1995_01_003.309,POINT (51.08416666666667 1.203055555555556),United Kingdom,10,Dungeness,1995005617,51,5,3.0,N,1,12,11.0,E,BIOT,Seaweed,Fucus vesiculosus,GROWING TIPS,1995-07-18T00:00:00,137Cs,0,0.155,31,Bq/kg f.w.,FSA-Food Standards Agency,,Copt Point,,51.084167,1.203056,1995
4,54125,ospar_biota_1995_01_003.308,POINT (51.08416666666667 1.203055555555556),United Kingdom,10,Dungeness,1995001537,51,5,3.0,N,1,12,11.0,E,BIOT,Seaweed,Fucus vesiculosus,GROWING TIPS,1995-02-20T00:00:00,137Cs,0,0.137,37,Bq/kg f.w.,FSA-Food Standards Agency,,Copt Point,,51.084167,1.203056,1995


## Nuclide Name Normalization

The MARISCO package standardizes the nuclide names in the DataFrames to match the MARIS standard nuclide names specified in a lookup table. 

The lookup process uses the following three columns:
- **`nuclide_id`**: A unique identifier for each nuclide.
- **`nuclide`**: The standard nuclide name.
- **`nc_name`**: The corresponding name used in NetCDF files.

Let’s inspect the lookup table:


In [None]:
#| eval: false
nuc_lut_df = pd.read_excel(nuc_lut_path())
nuc_lut_df.head()

Unnamed: 0,nuclide_id,nuclide,atomicnb,massnb,nusymbol,half_life,hl_unit,nc_name
0,-1,NOT APPLICABLE,,,,,,NOT APPLICABLE
1,0,NOT AVAILABLE,0.0,0.0,0,0.0,-,NOT AVAILABLE
2,1,TRITIUM,1.0,3.0,3H,12.35,Y,h3
3,2,BERYLLIUM,4.0,7.0,7Be,53.3,D,be7
4,3,CARBON,6.0,14.0,14C,5730.0,Y,c14


The nuclide data is provided in the `nuclide` column. However, as shown below, the nuclide names are not standardized.


In [None]:
#| eval: false
dfs = wfs_processor()
df = get_unique_across_dfs(dfs, 'nuclide', as_df=True)
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
value,3H,Cs-137,226Ra,137Cs,210Pb,238Pu,RA-228,"239,240Pu",RA-226,210Po,241Am,CS-137,99Tc,"239, 240 Pu",228Ra


### Lower & strip nuclide names

To simplify the data, we use the `LowerStripNameCB` callback. For each dataframe in the dictionary of dataframes, `LowerStripNameCB` simplifies the nuclide name by converting it lowercase and striping any leading or trailing whitespace(s).

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='nuclide')])
dfs_output=tfm()
for key, df in dfs_output.items():
    print(f'{key} nuclides: ')
    print(df['nuclide'].unique())

BIOTA nuclides: 
['137cs' '239,240pu' '210po' '99tc' '226ra' '210pb' '228ra' 'cs-137' '3h'
 '241am' '239, 240 pu' '238pu']
SEAWATER nuclides: 
['137cs' '3h' '99tc' '239,240pu' '226ra' '228ra' '210po' '210pb' 'ra-228'
 'ra-226']


### Remap nuclide names to MARIS data formats

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `nuclide` column has inconsistent naming. E.g:

- `Cs-137`,  `137Cs` or `CS-137`
- `239, 240 pu` or `239,240 pu`
- `ra-226` and `226ra` 

See below:

:::

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='nuclide', as_df=True).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
value,3H,Cs-137,226Ra,137Cs,210Pb,238Pu,RA-228,"239,240Pu",RA-226,210Po,241Am,CS-137,99Tc,"239, 240 Pu",228Ra


Below, we map nuclide names used by HELCOM to the MARIS standard nuclide names. 

Remapping data provider nomenclatures to MARIS standards is a recurrent operation and is done in a semi-automated manner according to the following pattern:

1. **Inspect** data provider nomenclature:
2. **Match** automatically against MARIS nomenclature (using a fuzzy matching algorithm); 
3. **Fix** potential mismatches; 
4. **Apply** the lookup table to the dataframe.

We will refer to this process as **IMFA** (**I**nspect, **M**atch, **F**ix, **A**pply).

Let's now create an instance of a [fuzzy matching algorithm](https://www.wikiwand.com/en/articles/Approximate_string_matching) `Remapper`. This instance will match the nuclide names of the OSPAR dataset to the MARIS standard nuclide names.

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_output, col_name='nuclide', as_df=True),
                    maris_lut_fn=nuc_lut_path,
                    maris_col_id='nuclide_id',
                    maris_col_name='nc_name',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='nuclides_ospar.pkl')

Lets try to match OSPAR nuclide names to MARIS standard nuclide names as automatically as possible. The `match_score` column allows to assess the results:

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)

Processing:   0%|          | 0/14 [00:00<?, ?it/s]

Processing: 100%|██████████| 14/14 [00:00<00:00, 44.50it/s]

0 entries matched the criteria, while 14 entries had a match score of 1 or higher.





Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"239, 240 pu",pu240,"239, 240 pu",8
"239,240pu",pu240,"239,240pu",6
226ra,u234,226ra,4
228ra,u235,228ra,4
137cs,i133,137cs,4
210pb,ru106,210pb,4
210po,ru106,210po,4
241am,pu241,241am,4
238pu,u238,238pu,3
99tc,tu,99tc,3


We can now manually review the unmatched nuclide names and construct a dictionary to map them to the MARIS standard.

In [None]:
#| export
fixes_nuclide_names = {
    '99tc': 'tc99',
    '238pu': 'pu238',
    '226ra': 'ra226',
    'ra-226': 'ra226',
    'ra-228': 'ra228',    
    '210pb': 'pb210',
    '241am': 'am241',
    '228ra': 'ra228',
    '137cs': 'cs137',
    '210po': 'po210',
    '239,240pu': 'pu239_240_tot',
    '239, 240 pu': 'pu239_240_tot',
    'cs-137': 'cs137',
    '3h': 'h3'
    }

The dictionary `fixes_nuclide_names`, applies manual corrections to the nuclide names before the remapping process. 
The `generate_lookup_table` function has an `overwrite` parameter (default is `True`), which, when set to `True`, creates a pickle file cache of the lookup table. We can now test the remapping process:

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
fc.test_eq(len(remapper.select_match(match_score_threshold=1)), 0)

Processing:   0%|          | 0/14 [00:00<?, ?it/s]

Processing: 100%|██████████| 14/14 [00:00<00:00, 43.08it/s]


If we would like to to view all remapped nuclides we can set the match score threshold to 0 which will return all nuclides.

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
remapper.select_match(match_score_threshold=0, verbose=True).T

Processing:   0%|          | 0/14 [00:00<?, ?it/s]

Processing: 100%|██████████| 14/14 [00:00<00:00, 45.27it/s]


0 entries matched the criteria, while 14 entries had a match score of 0 or higher.


source_key,226ra,"239, 240 pu",228ra,ra-226,238pu,137cs,210pb,ra-228,210po,cs-137,241am,99tc,3h,"239,240pu"
matched_maris_name,ra226,pu239_240_tot,ra228,ra226,pu238,cs137,pb210,ra228,po210,cs137,am241,tc99,h3,pu239_240_tot
source_name,226ra,"239, 240 pu",228ra,ra-226,238pu,137cs,210pb,ra-228,210po,cs-137,241am,99tc,3h,"239,240pu"
match_score,0,0,0,0,0,0,0,0,0,0,0,0,0,0


We can now see that the nuclide names have been remapped correctly. We now create a callback `RemapNuclideNameCB` to remap the nuclide names in the dataframes. We remap to use the `nuclide_id` values. 

Note that we pass `overwrite=False` to the `Remapper` constructor to now use the cached version.

In [None]:
#| export
# Create a lookup table for nuclide names
lut_nuclides = lambda df: Remapper(provider_lut_df=df,
                                   maris_lut_fn=nuc_lut_path,
                                   maris_col_id='nuclide_id',
                                   maris_col_name='nc_name',
                                   provider_col_to_match='value',
                                   provider_col_key='value',
                                   fname_cache='nuclides_ospar.pkl').generate_lookup_table(fixes=fixes_nuclide_names, 
                                                                                            as_df=False, overwrite=False)

In [None]:
#| export
class RemapNuclideNameCB(Callback):
    "Remap data provider nuclide names to standardized MARIS nuclide names."
    def __init__(self, 
                 fn_lut: Callable, # Function that returns the lookup table dictionary
                 col_name: str # Column name to remap
                ):
        fc.store_attr()

    def __call__(self, tfm: Transformer):
        df_uniques = get_unique_across_dfs(tfm.dfs, col_name=self.col_name, as_df=True)
        #lut = {k: v.matched_maris_name for k, v in self.fn_lut(df_uniques).items()}    
        lut = {k: v.matched_id for k, v in self.fn_lut(df_uniques).items()}    
        for k in tfm.dfs.keys():
            tfm.dfs[k]['NUCLIDE'] = tfm.dfs[k][self.col_name].replace(lut)

Let's see it in action, along with the `LowerStripNameCB` callback:

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide')
                            ])
dfs_out = tfm()

# For instance
for key in dfs_out.keys():
    print(f'{key} NUCLIDE unique: ', dfs_out[key]['NUCLIDE'].unique())

BIOTA NUCLIDE unique:  [33 77 47 15 53 41 54  1 72 67]
SEAWATER NUCLIDE unique:  [33  1 15 77 53 54 47 41]


## Standardize Time

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: There are inconsistencies in the column names used for time. The `SEAWATER` and `BIOTA` datasets use different column names for time. `SEAWATER` uses the column name `sampling_1` and `BIOTA` uses the column name `sampling_d`.

:::

In [None]:
#| eval: false
dfs = wfs_processor()
with pd.option_context('display.max_columns', None):
    display(dfs['SEAWATER'].head(2))
print('Number of NaN values in sampling_1 for SEAWATER: ', dfs['SEAWATER']['sampling_1'].isnull().sum())

with pd.option_context('display.max_columns', None):
    display(dfs['BIOTA'].head(2))

print('Number of NaN values in sampling_d for BIOTA: ', dfs['BIOTA']['sampling_d'].isnull().sum())

Unnamed: 0,id,fid,the_geom,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,sampling_d,sampling_1,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year,f1,reference_
0,45552.0,ospar_seawater_1995_01_003.1,POINT (56.16666666666666 11.78333333333333),Denmark,12,HesselÃ¸,H95-22,56,10,0.0,N,11,47,0.0,E,Water,2.0,1995-05-01T00:00:00,137Cs,0,0.040141,6823919,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995,,
1,67787.0,ospar_seawater_1995_01_003.430,POINT (63.65 -15.9),Iceland,15,STO,SJ95B1,63,39,0.0,N,15,54,0.0,W,WATER,0.0,1995-02-15T00:00:00,137Cs,0,0.003,45,Bq/l,Icelandic Radiation Safety Authority,15% uncertainty assumed,AW,,63.65,-15.9,1995,,


Number of NaN values in sampling_1 for SEAWATER:  0


Unnamed: 0,id,fid,the_geom,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,biological,species,body_part,sampling_d,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year
0,38847,ospar_biota_1995_01_003.1,POINT (55.96666666666667 11.58333333333333),Denmark,12,Klint,950089,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-04-05T00:00:00,137Cs,0,2.0217,626727,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
1,54128,ospar_biota_1995_01_003.311,POINT (50.75 0.5),United Kingdom,3,Dungeness,1995005618,50,45,0.0,N,0,30,0.0,E,BIOT,Fish,GADUS MORHUA,FLESH WITHOUT BONES,1995-07-14T00:00:00,137Cs,0,0.341,28,Bq/kg f.w.,FSA-Food Standards Agency,,Dungeness PLZ,,50.75,0.5,1995


Number of NaN values in sampling_d for BIOTA:  0


Create a callback that remaps the time format in the dictionary of dataframes (i.e. `%m/%d/%y %H:%M:%S`) and handle missing dates:

In [None]:
#| export
class ParseTimeCB(Callback):
    "Parse the time format in the dataframe and check for inconsistencies."
    def __call__(self, tfm):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                # Check if the 'sampling_1' column exists
                if 'sampling_1' in df.columns:
                    # Convert the time format of the sampling_1 and sampling_d columns
                    df['TIME'] = pd.to_datetime(df['sampling_1'], format='%Y-%m-%dT%H:%M:%S', errors='coerce')
            if grp == 'BIOTA':
                # Check if the 'sampling_1' column exists
                if 'sampling_d' in df.columns:
                    # Convert the time format of the sampling_1 and sampling_d columns
                    df['TIME'] = pd.to_datetime(df['sampling_d'], format='%Y-%m-%dT%H:%M:%S', errors='coerce')
            # Drop rows where TIME is still NaN after processing
            df.dropna(subset=['TIME'], inplace=True)

Apply the transformer for callbacks `ParseTimeCB`. Then, print the `TIME` data for `seawater`.

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    ParseTimeCB(),
    CompareDfsAndTfmCB(dfs)])

tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['SEAWATER']['TIME'])

                           BIOTA  SEAWATER
Number of rows in dfs      14715     18308
Number of rows in tfm.dfs  14715     18308
Number of rows removed         0         0 

0       1995-05-01
1       1995-02-15
2       1995-02-15
3       1995-05-15
4       1995-05-15
           ...    
18303   2021-02-01
18304   2021-07-19
18305   2021-10-14
18306   2021-01-13
18307   2021-02-17
Name: TIME, Length: 18308, dtype: datetime64[ns]


The NetCDF time format requires the time to be encoded as number of milliseconds since a time of origin. In our case the time of origin is `1970-01-01` as indicated in `configs.ipynb` `CONFIFS['units']['time']` dictionary.

`EncodeTimeCB` converts the HELCOM `time` format to the MARIS NetCDF `time` format.

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            EncodeTimeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.logs)
                            

                           BIOTA  SEAWATER
Number of rows in dfs      14715     18308
Number of rows in tfm.dfs  14715     18308
Number of rows removed         0         0 

['Parse the time format in the dataframe and check for inconsistencies.', 'Encode time as seconds since epoch.', 'Create a dataframe of dropped data. Data included in the `dfs` not in the `tfm`.']


## Sanitize value

We allocate each column containing measurement values into a single column `VALUE` and remove `NA` where needed.

In [None]:
#| export
class SanitizeValueCB(Callback):
    "Sanitize value by removing blank entries and populating `value` column."
    def __init__(self, 
                 value_col: str='activity_o' # Column name to sanitize
                 ):
        fc.store_attr()

    def __call__(self, tfm):
        for df in tfm.dfs.values():
            df.dropna(subset=[self.value_col], inplace=True)
            df['VALUE'] = df[self.value_col]

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            CompareDfsAndTfmCB(dfs)])

tfm()

print('Example of VALUE column:')
print(tfm.dfs['SEAWATER'][['VALUE']].head())
print('\nComparison stats:')
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

Example of VALUE column:
      VALUE
0  0.040141
1  0.003000
2  0.003100
3  0.002600
4  0.002400

Comparison stats:
                           BIOTA  SEAWATER
Number of rows in dfs      14715     18308
Number of rows in tfm.dfs  14715     18308
Number of rows removed         0         0 



## Normalize uncertainty

:::{.callout-tip}

**Feedback to Data Provider**: We have noticed that some entries in the `uncertaint` column use a comma (`,`) as a decimal separator. Please consider standardizing these entries to use a period (`.`) as the decimal separator. 

:::

For each sample type in the OSPAR dataset, the reported uncertainty is given as an expanded uncertainty with a coverage factor `𝑘=2`. For further details, refer to the [OSPAR reporting guidelines](https://mcc.jrc.ec.europa.eu/documents/OSPAR/Guidelines_forestimationof_a_%20measurefor_uncertainty_in_OSPARmonitoring.pdf).

**Note**: For MARIS the OSPAR uncertainty values are normalized to standard uncertainty with a coverage factor 
𝑘=1.

`NormalizeUncCB` callback normalizes the uncertainty using the following `lambda` function:

In [None]:
#| export
unc_exp2stan = lambda df, unc_col: df[unc_col] / 2

In [None]:
#| export
class NormalizeUncCB(Callback):
    """Normalize uncertainty values in DataFrames."""
    def __init__(self, 
                 col_unc: str='uncertaint', # Column name to normalize
                 fn_convert_unc: Callable=unc_exp2stan, # Function correcting coverage factor
                 ): 
        fc.store_attr()

    def __call__(self, tfm):
        for df in tfm.dfs.values():
            self._convert_commas_to_periods(df)
            self._convert_to_float(df)
            self._apply_conversion_function(df)

    def _convert_commas_to_periods(self, df):
        """Convert commas to periods in the uncertainty column."""
        df[self.col_unc] = df[self.col_unc].astype(str).str.replace(',', '.')

    def _convert_to_float(self, df):
        """Convert uncertainty column to float, handling errors by setting them to NaN."""
        df[self.col_unc] = pd.to_numeric(df[self.col_unc], errors='coerce')

    def _apply_conversion_function(self, df):
        """Apply the conversion function to normalize the uncertainty values."""
        df['UNC'] = self.fn_convert_unc(df, self.col_unc)

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
        SanitizeValueCB(),               
        NormalizeUncCB()
    ])
tfm()

for grp in ['SEAWATER', 'BIOTA']:
    print(f'\n{grp}:')
    print(tfm.dfs[grp][['VALUE', 'UNC']].head())


SEAWATER:
      VALUE       UNC
0  0.040141  0.000341
1  0.003000  0.000225
2  0.003100  0.000233
3  0.002600  0.000195
4  0.002400  0.000180

BIOTA:
    VALUE       UNC
0  2.0217  0.031336
1  0.3410  0.014000
2  0.2250  0.022500
3  0.1550  0.015500
4  0.1370  0.018500


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `SEAWATER` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

To show situations where the uncertainty is much greater than the value we will calculate the 'relative uncertainty' for the seawater dataset. 

In [None]:
#| eval: false
grp='SEAWATER'
tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['UNC'] / tfm.dfs[grp]['VALUE'])
    # Multiply by 100 to convert to percentage
    * 100
)

Now we will return all rows where the relative uncertainty is greater than 100% for the seawater dataset.

In [None]:
#| eval: false
threshold = 100
cols_to_show=['id', 'contractin', 'nuclide', 'value_type', 'activity_o', 'uncertaint', 'unit', 'relative_uncertainty']
df=tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]

print(f'Number of rows where relative uncertainty is greater than {threshold}%: \n {df.shape[0]} \n')

print(f'Example:')
with pd.option_context('display.max_rows', None):
    display(df.head())


Number of rows where relative uncertainty is greater than 100%: 
 74 

Example:


Unnamed: 0,id,contractin,nuclide,value_type,activity_o,uncertaint,unit,relative_uncertainty
1229,55488.0,United Kingdom,3H,0,11.1091,97164.0,Bq/l,437317.154405
2799,37549.0,Germany,99Tc,0,0.00092,0.09,Bq/l,4891.304348
2800,37550.0,Germany,99Tc,0,0.00055,0.07,Bq/l,6363.636364
2801,37551.0,Germany,99Tc,0,0.00059,0.12,Bq/l,10169.491525
2856,37548.0,Germany,99Tc,0,0.00063,0.07,Bq/l,5555.555556


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `BIOTA` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

Include the relative uncertainty for the biota dataset. 

In [None]:
#| eval: false
grp='BIOTA'
tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['UNC'] / tfm.dfs[grp]['VALUE'])
    # Multiply by 100 to convert to percentage
    * 100
)

Return all rows where the relative uncertainty is greater than 100% for the biota dataset..

In [None]:
#| eval: false
threshold = 100
cols_to_show=['id', 'contractin', 'nuclide', 'value_type', 'activity_o', 'uncertaint', 'unit', 'relative_uncertainty']
df=tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]

print(f'Number of rows where relative uncertainty is greater than {threshold}%: \n {df.shape[0]} \n')

print(f'Example:')
with pd.option_context('display.max_rows', None):
    display(df.head())


Number of rows where relative uncertainty is greater than 100%: 
 36 

Example:


Unnamed: 0,id,contractin,nuclide,value_type,activity_o,uncertaint,unit,relative_uncertainty
597,35011,Belgium,137Cs,0,0.1619,66.0,Bq/kg f.w.,20382.95244
926,49226,Sweden,137Cs,0,0.327,1.468,Bq/kg f.w.,224.464832
1089,49230,Sweden,137Cs,0,0.275,1.982,Bq/kg f.w.,360.363636
1102,49232,Sweden,137Cs,0,0.309,2.16,Bq/kg f.w.,349.514563
2030,49239,Sweden,137Cs,0,0.200202,1.094,Bq/kg f.w.,273.224044


## Remap units

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: It would be easier to work with the units if they were standardized. The units are not consistent across the dataset, for instance `BQ/L`, `Bq/l` and `Bq/L` are used interchangeably.

:::


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Unit` column contains `NaN` values for the `SEAWATER` dataset, as shown below.
:::


In [None]:
#| eval: false
df=dfs['SEAWATER'][dfs['SEAWATER']['unit'].isnull()].drop(columns=['measuremen','sample_com','reference'])
print(f'Number of rows with NaN in unit column: \n {df.shape[0]} \n')
print(f'Example:')
with pd.option_context('display.max_rows', None):
    display(df.head())

Number of rows with NaN in unit column: 
 2656 

Example:


Unnamed: 0,id,fid,the_geom,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,...,value_type,activity_o,uncertaint,unit,data_provi,latdd,longdd,year,f1,reference_
164,92371.0,ospar_seawater_1995_01_003.596,POINT (51.41194444444444 3.565555555555556),Netherlands,8,VLISSGBISSVH,,51,24,43.0,...,0,0.017,102,,Rijkswaterstaat Centre for Water Management,51.411944,3.565556,1995,,
165,92372.0,ospar_seawater_1995_01_003.597,POINT (51.41194444444444 3.565555555555556),Netherlands,8,VLISSGBISSVH,,51,24,43.0,...,0,0.008,48,,Rijkswaterstaat Centre for Water Management,51.411944,3.565556,1995,,
166,92373.0,ospar_seawater_1995_01_003.598,POINT (51.41194444444444 3.565555555555556),Netherlands,8,VLISSGBISSVH,,51,24,43.0,...,0,0.032,192,,Rijkswaterstaat Centre for Water Management,51.411944,3.565556,1995,,
167,92374.0,ospar_seawater_1995_01_003.599,POINT (51.41194444444444 3.565555555555556),Netherlands,8,VLISSGBISSVH,,51,24,43.0,...,0,0.017,102,,Rijkswaterstaat Centre for Water Management,51.411944,3.565556,1995,,
168,92375.0,ospar_seawater_1995_01_003.600,POINT (51.41194444444444 3.565555555555556),Netherlands,8,VLISSGBISSVH,,51,24,43.0,...,0,0.013,78,,Rijkswaterstaat Centre for Water Management,51.411944,3.565556,1995,,


Let's inspect the unique units used by OSPAR:

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='unit', as_df=True)

Unnamed: 0,index,value
0,0,
1,1,Bq/kg f.w.
2,2,Bq/l
3,3,Bq/L
4,4,BQ/L


We will define unit renaming rules for OSPAR dataset:

In [None]:
#| export
# Define unit names renaming rules
renaming_unit_rules = {'Bq/l': 1, #'Bq/m3'
                       'Bq/L': 1,
                       'BQ/L': 1,
                       'Bq/kg f.w.': 5, # Bq/kgw
                       } 

Now we will create a callback `RemapUnitCB` to remap the units in the dataframes. For the `SEAWATER` dataset we will set a default unit of `Bq/l`. 

In [None]:
#| export
class RemapUnitCB(Callback):
    """Callback to update DataFrame 'UNIT' columns based on a lookup table."""

    def __init__(self, lut: Dict[str, str]):
        fc.store_attr('lut')  # Store the lookup table as an attribute

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                self._apply_default_units(df, unit='Bq/l')
            self._print_na_units(df)
            self._update_units(df)

    def _apply_default_units(self, df: pd.DataFrame , unit = None):
        df.loc[df['unit'].isnull(), 'unit'] = unit

    def _print_na_units(self, df: pd.DataFrame):
        na_count = df['unit'].isnull().sum()
        if na_count > 0:
            print(f"Number of rows with NaN in 'unit' column: {na_count}")

    def _update_units(self, df: pd.DataFrame):
        df['UNIT'] = df['unit'].apply(lambda x: self.lut.get(x, 'Unknown'))

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[SanitizeValueCB(), # Remove blank value entries (also removes NaN values in Unit column) 
                            RemapUnitCB(renaming_unit_rules),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print('Unit column values:')
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['UNIT'].unique()}")

                           BIOTA  SEAWATER
Number of rows in dfs      14715     18308
Number of rows in tfm.dfs  14715     18308
Number of rows removed         0         0 

Unit column values:
BIOTA: [5]
SEAWATER: [1]


## Remap detection limit

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Value type` column contains many `nan` values and many entries with a value of `0`.

:::

In [None]:
#| eval: false
# Count the number of NaN entries in the 'value_type' column for 'SEAWATER'
na_count_seawater = dfs['SEAWATER']['value_type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'SEAWATER': {na_count_seawater}")

# Count the number of NaN entries in the 'value_type' column for 'BIOTA'
na_count_biota = dfs['BIOTA']['value_type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'BIOTA': {na_count_biota}")

# Count the number of entries in the 'value_type' column where the value is 0 for 'SEAWATER'
zero_count_seawater = dfs['SEAWATER']['value_type'].value_counts()[0]
print(f"Number of 'value_type' entries where the value is '0' in 'SEAWATER': {zero_count_seawater}")

# Count the number of entries in the 'value_type' column where the value is 0 for 'BIOTA'
zero_count_biota = dfs['BIOTA']['value_type'].value_counts()[0]
print(f"Number of 'value_type' entries where the value is '0' in 'BIOTA': {zero_count_biota}")    


Number of NaN 'Value type' entries in 'SEAWATER': 54
Number of NaN 'Value type' entries in 'BIOTA': 23
Number of 'value_type' entries where the value is '0' in 'SEAWATER': 13549
Number of 'value_type' entries where the value is '0' in 'BIOTA': 10189


In the OSPAR dataset, the detection limit is indicated by < in the Value type column. When the Value type is <, the Activity or MDA column contains the detection limit value. Conversely, when the Value type is =, the Activity or MDA column contains the measurement value.

Let’s examine the Value type column entries in the OSPAR dataset:

In [None]:
#| eval: false
for grp in dfs.keys():
    print(f'{grp}:')
    print(tfm.dfs[grp]['value_type'].unique())


BIOTA:
['0' '<' nan]
SEAWATER:
['0' '<' '=' nan]


Detection limits are encoded as follows in MARIS:

In [None]:
#| eval: false
pd.read_excel(detection_limit_lut_path())

Unnamed: 0,id,name,name_sanitized
0,-1,Not applicable,Not applicable
1,0,Not Available,Not available
2,1,=,Detected value
3,2,<,Detection limit
4,3,ND,Not detected
5,4,DE,Derived


We create a lambda function to retrieve the lookup table.

In [None]:
#| export
lut_dl = lambda: pd.read_excel(detection_limit_lut_path(), usecols=['name','id']).set_index('name').to_dict()['id']

In [None]:
#| eval: false
lut_dl()

{'Not applicable': -1, 'Not Available': 0, '=': 1, '<': 2, 'ND': 3, 'DE': 4}

We define the columns of interest in both the `SEAWATER` and `BIOTA` dataframes for the detection limit column.

In [None]:
#| export
coi_dl = {'SEAWATER' : {'DL' : 'value_type'},
          'BIOTA':  {'DL' : 'value_type'}
          }

We create a callback `RemapDetectionLimitCB` to remap the detection limit values to MARIS format using the lookup table. Since the dataset contain both '0' and 'nan' entries for the detection limit column, we will create a condition to set the detection limit to '=' when the value and uncertainty columns are present and the current detection limit value is not in the lookup keys.

In [None]:
#| export
class RemapDetectionLimitCB(Callback):
    """Remap detection limit values to MARIS format using a lookup table."""

    def __init__(self, coi: dict, fn_lut: Callable):
        """Initialize with column configuration and a function to get the lookup table."""
        fc.store_attr()        

    def __call__(self, tfm: Transformer):
        """Apply the remapping of detection limits across all dataframes"""
        lut = self.fn_lut()  # Retrieve the lookup table
        for grp, df in tfm.dfs.items():
            df['DL'] = df[self.coi[grp]['DL']]
            self._set_detection_limits(df, lut)

    def _set_detection_limits(self, df: pd.DataFrame, lut: dict):
        """Set detection limits based on value and uncertainty columns using specified conditions."""
        # Condition to set '=' when value and uncertainty are present and the current detection limit is not in the lookup keys
        condition_eq = (df['VALUE'].notna() & df['UNC'].notna() & ~df['DL'].isin(lut.keys()))
        df.loc[condition_eq, 'DL'] = '='

        # Set 'Not Available' for unmatched detection limits
        df.loc[~df['DL'].isin(lut.keys()), 'DL'] = 'Not Available'

        # Map existing detection limits using the lookup table
        df['DL'] = df['DL'].map(lut)

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            NormalizeUncCB(),                  
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl)])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['DL'].unique()}")

BIOTA: [1 2 0]
SEAWATER: [1 2 0]


## Remap Biota species

The OSPAR dataset contains biota species information in the `Species` column of the biota dataframe. To ensure consistency with MARIS standards, we need to remap these species names. We'll use a same approach to the one we employed for standardizing nuclide names:


We first inspect unique `Species` values used by OSPAR:

In [None]:
#| eval: false
dfs = wfs_processor()
with pd.option_context('display.max_columns', None):
    display(get_unique_across_dfs(dfs, col_name='species', as_df=True).T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71.0,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155
value,Flatfish,GLYPTOCEPHALUS CYNOGLOSSUS,SCOMBER SCOMBRUS,Cerastoderma (Cardium) Edule,Unknown,Mytilus Edulis,FUCUS SPIRALIS,FUCUS VESICULOSUS,Ostrea Edulis,Modiolus modiolus,Melanogrammus aeglefinus,Brosme brosme,ASCOPHYLLUM NODOSUM,Merlangius merlangus,"Mixture of green, red and brown algae",Ascophyllum nodosum,Patella sp.,Merluccius merluccius,Reinhardtius hippoglossoides,Trisopterus minutus,MELANOGRAMMUS AEGLEFINUS,PLUERONECTES PLATESSA,Thunnus thynnus,SOLEA SOLEA (S.VULGARIS),RHODYMENIA spp,MERLANGIUS MERLANGUS,Boreogadus saida,BOREOGADUS SAIDA,ASCOPHYLLUN NODOSUM,Merlangius Merlangus,ETMOPTERUS SPINAX,CYCLOPTERUS LUMPUS,FUCUS SPP.,Limanda Limanda,Sprattus sprattus,Hippoglossoides platessoides,LITTORINA LITTOREA,PLATICHTHYS FLESUS,PORPHYRA UMBILICALIS,Scomber scombrus,Cyclopterus lumpus,CLUPEA HARENGUS,Sebastes Mentella,Tapes sp.,Squalus acanthias,Clupea Harengus,HIPPOGLOSSOIDES PLATESSOIDES,Micromesistius poutassou,SEBASTES MARINUS,Coryphaenoides rupestris,Glyptocephalus cynoglossus,Fucus Vesiculosus,Lophius piscatorius,Boreogadus Saida,Ostrea edulis,HIPPOGLOSSUS HIPPOGLOSSUS,Galeus melastomus,Pollachius virens,Limanda limanda,Penaeus vannamei,Fucus vesiculosus,SALMO SALAR,OSILINUS LINEATUS,FUCUS SERRATUS,MYTILUS EDULIS,Anarhichas lupus,FUCUS spp,Argentina sphyraena,Sebastes vivipares,Sepia spp.,Anguilla anguilla,,Gaidropsarus argenteus,Pelvetia canaliculata,Sebastes norvegicus,Dasyatis pastinaca,Pleuronectiformes [order],MICROMESISTIUS POUTASSOU,Trachurus trachurus,Littorina littorea,Rhodymenia spp.,REINHARDTIUS HIPPOGLOSSOIDES,Cerastoderma edule,Molva molva,LIMANDA LIMANDA,Raja montagui,Gadus Morhua,MERLUCCIUS MERLUCCIUS,Phoca vitulina,Homarus gammarus,Trisopterus esmarkii,Sebastes mentella,Solea solea (S.vulgaris),Pleuronectes platessa,Crassostrea gigas,Salmo salar,Hyperoplus lanceolatus,CHIMAERA MONSTROSA,EUTRIGLA GURNARDUS,PATELLA,Lycodes vahlii,Clupea harengus,RAJA DIPTURUS BATIS,PLEURONECTES PLATESSA,Microstomus kitt,SEBASTES MENTELLA,Thunnus sp.,Pecten maximus,RAJIDAE/BATOIDEA,Sebastes marinus,Capros aper,Lumpenus lampretaeformis,CRASSOSTREA GIGAS,LAMINARIA DIGITATA,GADUS MORHUA,PATELLA VULGATA,Platichthys flesus,Hippoglossus hippoglossus,ANARHICHAS LUPUS,PECTINIDAE,OSTREA EDULIS,GALEUS MELASTOMUS,unknown,Gadus morhua,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,PELVETIA CANALICULATA,Mytilus edulis,Fucus serratus,BROSME BROSME,Gadiculus argenteus thori,MOLVA MOLVA,Pollachius pollachius,MONODONTA LINEATA,DICENTRARCHUS (MORONE) LABRAX,Anarhichas minor,Argentina silus,CERASTODERMA (CARDIUM) EDULE,POLLACHIUS VIRENS,Sebastes viviparus,SCOPHTHALMUS RHOMBUS,SPRATTUS SPRATTUS,MOLVA DYPTERYGIA,NUCELLA LAPILLUS,PALMARIA PALMATA,Phycis blennoides,Gadus sp.,TRACHURUS TRACHURUS,Fucus sp.,BUCCINUM UNDATUM,Sardina pilchardus,PECTEN MAXIMUS,Fucus distichus,Buccinum undatum,Eutrigla gurnardus,Anarhichas denticulatus,Mallotus villosus


We try to remap the `Species` column to the `species` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='species', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='species_ospar.pkl')

In this step, we generate a lookup table using the `remapper` object. The lookup table maps data provider entries to MARIS entries using fuzzy matching. After generating the table, we select matches that meet a specified threshold (i.e., greater than 1), which means that matches that require more than one character correction are shown.

- **`generate_lookup_table(as_df=True)`**: This method generates the lookup table and returns it as a DataFrame. It uses fuzzy matching to align entries from the data provider with those in the MARIS lookup table.
- **`select_match(match_score_threshold=1)`**: This method filters the generated lookup table to include only those matches with a score greater than or equal to the specified threshold. A threshold of 1 ensures that only perfect matches are selected.

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing:   0%|          | 0/156 [00:00<?, ?it/s]

Processing: 100%|██████████| 156/156 [00:21<00:00,  7.37it/s]

127 entries matched the criteria, while 29 entries had a match score of 1 or higher.





source_key,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,"Mixture of green, red and brown algae",Solea solea (S.vulgaris),SOLEA SOLEA (S.VULGARIS),Cerastoderma (Cardium) Edule,CERASTODERMA (CARDIUM) EDULE,DICENTRARCHUS (MORONE) LABRAX,RAJIDAE/BATOIDEA,Pleuronectiformes [order],PALMARIA PALMATA,MONODONTA LINEATA,Flatfish,Rhodymenia spp.,FUCUS SPP.,Sepia spp.,unknown,Unknown,RAJA DIPTURUS BATIS,Gadus sp.,Fucus sp.,Thunnus sp.,FUCUS spp,Tapes sp.,RHODYMENIA spp,Patella sp.,PLUERONECTES PLATESSA,Gaidropsarus argenteus,Sebastes vivipares,ASCOPHYLLUN NODOSUM
matched_maris_name,Lomentaria catenata,Mercenaria mercenaria,Loligo vulgaris,Loligo vulgaris,Cerastoderma edule,Cerastoderma edule,Dicentrarchus labrax,Batoidea,Pleuronectiformes,Alaria marginata,Monodonta labio,Lambia,Rhodymenia,Fucus,Sepia,Undaria,Undaria,Dipturus batis,Penaeus sp.,Fucus,Thunnus,Fucus,Tapes,Rhodymenia,Patella,Pleuronectes platessa,Gaidropsarus argentatus,Sebastes viviparus,Ascophyllum nodosum
source_name,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,"Mixture of green, red and brown algae",Solea solea (S.vulgaris),SOLEA SOLEA (S.VULGARIS),Cerastoderma (Cardium) Edule,CERASTODERMA (CARDIUM) EDULE,DICENTRARCHUS (MORONE) LABRAX,RAJIDAE/BATOIDEA,Pleuronectiformes [order],PALMARIA PALMATA,MONODONTA LINEATA,Flatfish,Rhodymenia spp.,FUCUS SPP.,Sepia spp.,unknown,Unknown,RAJA DIPTURUS BATIS,Gadus sp.,Fucus sp.,Thunnus sp.,FUCUS spp,Tapes sp.,RHODYMENIA spp,Patella sp.,PLUERONECTES PLATESSA,Gaidropsarus argenteus,Sebastes vivipares,ASCOPHYLLUN NODOSUM
match_score,31,26,12,12,10,10,9,8,8,7,6,5,5,5,5,5,5,5,4,4,4,4,4,4,4,2,2,1,1


Below, we fixthe entries that are not properly matched by the `Remapper` object:

In [None]:
#| export
fixes_biota_species = {
    'RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA': NA,  # Mix of species, no direct mapping
    'Mixture of green, red and brown algae': NA,  # Mix of species, no direct mapping
    'Solea solea (S.vulgaris)': 'Solea solea',
    'SOLEA SOLEA (S.VULGARIS)': 'Solea solea',
    'RAJIDAE/BATOIDEA': NA, #Mix of species, no direct mapping
    'PALMARIA PALMATA': NA,  # Not defined
    'Unknown': NA,
    'unknown': NA,
    'Flatfish': NA,
    'Gadus sp.': NA,  # Not defined
}

We now attempt remapping again, incorporating the `fixes_biota_species` dictionary:

In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_species)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing:   0%|          | 0/156 [00:00<?, ?it/s]

Processing: 100%|██████████| 156/156 [00:20<00:00,  7.59it/s]

137 entries matched the criteria, while 19 entries had a match score of 1 or higher.





source_key,Cerastoderma (Cardium) Edule,CERASTODERMA (CARDIUM) EDULE,DICENTRARCHUS (MORONE) LABRAX,Pleuronectiformes [order],MONODONTA LINEATA,RAJA DIPTURUS BATIS,Rhodymenia spp.,Sepia spp.,FUCUS SPP.,Patella sp.,FUCUS spp,Tapes sp.,Thunnus sp.,RHODYMENIA spp,Fucus sp.,Gaidropsarus argenteus,PLUERONECTES PLATESSA,Sebastes vivipares,ASCOPHYLLUN NODOSUM
matched_maris_name,Cerastoderma edule,Cerastoderma edule,Dicentrarchus labrax,Pleuronectiformes,Monodonta labio,Dipturus batis,Rhodymenia,Sepia,Fucus,Patella,Fucus,Tapes,Thunnus,Rhodymenia,Fucus,Gaidropsarus argentatus,Pleuronectes platessa,Sebastes viviparus,Ascophyllum nodosum
source_name,Cerastoderma (Cardium) Edule,CERASTODERMA (CARDIUM) EDULE,DICENTRARCHUS (MORONE) LABRAX,Pleuronectiformes [order],MONODONTA LINEATA,RAJA DIPTURUS BATIS,Rhodymenia spp.,Sepia spp.,FUCUS SPP.,Patella sp.,FUCUS spp,Tapes sp.,Thunnus sp.,RHODYMENIA spp,Fucus sp.,Gaidropsarus argenteus,PLUERONECTES PLATESSA,Sebastes vivipares,ASCOPHYLLUN NODOSUM
match_score,10,10,9,8,6,5,5,5,5,4,4,4,4,4,4,2,2,1,1


Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| export
lut_biota = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='species', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='species_ospar.pkl').generate_lookup_table(fixes=fixes_biota_species, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `species` column to our `biota` dataframe, containing standardized species IDs.


In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['SPECIES'].unique()

array([  96,   99,  391,  192,  129,  274,  394,  396,  392,  270,  397,
        244,   50,  401,  417,  378,  139,    0,  379,  413,  410,  412,
        380,  414,  272,  395,  243,  418,  411,  402,  407,  426,  191,
        429,  393,  430,  384,  381,  403,  399,  415,  416,  405,  398,
        404,  389,  386,  385,  408,  406,  427,  400,  390, 1684,  377,
        388,  387,  382,  383,  428,  419, 1609,  425,  424,  420,  421,
        422,  423,  431,  294,  440,  432,  433,  442,  441, 1605,  439,
        438,  437,  434,  435,  436,  444,  443, 1610, 1606, 1608,   23,
        556,  234, 1752, 1701])

## Enhance Species Data Using Biological group column
The `Biological group` column in the OSPAR dataset provides valuable insights related to species. We will leverage this information to enrich the `species` column. To achieve this, we will employ the generic `RemapCB` callback to create an `enhanced_species` column. Subsequently, this `enhanced_species` column will be used to further enrich the `species` column.

First we inspect the unique values in the `Biological group` column.

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='biological', as_df=True)

Unnamed: 0,index,value
0,0,Molluscs
1,1,SEAWEED
2,2,FISH
3,3,seaweed
4,4,Seaweed
5,5,Fish
6,6,molluscs
7,7,fish
8,8,MOLLUSCS


We will remap the `Biological group` columns data to the `SPECIES` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='biological', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='enhance_species_ospar.pkl')

Like before we will generate the lookup table and select matches that meet a specified threshold (i.e., greater than 1), which means that matches requiring more than one character change are shown.

In [None]:
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/9 [00:00<?, ?it/s]

Processing: 100%|██████████| 9/9 [00:01<00:00,  7.06it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FISH,Fucus,FISH,4
Fish,Fucus,Fish,4
fish,Fucus,fish,4
Molluscs,Mollusca,Molluscs,1
molluscs,Mollusca,molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1


We can see that some of the entries require manual corrections.

In [None]:
fixes_enhanced_biota_species = {
    'fish': 'Pisces',
    'FISH': 'Pisces',
    'Fish': 'Pisces'    
}


Now we will apply the manual corrections to the lookup table and generate the lookup table again.

In [None]:
remapper.generate_lookup_table(fixes=fixes_enhanced_biota_species)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/9 [00:00<?, ?it/s]

Processing: 100%|██████████| 9/9 [00:01<00:00,  6.63it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Molluscs,Mollusca,Molluscs,1
molluscs,Mollusca,molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1


Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| export
lut_biota_enhanced = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='biological', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='enhance_species_ospar.pkl').generate_lookup_table(fixes=fixes_enhanced_biota_species, as_df=False, overwrite=False)

Now lets see the species that are not matched by the `LookupBiogroupCB` callback. 

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of an `enhanced_species` column to our `BIOTA` dataframe, containing standardized species IDs.

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['enhanced_species'].unique()

array([1059,  712,  873])

Now that we have the `enhanced_species` column, we can use it to enrich the `SPECIES` column. We will use the enhanced species column in the absence of a species match if the enhanced species column is valid. 

In [None]:
#| export
class EnhanceSpeciesCB(Callback):
    """Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met."""

    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        self._enhance_species(tfm.dfs['BIOTA'])

    def _enhance_species(self, df: pd.DataFrame):
        df['SPECIES'] = df.apply(
            lambda row: row['enhanced_species'] if row['SPECIES'] in [-1, 0] and pd.notnull(row['enhanced_species']) else row['SPECIES'],
            axis=1
        )

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),
    EnhanceSpeciesCB()
    ])

tfm()['BIOTA']['SPECIES'].unique()

array([  96,   99,  391,  192,  129,  274,  394,  396,  392,  270,  397,
        244,   50,  401,  417,  378,  139, 1059,  379,  413,  712,  410,
        412,  380,  414,  272,  395,  243,  418,  411,  402,  407,  426,
        191,  429,  393,  430,  384,  381,  403,  399,  415,  416,  405,
        398,  404,  389,  386,  385,  408,  406,  427,  400,  390, 1684,
        377,  388,  387,  382,  383,  428,  419, 1609,  425,  424,  420,
        421,  422,  423,  431,  294,  440,  432,  433,  442,  441, 1605,
        439,  438,  437,  434,  435,  436,  444,  443, 1610, 1606, 1608,
         23,  556,  234, 1752, 1701])

All entries are matched for the `SPECIES` column.

## Remap Biota tissues

The OSPAR dataset includes entries where the `Body Part` is labeled as `whole`. However, the MARIS data standard requires a more specific distinction in the `body_part` field, differentiating between `Whole animal` and `Whole plant`. Fortunately, the OSPAR data provides a `Biological group` field that allows us to make this distinction.

To address this discrepancy and ensure compatibility with MARIS standards, we will:

1. Create a temporary column `body_part_temp` that combines information from both `Body Part` and `Biological group`.
2. Use this temporary column to perform the lookup using our `Remapper` object.

Lets create the temporary column, `body_part_temp`, that combines `Body Part` and `Biological group`.

In [None]:
#| export
class AddBodypartTempCB(Callback):
    "Add a temporary column with the body part and biological group combined."    
    def __call__(self, tfm):
        tfm.dfs['BIOTA']['body_part_temp'] = (
            tfm.dfs['BIOTA']['body_part'] + ' ' + 
            tfm.dfs['BIOTA']['biological']
            ).str.strip().str.lower()                                 

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            ])
dfs_test = tfm()
dfs_test['BIOTA']['body_part_temp'].unique()


array(['whole plant seaweed', 'flesh without bones fish',
       'growing tips seaweed', 'soft parts molluscs', 'liver fish',
       'flesh fish', 'whole fish fish', 'whole animal molluscs',
       'muscle fish', 'soft parts fish', 'flesh with scales fish',
       'flesh without bone fish', 'whole animal fish', 'head fish',
       'unknown fish', 'whole fish', 'flesh without bones seaweed',
       'cod medallion fish', 'whole without head fish', 'whole fisk fish',
       'mix of muscle and whole fish without liver fish',
       'flesh without bones molluscs', 'whole seaweed'], dtype=object)

To align the ``body_part_temp`` column with the ``bodypar`` column in the MARIS nomenclature, we utilize a Remapper object. Since the OSPAR dataset does not include a predefined lookup table for the ``body_part`` column, we first create a lookup table by extracting unique values from the ``body_part_temp`` column.

In [None]:
get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True).head()

Unnamed: 0,index,value
0,0,flesh with scales fish
1,1,mix of muscle and whole fish without liver fish
2,2,whole plant seaweed
3,3,whole animal molluscs
4,4,whole seaweed


We try to remap the `body_part_temp` column to the `bodypar` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True),
                    maris_lut_fn=bodyparts_lut_path,
                    maris_col_id='bodypar_id',
                    maris_col_name='bodypar',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='tissues_ospar.pkl'
                    )

remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=0, verbose=True).T)

Processing:   0%|          | 0/23 [00:00<?, ?it/s]

Processing: 100%|██████████| 23/23 [00:00<00:00, 94.83it/s]

0 entries matched the criteria, while 23 entries had a match score of 0 or higher.





source_key,mix of muscle and whole fish without liver fish,cod medallion fish,whole without head fish,whole animal molluscs,flesh without bones molluscs,whole fisk fish,whole fish fish,unknown fish,soft parts molluscs,whole plant seaweed,flesh without bones seaweed,growing tips seaweed,flesh fish,whole seaweed,whole fish,flesh with scales fish,whole animal fish,soft parts fish,muscle fish,flesh without bones fish,liver fish,head fish,flesh without bone fish
matched_maris_name,Flesh without bones,Old leaf,Flesh without bones,Whole animal,Flesh without bones,Whole animal,Whole animal,Growing tips,Soft parts,Whole plant,Flesh without bones,Growing tips,Shells,Whole plant,Whole animal,Flesh with scales,Whole animal,Soft parts,Muscle,Flesh without bones,Liver,Head,Flesh without bones
source_name,mix of muscle and whole fish without liver fish,cod medallion fish,whole without head fish,whole animal molluscs,flesh without bones molluscs,whole fisk fish,whole fish fish,unknown fish,soft parts molluscs,whole plant seaweed,flesh without bones seaweed,growing tips seaweed,flesh fish,whole seaweed,whole fish,flesh with scales fish,whole animal fish,soft parts fish,muscle fish,flesh without bones fish,liver fish,head fish,flesh without bone fish
match_score,31,13,13,9,9,9,9,9,9,8,8,8,7,6,5,5,5,5,5,5,5,5,4


Many of the lookup entries are sufficient for our needs. However, for values that don't find a match, we can use the `fixes_biota_bodyparts` dictionary to apply manual corrections. First we will create the dictionary.

In [None]:
#| export
fixes_biota_tissues = {
    'whole seaweed' : 'Whole plant',
    'flesh fish': 'Flesh with bones', # We assume it as the category 'Flesh with bones' also exists
    'flesh fish' : 'Flesh with bones',
    'unknown fish' : NA,
    'unknown fish' : NA,
    'cod medallion fish' : NA, # TO BE DETERMINED
    'mix of muscle and whole fish without liver fish' : NA, # TO BE DETERMINED
    'whole without head fish' : NA, # TO BE DETERMINED
    'flesh without bones seaweed' : NA, # TO BE DETERMINED
    'tail and claws fish' : NA # TO BE DETERMINED
}

Now we will generate the lookup table and apply the manual corrections of the ``fixes_biota_bodyparts`` dictionary.


In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_tissues)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing:   0%|          | 0/23 [00:00<?, ?it/s]

Processing: 100%|██████████| 23/23 [00:00<00:00, 87.95it/s]

2 entries matched the criteria, while 21 entries had a match score of 1 or higher.





source_key,whole animal molluscs,flesh without bones molluscs,whole fish fish,whole fisk fish,soft parts molluscs,whole plant seaweed,growing tips seaweed,flesh with scales fish,whole animal fish,whole fish,soft parts fish,head fish,muscle fish,flesh without bones fish,liver fish,flesh without bone fish,mix of muscle and whole fish without liver fish,whole without head fish,unknown fish,flesh without bones seaweed,cod medallion fish
matched_maris_name,Whole animal,Flesh without bones,Whole animal,Whole animal,Soft parts,Whole plant,Growing tips,Flesh with scales,Whole animal,Whole animal,Soft parts,Head,Muscle,Flesh without bones,Liver,Flesh without bones,(Not available),(Not available),(Not available),(Not available),(Not available)
source_name,whole animal molluscs,flesh without bones molluscs,whole fish fish,whole fisk fish,soft parts molluscs,whole plant seaweed,growing tips seaweed,flesh with scales fish,whole animal fish,whole fish,soft parts fish,head fish,muscle fish,flesh without bones fish,liver fish,flesh without bone fish,mix of muscle and whole fish without liver fish,whole without head fish,unknown fish,flesh without bones seaweed,cod medallion fish
match_score,9,9,9,9,9,8,8,5,5,5,5,5,5,5,5,4,2,2,2,2,2


At this stage, the majority of entries have been successfully matched to MARIS nomenclature. For those entries that remain unmatched, they are appropriately marked as not available. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.

In [None]:
#| export
lut_bodyparts = lambda: Remapper(provider_lut_df=get_unique_across_dfs(tfm.dfs, col_name='body_part_temp', as_df=True),
                               maris_lut_fn=bodyparts_lut_path,
                               maris_col_id='bodypar_id',
                               maris_col_name='bodypar',
                               provider_col_to_match='value',
                               provider_col_key='value',
                               fname_cache='tissues_ospar.pkl'
                               ).generate_lookup_table(fixes=fixes_biota_tissues, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `BODY_PART` column to our `biota` dataframe, containing standardized species IDs.

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA')
                            ])
tfm()
tfm.dfs['BIOTA']['BODY_PART'].unique()

array([40, 52, 56, 19, 25,  4,  1, 34, 60, 13,  0])

## Remap biogroup

The MARIS species lookup table contains a ``biogroup_id`` column that associates each species with its corresponding ``biogroup``. We will leverage this relationship to create a ``BIO_GROUP`` column in the ``BIOTA`` DataFrame.

In [None]:
#| export
lut_biogroup_from_biota = lambda: get_lut(src_dir=species_lut_path().parent, fname=species_lut_path().name, 
                               key='species_id', value='biogroup_id')

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[ 
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),
    EnhanceSpeciesCB(),
    RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA')
    ])

print(tfm()['BIOTA']['BIO_GROUP'].unique())


[11  4 13 14 12  2  5]


## Add Laboratory ID (REVIEW)

:::{.callout-tip}

**FEEDBACK FOR NEXT VERSION**: Addition of the laboratory ID column requires the lookup table to be sanitized. 

:::

Lets use the utility `get_unique_across_dfs` function to review the unique laboratory IDs in the OSPAR dataset:

In [None]:
with pd.option_context('display.max_columns', None):
    display(get_unique_across_dfs(dfs, col_name='data_provi', as_df=True).T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104
index,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40.0,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104
value,IRSN-LRC,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,NIEA-Northern Ireland Environment Agency,Radiological Protection Instiute of Ireland,Rijkswaterstaat Centre for Water Management,Icelandic Radiation Safety Authority,Insititute for Marine Research,IMR,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,IRSN : OPRI-LVRE,SCKÃ¢ÂÂ¢CEN,DTU SUS,IRSN : LVRE,Radiological Protection Institute of Ireland,IFE/NRPA,Institute for Energy technology,Insitute of Marine Research,"Federal Maritime and Hydrographic Agency, Hamburg",Norwegian Radiation Protection Authority,SEPA-Scottish Environment Protection Agency,SCKâ¢CEN,Institute for Marine Research/Norweigian Radia...,Institute of Marine Research,IRSN : LRC/LS3E/RSMASS,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,IRSN : OPRI,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,DTU Nutech,Johann Heinrich von ThÃÂ¼nen Institute (vTI),Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Norweigian Radiation Protection Authority,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Norwegian Radiaton Protection Authority,IRSN : OPRI-LVRE/MN,SL-Sellafield Ltd,Institute of Marine Research/Norwegian Radiati...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,RisÃÂ¸-DTU,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,"DTU Nutech, DK",Environmental Protection Agency,IRSN : LS3E/Marine Nationale,Corystes 14/2004,EA - Environment Agency,IRSN : LERFA,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,EA-Environment Agency,NRPA,Insititute for Energy Technology,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,IRSN : LS3E/RSMASS,IRSN : LVRE/RSMASS,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,RisÃ¸-DTU,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Nuclear Energy Research centre,Johann Heinrich von Thuenen Institute (vTI),Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Institute for energy technology,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Norwegian Radioaton Protection Authority,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,SSM,Norwegian Radiation and Nuclear Safety Authority,Scientific Institute of Public Health,BEIS,IRSN : LS3E,Institute for Energy Technology,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Institut de Radioprotection et SÃ»retÃ© NuclÃ©...,DTU ENV,SCKCEN,Johann Heinrich von ThÃÂ¸nen Institute (vTI),Intitute for Marine Research,Instiute of Marine Research,Nuclear Safety Council,Institute of Energy Technology,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,"Defra-Department for Environment, Food and Rur...","Institute for Energy Technology, Kjeller, Norway",BEIS (formerly DECC),Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,Swedish Radiation Safety Authority,IRSN : LRC,IFE,FSA-Food Standards Agency,NorwegiaN Radiation Protection Authority,Institute for marine research,Institute for Marine Research,Endeavour 10/2004,IRSN : LVRE/MN,IRSN : OPRI/DDASS,IRSN : OPRI/MN,Institut de Radioprotection et SÃÂ»retÃÂ© Nu...,IRSN-LVRE,Rijkswaterstaat Laboratory CIV


The `LAB` information could be included with a little work. 

## Add Sample ID (REVIEW)

The OSPAR dataset includes an `ID` column, which we will use to create the `SMP_ID` column.

In [None]:
#| export
class AddSampleIdCB(Callback):
    "Create a SMP_ID column from the ID column"
    def __call__(self, tfm):
        for df in tfm.dfs.values():
            if 'id' in df.columns:
                df['SMP_ID'] = df['id']

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            AddSampleIdCB(),
                            CompareDfsAndTfmCB(dfs)

                            ])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['SMP_ID'].unique()}")

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
    

BIOTA: [38847 54128 54127 ... 95773 95397 96864]
SEAWATER: [ 45552.  67787.  67788. ... 120876. 120877. 120878.]
                           BIOTA  SEAWATER
Number of rows in dfs      14715     18308
Number of rows in tfm.dfs  14715     18308
Number of rows removed         0         0 



## Add depth

The OSPAR dataset includes a column for the sampling depth (`Sampling depth`) for the `SEAWATER` dataset. In this section, we will create a callback to incorporate the sampling depth (`smp_depth`) into the MARIS dataset.

In [None]:
class AddDepthCB(Callback):
    "Ensure depth values are floats and add 'SMP_DEPTH' columns."
    def __call__(self, tfm: Transformer):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                if 'sampling_d' in df.columns:
                    df['SMP_DEPTH'] = df['sampling_d'].astype(float)

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    AddDepthCB()
    ])
tfm()
for grp in tfm.dfs.keys():  
    if 'SMP_DEPTH' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['SMP_DEPTH']].drop_duplicates())

SEAWATER:        SMP_DEPTH
0            2.0
1            0.0
2          140.0
40         120.0
61         700.0
...          ...
17455       60.0
17460       70.0
17815      110.0
17817     1680.0
17833      163.0

[123 rows x 1 columns]


## Standardize Coordinates

The OSPAR dataset offers coordinates in degrees, minutes, and seconds (DMS). The following callback is designed to convert DMS to decimal degrees. 

In [None]:
#| export
class ConvertLonLatCB(Callback):
    """Convert Coordinates to decimal degrees (DDD.DDDDD°)."""
    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            df['LAT'] = self._convert_latitude(df)
            df['LON'] = self._convert_longitude(df)

    def _convert_latitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['latdir'].isin(['S']),
            self._dms_to_decimal(df['latd'], df['latm'], df['lats']) * -1,
            self._dms_to_decimal(df['latd'], df['latm'], df['lats'])
        )

    def _convert_longitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['longdir'].isin(['W']),
            self._dms_to_decimal(df['longd'], df['longm'], df['longs']) * -1,
            self._dms_to_decimal(df['longd'], df['longm'], df['longs'])
        )

    def _dms_to_decimal(self, degrees: pd.Series, minutes: pd.Series, seconds: pd.Series) -> pd.Series:
        return degrees + minutes / 60 + seconds / 3600


In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB()
                            ])
tfm()
tfm.dfs['SEAWATER'][['LAT','latd', 'latm', 'lats', 'LON', 'latdir', 'longd', 'longm','longs', 'longdir']]

Unnamed: 0,LAT,latd,latm,lats,LON,latdir,longd,longm,longs,longdir
0,56.166667,56,10,0.0,11.783333,N,11,47,0.0,E
1,63.650000,63,39,0.0,-15.900000,N,15,54,0.0,W
2,63.650000,63,39,0.0,-15.900000,N,15,54,0.0,W
3,64.330000,64,19,48.0,-25.000000,N,25,0,0.0,W
4,64.330000,64,19,48.0,-27.970000,N,27,58,12.0,W
...,...,...,...,...,...,...,...,...,...,...
18303,51.411944,51,24,43.0,3.565556,N,3,33,56.0,E
18304,51.411944,51,24,43.0,3.565556,N,3,33,56.0,E
18305,51.411944,51,24,43.0,3.565556,N,3,33,56.0,E
18306,51.719444,51,43,10.0,3.493889,N,3,29,38.0,E


Sanitize coordinates drops a row when both longitude & latitude equal 0 or data contains unrealistic longitude & latitude values. Converts longitude & latitude `,` separator to `.` separator."

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['BIOTA'][['LAT','LON']])

                           BIOTA  SEAWATER
Number of rows in dfs      14715     18308
Number of rows in tfm.dfs  14715     18308
Number of rows removed         0         0 

             LAT        LON
0      55.966667  11.583333
1      50.750000   0.500000
2      50.750000   0.500000
3      51.084167   1.203056
4      51.084167   1.203056
...          ...        ...
14710  56.000000   6.000000
14711  56.500000  12.000000
14712  56.000000   6.000000
14713  54.087778   7.850556
14714  54.872778  -3.594444

[14715 rows x 2 columns]


## Review all callbacks

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

                           BIOTA  SEAWATER
Number of rows in dfs      14715     18308
Number of rows in tfm.dfs  14715     18308
Number of rows removed         0         0 



### Example change logs

Review the change logs for the netcdf encoding.

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            ])

# Transform
tfm()
# Check transformation logs
tfm.logs

["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column.",
 'Remap data provider nuclide names to standardized MARIS nuclide names.',
 'Parse the time format in the dataframe and check for inconsistencies.',
 'Encode time as seconds since epoch.',
 'Sanitize value by removing blank entries and populating `value` column.',
 'Normalize uncertainty values in DataFrames.',
 "Callback to update DataFrame 'UNIT' columns based on a lookup table.",
 'Remap detection limit values to MARIS format using a lookup table.',
 "Remap values from 'species' to 'SPECIES' for groups: BIOTA.",
 "Remap values from 'biological' to 'enhanced_species' for groups: BIOTA.",
 "Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met.",
 'Add a temporary column with the body part and biological group combined.',
 "Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA.",
 'Create a SMP_ID column from the ID column',
 "Ensure depth value

## Feed global attributes

In [None]:
#| export
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']


In [None]:
#| export
def get_attrs(
    tfm: Transformer, # Transformer object
    zotero_key: str, # Zotero dataset record key
    kw: list = kw # List of keywords
    ) -> dict: # Global attributes
    "Retrieve all global attributes."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()

In [None]:
#|eval: false
get_attrs(tfm, zotero_key=zotero_key, kw=kw)

{'geospatial_lat_min': '49.43222222222222',
 'geospatial_lat_max': '81.26805555555555',
 'geospatial_lon_min': '-58.23166666666667',
 'geospatial_lon_max': '36.181666666666665',
 'geospatial_bounds': 'POLYGON ((-58.23166666666667 36.181666666666665, 49.43222222222222 36.181666666666665, 49.43222222222222 81.26805555555555, -58.23166666666667 81.26805555555555, -58.23166666666667 36.181666666666665))',
 'geospatial_vertical_max': '1850.0',
 'geospatial_vertical_min': '0.0',
 'time_coverage_start': '1995-01-01T00:00:00',
 'time_coverage_end': '2021-12-31T00:00:00',
 'id': 'LQRA4MMK',
 'title': 'OSPAR Environmental Monitoring of Radioactive Substances',
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "OSPAR Comission\'s Radioactive Substances Committee (RSC)"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science >

### Encoding NETCDF

In [None]:
#| export
def encode(
    fname_out_nc: str, # Output file name
    **kwargs # Additional arguments
    ) -> None:
    "Encode data to NetCDF."
    dfs = wfs_processor()
    tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                                ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out_nc, 
                            global_attrs=get_attrs(tfm, zotero_key=zotero_key, kw=kw),
                            verbose=kwargs.get('verbose', False),
                           )
    encoder.encode()

In [None]:
#|eval: false
encode(fname_out_nc, verbose=True)

--------------------------------------------------------------------------------
Creating enums for the following columns:
['NUCLIDE', 'BODY_PART', 'DL', 'UNIT', 'SPECIES']
Creating enum for nuclide_t with values {'NOT APPLICABLE': -1, 'NOT AVAILABLE': 0, 'h3': 1, 'be7': 2, 'c14': 3, 'k40': 4, 'cr51': 5, 'mn54': 6, 'co57': 7, 'co58': 8, 'co60': 9, 'zn65': 10, 'sr89': 11, 'sr90': 12, 'zr95': 13, 'nb95': 14, 'tc99': 15, 'ru103': 16, 'ru106': 17, 'rh106': 18, 'ag106m': 19, 'ag108': 20, 'ag108m': 21, 'ag110m': 22, 'sb124': 23, 'sb125': 24, 'te129m': 25, 'i129': 28, 'i131': 29, 'cs127': 30, 'cs134': 31, 'cs137': 33, 'ba140': 34, 'la140': 35, 'ce141': 36, 'ce144': 37, 'pm147': 38, 'eu154': 39, 'eu155': 40, 'pb210': 41, 'pb212': 42, 'pb214': 43, 'bi207': 44, 'bi211': 45, 'bi214': 46, 'po210': 47, 'rn220': 48, 'rn222': 49, 'ra223': 50, 'ra224': 51, 'ra225': 52, 'ra226': 53, 'ra228': 54, 'ac228': 55, 'th227': 56, 'th228': 57, 'th232': 59, 'th234': 60, 'pa234': 61, 'u234': 62, 'u235': 63, 'u238'

## NetCDF Review

First lets review the global attributes of the NetCDF file:

In [None]:
#| eval: false
contents = ExtractNetcdfContents(fname_out_nc)
print(contents.global_attrs)

{'id': 'LQRA4MMK', 'title': 'OSPAR Environmental Monitoring of Radioactive Substances', 'summary': '', 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)', 'history': 'TBD', 'keywords_vocabulary': 'GCMD Science Keywords', 'keywords_vocabulary_url': 'ht

Review the publisher_postprocess_logs.

In [None]:
#| eval: false
print(contents.global_attrs['publisher_postprocess_logs'])

Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column., Remap data provider nuclide names to standardized MARIS nuclide names., Parse the time format in the dataframe and check for inconsistencies., Encode time as seconds since epoch., Sanitize value by removing blank entries and populating `value` column., Normalize uncertainty values in DataFrames., Callback to update DataFrame 'UNIT' columns based on a lookup table., Remap detection limit values to MARIS format using a lookup table., Remap values from 'species' to 'SPECIES' for groups: BIOTA., Remap values from 'biological' to 'enhanced_species' for groups: BIOTA., Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met., Add a temporary column with the body part and biological group combined., Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA., Create a SMP_ID column from the ID column, Ensure depth values are floats and add 'SMP_DEPTH' columns., C

Now lets review the enums of the groups in the NetCDF file:

In [None]:
#| eval: false
print(contents.enum_dicts)

{'BIOTA': {'nuclide': {'NOT APPLICABLE': '-1', 'NOT AVAILABLE': '0', 'h3': '1', 'be7': '2', 'c14': '3', 'k40': '4', 'cr51': '5', 'mn54': '6', 'co57': '7', 'co58': '8', 'co60': '9', 'zn65': '10', 'sr89': '11', 'sr90': '12', 'zr95': '13', 'nb95': '14', 'tc99': '15', 'ru103': '16', 'ru106': '17', 'rh106': '18', 'ag106m': '19', 'ag108': '20', 'ag108m': '21', 'ag110m': '22', 'sb124': '23', 'sb125': '24', 'te129m': '25', 'i129': '28', 'i131': '29', 'cs127': '30', 'cs134': '31', 'cs137': '33', 'ba140': '34', 'la140': '35', 'ce141': '36', 'ce144': '37', 'pm147': '38', 'eu154': '39', 'eu155': '40', 'pb210': '41', 'pb212': '42', 'pb214': '43', 'bi207': '44', 'bi211': '45', 'bi214': '46', 'po210': '47', 'rn220': '48', 'rn222': '49', 'ra223': '50', 'ra224': '51', 'ra225': '52', 'ra226': '53', 'ra228': '54', 'ac228': '55', 'th227': '56', 'th228': '57', 'th232': '59', 'th234': '60', 'pa234': '61', 'u234': '62', 'u235': '63', 'u238': '64', 'np237': '65', 'np239': '66', 'pu238': '67', 'pu239': '68', '

Lets review the data of the NetCDF file:

In [None]:
#| eval: false
dfs = contents.dfs
dfs

{'BIOTA':              LON        LAT        TIME  SMP_ID  NUCLIDE      VALUE  UNIT  \
 0      11.583333  55.966667   797040000   38847       33   2.021700     5   
 1       0.500000  50.750000   805680000   54128       33   0.341000     5   
 2       0.500000  50.750000   801100800   54127       33   0.225000     5   
 3       1.203056  51.084167   806025600   54126       33   0.155000     5   
 4       1.203056  51.084167   793238400   54125       33   0.137000     5   
 ...          ...        ...         ...     ...      ...        ...   ...   
 14710   6.000000  56.000000  1615420800   95771       33   0.148000     5   
 14711  12.000000  56.500000  1614124800   95772       33   0.312000     5   
 14712   6.000000  56.000000  1615420800   95773       33   0.112000     5   
 14713   7.850555  54.087776  1638316800   95397       33   0.071647     5   
 14714  -3.594445  54.872776  1640908800   96864       15  13.200000     5   
 
             UNC  DL  SPECIES  BODY_PART  
 0      0.

Lets review the biota data:

In [None]:
#| eval: false
nc_dfs_biota=dfs['BIOTA']
nc_dfs_biota

Unnamed: 0,LON,LAT,TIME,SMP_ID,NUCLIDE,VALUE,UNIT,UNC,DL,SPECIES,BODY_PART
0,11.583333,55.966667,797040000,38847,33,2.021700,5,0.031336,1,96,40
1,0.500000,50.750000,805680000,54128,33,0.341000,5,0.014000,1,99,52
2,0.500000,50.750000,801100800,54127,33,0.225000,5,0.022500,1,99,52
3,1.203056,51.084167,806025600,54126,33,0.155000,5,0.015500,1,96,56
4,1.203056,51.084167,793238400,54125,33,0.137000,5,0.018500,1,96,56
...,...,...,...,...,...,...,...,...,...,...,...
14710,6.000000,56.000000,1615420800,95771,33,0.148000,5,0.005476,1,99,4
14711,12.000000,56.500000,1614124800,95772,33,0.312000,5,0.008580,1,192,4
14712,6.000000,56.000000,1615420800,95773,33,0.112000,5,0.003640,1,192,4
14713,7.850555,54.087776,1638316800,95397,33,0.071647,5,0.001660,1,139,1


Lets review the seawater data:

In [None]:
#| eval: false
nc_dfs_seawater=dfs['SEAWATER']
nc_dfs_seawater

Unnamed: 0,LON,LAT,SMP_DEPTH,TIME,SMP_ID,NUCLIDE,VALUE,UNIT,UNC,DL
0,11.783334,56.166668,2.0,799286400,45552,33,0.040141,1,3.411960e-04,1
1,-15.900000,63.650002,0.0,792806400,67787,33,0.003000,1,2.250000e-04,1
2,-15.900000,63.650002,140.0,792806400,67788,33,0.003100,1,2.325000e-04,1
3,-25.000000,64.330002,0.0,800496000,67789,33,0.002600,1,1.950000e-04,1
4,-27.969999,64.330002,0.0,800496000,67790,33,0.002400,1,1.800000e-04,1
...,...,...,...,...,...,...,...,...,...,...
18303,3.565556,51.411945,1.0,1612137600,120873,77,0.000035,1,1.730000e-06,1
18304,3.565556,51.411945,1.0,1626652800,120875,77,0.000008,1,3.950000e-07,1
18305,3.565556,51.411945,1.0,1634169600,120876,77,0.000002,1,9.000000e-08,1
18306,3.493889,51.719444,1.0,1610496000,120877,1,5.150000,1,2.575000e-01,1


## Data Format Conversion 

The MARIS data processing workflow involves two key steps:

1. **NetCDF to Standardized CSV Compatible with OpenRefine Pipeline**
   - Convert standardized NetCDF files to CSV formats compatible with OpenRefine using the `NetCDFDecoder`.
   - Preserve data integrity and variable relationships.
   - Maintain standardized nomenclature and units.

2. **Database Integration**
   - Process the converted CSV files using OpenRefine.
   - Apply data cleaning and standardization rules.
   - Export validated data to the MARIS master database.

This section focuses on the first step: converting NetCDF files to a format suitable for OpenRefine processing using the `NetCDFDecoder` class.

In [None]:
#|eval: false
decode(fname_in=fname_out_nc, verbose=True)

Saved BIOTA to ../../_data/output/191-OSPAR-2024_BIOTA.csv
Saved SEAWATER to ../../_data/output/191-OSPAR-2024_SEAWATER.csv
