In [None]:
#| default_exp handlers.ospar

# OSPAR 

> This data pipeline, known as a "handler" in Marisco terminology, is designed to clean, standardize, and encode [OSPAR data](https://odims.ospar.org/en/) into `NetCDF` format. The handler processes raw OSPAR data, applying various transformations and lookups to align it with `MARIS` data standards.

Key functions of this handler:

- **Cleans** and **normalizes** raw OSPAR data
- **Applies standardized nomenclature** and units
- **Encodes the processed data** into `NetCDF` format compatible with MARIS requirements

This handler is a crucial component in the Marisco data processing workflow, ensuring OSPAR data is properly integrated into the MARIS database.

:::{.callout-tip}

For new MARIS users, please refer to [Understanding MARIS Data Formats (NetCDF and Open Refine)](https://github.com/franckalbinet/marisco/tree/main/install_configure_guide) for detailed information.

:::

The present notebook pretends to be an instance of [Literate Programming](https://www.wikiwand.com/en/articles/Literate_programming) in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case `marisco/handlers/ospar.py`) the code snippet is added to the module using `#| export` as provided by the wonderful [nbdev](https://nbdev.readthedocs.io/en/latest/) library.

In [None]:
#| hide
%load_ext autoreload
%autoreload 2

In [None]:
#| export
import pandas as pd 
import numpy as np
#from functools import partial 
import fastcore.all as fc 
from fastcore.basics import patch, store_attr
from pathlib import Path 
#from dataclasses import asdict
from typing import List, Dict, Callable, Tuple, Any 
#from collections import OrderedDict, defaultdict
import re
#from functools import partial

from datetime import datetime
from owslib.wfs import WebFeatureService
from io import StringIO

from marisco.utils import (
    Remapper, 
    ddmm_to_dd,
    Match, 
    get_unique_across_dfs,
    NA,
    nc_to_dfs,
    get_netcdf_properties, 
    get_netcdf_group_properties,
    get_netcdf_variable_properties
)

from marisco.callbacks import (
    Callback, 
    Transformer, 
    EncodeTimeCB, 
    AddSampleTypeIdColumnCB,
    AddNuclideIdColumnCB, 
    LowerStripNameCB, 
    SanitizeLonLatCB, 
    CompareDfsAndTfmCB, 
    RemapCB
)

from marisco.metadata import (
    GlobAttrsFeeder, 
    BboxCB, 
    DepthRangeCB, 
    TimeRangeCB, 
    ZoteroCB, 
    KeyValuePairCB
)

from marisco.configs import (
    nuc_lut_path, 
    nc_tpl_path, 
    cfg, 
    species_lut_path, 
    sediments_lut_path, 
    bodyparts_lut_path, 
    detection_limit_lut_path, 
    filtered_lut_path, 
    get_lut, 
    unit_lut_path,
    prepmet_lut_path,
    sampmet_lut_path,
    counmet_lut_path, 
    lab_lut_path,
    NC_VARS
)

from marisco.encoders import (
    NetCDFEncoder, 
)

from marisco.handlers.data_format_transformation import (
    decode, 
)

import warnings
warnings.filterwarnings('ignore')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
pyproj not installed


In [None]:
#| hide
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)  # Show full column width

## Configuration and File Paths

The handler requires several configuration parameters:

1. **fname_out_nc**: Output path and filename for NetCDF file (relative paths supported) 
2. **zotero_key**: Key for retrieving dataset attributes from [Zotero](https://www.zotero.org/)
3. **ref_id**: Reference ID in the MARIS [Zotero library](https://www.zotero.org/groups/2432820/maris/library)

In [None]:
#| export
fname_out_nc = '../../_data/output/191-OSPAR-2024.nc'
zotero_key ='LQRA4MMK' # OSPAR MORS zotero key
ref_id = 191 # OSPAR reference id as defined by MARIS

## Load data

OSPAR data is available through the [ODIMS OSPAR platform](https://odims.ospar.org/en/search/). The data is hosted and can be accessed using a [Web Feature Service (WFS)](https://odims.ospar.org/geoserver/odims/wfs/?service=WFS&request=GetCapabilities), which allows for efficient querying and retrieval of geospatial data.

The `OsparWfsProcessor` is designed to interact with this Web Feature Service. It utilizes specific search parameters:
- `ospar_biota` for biological data
- `ospar_seawater` for seawater data

Upon execution, the processor retrieves the relevant OSPAR data and organizes it into a structured format. The results are returned as a dictionary of DataFrames, where the keys are:
- `BIOTA` for biota data
- `SEAWATER` for seawater data

:::{.callout-tip}

**Feedback to Data Provider.**

Please note that we are assuming that new versions of data supersede all previous versions. Files are stored on the WFS service with the following naming convention:

- **Prefix**: All filenames start with `odims:ospar_`, indicating that the data originates from the OSPAR dataset managed by the ODIMS platform.

- **Data Type**: Following the prefix, the filename specifies the type of data:
  - `biota` - Indicates biological data.
  - `seawater` - Indicates seawater-related data.

- **Date and Version**:
  - The year of the dataset is represented by four digits (e.g., `2023`).
  - The month of the dataset is represented by two digits (e.g., `04` for April).
  - The version of the dataset is represented by three digits, where higher numbers indicate more recent versions (e.g., `001`).

- **Separators**: Underscores (`_`) are used as separators to distinctly divide different parts of the filename.

Consider the filename `odims:ospar_biota_2023_01_001`. This indicates a file containing biota data from January 2023, version 001. Under the current implementation, this data would be replaced by the file `odims:ospar_biota_2023_01_002` (i.e., version 002), as newer versions supersede older ones.

:::


In [None]:
#| export
class OsparWfsProcessor:
    "Processor for OSPAR Web Feature Service operations, managing feature filtering and data fetching."
    
    def __init__(self, url, search_params=None, version='2.0.0'):
        "Initialize with URL, version, and search parameters."
        fc.store_attr()
        self.wfs = WebFeatureService(url=self.url, version=self.version)
        self.features_dfs = {}
        self.dfs = {}

    def __call__(self):
        "Process, fetch and filter OSPAR data"
        self.filter_features()
        self.check_feature_pattern()
        self.extract_version_from_feature_name()
        self.filter_latest_versions()
        self.fetch_and_combine_csv()

        return self.dfs

In [None]:
#| export
@patch
def filter_features(self: OsparWfsProcessor):
    "Filter features based on search parameters."
    available_features = list(self.wfs.contents.keys())
    for group, value in self.search_params.items():
        filtered_features = [ftype for ftype in available_features if value in ftype]
        self.features_dfs[group] = pd.DataFrame([{'feature': ftype} for ftype in filtered_features])


In [None]:
#| export
@patch
def check_feature_pattern(self: OsparWfsProcessor):
    """
    Check and retain features conforming to a specific pattern, printing unmatched features.
    """
    pattern = re.compile(r'^odims:ospar_(biota|seawater)_(\d{4})_(\d{2})_(\d{3})$')
    unmatched_features = []
    for group, df in list(self.features_dfs.items()):
        # Apply the pattern and find unmatched features
        matched_features = df['feature'].apply(lambda x: bool(pattern.match(x)))
        unmatched = df[~matched_features]['feature']
        unmatched_features.extend(unmatched.tolist())
        # Filter the DataFrame to only include matched features
        self.features_dfs[group] = df[matched_features]

    if unmatched_features:
        print("Unmatched features:", unmatched_features)

In [None]:
#| export
@patch
def extract_version_from_feature_name(self: OsparWfsProcessor):
    "Extract version from feature name."
    for group, df in list(self.features_dfs.items()):
        df['source'] = df['feature'].apply(lambda x: x.split('_')[0])
        df['type'] = df['feature'].apply(lambda x: x.split('_')[1])
        df['year'] = df['feature'].apply(lambda x: x.split('_')[2])
        df['month'] = df['feature'].apply(lambda x: x.split('_')[3])
        df['version'] = df['feature'].apply(lambda x: x.split('_')[4])

In [None]:
#| export
@patch
def filter_latest_versions(self: OsparWfsProcessor):
    "Filter each DataFrame to include only the latest version of each feature"
    for group, df in list(self.features_dfs.items()):
        df[['year', 'month', 'version']] = df[['year', 'month', 'version']].astype(int)
        
        if group == 'BIOTA':
            # Removing biota data for the year 2022 as the data is unavaible on the WFS.
            df = df[df['year'] != 2022]
            
        idx = df.groupby(['source', 'type', 'year', 'month'])['version'].idxmax()
        self.features_dfs[group] = df.loc[idx]

In [None]:
#| export
@patch
def fetch_and_combine_csv(self: OsparWfsProcessor):
    "Fetch CSV data for each feature from the WFS and combine into a single DataFrame for each sample type."
    for group, df in list(self.features_dfs.items()):
        combined_df = pd.DataFrame()
        for feature in df['feature']:
            try:
                response = self.wfs.getfeature(typename=feature, outputFormat='csv')
                csv_data = StringIO(response.read().decode('utf-8'))
                df_csv = pd.read_csv(csv_data)
                df_csv.columns = df_csv.columns.str.lower()  # Convert column names to lowercase
                combined_df = pd.concat([combined_df, df_csv], ignore_index=True)
            except Exception as e:
                print(f"Failed to fetch data for {feature}: {e}")
        self.dfs[group] = combined_df

In [None]:
#|eval: false
wfs_processor=OsparWfsProcessor(url= 'https://odims.ospar.org/geoserver/odims/wfs', search_params={'BIOTA': 'ospar_biota', 'SEAWATER': 'ospar_seawater'})
dfs = wfs_processor()

Display the head of the `SEAWATER` dataframe with all columns.

In [None]:
#|eval: false
# Show all columns
with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(dfs['SEAWATER'].head())

Unnamed: 0,fid,the_geom,id,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,sampling_d,sampling_1,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year,f1,reference_
0,ospar_seawater_1995_01_003.1,POINT (56.16666666666666 11.78333333333333),45552.0,Denmark,12,HesselÃ¸,H95-22,56,10,0.0,N,11,47,0.0,E,Water,2.0,1995-05-01T00:00:00,137Cs,0,0.040141,6823919,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,
1,ospar_seawater_1995_01_003.2,POINT (56.16666666666666 11.78333333333333),45553.0,Denmark,12,HesselÃ¸,H95-23,56,10,0.0,N,11,47,0.0,E,Water,24.0,1995-05-01T00:00:00,137Cs,0,0.037117,7794675,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,
2,ospar_seawater_1995_01_003.3,POINT (56.16666666666666 11.78333333333333),45554.0,Denmark,12,HesselÃ¸,H95-56,56,10,0.0,N,11,47,0.0,E,Water,2.0,1995-11-01T00:00:00,137Cs,0,0.04345,56485,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,
3,ospar_seawater_1995_01_003.4,POINT (56.16666666666666 11.78333333333333),45555.0,Denmark,12,HesselÃ¸,H95-57,56,10,0.0,N,11,47,0.0,E,Water,25.0,1995-11-01T00:00:00,137Cs,0,0.04608,50688,Bq/l,RisÃ¸-DTU,,,,56.166667,11.783333,1995.0,,
4,ospar_seawater_1995_01_003.5,POINT (56.11666666666667 11.16666666666667),45556.0,Denmark,12,Kattegat SW,H95-20,56,7,0.0,N,11,10,0.0,E,Water,2.0,1995-05-01T00:00:00,137Cs,0,0.05033,75495,Bq/l,RisÃ¸-DTU,,,,56.116667,11.166667,1995.0,,


Display the head of the `BIOTA` dataframe with all columns.

In [None]:
#|eval: false
# Show all columns
with pd.option_context('display.max_columns', None, 'display.max_colwidth', None):
    display(dfs['BIOTA'].head())

Unnamed: 0,fid,the_geom,id,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample_typ,biological,species,body_part,sampling_d,nuclide,value_type,activity_o,uncertaint,unit,data_provi,measuremen,sample_com,reference,latdd,longdd,year
0,ospar_biota_1995_01_003.1,POINT (55.96666666666667 11.58333333333333),38847,Denmark,12,Klint,950089,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-04-05T00:00:00,137Cs,0,2.0217,626727,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
1,ospar_biota_1995_01_003.2,POINT (55.96666666666667 11.58333333333333),38848,Denmark,12,Klint,950229,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-07-07T00:00:00,137Cs,0,2.34446,468892,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
2,ospar_biota_1995_01_003.3,POINT (55.96666666666667 11.58333333333333),38849,Denmark,12,Klint,950360,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus serratus,Whole plant,1995-09-19T00:00:00,137Cs,0,2.62356,4197696,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
3,ospar_biota_1995_01_003.4,POINT (55.96666666666667 11.58333333333333),38850,Denmark,12,Klint,950359,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-09-19T00:00:00,137Cs,0,2.7807,305877,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995
4,ospar_biota_1995_01_003.5,POINT (55.96666666666667 11.58333333333333),38851,Denmark,12,Klint,950489,55,58,0.0,N,11,35,0.0,E,BIOT,Seaweed,Fucus vesiculosus,Whole plant,1995-12-21T00:00:00,137Cs,0,1.51102,1662122,Bq/kg f.w.,RisÃÂ¸-DTU,,,,55.966667,11.583333,1995


## Nuclide Name Normalization

### Lower & strip nuclide names

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: Some nuclide names contain one or multiple trailing spaces.


NM Check this since the update. Check the spaces in the value.,,,


:::

In [None]:
#| eval: false
dfs = wfs_processor()
df = get_unique_across_dfs(dfs, 'nuclide', as_df=True, include_nchars=True)
df['stripped_chars'] = df['value'].str.strip().str.replace(' ', '').str.len()
print(df[df['n_chars'] != df['stripped_chars']])

    index        value  n_chars  stripped_chars
13     13  239, 240 Pu       11               9


In [None]:
df

Unnamed: 0,index,value,n_chars,stripped_chars
0,0,210Po,5,5
1,1,238Pu,5,5
2,2,228Ra,5,5
3,3,226Ra,5,5
4,4,CS-137,6,6
5,5,RA-228,6,6
6,6,241Am,5,5
7,7,"239,240Pu",9,9
8,8,RA-226,6,6
9,9,137Cs,5,5


To fix this issue, we use the `LowerStripNameCB` callback. For each dataframe in the dictionary of dataframes, it corrects the nuclide name by converting it lowercase, striping any leading or trailing whitespace(s).

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[LowerStripNameCB(col_src='nuclide', col_dst='nuclide')])
dfs_output=tfm()
for key, df in dfs_output.items():
    print(f'{key} nuclides: ')
    print(df['nuclide'].unique())

BIOTA nuclides: 
['137cs' '99tc' '239,240pu' '210po' '210pb' '226ra' '228ra' 'cs-137' '3h'
 '238pu' '239, 240 pu' '241am']
SEAWATER nuclides: 
['137cs' '239,240pu' '3h' '99tc' '226ra' '228ra' '210po' '210pb' 'ra-226'
 'ra-228']


### Remap nuclide names to MARIS data formats

Below, we map nuclide names used by HELCOM to the MARIS standard nuclide names. 

Remapping data provider nomenclatures to MARIS standards is a recurrent operation and is done in a semi-automated manner according to the following pattern:

1. **Inspect** data provider nomenclature:
2. **Match** automatically against MARIS nomenclature (using a fuzzy matching algorithm); 
3. **Fix** potential mismatches; 
4. **Apply** the lookup table to the dataframe.

We will refer to this process as **IMFA** (**I**nspect, **M**atch, **F**ix, **A**pply).

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `nuclide` column has inconsistent naming. E.g:

- `Cs-137`,  `137Cs` or `CS-137`
- `239, 240 pu` or `239,240 pu`
- `ra-226` and `226ra` 

See below:

:::

In [None]:
#| eval: false
get_unique_across_dfs(dfs, col_name='nuclide', as_df=True)

Unnamed: 0,index,value
0,0,210Po
1,1,238Pu
2,2,228Ra
3,3,226Ra
4,4,CS-137
5,5,RA-228
6,6,241Am
7,7,"239,240Pu"
8,8,RA-226
9,9,137Cs


Let's now create an instance of a [fuzzy matching algorithm](https://www.wikiwand.com/en/articles/Approximate_string_matching) `Remapper`. This instance will match the nuclide names of the OSPAR dataset to the MARIS standard nuclide names.

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_output, col_name='nuclide', as_df=True),
                    maris_lut_fn=nuc_lut_path,
                    maris_col_id='nuclide_id',
                    maris_col_name='nc_name',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='nuclides_ospar.pkl')

Lets try to match OSPAR nuclide names to MARIS standard nuclide names as automatically as possible. The `match_score` column allows to assess the results:

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1, verbose=True)

Processing:   0%|          | 0/14 [00:00<?, ?it/s]

Processing: 100%|██████████| 14/14 [00:00<00:00, 29.99it/s]

0 entries matched the criteria, while 14 entries had a match score of 1 or higher.





Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"239, 240 pu",pu240,"239, 240 pu",8
"239,240pu",pu240,"239,240pu",6
228ra,u235,228ra,4
241am,pu241,241am,4
210pb,ru106,210pb,4
137cs,i133,137cs,4
226ra,u234,226ra,4
210po,ru106,210po,4
99tc,tu,99tc,3
238pu,u238,238pu,3


We can now manually inspect the unmatched nuclide names and create a table to correct them to the MARIS standard:

In [None]:
#| export
fixes_nuclide_names = {
    '99tc': 'tc99',
    '238pu': 'pu238',
    '226ra': 'ra226',
    'ra-226': 'ra226',
    'ra-228': 'ra228',    
    '210pb': 'pb210',
    '241am': 'am241',
    '228ra': 'ra228',
    '137cs': 'cs137',
    '210po': 'po210',
    '239,240pu': 'pu239_240_tot',
    '239, 240 pu': 'pu239_240_tot',
    'cs-137': 'cs137',
    '3h': 'h3'
    }

We now include the table `fixes_nuclide_names`, which applies manual corrections to the nuclide names before the remapping process. 
The `generate_lookup_table` function has an `overwrite` parameter (default is `True`), which, when set to `True`, creates a pickle file cache of the lookup table. We can now test the remapping process:

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
fc.test_eq(len(remapper.select_match(match_score_threshold=1)), 0)

Processing:   0%|          | 0/14 [00:00<?, ?it/s]

Processing: 100%|██████████| 14/14 [00:00<00:00, 35.26it/s]


If we want to view all the remapped nuclides we can set the match score threshold to 0; 

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
remapper.select_match(match_score_threshold=0, verbose=True)

Processing:   0%|          | 0/14 [00:00<?, ?it/s]

Processing: 100%|██████████| 14/14 [00:00<00:00, 45.70it/s]

0 entries matched the criteria, while 14 entries had a match score of 0 or higher.





Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cs-137,cs137,cs-137,0
228ra,ra228,228ra,0
ra-226,ra226,ra-226,0
"239,240pu",pu239_240_tot,"239,240pu",0
"239, 240 pu",pu239_240_tot,"239, 240 pu",0
241am,am241,241am,0
3h,h3,3h,0
ra-228,ra228,ra-228,0
99tc,tc99,99tc,0
210pb,pb210,210pb,0


Values are remapped correctly! We can now create a callback `RemapNuclideNameCB` to remap the nuclide names. Note that we pass `overwrite=False` to the `Remapper` constructor to now use the cached version.

In [None]:
#| export
# Create a lookup table for nuclide names
lut_nuclides = lambda df: Remapper(provider_lut_df=df,
                                   maris_lut_fn=nuc_lut_path,
                                   maris_col_id='nuclide_id',
                                   maris_col_name='nc_name',
                                   provider_col_to_match='value',
                                   provider_col_key='value',
                                   fname_cache='nuclides_ospar.pkl').generate_lookup_table(fixes=fixes_nuclide_names, 
                                                                                            as_df=False, overwrite=False)

In [None]:
#| export
class RemapNuclideNameCB(Callback):
    "Remap data provider nuclide names to standardized MARIS nuclide names."
    def __init__(self, 
                 fn_lut: Callable, # Function that returns the lookup table dictionary
                 col_name: str # Column name to remap
                ):
        fc.store_attr()

    def __call__(self, tfm: Transformer):
        df_uniques = get_unique_across_dfs(tfm.dfs, col_name=self.col_name, as_df=True)
        #lut = {k: v.matched_maris_name for k, v in self.fn_lut(df_uniques).items()}    
        lut = {k: v.matched_id for k, v in self.fn_lut(df_uniques).items()}    
        for k in tfm.dfs.keys():
            tfm.dfs[k]['NUCLIDE'] = tfm.dfs[k][self.col_name].replace(lut)

Let's see it in action, along with the `LowerStripNameCB` callback:

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide')
                            ])
dfs_out = tfm()

# For instance
for key in dfs_out.keys():
    print(f'{key} NUCLIDE unique: ', dfs_out[key]['NUCLIDE'].unique())

BIOTA NUCLIDE unique:  [33 15 77 47 41 53 54  1 67 72]
SEAWATER NUCLIDE unique:  [33 77  1 15 53 54 47 41]


## Standardize Time

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: There is an inconsistency in the column names used for time between the `SEAWATER` and `BIOTA` datasets. `SEAWATER` uses the column name `sampling_1` and `BIOTA` uses the column name `sampling_d`.

:::

In [None]:
#| eval: false
dfs = wfs_processor()
print('Number of NaN values in sampling_1 for SEAWATER: ', dfs['SEAWATER']['sampling_1'].isnull().sum())
print('Number of NaN values in sampling_d for BIOTA: ', dfs['BIOTA']['sampling_d'].isnull().sum())

Number of NaN values in sampling_1 for SEAWATER:  0
Number of NaN values in sampling_d for BIOTA:  0


Create a callback that remaps the time format in the dictionary of dataframes (i.e. `%m/%d/%y %H:%M:%S`) and handle missing dates:

In [None]:
#| export
class ParseTimeCB(Callback):
    "Parse the time format in the dataframe and check for inconsistencies."
    def __call__(self, tfm):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                # Check if the 'sampling_1' column exists
                if 'sampling_1' in df.columns:
                    # Convert the time format of the sampling_1 and sampling_d columns
                    df['TIME'] = pd.to_datetime(df['sampling_1'], format='%Y-%m-%dT%H:%M:%S', errors='coerce')
            if grp == 'BIOTA':
                # Check if the 'sampling_1' column exists
                if 'sampling_d' in df.columns:
                    # Convert the time format of the sampling_1 and sampling_d columns
                    df['TIME'] = pd.to_datetime(df['sampling_d'], format='%Y-%m-%dT%H:%M:%S', errors='coerce')
            # Drop rows where TIME is still NaN after processing
            df.dropna(subset=['TIME'], inplace=True)

Apply the transformer for callbacks `ParseTimeCB`. Then, print the `TIME` data for `seawater`.

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    ParseTimeCB(),
    CompareDfsAndTfmCB(dfs)])

tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['SEAWATER']['TIME'])

                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 

0       1995-05-01
1       1995-05-01
2       1995-11-01
3       1995-11-01
4       1995-05-01
           ...    
19014   2022-08-28
19015   2022-08-29
19016   2022-08-29
19017   2022-08-29
19018   2022-07-01
Name: TIME, Length: 19019, dtype: datetime64[ns]


The NetCDF time format requires the time to be encoded as number of milliseconds since a time of origin. In our case the time of origin is `1970-01-01` as indicated in `configs.ipynb` `CONFIFS['units']['time']` dictionary.

`EncodeTimeCB` converts the HELCOM `time` format to the MARIS NetCDF `time` format.

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            EncodeTimeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.logs)
                            

                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 

['Parse the time format in the dataframe and check for inconsistencies.', 'Encode time as seconds since epoch.', 'Create a dataframe of dropped data. Data included in the `dfs` not in the `tfm`.']


## Sanitize value

We allocate each column containing measurement values into a single column `VALUE` and remove `NA` where needed.

In [None]:
#| export
class SanitizeValueCB(Callback):
    "Sanitize value by removing blank entries and populating `value` column."
    def __init__(self, 
                 value_col: str='activity_o' # Column name to sanitize
                 ):
        fc.store_attr()

    def __call__(self, tfm):
        for df in tfm.dfs.values():
            df.dropna(subset=[self.value_col], inplace=True)
            df['VALUE'] = df[self.value_col]

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            CompareDfsAndTfmCB(dfs)])

tfm()

print('Example of VALUE column:')
print(tfm.dfs['SEAWATER'][['VALUE']].head())
print('\nComparison stats:')
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

Example of VALUE column:
      VALUE
0  0.040141
1  0.037117
2  0.043450
3  0.046080
4  0.050330

Comparison stats:
                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 



## Normalize uncertainty

:::{.callout-tip}

**Feedback to Data Provider**: We have noticed that some entries in the `uncertaint` column use a comma (`,`) as a decimal separator. Please consider standardizing these entries to use a period (`.`) as the decimal separator. 

:::

For each sample type in the OSPAR dataset, the reported uncertainty is given as an expanded uncertainty with a coverage factor `𝑘=2`. For further details, refer to the [OSPAR reporting guidelines](https://mcc.jrc.ec.europa.eu/documents/OSPAR/Guidelines_forestimationof_a_%20measurefor_uncertainty_in_OSPARmonitoring.pdf).

**Note**: For MARIS the OSPAR uncertainty values are normalized to standard uncertainty with a coverage factor 
𝑘=1.

`NormalizeUncCB` callback normalizes the uncertainty using the following `lambda` function:

In [None]:
#| export
unc_exp2stan = lambda df, unc_col: df[unc_col] / 2

In [None]:
#| export
class NormalizeUncCB(Callback):
    """Normalize uncertainty values in DataFrames."""
    def __init__(self, 
                 col_unc: str='uncertaint', # Column name to normalize
                 fn_convert_unc: Callable=unc_exp2stan, # Function correcting coverage factor
                 ): 
        fc.store_attr()

    def __call__(self, tfm):
        for df in tfm.dfs.values():
            self._convert_commas_to_periods(df)
            self._convert_to_float(df)
            self._apply_conversion_function(df)

    def _convert_commas_to_periods(self, df):
        """Convert commas to periods in the uncertainty column."""
        df[self.col_unc] = df[self.col_unc].astype(str).str.replace(',', '.')

    def _convert_to_float(self, df):
        """Convert uncertainty column to float, handling errors by setting them to NaN."""
        df[self.col_unc] = pd.to_numeric(df[self.col_unc], errors='coerce')

    def _apply_conversion_function(self, df):
        """Apply the conversion function to normalize the uncertainty values."""
        df['UNC'] = self.fn_convert_unc(df, self.col_unc)

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
        SanitizeValueCB(),               
        NormalizeUncCB()
    ])
tfm()

for grp in ['SEAWATER', 'BIOTA']:
    print(f'\n{grp}:')
    print(tfm.dfs[grp][['VALUE', 'UNC']].head())


SEAWATER:
      VALUE       UNC
0  0.040141  0.000341
1  0.037117  0.000390
2  0.043450  0.000282
3  0.046080  0.000253
4  0.050330  0.000377

BIOTA:
     VALUE       UNC
0  2.02170  0.031336
1  2.34446  0.023445
2  2.62356  0.020988
3  2.78070  0.015294
4  1.51102  0.008311


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `SEAWATER` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

To show situations where the uncertainty is much greater than the value we will calculate the 'relative uncertainty' for the seawater dataset. 

In [None]:
grp='SEAWATER'
tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['UNC'] / tfm.dfs[grp]['VALUE'])
    # Multiply by 100 to convert to percentage
    * 100
)

Now we will return all rows where the relative uncertainty is greater than 100% for the seawater dataset.

In [None]:
dfs['SEAWATER'].columns

Index(['fid', 'the_geom', 'id', 'contractin', 'rsc_sub_di', 'station_id',
       'sample_id', 'latd', 'latm', 'lats', 'latdir', 'longd', 'longm',
       'longs', 'longdir', 'sample_typ', 'sampling_d', 'sampling_1', 'nuclide',
       'value_type', 'activity_o', 'uncertaint', 'unit', 'data_provi',
       'measuremen', 'sample_com', 'reference', 'latdd', 'longdd', 'year',
       'f1', 'reference_'],
      dtype='object')

In [None]:
threshold = 100
cols_to_show=['id', 'contractin', 'nuclide', 'value_type', 'activity_o', 'uncertaint', 'unit', 'relative_uncertainty']
tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold].head()


Unnamed: 0,id,contractin,nuclide,value_type,activity_o,uncertaint,unit,relative_uncertainty
1599,55488.0,United Kingdom,3H,0,11.1091,97164.0,Bq/l,437317.154405
2518,37532.0,Germany,99Tc,0,0.00223,0.12,Bq/l,2690.58296
2534,37548.0,Germany,99Tc,0,0.00063,0.07,Bq/l,5555.555556
2535,37549.0,Germany,99Tc,0,0.00092,0.09,Bq/l,4891.304348
2536,37550.0,Germany,99Tc,0,0.00055,0.07,Bq/l,6363.636364


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `BIOTA` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

Include the relative uncertainty for the biota dataset. 

In [None]:
grp='BIOTA'
tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['UNC'] / tfm.dfs[grp]['VALUE'])
    # Multiply by 100 to convert to percentage
    * 100
)

Return all rows where the relative uncertainty is greater than 100% for the biota dataset..

In [None]:
threshold = 100
cols_to_show=['id', 'contractin', 'nuclide', 'value_type', 'activity_o', 'uncertaint', 'unit', 'relative_uncertainty']
tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold].head()

Unnamed: 0,id,contractin,nuclide,value_type,activity_o,uncertaint,unit,relative_uncertainty
492,35011,Belgium,137Cs,0,0.1619,66.0,Bq/kg f.w.,20382.95244
756,49226,Sweden,137Cs,0,0.327,1.468,Bq/kg f.w.,224.464832
1039,35011,Belgium,137Cs,0,0.1619,66.0,Bq/kg f.w.,20382.95244
1303,49226,Sweden,137Cs,0,0.327,1.468,Bq/kg f.w.,224.464832
1838,49230,Sweden,137Cs,0,0.275,1.982,Bq/kg f.w.,360.363636


## Remap units

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: It would be easier to work with the units if they were standardized. The units are not consistent across the dataset, for instance `BQ/L`, `Bq/l` and `Bq/L` are used interchangeably.

:::


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Unit` column contains `NaN` values for the `SEAWATER` dataset, as shown below.
:::


In [None]:
with pd.option_context('display.max_rows', None):
    display(dfs['SEAWATER'][dfs['SEAWATER']['unit'].isnull()].drop(columns=['measuremen','sample_com','reference']).head())

Unnamed: 0,fid,the_geom,id,contractin,rsc_sub_di,station_id,sample_id,latd,latm,lats,...,value_type,activity_o,uncertaint,unit,data_provi,latdd,longdd,year,f1,reference_
543,ospar_seawater_1995_01_003.544,POINT (52.30138888888889 4.301111111111111),92319.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,4.8,48,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,
544,ospar_seawater_1995_01_003.545,POINT (52.30138888888889 4.301111111111111),92320.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,4.4,44,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,
545,ospar_seawater_1995_01_003.546,POINT (52.30138888888889 4.301111111111111),92321.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,4.0,4,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,
546,ospar_seawater_1995_01_003.547,POINT (52.30138888888889 4.301111111111111),92322.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,3.6,36,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,
547,ospar_seawater_1995_01_003.548,POINT (52.30138888888889 4.301111111111111),92323.0,Netherlands,8,NOORDWK10,,52,18,5.0,...,0,4.3,43,,Rijkswaterstaat Centre for Water Management,52.301389,4.301111,1995.0,,


Let's inspect the unique units used by OSPAR:

In [None]:
get_unique_across_dfs(dfs, col_name='unit', as_df=True)

Unnamed: 0,index,value
0,0,
1,1,Bq/l
2,2,Bq/kg f.w.
3,3,Bq/L
4,4,BQ/L


We define unit renaming rules for OSPAR dataset:

In [None]:
#| export
# Define unit names renaming rules
renaming_unit_rules = {'Bq/l': 1, #'Bq/m3'
                       'Bq/L': 1,
                       'BQ/L': 1,
                       'Bq/kg f.w.': 5, # Bq/kgw
                       } 

In [None]:
#| export
class RemapUnitCB(Callback):
    """Callback to update DataFrame 'UNIT' columns based on a lookup table."""

    def __init__(self, lut: Dict[str, str]):
        fc.store_attr('lut')  # Store the lookup table as an attribute

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                self._apply_default_units(df, unit='Bq/l')
            self._print_na_units(df)
            self._update_units(df)

    def _apply_default_units(self, df: pd.DataFrame , unit = None):
        df.loc[df['unit'].isnull(), 'unit'] = unit

    def _print_na_units(self, df: pd.DataFrame):
        na_count = df['unit'].isnull().sum()
        if na_count > 0:
            print(f"Number of rows with NaN in 'unit' column: {na_count}")

    def _update_units(self, df: pd.DataFrame):
        df['UNIT'] = df['unit'].apply(lambda x: self.lut.get(x, 'Unknown'))

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[SanitizeValueCB(), # Remove blank value entries (also removes NaN values in Unit column) 
                            RemapUnitCB(renaming_unit_rules),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print('Unit unique values:')
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['UNIT'].unique()}")

                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 

Unit unique values:
BIOTA: [5]
SEAWATER: [1]


## Remap detection limit

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Value type` column contains many `nan` values and many entries with a value of `0`.
:::

In [None]:
# Count the number of NaN entries in the 'value_type' column for 'SEAWATER'
na_count_seawater = dfs['SEAWATER']['value_type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'SEAWATER': {na_count_seawater}")

# Count the number of NaN entries in the 'value_type' column for 'BIOTA'
na_count_biota = dfs['BIOTA']['value_type'].isnull().sum()
print(f"Number of NaN 'Value type' entries in 'BIOTA': {na_count_biota}")

# Count the number of entries in the 'value_type' column where the value is 0 for 'SEAWATER'
zero_count_seawater = dfs['SEAWATER']['value_type'].value_counts()[0]
print(f"Number of 'value_type' entries where the value is 0 in 'SEAWATER': {zero_count_seawater}")

# Count the number of entries in the 'value_type' column where the value is 0 for 'BIOTA'
zero_count_biota = dfs['BIOTA']['value_type'].value_counts()[0]
print(f"Number of 'value_type' entries where the value is 0 in 'BIOTA': {zero_count_biota}")    


Number of NaN 'Value type' entries in 'SEAWATER': 54
Number of NaN 'Value type' entries in 'BIOTA': 23
Number of 'value_type' entries where the value is 0 in 'SEAWATER': 14032
Number of 'value_type' entries where the value is 0 in 'BIOTA': 10631


In the OSPAR dataset the detection limit is encoded as `<`  in the `Value type` column. If a value is `<` then the `Activity or MDA` column contains the detection limit value. If the `Value type` is `=` then the `Activity or MDA` column contains the measurement value.


Lets review the `Value type` column values for the OSPAR dataset:

In [None]:
for grp in dfs.keys():
    print(f'{grp}:')
    print(tfm.dfs[grp]['value_type'].unique())


BIOTA:
['0' '<' nan]
SEAWATER:
['0' '<' '=' nan]


Detection limits are encoded as follows in MARIS:

In [None]:
#| eval: false
pd.read_excel(detection_limit_lut_path())

Unnamed: 0,id,name,name_sanitized
0,-1,Not applicable,Not applicable
1,0,Not Available,Not available
2,1,=,Detected value
3,2,<,Detection limit
4,3,ND,Not detected
5,4,DE,Derived


In [None]:
#| export
lut_dl = lambda: pd.read_excel(detection_limit_lut_path(), usecols=['name','id']).set_index('name').to_dict()['id']

In [None]:
#| export
coi_dl = {'SEAWATER' : {'DL' : 'value_type'},
          'BIOTA':  {'DL' : 'value_type'}
          }

In [None]:
#| export
class RemapDetectionLimitCB(Callback):
    """Remap detection limit values to MARIS format using a lookup table."""

    def __init__(self, coi: dict, fn_lut: Callable):
        """Initialize with column configuration and a function to get the lookup table."""
        self.coi = coi
        self.fn_lut = fn_lut

    def __call__(self, tfm: Transformer):
        """Apply the remapping of detection limits across all dataframes"""
        lut = self.fn_lut()  # Retrieve the lookup table
        for grp, df in tfm.dfs.items():
            df['DL'] = df[self.coi[grp]['DL']]
            self._set_detection_limits(df, lut)

    def _set_detection_limits(self, df: pd.DataFrame, lut: dict):
        """Set detection limits based on value and uncertainty columns using specified conditions."""
        # Condition to set '=' when value and uncertainty are present and the current detection limit is not in the lookup keys
        condition_eq = (df['VALUE'].notna() & df['UNC'].notna() & ~df['DL'].isin(lut.keys()))
        df.loc[condition_eq, 'DL'] = '='

        # Set 'Not Available' for unmatched detection limits
        df.loc[~df['DL'].isin(lut.keys()), 'DL'] = 'Not Available'

        # Map existing detection limits using the lookup table
        df['DL'] = df['DL'].map(lut)

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[SanitizeValueCB(),
                            NormalizeUncCB(),                  
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl)])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['DL'].unique()}")

BIOTA: [1 2 0]
SEAWATER: [1 2 0]


## Remap Biota species

The OSPAR dataset contains biota species information in the `Species` column of the biota dataframe. To ensure consistency with MARIS standards, we need to remap these species names. We'll use a same approach to the one we employed for standardizing nuclide names:


We first inspect unique `Species` values used by OSPAR:

In [None]:
dfs = wfs_processor()
get_unique_across_dfs(dfs, col_name='species', as_df=True)

Unnamed: 0,index,value
0,0,Hippoglossoides platessoides
1,1,Argentina silus
2,2,ETMOPTERUS SPINAX
3,3,MYTILUS EDULIS
4,4,Squalus acanthias
...,...,...
151,151,BUCCINUM UNDATUM
152,152,EUTRIGLA GURNARDUS
153,153,Trachurus trachurus
154,154,PLUERONECTES PLATESSA


We try to remap the `Species` column to the `species` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='species', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='species_ospar.pkl')

In this step, we generate a lookup table using the `remapper` object. The lookup table maps data provider entries to MARIS entries using fuzzy matching. After generating the table, we select matches that meet a specified threshold (i.e., greater than 1), which means that matches requiring more than one character change are shown.

- **`generate_lookup_table(as_df=True)`**: This method generates the lookup table and returns it as a DataFrame. It uses fuzzy matching to align entries from the data provider with those in the MARIS lookup table.
- **`select_match(match_score_threshold=1)`**: This method filters the generated lookup table to include only those matches with a score greater than or equal to the specified threshold. A threshold of 1 ensures that only perfect matches are selected.

In [None]:
remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing: 100%|██████████| 156/156 [00:22<00:00,  7.08it/s]

127 entries matched the criteria, while 29 entries had a match score of 1 or higher.





source_key,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,"Mixture of green, red and brown algae",SOLEA SOLEA (S.VULGARIS),Solea solea (S.vulgaris),Cerastoderma (Cardium) Edule,CERASTODERMA (CARDIUM) EDULE,DICENTRARCHUS (MORONE) LABRAX,RAJIDAE/BATOIDEA,Pleuronectiformes [order],PALMARIA PALMATA,MONODONTA LINEATA,Unknown,unknown,RAJA DIPTURUS BATIS,Flatfish,Sepia spp.,Rhodymenia spp.,FUCUS SPP.,Thunnus sp.,Gadus sp.,Patella sp.,Fucus sp.,Tapes sp.,RHODYMENIA spp,FUCUS spp,PLUERONECTES PLATESSA,Gaidropsarus argenteus,Sebastes vivipares,ASCOPHYLLUN NODOSUM
matched_maris_name,Lomentaria catenata,Mercenaria mercenaria,Loligo vulgaris,Loligo vulgaris,Cerastoderma edule,Cerastoderma edule,Dicentrarchus labrax,Batoidea,Pleuronectiformes,Alaria marginata,Monodonta labio,Undaria,Undaria,Dipturus batis,Lambia,Sepia,Rhodymenia,Fucus,Thunnus,Penaeus sp.,Patella,Fucus,Tapes,Rhodymenia,Fucus,Pleuronectes platessa,Gaidropsarus argentatus,Sebastes viviparus,Ascophyllum nodosum
source_name,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,"Mixture of green, red and brown algae",SOLEA SOLEA (S.VULGARIS),Solea solea (S.vulgaris),Cerastoderma (Cardium) Edule,CERASTODERMA (CARDIUM) EDULE,DICENTRARCHUS (MORONE) LABRAX,RAJIDAE/BATOIDEA,Pleuronectiformes [order],PALMARIA PALMATA,MONODONTA LINEATA,Unknown,unknown,RAJA DIPTURUS BATIS,Flatfish,Sepia spp.,Rhodymenia spp.,FUCUS SPP.,Thunnus sp.,Gadus sp.,Patella sp.,Fucus sp.,Tapes sp.,RHODYMENIA spp,FUCUS spp,PLUERONECTES PLATESSA,Gaidropsarus argenteus,Sebastes vivipares,ASCOPHYLLUN NODOSUM
match_score,31,26,12,12,10,10,9,8,8,7,6,5,5,5,5,5,5,5,4,4,4,4,4,4,4,2,2,1,1


Below, we fixthe entries that are not properly matched by the `Remapper` object:

In [None]:
#| export
fixes_biota_species = {
    'RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA': NA,  # Mix of species, no direct mapping
    'Mixture of green, red and brown algae': NA,  # Mix of species, no direct mapping
    'Solea solea (S.vulgaris)': 'Solea solea',
    'SOLEA SOLEA (S.VULGARIS)': 'Solea solea',
    'RAJIDAE/BATOIDEA': NA, #Mix of species, no direct mapping
    'PALMARIA PALMATA': NA,  # Not defined
    'Unknown': NA,
    'unknown': NA,
    'Flatfish': NA,
    'Gadus sp.': NA,  # Not defined
}

We now attempt remapping again, incorporating the `fixes_biota_species` dictionary:

In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_species)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing:   0%|          | 0/156 [00:00<?, ?it/s]

Processing: 100%|██████████| 156/156 [00:21<00:00,  7.17it/s]

137 entries matched the criteria, while 19 entries had a match score of 1 or higher.





source_key,Cerastoderma (Cardium) Edule,CERASTODERMA (CARDIUM) EDULE,DICENTRARCHUS (MORONE) LABRAX,Pleuronectiformes [order],MONODONTA LINEATA,Rhodymenia spp.,Sepia spp.,FUCUS SPP.,RAJA DIPTURUS BATIS,Tapes sp.,Thunnus sp.,Fucus sp.,Patella sp.,FUCUS spp,RHODYMENIA spp,Gaidropsarus argenteus,PLUERONECTES PLATESSA,Sebastes vivipares,ASCOPHYLLUN NODOSUM
matched_maris_name,Cerastoderma edule,Cerastoderma edule,Dicentrarchus labrax,Pleuronectiformes,Monodonta labio,Rhodymenia,Sepia,Fucus,Dipturus batis,Tapes,Thunnus,Fucus,Patella,Fucus,Rhodymenia,Gaidropsarus argentatus,Pleuronectes platessa,Sebastes viviparus,Ascophyllum nodosum
source_name,Cerastoderma (Cardium) Edule,CERASTODERMA (CARDIUM) EDULE,DICENTRARCHUS (MORONE) LABRAX,Pleuronectiformes [order],MONODONTA LINEATA,Rhodymenia spp.,Sepia spp.,FUCUS SPP.,RAJA DIPTURUS BATIS,Tapes sp.,Thunnus sp.,Fucus sp.,Patella sp.,FUCUS spp,RHODYMENIA spp,Gaidropsarus argenteus,PLUERONECTES PLATESSA,Sebastes vivipares,ASCOPHYLLUN NODOSUM
match_score,10,10,9,8,6,5,5,5,5,4,4,4,4,4,4,2,2,1,1


Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| export
lut_biota = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='species', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='species_ospar.pkl').generate_lookup_table(fixes=fixes_biota_species, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `species` column to our `biota` dataframe, containing standardized species IDs.


In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['SPECIES'].unique()

array([  96,  392,   50,   99,  192,  244,  378,  139,  379,    0,  413,
        129,  274,  391,  394,  396,  417,  397,  270,  401,  380,  412,
        410,  272,  414,  395,  243,  418,  411,  407,  402,  191,  426,
        393,  429,  430,  384,  381,  403,  399,  398,  408,  389,  386,
        404,  405,  385,  415,  416,  400,  406,  427,  377,  382,  383,
        387,  388,  390, 1684,  425,  428,  419, 1609,  420,  421,  422,
        423,  424,  431,  294,  440,  432,  433,  434,  435,  436,  437,
        438,  439,  441,  442, 1605,  443,  444, 1610, 1608,   23, 1606,
        234,  556, 1701, 1752])

## Enhance Species Data Using Biological group column
The `Biological group` column in the OSPAR dataset provides valuable insights related to species. We will leverage this information to enrich the `species` column. To achieve this, we will employ the generic `RemapCB` callback to create an `enhanced_species` column. Subsequently, this `enhanced_species` column will be used to further enrich the `species` column.

First we inspect the unique values in the `Biological group` column.

In [None]:
get_unique_across_dfs(dfs, col_name='biological', as_df=True)

Unnamed: 0,index,value
0,0,Molluscs
1,1,fish
2,2,SEAWEED
3,3,Fish
4,4,Seaweed
5,5,seaweed
6,6,molluscs
7,7,MOLLUSCS
8,8,FISH


We will remap the `Biological group` columns data to the `species` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='biological', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='enhance_species_ospar.pkl')

Like before we will generate the lookup table and select matches that meet a specified threshold (i.e., greater than 1), which means that matches requiring more than one character change are shown.

In [None]:
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/9 [00:00<?, ?it/s]

Processing: 100%|██████████| 9/9 [00:01<00:00,  7.64it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fish,Fucus,fish,4
Fish,Fucus,Fish,4
FISH,Fucus,FISH,4
Molluscs,Mollusca,Molluscs,1
molluscs,Mollusca,molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1


We can see that some of the entries require manual corrections.

In [None]:
fixes_enhanced_biota_species = {
    'fish': 'Pisces',
    'FISH': 'Pisces',
    'Fish': 'Pisces'    
}


Now we will apply the manual corrections to the lookup table and generate the lookup table again.

In [None]:
remapper.generate_lookup_table(fixes=fixes_enhanced_biota_species)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/9 [00:00<?, ?it/s]

Processing: 100%|██████████| 9/9 [00:01<00:00,  6.45it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Molluscs,Mollusca,Molluscs,1
molluscs,Mollusca,molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1


Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| export
lut_biota_enhanced = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='biological', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='enhance_species_ospar.pkl').generate_lookup_table(fixes=fixes_enhanced_biota_species, as_df=False, overwrite=False)

Now lets see the species that are not matched by the `LookupBiogroupCB` callback. 

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of an `enhanced_species` column to our `BIOTA` dataframe, containing standardized species IDs.

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA')    
    ])

tfm()['BIOTA']['enhanced_species'].unique()

array([1059,  712,  873])

Now that we have the `enhanced_species` column, we can use it to enrich the `species` column. We will use the enhanced species column in the absence of a species match if the enhanced species column is valid. 

In [None]:
#| export
class EnhanceSpeciesCB(Callback):
    """Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met."""

    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        self._enhance_species(tfm.dfs['BIOTA'])

    def _enhance_species(self, df: pd.DataFrame):
        df['SPECIES'] = df.apply(
            lambda row: row['enhanced_species'] if row['SPECIES'] in [-1, 0] and pd.notnull(row['enhanced_species']) else row['SPECIES'],
            axis=1
        )

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),
    EnhanceSpeciesCB()
    ])

tfm()['BIOTA']['SPECIES'].unique()

array([  96,  392,   50,   99,  192,  244,  378,  139,  379, 1059,  413,
        129,  274,  391,  394,  396,  417,  397,  270,  401,  712,  380,
        412,  410,  272,  414,  395,  243,  418,  411,  407,  402,  191,
        426,  393,  429,  430,  384,  381,  403,  399,  398,  408,  389,
        386,  404,  405,  385,  415,  416,  400,  406,  427,  377,  382,
        383,  387,  388,  390, 1684,  425,  428,  419, 1609,  420,  421,
        422,  423,  424,  431,  294,  440,  432,  433,  434,  435,  436,
        437,  438,  439,  441,  442, 1605,  443,  444, 1610, 1608,   23,
       1606,  234,  556, 1701, 1752])

All entries are matched for the `SPECIES` column.

## Remap Biota tissues

The OSPAR dataset includes entries where the `Body Part` is labeled as `whole`. However, the MARIS data standard requires a more specific distinction in the `body_part` field, differentiating between `Whole animal` and `Whole plant`. Fortunately, the OSPAR data provides a `Biological group` field that allows us to make this distinction.

To address this discrepancy and ensure compatibility with MARIS standards, we will:

1. Create a temporary column `body_part_temp` that combines information from both `Body Part` and `Biological group`.
2. Use this temporary column to perform the lookup using our `Remapper` object.

Lets create the temporary column, `body_part_temp`, that combines `Body Part` and `Biological group`.

In [None]:
dfs['BIOTA'].columns

Index(['fid', 'the_geom', 'id', 'contractin', 'rsc_sub_di', 'station_id',
       'sample_id', 'latd', 'latm', 'lats', 'latdir', 'longd', 'longm',
       'longs', 'longdir', 'sample_typ', 'biological', 'species', 'body_part',
       'sampling_d', 'nuclide', 'value_type', 'activity_o', 'uncertaint',
       'unit', 'data_provi', 'measuremen', 'sample_com', 'reference', 'latdd',
       'longdd', 'year'],
      dtype='object')

In [None]:
#| export
class AddBodypartTempCB(Callback):
    "Add a temporary column with the body part and biological group combined."    
    def __call__(self, tfm):
        tfm.dfs['BIOTA']['body_part_temp'] = (
            tfm.dfs['BIOTA']['body_part'] + ' ' + 
            tfm.dfs['BIOTA']['biological']
            ).str.strip().str.lower()                                 

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            ])
dfs_test = tfm()
dfs_test['BIOTA']['body_part_temp'].unique()


array(['whole plant seaweed', 'flesh without bones fish',
       'soft parts molluscs', 'growing tips seaweed', 'flesh fish',
       'liver fish', 'whole animal molluscs', 'whole fish fish',
       'soft parts fish', 'muscle fish', 'flesh with scales fish',
       'whole animal fish', 'flesh without bone fish', 'head fish',
       'unknown fish', 'flesh without bones seaweed', 'whole fish',
       'flesh without bones molluscs', 'whole seaweed',
       'whole without head fish',
       'mix of muscle and whole fish without liver fish',
       'whole fisk fish', 'cod medallion fish'], dtype=object)

To align the ``body_part_temp`` column with the ``bodypar`` column in the MARIS nomenclature, we utilize a Remapper object. Since the OSPAR dataset does not include a predefined lookup table for the ``body_part`` column, we first create a lookup table by extracting unique values from the ``body_part_temp`` column.

In [None]:
get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True).head()

Unnamed: 0,index,value
0,0,unknown fish
1,1,whole fisk fish
2,2,whole seaweed
3,3,soft parts fish
4,4,head fish


We try to remap the `body_part_temp` column to the `bodypar` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True),
                    maris_lut_fn=bodyparts_lut_path,
                    maris_col_id='bodypar_id',
                    maris_col_name='bodypar',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='tissues_ospar.pkl'
                    )

remapper.generate_lookup_table(as_df=True)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=0, verbose=True).T)

Processing:   0%|          | 0/23 [00:00<?, ?it/s]

Processing: 100%|██████████| 23/23 [00:00<00:00, 98.80it/s]

0 entries matched the criteria, while 23 entries had a match score of 0 or higher.





source_key,mix of muscle and whole fish without liver fish,whole without head fish,cod medallion fish,unknown fish,whole fisk fish,whole fish fish,flesh without bones molluscs,whole animal molluscs,soft parts molluscs,growing tips seaweed,flesh without bones seaweed,whole plant seaweed,flesh fish,whole seaweed,flesh without bones fish,liver fish,whole animal fish,flesh with scales fish,head fish,soft parts fish,muscle fish,whole fish,flesh without bone fish
matched_maris_name,Flesh without bones,Flesh without bones,Old leaf,Growing tips,Whole animal,Whole animal,Flesh without bones,Whole animal,Soft parts,Growing tips,Flesh without bones,Whole plant,Shells,Whole plant,Flesh without bones,Liver,Whole animal,Flesh with scales,Head,Soft parts,Muscle,Whole animal,Flesh without bones
source_name,mix of muscle and whole fish without liver fish,whole without head fish,cod medallion fish,unknown fish,whole fisk fish,whole fish fish,flesh without bones molluscs,whole animal molluscs,soft parts molluscs,growing tips seaweed,flesh without bones seaweed,whole plant seaweed,flesh fish,whole seaweed,flesh without bones fish,liver fish,whole animal fish,flesh with scales fish,head fish,soft parts fish,muscle fish,whole fish,flesh without bone fish
match_score,31,13,13,9,9,9,9,9,9,8,8,8,7,6,5,5,5,5,5,5,5,5,4


Many of the lookup entries are sufficient for our needs. However, for values that don't find a match, we can use the `fixes_biota_bodyparts` dictionary to apply manual corrections. First we will create the dictionary.

In [None]:
#| export
fixes_biota_tissues = {
    'whole seaweed' : 'Whole plant',
    'flesh fish': 'Flesh with bones', # We assume it as the category 'Flesh with bones' also exists
    'flesh fish' : 'Flesh with bones',
    'unknown fish' : NA,
    'unknown fish' : NA,
    'cod medallion fish' : NA, # TO BE DETERMINED
    'mix of muscle and whole fish without liver fish' : NA, # TO BE DETERMINED
    'whole without head fish' : NA, # TO BE DETERMINED
    'flesh without bones seaweed' : NA, # TO BE DETERMINED
    'tail and claws fish' : NA # TO BE DETERMINED
}

Now we will generate the lookup table and apply the manual corrections of the ``fixes_biota_bodyparts`` dictionary.


In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_tissues)
with pd.option_context('display.max_columns', None):
    display(remapper.select_match(match_score_threshold=1, verbose=True).T)

Processing:   0%|          | 0/23 [00:00<?, ?it/s]

Processing: 100%|██████████| 23/23 [00:00<00:00, 94.86it/s]

2 entries matched the criteria, while 21 entries had a match score of 1 or higher.





source_key,flesh without bones molluscs,whole fisk fish,whole animal molluscs,soft parts molluscs,whole fish fish,growing tips seaweed,whole plant seaweed,flesh with scales fish,muscle fish,flesh without bones fish,whole fish,whole animal fish,head fish,soft parts fish,liver fish,flesh without bone fish,mix of muscle and whole fish without liver fish,unknown fish,cod medallion fish,whole without head fish,flesh without bones seaweed
matched_maris_name,Flesh without bones,Whole animal,Whole animal,Soft parts,Whole animal,Growing tips,Whole plant,Flesh with scales,Muscle,Flesh without bones,Whole animal,Whole animal,Head,Soft parts,Liver,Flesh without bones,(Not available),(Not available),(Not available),(Not available),(Not available)
source_name,flesh without bones molluscs,whole fisk fish,whole animal molluscs,soft parts molluscs,whole fish fish,growing tips seaweed,whole plant seaweed,flesh with scales fish,muscle fish,flesh without bones fish,whole fish,whole animal fish,head fish,soft parts fish,liver fish,flesh without bone fish,mix of muscle and whole fish without liver fish,unknown fish,cod medallion fish,whole without head fish,flesh without bones seaweed
match_score,9,9,9,9,9,8,8,5,5,5,5,5,5,5,5,4,2,2,2,2,2


At this stage, the majority of entries have been successfully matched to MARIS nomenclature. For those entries that remain unmatched, they are appropriately marked as not available. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.

In [None]:
#| export
lut_bodyparts = lambda: Remapper(provider_lut_df=get_unique_across_dfs(tfm.dfs, col_name='body_part_temp', as_df=True),
                               maris_lut_fn=bodyparts_lut_path,
                               maris_col_id='bodypar_id',
                               maris_col_name='bodypar',
                               provider_col_to_match='value',
                               provider_col_key='value',
                               fname_cache='tissues_ospar.pkl'
                               ).generate_lookup_table(fixes=fixes_biota_tissues, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `body_part` column to our `biota` dataframe, containing standardized species IDs.

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[  
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA')
                            ])
tfm()
tfm.dfs['BIOTA']['BODY_PART'].unique()

array([40, 52, 19, 56,  4, 25,  1, 34, 60, 13,  0])

## Remap biogroup

The MARIS species lookup table includes a ``biogroup_id`` column that associates each species with its corresponding ``biogroup``. We will leverage this relationship to populate a ``bio_group`` column in the biota DataFrame.

In [None]:
#| export
lut_biogroup_from_biota = lambda: get_lut(src_dir=species_lut_path().parent, fname=species_lut_path().name, 
                               key='species_id', value='biogroup_id')

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[ 
    RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),
    EnhanceSpeciesCB(),
    RemapCB(fn_lut=lut_biogroup_from_biota, col_remap='BIO_GROUP', col_src='SPECIES', dest_grps='BIOTA')
    ])

print(tfm()['BIOTA']['BIO_GROUP'].unique())


[11  4 13 14 12  2  5]


## Add Laboratory ID (REVIEW)

See helcom.ipynb for details refarding the review of the laboratory ID column.

Lets use the utility `get_unique_across_dfs` function to review the unique laboratory IDs in the OSPAR dataset:

In [None]:
tfm.dfs['BIOTA'].columns

Index(['fid', 'the_geom', 'id', 'contractin', 'rsc_sub_di', 'station_id',
       'sample_id', 'latd', 'latm', 'lats', 'latdir', 'longd', 'longm',
       'longs', 'longdir', 'sample_typ', 'biological', 'species', 'body_part',
       'sampling_d', 'nuclide', 'value_type', 'activity_o', 'uncertaint',
       'unit', 'data_provi', 'measuremen', 'sample_com', 'reference', 'latdd',
       'longdd', 'year', 'SPECIES', 'enhanced_species', 'BIO_GROUP'],
      dtype='object')

In [None]:
tfm.dfs['BIOTA'][['data_provi','contractin']].drop_duplicates().head(5)

Unnamed: 0,data_provi,contractin
0,RisÃÂ¸-DTU,Denmark
13,Johann Heinrich von ThÃÂ¼nen Institute (vTI),Germany
84,IFE/NRPA,Norway
87,IRSN : LERFA,France
123,IRSN : OPRI,France


The `LAB` information could be included with a little work. 

## Add Sample ID (REVIEW)

See helcom.ipynb for details regarding the review of the sample ID (i.e. ``SMP_ID``	) column.


The OSPAR dataset includes an `ID` column, which we will use to create the `SMP_ID` column.

In [None]:
#| export
class AddSampleIdCB(Callback):
    "Create a SMP_ID column from the ID column"
    def __call__(self, tfm):
        for df in tfm.dfs.values():
            if 'id' in df.columns:
                df['SMP_ID'] = df['id']

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            AddSampleIdCB(),
                            CompareDfsAndTfmCB(dfs)

                            ])
tfm()
for grp in ['BIOTA', 'SEAWATER']:
    print(f"{grp}: {tfm.dfs[grp]['SMP_ID'].unique()}")

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
    

BIOTA: [    1     2     3 ... 98060 98061 98062]
SEAWATER: [     1      2      3 ... 120366 120367 120368]
                           SEAWATER  BIOTA
Number of rows in dfs         19193  15951
Number of rows in tfm.dfs     19193  15951
Number of rows removed            0      0 



In [None]:
dfs['SEAWATER']['id']

0        45552.0
1        45553.0
2        45554.0
3        45555.0
4        45556.0
          ...   
19014        NaN
19015        NaN
19016        NaN
19017        NaN
19018        NaN
Name: id, Length: 19019, dtype: float64

## Add depth

The OSPAR dataset includes a column for the sampling depth (`Sampling depth`) for the `SEAWATER` dataset. In this section, we will create a callback to incorporate the sampling depth (`smp_depth`) into the MARIS dataset.

In [None]:
class AddDepthCB(Callback):
    "Ensure depth values are floats and add 'SMP_DEPTH' columns."
    def __call__(self, tfm: Transformer):
        for grp, df in tfm.dfs.items():
            if grp == 'SEAWATER':
                if 'sampling_d' in df.columns:
                    df['SMP_DEPTH'] = df['sampling_d'].astype(float)

In [None]:
#| eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
    AddDepthCB()
    ])
tfm()
for grp in tfm.dfs.keys():  
    if 'SMP_DEPTH' in tfm.dfs[grp].columns:
        print(f'{grp}:', tfm.dfs[grp][['SMP_DEPTH']].drop_duplicates())

SEAWATER:        SMP_DEPTH
0            2.0
1           24.0
3           25.0
5           32.0
7           36.0
...          ...
18500       89.0
18748        0.5
18751     1665.0
18759      276.4
18760      372.6

[130 rows x 1 columns]


## Standardize Coordinates

The OSPAR dataset offers coordinates in degrees, minutes, and seconds (DMS). The following callback is designed to convert DMS to decimal degrees. 

In [None]:
#| export
class ConvertLonLatCB(Callback):
    """Convert Coordinates to decimal degrees (DDD.DDDDD°)."""
    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            df['LAT'] = self._convert_latitude(df)
            df['LON'] = self._convert_longitude(df)

    def _convert_latitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['latdir'].isin(['S']),
            self._dms_to_decimal(df['latd'], df['latm'], df['lats']) * -1,
            self._dms_to_decimal(df['latd'], df['latm'], df['lats'])
        )

    def _convert_longitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['longdir'].isin(['W']),
            self._dms_to_decimal(df['longd'], df['longm'], df['longs']) * -1,
            self._dms_to_decimal(df['longd'], df['longm'], df['longs'])
        )

    def _dms_to_decimal(self, degrees: pd.Series, minutes: pd.Series, seconds: pd.Series) -> pd.Series:
        return degrees + minutes / 60 + seconds / 3600


In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB()
                            ])
tfm()
tfm.dfs['SEAWATER'][['LAT','latd', 'latm', 'lats', 'LON', 'latdir', 'longd', 'longm','longs', 'longdir']]

Unnamed: 0,LAT,latd,latm,lats,LON,latdir,longd,longm,longs,longdir
0,56.166667,56,10,0.0,11.783333,N,11,47,0.0,E
1,56.166667,56,10,0.0,11.783333,N,11,47,0.0,E
2,56.166667,56,10,0.0,11.783333,N,11,47,0.0,E
3,56.166667,56,10,0.0,11.783333,N,11,47,0.0,E
4,56.116667,56,7,0.0,11.166667,N,11,10,0.0,E
...,...,...,...,...,...,...,...,...,...,...
19014,54.916333,54,54,58.8,-0.280167,N,0,16,48.6,W
19015,53.912500,53,54,45.0,0.918167,N,0,55,5.4,E
19016,53.930667,53,55,50.4,1.275333,N,1,16,31.2,E
19017,54.508833,54,30,31.8,2.716500,N,2,42,59.4,E


Sanitize coordinates drops a row when both longitude & latitude equal 0 or data contains unrealistic longitude & latitude values. Converts longitude & latitude `,` separator to `.` separator."

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

print(tfm.dfs['BIOTA'][['LAT','LON']])


                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 

             LAT        LON
0      55.966667  11.583333
1      55.966667  11.583333
2      55.966667  11.583333
3      55.966667  11.583333
4      55.966667  11.583333
...          ...        ...
15257  58.452500  -5.041667
15258  58.452500  -5.041667
15259  54.872778  -3.594444
15260  54.872778  -3.594444
15261  54.872778  -3.594444

[15262 rows x 2 columns]


## Review all callbacks

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

{'Not applicable': -1, 'Not Available': 0, '=': 1, '<': 2, 'ND': 3, 'DE': 4}
['0' '<' nan]
{'Not applicable': -1, 'Not Available': 0, '=': 1, '<': 2, 'ND': 3, 'DE': 4}
['0' '<' '=' nan]
                           BIOTA  SEAWATER
Number of rows in dfs      15262     19019
Number of rows in tfm.dfs  15262     19019
Number of rows removed         0         0 



### Example change logs

Review the change logs for the netcdf encoding.

In [None]:
#|eval: false
dfs = wfs_processor()
tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            ])

# Transform
tfm()
# Check transformation logs
tfm.logs

["Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column.",
 'Remap data provider nuclide names to standardized MARIS nuclide names.',
 'Parse the time format in the dataframe and check for inconsistencies.',
 'Encode time as seconds since epoch.',
 'Sanitize value by removing blank entries and populating `value` column.',
 'Normalize uncertainty values in DataFrames.',
 "Callback to update DataFrame 'UNIT' columns based on a lookup table.",
 'Remap detection limit values to MARIS format using a lookup table.',
 "Remap values from 'species' to 'SPECIES' for groups: BIOTA.",
 "Remap values from 'biological' to 'enhanced_species' for groups: BIOTA.",
 "Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met.",
 'Add a temporary column with the body part and biological group combined.',
 "Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA.",
 'Create a SMP_ID column from the ID column',
 "Ensure depth value

## Feed global attributes

In [None]:
#| export
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']


In [None]:
#| export
def get_attrs(
    tfm: Transformer, # Transformer object
    zotero_key: str, # Zotero dataset record key
    kw: list = kw # List of keywords
    ) -> dict: # Global attributes
    "Retrieve all global attributes."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()

In [None]:
#|eval: false
get_attrs(tfm, zotero_key=zotero_key, kw=kw)

{'geospatial_lat_min': '49.43222222222222',
 'geospatial_lat_max': '81.26805555555555',
 'geospatial_lon_min': '-58.23166666666667',
 'geospatial_lon_max': '36.181666666666665',
 'geospatial_bounds': 'POLYGON ((-58.23166666666667 36.181666666666665, 49.43222222222222 36.181666666666665, 49.43222222222222 81.26805555555555, -58.23166666666667 81.26805555555555, -58.23166666666667 36.181666666666665))',
 'geospatial_vertical_max': '1850.0',
 'geospatial_vertical_min': '0.0',
 'time_coverage_start': '1995-01-01T00:00:00',
 'time_coverage_end': '2022-12-25T00:00:00',
 'title': 'OSPAR Environmental Monitoring of Radioactive Substances',
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "OSPAR Comission\'s Radioactive Substances Committee (RSC)"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Che

### Encoding NETCDF

In [None]:
#| export
def encode(
    fname_out_nc: str, # Output file name
    **kwargs # Additional arguments
    ) -> None:
    "Encode data to NetCDF."
    dfs = wfs_processor()
    tfm = Transformer(dfs, cbs=[
                            LowerStripNameCB(col_src='nuclide', col_dst='nuclide'),
                            RemapNuclideNameCB(lut_nuclides, col_name='nuclide'),
                            ParseTimeCB(),
                            EncodeTimeCB(),
                            SanitizeValueCB(),
                            NormalizeUncCB(),
                            RemapUnitCB(renaming_unit_rules),
                            RemapDetectionLimitCB(coi_dl, lut_dl),
                            RemapCB(fn_lut=lut_biota, col_remap='SPECIES', col_src='species', dest_grps='BIOTA'),    
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='biological', dest_grps='BIOTA'),    
                            EnhanceSpeciesCB(),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='BODY_PART', col_src='body_part_temp' , dest_grps='BIOTA'),
                            AddSampleIdCB(),
                            AddDepthCB(),    
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                                ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            dest_fname=fname_out_nc, 
                            global_attrs=get_attrs(tfm, zotero_key=zotero_key, kw=kw),
                            verbose=kwargs.get('verbose', False),
                           )
    encoder.encode()

In [None]:
#|eval: false
encode(fname_out_nc, verbose=True)

--------------------------------------------------------------------------------
Creating enums for the following columns:
['UNIT', 'SPECIES', 'BODY_PART', 'DL', 'NUCLIDE']
Creating enum for unit_t with values {'Not applicable': -1, 'NOT AVAILABLE': 0, 'Bq per m3': 1, 'Bq per m2': 2, 'Bq per kg': 3, 'Bq per kgd': 4, 'Bq per kgw': 5, 'kg per kg': 6, 'TU': 7, 'DELTA per mill': 8, 'atom per kg': 9, 'atom per kgd': 10, 'atom per kgw': 11, 'atom per l': 12, 'Bq per kgC': 13}.
Creating enum for species_t with values {'NOT AVAILABLE': 0, 'Aristeus antennatus': 1, 'Apostichopus': 2, 'Saccharina japonica var religiosa': 3, 'Siganus fuscescens': 4, 'Alpheus dentipes': 5, 'Hexagrammos agrammus': 6, 'Ditrema temminckii': 7, 'Parapristipoma trilineatum': 8, 'Scombrops boops': 9, 'Pseudopleuronectes schrenki': 10, 'Desmarestia ligulata': 11, 'Saccharina japonica': 12, 'Neodilsea yendoana': 13, 'Costaria costata': 14, 'Sargassum yezoense': 15, 'Acanthephyra pelagica': 16, 'Sargassum ringgoldianum': 1

## NetCDF Review

First lets review the general properties of the NetCDF file:

In [None]:
#| eval: false
properties=get_netcdf_properties(fname_out_nc)
for key, val in properties.items():
    if isinstance(val, dict):
        print(f"{key}:")
        for sub_key, sub_val in val.items():
            print(f"  {sub_key}: {sub_val}")
    else:
        print(f"{key}: {val}")

file_size_bytes: 601164
file_format: NETCDF4
groups: ['biota', 'seawater']
global_attributes:
  id: TBD
  title: OSPAR Environmental Monitoring of Radioactive Substances
  summary: 
  keywords: oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes, Earth Science > Oceans > Water Quality > Ocean Contaminants, Earth Science > Biological Classification > Animals/Vertebrates > Fish, Earth Science > Biosphere > Ecosystems > Marine Ecosystems, Earth Science > Biological Classification > Animals/Invertebrates > Mollusks, Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans, Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)
  history: TBD
  key

Review the publisher_postprocess_logs.

In [None]:
#| eval: false
print(properties['global_attributes']['publisher_postprocess_logs'])

Convert 'nuclide' column values to lowercase, strip spaces, and store in 'nuclide' column., Remap data provider nuclide names to standardized MARIS nuclide names., Parse the time format in the dataframe and check for inconsistencies., Encode time as seconds since epoch., Sanitize value by removing blank entries and populating `value` column., Normalize uncertainty values in DataFrames., Callback to update DataFrame 'UNIT' columns based on a lookup table., Remap detection limit values to MARIS format using a lookup table., Remap values from 'species' to 'SPECIES' for groups: BIOTA., Remap values from 'biological' to 'enhanced_species' for groups: BIOTA., Enhance the 'SPECIES' column using the 'enhanced_species' column if conditions are met., Add a temporary column with the body part and biological group combined., Remap values from 'body_part_temp' to 'BODY_PART' for groups: BIOTA., Create a SMP_ID column from the ID column, Ensure depth values are floats and add 'SMP_DEPTH' columns., C

Now lets review the properties of the groups in the NetCDF file:

In [None]:
#| eval: false
properties = get_netcdf_group_properties(fname_out_nc)

for key, val in properties.items():
    if isinstance(val, dict):
        print(f"{key}:")
        for sub_key, sub_val in val.items():
            print(f"  {sub_key}: {sub_val}")
    else:
        print(f"{key}: {val}")

biota:
  variables: ['lon', 'lat', 'time', 'smp_id', 'nuclide', 'value', 'unit', 'unc', 'dl', 'species', 'body_part']
  dimensions: {'id': 15262}
  attributes: {}
seawater:
  variables: ['lon', 'lat', 'smp_depth', 'time', 'smp_id', 'nuclide', 'value', 'unit', 'unc', 'dl']
  dimensions: {'id': 19019}
  attributes: {}


Lets review all variable attributes for the groups of the NetCDF file:

In [None]:
#| eval: false
df_var_prop=get_netcdf_variable_properties(fname_out_nc, as_df=True).T
df_var_prop

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
group,biota,biota,biota,biota,biota,biota,biota,biota,biota,biota,...,seawater,seawater,seawater,seawater,seawater,seawater,seawater,seawater,seawater,seawater
variable,lon,lat,time,smp_id,nuclide,value,unit,unc,dl,species,...,lon,lat,smp_depth,time,smp_id,nuclide,value,unit,unc,dl
dimensions_id,"('id',)","('id',)","('id',)","('id',)","('id',)","('id',)","('id',)","('id',)","('id',)","('id',)",...,"('id',)","('id',)","('id',)","('id',)","('id',)","('id',)","('id',)","('id',)","('id',)","('id',)"
dimensions_size,"(15262,)","(15262,)","(15262,)","(15262,)","(15262,)","(15262,)","(15262,)","(15262,)","(15262,)","(15262,)",...,"(19019,)","(19019,)","(19019,)","(19019,)","(19019,)","(19019,)","(19019,)","(19019,)","(19019,)","(19019,)"
data_type,<f4,<f4,<u8,<u8,<i8,<f4,<i8,<f4,<i8,<i8,...,<f4,<f4,<f4,<u8,<u8,<i8,<f4,<i8,<f4,<i8
attr_long_name,Measurement longitude,Measurement latitude,Time of measurement,Data provider sample ID,Nuclide,Activity,Unit,Uncertainty,Detection limit,Species,...,Measurement longitude,Measurement latitude,Sample depth below seal level,Time of measurement,Data provider sample ID,Nuclide,Activity,Unit,Uncertainty,Detection limit
attr_standard_name,longitude,latitude,time,sample_id,nuclide,activity,unit,uncertainty,detection_limit,species,...,longitude,latitude,sample_depth_below_sea_floor,time,sample_id,nuclide,activity,unit,uncertainty,detection_limit
attr_units,degrees_east,degrees_north,seconds since 1970-01-01 00:00:00.0,,,,,,,,...,degrees_east,degrees_north,m,seconds since 1970-01-01 00:00:00.0,,,,,,
attr_time_origin,,,1970-01-01 00:00:00,,,,,,,,...,,,,1970-01-01 00:00:00,,,,,,
attr_time_zone,,,UTC,,,,,,,,...,,,,UTC,,,,,,


Lets convert the NetCDF file to a dictionary of DataFrames:

In [None]:
#| eval: false
dfs=nc_to_dfs(fname_out_nc)

Lets review the biota data:

In [None]:
#| eval: false
nc_dfs_biota=dfs['BIOTA']
nc_dfs_biota

Unnamed: 0,lon,lat,time,smp_id,nuclide,value,unit,unc,dl,species,body_part
0,11.583333,55.966667,1995-04-05,38847,33,2.02170,5,0.031336,1,96,40
1,11.583333,55.966667,1995-07-07,38848,33,2.34446,5,0.023445,1,96,40
2,11.583333,55.966667,1995-09-19,38849,33,2.62356,5,0.020988,1,392,40
3,11.583333,55.966667,1995-09-19,38850,33,2.78070,5,0.015294,1,96,40
4,11.583333,55.966667,1995-12-21,38851,33,1.51102,5,0.008311,1,96,40
...,...,...,...,...,...,...,...,...,...,...,...
15257,-5.041667,58.452499,2021-04-26,96860,15,9.50000,5,1.750000,1,96,56
15258,-5.041667,58.452499,2021-07-27,96861,15,10.00000,5,1.850000,1,96,56
15259,-3.594445,54.872776,2021-06-18,96862,33,0.96400,5,0.145000,1,394,19
15260,-3.594445,54.872776,2021-12-13,96863,33,1.48000,5,0.215000,1,394,19


Lets review the seawater data:

In [None]:
#| eval: false
nc_dfs_seawater=dfs['SEAWATER']
nc_dfs_seawater

Unnamed: 0,lon,lat,smp_depth,time,smp_id,nuclide,value,unit,unc,dl
0,11.783334,56.166668,2.0,1995-05-01,45552,33,0.040141,1,0.000341,1
1,11.783334,56.166668,24.0,1995-05-01,45553,33,0.037117,1,0.000390,1
2,11.783334,56.166668,2.0,1995-11-01,45554,33,0.043450,1,0.000282,1
3,11.783334,56.166668,25.0,1995-11-01,45555,33,0.046080,1,0.000253,1
4,11.166667,56.116665,2.0,1995-05-01,45556,33,0.050330,1,0.000377,1
...,...,...,...,...,...,...,...,...,...,...
19014,-0.280167,54.916332,3.0,2022-08-28,9223372036854775808,33,0.002261,1,0.000298,1
19015,0.918167,53.912498,3.0,2022-08-29,9223372036854775808,33,0.002130,1,0.000304,1
19016,1.275333,53.930668,3.0,2022-08-29,9223372036854775808,33,0.002210,1,0.000298,1
19017,2.716500,54.508835,3.0,2022-08-29,9223372036854775808,33,0.002270,1,0.000285,1


## Data Format Conversion 

The MARIS data processing workflow involves two key steps:

1. **NetCDF to Standardized CSV Compatible with OpenRefine Pipeline**
   - Convert standardized NetCDF files to CSV formats compatible with OpenRefine using the `NetCDFDecoder`.
   - Preserve data integrity and variable relationships.
   - Maintain standardized nomenclature and units.

2. **Database Integration**
   - Process the converted CSV files using OpenRefine.
   - Apply data cleaning and standardization rules.
   - Export validated data to the MARIS master database.

This section focuses on the first step: converting NetCDF files to a format suitable for OpenRefine processing using the `NetCDFDecoder` class.

In [None]:
#|eval: false
decode(fname_in=fname_out_nc, verbose=True)

{'BIOTA':              LON        LAT        TIME  SMP_ID  NUCLIDE     VALUE  UNIT  \
0      11.583333  55.966667   797040000   38847       33   2.02170     5   
1      11.583333  55.966667   805075200   38848       33   2.34446     5   
2      11.583333  55.966667   811468800   38849       33   2.62356     5   
3      11.583333  55.966667   811468800   38850       33   2.78070     5   
4      11.583333  55.966667   819504000   38851       33   1.51102     5   
...          ...        ...         ...     ...      ...       ...   ...   
15257  -5.041667  58.452499  1619395200   96860       15   9.50000     5   
15258  -5.041667  58.452499  1627344000   96861       15  10.00000     5   
15259  -3.594445  54.872776  1623974400   96862       33   0.96400     5   
15260  -3.594445  54.872776  1639353600   96863       33   1.48000     5   
15261  -3.594445  54.872776  1640908800   96864       15  13.20000     5   

            UNC  DL  SPECIES  BODY_PART  
0      0.031336   1       96       