In [68]:
#| default_exp handlers.ospar

# OSPAR 
> Data pipeline (handler) to convert OSPAR data ([source](https://odims.ospar.org/en/)) to `NetCDF` format or `Open Refine` format.  

## Processing OSPAR Environmental Monitoring Data

The OSPAR Environmental Monitoring [Data](https://odims.ospar.org/en/) is provided as a Microsoft Access database. `Mdbtools` (https://github.com/mdbtools/mdbtools) can be used to convert the tables of the Microsoft Access database to `.csv` files on Unix-like OS.

Example steps:
1. Download data.
2. Install mdbtools via VScode Terminal 

    ```
    sudo apt-get -y install mdbtools
    ````

3. Install unzip via VScode Terminal 

    ```
    sudo apt-get -y install unzip
    ````

4. In VS code terminal, navigate to the marisco data folder

    ```
    cd /home/marisco/downloads/marisco/_data/accdb/ospar
    ```

5. Unzip OSPAR_Env_Concentrations_20240206.zip

    ```
    unzip OSPAR_Env_Concentrations_20240206.zip
    ```

6. Run preprocess.sh to generate the required data files

    ```
    ./preprocess.sh OSPAR_Env_Concentrations_20240206.zip
    ````
7. Content of 'preprocess.sh' script.
    ```
    #!/bin/bash

    # Example of use: ./preprocess.sh OSPAR_Env_Concentrations_20240206.zip
    unzip $1
    dbname=$(ls *.accdb *.mdb)
    mkdir csv
    for table in $(mdb-tables -1 "$dbname"); do
        echo "Export table $table"
        mdb-export "$dbname" "$table" > "csv/$table.csv"
    done
    ```


***

## Understanding MARIS Data Formats (NetCDF and Open Refine).

> [!TIP]
>
>For new MARIS users, please refer to [Understanding MARIS Data Formats (NetCDF and Open Refine)]() for detailed information.

TODO : update link when pushed.

***

## Packages import

In [69]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [70]:
#| export
import pandas as pd # Python package that provides fast, flexible, and expressive data structures.
import numpy as np
from tqdm import tqdm # Python Progress Bar Library
from functools import partial # Function which Return a new partial object which when called will behave like func called with the positional arguments args and keyword arguments keywords
import fastcore.all as fc # package that brings fastcore functionality, see https://fastcore.fast.ai/.
from pathlib import Path # This module offers classes representing filesystem paths
from dataclasses import asdict
from typing import List, Dict, Callable,  Tuple
from math import modf
from collections import OrderedDict

from marisco.utils import (has_valid_varname, match_worms, match_maris_lut, Match)
from marisco.callbacks import (Callback, Transformer, EncodeTimeCB, SanitizeLonLatCB)
from marisco.metadata import (GlobAttrsFeeder, BboxCB, DepthRangeCB, TimeRangeCB, ZoteroCB, KeyValuePairCB)
from marisco.configs import (nuc_lut_path, nc_tpl_path, cfg, cache_path, cdl_cfg, Enums, lut_path,
                             species_lut_path, sediments_lut_path, bodyparts_lut_path, 
                             detection_limit_lut_path, filtered_lut_path, area_lut_path)
from marisco.serializers import NetCDFEncoder,  OpenRefineCsvEncoder
import warnings

In [71]:
warnings.filterwarnings('ignore')

***

## Configuration and File Paths

1. **fname_in** - is the path to the folder containing the OSPAR data in CSV format. The path can be defined as a relative path. 

2. **fname_out_nc** - is the path and filename for the NetCDF output.The path can be defined as a relative path. 

3. **fname_out_csv** - is the path and filename for the Open Refine csv output.The path can be defined as a relative path.

4. **Zotero key** - is used to retrieve attributes related to the dataset from [Zotero](https://www.zotero.org/). The MARIS datasets include a [library](https://maris.iaea.org/datasets) available on [Zotero](https://www.zotero.org/groups/2432820/maris/library). 

5. **ref_id** - refers to the location in archive of the Zotero library.


In [72]:
# | export
fname_in = '../../_data/accdb/ospar/csv'
fname_out_nc = '../../_data/output/ospar_19950103_2021214.nc'
fname_out_csv = '../../_data/output/100-HELCOM-MORS-2024.csv'
zotero_key ='LQRA4MMK'
ref_id = 191

***

## Utils

Load OSPAR data and return the data in a Python dictionary of dataframes with the dictionary key as the sample type.

In [73]:
#| export
def load_data(src_dir: str, smp_types: List[str] = ['SEA', 'SED', 'BIO']) -> Dict[str, pd.DataFrame]:
    """
    Load OSPAR data and return the data in a dictionary of dataframes with the dictionary key as the sample type.
    
    Args:
    src_dir (str): The directory where the source CSV files are located.
    smp_types (List[str]): A list of sample types to load. Defaults to ['SEA', 'SED', 'BIO'].
    
    Returns:
    Dict[str, pd.DataFrame]: A dictionary with sample types as keys and their corresponding dataframes as values.
    """   
    dfs = {}
    lut_smp_type = {'Seawater data': 'seawater', 'Biota data': 'biota'}
    for k, v in lut_smp_type.items():
        fname_meas = k + '.csv' # measurement (i.e. radioactivity) information and sample information     
        df = pd.read_csv(Path(src_dir)/fname_meas, encoding='unicode_escape')
        dfs[v] = df
    return dfs

***

## Load data

dfs includes a dictionary of tables (dataframes) that is created from the OSPAR dataset defined by fname_in. The data to be included in each dataframe is sorted by sample type. Each dictionary is defined with a key equal to the sample type. 

In [74]:
#|eval: false
dfs = load_data(fname_in)
dfs

{'seawater':            ID Contracting Party  RSC Sub-division   Station ID Sample ID  \
 0           1           Belgium               8.0  Belgica-W01    WNZ 01   
 1           2           Belgium               8.0  Belgica-W02    WNZ 02   
 2           3           Belgium               8.0  Belgica-W03    WNZ 03   
 3           4           Belgium               8.0  Belgica-W04    WNZ 04   
 4           5           Belgium               8.0  Belgica-W05    WNZ 05   
 ...       ...               ...               ...          ...       ...   
 18851  121646    United Kingdom              10.0       Rosyth   2100318   
 18852  121647    United Kingdom              10.0       Rosyth   2101399   
 18853  121648    United Kingdom               6.0        Wylfa    21-656   
 18854  121649    United Kingdom               6.0        Wylfa    21-657   
 18855  121650    United Kingdom               6.0        Wylfa    21-654   
 
        LatD  LatM  LatS LatDir  LongD  ...  Sampling date  Nu

List the keys for the dictionary of dataframes.  

In [75]:
#|eval: false
keys = dfs.keys()
keys

dict_keys(['seawater', 'biota'])

Show the structure of the 'seawater' dataframe. 

In [76]:
#|eval: false
dfs['seawater'].head()

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
0,1,Belgium,8.0,Belgica-W01,WNZ 01,51.0,22.0,31.0,N,3.0,...,27/01/2010,137Cs,<,0.2,,Bq/l,SCKCEN,,,
1,2,Belgium,8.0,Belgica-W02,WNZ 02,51.0,13.0,25.0,N,2.0,...,27/01/2010,137Cs,<,0.27,,Bq/l,SCKCEN,,,
2,3,Belgium,8.0,Belgica-W03,WNZ 03,51.0,11.0,4.0,N,2.0,...,27/01/2010,137Cs,<,0.26,,Bq/l,SCKCEN,,,
3,4,Belgium,8.0,Belgica-W04,WNZ 04,51.0,25.0,13.0,N,3.0,...,27/01/2010,137Cs,<,0.25,,Bq/l,SCKCEN,,,
4,5,Belgium,8.0,Belgica-W05,WNZ 05,51.0,24.0,58.0,N,2.0,...,26/01/2010,137Cs,<,0.2,,Bq/l,SCKCEN,,,


Show the structure of the `biota` dataframe. 

In [77]:
#|eval: false
dfs['biota'].head()

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
0,96793,United Kingdom,5,Hunterston,2200086,55,43,31.0,N,4,...,31/12/2021,"239,240Pu",=,0.351,0.066,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,"PLZ. Annual bulk of 2 samples, representative ...",
1,96822,United Kingdom,6,Chapelcross,2200081,54,58,8.0,N,3,...,31/12/2021,99Tc,=,39.0,15.0,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,PLZ,
2,96823,United Kingdom,7,Dounreay,2200093,58,33,57.0,N,3,...,31/12/2021,"239,240Pu",=,0.0938,0.018,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,"Sandside Bay. Annual bulk of 4 samples, repre...",
3,96824,United Kingdom,7,Dounreay,2200089,58,37,7.0,N,3,...,31/12/2021,"239,240Pu",=,1.54,0.31,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,"Brims Ness. Annual bulk of 4 samples, represe...",
4,96857,United Kingdom,10,Torness,2100074,55,57,53.0,N,2,...,31/12/2021,99Tc,=,16.0,6.0,Bq/kg f.w.,SEPA-Scottish Environment Protection Agency,,"Thornton Loch. Annual bulk of 2 samples, repre...",


***

## Data transformation pipeline for NetCDF and Open Refine.

### Data transformation pipeline utils

``CompareDfsAndTfmCB`` compares the original dataframes to the transformed dataframe. A dictionary of dataframes, ``tfm.dfs_dropped``, is created to include the data present in the original dataset but absent from the transformed data. ``tfm.compare_stats`` provides a quick overview of the number of rows in both the original dataframes and the transformed dataframe.

In [78]:
#| export
class CompareDfsAndTfmCB(Callback):
    "Create a dataframe of dropped data. Data included in the `dfs` not in the `tfm`."
    
    def __init__(self, dfs: Dict[str, pd.DataFrame]):
        fc.store_attr()
    
    def __call__(self, tfm: Transformer) -> None:
        self._initialize_tfm_attributes(tfm)
        for grp in tfm.dfs.keys():
            dropped_df = self._get_dropped_data(grp, tfm)
            tfm.dfs_dropped[grp] = dropped_df
            tfm.compare_stats[grp] = self._compute_stats(grp, tfm)

    def _initialize_tfm_attributes(self, tfm: Transformer) -> None:
        """Initialize attributes in `tfm`."""
        tfm.dfs_dropped = {}
        tfm.compare_stats = {}

    def _get_dropped_data(self, grp: str, tfm: Transformer) -> pd.DataFrame:
        """
        Get the data that is present in `dfs` but not in `tfm.dfs`.
        
        Args:
        grp (str): The group key.
        tfm (Transformer): The transformation object containing `dfs`.
        
        Returns:
        pd.DataFrame: Dataframe with dropped rows.
        """
        index_diff = self.dfs[grp].index.difference(tfm.dfs[grp].index)
        return self.dfs[grp].loc[index_diff]
    
    def _compute_stats(self, grp: str, tfm: Transformer) -> Dict[str, int]:
        """
        Compute comparison statistics between `dfs` and `tfm.dfs`.
        
        Args:
        grp (str): The group key.
        tfm (Transformer): The transformation object containing `dfs`.
        
        Returns:
        Dict[str, int]: Dictionary with comparison statistics.
        """
        return {
            'Number of rows in dfs': len(self.dfs[grp].index),
            'Number of rows in tfm.dfs': len(tfm.dfs[grp].index),
            'Number of dropped rows': len(tfm.dfs_dropped[grp].index),
            'Number of rows in tfm.dfs + Number of dropped rows': len(tfm.dfs[grp].index) + len(tfm.dfs_dropped[grp].index)
        }


***

### Define Sample Type 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: Included as netcdf.group*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``type``.*

In [79]:
type_lut = {
    'SEAWATER' : 1,
    'BIOTA' : 2,
}

In [80]:
# | export
class GetSampleTypeCB(Callback):
    """Set the 'Sample type' column in the DataFrames based on a lookup table."""
    
    def __init__(self, type_lut=None):
        """
        Initialize the GetSampleTypeCB callback.

        Args:
            type_lut (dict, optional): A lookup table to map group names to sample types.
        """
        fc.store_attr()

    def __call__(self, tfm):
        """
        Apply the sample type lookup to DataFrames in the transformer.

        Args:
            tfm (Transformer): The transformer object containing DataFrames.
        """
        for key in tfm.dfs.keys():
            df = tfm.dfs[key]
            
            # Determine the sample type
            sample_type = self._get_sample_type(key)
            
            # Set the 'Sample type' column
            df['samptype_id'] = sample_type

    def _get_sample_type(self, group_name):
        """
        Determine the sample type for a given group name using the lookup table.

        Args:
            group_name (str): The name of the group.

        Returns:
            str: The sample type.
        """
        
        # Return the sample type from the lookup table
        return self.type_lut[group_name.upper()]


Here we call a transformer, which applies the callback (e.g. `GetSampleTypeCB`) to the dictionary of dataframes, `dfs`.

In [107]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[GetSampleTypeCB(type_lut),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18856  15314
Number of dropped rows                                     0      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 



***

### Normalize ``Nuclide`` names

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``nuclide``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``nuclide_id``.*

#### Lower & strip nuclide names

Creates a class `LowerStripRdnNameCB` that receives a dictionary of dataframes. For each dataframe in the dictionary of dataframes, it corrects the nuclide name by converting it lowercase, striping any leading or trailing whitespace(s) and ensuring the number comes before letters (e.g. 137cs).

In [82]:
#| export
class LowerStripRdnNameCB(Callback):
    """Convert nuclide names to lowercase and strip any trailing spaces."""

    def __call__(self, tfm):
        for key in tfm.dfs.keys():
            self._process_nuclide_column(tfm.dfs[key])

    def _process_nuclide_column(self, df):
        """Apply transformation to the 'Nuclide' column of the given DataFrame."""
        if 'Nuclide' in df.columns:
            df['NUCLIDE'] = df['Nuclide'].apply(self._transform_nuclide)
        else:
            print(f"Warning: 'Nuclide' column not found in DataFrame.")

    def _transform_nuclide(self, nuclide):
        """Convert nuclide name to lowercase and strip trailing spaces."""
        if isinstance(nuclide, str):
            return nuclide.lower().strip()
        return nuclide

Here we call a transformer, which applies the callback (e.g. `LowerStripRdnNameCB`) to the dictionary of dataframes, `dfs`. We then print the unique entries of the transformed `NUCLIDE` column for each dataframe included in the dictionary of dataframes, `dfs`.

In [83]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB()])
print('seawater nuclides: ')
print(tfm()['seawater']['NUCLIDE'].unique())
print('biota nuclides: ')
print(tfm()['biota']['NUCLIDE'].unique())

seawater nuclides: 
['137cs' '239,240pu' '226ra' '228ra' '99tc' '3h' '210po' '210pb' nan
 'ra-226' 'ra-228']
biota nuclides: 
['239,240pu' '99tc' '137cs' '226ra' '228ra' '238pu' '239, 240 pu' '241am'
 'cs-137' 'cs-134' '3h' '210pb' '210po']


#### Remap nuclide names to MARIS data formats

The `maris-template.nc` file, which  is created from the `cdl.toml` on installation of the Marisco package, provides details of the nuclides permitted in the  MARIS NetCDF file. Here we define a function  `get_unique_nuclides()` which creates a list of the unique nuclides from each dataframe in the dictionary of dataframes `dfs`. The function `has_valid_varname` checks that each nuclide in this list is included in the `maris-template.nc` (i.e. the `cdl.toml`). `has_valid_varname` returns all variables in the list that are not in the `maris-template.nc` or returns `True`. 


In [84]:
#| export
def get_unique_nuclides(dfs: Dict[str, pd.DataFrame]) -> List[str]:
    """
    Get a list of unique radionuclide types measured across samples.

    Args:
        dfs (Dict[str, pd.DataFrame]): A dictionary where keys are sample names and values are DataFrames.

    Returns:
        List[str]: A list of unique radionuclide types.
    """
    # Collect unique nuclide names from all DataFrames
    nuclides = set()
    for df in dfs.values():
        nuclides.update(df['NUCLIDE'].unique())

    return list(nuclides)

In [85]:
#|eval: false
# Check if these variable names are consistent with MARIS CDL
has_valid_varname(get_unique_nuclides(tfm.dfs), nc_tpl_path())

"228ra" variable name not found in MARIS CDL
"nan" variable name not found in MARIS CDL
"137cs" variable name not found in MARIS CDL
"226ra" variable name not found in MARIS CDL
"239,240pu" variable name not found in MARIS CDL
"3h" variable name not found in MARIS CDL
"ra-228" variable name not found in MARIS CDL
"238pu" variable name not found in MARIS CDL
"99tc" variable name not found in MARIS CDL
"ra-226" variable name not found in MARIS CDL
"210pb" variable name not found in MARIS CDL
"cs-134" variable name not found in MARIS CDL
"210po" variable name not found in MARIS CDL
"cs-137" variable name not found in MARIS CDL
"239, 240 pu" variable name not found in MARIS CDL
"241am" variable name not found in MARIS CDL


False

Many nuclide names are not listed in the `maris-template.nc`. Here we create a look up table, `varnames_lut_updates`, which will be used to correct the nuclide names in the dictionary of dataframes (i.e. dfs) that are not compatible with the `maris-template.nc`.

In [86]:
#| export
varnames_lut_updates = {
            "239, 240 pu" :  'pu239_240_tot',
            "cs-137" :  'cs137',
            "241am" : 'am241',
            "228ra" : 'ra228',
            "3h" : 'h3',
            "99tc" : 'tc99' ,
            "cs-134" : 'cs134',
            "210pb" : 'pb210',
            "239,240pu" : 'pu239_240_tot',
            "238pu" : 'pu238',
            "137cs" : 'cs137',
            "226ra" : 'ra226',
            "ra-228" : 'ra228',
            "ra-226" : 'ra226',
            "210po" : 'po210'}

Function `get_varnames_lut` returns a dictionary of nuclide names. This dictionary includes the `NUCLIDE` names from the dataframes in dfs, along with corrections specified in `varnames_lut_updates`.

In [87]:
#| export
def get_varnames_lut(
    dfs: Dict[str, pd.DataFrame], 
    lut: Dict[str, str] = varnames_lut_updates
) -> Dict[str, str]:
    """
    Generate a lookup table for radionuclide names, updating with provided mappings.

    Args:
        dfs (Dict[str, pd.DataFrame]): A dictionary where keys are sample names and values are DataFrames.
        lut (Dict[str, str], optional): A dictionary with additional mappings to update the lookup table.

    Returns:
        Dict[str, str]: A dictionary mapping radionuclide names to their corresponding names.
    """
    # Generate a base lookup table from unique nuclide names
    unique_nuclides = get_unique_nuclides(dfs)
    base_lut = {name: name for name in unique_nuclides}

    # Update the base lookup table with additional mappings
    base_lut.update(lut)
    
    return base_lut

The ``get_nuc_id_lut`` function creates a lookup table to map nuclide names to their IDs. In the MARIS Open Refine data format, each nuclide has a unique nuclide_id. This function reads an Excel file that lists nuclide names and their IDs, and then returns a dictionary. In this dictionary, the nuclide names are the keys, and their corresponding IDs are the values.

In [88]:
#| export
def get_nuc_id_lut():
    df = pd.read_excel(nuc_lut_path(), usecols=['nc_name','nuclide_id'])
    return df.set_index('nc_name').to_dict()['nuclide_id']

Create a callback that remaps the nuclide names in the dataframes to the updated names in `varnames_lut_updates`.

In [89]:
# | export
class RemapRdnNameCB(Callback):
    """Remap and standardize radionuclide names to MARIS radionuclide names and define nuclide ids."""
    
    def __init__(self, 
                 fn_lut: Callable[[Dict[str, pd.DataFrame]], Dict[str, str]] = partial(get_varnames_lut, lut=varnames_lut_updates),
                 nuc_id_lut: Callable[[], Dict[str, str]] = get_nuc_id_lut):
        """
        Initialize the RemapRdnNameCB with functions to generate lookup tables for radionuclide names 
        and nuclide IDs.

        Args:
            fn_lut (Callable, optional): A function that takes a dictionary of DataFrames and returns a lookup table 
                                         for remapping radionuclide names.
            nuc_id_lut (Callable, optional): A function that returns a lookup table for nuclide IDs.
        """
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        """Apply lookup tables to remap radionuclide names and obtain nuclide IDs in DataFrames.

        Args:
            tfm (Transformer): The transformer object containing DataFrames.
        """
        lut = self.fn_lut(tfm.dfs)
        nuc_id_lut = self.nuc_id_lut()
        
        for grp in tfm.dfs:
            df = tfm.dfs[grp]
            # Drop rows where 'NUCLIDE' is NaN
            df.dropna(subset=['NUCLIDE'], inplace=True)
            self._remap_nuclide_names(df, lut)
            self._apply_nuclide_ids(df, nuc_id_lut)
                        
            
    def _remap_nuclide_names(self, df: pd.DataFrame, lut: Dict[str, str]):
        """
        Remap radionuclide names in the 'NUCLIDE' column of the DataFrame using the provided lookup table.

        Args:
            df (pd.DataFrame): DataFrame containing the 'NUCLIDE' column.
            lut (Dict[str, str]): Lookup table for remapping radionuclide names.
        """
        if 'NUCLIDE' in df.columns:
            df['NUCLIDE'] = df['NUCLIDE'].replace(lut)
        else:
            print(f"No 'NUCLIDE' column found in DataFrame of group {df.name}")

    def _apply_nuclide_ids(self, df: pd.DataFrame, nuc_id_lut: Dict[str, str]):
        """
        Apply nuclide IDs to the 'NUCLIDE' column using the provided nuclide ID lookup table.

        Args:
            df (pd.DataFrame): DataFrame containing the 'NUCLIDE' column.
            nuc_id_lut (Dict[str, str]): Lookup table for nuclide IDs.
        """
        if 'NUCLIDE' in df.columns:
            df['nuclide_id'] = df['NUCLIDE'].map(nuc_id_lut)
        else:
            print(f"No 'NUCLIDE' column found in DataFrame of group {df.name}")


Apply the transformer for callbacks `LowerStripRdnNameCB` and `RemapRdnNameCB`. Then, print the unique nuclides for each dataframe in the dictionary dfs.

In [90]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print('seawater nuclides: ')
print(tfm.dfs['seawater'][['NUCLIDE', 'nuclide_id']].drop_duplicates().reset_index(drop=True))
print('biota nuclides: ')
print(tfm.dfs['biota'][['NUCLIDE', 'nuclide_id']].drop_duplicates().reset_index(drop=True))

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18310  15314
Number of dropped rows                                   546      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

seawater nuclides: 
         NUCLIDE  nuclide_id
0          cs137          33
1  pu239_240_tot          77
2          ra226          53
3          ra228          54
4           tc99          15
5             h3           1
6          po210          47
7          pb210          41
biota nuclides: 
          NUCLIDE  nuclide_id
0   pu239_240_tot          77
1            tc99          15
2           cs137          33
3           ra226          53
4           ra228          54
5           pu238          67
6           am241          72
7           cs134          31
8              h3           1
9           pb210          41
10          po210          47


After applying correction to the nuclide names we check that all nuclide in the dictionary of dataframees are valid. Returns `True` if all are valid.

In [91]:
#|eval: false
has_valid_varname(get_unique_nuclides(tfm.dfs), nc_tpl_path())

True

Many entries of OSPAR Nuclide are NAN. 

In [92]:
dfs['seawater'][dfs['seawater']['Nuclide'].isna()]

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
16799,97147,,,,,,,,,,...,,,,,,,,,,
16800,97148,,,,,,,,,,...,,,,,,,,,,
16801,97149,,,,,,,,,,...,,,,,,,,,,
16802,97150,,,,,,,,,,...,,,,,,,,,,
16803,97151,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18474,120366,Ireland,4.0,N8,,53.0,39.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18475,120367,Ireland,4.0,N9,,53.0,53.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18476,120368,Ireland,4.0,N10,,53.0,52.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18477,120369,Ireland,1.0,Salthill,,53.0,15.0,40.0,N,9.0,...,,,,,,,,2021 data,Woodstown (County Waterford) and Salthill (Cou...,


***

### Standardize Time

#### Parse time

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: `time`.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variables: `begperiod` 

Create a callback that remaps the time format in the dictionary of dataframes (i.e. `%m/%d/%y %H:%M:%S`):

In [93]:
#| export
class ParseTimeCB(Callback):
    def __init__(self):
        fc.store_attr()
            
        
    def __call__(self, tfm):
        for grp in tfm.dfs.keys():
            df = tfm.dfs[grp]
            self._process_dates(df)
            self._define_beg_period(df)
            self._remove_nan(df)

    def _process_dates(self, df: pd.DataFrame):
        """
        Process and correct date and time information in the DataFrame.

        Args:
            df (pd.DataFrame): DataFrame containing the 'Sampling date' column.
        """
        if 'Sampling date' in df.columns:
            # Convert 'Sampling date' to datetime, ignoring errors to avoid NaNs
            df['time'] = pd.to_datetime(df['Sampling date'], format='%d/%m/%Y', errors='coerce')
        else:
            # Create 'time' column with NaT if 'Sampling date' doesn't exist
            df['time'] = pd.NaT                
                    
    def _define_beg_period(self, df: pd.DataFrame):
        """
        Create a standardized date representation for Open Refine.
        
        Args:
            df (pd.DataFrame): DataFrame containing the 'time' column.
        """
        df['begperiod'] = df['time']

    def _remove_nan(self, df: pd.DataFrame):
        """
        Remove rows with NaN entries in the 'time' column.
        
        Args:
            df (pd.DataFrame): DataFrame containing the 'time' column.
        """
        df.dropna(subset=['time'], inplace=True)



Apply the transformer for callbacks `ParseTimeCB`. Then, print the ``begperiod`` and `time` data for `seawater`.

In [94]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['seawater'][['begperiod','time']])

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18308  15314
Number of dropped rows                                   548      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

       begperiod       time
0     2010-01-27 2010-01-27
1     2010-01-27 2010-01-27
2     2010-01-27 2010-01-27
3     2010-01-27 2010-01-27
4     2010-01-26 2010-01-26
...          ...        ...
18851 2021-04-29 2021-04-29
18852 2021-12-10 2021-12-10
18853 2021-04-07 2021-04-07
18854 2021-04-07 2021-04-07
18855 2021-04-07 2021-04-07

[18308 rows x 2 columns]


***

In [108]:
#### Encode time (seconds since ...)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``time``*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: No encoding for Open Refine.* 

`EncodeTimeCB` converts the HELCOM `time` format to the MARIS NetCDF `time` format.

In [96]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            EncodeTimeCB(cfg(), verbose = True),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
                            

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18308  15314
Number of dropped rows                                   548      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 



***

### Sanitize value

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``value``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variables: ``activity``.*

In [97]:
# | export
class SanitizeValue(Callback):
    "Sanitize value by removing blank entries."

    def __init__(self):
        """
        Initialize the SanitizeValue callback.
        """
        fc.store_attr()

    def __call__(self, tfm):
        """
        Sanitize the DataFrames in the transformer by removing rows with blank values in specified columns.
        Args:
            tfm (Transformer): The transformer object containing DataFrames.
        """
        for grp in tfm.dfs.keys():
            self._sanitize_dataframe(tfm.dfs[grp], grp)


    def _sanitize_dataframe(self, df: pd.DataFrame, grp: str):
        """
        Remove rows where value column (i.e. 'Activity or MDA') is blank and remap to 'value' column.

        Args:
            df (pd.DataFrame): DataFrame to sanitize.
            grp (str): Group name to determine column names.
        """
        value_col = 'Activity or MDA'
        if value_col in df.columns:
            df.dropna(subset=[value_col], inplace=True)
            df['value'] = df[value_col]
            

In [109]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[SanitizeValue(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18308  15314
Number of dropped rows                                   548      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 



***

### Normalize uncertainty

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``uncertainty``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: `Uncertainty`.*

NormalizeUncCB callback normalizes the uncertainty

In [99]:
tfm.dfs['seawater'].columns

Index(['ID', 'Contracting Party', 'RSC Sub-division', 'Station ID',
       'Sample ID', 'LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sample type', 'Sampling depth', 'Sampling date',
       'Nuclide', 'Value type', 'Activity or MDA', 'Uncertainty', 'Unit',
       'Data provider', 'Measurement Comment', 'Sample Comment',
       'Reference Comment', 'value'],
      dtype='object')

In [100]:
class NormalizeUncCB(Callback):
    """Remap uncertainty."""
    
    def __init__(self):
        """
        Initialize the NormalizeUncCB.
        """
        # Automatically initialize attributes using fc.store_attr()
        fc.store_attr()
    
    def __call__(self, tfm: 'Transformer'):
        """
        Apply the conversion function to each DataFrame in the transformer.

        Args:
            tfm (Transformer): The transformer object containing DataFrames.
        """
        for grp in tfm.dfs.keys():
            df = tfm.dfs[grp]
            df['uncertainty'] = df['Uncertainty']

Apply the transformer for callback NormalizeUncCB(). Then, print the value (i.e. activity per unit ) and standard uncertainty for each sample type.

In [110]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[       
                            SanitizeValue(),               
                            NormalizeUncCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])


tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

print(tfm.dfs['seawater'][['value', 'uncertainty']][:5])
print(tfm.dfs['biota'][['value', 'uncertainty']][:5])


                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18308  15314
Number of dropped rows                                   548      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

   value  uncertainty
0   0.20          NaN
1   0.27          NaN
2   0.26          NaN
3   0.25          NaN
4   0.20          NaN
     value  uncertainty
0   0.3510        0.066
1  39.0000       15.000
2   0.0938        0.018
3   1.5400        0.310
4  16.0000        6.000


***

### Lookup transformations 

#### Lookup MARIS function 

`get_maris_lut` performs a lookup of data provided in `data_provider_lut` against the MARIS lookup (`maris_lut`) using a fuzzy matching algorithm based on Levenshtein distance. The `get_maris_lut` is used to correct the HELCOM data to a standard format for MARIS. 

In [None]:
#|export
def get_maris_lut(df_biota,
                  fname_cache,  # For instance 'species_ospar.pkl'
                  data_provider_name_col: str,  # Data provider lookup column name of interest
                  maris_lut: Callable,  # Function retrieving MARIS source lookup table
                  maris_id: str,  # Id of MARIS lookup table nomenclature item to match
                  maris_name: str,  # Name of MARIS lookup table nomenclature item to match
                  unmatched_fixes={},
                  as_dataframe=False,
                  overwrite=False):
    """
    Generate a lookup table mapping data provider names to MARIS radionuclide names.

    Args:
        df_biota (pd.DataFrame): DataFrame containing biota data.
        fname_cache (str): Cache file name for storing the lookup table.
        data_provider_name_col (str): Column name of interest in the data provider's dataset.
        maris_lut (Callable): Function to retrieve MARIS source lookup table.
        maris_id (str): Id of MARIS lookup table nomenclature item to match.
        maris_name (str): Name of MARIS lookup table nomenclature item to match.
        unmatched_fixes (dict): Dictionary of unmatched names and their corrections.
        as_dataframe (bool): Whether to return the lookup table as a DataFrame.
        overwrite (bool): Whether to overwrite the cache file if it exists.

    Returns:
        dict or pd.DataFrame: Lookup table mapping data provider names to MARIS radionuclide names.
    """
    fname_cache = Path(cache_path()) / fname_cache
    maris_lut_table = maris_lut()

    if overwrite or not fname_cache.exists():
        lut = _generate_lookup_table(df_biota, data_provider_name_col, maris_lut_table, maris_id, maris_name, unmatched_fixes)
        fc.save_pickle(fname_cache, lut)
    else:
        lut = fc.load_pickle(fname_cache)

    if as_dataframe:
        return _convert_lut_to_dataframe(lut)
    else:
        return lut

def _generate_lookup_table(df_biota, data_provider_name_col, maris_lut_table, maris_id, maris_name, unmatched_fixes):
    """
    Generate the lookup table from the provided data.

    Args:
        df_biota (pd.DataFrame): DataFrame containing biota data.
        data_provider_name_col (str): Column name of interest in the data provider's dataset.
        maris_lut_table (pd.DataFrame): MARIS source lookup table.
        maris_id (str): Id of MARIS lookup table nomenclature item to match.
        maris_name (str): Name of MARIS lookup table nomenclature item to match.
        unmatched_fixes (dict): Dictionary of unmatched names and their corrections.

    Returns:
        dict: Lookup table mapping data provider names to MARIS radionuclide names.
    """
    lut = {}
    unique_names = df_biota[data_provider_name_col].unique()
    for name in tqdm(unique_names, total=len(unique_names), desc="Generating lookup table"):
        corrected_name = unmatched_fixes.get(name, name)
        corrected_name = _sanitize_name(corrected_name)
        result = match_maris_lut(maris_lut_table, corrected_name, maris_id, maris_name)
        match = Match(result.iloc[0][maris_id], result.iloc[0][maris_name], name, result.iloc[0]['score'])
        lut[name] = match
    return lut

def _sanitize_name(name):
    """
    Ensure the name is a string and convert it to lowercase, stripping any trailing spaces.

    Args:
        name (any): The name to sanitize.

    Returns:
        str: The sanitized name.
    """
    if isinstance(name, str):
        return name.lower().strip()
    else:
        return str(name).lower().strip()

def _convert_lut_to_dataframe(lut):
    """
    Convert the lookup table dictionary to a sorted DataFrame.

    Args:
        lut (dict): Lookup table mapping data provider names to MARIS radionuclide names.

    Returns:
        pd.DataFrame: Sorted DataFrame of the lookup table.
    """
    df_lut = pd.DataFrame({k: asdict(v) for k, v in lut.items()}).transpose()
    df_lut.index.name = 'source_id'
    return df_lut.sort_values(by='match_score', ascending=False)


#### Lookup : Biota species

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``species``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: `Species`.*

In [None]:
#|export
# key equals name in dfs['biota']. 
# value equals replacement name to use in match_maris_lut (i.e. name_to_match)
unmatched_fixes_biota_species = {}

In [None]:
#|eval: false
species_lut_df = get_maris_lut(df_biota=tfm.dfs['biota'], 
                                fname_cache='species_ospar.pkl', 
                                data_provider_name_col='Species',
                                maris_lut=species_lut_path,
                                maris_id='species_id',
                                maris_name='species',
                                unmatched_fixes=unmatched_fixes_biota_species,
                                as_dataframe=True,
                                overwrite=True)

Generating lookup table:   0%|          | 0/156 [00:00<?, ?it/s]

Generating lookup table: 100%|██████████| 156/156 [00:27<00:00,  5.72it/s]


**TODO:** Mixed species ID (e.g.RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA ). Drop?

Show `maris_species_lut` where `match_type` is not a perfect match ( i.e. not equal 0).

In [None]:
#|eval: false
species_lut_df[species_lut_df['match_score'] > 1]

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,1426,Lomentaria catenata,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,31
"Mixture of green, red and brown algae",814,Mercenaria mercenaria,"Mixture of green, red and brown algae",26
Solea solea (S.vulgaris),161,Loligo vulgaris,Solea solea (S.vulgaris),12
SOLEA SOLEA (S.VULGARIS),161,Loligo vulgaris,SOLEA SOLEA (S.VULGARIS),12
CERASTODERMA (CARDIUM) EDULE,274,Cerastoderma edule,CERASTODERMA (CARDIUM) EDULE,10
Cerastoderma (Cardium) Edule,274,Cerastoderma edule,Cerastoderma (Cardium) Edule,10
MONODONTA LINEATA,1213,Ophiothrix lineata,MONODONTA LINEATA,9
NUCELLA LAPILLUS,363,Mugil cephalus,NUCELLA LAPILLUS,9
DICENTRARCHUS (MORONE) LABRAX,424,Dicentrarchus labrax,DICENTRARCHUS (MORONE) LABRAX,9
Pleuronectiformes [order],411,Pleuronectiformes,Pleuronectiformes [order],8


Match unmatched `biota_species`:

In [None]:
#|export
# LookupBiotaSpeciesCB filters 'Not available'. 
unmatched_fixes_biota_species = {'RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA': 'Not available', # mix
 'Mixture of green, red and brown algae': 'Not available', #mix 
 'Solea solea (S.vulgaris)': 'Solea solea',
 'SOLEA SOLEA (S.VULGARIS)': 'Solea solea',
 'CERASTODERMA (CARDIUM) EDULE': 'Cerastoderma edule',
 'Cerastoderma (Cardium) Edule': 'Cerastoderma edule',
 'MONODONTA LINEATA': 'Phorcus lineatus',
 'NUCELLA LAPILLUS': 'Not available', # Droped. In worms 'Nucella lapillus (Linnaeus, 1758)', 
 'DICENTRARCHUS (MORONE) LABRAX': 'Dicentrarchus labrax',
 'Pleuronectiformes [order]': 'Pleuronectiformes',
 'RAJIDAE/BATOIDEA': 'Not available', #mix 
 'PALMARIA PALMATA': 'Not available', # Dropped. In worms 'Palmaria palmata (Linnaeus) F.Weber & D.Mohr, 1805',
 'Sepia spp.': 'Sepia',
 'Rhodymenia spp.': 'Rhodymenia',
 'unknown': 'Not available',
 'RAJA DIPTURUS BATIS': 'Dipturus batis',
 'Unknown': 'Not available',
 'Flatfish': 'Not available',
 'FUCUS SPP.': 'FUCUS',
 'Patella sp.': 'Patella',
 'Gadus sp.': 'Gadus',
 'FUCUS spp': 'FUCUS',
 'Tapes sp.': 'Tapes',
 'Thunnus sp.': 'Thunnus',
 'RHODYMENIA spp': 'RHODYMENIA',
 'Fucus sp.': 'Fucus',
 'PECTINIDAE': 'Not available', # Droped. In worms as PECTINIDAE is a family.
 'PLUERONECTES PLATESSA': 'Pleuronectes platessa',
 'Gaidropsarus argenteus': 'Gaidropsarus argentatus'}

In [None]:
#|eval: false
'''
# Drop row in the dfs['biota] where the unmatched_fixes_biota_species value is 'Not available'. 
na_list = ['Not available']     
na_biota_species = [k for k,v in unmatched_fixes_biota_species.items() if v in na_list]
tfm.dfs['biota'] = tfm.dfs['biota'][~tfm.dfs['biota']['Species'].isin(na_biota_species)]
# drop nan values
tfm.dfs['biota']=tfm.dfs['biota'][tfm.dfs['biota']['Species'].notna()]
'''
species_lut_df = get_maris_lut(df_biota=tfm.dfs['biota'], 
                                fname_cache='species_ospar.pkl', 
                                data_provider_name_col='Species',
                                maris_lut=species_lut_path,
                                maris_id='species_id',
                                maris_name='species',
                                unmatched_fixes=unmatched_fixes_biota_species,
                                as_dataframe=True,
                                overwrite=True)

Generating lookup table:   0%|          | 0/156 [00:00<?, ?it/s]

Generating lookup table: 100%|██████████| 156/156 [00:27<00:00,  5.70it/s]


In [None]:
#|eval: false
species_lut_df[species_lut_df['match_score'] > 1]

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,719,Ensis,,4


In [112]:
#| export
class LookupBiotaSpeciesCB(Callback):
    """
    Biota species remapped to MARIS db.
    """
    def __init__(self, fn_lut, unmatched_fixes_biota_species):
        fc.store_attr()

    def __call__(self, tfm):
        # Get the lookup table
        lut = self.fn_lut(df_biota=tfm.dfs['biota'])
        biota_df = tfm.dfs['biota']
        
        # Process biota DataFrame
        biota_df = self._drop_nan_species(biota_df)
        biota_df = self._drop_unmatched_species(biota_df)
        biota_df = self._perform_lookup(biota_df, lut)
        
        # Update the transformer dataframe
        tfm.dfs['biota'] = biota_df

    def _drop_nan_species(self, df):
        """
        Drop rows where 'Species' are 'nan'.
        
        Args:
            df (pd.DataFrame): DataFrame containing biota data.
        
        Returns:
            pd.DataFrame: DataFrame with 'nan' species rows removed.
        """
        return df[df['Species'].notna()]

    def _drop_unmatched_species(self, df):
        """
        Drop rows where the 'Species' value matches entries in the unmatched_fixes_biota_species with 'Not available'.
        
        Args:
            df (pd.DataFrame): DataFrame containing biota data.
        
        Returns:
            pd.DataFrame: DataFrame with 'unmatched' species rows removed.
        """
        na_list = ['Not available']
        na_biota_species = {k for k, v in self.unmatched_fixes_biota_species.items() if v in na_list}
        
        return df[~df['Species'].isin(na_biota_species)]


    def _perform_lookup(self, df, lut):
        """
        Perform lookup to remap species.
        
        Args:
            df (pd.DataFrame): DataFrame containing biota data.
            lut (dict): Lookup table for species remapping.
        
        Returns:
            pd.DataFrame: DataFrame with remapped species.
        """
        df['species'] = df['Species'].apply(lambda x: lut.get(x, Match(-1, None, x, None)).matched_id)
        return df


In [113]:
#| export
get_maris_species = partial(get_maris_lut, 
                fname_cache='species_ospar.pkl', 
                data_provider_name_col='SCIENTIFIC NAME',
                maris_lut=species_lut_path,
                maris_id='species_id',
                maris_name='species',
                unmatched_fixes=unmatched_fixes_biota_species,
                as_dataframe=False,
                overwrite=False)

In [114]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')




                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18856  13194
Number of dropped rows                                     0   2120
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 



***

##### Correct OSPAR `Body Part` labelled as `Whole`

The OSPAR data includes entries with the variable Body Part labelled as `whole`. The Maris data requires that the body `body_part` distinguishes between `Whole animal` and `Whole plant`. The OSPAR data defines the `Biological group` which allows for the Body Part labelled as whole to be defined as `Whole animal` and `Whole plant`. 

In [116]:
#| export
whole_animal_plant = {'whole' : ['Whole','WHOLE', 'WHOLE FISH', 'Whole fisk', 'Whole fish'],
                      'Whole animal' : ['Molluscs','Fish','FISH','molluscs','fish','MOLLUSCS'],
                      'Whole plant' : ['Seaweed','seaweed','SEAWEED'] }

In [119]:
#| export
class CorrectWholeBodyPartCB(Callback):
    """
    Update body parts labeled as 'whole' to either 'Whole animal' or 'Whole plant'.
    """
    
    def __init__(self, wap: Dict[str, List[str]] = whole_animal_plant):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        self.correct_whole_body_part(tfm.dfs['biota'])

    def correct_whole_body_part(self, df: pd.DataFrame):
        df['body_part'] = df['Body Part']   
        self.update_body_part(df, self.wap['whole'], self.wap['Whole animal'], 'Whole animal')
        self.update_body_part(df, self.wap['whole'], self.wap['Whole plant'], 'Whole plant')

    def update_body_part(self, df: pd.DataFrame, whole_list: List[str], group_list: List[str], new_value: str):
        mask = (df['body_part'].isin(whole_list)) & (df['Biological group'].isin(group_list))
        df.loc[mask, 'body_part'] = new_value


In [135]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['biota']['body_part'].unique())

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18856  13194
Number of dropped rows                                     0   2120
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

['SOFT PARTS' 'GROWING TIPS' 'Whole plant' 'Whole animal'
 'FLESH WITHOUT BONES' 'WHOLE ANIMAL' 'WHOLE PLANT' 'Soft Parts'
 'Whole without head' 'Cod medallion' 'Muscle'
 'Mix of muscle and whole fish without liver' 'Flesh' 'FLESH WITHOUT BONE'
 'UNKNOWN' 'FLESH' 'FLESH WITH SCALES' 'HEAD' 'Flesh without bones'
 'Soft parts' 'whole plant' 'LIVER' 'MUSCLE']


Get a dataframe of matched OSPAR biota tissues with Maris Bodyparts

In [121]:
#|export
unmatched_fixes_biota_tissues = {}

In [122]:
#|eval: false
tissues_lut_df = get_maris_lut(df_biota=tfm.dfs['biota'], 
                                fname_cache='tissues_ospar.pkl', 
                                data_provider_name_col='body_part',
                                maris_lut=bodyparts_lut_path,
                                maris_id='bodypar_id',
                                maris_name='bodypar',
                                unmatched_fixes=unmatched_fixes_biota_tissues,
                                as_dataframe=True,
                                overwrite=True)
tissues_lut_df

Generating lookup table:   0%|          | 0/23 [00:00<?, ?it/s]

Generating lookup table: 100%|██████████| 23/23 [00:00<00:00, 89.80it/s]


Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mix of muscle and whole fish without liver,52,Flesh without bones,Mix of muscle and whole fish without liver,27
Whole without head,52,Flesh without bones,Whole without head,10
Cod medallion,8,Exoskeleton,Cod medallion,9
UNKNOWN,12,Skin,UNKNOWN,5
FLESH,42,Leaf,FLESH,3
Flesh,42,Leaf,Flesh,3
FLESH WITHOUT BONE,52,Flesh without bones,FLESH WITHOUT BONE,1
LIVER,25,Liver,LIVER,0
whole plant,40,Whole plant,whole plant,0
Soft parts,19,Soft parts,Soft parts,0


List unmatched OSPAR tissues:

In [123]:
#|eval: false
tissues_lut_df[tissues_lut_df['match_score'] >= 1]['source_name'].tolist()

['Mix of muscle and whole fish without liver',
 'Whole without head',
 'Cod medallion',
 'UNKNOWN',
 'FLESH',
 'Flesh',
 'FLESH WITHOUT BONE']

Read Maris tissue lut to correct unmatched tissues:

In [124]:
#|eval: false
marisco_lut_df = pd.read_excel(bodyparts_lut_path())
marisco_lut_df

Unnamed: 0,bodypar_id,bodypar,bodycode,groupcode
0,-1,Not applicable,,
1,0,(Not available),0,0
2,1,Whole animal,WHOA,WHO
3,2,Whole animal eviscerated,WHOEV,WHO
4,3,Whole animal eviscerated without head,WHOHE,WHO
...,...,...,...,...
57,56,Growing tips,GTIP,PHAN
58,57,Upper parts of plants,UPPL,PHAN
59,58,Lower parts of plants,LWPL,PHAN
60,59,Shells/carapace,SHCA,SKEL


Create a dictionary of unmatched tissues to allow for  correctection

In [127]:
#|export
unmatched_fixes_biota_tissues = {
'Mix of muscle and whole fish without liver' : 'Not available', # Drop
 'Whole without head' : 'Whole animal eviscerated without head', # Drop? eviscerated? ,
 'Cod medallion' : 'Whole animal eviscerated without head',
 'FLESH' : 'Flesh without bones', # Drop? with or without bones?
 'Flesh' : 'Flesh without bones', # Drop? with or without bones?
 'UNKNOWN' : 'Not available',
 'FLESH WITHOUT BONE' : 'Flesh without bones'
}

In [128]:
#|eval: false
tissues_lut_df = get_maris_lut(df_biota=tfm.dfs['biota'], 
                                fname_cache='tissues_ospar.pkl', 
                                data_provider_name_col='body_part',
                                maris_lut=bodyparts_lut_path,
                                maris_id='bodypar_id',
                                maris_name='bodypar',
                                unmatched_fixes=unmatched_fixes_biota_tissues,
                                as_dataframe=True,
                                overwrite=True)
tissues_lut_df

Generating lookup table: 100%|██████████| 23/23 [00:00<00:00, 74.97it/s]


Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mix of muscle and whole fish without liver,0,(Not available),Mix of muscle and whole fish without liver,2
UNKNOWN,0,(Not available),UNKNOWN,2
Flesh,52,Flesh without bones,Flesh,0
LIVER,25,Liver,LIVER,0
whole plant,40,Whole plant,whole plant,0
Soft parts,19,Soft parts,Soft parts,0
Flesh without bones,52,Flesh without bones,Flesh without bones,0
HEAD,13,Head,HEAD,0
FLESH WITH SCALES,60,Flesh with scales,FLESH WITH SCALES,0
FLESH,52,Flesh without bones,FLESH,0


List unmatched OSPAR tissues:

In [129]:
#|eval: false
tissues_lut_df[tissues_lut_df['match_score'] >= 1]['source_name'].tolist()

['Mix of muscle and whole fish without liver', 'UNKNOWN']

In [130]:
#| export
class LookupBiotaBodyPartCB(Callback):
    """
    Update body part id based on MARIS dbo_bodypar.xlsx:
        - 3: 'Whole animal eviscerated without head',
        - 12: 'Viscera',
        - 8: 'Skin'
    """

    def __init__(self, fn_lut: Callable, unmatched_fixes_biota_tissues: Dict[str, str]):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        lut = self.fn_lut(df_biota=tfm.dfs['biota'])
        self.drop_nan_species(tfm.dfs['biota'])
        self.drop_unmatched(tfm.dfs['biota'])
        self.perform_lookup(tfm.dfs['biota'], lut)

    def drop_nan_species(self, df: pd.DataFrame):
        """
        Drop rows where 'body_part' is NaN.

        Args:
            df (pd.DataFrame): The DataFrame to process.
        """
        df.dropna(subset=['body_part'], inplace=True)

    def drop_unmatched(self, df: pd.DataFrame):
        """
        Drop rows where the 'body_part' is in the unmatched_fixes_biota_tissues list with value 'Not available'.

        Args:
            df (pd.DataFrame): The DataFrame to process.
        """
        na_list = ['Not available']
        na_biota_tissues = [k for k, v in self.unmatched_fixes_biota_tissues.items() if v in na_list]
        df.drop(df[df['body_part'].isin(na_biota_tissues)].index, inplace=True)

    def perform_lookup(self, df: pd.DataFrame, lut: Dict[str, 'Match']):
        """
        Perform lookup to update 'body_part' with matched IDs.

        Args:
            df (pd.DataFrame): The DataFrame to process.
            lut (Dict[str, Match]): The lookup table.
        """
        df['body_part'] = df['body_part'].apply(lambda x: lut[x].matched_id if x in lut else x)


In [133]:
#|eval: false
get_maris_bodypart=partial(get_maris_lut, 
                            fname_cache='tissues_ospar.pkl', 
                            data_provider_name_col='body_part',
                            maris_lut=bodyparts_lut_path,
                            maris_id='bodypar_id',
                            maris_name='bodypar',
                            unmatched_fixes=unmatched_fixes_biota_tissues,
                            as_dataframe=False,
                            overwrite=False)


In [136]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['biota'][['Body Part', 'body_part']][:5])

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18856  13188
Number of dropped rows                                     0   2126
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

      Body Part  body_part
0    SOFT PARTS         19
1  GROWING TIPS         56
2    SOFT PARTS         19
3    SOFT PARTS         19
4  GROWING TIPS         56


***

#### Lookup : Biogroup

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``bio_group``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: Biogroup is not included.*

`get_biogroup_lut` reads the file at `species_lut_path()` and from the contents of this file creates a dictionary linking `species_id` to `biogroup_id`.

In [138]:
#| export
def get_biogroup_lut(maris_lut: str) -> dict:
    """
    Retrieve a lookup table for biogroup ids from a MARIS lookup table.

    Args:
        maris_lut (str): Path to the MARIS lookup table (Excel file).

    Returns:
        dict: A dictionary mapping species_id to biogroup_id.
    """
    species = pd.read_excel(maris_lut)
    return species[['species_id', 'biogroup_id']].set_index('species_id').to_dict()['biogroup_id']


`LookupBiogroupCB` applies the corrected `biota` `bio group` data obtained from the `get_maris_lut` function to the `biota` dataframe in the dictionary of dataframes, `dfs`.

In [142]:
#| export
class LookupBiogroupCB(Callback):
    """
    Update biogroup id based on MARIS dbo_species.xlsx.
    """

    def __init__(self, fn_lut: Callable):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        lut = self.fn_lut()
        self.update_bio_group(tfm.dfs['biota'], lut)

    def update_bio_group(self, df: pd.DataFrame, lut: dict):
        """
        Update the 'bio_group' column in the DataFrame based on the lookup table.

        Args:
            df (pd.DataFrame): The DataFrame to process.
            lut (Dict[str, Any]): The lookup table for updating 'bio_group'.
        """
        df['bio_group'] = df['species'].apply(lambda x: lut.get(x, -1))


In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['biota'][['Body Part', 'body_part']][:5])

Apply the transformer for callbacks ``LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species)``,``CorrectWholeBodyPartCB()``, ``LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues)``,             ``LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path()))``,   ``CompareDfsAndTfmCB(dfs)`` . Then, print the `bio_group` for the `biota` dataframe.

In [143]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['biota']['bio_group'].unique())

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18856  13188
Number of dropped rows                                     0   2126
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

[13 11 14  4  2  5 12]


***

#### Lookup : Taxon Information

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: Not included`*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``Taxonname`` , ``TaxonRepName``, ``Taxonrank``*

`get_taxonname_lut` reads the file at `species_lut_path()` and from the contents of this file creates a dictionary linking `species_id` to `Taxonname`.

In [144]:
#| export
def get_taxon_info_lut(maris_lut: str) -> dict:
    """
    Retrieve a lookup table for Taxonname from a MARIS lookup table.

    Args:
        maris_lut (str): Path to the MARIS lookup table (Excel file).

    Returns:
        dict: A dictionary mapping species_id to biogroup_id.
    """
    species = pd.read_excel(maris_lut)
    return species[['species_id', 'Taxonname', 'Taxonrank','TaxonDB','TaxonDBID','TaxonDBURL']].set_index('species_id').to_dict()

# TODO include Commonname field after next MARIS data reconciling process.

In [145]:

# | export
class LookupTaxonInformationCB(Callback):
    """Update taxon names based on MARIS species LUT (dbo_species.xlsx)."""
    def __init__(self, fn_lut: Callable[[], dict]):
        """
        Initialize the LookupTaxonNameCB with a function to generate the lookup table.

        Args:
            fn_lut (Callable[[], dict]): Function that returns the lookup table dictionary.
        """
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        """
        Update the 'taxon_name' column in the DataFrame using the lookup table and print unmatched species IDs.

        Args:
            tfm (Transformer): The transformer object containing DataFrames.
        """
        lut = self.fn_lut()
        
        
        self._set_taxon_rep_name(tfm.dfs['biota'])
        tfm.dfs['biota']['Taxonname'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['Taxonname']))
        #df['Commonname'] = df['species'].apply(lambda x: self._get_name_by_species_id(x, lut['Commonname']))
        tfm.dfs['biota']['Taxonrank'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['Taxonrank']))
        tfm.dfs['biota']['TaxonDB'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['TaxonDB']))
        tfm.dfs['biota']['TaxonDBID'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['TaxonDBID']))
        tfm.dfs['biota']['TaxonDBURL'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['TaxonDBURL']))


    def _set_taxon_rep_name(self, df: pd.DataFrame):
        """
        Remap the 'TaxonRepName' column to the 'RUBIN' column values.

        Args:
            df (pd.DataFrame): The DataFrame to modify.
        """
        # Ensure both columns exist before attempting to remap
        if 'RUBIN' in df.columns:
            df['TaxonRepName'] = df['RUBIN']
        else:
            print("Warning: 'RUBIN' column not found in DataFrame.")
            
            

    def _get_name_by_species_id(self, species_id: str, lut: dict) -> str:
        """
        Get the  name from the lookup table and print species ID if the taxon name is not found.

        Args:
            species_id (str): The species ID from the DataFrame.
            lut (dict): The lookup table dictionary.

        Returns:
            str: The name from the lookup table.
        """
        name = lut.get(species_id, 'Unknown')  # Default to 'Unknown' if not found
        if name == 'Unknown':
            print(f"Unmatched species ID: {species_id} for {lut.keys()[0]}")
        return name


In [146]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path())),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['biota'][['Taxonname', 'Taxonrank','TaxonDB','TaxonDBID','TaxonDBURL']].drop_duplicates().head())

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18856  13188
Number of dropped rows                                     0   2126
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

               Taxonname Taxonrank   TaxonDB TaxonDBID  \
0     Littorina littorea   species  Wikidata    Q27935   
1      Fucus vesiculosus   species  Wikidata   Q754755   
15        Mytilus edulis   species  Wikidata    Q27855   
24       Clupea harengus   species  Wikidata  Q2396858   
28  Merlangius merlangus   species  Wikidata   Q273083   

                                TaxonDBURL  
0     https://www.wikidata.org/wiki/Q27935  
1    https://www.wikidata.org/wiki/Q754755  
15    https://www.wikidata.org/wiki/Q27855  
24  https://www.wikidata.org/wiki/Q2396858  
28   https://www.wikidata.org/wiki/Q273083  


***

#### Lookup : Units

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``unit``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``Unit``.*

Create `renaming_unit_rules` to rename the units. 

In [147]:
#| export
# Define unit names renaming rules
renaming_unit_rules = {'Bq/l': 1, #'Bq/m3'
                       'Bq/L': 1,
                       'BQ/L': 1,
                       'Bq/kg f.w.': 5, # Bq/kgw
                       'Bq/kg.fw' : 5,
                       'Bq/kg fw' : 5,
                       'Bq/kg f.w' : 5 
                       } 

In [154]:
#| export
class LookupUnitCB(Callback):
    """
    Update the 'unit' column in DataFrames based on a lookup table.

    The class handles:
    - Assigning a default unit for NaN values in the 'Unit' column for specific groups.
    - Dropping rows with NaN values in the 'Unit' column.
    - Performing lookup to update the 'unit' column based on the provided lookup table.
    """

    def __init__(self, lut: dict = renaming_unit_rules):
        """
        Initialize the LookupUnitCB with a lookup table.

        Args:
            lut (dict): A dictionary used for lookup to update the 'unit' column.
        """
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        """
        Apply the callback to each DataFrame in the transformer.

        Args:
            tfm (Transformer): The transformer containing DataFrames to process.
        """
        for grp in tfm.dfs.keys():
            if grp == 'seawater':
                self._apply_units(tfm.dfs[grp])
            self._drop_na_units(tfm.dfs[grp])
            self._perform_lookup(tfm.dfs[grp])

    def _apply_units(self, df: pd.DataFrame):
        """
        Apply a default unit where the 'Unit' column is NaN.

        Args:
            df (pd.DataFrame): The DataFrame to process.
        """
        df.loc[df['Unit'].isnull(), 'Unit'] = 'Bq/l'

    def _drop_na_units(self, df: pd.DataFrame):
        """
        Drop rows where the 'Unit' column has NaN values.

        Args:
            df (pd.DataFrame): The DataFrame to process.
        """
        df.dropna(subset=['Unit'], inplace=True)

    def _perform_lookup(self, df: pd.DataFrame):
        """
        Perform lookup to update the 'unit' column based on the lookup table.

        Args:
            df (pd.DataFrame): The DataFrame to process.
        """
        df['unit'] = df['Unit'].apply(lambda x: self.lut.get(x, 'Unknown'))


In [155]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            LookupUnitCB(renaming_unit_rules),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['biota']['unit'].dtypes)

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18856  15314
Number of dropped rows                                     0      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

int64


***

#### Lookup : Detection limit or Value type

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``detection_limit``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine foramt variable: ``Value type``.*

HEREHGERE HERE

**TODO:** Review OSPAR `">"`? See `tfm.dfs[grp]['Value type'].unique()`

In [160]:
#|eval: false
grp='biota'
tfm.dfs[grp]['Value type'].unique()

array(['=', '<', '>', nan], dtype=object)

In [161]:
# | export
class LookupDetectionLimitCB(Callback):
    """
    Remap activity value, activity uncertainty, and detection limit to MARIS format.

    This class performs the following operations:
    - Reads a lookup table from an Excel file.
    - Copies and processes the 'Value type' column.
    - Fills NaN values with 'Not Available'.
    - Drops rows where 'Value type' is not in the lookup table.
    - Performs a lookup to update the 'detection_limit' column based on the lookup table.
    """

    def __init__(self, lut_path: str):
        """
        Initialize the LookupDetectionLimitCB with a path to the lookup table.

        Args:
            lut_path (str): The path to the Excel file containing the lookup table.
        """
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        """
        Apply the callback to each DataFrame in the transformer.

        Args:
            tfm (Transformer): The transformer containing DataFrames to process.
        """
        lut = self._load_lookup_table()
        for grp in tfm.dfs.keys():
            df = tfm.dfs[grp]
            df = self._copy_and_fill_na(df)
            df = self._correct_greater_than(df)  # Ensure to correct 'Value type' if necessary
            df = self._drop_na_rows(df)
            self._perform_lookup(df, lut)
            tfm.dfs[grp] = df  # Update the DataFrame in the transformer

    def _load_lookup_table(self) -> dict:
        """
        Load the lookup table from the Excel file and create a mapping dictionary.

        Returns:
            dict: A dictionary mapping value types to detection limits.
        """
        df = pd.read_excel(self.lut_path)
        df = df.astype({'id': 'int'})
        return dict((v, k) for k, v in df.set_index('id')['name'].to_dict().items())

    def _correct_greater_than(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Correct the 'Value type' where it is '>' by changing it to '<'.

        Args:
            df (pd.DataFrame): The DataFrame to process.

        Returns:
            pd.DataFrame: The DataFrame with corrected 'Value type'.
        """
        df.loc[df['Value type'] == '>', 'Value type'] = '<'
        return df

    def _copy_and_fill_na(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Copy the 'Value type' column and fill NaN values with 'Not Available'.

        Args:
            df (pd.DataFrame): The DataFrame to process.

        Returns:
            pd.DataFrame: The DataFrame with updated 'val_type' column.
        """
        df['val_type'] = df['Value type']
        df['val_type'].fillna('Not Available', inplace=True)
        return df

    def _drop_na_rows(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Drop rows where the 'val_type' column has values not in the lookup table.

        Args:
            df (pd.DataFrame): The DataFrame to process.

        Returns:
            pd.DataFrame: The DataFrame with rows dropped where 'val_type' is not in the lookup table.
        """
        return df[df['val_type'].isin(self.lut.keys())]

    def _perform_lookup(self, df: pd.DataFrame, lut: dict):
        """
        Perform lookup to update the 'detection_limit' column based on the lookup table.

        Args:
            df (pd.DataFrame): The DataFrame to process.
            lut (dict): The lookup table dictionary.
        """
        df['detection_limit'] = df['val_type'].apply(lambda x: lut.get(x, 'Unknown'))


In [159]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            LookupUnitCB(renaming_unit_rules),
                            LookupDetectionLimitCB(detection_limit_lut_path()),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
tfm.dfs['seawater'][['detection_limit','Value type']]

AttributeError: 'LookupDetectionLimitCB' object has no attribute 'lut'

### Lon, Lat 

In [None]:
# | export
class ConvertLonLatCB(Callback):
    "Convert Longitude and Latitude values to DDD.DDDDD°"
    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm):
        for grp in tfm.dfs.keys():
            tfm.dfs[grp]['latitude'] = np.where(tfm.dfs[grp]['LatDir'].isin(['S']), ((tfm.dfs[grp]['LatD'] + tfm.dfs[grp]['LatM']/60 + tfm.dfs[grp]['LatS'] /(60*60))* (-1)), (tfm.dfs[grp]['LatD'] + tfm.dfs[grp]['LatM']/60 + tfm.dfs[grp]['LatS'] /(60*60)))
            tfm.dfs[grp]['longitude'] = np.where(tfm.dfs[grp]['LongDir'].isin(['W']), ((tfm.dfs[grp]['LongD'] + tfm.dfs[grp]['LongM']/60 + tfm.dfs[grp]['LongS'] /(60*60))* (-1)), (tfm.dfs[grp]['LongD'] + tfm.dfs[grp]['LongM']/60 + tfm.dfs[grp]['LongS'] /(60*60)))

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(get_rdn_format),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupUnitCB(renaming_unit_rules),
                            LookupDetectionLimitCB(detection_limit_lut_path()),
                            ConvertLonLatCB()
                            ])
tfm()['seawater'][['latitude','LatD', 'LatM', 'LatS', 'longitude', 'LatDir', 'LongD', 'LongM','LongS', 'LongDir']]

Unnamed: 0,latitude,LatD,LatM,LatS,longitude,LatDir,LongD,LongM,LongS,LongDir
0,51.375278,51.0,22.0,31.0,3.188056,N,3.0,11.0,17.0,E
1,51.223611,51.0,13.0,25.0,2.859444,N,2.0,51.0,34.0,E
2,51.184444,51.0,11.0,4.0,2.713611,N,2.0,42.0,49.0,E
3,51.420278,51.0,25.0,13.0,3.262222,N,3.0,15.0,44.0,E
4,51.416111,51.0,24.0,58.0,2.809722,N,2.0,48.0,35.0,E
...,...,...,...,...,...,...,...,...,...,...
18851,56.011111,56.0,0.0,40.0,-3.406667,N,3.0,24.0,24.0,W
18852,56.011111,56.0,0.0,40.0,-3.406667,N,3.0,24.0,24.0,W
18853,53.413333,53.0,24.0,48.0,-3.870278,N,3.0,52.0,13.0,W
18854,53.569722,53.0,34.0,11.0,-3.769722,N,3.0,46.0,11.0,W


### Encode time (seconds since ...)

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(get_rdn_format),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupUnitCB(renaming_unit_rules),
                            LookupDetectionLimitCB(detection_limit_lut_path()),
                            ConvertLonLatCB(),
                            EncodeTimeCB(cfg())
                            ])
tfm()['seawater']['time']

0        1264550400
1        1264550400
2        1264550400
3        1264550400
4        1264464000
            ...    
18851    1619654400
18852    1639094400
18853    1617753600
18854    1617753600
18855    1617753600
Name: time, Length: 15652, dtype: int64

### Compare DFS and TFM data

In [None]:
# | export
class CompareDfsAndTfm(Callback):
    "Create a dfs of dropped data. Data included in the DFS not in the TFM"
    def __init__(self, dfs_compare):
        fc.store_attr()

    def __call__(self, tfm):
        tfm.dfs_dropped={}
        tfm.compare_stats={}
        for grp in tfm.dfs.keys():
            dfs_all = self.dfs_compare[grp].merge(tfm.dfs[grp], on=self.dfs_compare[grp].columns.to_list(), how='left', indicator=True)
            tfm.dfs_dropped[grp]=dfs_all[dfs_all['_merge'] == 'left_only']  
            tfm.compare_stats[grp]= {'Number of rows dfs:' : len(self.dfs_compare[grp].index),
                                     'Number of rows tfm.dfs:' : len(tfm.dfs[grp].index),
                                     'Number of dropped rows:' : len(tfm.dfs_dropped[grp].index),
                                     'Number of rows tfm.dfs + Number of dropped rows:' : len(tfm.dfs[grp].index) + len(tfm.dfs_dropped[grp].index)
                                    }

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(get_rdn_format),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupUnitCB(renaming_unit_rules),
                            LookupDetectionLimitCB(detection_limit_lut_path()),
                            ConvertLonLatCB(),
                            EncodeTimeCB(cfg()),
                            CompareDfsAndTfm(dfs)
                            ])
tfm()
tfm.compare_stats

{'seawater': {'Number of rows dfs:': 18856,
  'Number of rows tfm.dfs:': 15652,
  'Number of dropped rows:': 3204,
  'Number of rows tfm.dfs + Number of dropped rows:': 18856},
 'biota': {'Number of rows dfs:': 15314,
  'Number of rows tfm.dfs:': 13171,
  'Number of dropped rows:': 2143,
  'Number of rows tfm.dfs + Number of dropped rows:': 15314}}

In [None]:
#|eval: false
dfs_dropped_biota=tfm.dfs_dropped['biota']
dfs_dropped_seawater=tfm.dfs_dropped['seawater']

### Rename columns

In [None]:
#| export
# Define columns of interest by sample type
coi_grp = {'seawater': ['nuclide', 'Activity or MDA', 'Uncertainty','detection_limit','unit', 'time', 'Sampling depth',
                        'latitude', 'longitude', 'Sample ID'],
           'biota': ['nuclide', 'Activity or MDA', 'Uncertainty','detection_limit','unit', 'time', 'latitude', 'longitude', 'Sample ID',
                     'species', 'body_part', 'bio_group']}

In [None]:
#| export
def get_renaming_rules():
    vars = cdl_cfg()['vars']
    # Define column names renaming rules
    return {
        'Activity or MDA': 'value',
        'Uncertainty': vars['suffixes']['uncertainty']['name'],
        'Sampling depth': vars['defaults']['smp_depth']['name'],
        'latitude': vars['defaults']['lat']['name'],
        'longitude': vars['defaults']['lon']['name'],
        'unit': vars['suffixes']['unit']['name'],
        'detection_limit': vars['suffixes']['detection_limit']['name']
    }

In [None]:
#| export
class RenameColumnCB(Callback):
    def __init__(self,
                 coi,
                 fn_renaming_rules,
                ):
        fc.store_attr()

    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            # Select cols of interest
            tfm.dfs[k] = tfm.dfs[k].loc[:, self.coi[k]]

            # Rename cols
            tfm.dfs[k].rename(columns=self.fn_renaming_rules(), inplace=True)

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(get_rdn_format),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupUnitCB(renaming_unit_rules),
                            LookupDetectionLimitCB(detection_limit_lut_path()),
                            ConvertLonLatCB(),
                            EncodeTimeCB(cfg()),
                            #CompareDfsAndTfm(dfs),
                            RenameColumnCB(coi_grp, get_renaming_rules)
                            ])                            
                            
tfm()['seawater']

Unnamed: 0,nuclide,value,_unc,_dl,_unit,time,smp_depth,lat,lon,Sample ID
0,cs137,0.20000,,3,1,1264550400,3.0,51.375278,3.188056,WNZ 01
1,cs137,0.27000,,3,1,1264550400,3.0,51.223611,2.859444,WNZ 02
2,cs137,0.26000,,3,1,1264550400,3.0,51.184444,2.713611,WNZ 03
3,cs137,0.25000,,3,1,1264550400,3.0,51.420278,3.262222,WNZ 04
4,cs137,0.20000,,3,1,1264464000,3.0,51.416111,2.809722,WNZ 05
...,...,...,...,...,...,...,...,...,...,...
18851,h3,1.00000,,3,1,1619654400,0.0,56.011111,-3.406667,2100318
18852,h3,1.05000,,3,1,1639094400,0.0,56.011111,-3.406667,2101399
18853,cs137,0.00431,0.000543,2,1,1617753600,0.0,53.413333,-3.870278,21-656
18854,cs137,0.00946,0.000253,2,1,1617753600,0.0,53.569722,-3.769722,21-657


### ReshapeLongToWide

In [None]:
#| export
class ReshapeLongToWide(Callback):
    "Convert data from long to wide with renamed columns."
    def __init__(self, columns='nuclide', values=['value']):
        fc.store_attr()
        # Retrieve all possible derived vars (e.g 'unc', 'dl', ...) from configs
        self.derived_cols = [value['name'] for value in cdl_cfg()['vars']['suffixes'].values()]
    
    def renamed_cols(self, cols):
        "Flatten columns name"
        return [inner if outer == "value" else f'{inner}{outer}'
                if inner else outer
                for outer, inner in cols]

    def pivot(self, df):
        # Among all possible 'derived cols' select the ones present in df
        derived_coi = [col for col in self.derived_cols if col in df.columns]
        
        df.reset_index(names='sample', inplace=True)
        
        idx = list(set(df.columns) - set([self.columns] + derived_coi + self.values))
        return df.pivot_table(index=idx,
                              columns=self.columns,
                              values=self.values + derived_coi,
                              fill_value=np.nan,
                              aggfunc=lambda x: x
                              ).reset_index()

    def __call__(self, tfm):
        for k in tfm.dfs.keys():
            tfm.dfs[k] = self.pivot(tfm.dfs[k])
            tfm.dfs[k].columns = self.renamed_cols(tfm.dfs[k].columns)

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(get_rdn_format),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupUnitCB(renaming_unit_rules),
                            LookupDetectionLimitCB(detection_limit_lut_path()),
                            ConvertLonLatCB(),
                            EncodeTimeCB(cfg()),
                            #CompareDfsAndTfm(dfs),
                            RenameColumnCB(coi_grp, get_renaming_rules),
                            ReshapeLongToWide()
                            ])                            
                            
tfm()['biota']

Unnamed: 0,time,sample,lat,body_part,lon,bio_group,Sample ID,species,am241_dl,cs137_dl,...,am241,cs137,h3,pb210,po210,pu238,pu239_240_tot,ra226,ra228,tc99
0,789264000,15307,54.455000,19,-3.566111,13,1995001077,394,,2.0,...,,7.697,,,,,,,,
1,789264000,15308,54.455000,19,-3.566111,13,1995001082,394,,,...,,,,,9.58,,,,,
2,789264000,15309,54.455000,19,-3.566111,13,1995001077,394,,,...,,,,,,,,,,838.0
3,789350400,15300,54.348056,52,7.566667,4,14941,99,,2.0,...,,0.637,,,,,,,,
4,789350400,15301,54.289722,52,7.496667,4,14940,50,,2.0,...,,0.611,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7992,1640908800,1,54.968889,56,-3.240556,11,2200081,96,,,...,,,,,,,,,,39.0
7993,1640908800,2,58.565833,19,-3.791389,13,2200093,394,,,...,,,,,,,0.0938,,,
7994,1640908800,3,58.618611,19,-3.647778,13,2200089,394,,,...,,,,,,,1.5400,,,
7995,1640908800,4,55.964722,56,-2.398056,11,2100074,96,,,...,,,,,,,,,,16.0


### Sanitize coordinates

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(get_rdn_format),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                            CorrectWholeBodyPartCB(),
                            LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupUnitCB(renaming_unit_rules),
                            LookupDetectionLimitCB(detection_limit_lut_path()),
                            ConvertLonLatCB(),
                            EncodeTimeCB(cfg()),
                            #CompareDfsAndTfm(dfs),
                            RenameColumnCB(coi_grp, get_renaming_rules),
                            ReshapeLongToWide(),
                            SanitizeLonLatCB()
                            ])                            
                            
tfm()['seawater'].head()

Unnamed: 0,time,smp_depth,sample,lat,lon,Sample ID,cs137_dl,h3_dl,po210_dl,pu239_240_tot_dl,...,ra226_unit,ra228_unit,tc99_unit,cs137,h3,po210,pu239_240_tot,ra226,ra228,tc99
0,789955200,0.0,7295,54.866944,-5.8,1995001221,2.0,,,,...,,,,0.049846,,,,,,
1,790041600,0.0,7334,54.488889,-3.606944,1995001407,,2.0,,,...,,,,,5.26388,,,,,
2,791251200,0.0,7326,54.872778,-3.594444,1995001334,,,,2.0,...,,,,,,,0.00468,,,
3,791424000,0.0,5791,54.001111,8.1,1995060,2.0,,,,...,,,,0.0049,,,,,,
4,791769600,0.0,7322,54.488889,-3.606944,1995001336,2.0,,,,...,,,,0.197523,,,,,,


## NetCDF encoder

### Example change logs

In [None]:
#|eval: false
tfm.logs

['Drop NaN nuclide names, convert nuclide names to lowercase, strip separators (e.g. `-`,`,`) and any trailing space(s)',
 'Remap to MARIS radionuclide names.',
 '\n    Biota species remapped to MARIS db:\n\n    ',
 "\n    Update bodypart labeled as 'whole' to either 'Whole animal' or 'Whole plant'.\n    ",
 "\n    Update bodypart id based on MARIS dbo_bodypar.xlsx:\n        - 3: 'Whole animal eviscerated without head',\n        - 12: 'Viscera',\n        - 8: 'Skin'\n    ",
 '\n    Update biogroup id  based on MARIS dbo_species.xlsx\n    ',
 'Remamp activity value, activity uncertainty and detection limit to MARIS format.',
 'Convert Longitude and Latitude values to DDD.DDDDD°',
 'Encode time as `int` representing seconds since xxx',
 'Convert data from long to wide with renamed columns.',
 'Drop row when both longitude & latitude equal 0. Drop unrealistic longitude & latitude values. Convert longitude & latitude `,` separator to `.` separator.']

### Feed global attributes

In [None]:
#| export
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']


In [None]:
#| export
def get_attrs(tfm, zotero_key, kw=kw):
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(cfg()),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()

Attributes related to the dataset are retrieved from [zotero](https://www.zotero.org/) using a zotero_key. The [MARIS datasets](https://maris.iaea.org/datasets) include a library on [zotero](https://www.zotero.org/groups/2432820/maris/library):

In [None]:
#|eval: false
get_attrs(tfm, zotero_key='LQRA4MMK', kw=kw)

{'geospatial_lat_min': '40.0',
 'geospatial_lat_max': '79.40833333333335',
 'geospatial_lon_min': '-58.23166666666667',
 'geospatial_lon_max': '36.181666666666665',
 'geospatial_bounds': 'POLYGON ((-58.23166666666667 36.181666666666665, 40 36.181666666666665, 40 79.40833333333335, -58.23166666666667 79.40833333333335, -58.23166666666667 36.181666666666665))',
 'time_coverage_start': '1995-01-05T00:00:00',
 'time_coverage_end': '2021-12-31T00:00:00',
 'title': 'OSPAR Environmental Monitoring of Radioactive Substances',
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "OSPAR Comission\'s Radioactive Substances Committee (RSC)"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth Science > Oceans > Ocean Chemistry, Earth S

In [None]:
#| export
def enums_xtra(tfm, vars):
    "Retrieve a subset of the lengthy enum as 'species_t' for instance"
    enums = Enums(lut_src_dir=lut_path(), cdl_enums=cdl_cfg()['enums'])
    xtras = {}
    for var in vars:
        unique_vals = tfm.unique(var)
        if unique_vals.any():
            xtras[f'{var}_t'] = enums.filter(f'{var}_t', unique_vals)
    return xtras

### Encoding

In [None]:
#| export
def encode(fname_in, fname_out, nc_tpl_path, **kwargs):
    dfs = load_data(fname_in)
    tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(get_rdn_format),
                                RemapRdnNameCB(),
                                ParseTimeCB(),
                                LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                                CorrectWholeBodyPartCB(),
                                LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                                LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                                LookupUnitCB(renaming_unit_rules),
                                LookupDetectionLimitCB(detection_limit_lut_path()),
                                ConvertLonLatCB(),
                                EncodeTimeCB(cfg()),
                                #CompareDfsAndTfm(dfs),
                                RenameColumnCB(coi_grp, get_renaming_rules),
                                ReshapeLongToWide(),
                                SanitizeLonLatCB()
                                ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            src_fname=nc_tpl_path,
                            dest_fname=fname_out, 
                            global_attrs=get_attrs(tfm, zotero_key='LQRA4MMK', kw=kw),
                            verbose=kwargs.get('verbose', False),
                            enums_xtra=enums_xtra(tfm, vars=['species', 'body_part'])
                           )
    encoder.encode()

In [None]:
#|eval: false
encode(fname_in, fname_out, nc_tpl_path(), verbose=False)

# Review data

In [None]:
#|eval: false
import xarray as xr
from netCDF4 import Dataset

In [None]:
#|eval: false
def netcdf4_to_df(fname_in):
    # read nc file
    netcdf4_data = Dataset(fname_in, "r")
    # Create dictionary of dataframes
    dfs={}
    for group in (netcdf4_data.groups.keys()):
        ds = xr.open_dataset(fname_in, group=group,  decode_times=False)
        dfs[group]=ds.to_dataframe()
    netcdf4_data.close()
    return(dfs)

In [None]:
#|eval: false
dfs = netcdf4_to_df(fname_out)
dfs_biota=dfs['biota']
dfs_seawater=dfs['seawater']

In [None]:
#|eval: false
dfs_biota.columns

Index(['sample', 'lon', 'lat', 'time', 'bio_group', 'species', 'body_part',
       'h3', 'h3_dl', 'h3_unit', 'tc99', 'tc99_unc', 'tc99_dl', 'tc99_unit',
       'cs137', 'cs137_unc', 'cs137_dl', 'cs137_unit', 'pb210', 'pb210_unc',
       'pb210_dl', 'pb210_unit', 'po210', 'po210_unc', 'po210_dl',
       'po210_unit', 'ra226', 'ra226_unc', 'ra226_dl', 'ra226_unit', 'ra228',
       'ra228_unc', 'ra228_dl', 'ra228_unit', 'pu238', 'pu238_unc', 'pu238_dl',
       'pu238_unit', 'am241', 'am241_unc', 'am241_dl', 'am241_unit',
       'pu239_240_tot', 'pu239_240_tot_unc', 'pu239_240_tot_dl',
       'pu239_240_tot_unit'],
      dtype='object')

In [None]:
#|eval: false
dfs_biota

Unnamed: 0_level_0,sample,lon,lat,time,bio_group,species,body_part,h3,h3_dl,h3_unit,...,pu238_dl,pu238_unit,am241,am241_unc,am241_dl,am241_unit,pu239_240_tot,pu239_240_tot_unc,pu239_240_tot_dl,pu239_240_tot_unit
biota,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,15307,-3.566111,54.455002,789264000,13,394,19,,-1,-1,...,-1,-1,,,-1,-1,,,-1,-1
1,15308,-3.566111,54.455002,789264000,13,394,19,,-1,-1,...,-1,-1,,,-1,-1,,,-1,-1
2,15309,-3.566111,54.455002,789264000,13,394,19,,-1,-1,...,-1,-1,,,-1,-1,,,-1,-1
3,15300,7.566667,54.348057,789350400,4,99,52,,-1,-1,...,-1,-1,,,-1,-1,,,-1,-1
4,15301,7.496666,54.289722,789350400,4,50,52,,-1,-1,...,-1,-1,,,-1,-1,,,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7992,1,-3.240556,54.968887,1640908800,11,96,56,,-1,-1,...,-1,-1,,,-1,-1,,,-1,-1
7993,2,-3.791389,58.565834,1640908800,13,394,19,,-1,-1,...,-1,-1,,,-1,-1,0.0938,0.018,2,5
7994,3,-3.647778,58.618610,1640908800,13,394,19,,-1,-1,...,-1,-1,,,-1,-1,1.5400,0.310,2,5
7995,4,-2.398056,55.964722,1640908800,11,96,56,,-1,-1,...,-1,-1,,,-1,-1,,,-1,-1


In [None]:
#|eval: false
dfs_seawater

Unnamed: 0_level_0,sample,lon,lat,smp_depth,time,h3,h3_unc,h3_dl,h3_unit,tc99,...,ra226_dl,ra226_unit,ra228,ra228_unc,ra228_dl,ra228_unit,pu239_240_tot,pu239_240_tot_unc,pu239_240_tot_dl,pu239_240_tot_unit
seawater,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,7295,-5.800000,54.866943,0.0,789955200,,,-1,-1,,...,-1,-1,,,-1,-1,,,-1,-1
1,7334,-3.606945,54.488888,0.0,790041600,5.26388,0.99701,2,1,,...,-1,-1,,,-1,-1,,,-1,-1
2,7326,-3.594445,54.872776,0.0,791251200,,,-1,-1,,...,-1,-1,,,-1,-1,0.00468,0.000077,2,1
3,5791,8.100000,54.001110,0.0,791424000,,,-1,-1,,...,-1,-1,,,-1,-1,,,-1,-1
4,7322,-3.606945,54.488888,0.0,791769600,,,-1,-1,,...,-1,-1,,,-1,-1,,,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10085,18517,3.565556,51.411945,1.0,1638748800,,,-1,-1,,...,-1,-1,0.0165,0.0066,2,1,,,-1,-1
10086,18852,-3.406667,56.011112,0.0,1639094400,1.05000,,3,1,,...,-1,-1,,,-1,-1,,,-1,-1
10087,18789,-5.800000,54.866943,0.0,1639526400,,,-1,-1,,...,-1,-1,,,-1,-1,,,-1,-1
10088,18800,-5.800000,54.866943,0.0,1639526400,,,-1,-1,0.000351,...,-1,-1,,,-1,-1,,,-1,-1
