In [None]:
#| default_exp handlers.helcom

# HELCOM

> Data pipeline (handler) to convert HELCOM data ([source](https://helcom.fi/about-us)) to `NetCDF` format or `Open Refine` format.  

:::{.callout-tip}

For new MARIS users, please refer to [Understanding MARIS Data Formats (NetCDF and Open Refine)](https://github.com/franckalbinet/marisco/tree/main/install_configure_guide) for detailed information.

:::

## Processing HELCOM MORS Environment Data

<!-- ## HELCOM MORS Environment database -->

[Helcom MORS data](https://helcom.fi/about-us) is provided as a Microsoft Access database. 
[`Mdbtools`](https://github.com/mdbtools/mdbtools) can be used to convert the tables of the Microsoft Access database to `.csv` files on Unix-like OS.

Example steps:
1. Download data (e.g. https://metadata.helcom.fi/geonetwork/srv/fin/catalog.search#/metadata/2fdd2d46-0329-40e3-bf96-cb08c7206a24). 
2. Install mdbtools via VScode Terminal 

    ```
    sudo apt-get -y install mdbtools
    ````

3. Install unzip via VScode Terminal 

    ```
    sudo apt-get -y install unzip
    ````

4. In VS code terminal, navigate to the marisco data folder

    ```
    cd /home/marisco/downloads/marisco/_data/accdb/mors_19840101_20211231
    ```

5. Unzip MORS_ENVIRONMENT.zip 

    ```
    unzip MORS_ENVIRONMENT.zip 
    ```

6. Run preprocess.sh to generate the required data files

    ```
    ./preprocess.sh MORS_ENVIRONMENT.zip
    ````
7. Content of 'preprocess.sh' script.
    ```
    #!/bin/bash

    # Example of use: ./preprocess.sh MORS_ENVIRONMENT.zip
    unzip $1
    dbname=$(ls *.accdb)
    mkdir csv
    for table in $(mdb-tables -1 "$dbname"); do
        echo "Export table $table"
        mdb-export "$dbname" "$table" > "csv/$table.csv"
    done
    ```

## Packages import

In [None]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
#| export
import pandas as pd 
import numpy as np
from tqdm import tqdm 
from functools import partial 
import fastcore.all as fc 
from pathlib import Path 
from dataclasses import asdict
from typing import List, Dict, Callable, Tuple
from math import modf
from collections import OrderedDict

from marisco.utils import (has_valid_varname, match_worms, match_maris_lut, Match)
from marisco.callbacks import (Callback, Transformer, EncodeTimeCB, 
                               SanitizeLonLatCB, ReshapeLongToWide, CompareDfsAndTfmCB)
from marisco.metadata import (GlobAttrsFeeder, BboxCB, DepthRangeCB, 
                              TimeRangeCB, ZoteroCB, KeyValuePairCB)
from marisco.configs import (nuc_lut_path, nc_tpl_path, cfg, cache_path, 
                             cdl_cfg, Enums, lut_path, species_lut_path, 
                             sediments_lut_path, bodyparts_lut_path, 
                             detection_limit_lut_path, filtered_lut_path, area_lut_path)
from marisco.serializers import NetCDFEncoder,  OpenRefineCsvEncoder

import warnings
warnings.filterwarnings('ignore')

## Configuration and file paths

- **fname_in**: path to the folder containing the HELCOM data in CSV format. The path can be defined as a relative path. 

- **fname_out_nc**: path and filename for the NetCDF output.The path can be defined as a relative path. 

- **fname_out_csv**: path and filename for the Open Refine csv output.The path can be defined as a relative path.

- **Zotero key**: used to retrieve attributes related to the dataset from [Zotero](https://www.zotero.org/). The MARIS datasets include a [library](https://maris.iaea.org/datasets) available on [Zotero](https://www.zotero.org/groups/2432820/maris/library). 

- **ref_id**: refers to the location in Archive of the Zotero library.


In [None]:
# | export
fname_in = '../../_data/accdb/mors/csv'
fname_out_nc = '../../_data/output/100-HELCOM-MORS-2024.nc'
fname_out_csv = '../../_data/output/100-HELCOM-MORS-2024.csv'
zotero_key ='26VMZZ2Q'
ref_id = 100

## Utils

Load HELCOM data and return the data in a Python dictionary of dataframes with the dictionary key as the sample type.

In [None]:
#| exports
default_smp_types = [('SEA', 'seawater'), ('SED', 'sediment'), ('BIO', 'biota')]

In [None]:
#| exports
def load_data(src_dir: str | Path, # The directory where the source CSV files are located
              smp_types: List = default_smp_types # A list of tuples, each containing the file prefix and the corresponding sample type name
             ) -> Dict[str, pd.DataFrame]: # A dictionary with sample types as keys and their corresponding dataframes as values
    "Load HELCOM data and return the data in a dictionary of dataframes with the dictionary key as the sample type."
    src_path = Path(src_dir)
    
    def load_and_merge(file_prefix: str) -> pd.DataFrame:
        try:
            df_meas = pd.read_csv(src_path / f'{file_prefix}02.csv')
            df_smp = pd.read_csv(src_path / f'{file_prefix}01.csv')
            return pd.merge(df_meas, df_smp, on='KEY', how='left')
        except FileNotFoundError as e:
            print(f"Error loading files for {file_prefix}: {e}")
            return pd.DataFrame()  # Return an empty DataFrame if files are not found
    
    return {smp_type: load_and_merge(file_prefix) for file_prefix, smp_type in smp_types}

## Transformation pipeline


### Load data

`dfs` is a dictionary of dataframes created from the Helcom dataset located at the path `fname_in`. The data to be included in each dataframe is sorted by sample type. Each dictionary is defined with a key equal to the sample type. 

In [None]:
#| eval: false
dfs = load_data(fname_in)
print(dfs.keys())
print(f"Seawater cols: {dfs['seawater'].columns}")
print(f"Sediment cols: {dfs['sediment'].columns}")
print(f"Biota cols: {dfs['biota'].columns}")

dict_keys(['seawater', 'sediment', 'biota'])
Seawater cols: Index(['KEY', 'NUCLIDE', 'METHOD', '< VALUE_Bq/m³', 'VALUE_Bq/m³', 'ERROR%_m³',
       'DATE_OF_ENTRY_x', 'COUNTRY', 'LABORATORY', 'SEQUENCE', 'DATE', 'YEAR',
       'MONTH', 'DAY', 'STATION', 'LATITUDE (ddmmmm)', 'LATITUDE (dddddd)',
       'LONGITUDE (ddmmmm)', 'LONGITUDE (dddddd)', 'TDEPTH', 'SDEPTH', 'SALIN',
       'TTEMP', 'FILT', 'MORS_SUBBASIN', 'HELCOM_SUBBASIN', 'DATE_OF_ENTRY_y'],
      dtype='object')
Sediment cols: Index(['KEY', 'NUCLIDE', 'METHOD', '< VALUE_Bq/kg', 'VALUE_Bq/kg', 'ERROR%_kg',
       '< VALUE_Bq/m²', 'VALUE_Bq/m²', 'ERROR%_m²', 'DATE_OF_ENTRY_x',
       'COUNTRY', 'LABORATORY', 'SEQUENCE', 'DATE', 'YEAR', 'MONTH', 'DAY',
       'STATION', 'LATITUDE (ddmmmm)', 'LATITUDE (dddddd)',
       'LONGITUDE (ddmmmm)', 'LONGITUDE (dddddd)', 'DEVICE', 'TDEPTH',
       'UPPSLI', 'LOWSLI', 'AREA', 'SEDI', 'OXIC', 'DW%', 'LOI%',
       'MORS_SUBBASIN', 'HELCOM_SUBBASIN', 'SUM_LINK', 'DATE_OF_ENTRY_y'],
      dty

Show the structure of the `seawater` dataframe:

In [None]:
#| eval: false
dfs['seawater'].head()

Unnamed: 0,KEY,NUCLIDE,METHOD,< VALUE_Bq/m³,VALUE_Bq/m³,ERROR%_m³,DATE_OF_ENTRY_x,COUNTRY,LABORATORY,SEQUENCE,...,LONGITUDE (ddmmmm),LONGITUDE (dddddd),TDEPTH,SDEPTH,SALIN,TTEMP,FILT,MORS_SUBBASIN,HELCOM_SUBBASIN,DATE_OF_ENTRY_y
0,WKRIL2012003,CS137,,,5.3,32.0,08/20/14 00:00:00,90.0,KRIL,2012003.0,...,29.2,29.3333,,0.0,,,,11.0,11.0,08/20/14 00:00:00
1,WKRIL2012004,CS137,,,19.9,20.0,08/20/14 00:00:00,90.0,KRIL,2012004.0,...,29.2,29.3333,,29.0,,,,11.0,11.0,08/20/14 00:00:00
2,WKRIL2012005,CS137,,,25.5,20.0,08/20/14 00:00:00,90.0,KRIL,2012005.0,...,23.09,23.15,,0.0,,,,11.0,3.0,08/20/14 00:00:00
3,WKRIL2012006,CS137,,,17.0,29.0,08/20/14 00:00:00,90.0,KRIL,2012006.0,...,27.59,27.9833,,0.0,,,,11.0,11.0,08/20/14 00:00:00
4,WKRIL2012007,CS137,,,22.2,18.0,08/20/14 00:00:00,90.0,KRIL,2012007.0,...,27.59,27.9833,,39.0,,,,11.0,11.0,08/20/14 00:00:00


Show the structure of the `biota` dataframe:

In [None]:
#| eval: false
dfs['biota'].head()

Unnamed: 0,KEY,NUCLIDE,METHOD,< VALUE_Bq/kg,VALUE_Bq/kg,BASIS,ERROR%,NUMBER,DATE_OF_ENTRY_x,COUNTRY,...,BIOTATYPE,TISSUE,NO,LENGTH,WEIGHT,DW%,LOI%,MORS_SUBBASIN,HELCOM_SUBBASIN,DATE_OF_ENTRY_y
0,BVTIG2012041,CS134,VTIG01,<,0.01014,W,,,02/27/14 00:00:00,6.0,...,F,5,16.0,45.7,948.0,18.453,92.9,2.0,16,02/27/14 00:00:00
1,BVTIG2012041,K40,VTIG01,,135.3,W,3.57,,02/27/14 00:00:00,6.0,...,F,5,16.0,45.7,948.0,18.453,92.9,2.0,16,02/27/14 00:00:00
2,BVTIG2012041,CO60,VTIG01,<,0.01398,W,,,02/27/14 00:00:00,6.0,...,F,5,16.0,45.7,948.0,18.453,92.9,2.0,16,02/27/14 00:00:00
3,BVTIG2012041,CS137,VTIG01,,4.338,W,3.48,,02/27/14 00:00:00,6.0,...,F,5,16.0,45.7,948.0,18.453,92.9,2.0,16,02/27/14 00:00:00
4,BVTIG2012040,CS134,VTIG01,<,0.009614,W,,,02/27/14 00:00:00,6.0,...,F,5,17.0,45.9,964.0,18.458,92.9,2.0,16,02/27/14 00:00:00


Show the structure of the `sediment` dataframe: 

In [None]:
#| eval: false
dfs['sediment'].head()

Unnamed: 0,KEY,NUCLIDE,METHOD,< VALUE_Bq/kg,VALUE_Bq/kg,ERROR%_kg,< VALUE_Bq/m²,VALUE_Bq/m²,ERROR%_m²,DATE_OF_ENTRY_x,...,LOWSLI,AREA,SEDI,OXIC,DW%,LOI%,MORS_SUBBASIN,HELCOM_SUBBASIN,SUM_LINK,DATE_OF_ENTRY_y
0,SKRIL2012048,RA226,,,35.0,26.0,,,,08/20/14 00:00:00,...,20.0,0.006,,,,,11.0,11.0,,08/20/14 00:00:00
1,SKRIL2012049,RA226,,,36.0,22.0,,,,08/20/14 00:00:00,...,27.0,0.006,,,,,11.0,11.0,,08/20/14 00:00:00
2,SKRIL2012050,RA226,,,38.0,24.0,,,,08/20/14 00:00:00,...,2.0,0.006,,,,,11.0,11.0,,08/20/14 00:00:00
3,SKRIL2012051,RA226,,,36.0,25.0,,,,08/20/14 00:00:00,...,4.0,0.006,,,,,11.0,11.0,,08/20/14 00:00:00
4,SKRIL2012052,RA226,,,30.0,23.0,,,,08/20/14 00:00:00,...,6.0,0.006,,,,,11.0,11.0,,08/20/14 00:00:00


### Define Sample Type 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: Included as netcdf.group*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``type``.*

In [None]:
#| exports
type_lut = {
    'SEAWATER' : 1,
    'BIOTA' : 2,
    'SEDIMENT' : 3
}

In [None]:
#| exports
class GetSampleTypeCB(Callback):
    def __init__(self, type_lut: Dict[str, int]):
        "Set the sample type column in the DataFrames based on a lookup table."
        self.type_lut = type_lut

    def __call__(self, tfm):
        "Apply the sample type lookup to DataFrames in the transformer."
        for key, df in tfm.dfs.items():
            df['samptype_id'] = self._get_sample_type(key)

    def _get_sample_type(self, group_name: str) -> int:
        "Determine the sample type for a given group name using the lookup table."
        return self.type_lut.get(group_name.upper(), 0)  # Default to 0 if not found

Here we call a transformer, which applies the callback (e.g. `GetSampleTypeCB`) to the dictionary of dataframes, `dfs`.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[GetSampleTypeCB(type_lut),
                            CompareDfsAndTfmCB(dfs)
                            ])

print(tfm()['seawater'].head())
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

            KEY NUCLIDE METHOD < VALUE_Bq/m³  VALUE_Bq/m³  ERROR%_m³  \
0  WKRIL2012003   CS137    NaN           NaN          5.3       32.0   
1  WKRIL2012004   CS137    NaN           NaN         19.9       20.0   
2  WKRIL2012005   CS137    NaN           NaN         25.5       20.0   
3  WKRIL2012006   CS137    NaN           NaN         17.0       29.0   
4  WKRIL2012007   CS137    NaN           NaN         22.2       18.0   

     DATE_OF_ENTRY_x  COUNTRY LABORATORY   SEQUENCE  ... LONGITUDE (dddddd)  \
0  08/20/14 00:00:00     90.0       KRIL  2012003.0  ...            29.3333   
1  08/20/14 00:00:00     90.0       KRIL  2012004.0  ...            29.3333   
2  08/20/14 00:00:00     90.0       KRIL  2012005.0  ...            23.1500   
3  08/20/14 00:00:00     90.0       KRIL  2012006.0  ...            27.9833   
4  08/20/14 00:00:00     90.0       KRIL  2012007.0  ...            27.9833   

   TDEPTH  SDEPTH  SALIN TTEMP  FILT  MORS_SUBBASIN  HELCOM_SUBBASIN  \
0     NaN     0.0   

### Normalize nuclide names

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``nuclide``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``nuclide_id``.*

#### Lower & strip nuclide names

Create a callback function, `LowerStripRdnNameCB`, that receives a dictionary of dataframes. For each dataframe in the dictionary, it converts the contents of the `Nuclides` column to lowercase and removes any leading or trailing whitespace.

In [None]:
#| exports
class LowerStripRdnNameCB(Callback):
    "Convert nuclide names to lowercase and strip any trailing spaces."
    def __call__(self, tfm):
        for key in tfm.dfs.keys():
            self._process_nuclide_column(tfm.dfs[key])

    def _process_nuclide_column(self, df):
        "Apply transformation to the 'NUCLIDE' column of the given DataFrame."
        df['NUCLIDE'] = df['NUCLIDE'].apply(self._transform_nuclide)

    def _transform_nuclide(self, nuclide):
        "Convert nuclide name to lowercase and strip trailing spaces."
        return nuclide.lower().strip()


Here we call a transformer, which applies the callback (e.g. `LowerStripRdnNameCB`) to the dictionary of dataframes, `dfs`. We then print the unique entries of the transformed `NUCLIDE` column for each dataframe included in the dictionary of dataframes, `dfs`.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB()])
print('seawater nuclides: ')
print(tfm()['seawater']['NUCLIDE'].unique())
print('biota nuclides: ')
print(tfm()['biota']['NUCLIDE'].unique())
print('sediment nuclides: ')
print(tfm()['sediment']['NUCLIDE'].unique())

seawater nuclides: 
['cs137' 'sr90' 'h3' 'cs134' 'pu238' 'pu239240' 'am241' 'cm242' 'cm244'
 'tc99' 'k40' 'ru103' 'sr89' 'sb125' 'nb95' 'ru106' 'zr95' 'ag110m'
 'cm243244' 'ba140' 'ce144' 'u234' 'u238' 'co60' 'pu239' 'pb210' 'po210'
 'np237' 'pu240' 'mn54']
biota nuclides: 
['cs134' 'k40' 'co60' 'cs137' 'sr90' 'ag108m' 'mn54' 'co58' 'ag110m'
 'zn65' 'sb125' 'pu239240' 'ru106' 'be7' 'ce144' 'pb210' 'po210' 'sb124'
 'sr89' 'zr95' 'te129m' 'ru103' 'nb95' 'ce141' 'la140' 'i131' 'ba140'
 'pu238' 'u235' 'bi214' 'pb214' 'pb212' 'tl208' 'ac228' 'ra223' 'eu155'
 'ra226' 'gd153' 'sn113' 'fe59' 'tc99' 'co57' 'sn117m' 'eu152' 'sc46'
 'rb86' 'ra224' 'th232' 'cs134137' 'am241' 'ra228' 'th228' 'k-40' 'cs138'
 'cs139' 'cs140' 'cs141' 'cs142' 'cs143' 'cs144' 'cs145' 'cs146']
sediment nuclides: 
['ra226' 'cs137' 'ra228' 'k40' 'sr90' 'cs134137' 'cs134' 'pu239240'
 'pu238' 'co60' 'ru103' 'ru106' 'sb125' 'ag110m' 'ce144' 'am241' 'be7'
 'th228' 'pb210' 'co58' 'mn54' 'zr95' 'ba140' 'po210' 'ra224' 'nb95'
 'p

#### Remap nuclide names to MARIS data formats

The `maris-template.nc` file, which  is created from the `cdl.toml` on installation of the Marisco package, provides details of the nuclides permitted in the  MARIS NetCDF file. Here we define a function  `get_unique_nuclides()` which creates a list of the unique nuclides from each dataframe in the dictionary of dataframes `dfs`. The function `has_valid_varname` checks that each nuclide in this list is included in the `maris-template.nc` (i.e. the `cdl.toml`). `has_valid_varname` returns all variables in the list that are not in the `maris-template.nc` or returns `True`. 
 

In [None]:
dfs['seawater']

Unnamed: 0,KEY,NUCLIDE,METHOD,< VALUE_Bq/m³,VALUE_Bq/m³,ERROR%_m³,DATE_OF_ENTRY_x,COUNTRY,LABORATORY,SEQUENCE,...,LONGITUDE (ddmmmm),LONGITUDE (dddddd),TDEPTH,SDEPTH,SALIN,TTEMP,FILT,MORS_SUBBASIN,HELCOM_SUBBASIN,DATE_OF_ENTRY_y
0,WKRIL2012003,CS137,,,5.3,32.000000,08/20/14 00:00:00,90.0,KRIL,2012003.0,...,29.2000,29.3333,,0.0,,,,11.0,11.0,08/20/14 00:00:00
1,WKRIL2012004,CS137,,,19.9,20.000000,08/20/14 00:00:00,90.0,KRIL,2012004.0,...,29.2000,29.3333,,29.0,,,,11.0,11.0,08/20/14 00:00:00
2,WKRIL2012005,CS137,,,25.5,20.000000,08/20/14 00:00:00,90.0,KRIL,2012005.0,...,23.0900,23.1500,,0.0,,,,11.0,3.0,08/20/14 00:00:00
3,WKRIL2012006,CS137,,,17.0,29.000000,08/20/14 00:00:00,90.0,KRIL,2012006.0,...,27.5900,27.9833,,0.0,,,,11.0,11.0,08/20/14 00:00:00
4,WKRIL2012007,CS137,,,22.2,18.000000,08/20/14 00:00:00,90.0,KRIL,2012007.0,...,27.5900,27.9833,,39.0,,,,11.0,11.0,08/20/14 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21211,WSSSM2021005,H3,SSM45,,1030.0,93.203883,09/06/22 00:00:00,77.0,SSSM,202105.0,...,18.2143,18.3572,,1.0,,,N,1.0,8.0,09/06/22 00:00:00
21212,WSSSM2021006,H3,SSM45,,2240.0,43.303571,09/06/22 00:00:00,77.0,SSSM,202106.0,...,17.0000,17.0000,,1.0,,,N,10.0,10.0,09/06/22 00:00:00
21213,WSSSM2021007,H3,SSM45,,2060.0,47.087379,09/06/22 00:00:00,77.0,SSSM,202107.0,...,11.5671,11.9452,,1.0,,,N,12.0,12.0,09/06/22 00:00:00
21214,WSSSM2021008,H3,SSM45,,2300.0,43.478261,09/06/22 00:00:00,77.0,SSSM,202108.0,...,11.5671,11.9452,,1.0,,,N,12.0,12.0,09/06/22 00:00:00


In [None]:
#| export
def get_unique_nuclides(dfs: Dict[str, pd.DataFrame]) -> List[str]:
    "Get a list of unique radionuclide types measured across samples."
    nuclides = set()
    for df in dfs.values(): nuclides.update(df['NUCLIDE'].unique())
    return list(nuclides)

In [None]:
#| eval: false
# Check if these variable names are consistent with MARIS CDL
has_valid_varname(get_unique_nuclides(tfm.dfs), nc_tpl_path())

"cs139" variable name not found in MARIS CDL
"cs134137" variable name not found in MARIS CDL
"cs138" variable name not found in MARIS CDL
"cs143" variable name not found in MARIS CDL
"cs142" variable name not found in MARIS CDL
"cs144" variable name not found in MARIS CDL
"pu238240" variable name not found in MARIS CDL
"pu239240" variable name not found in MARIS CDL
"cs146" variable name not found in MARIS CDL
"cs141" variable name not found in MARIS CDL
"cm243244" variable name not found in MARIS CDL
"k-40" variable name not found in MARIS CDL
"cs145" variable name not found in MARIS CDL
"cs140" variable name not found in MARIS CDL


False

Many nuclide names are not listed in the `maris-template.nc`. Here we create a look up table, `varnames_lut_updates`, which will be used to correct the nuclide names in the dictionary of dataframes (i.e. dfs) that are not compatible with the `maris-template.nc`.

In [None]:
#| exports
varnames_lut_updates = {
    'k-40': 'k40',
    'cm243244': 'cm243_244_tot',
    'cs134137': 'cs134_137_tot',
    'pu239240': 'pu239_240_tot',
    'pu238240': 'pu238_240_tot',
    'cs138': 'cs137',
    'cs139': 'cs137',
    'cs140': 'cs137',
    'cs141': 'cs137',
    'cs142': 'cs137',
    'cs143': 'cs137',
    'cs144': 'cs137',
    'cs145': 'cs137',
    'cs146': 'cs137'}

Function `get_varnames_lut` returns a dictionary of nuclide names. This dictionary includes the `NUCLIDE` names from the dataframes in dfs, along with corrections specified in `varnames_lut_updates`.

In [None]:
#| exports
def get_varnames_lut(
    dfs:dict, # Data to transform
    lut:dict=varnames_lut_updates # Lut to fix not found nuclide names
) -> dict: 
    "Generate a lookup table for radionuclide names, updating with provided mappings."
    unique_nuclides = get_unique_nuclides(dfs)
    base_lut = {name: name for name in unique_nuclides}
    base_lut.update(lut)
    return base_lut

The ``get_nuc_id_lut`` function creates a lookup table to map nuclide names to their IDs. In the MARIS Open Refine data format, each nuclide has a unique nuclide_id. This function reads an Excel file that lists nuclide names and their IDs, and then returns a dictionary. In this dictionary, the nuclide names are the keys, and their corresponding IDs are the values.

In [None]:
#| exports
def get_nuc_id_lut():
    df = pd.read_excel(nuc_lut_path(), usecols=['nc_name','nuclide_id'])
    return df.set_index('nc_name').to_dict()['nuclide_id']

Create a callback that remaps the nuclide names in the dataframes to the updated names in `varnames_lut_updates`.

In [None]:
# | exports
class RemapRdnNameCB(Callback):
    def __init__(self, 
                 fn_lut:Callable=partial(get_varnames_lut, lut=varnames_lut_updates), # Function remapping radionuclide names
                 nuc_id_lut:Callable=get_nuc_id_lut # Function that returns a lookup table for nuclide IDs
                ):
        "Remap and standardize radionuclide names to MARIS radionuclide names and define nuclide ids."
        fc.store_attr()

    def __call__(self, tfm):
        "Apply lookup tables to remap radionuclide names and obtain nuclide IDs in DataFrames."
        lut = self.fn_lut(tfm.dfs)
        nuc_id_lut = self.nuc_id_lut()
        
        for grp in tfm.dfs:
            df = tfm.dfs[grp]
            self._remap_nuclide_names(df, lut)
            self._apply_nuclide_ids(df, nuc_id_lut)

    def _remap_nuclide_names(self, 
                             df:pd.DataFrame, # DataFrame containing the 'NUCLIDE' column
                             lut: Dict[str, str] # Lookup table for remapping radionuclide names
                            ):
        "Remap radionuclide names in the 'NUCLIDE' column of the DataFrame using the provided lookup table."
        if 'NUCLIDE' in df.columns:
            df['NUCLIDE'] = df['NUCLIDE'].replace(lut)
        else:
            print(f"No 'NUCLIDE' column found in DataFrame of group {df.name}")

    def _apply_nuclide_ids(self, 
                           df:pd.DataFrame, # DataFrame containing the `NUCLIDE` column
                           nuc_id_lut:Dict[str, str] # Lookup table for nuclide IDs
                          ):
        "Apply nuclide IDs to the 'NUCLIDE' column using the provided nuclide ID lookup table."
        if 'NUCLIDE' in df.columns:
            df['nuclide_id'] = df['NUCLIDE'].map(nuc_id_lut)
        else:
            print(f"No 'NUCLIDE' column found in DataFrame of group {df.name}")

Apply the transformer for callbacks `LowerStripRdnNameCB` and `RemapRdnNameCB`. Then, print the unique nuclides for each dataframe in the dictionary dfs.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            #CompareDfsAndTfmCB(dfs)
                            ])
tfm()

#print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print('seawater nuclides: ')
print(tfm.dfs['seawater'][['NUCLIDE', 'nuclide_id']].drop_duplicates().reset_index(drop=True))
print('biota nuclides: ')
print(tfm.dfs['biota'][['NUCLIDE', 'nuclide_id']].drop_duplicates().reset_index(drop=True))
print('sediment nuclides: ')
print(tfm.dfs['sediment'][['NUCLIDE', 'nuclide_id']].drop_duplicates().reset_index(drop=True))


seawater nuclides: 
          NUCLIDE  nuclide_id
0           cs137          33
1            sr90          12
2              h3           1
3           cs134          31
4           pu238          67
5   pu239_240_tot          77
6           am241          72
7           cm242          73
8           cm244          75
9            tc99          15
10            k40           4
11          ru103          16
12           sr89          11
13          sb125          24
14           nb95          14
15          ru106          17
16           zr95          13
17         ag110m          22
18  cm243_244_tot          80
19          ba140          34
20          ce144          37
21           u234          62
22           u238          64
23           co60           9
24          pu239          68
25          pb210          41
26          po210          47
27          np237          65
28          pu240          69
29           mn54           6
biota nuclides: 
          NUCLIDE  nuclide_id
0  

After applying correction to the nuclide names we check that all nuclide in the dictionary of dataframes are valid. Returns `True` if all are valid.

In [None]:
#| eval: false
has_valid_varname(get_unique_nuclides(tfm.dfs), nc_tpl_path())

True

### Standardize Time

#### Parse time

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: `time`.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Open Refine format variables: `begperiod` 

Create a callback that remaps the time format in the dictionary of dataframes (i.e. `%m/%d/%y %H:%M:%S`):

**Comment (FA)**: Can be simplified I think (TBC)

In [None]:
#| exports
class ParseTimeCB(Callback):
    def __init__(self): 
        fc.store_attr()
            
    def __call__(self, 
                 tfm # The transformer object containing DataFrames
                ):
        for grp in tfm.dfs.keys():
            df = tfm.dfs[grp]
            self._process_dates(df)
            self._define_beg_period(df)

    def _process_dates(self, 
                       df:pd.DataFrame # DataFrame containing the `DATE`, `YEAR`, `MONTH`, and `DAY` columns
                      ):
        "Process and correct date and time information in the DataFrame."
        df['time'] = pd.to_datetime(df['DATE'], format='%m/%d/%y %H:%M:%S')
        # if 'DATE' column is nan, get 'time' from 'YEAR','MONTH' and 'DAY' column. 
        # if 'DAY' or 'MONTH' is 0 then set it to 1. 
        df.loc[df["DAY"] == 0, "DAY"] = 1
        df.loc[df["MONTH"] == 0, "MONTH"] = 1
        
        # if 'DAY' and 'MONTH' is nan but YEAR is not nan then set 'DAY' and 'MONTH' both to 1. 
        condition = (df["DAY"].isna()) & (df["MONTH"].isna()) & (df["YEAR"].notna())
        df.loc[condition, "DAY"] = 1
        df.loc[condition, "MONTH"] = 1
        
        condition = df['DATE'].isna() # if 'DATE' is nan. 
        df['time']  = np.where(condition,
                                            # 'coerce', then invalid parsing will be set as NaT. NaT will result if the number of days are not valid for the month.
                                        pd.to_datetime(df[['YEAR', 'MONTH', 'DAY']], format='%y%m%d', errors='coerce'),  
                                        pd.to_datetime(df['DATE'], format='%m/%d/%y %H:%M:%S'))
        
    def _define_beg_period(self, 
                           df: pd.DataFrame # DataFrame containing the `time` column
                          ):
        "Create a standardized date representation for Open Refine."
        df['begperiod'] = df['time']

Apply the transformer for callbacks `ParseTimeCB`. Then, print the ``begperiod`` and `time` data for `seawater`.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['seawater'][['begperiod','time']])

                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21216     39817  15827
Number of dropped rows                                     0         0      0
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 

       begperiod       time
0     2012-05-23 2012-05-23
1     2012-05-23 2012-05-23
2     2012-06-17 2012-06-17
3     2012-05-24 2012-05-24
4     2012-05-24 2012-05-24
...          ...        ...
21211 2021-10-15 2021-10-15
21212 2021-11-04 2021-11-04
21213 2021-10-15 2021-10-15
21214 2021-05-17 2021-05-17
21215 2021-05-13 2021-05-13

[21216 rows x 2 columns]


#### Encode time (seconds since ...)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``time``*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: No encoding for Open Refine.* 

`EncodeTimeCB` converts the HELCOM `time` format to the MARIS NetCDF `time` format.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[ParseTimeCB(),
                            EncodeTimeCB(cfg(), verbose = True),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
                            

8 of 21216 entries for `time` are invalid for seawater.
1 of 39817 entries for `time` are invalid for sediment.
                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21208     39816  15827
Number of dropped rows                                     8         1      0
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 



In [None]:
tfm.dfs_dropped['seawater'][['YEAR', 'MONTH', 'DAY', 'DATE']]

Unnamed: 0,YEAR,MONTH,DAY,DATE
20556,,,,
20557,,,,
20558,,,,
20559,,,,
20560,,,,
20561,,,,
20562,,,,
20563,,,,


### Sanitize value

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``value``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variables: ``activity``.*

In [None]:
#| exports
# Columns of interest
coi_val = {'seawater' : {'val': 'VALUE_Bq/m³'},
           'biota':  {'val': 'VALUE_Bq/kg'},
           'sediment': {'val': 'VALUE_Bq/kg'}}

**Comment (FA)**: Those lines can be simplified I think:
```
value_col = self.coi.get(grp, {}).get('val')
if value_col and value_col in df.columns:
```

In [None]:
# | exports
class SanitizeValue(Callback):
    def __init__(self, 
                 coi:dict # Dictionary containing column names for values based on group
                ):
        "Sanitize value by removing blank entries and ensuring the 'value' column is retained."
        fc.store_attr()

    def __call__(self, 
                 tfm # The transformer object containing DataFrames
                ):
        "Sanitize the DataFrames in the transformer by removing rows with blank values in specified columns."
        for grp in tfm.dfs.keys():
            self._sanitize_dataframe(tfm.dfs[grp], grp)

    def _sanitize_dataframe(self, 
                            df:pd.DataFrame, # DataFrame to sanitize
                            grp:str # Group name to determine column names
                           ):
        "Remove rows where specified value columns are blank and ensure the 'value' column is included."
        value_col = self.coi.get(grp, {}).get('val')
        if value_col and value_col in df.columns:
            df.dropna(subset=[value_col], inplace=True)
            # Ensure 'value' column is retained
            if 'value' not in df.columns:
                df['value'] = df[value_col]

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[SanitizeValue(coi_val),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21122     39532  15798
Number of dropped rows                                    94       285     29
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 



### Normalize uncertainty

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``uncertainty``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: `Uncertainty`.*

Function `unc_rel2stan` converts uncertainty from relative uncertainty to standard uncertainty.

In [None]:
#| exports
def unc_rel2stan(
    df:pd.DataFrame, # DataFrame containing measurement and uncertainty columns
    meas_col:str, # Name of the column with measurement values
    unc_col:str # Name of the column with relative uncertainty values (percentages)
) -> pd.Series: # Series with calculated absolute uncertainties
    "Convert relative uncertainty to absolute uncertainty."
    return df.apply(lambda row: row[unc_col] * row[meas_col] / 100, axis=1)

For each sample type in the Helcom dataset, the uncertainty is given as a relative uncertainty. The column names for both the value and the uncertainty vary by sample type. The coi_units_unc dictionary defines the column names for the Value and Uncertainty for each sample type.

In [None]:
#| exports
# Columns of interest
coi_units_unc = [('seawater', 'VALUE_Bq/m³', 'ERROR%_m³'),
                 ('biota', 'VALUE_Bq/kg', 'ERROR%'),
                 ('sediment', 'VALUE_Bq/kg', 'ERROR%_kg')]

NormalizeUncCB callback normalizes the uncertainty by converting from relative uncertainty to standard uncertainty. 

In [None]:
#| exports
class NormalizeUncCB(Callback):
    def __init__(self, 
                 fn_convert_unc:Callable=unc_rel2stan, # Function converting relative uncertainty to absolute uncertainty
                 coi:List=coi_units_unc # List of columns of interest
                ):
        "Convert from relative error % to uncertainty of activity unit."
        fc.store_attr()
    
    def __call__(self, tfm):
        for grp, val, unc in self.coi:
            if grp in tfm.dfs:
                df = tfm.dfs[grp]
                df['uncertainty'] = self.fn_convert_unc(df, val, unc)

Apply the transformer for callback NormalizeUncCB(). Then, print the value (i.e. activity per unit ) and standard uncertainty for each sample type.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[                         
                            NormalizeUncCB(),
                            SanitizeValue(coi_val)])

print(tfm()['seawater'][['value', 'uncertainty']][:5])
print(tfm()['biota'][['value', 'uncertainty']][:5])
print(tfm()['sediment'][['value', 'uncertainty']][:5])

   value  uncertainty
0    5.3        1.696
1   19.9        3.980
2   25.5        5.100
3   17.0        4.930
4   22.2        3.996
        value  uncertainty
0    0.010140          NaN
1  135.300000     4.830210
2    0.013980          NaN
3    4.338000     0.150962
4    0.009614          NaN
   value  uncertainty
0   35.0         9.10
1   36.0         7.92
2   38.0         9.12
3   36.0         9.00
4   30.0         6.90


### Lookup transformations 

#### Lookup MARIS function 

`get_maris_lut` performs a lookup of data provided in `data_provider_lut` against the MARIS lookup (`maris_lut`) using a fuzzy matching algorithm based on Levenshtein distance. The `get_maris_lut` is used to correct the HELCOM data to a standard format for MARIS. 

In [None]:
#| exports
def get_maris_lut(fname_in, 
                  fname_cache, # For instance 'species_helcom.pkl'
                  data_provider_lut: str, # Data provider lookup table name
                  data_provider_id_col: str, # Data provider lookup column id of interest
                  data_provider_name_col: str, # Data provider lookup column name of interest
                  maris_lut: Callable, # Function retrieving MARIS source lookup table
                  maris_id: str, # Id of MARIS lookup table nomenclature item to match
                  maris_name: str, # Name of MARIS lookup table nomenclature item to match
                  unmatched_fixes: dict = {},
                  as_dataframe: bool = False,
                  overwrite: bool = False
                 ):
    "Try to match a look up table provided by the data provider with MARIS one."
    cache_file = cache_path() / fname_cache
    lut = {}
    maris_lut = maris_lut()
    df = pd.read_csv(Path(fname_in) / data_provider_lut)
    
    if overwrite or (not cache_file.exists()):
        for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing"):
            # Fix if unmatched
            has_to_be_fixed = row[data_provider_id_col] in unmatched_fixes            
            name_to_match = unmatched_fixes[row[data_provider_id_col]] if has_to_be_fixed else row[data_provider_name_col]

            # Match
            result = match_maris_lut(maris_lut, name_to_match, maris_id, maris_name)
            match = Match(result.iloc[0][maris_id], result.iloc[0][maris_name], 
                          row[data_provider_name_col], result.iloc[0]['score'])
            
            lut[row[data_provider_id_col]] = match
        
        fc.save_pickle(cache_file, lut)
    else:
        lut = fc.load_pickle(cache_file)

    if as_dataframe:
        df_lut = pd.DataFrame({k: asdict(v) for k, v in lut.items()}).transpose()
        df_lut.index.name = 'source_id'
        return df_lut.sort_values(by='match_score', ascending=False)
    else:
        return lut


#### Lookup : Biota species

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``species``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: `Species`.*

The HELCOM dataset includes look-up in the `RUBIN_NAME.csv` file for biota species. 

In [None]:
#| eval: false
df_rubin = pd.read_csv(Path(fname_in) / 'RUBIN_NAME.csv')
df_rubin.head(5)

Unnamed: 0,RUBIN_ID,RUBIN,SCIENTIFIC NAME,ENGLISH NAME
0,11,ABRA BRA,ABRAMIS BRAMA,BREAM
1,12,ANGU ANG,ANGUILLA ANGUILLA,EEL
2,13,ARCT ISL,ARCTICA ISLANDICA,ISLAND CYPRINE
3,14,ASTE RUB,ASTERIAS RUBENS,COMMON STARFISH
4,15,CARD EDU,CARDIUM EDULE,COCKLE


Create `unmatched_fixes_biota_species` to correct the spelling of names that are unmatched in the HELCOM dataset. 

In [None]:
#| exports
unmatched_fixes_biota_species = {
    'CARD EDU': 'Cerastoderma edule',
    'LAMI SAC': 'Saccharina latissima',
    'PSET MAX': 'Scophthalmus maximus',
    'STIZ LUC': 'Sander luciopercas'}

In [None]:
#| eval: false
species_lut_df = get_maris_lut(fname_in, 
                               fname_cache='species_helcom.pkl', 
                               data_provider_lut='RUBIN_NAME.csv',
                               data_provider_id_col='RUBIN',
                               data_provider_name_col='SCIENTIFIC NAME',
                               maris_lut=species_lut_path,
                               maris_id='species_id',
                               maris_name='species',
                               unmatched_fixes=unmatched_fixes_biota_species,
                               as_dataframe=True,
                               overwrite=True)

Processing: 100%|██████████| 46/46 [00:06<00:00,  6.92it/s]


Display `species_lut_df`. The `match_score` represents the number insertions, deletions, or substitutions needed to transform from the HECOM source name (`source_name`) to the maris name, (`matched_maris_name`). 

In [None]:
#| eval: false
species_lut_df.head()

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENCH CIM,276,Echinodermata,ENCHINODERMATA CIM,5
MACO BAL,122,Macoma balthica,MACOMA BALTICA,1
STUC PEC,704,Stuckenia pectinata,STUCKENIA PECTINATE,1
STIZ LUC,285,Sander lucioperca,STIZOSTEDION LUCIOPERCA,1
ABRA BRA,271,Abramis brama,ABRAMIS BRAMA,0


Show `species_lut_df` where `match_type` is not a perfect match ( i.e. not equal 0).

In [None]:
species_lut_df[species_lut_df['match_score'] >= 1]

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENCH CIM,276,Echinodermata,ENCHINODERMATA CIM,5
MACO BAL,122,Macoma balthica,MACOMA BALTICA,1
STUC PEC,704,Stuckenia pectinata,STUCKENIA PECTINATE,1
STIZ LUC,285,Sander lucioperca,STIZOSTEDION LUCIOPERCA,1


`LookupBiotaSpeciesCB` applies the corrected `biota` `species` data obtained from the `get_maris_lut` function to the `biota` dataframe in the dictionary of dataframes, `dfs`.

In [None]:
#| exports
class LookupBiotaSpeciesCB(Callback):
    def __init__(self, 
                 fn_lut:Callable # Function that returns the lookup table dictionary
                ):
        "Biota species standardized to MARIS format."
        fc.store_attr()

    def __call__(self, tfm):
        "Remap biota species names in the DataFrame using the lookup table and print unmatched RUBIN values."
        lut = self.fn_lut()
        tfm.dfs['biota']['species'] = tfm.dfs['biota']['RUBIN'].apply(lambda x: self._get_species(x, lut))

    def _get_species(self, 
                     rubin_value:str, # The RUBIN value from the DataFrame
                     lut:dict # The lookup table dictionary
                    ):
        "Get the matched_id from the lookup table and print RUBIN if the matched_id is -1."
        match = lut.get(rubin_value.strip(), Match(-1, None, None, None))
        if match.matched_id == -1:
            self.print_unmatched_rubin(rubin_value)
        return match.matched_id

    def print_unmatched_rubin(self, 
                              rubin_value: str # The RUBIN value from the DataFrame
                             ):
        "Print the RUBIN value if the matched_id is -1."
        print(f"Unmatched RUBIN: {rubin_value}")

`get_maris_species` defines a partial function of `get_maris_lut`, with predefined arguments  for species lookup.

In [None]:
#| exports
get_maris_species = partial(get_maris_lut,
                            fname_in, fname_cache='species_helcom.pkl', 
                            data_provider_lut='RUBIN_NAME.csv',
                            data_provider_id_col='RUBIN',
                            data_provider_name_col='SCIENTIFIC NAME',
                            maris_lut=species_lut_path,
                            maris_id='species_id',
                            maris_name='species',
                            unmatched_fixes=unmatched_fixes_biota_species,
                            as_dataframe=False,
                            overwrite=False)

Apply the transformer for callback `LookupBiotaSpeciesCB(get_maris_species)`. Then, print the unique `species` for the `biota` dataframe.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[                     
                            LookupBiotaSpeciesCB(get_maris_species)
                            ])

#print(tfm()['biota'][['RUBIN', 'species']][:10])
print(tfm()['biota']['species'].unique())

[  99  243   50  139  270  192  191  284   84  269  122   96  287  279
  278  288  286  244  129  275  271  285  283  247  120   59  280  274
  273  290  289  272  277  276   21  282  110  281  245  704 1524  703
 1611  621   60]


#### Lookup : Biota tissues

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``body_part``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: `Body part`.*

The HELCOM dataset includes look-up in the `TISSUE.csv` file for biota tissues. Biota tissue is known as `body part` in the maris data set.    

In [None]:
#| eval: false
pd.read_csv('../../_data/accdb/mors/csv/TISSUE.csv').head()

Unnamed: 0,TISSUE,TISSUE_DESCRIPTION
0,1,WHOLE FISH
1,2,WHOLE FISH WITHOUT ENTRAILS
2,3,WHOLE FISH WITHOUT HEAD AND ENTRAILS
3,4,FLESH WITH BONES
4,5,FLESH WITHOUT BONES (FILETS)


Create `unmatched_fixes_biota_tissues` to correct entries in the HELCOM dataset. 

In [None]:
#| exports
unmatched_fixes_biota_tissues = {
    3: 'Whole animal eviscerated without head',
    12: 'Viscera',
    8: 'Skin'}

In [None]:
#| eval: false
tissues_lut_df = get_maris_lut(fname_in, 
                               fname_cache='tissues_helcom.pkl', 
                               data_provider_lut='TISSUE.csv',
                               data_provider_id_col='TISSUE',
                               data_provider_name_col='TISSUE_DESCRIPTION',
                               maris_lut=bodyparts_lut_path,
                               maris_id='bodypar_id',
                               maris_name='bodypar',
                               unmatched_fixes=unmatched_fixes_biota_tissues,
                               as_dataframe=True,
                               overwrite=True)

Processing: 100%|██████████| 29/29 [00:00<00:00, 141.12it/s]


In [None]:
tissues_lut_df.head()

Unnamed: 0_level_0,matched_id,matched_maris_name,source_name,match_score
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,52,Flesh without bones,WHOLE FISH WITHOUT ENTRAILS,13
5,52,Flesh without bones,FLESH WITHOUT BONES (FILETS),9
1,1,Whole animal,WHOLE FISH,5
15,53,Stomach and intestine,STOMACH + INTESTINE,3
41,1,Whole animal,WHOLE ANIMALS,1


`LookupBiotaBodyPartCB` applies the corrected `biota` `TISSUE` data obtained from the `get_maris_lut` function to the `biota` dataframe in the dictionary of dataframes, `dfs`.

In [None]:
#| exports
class LookupBiotaBodyPartCB(Callback):
    def __init__(self, 
                 fn_lut:Callable # Function that returns the lookup table dictionary
                ):
        "Update bodypart id based on MARIS body part LUT (dbo_bodypar.xlsx)."
        fc.store_attr()

    def __call__(self, tfm):
        "Remap biota body parts in the DataFrame using the lookup table and print unmatched TISSUE values."
        lut = self.fn_lut()
        tfm.dfs['biota']['body_part'] = tfm.dfs['biota']['TISSUE'].apply(lambda x: self._get_body_part(x, lut))

    def _get_body_part(self, 
                       tissue_value:str, # The TISSUE value from the DataFrame
                       lut:dict # The lookup table dictionary
                      ):
        "Get the matched_id from the lookup table and print TISSUE if the matched_id is -1."
        match = lut.get(tissue_value, Match(-1, None, None, None))
        if match.matched_id == -1: 
            self.print_unmatched_tissue(tissue_value)
        return match.matched_id

    def print_unmatched_tissue(self, 
                               tissue_value:str # The TISSUE value from the DataFrame
                              ):
        "Print the TISSUE value if the matched_id is -1."
        print(f"Unmatched TISSUE: {tissue_value}")

`get_maris_bodypart` defines a partial function of `get_maris_lut`, with predefined arguments  for  `TISSUE` (or `bodypar`) lookup.

In [None]:
#| exports
get_maris_bodypart = partial(get_maris_lut,
                             fname_in,
                             fname_cache='tissues_helcom.pkl', 
                             data_provider_lut='TISSUE.csv',
                             data_provider_id_col='TISSUE',
                             data_provider_name_col='TISSUE_DESCRIPTION',
                             maris_lut=bodyparts_lut_path,
                             maris_id='bodypar_id',
                             maris_name='bodypar',
                             unmatched_fixes=unmatched_fixes_biota_tissues)

Apply the transformer for callbacks `LookupBiotaSpeciesCB(get_maris_species)` and `LookupBiotaBodyPartCB(get_maris_bodypart)`. Then, print the `TISSUE` and `body_part` for the `biota` dataframe.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[                 
                            LookupBiotaSpeciesCB(get_maris_species),
                            LookupBiotaBodyPartCB(get_maris_bodypart)
                            ])

print(tfm()['biota'][['TISSUE', 'body_part']][:5])

   TISSUE  body_part
0       5         52
1       5         52
2       5         52
3       5         52
4       5         52


#### Lookup : Biogroup

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``bio_group``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: Biogroup is not included.*

`get_biogroup_lut` reads the file at `species_lut_path()` and from the contents of this file creates a dictionary linking `species_id` to `biogroup_id`.

In [None]:
#| exports
def get_biogroup_lut(maris_lut:str # Path to the MARIS lookup table (Excel file)
                    ) -> dict: # A dictionary mapping species_id to biogroup_id
    "Retrieve a lookup table for biogroup ids from a MARIS lookup table."
    species = pd.read_excel(maris_lut)
    return species[['species_id', 'biogroup_id']].set_index('species_id').to_dict()['biogroup_id']

`LookupBiogroupCB` applies the corrected `biota` `bio group` data obtained from the `get_maris_lut` function to the `biota` dataframe in the dictionary of dataframes, `dfs`.

In [None]:
#| exports
class LookupBiogroupCB(Callback):
    def __init__(self, 
                 fn_lut:Callable # Function that returns the lookup table dictionary
                ):
        "Update biogroup id based on MARIS species LUT (dbo_species.xlsx)."
        fc.store_attr()

    def __call__(self, tfm):
        "Update the 'bio_group' column in the DataFrame using the lookup table and print unmatched species values."
        lut = self.fn_lut()
        tfm.dfs['biota']['bio_group'] = tfm.dfs['biota']['species'].apply(lambda x: self._get_biogroup(x, lut))

    def _get_biogroup(self, 
                      species_value:str, # The species value from the DataFrame
                      lut: dict # The lookup table dictionary
                     ) -> int: # The biogroup id from the lookup table
        "Get the biogroup id from the lookup table and print species if the biogroup id is not found."
        biogroup_id = lut.get(species_value, -1)
        if biogroup_id == -1:
            self.print_unmatched_species(species_value)
        return biogroup_id

    def print_unmatched_species(self, 
                                species_value:str # The species value from the DataFrame
                               ):
        "Print the species value if the biogroup id is not found."
        print(f"Unmatched species: {species_value}")

Apply the transformer for callbacks `LookupBiotaSpeciesCB(get_maris_species)`, `LookupBiotaBodyPartCB(get_maris_bodypart)`, `LookupSedimentCB(get_maris_sediments)` and `LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())` . Then, print the `bio_group` for the `biota` dataframe.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[                      
                            LookupBiotaSpeciesCB(get_maris_species),
                            LookupBiotaBodyPartCB(get_maris_bodypart),                          
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path()))
                            ])

print(tfm()['biota']['bio_group'].unique())

[ 4  2 14 11  8  3]


#### Lookup : Taxon Information

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: Not included`*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``Taxonname`` , ``TaxonRepName``, ``Taxonrank``*

`get_taxonname_lut` reads the file at `species_lut_path()` and from the contents of this file creates a dictionary linking `species_id` to `Taxonname`.

In [None]:
#| exports
def get_taxon_info_lut(
    maris_lut:str # Path to the MARIS lookup table (Excel file)
) -> dict: # A dictionary mapping species_id to biogroup_id
    "Retrieve a lookup table for Taxonname from a MARIS lookup table."
    species = pd.read_excel(maris_lut)
    return species[['species_id', 'Taxonname', 'Taxonrank','TaxonDB','TaxonDBID','TaxonDBURL']].set_index('species_id').to_dict()

# TODO include Commonname field after next MARIS data reconciling process.

**Comment (FA)**: Above class should be simplified.

In [None]:
# | exports
class LookupTaxonInformationCB(Callback):
    def __init__(self, 
                 fn_lut:Callable # Function that returns the lookup table dictionary
                ):
        "Update taxon names based on MARIS species LUT (dbo_species.xlsx)."
        fc.store_attr()

    def __call__(self, tfm):
        "Update the 'taxon_name' column in the DataFrame using the lookup table and print unmatched species IDs."
        lut = self.fn_lut()
        self._set_taxon_rep_name(tfm.dfs['biota'])
        tfm.dfs['biota']['Taxonname'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['Taxonname']))
        #df['Commonname'] = df['species'].apply(lambda x: self._get_name_by_species_id(x, lut['Commonname']))
        tfm.dfs['biota']['Taxonrank'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['Taxonrank']))
        tfm.dfs['biota']['TaxonDB'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['TaxonDB']))
        tfm.dfs['biota']['TaxonDBID'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['TaxonDBID']))
        tfm.dfs['biota']['TaxonDBURL'] =  tfm.dfs['biota']['species'].apply(lambda x: self._get_name_by_species_id(x, lut['TaxonDBURL']))

    def _set_taxon_rep_name(self, 
                            df:pd.DataFrame # The DataFrame to modify
                           ):
        "Remap the `TaxonRepName` column to the `RUBIN` column values."
        # Ensure both columns exist before attempting to remap
        if 'RUBIN' in df.columns:
            df['TaxonRepName'] = df['RUBIN']
        else:
            print("Warning: 'RUBIN' column not found in DataFrame.")
            
    def _get_name_by_species_id(self, 
                                species_id:str, # The species ID from the DataFrame
                                lut: dict # The lookup table dictionary
                               ) -> str: # The name from the lookup table
        "Get the  name from the lookup table and print species ID if the taxon name is not found."
        name = lut.get(species_id, 'Unknown')  # Default to 'Unknown' if not found
        if name == 'Unknown':
            print(f"Unmatched species ID: {species_id} for {lut.keys()[0]}")
        return name

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[                      
                            LookupBiotaSpeciesCB(get_maris_species),
                            LookupBiotaBodyPartCB(get_maris_bodypart),                          
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path()))
                            ])
tfm()
print(tfm.dfs['biota'][['Taxonname', 'Taxonrank','TaxonDB','TaxonDBID','TaxonDBURL']].drop_duplicates().head())

               Taxonname Taxonrank   TaxonDB TaxonDBID  \
0           Gadus morhua   species  Wikidata   Q199788   
40     Sprattus sprattus   species  Wikidata   Q506823   
44       Clupea harengus   species  Wikidata  Q2396858   
77  Merlangius merlangus   species  Wikidata   Q273083   
78       Limanda limanda   species  Wikidata  Q1135526   

                                TaxonDBURL  
0    https://www.wikidata.org/wiki/Q199788  
40   https://www.wikidata.org/wiki/Q506823  
44  https://www.wikidata.org/wiki/Q2396858  
77   https://www.wikidata.org/wiki/Q273083  
78  https://www.wikidata.org/wiki/Q1135526  


#### Lookup : Sediment types

The HELCOM dataset includes look-up in the `SEDIMENT_TYPE.csv` file for Sediment types. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``sed_type``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: `Sediment type`.*

In [None]:
#| eval: false
df_sediment = pd.read_csv(Path(fname_in) / 'SEDIMENT_TYPE.csv')
df_sediment.head()

Unnamed: 0,SEDI,SEDIMENT TYPE,RECOMMENDED TO BE USED
0,-99,NO DATA,
1,0,GRAVEL,YES
2,1,SAND,YES
3,2,FINE SAND,NO
4,3,SILT,YES


Create `unmatched_fixes_sediments` to correct entries in the HELCOM dataset. 

In [None]:
#| exports
unmatched_fixes_sediments = {
    #np.nan: 'Not applicable',
    -99: '(Not available)'
}

In [None]:
#| eval: false
sediments_lut_df = get_maris_lut(
    fname_in, 
    fname_cache='sediments_helcom.pkl', 
    data_provider_lut='SEDIMENT_TYPE.csv',
    data_provider_id_col='SEDI',
    data_provider_name_col='SEDIMENT TYPE',
    maris_lut=sediments_lut_path,
    maris_id='sedtype_id',
    maris_name='sedtype',
    unmatched_fixes=unmatched_fixes_sediments,
    as_dataframe=True,
    overwrite=True)

Processing: 100%|██████████| 47/47 [00:00<00:00, 142.98it/s]


`get_maris_sediments` defines a partial function of `get_maris_lut`, with predefined arguments  for  `SEDI` (or `sedtype`) lookup.

In [None]:
#| exports
get_maris_sediments = partial(
    get_maris_lut,
    fname_in, 
    fname_cache='sediments_helcom.pkl', 
    data_provider_lut='SEDIMENT_TYPE.csv',
    data_provider_id_col='SEDI',
    data_provider_name_col='SEDIMENT TYPE',
    maris_lut=sediments_lut_path,
    maris_id='sedtype_id',
    maris_name='sedtype',
    unmatched_fixes=unmatched_fixes_sediments)

`LookupSedimentCB` applies the corrected `sediment` `SEDI` data obtained from the `get_maris_lut` function to the `sediment` dataframe in the dictionary of dataframes, `dfs`.

In [None]:
#| exports
def preprocess_sedi(df:pd.DataFrame, column_name:str='SEDI'):
    "Preprocess the 'SEDI' column in the DataFrame by handling missing values and specific replacements."
    if column_name in df.columns:
        df[column_name] = df[column_name].fillna(-99).astype('int')
        df[column_name].replace([56, 73], -99, inplace=True)
    return df

In [None]:
#| exports
class LookupSedimentCB(Callback):
    def __init__(self, 
                 fn_lut:Callable, # Function that returns the lookup table dictionary
                 preprocess_fn:Callable=preprocess_sedi # Function to preprocess the sediment DataFrame
                ):
        "Update sediment id based on MARIS species LUT (dbo_sedtype.xlsx)."
        fc.store_attr()
        self.preprocess_fn = preprocess_fn

    def __call__(self, tfm):
        "Remap sediment types in the DataFrame using the lookup table and handle specific replacements."
        lut = self.fn_lut()
        
        # Set SedRepName
        tfm.dfs['sediment']['SedRepName']  = tfm.dfs['sediment']['SEDI'] 

        # Apply preprocessing to the 'SEDI' column
        tfm.dfs['sediment'] = self.preprocess_fn(tfm.dfs['sediment'])
        
        # Apply the lookup function
        tfm.dfs['sediment']['sed_type'] = tfm.dfs['sediment']['SEDI'].apply(lambda x: self._get_sediment_type(x, lut))

    def _get_sediment_type(self, 
                           sedi_value:int, # The `SEDI` value from the DataFrame
                           lut: dict # The lookup table dictionary
                          ): 
        "Get the matched_id from the lookup table and print SEDI if the matched_id is -1."
        match = lut.get(sedi_value, Match(-1, None, None, None))
        if match.matched_id == -1:
            self._print_unmatched_sedi(sedi_value)
        return match.matched_id

    def _print_unmatched_sedi(self, 
                              sedi_value:int # The `SEDI` value from the DataFram
                             ):
        "Print the SEDI value if the matched_id is -1."
        print(f"Unmatched SEDI: {sedi_value}")

Apply the transformer for callbacks `LookupSedimentCB(get_maris_sediments)`. Then, print the `SEDI` and `sed_type` for the `biota` dataframe.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LookupSedimentCB(get_maris_sediments)])

tfm()
print(tfm.dfs['sediment'][['SedRepName', 'SEDI', 'sed_type']][:5])

   SedRepName  SEDI  sed_type
0         NaN   -99         0
1         NaN   -99         0
2         NaN   -99         0
3         NaN   -99         0
4         NaN   -99         0


#### Lookup : Units

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``unit``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``Unit``.*

Create `renaming_unit_rules` to rename the units. 

In [None]:
#| exports
# Define unit names renaming rules
renaming_unit_rules = {
    'seawater': 1,  # 'Bq/m3'
    'sediment': 4,  # 'Bq/kgd' for sediment
    'biota': {
        'D': 4,  # 'Bq/kgd'
        'W': 5,  # 'Bq/kgw'
        'F': 5   # 'Bq/kgw' (assumed to be 'Fresh', so set to wet)
    }
}

`LookupUnitCB` defines a `unit` column each dataframe based on the units provided in the value (`VALUE_Bq/m³` or `VALUE_Bq/kg`) column of the HELCOM dataset. 

In [None]:
#| export
class LookupUnitCB(Callback):
    def __init__(self, 
                 renaming_unit_rules:dict=renaming_unit_rules # Dictionary containing renaming rules for different unit categories
                ):
        "Set the 'unit' id column in the DataFrames based on a lookup table."
        fc.store_attr()

    def __call__(self, tfm):
        "Apply unit renaming rules to DataFrames within the transformer."
        for grp in tfm.dfs:
            rules = renaming_unit_rules.get(grp)
            if rules is not None:
                # if group tules include a dictionary, apply the dictionay. 
                if isinstance(rules, dict):
                    # Apply rules based on the 'BASIS' column
                    tfm.dfs[grp]['unit'] = tfm.dfs[grp]['BASIS'].apply(lambda x: rules.get(x, 0))
                else:
                    # Apply a single rule to the entire DataFrame
                    tfm.dfs[grp]['unit'] = rules

Apply the transformer for callback `LookupUnitCB()`. Then, print the unique `unit` for the `seawater` dataframe.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            LookupUnitCB()])

print(tfm()['biota']['unit'].unique())

[5 0 4]


#### Lookup : Detection limit or Value type

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``detection_limit``.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine foramt variable: ``Value type``.*

Create `coi_dl` to define the column names related to Value type for each dataset. 

In [None]:
#| exports
# Columns of interest
coi_dl = {'seawater' : { 'val' : 'VALUE_Bq/m³',
                        'unc' : 'ERROR%_m³',
                        'dl' : '< VALUE_Bq/m³'},
                 'biota':  {'val' : 'VALUE_Bq/kg',
                            'unc' : 'ERROR%',
                            'dl' : '< VALUE_Bq/kg'},
                 'sediment': { 'val' : 'VALUE_Bq/kg',
                              'unc' : 'ERROR%_kg',
                              'dl' : '< VALUE_Bq/kg'}}

`get_detectionlimit_lut` reads the file at `detection_limit_lut_path()` and from the contents of this file creates a dictionary linking `name` to `id`.
| id | name | name_sanitized |
| :-: | :-: | :-: |
|-1|Not applicable|Not applicable|
|0|Not Available|Not available|
|1|=|Detected value|
|2|<|Detection limit|
|3|ND|Not detected|
|4|DE|Derived|

In [None]:
#| exports
def get_detectionlimit_lut():
    df = pd.read_excel(detection_limit_lut_path(), usecols=['name','id'])
    return df.set_index('name').to_dict()['id']

`LookupDetectionLimitCB` creates a `detection_limit` column with values determined as follows:
1. Perform a lookup with the appropriate columns value type (or detection limit) columns (`< VALUE_Bq/m³` or `< VALUE_Bq/kg`) against the table returned from the function `get_detectionlimit_lut`.
2. If `< VALUE_Bq/m³` or `< VALUE_Bq/kg>` is NaN but both activity values (`VALUE_Bq/m³` or `VALUE_Bq/kg`) and standard uncertainty (`ERROR%_m³`, `ERROR%`, or `ERROR%_kg`) are provided, then assign the ID of `1` (i.e. "Detected value").
3. For other NaN values in the `detection_limit` column, set them to `0` (i.e. `Not Available`).

In [None]:
# | exports
class LookupDetectionLimitCB(Callback):
    def __init__(self, 
                 coi:dict=coi_dl, # Configuration options for column names
                 fn_lut:Callable=get_detectionlimit_lut # Function that returns a lookup table
                ):
        "Remap value type to MARIS format."
        fc.store_attr()

    def __call__(self, tfm):
        "Remap detection limits in the DataFrames using the lookup table."
        lut = self.fn_lut()
        
        for grp in tfm.dfs:
            df = tfm.dfs[grp]
            self._update_detection_limit(df, grp, lut)
    
    def _update_detection_limit(self, 
                                df:pd.DataFrame, # The DataFrame to modify
                                grp:str, # The group name to get the column configuration
                                lut:dict # The lookup table dictionary
                               ):
        "Update detection limit column in the DataFrame based on lookup table and rules."
        detection_col = self.coi[grp]['dl']
        value_col = self.coi[grp]['val']
        uncertainty_col = self.coi[grp]['unc']
        
        # Copy detection limit column
        df['detection_limit'] = df[detection_col]
        
        # Fill values with '=' or 'Not Available'
        condition = ((df[value_col].notna()) & (df[uncertainty_col].notna()) &
                     (~df['detection_limit'].isin(lut.keys())))
        df.loc[condition, 'detection_limit'] = '='
        df.loc[~df['detection_limit'].isin(lut.keys()), 'detection_limit'] = 'Not Available'
        
        # Perform lookup
        df['detection_limit'] = df['detection_limit'].map(lut)

Apply the transformer for callback `LookupDetectionLimitCB`. Then, print the unique `detection_limit` for the `seawater` dataframe.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            NormalizeUncCB(),
                            SanitizeValue(coi_val),                       
                            LookupUnitCB(),
                            LookupDetectionLimitCB()])

print(tfm()['seawater']['detection_limit'].unique())

[1 2 0]


### Include Sample Laboratory code. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: Sample Laboratory code is not included.*`*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``samplabcode``*

>  MARIS NetCDF format does not include Sample Laboratory code.

In [None]:
# | exports
class RemapDataProviderSampleIdCB(Callback):
    "Remap `KEY` column to `samplabcode` in each DataFrame."
    def __call__(self, tfm):
        for grp in tfm.dfs:
            self._remap_sample_id(tfm.dfs[grp])
    
    def _remap_sample_id(self, df:pd.DataFrame):
        df['samplabcode'] = df['KEY']

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemapDataProviderSampleIdCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

print(tfm()['seawater']['samplabcode'].unique())
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')


['WKRIL2012003' 'WKRIL2012004' 'WKRIL2012005' ... 'WSSSM2021006'
 'WSSSM2021007' 'WSSSM2021008']
                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21216     39817  15827
Number of dropped rows                                     0         0      0
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 



### Filtered

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``filtered``*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``Filtered``*

`get_filtered_lut` reads the file at `filtered_lut_path()` and from the contents of this file creates a dictionary linking `name` to `id`.

In [None]:
#| exports
def get_filtered_lut() -> dict: # A dictionary mapping names to IDs
    "Retrieve a filtered lookup table from an Excel file."
    df = pd.read_excel(filtered_lut_path(), usecols=['name', 'id'])
    return df.set_index('name').to_dict()['id']

Create  `renaming_rules` to rename the HELCOM data to the MARIS format.

In [None]:
#| exports
renaming_rules = {'N': 'No',
                  'n': 'No',
                  'F': 'Yes'}

`LookupFiltCB` converts the HELCOM `FILT` format to the MARIS `FILT` format.

In [None]:
#| exports
class LookupFiltCB(Callback):
    def __init__(self,
                 rules=renaming_rules, # Dictionary mapping FILT codes to their corresponding names
                 fn_lut=get_filtered_lut # Function that returns the lookup table dictionary
                ):
        "Lookup FILT value."
        fc.store_attr()

    def __call__(self, tfm):
        "Update the FILT column in the DataFrames using the renaming rules and lookup table."
        lut = self.fn_lut()
        rules = self.rules
        
        for grp in tfm.dfs.keys():
            if "FILT" in tfm.dfs[grp].columns:
                self._update_filt_column(tfm.dfs[grp], rules, lut)

    def _update_filt_column(self, 
                            df:pd.DataFrame, # The DataFrame to modify
                            rules:dict, # Dictionary mapping `FILT` codes to their corresponding names
                            lut:dict # Dictionary for lookup values
                           ):
        "Update the FILT column based on renaming rules and lookup table."
        # Fill values that are not in the renaming rules with 'Not available'.
        df['FILT'] = df['FILT'].apply(lambda x: rules.get(x, 'Not available'))
        
        # Perform lookup
        df['FILT'] = df['FILT'].map(lambda x: lut.get(x, 0))

Apply the transformer for callback `LookupFiltCB()`. Then, print the unique `FILT` for the `seawater` dataframe.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            LookupFiltCB()
                            ])

print(tfm()['seawater']['FILT'].unique())

[0 2 1]


### Measurement note

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variables: Not included in NetCDF*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: ``measurenote``*

The HELCOM dataset includes look-up at ``ANALYSIS_METHOD.csv``. This look-up was used to capture the method used as described by HELCOM.

In [None]:
#| exports
def get_helcom_method_desc():
    df = pd.read_csv(Path(fname_in) / 'ANALYSIS_METHOD.csv')
    return df.set_index('METHOD').to_dict()['DESCRIPTION']

In [None]:
#| exports
class RecordMeasurementNoteCB(Callback):
    def __init__(self, 
                 fn_lut: Callable # Function that returns the lookup dictionary with `METHOD` as key and `DESCRIPTION` as value
                ):
        "Record measurement notes by adding a 'measurenote' column to DataFrames."
        self.fn_lut = fn_lut
        fc.store_attr()

    def __call__(self, tfm):
        "Apply the lookup table to add 'measurenote' to DataFrames in the transformer."
        lut = self.fn_lut()
        for grp, df in tfm.dfs.items():
            if 'METHOD' in df.columns:
                self._add_measurementnote(df, lut)
            else:
                print(f"Warning: 'METHOD' column not found in DataFrame for group '{grp}'")

    def _add_measurementnote(self, 
                             df:pd.DataFrame, # DataFrame containing the `METHOD` column
                             lut:Dict # Lookup table dictionary mapping `METHOD` to `DESCRIPTION`
                            ):
        "Map 'METHOD' values to `measurenote` using the provided lookup table."
        df['measurenote'] = df['METHOD'].map(lut)        

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RecordMeasurementNoteCB(get_helcom_method_desc),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21216     39817  15827
Number of dropped rows                                     0         0      0
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 



In [None]:
tfm.dfs['seawater']['measurenote'].unique()

array([nan,
       'Radiochemical method Radiocaesium separation from seawater samples.134+137Cs was adsorbed on AMP mat,  dissolved with NaOH and after purification precipitated as chloroplatinate (Cs2PtCl6).Counting with low background anticoincidence beta counter.',
       'Radiochem. meth of Sr90. Precipation with oxalate and separation of calcium, barium, radium and ytrium couting with low background anticoincidence beta counter. 1982-1994',
       'For tritium liquid scintialtion counting, combined with electrolytic enrichment of analysed water samples, double distilled, before and after electrolysis in cells. Liquid Scintillation spectrometer LKB Wallac model 1410',
       'Pretreatment drying (sediment, biota samples) and ashing (biota samples)or vaporization to 1000 ml (sea water samples), measured by gamma-spectrometry using HPGe detectors sediment, biota, sea water /Cs-137, Cs-134, K-40',
       'Radiochemical method. acidified samples are pre-concentrated using NH4-Pmo sepa

### Include Station

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: Station ID is not included.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variable: ``Station``*

>  MARIS NetCDF format does not include Station ID.

In [None]:
#| exports
class RemapStationIdCB(Callback):
    def __init__(self):
        "Remap Station ID to MARIS format."
        fc.store_attr()

    def __call__(self, tfm:Transformer):
        "Iterate through all DataFrames in the transformer object and remap `STATION` to `station_id`."
        for grp in tfm.dfs.keys():
            self._remap_station_id(tfm.dfs[grp])

    def _remap_station_id(self, 
                          df:pd.DataFrame # The DataFrame to modify
                         ):
        "Remap `STATION` column to `station_id` in the given DataFrame."
        df['station'] = df['STATION']

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemapStationIdCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
#print(tfm.dfs['seawater']['station'].unique())
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21216     39817  15827
Number of dropped rows                                     0         0      0
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 



### Sediment slice position (top and bottom)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: Top and Bottom is not included.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variables: ``Top`` and ``Bottom``.*

>  MARIS NetCDF format does not include sediment slice top and bottom.

In [None]:
#| exports
class RemapSedSliceTopBottomCB(Callback):
    def __init__(self):
        "Remap Sediment slice top and bottom to MARIS format."
        fc.store_attr()

    def __call__(self, tfm:Transformer):
        "Iterate through all DataFrames in the transformer object and remap sediment slice top and bottom."
        if 'sediment' in tfm.dfs:
            self._remap_sediment_slice(tfm.dfs['sediment'])

    def _remap_sediment_slice(self, 
                              df:pd.DataFrame # The DataFrame to modify
                             ):
        "Remap `LOWSLI` column to `bottom` and `UPPSLI` column to `top` in the given DataFrame."
        df['bottom'] = df['LOWSLI']
        df['top'] = df['UPPSLI']

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemapSedSliceTopBottomCB()
                            ])
tfm()
print(tfm.dfs['sediment']['top'].head())


0    15.0
1    20.0
2     0.0
3     2.0
4     4.0
Name: top, dtype: float64


### Dry to wet ratio

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variable: DW% is not included.*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variables: ``Dry/wet ratio``.*

HELCOM Description:

**Sediment:**
1. DW%: DRY WEIGHT AS PERCENTAGE (%) OF FRESH WEIGHT.
2. VALUE_Bq/kg: Measured radioactivity concentration in Bq/kg dry wt. in scientific format(e.g. 123 = 1.23E+02, 0.076 = 7.6E-02)

**Biota:**
1. WEIGHT: Average weight (in g) of specimen in the sample
2. DW%: DRY WEIGHT AS PERCENTAGE (%) OF FRESH WEIGHT

In [None]:
#| exports
class LookupDryWetRatio(Callback):
    def __init__(self):
        "Lookup dry-wet ratio and format for MARIS."
        fc.store_attr()

    def __call__(self, tfm:Transformer):
        "Iterate through all DataFrames in the transformer object and apply the dry-wet ratio lookup."
        for grp in tfm.dfs.keys():
            if 'DW%' in tfm.dfs[grp].columns:
                self._apply_dry_wet_ratio(tfm.dfs[grp])

    def _apply_dry_wet_ratio(self, df: pd.DataFrame):
        "Apply dry-wet ratio conversion and formatting to the given DataFrame."
        df['dry_wet_ratio'] = df['DW%']
        # Convert 'DW%' = 0% to NaN.
        df.loc[df['dry_wet_ratio'] == 0, 'dry_wet_ratio'] = np.NaN


In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            LookupDryWetRatio(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
                    

print(tfm.dfs['biota']['dry_wet_ratio'].head())


                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21216     39817  15827
Number of dropped rows                                     0         0      0
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 

0    18.453
1    18.453
2    18.453
3    18.453
4    18.458
Name: dry_wet_ratio, dtype: float64


### Standardize Coordinates

#### Capture Coordinates

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variables: ``lon``  and ``lat``*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variables: ``Longitude`` and ``Latitude``.*

Use decimal degree coordinates if available; otherwise, convert from degree-minute format to decimal degrees.

In [None]:
#| exports
coi_coordinates = {
    'seawater': {
        'lon_d': 'LONGITUDE (dddddd)',
        'lat_d': 'LATITUDE (dddddd)',
        'lon_m': 'LONGITUDE (ddmmmm)',
        'lat_m': 'LATITUDE (ddmmmm)'
    },
    'biota': {
        'lon_d': 'LONGITUDE dddddd',
        'lat_d': 'LATITUDE dddddd',
        'lon_m': 'LONGITUDE ddmmmm',
        'lat_m': 'LATITUDE ddmmmm'
    },
    'sediment': {
        'lon_d': 'LONGITUDE (dddddd)',
        'lat_d': 'LATITUDE (dddddd)',
        'lon_m': 'LONGITUDE (ddmmmm)',
        'lat_m': 'LATITUDE (ddmmmm)'
    }
}

In [None]:
#| exports
def ddmmmm2dddddd(
    ddmmmm:float # Coordinates in `ddmmmm` format where `dd` are degrees and `mmmm`` are minutes
    ) -> float: # Coordinates in `dddddd`` format
    # Split into degrees and minutes
    mins, degs = modf(ddmmmm)
    # Convert minutes to decimal
    mins = mins * 100
    # Convert to 'dddddd' format
    return round(int(degs) + (mins / 60), 6)

In [None]:
#| exports
class FormatCoordinates(Callback):
    def __init__(self, 
                 coi:dict, # Column names mapping for coordinates
                 fn_convert_cor:Callable # Function to convert coordinates
                 ):
        "Format coordinates for MARIS. Converts coordinates from 'ddmmmm' to 'dddddd' format if needed."
        fc.store_attr()

    def __call__(self, tfm:Transformer):
        "Apply formatting to coordinates in the DataFrame."
        for grp in tfm.dfs.keys():
            self._format_coordinates(tfm.dfs[grp], grp)

    def _format_coordinates(self, 
                            df:pd.DataFrame, # DataFrame to modify
                            grp: str # Group name to determine column names
                            ):
        "Format coordinates in the DataFrame for a specific group."
        lon_col_d = self.coi[grp]['lon_d']
        lat_col_d = self.coi[grp]['lat_d']
        lon_col_m = self.coi[grp]['lon_m']
        lat_col_m = self.coi[grp]['lat_m']
        
        # Define condition where 'dddddd' format is not available or is zero
        condition = (
            (df[lon_col_d].isna() | (df[lon_col_d] == 0)) |
            (df[lat_col_d].isna() | (df[lat_col_d] == 0))
        )
        
        # Apply conversion function only to non-null and non-zero values
        df['lon'] = np.where(
            condition,
            df[lon_col_m].apply(lambda x: self._safe_convert(x)),
            df[lon_col_d]
        )
        
        df['lat'] = np.where(
            condition,
            df[lat_col_m].apply(lambda x: self._safe_convert(x)),
            df[lat_col_d]
        )
        
        # Drop rows where coordinate columns contain NaN values
        df.dropna(subset=['lat', 'lon'], inplace=True)

    def _safe_convert(self, 
                      value:float # Coordinate value to convert
                      ):
        "Convert coordinate value safely, handling NaN values."
        if pd.isna(value):
            return value  # Return NaN if value is NaN
        try:
            return self.fn_convert_cor(value)
        except Exception as e:
            print(f"Error converting value {value}: {e}")
            return value  # Return original value if an error occurs


In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[                    
                            FormatCoordinates(coi_coordinates, ddmmmm2dddddd),
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['biota'][['lat','lon']])

                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21208     39816  15827
Number of dropped rows                                     8         1      0
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 

             lat        lon
0      54.283333  12.316667
1      54.283333  12.316667
2      54.283333  12.316667
3      54.283333  12.316667
4      54.283333  12.316667
...          ...        ...
15822  60.373333  18.395667
15823  60.373333  18.395667
15824  60.503333  18.366667
15825  60.503333  18.366667
15826  60.503333  18.366667

[15827 rows x 2 columns]


#### Sanitize coordinates

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*NetCDF format variables: ``lon``  and ``lat``*

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*Open Refine format variables: ``Longitude decimal`` and ``Latitude decimal``.*

Sanitize coordinates drops a row when both longitude & latitude equal 0 or data contains unrealistic longitude & latitude values. Converts longitude & latitude `,` separator to `.` separator."

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            FormatCoordinates(coi_coordinates, ddmmmm2dddddd),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['biota'][['lat','lon']])


                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21208     39816  15827
Number of dropped rows                                     8         1      0
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 

             lat        lon
0      54.283333  12.316667
1      54.283333  12.316667
2      54.283333  12.316667
3      54.283333  12.316667
4      54.283333  12.316667
...          ...        ...
15822  60.373333  18.395667
15823  60.373333  18.395667
15824  60.503333  18.366667
15825  60.503333  18.366667
15826  60.503333  18.366667

[15827 rows x 2 columns]


### Combine Callbacks and review DFS and TFM data

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            GetSampleTypeCB(type_lut),
                            LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg()),        
                            SanitizeValue(coi_val),                       
                            NormalizeUncCB(),
                            LookupBiotaSpeciesCB(get_maris_species),
                            LookupBiotaBodyPartCB(get_maris_bodypart),                          
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path())),
                            LookupSedimentCB(get_maris_sediments),
                            LookupUnitCB(),
                            LookupDetectionLimitCB(),    
                            RemapDataProviderSampleIdCB(),
                            LookupFiltCB(),
                            RemapStationIdCB(),
                            RemapSedSliceTopBottomCB(),
                            LookupDryWetRatio(),
                            FormatCoordinates(coi_coordinates, ddmmmm2dddddd),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')


                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21114     39531  15798
Number of dropped rows                                   102       286     29
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 



In [None]:
tfm.dfs['seawater'].columns

Index(['KEY', 'NUCLIDE', 'METHOD', '< VALUE_Bq/m³', 'VALUE_Bq/m³', 'ERROR%_m³',
       'DATE_OF_ENTRY_x', 'COUNTRY', 'LABORATORY', 'SEQUENCE', 'DATE', 'YEAR',
       'MONTH', 'DAY', 'STATION', 'LATITUDE (ddmmmm)', 'LATITUDE (dddddd)',
       'LONGITUDE (ddmmmm)', 'LONGITUDE (dddddd)', 'TDEPTH', 'SDEPTH', 'SALIN',
       'TTEMP', 'FILT', 'MORS_SUBBASIN', 'HELCOM_SUBBASIN', 'DATE_OF_ENTRY_y',
       'samptype_id', 'nuclide_id', 'time', 'begperiod', 'value',
       'uncertainty', 'unit', 'detection_limit', 'samplabcode', 'station',
       'lon', 'lat'],
      dtype='object')

In [None]:
seawater_dfs_dropped_review=tfm.dfs_dropped['seawater']
biota_dfs_dropped_review=tfm.dfs_dropped['biota']
sediment_dfs_dropped_review=tfm.dfs_dropped['sediment']

### Rename columns of interest for NetCDF or Open Refine

> Column names are standardized to MARIS NetCDF format (i.e. PEP8 ). 

In [None]:
#| exports
# TO BE REFACTORED
def get_renaming_rules(encoding_type='netcdf'):
    "Define columns of interest (keys) and renaming rules (values)."
    vars = cdl_cfg()['vars']
    if encoding_type == 'netcdf':
        return OrderedDict({
            ('seawater', 'biota', 'sediment'): {
                # DEFAULT
                'lat': vars['defaults']['lat']['name'],
                'lon': vars['defaults']['lon']['name'],
                'time': vars['defaults']['time']['name'],
                'NUCLIDE': 'nuclide',
                'detection_limit': vars['suffixes']['detection_limit']['name'],
                'unit': vars['suffixes']['unit']['name'],
                'value': 'value',
                'uncertainty': vars['suffixes']['uncertainty']['name'],
                'counting_method': vars['suffixes']['counting_method']['name'],
                'sampling_method': vars['suffixes']['sampling_method']['name'],
                'preparation_method': vars['suffixes']['preparation_method']['name']
            },
            ('seawater',): {
                # SEAWATER
                'SALIN': vars['suffixes']['salinity']['name'],
                'SDEPTH': vars['defaults']['smp_depth']['name'],
                #'FILT': vars['suffixes']['filtered']['name'], Need to fix
                'TTEMP': vars['suffixes']['temperature']['name'],
                'TDEPTH': vars['defaults']['tot_depth']['name'],

            },
            ('biota',): {
                # BIOTA
                'SDEPTH': vars['defaults']['smp_depth']['name'],
                'species': vars['bio']['species']['name'],
                'body_part': vars['bio']['body_part']['name'],
                'bio_group': vars['bio']['bio_group']['name']
            },
            ('sediment',): {
                # SEDIMENT
                'sed_type': vars['sed']['sed_type']['name'],
                'TDEPTH': vars['defaults']['tot_depth']['name'],
            }
        })
    
    elif encoding_type == 'openrefine':
        return OrderedDict({
            ('seawater', 'biota', 'sediment'): {
                # DEFAULT
                'samptype_id': 'samptype_id',
                'lat': 'latitude',
                'lon': 'longitude',
                'station': 'station',
                'begperiod': 'begperiod',
                'samplabcode': 'samplabcode',
                #'endperiod': 'endperiod',
                'nuclide_id': 'nuclide_id',
                'detection_limit': 'detection',
                'unit': 'unit_id',
                'value': 'activity',
                'uncertainty': 'uncertaint',
                #'vartype': 'vartype',
                #'rangelow': 'rangelow',
                #'rangeupp': 'rangeupp',
                #'rl_detection': 'rl_detection',
                #'ru_detection': 'ru_detection',
                #'freq': 'freq',
                'SDEPTH': 'sampdepth',
                #'samparea': 'samparea',
                'SALIN': 'salinity',
                'TTEMP': 'temperatur',
                'FILT': 'filtered',
                #'oxygen': 'oxygen',
                #'sampquality': 'sampquality',
                #'station': 'station',
                #'samplabcode': 'samplabcode',
                #'profile': 'profile',
                #'transect': 'transect',
                #'IODE_QualityFlag': 'IODE_QualityFlag',
                'TDEPTH': 'totdepth',
                #'counmet_id': 'counting_method',
                #'sampmet_id': 'sampling_method',
                #'prepmet_id': 'preparation_method',
                'sampnote': 'sampnote',
                'measurenote': 'measurenote'
            },
            ('seawater',) : {
                # SEAWATER
                #'volume': 'volume',
                #'filtpore': 'filtpore',
                #'acid': 'acid'
            },
            ('biota',) : {
                # BIOTA
                'species': 'species_id',
                'Taxonname': 'Taxonname',
                'TaxonRepName': 'TaxonRepName',
                #'Commonname': 'Commonname',
                'Taxonrank': 'Taxonrank',
                'TaxonDB': 'TaxonDB',
                'TaxonDBID': 'TaxonDBID',
                'TaxonDBURL': 'TaxonDBURL',
                'body_part': 'bodypar_id',
                #'drywt': 'drywt',
                #'wetwt': 'wetwt',
                'dry_wet_ratio': 'percentwt',
                #'drymet_id': 'drymet_id'
            },
            ('sediment',): {
                # SEDIMENT
                'sed_type': 'sedtype_id',
                #'sedtrap': 'sedtrap',
                'top': 'sliceup',
                'bottom': 'slicedown',
                'SedRepName': 'SedRepName',
                #'drywt': 'drywt',
                #'wetwt': 'wetwt',
                'dry_wet_ratio': 'percentwt',
                #'drymet_id': 'drymet_id'
                
            }
        })
    else:
        print("Invalid encoding_type provided. Please use 'netcdf' or 'openrefine'.")
        return None

In [None]:
#| exports
class SelectAndRenameColumnCB(Callback):
    def __init__(self, 
                 fn_renaming_rules:Callable, # A function that returns an OrderedDict of renaming rules 
                 encoding_type:str='netcdf', # The encoding type (`netcdf` or `openrefine`) to determine which renaming rules to use
                 verbose:bool=False # Whether to print out renaming rules that were not applied
                 ):
        "Select and rename columns in a DataFrame based on renaming rules for a specified encoding type."
        fc.store_attr()

    def __call__(self, tfm:Transformer):
        "Apply column selection and renaming to DataFrames in the transformer, and identify unused rules."
        try:
            renaming_rules = self.fn_renaming_rules(self.encoding_type)
        except ValueError as e:
            print(f"Error fetching renaming rules: {e}")
            return

        for group in tfm.dfs.keys():
            # Get relevant renaming rules for the current group
            group_rules = self._get_group_rules(renaming_rules, group)

            if not group_rules:
                continue

            # Apply renaming rules and track keys not found in the DataFrame
            df = tfm.dfs[group]
            df, not_found_keys = self._apply_renaming(df, group_rules)
            tfm.dfs[group] = df
            
            # Print any renaming rules that were not used
            if not_found_keys and self.verbose:
                print(f"\nGroup '{group}' has the following renaming rules not applied:")
                for old_col in not_found_keys:
                    print(f"Key '{old_col}' from renaming rules was not found in the DataFrame.")

    def _get_group_rules(self, 
                         renaming_rules:OrderedDict, # Renaming rules
                         group:str # Group name to filter rules
                         ) -> OrderedDict: # Renaming rules applicable to the specified group
        "Retrieve and merge renaming rules for the specified group based on the encoding type."
        relevant_rules = [rules for key, rules in renaming_rules.items() if group in key]
        merged_rules = OrderedDict()
        for rules in relevant_rules:
            merged_rules.update(rules)
        return merged_rules

    def _apply_renaming(self, 
                        df:pd.DataFrame, # DataFrame to modify
                        rename_rules:OrderedDict # Renaming rules
                        ) -> tuple: # (Renamed and filtered df, Column names from renaming rules that were not found in the DataFrame)
        """
        Select columns based on renaming rules and apply renaming, only for existing columns
        while maintaining the order of the dictionary columns."""
        existing_columns = set(df.columns)
        valid_rules = OrderedDict((old_col, new_col) for old_col, new_col in rename_rules.items() if old_col in existing_columns)

        # Create a list to maintain the order of columns
        columns_to_keep = [col for col in rename_rules.keys() if col in existing_columns]
        columns_to_keep += [new_col for old_col, new_col in valid_rules.items() if new_col in df.columns]

        df = df[list(OrderedDict.fromkeys(columns_to_keep))]

        # Apply renaming
        df.rename(columns=valid_rules, inplace=True)

        # Determine which keys were not found
        not_found_keys = set(rename_rules.keys()) - existing_columns
        return df, not_found_keys


In [None]:

#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            GetSampleTypeCB(type_lut),
                            LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg()),        
                            SanitizeValue(coi_val),                       
                            NormalizeUncCB(),
                            LookupBiotaSpeciesCB(get_maris_species),
                            LookupBiotaBodyPartCB(get_maris_bodypart),                          
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path())),
                            LookupSedimentCB(get_maris_sediments),
                            LookupUnitCB(),
                            LookupDetectionLimitCB(),    
                            RemapDataProviderSampleIdCB(),
                            LookupFiltCB(),
                            RemapStationIdCB(),
                            RemapSedSliceTopBottomCB(),
                            LookupDryWetRatio(),
                            FormatCoordinates(coi_coordinates, ddmmmm2dddddd),
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(get_renaming_rules, encoding_type='netcdf'),
                            ])

tfm()

print(tfm.dfs['seawater'].columns)
print(tfm.dfs['biota'].columns)
print(tfm.dfs['sediment'].columns)

Index(['lat', 'lon', 'time', 'nuclide', '_dl', '_unit', 'value', '_unc',
       '_sal', 'smp_depth', '_temp', 'tot_depth'],
      dtype='object')
Index(['lat', 'lon', 'time', 'nuclide', '_dl', '_unit', 'value', '_unc',
       'smp_depth', 'species', 'body_part', 'bio_group'],
      dtype='object')
Index(['lat', 'lon', 'time', 'nuclide', '_dl', '_unit', 'value', '_unc',
       'sed_type', 'tot_depth'],
      dtype='object')


### Reshape: long to wide

Convert data from long to wide and rename columns to comply with NetCDF format.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            GetSampleTypeCB(type_lut),
                            LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg()),        
                            SanitizeValue(coi_val),                       
                            NormalizeUncCB(),
                            LookupBiotaSpeciesCB(get_maris_species),
                            LookupBiotaBodyPartCB(get_maris_bodypart),                          
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path())),
                            LookupSedimentCB(get_maris_sediments),
                            LookupUnitCB(),
                            LookupDetectionLimitCB(),    
                            RemapDataProviderSampleIdCB(),
                            RecordMeasurementNoteCB(get_helcom_method_desc),
                            LookupFiltCB(),
                            RemapStationIdCB(),
                            RemapSedSliceTopBottomCB(),
                            LookupDryWetRatio(),
                            FormatCoordinates(coi_coordinates, ddmmmm2dddddd),
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(get_renaming_rules, encoding_type='netcdf'),
                            ReshapeLongToWide(), 
                            CompareDfsAndTfmCB(dfs)
                            ])
tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

                                                    seawater  sediment  biota
Number of rows in dfs                                  21216     39817  15827
Number of rows in tfm.dfs                              21114     39531  15798
Number of dropped rows                                   102       286     29
Number of rows in tfm.dfs + Number of dropped rows     21216     39817  15827 



In [None]:
# seawater_dfs_review=tfm.dfs['seawater']
# biota_dfs_review=tfm.dfs['biota']
# sediment_dfs_review=tfm.dfs['sediment']

## NetCDF encoder

### Example change logs

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[                         
                            GetSampleTypeCB(type_lut),
                            LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg()),        
                            SanitizeValue(coi_val),                       
                            NormalizeUncCB(),
                            LookupBiotaSpeciesCB(get_maris_species),
                            LookupBiotaBodyPartCB(get_maris_bodypart),                          
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path())),
                            LookupSedimentCB(get_maris_sediments),
                            LookupUnitCB(),
                            LookupDetectionLimitCB(),    
                            RemapDataProviderSampleIdCB(),
                            RecordMeasurementNoteCB(get_helcom_method_desc),
                            LookupFiltCB(),
                            RemapStationIdCB(),
                            RemapSedSliceTopBottomCB(),
                            LookupDryWetRatio(),
                            FormatCoordinates(coi_coordinates, ddmmmm2dddddd),
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(get_renaming_rules, encoding_type='netcdf'),
                            ReshapeLongToWide(), 
                            CompareDfsAndTfmCB(dfs)
                            ])

# Transform
tfm()
# Check transformation logs
tfm.logs

['Convert nuclide names to lowercase and strip any trailing spaces.',
 'Encode time as `int` representing seconds since xxx',
 'Remap `KEY` column to `samplabcode` in each DataFrame.',
 'Drop row when both longitude & latitude equal 0. Drop unrealistic longitude & latitude values. Convert longitude & latitude `,` separator to `.` separator.']

### Feed global attributes

In [None]:
#| export
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']

In [None]:
#| exports
def get_attrs(tfm, zotero_key, kw=kw):
    "Retrieve all global attributes."
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(cfg()),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()

In [None]:
#| eval: false
get_attrs(tfm, zotero_key=zotero_key, kw=kw)

{'geospatial_lat_min': '31.17',
 'geospatial_lat_max': '65.75',
 'geospatial_lon_min': '9.6333',
 'geospatial_lon_max': '53.5',
 'geospatial_bounds': 'POLYGON ((9.6333 53.5, 31.17 53.5, 31.17 65.75, 9.6333 65.75, 9.6333 53.5))',
 'time_coverage_start': '1984-01-10T00:00:00',
 'time_coverage_end': '2021-12-15T00:00:00',
 'title': 'Environmental database - Helsinki Commission Monitoring of Radioactive Substances',
 'summary': 'MORS Environment database has been used to collate data resulting from monitoring of environmental radioactivity in the Baltic Sea based on HELCOM Recommendation 26/3.\n\nThe database is structured according to HELCOM Guidelines on Monitoring of Radioactive Substances (https://www.helcom.fi/wp-content/uploads/2019/08/Guidelines-for-Monitoring-of-Radioactive-Substances.pdf), which specifies reporting format, database structure, data types and obligatory parameters used for reporting data under Recommendation 26/3.\n\nThe database is updated and quality assured annua

In [None]:
#| exports
def enums_xtra(tfm, vars):
    "Retrieve a subset of the lengthy enum as `species_t` for instance."
    enums = Enums(lut_src_dir=lut_path(), cdl_enums=cdl_cfg()['enums'])
    xtras = {}
    for var in vars:
        unique_vals = tfm.unique(var)
        if unique_vals.any():
            xtras[f'{var}_t'] = enums.filter(f'{var}_t', unique_vals)
    return xtras

### Encoding NETCDF

In [None]:
#| exports
def encode(fname_in, fname_out_nc, nc_tpl_path, **kwargs):
    dfs = load_data(fname_in)
    tfm = Transformer(dfs, cbs=[
                                GetSampleTypeCB(type_lut),
                                LowerStripRdnNameCB(),
                                RemapRdnNameCB(),
                                ParseTimeCB(),
                                EncodeTimeCB(cfg()),        
                                SanitizeValue(coi_val),                       
                                NormalizeUncCB(),
                                LookupBiotaSpeciesCB(get_maris_species),
                                LookupBiotaBodyPartCB(get_maris_bodypart),                          
                                LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                                LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path())),
                                LookupSedimentCB(get_maris_sediments),
                                LookupUnitCB(),
                                LookupDetectionLimitCB(),    
                                RemapDataProviderSampleIdCB(),
                                RecordMeasurementNoteCB(get_helcom_method_desc),
                                LookupFiltCB(),
                                RemapStationIdCB(),
                                RemapSedSliceTopBottomCB(),
                                LookupDryWetRatio(),
                                FormatCoordinates(coi_coordinates, ddmmmm2dddddd),
                                SanitizeLonLatCB(),
                                SelectAndRenameColumnCB(get_renaming_rules, encoding_type='netcdf'),
                                ReshapeLongToWide()
                                ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            src_fname=nc_tpl_path,
                            dest_fname=fname_out_nc, 
                            global_attrs=get_attrs(tfm, zotero_key=zotero_key, kw=kw),
                            verbose=kwargs.get('verbose', False),
                            enums_xtra=enums_xtra(tfm, vars=['species', 'body_part'])
                           )
    encoder.encode()

In [None]:
#| eval: false
encode(fname_in, fname_out_nc, nc_tpl_path(), verbose=False)

## Open Refine Pipeline

### Rename columns for Open Refine

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            GetSampleTypeCB(type_lut),
                            LowerStripRdnNameCB(),
                            RemapRdnNameCB(),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg()),        
                            SanitizeValue(coi_val),                       
                            NormalizeUncCB(),
                            LookupBiotaSpeciesCB(get_maris_species),
                            LookupBiotaBodyPartCB(get_maris_bodypart),                          
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path())),
                            LookupSedimentCB(get_maris_sediments),
                            LookupUnitCB(),
                            LookupDetectionLimitCB(),    
                            RemapDataProviderSampleIdCB(),
                            RecordMeasurementNoteCB(get_helcom_method_desc),
                            LookupFiltCB(),
                            RemapStationIdCB(),
                            RemapSedSliceTopBottomCB(),
                            LookupDryWetRatio(),
                            FormatCoordinates(coi_coordinates, ddmmmm2dddddd),
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(get_renaming_rules, encoding_type='openrefine', verbose=True),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')


Group 'seawater' has the following renaming rules not applied:
Key 'sampnote' from renaming rules was not found in the DataFrame.

Group 'sediment' has the following renaming rules not applied:
Key 'FILT' from renaming rules was not found in the DataFrame.
Key 'TTEMP' from renaming rules was not found in the DataFrame.
Key 'SDEPTH' from renaming rules was not found in the DataFrame.
Key 'SALIN' from renaming rules was not found in the DataFrame.
Key 'sampnote' from renaming rules was not found in the DataFrame.

Group 'biota' has the following renaming rules not applied:
Key 'TDEPTH' from renaming rules was not found in the DataFrame.
Key 'FILT' from renaming rules was not found in the DataFrame.
Key 'TTEMP' from renaming rules was not found in the DataFrame.
Key 'SALIN' from renaming rules was not found in the DataFrame.
Key 'sampnote' from renaming rules was not found in the DataFrame.
                                                    seawater  sediment  biota
Number of rows in df

**Example of data included in dfs_dropped.**

Main reasons for data to be dropped from dfs:
- No activity value reported (e.g. VALUE_Bq/kg)
- No time value reported. 

In [None]:
grp='sediment'
#grp='seawater'
#grp='biota'

tfm.dfs_dropped[grp]

Unnamed: 0,KEY,NUCLIDE,METHOD,< VALUE_Bq/kg,VALUE_Bq/kg,ERROR%_kg,< VALUE_Bq/m²,VALUE_Bq/m²,ERROR%_m²,DATE_OF_ENTRY_x,...,LOWSLI,AREA,SEDI,OXIC,DW%,LOI%,MORS_SUBBASIN,HELCOM_SUBBASIN,SUM_LINK,DATE_OF_ENTRY_y
11784,SLREB1998021,SR90,2,,,,,,,,...,12.0,0.02100,55.0,O,,,14.0,14.0,a,
11824,SLVDC1997023,CS137,1,,,,,,,,...,14.0,0.02100,55.0,O,,,9.0,9.0,a,
11832,SLVDC1997031,CS137,1,,,,,,,,...,14.0,0.02100,55.0,O,,,9.0,9.0,a,
11841,SLVDC1997040,CS137,1,,,,,,,,...,16.0,0.02100,55.0,O,,,9.0,9.0,a,
11849,SLVDC1998011,CS137,1,,,,,,,,...,16.0,0.02100,55.0,O,,,14.0,14.0,a,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39769,SSSSM2021030,CO60,SSSM43,<,,,<,,,09/06/22 00:00:00,...,2.0,0.01608,,,28.200000,15.0,12.0,12.0,,09/06/22 00:00:00
39774,SSSSM2021030,RA226,SSSM43,<,,,<,,,09/06/22 00:00:00,...,2.0,0.01608,,,28.200000,15.0,12.0,12.0,,09/06/22 00:00:00
39775,SSSSM2021030,RA223,SSSM43,<,,,<,,,09/06/22 00:00:00,...,2.0,0.01608,,,28.200000,15.0,12.0,12.0,,09/06/22 00:00:00
39777,SSSSM2021031,CS137,SSSM43,<,,,<,0.0,,09/06/22 00:00:00,...,2.0,0.01608,,,31.993243,,13.0,13.0,,09/06/22 00:00:00


## Open Refine encoder

In [None]:
#| exports
def encode_or(fname_in, fname_out_csv, ref_id, **kwargs):
    dfs = load_data(fname_in)
    tfm = Transformer(dfs, cbs=[
                                GetSampleTypeCB(type_lut),
                                LowerStripRdnNameCB(),
                                RemapRdnNameCB(),
                                ParseTimeCB(),
                                EncodeTimeCB(cfg()),        
                                SanitizeValue(coi_val),                       
                                NormalizeUncCB(),
                                LookupBiotaSpeciesCB(get_maris_species),
                                LookupBiotaBodyPartCB(get_maris_bodypart),                          
                                LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                                LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path())),
                                LookupSedimentCB(get_maris_sediments),
                                LookupUnitCB(),
                                LookupDetectionLimitCB(),    
                                RemapDataProviderSampleIdCB(),
                                RecordMeasurementNoteCB(get_helcom_method_desc),
                                LookupFiltCB(),
                                RemapStationIdCB(),
                                RemapSedSliceTopBottomCB(),
                                LookupDryWetRatio(),
                                FormatCoordinates(coi_coordinates, ddmmmm2dddddd),
                                SanitizeLonLatCB(),
                                SelectAndRenameColumnCB(get_renaming_rules, encoding_type='openrefine'),
                                CompareDfsAndTfmCB(dfs)
                                ])
    tfm()

    encoder = OpenRefineCsvEncoder(tfm.dfs, 
                                    dest_fname=fname_out_csv, 
                                    ref_id = ref_id,
                                    verbose = True
                                )
    encoder.encode()

In [None]:
#| eval: false
encode_or(fname_in, fname_out_csv, ref_id, verbose=True)

In [None]:
tfm.dfs['seawater']

Unnamed: 0,samptype_id,latitude,longitude,station,begperiod,samplabcode,nuclide_id,detection,unit_id,activity,uncertaint,sampdepth,salinity,temperatur,filtered,totdepth,measurenote
0,1,60.0833,29.3333,RU10,2012-05-23,WKRIL2012003,33,1,1,5.3,1.696,0.0,,,0,,
1,1,60.0833,29.3333,RU10,2012-05-23,WKRIL2012004,33,1,1,19.9,3.980,29.0,,,0,,
2,1,59.4333,23.1500,RU11,2012-06-17,WKRIL2012005,33,1,1,25.5,5.100,0.0,,,0,,
3,1,60.2500,27.9833,RU19,2012-05-24,WKRIL2012006,33,1,1,17.0,4.930,0.0,,,0,,
4,1,60.2500,27.9833,RU19,2012-05-24,WKRIL2012007,33,1,1,22.2,3.996,39.0,,,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21210,1,58.6033,11.2450,SW7,2021-09-28,WSSSM2021003,1,1,1,2370.0,970.000,1.0,,,2,,
21211,1,60.5200,18.3572,SWF135,2021-10-15,WSSSM2021005,1,1,1,1030.0,960.000,1.0,,,2,,
21212,1,57.4217,17.0000,SWS36,2021-11-04,WSSSM2021006,1,1,1,2240.0,970.000,1.0,,,2,,
21213,1,57.2347,11.9452,SWR35,2021-10-15,WSSSM2021007,1,1,1,2060.0,970.000,1.0,,,2,,


In [None]:
fname_out_csv

'../../_data/output/100-HELCOM-MORS-2024.csv'

***

###  Open Refine Variables not included in Helcom

| Field name      | Full name                | HELCOM     |
|-----------------|--------------------------|------------|
| sampquality     | Sample quality           | N          |
| lab_id          | Laboratory ID            | N          |
| profile_id      | Profile ID               | N          |
| transect_id     | Transect ID              | N          |
| endperiod       | End period               | N          |
| vartype         | Variable type            | N          |
| freq            | Frequency                | N          |
| rl_detection    | Range low detection      | N          |
| rangelow        | Range low                | N          |
| rangeupp        | Range upper              | N          |
| Commonname      | Common name              | N          |
| volume          | Volume                   | N          |
| filtpore        | Filter pore              | N          |
| acid            | Acidified                | N          |
| oxygen          | Oxygen                   | N          |
| samparea        | Sample area              | N          |
| drywt           | Dry weight               | N          |
| wetwt           | Wet weight               | N          |
| sampmet_id      | Sampling method ID       | N          |
| drymet_id       | Drying method ID         | N          |
| prepmet_id      | Preparation method ID    | N          |
| counmet_id      | Counting method ID       | N          |
| refnote         | Reference note           | N          |
| sampnote        | Sample note              | N          |
| gfe             | Good for export          | ?          |

***

## TODO

TODO: Should we use a single encoder for both NetCDF and OpenRefine? If so, should we have a single encode function that accepts a variable 'encoding_type'.

***

TODO: Include FILT for NetCDF

***

TODO: Check sediment 'DW%' data that is less than 1%. Is this realistic? Check the 'DW%' data that is 0%. Run below before SelectAndRenameColumnCB. 

In [None]:

dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[LowerStripRdnNameCB()
                            ])
tfm()

{'seawater':                 KEY NUCLIDE METHOD < VALUE_Bq/m³  VALUE_Bq/m³  ERROR%_m³  \
 0      WKRIL2012003   cs137    NaN           NaN          5.3  32.000000   
 1      WKRIL2012004   cs137    NaN           NaN         19.9  20.000000   
 2      WKRIL2012005   cs137    NaN           NaN         25.5  20.000000   
 3      WKRIL2012006   cs137    NaN           NaN         17.0  29.000000   
 4      WKRIL2012007   cs137    NaN           NaN         22.2  18.000000   
 ...             ...     ...    ...           ...          ...        ...   
 21211  WSSSM2021005      h3  SSM45           NaN       1030.0  93.203883   
 21212  WSSSM2021006      h3  SSM45           NaN       2240.0  43.303571   
 21213  WSSSM2021007      h3  SSM45           NaN       2060.0  47.087379   
 21214  WSSSM2021008      h3  SSM45           NaN       2300.0  43.478261   
 21215  WSSSM2021004      h3  SSM45             <          NaN        NaN   
 
          DATE_OF_ENTRY_x  COUNTRY LABORATORY   SEQUENCE  ... 

In [None]:
grp='sediment'
check_data_sediment=tfm.dfs[grp][(tfm.dfs[grp]['DW%'] < 1) & (tfm.dfs[grp]['DW%'] > 0.001) ]
check_data_sediment

Unnamed: 0,KEY,NUCLIDE,METHOD,< VALUE_Bq/kg,VALUE_Bq/kg,ERROR%_kg,< VALUE_Bq/m²,VALUE_Bq/m²,ERROR%_m²,DATE_OF_ENTRY_x,...,LOWSLI,AREA,SEDI,OXIC,DW%,LOI%,MORS_SUBBASIN,HELCOM_SUBBASIN,SUM_LINK,DATE_OF_ENTRY_y
30938,SLVEA2010001,cs137,LVEA01,,334.25,1.57,,131.886,41179.0,,...,2.0,0.0151,5.0,O,0.115,0.9,14.0,14.0,,11/11/11 00:00:00
30939,SLVEA2010002,cs137,LVEA01,,343.58,1.49,,132.092,41179.0,,...,4.0,0.0151,5.0,A,0.159,0.8,14.0,14.0,,11/11/11 00:00:00
30940,SLVEA2010003,cs137,LVEA01,,334.69,1.56,,134.39,41179.0,,...,6.0,0.0151,5.0,A,0.189,0.8,14.0,14.0,,11/11/11 00:00:00
30941,SLVEA2010004,cs137,LVEA01,,348.5,1.56,,136.699,41179.0,,...,8.0,0.0151,5.0,A,0.194,0.8,14.0,14.0,,11/11/11 00:00:00
30942,SLVEA2010005,cs137,LVEA01,,258.67,1.73,,104.894,41179.0,,...,10.0,0.0151,5.0,A,0.195,0.8,14.0,14.0,,11/11/11 00:00:00
30943,SLVEA2010006,cs137,LVEA01,,182.02,2.05,,77.523,41179.0,,...,12.0,0.0151,5.0,A,0.221,0.8,14.0,14.0,,11/11/11 00:00:00
30944,SLVEA2010007,cs137,LVEA01,,116.34,2.79,,46.946,41179.0,,...,14.0,0.0151,5.0,A,0.238,0.8,14.0,14.0,,11/11/11 00:00:00
30945,SLVEA2010008,cs137,LVEA01,,94.07,2.61,,38.162,41179.0,,...,16.0,0.0151,5.0,A,0.234,0.8,14.0,14.0,,11/11/11 00:00:00
30946,SLVEA2010009,cs137,LVEA01,,69.7,3.12,,27.444,41179.0,,...,18.0,0.0151,5.0,A,0.242,0.8,14.0,14.0,,11/11/11 00:00:00
30947,SLVEA2010010,cs137,LVEA01,,59.63,3.4,,24.22,41179.0,,...,20.0,0.0151,5.0,A,0.257,0.7,14.0,14.0,,11/11/11 00:00:00


In [None]:
grp='sediment'
check_data_sediment=tfm.dfs[grp][(tfm.dfs[grp]['DW%'] == 0) ]
check_data_sediment

Unnamed: 0,KEY,NUCLIDE,METHOD,< VALUE_Bq/kg,VALUE_Bq/kg,ERROR%_kg,< VALUE_Bq/m²,VALUE_Bq/m²,ERROR%_m²,DATE_OF_ENTRY_x,...,LOWSLI,AREA,SEDI,OXIC,DW%,LOI%,MORS_SUBBASIN,HELCOM_SUBBASIN,SUM_LINK,DATE_OF_ENTRY_y
9824,SERPC1997001,cs134,,,3.80,20.0,,5.75,,,...,2.0,0.008,5.0,A,0.0,0.0,11.0,11.0,a,
9825,SERPC1997001,cs137,,,389.00,4.0,,589.00,,,...,2.0,0.008,5.0,A,0.0,0.0,11.0,11.0,a,
9826,SERPC1997002,cs134,,,4.78,13.0,,12.00,,,...,4.0,0.008,5.0,A,0.0,0.0,11.0,11.0,a,
9827,SERPC1997002,cs137,,,420.00,4.0,,1060.00,,,...,4.0,0.008,5.0,A,0.0,0.0,11.0,11.0,a,
9828,SERPC1997003,cs134,,,3.12,17.0,,12.00,,,...,6.0,0.008,5.0,A,0.0,0.0,11.0,11.0,a,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15257,SKRIL1999062,th228,1,,68.00,,,,,,...,15.0,0.006,0.0,O,0.0,0.0,11.0,11.0,a,
15258,SKRIL1999063,k40,1,,1210.00,,,,,,...,21.5,0.006,0.0,O,0.0,0.0,11.0,11.0,a,
15259,SKRIL1999063,ra226,KRIL01,,56.50,,,,,,...,21.5,0.006,0.0,O,0.0,0.0,11.0,11.0,a,
15260,SKRIL1999063,ra228,KRIL01,,72.20,,,,,,...,21.5,0.006,0.0,O,0.0,0.0,11.0,11.0,a,


In [None]:
grp='biota'
check_data_sediment=tfm.dfs[grp][(tfm.dfs[grp]['DW%'] == 0) ]
check_data_sediment

Unnamed: 0,KEY,NUCLIDE,METHOD,< VALUE_Bq/kg,VALUE_Bq/kg,BASIS,ERROR%,NUMBER,DATE_OF_ENTRY_x,COUNTRY,...,BIOTATYPE,TISSUE,NO,LENGTH,WEIGHT,DW%,LOI%,MORS_SUBBASIN,HELCOM_SUBBASIN,DATE_OF_ENTRY_y
5971,BERPC1997002,k40,,,116.0,W,3.0,,,91.0,...,F,5,0.0,0.0,0.0,0.0,0.0,11.0,11,
5972,BERPC1997002,cs137,,,12.6,W,4.0,,,91.0,...,F,5,0.0,0.0,0.0,0.0,0.0,11.0,11,
5973,BERPC1997002,cs134,,,0.14,W,18.0,,,91.0,...,F,5,0.0,0.0,0.0,0.0,0.0,11.0,11,
5974,BERPC1997001,k40,,,116.0,W,4.0,,,91.0,...,F,5,0.0,0.0,0.0,0.0,0.0,11.0,11,
5975,BERPC1997001,cs137,,,12.0,W,4.0,,,91.0,...,F,5,0.0,0.0,0.0,0.0,0.0,11.0,11,
5976,BERPC1997001,cs134,,,0.21,W,24.0,,,91.0,...,F,5,0.0,0.0,0.0,0.0,0.0,11.0,11,


***

TODO : Should we manually extract the 'Counting method ID' from 'measurenote' (i.e. the HELCOM METHOD).

***

TODO: : Include CompareDfsAndTfmCB Callback in Transformer Callbacks System.

Description : I would like to include the  CompareDfsAndTfmCB in the Callback class. This callback will be helpful for identifying and analyzing data dropped during transformations, aiding in debugging and ensuring data integrity.


***

TODO: The description for the 'Sample area; variable states 'Sample surface area of sediment (cm2)'.
In the MARIS LUT we have a 'dbo_area.xlsx' LUT which includes the IHO sea areas. 
1) What does the variable 'Sample area' represent for Open Refine and is it the same for NetCDF?
2) The HELCOM data reports the sediment activity concentration as both Bq per mass and Bq per area. Would you like to include both entires in MARIS? 


In [None]:
dfs['sediment'].columns

Index(['KEY', 'NUCLIDE', 'METHOD', '< VALUE_Bq/kg', 'VALUE_Bq/kg', 'ERROR%_kg',
       '< VALUE_Bq/m²', 'VALUE_Bq/m²', 'ERROR%_m²', 'DATE_OF_ENTRY_x',
       'COUNTRY', 'LABORATORY', 'SEQUENCE', 'DATE', 'YEAR', 'MONTH', 'DAY',
       'STATION', 'LATITUDE (ddmmmm)', 'LATITUDE (dddddd)',
       'LONGITUDE (ddmmmm)', 'LONGITUDE (dddddd)', 'DEVICE', 'TDEPTH',
       'UPPSLI', 'LOWSLI', 'AREA', 'SEDI', 'OXIC', 'DW%', 'LOI%',
       'MORS_SUBBASIN', 'HELCOM_SUBBASIN', 'SUM_LINK', 'DATE_OF_ENTRY_y'],
      dtype='object')

***