:::{.callout-tip}

For new MARIS users, please refer to [Understanding MARIS Data Formats (NetCDF and Open Refine)](https://github.com/franckalbinet/marisco/tree/main/install_configure_guide) for detailed information.

:::

The present notebook pretends to be an instance of [Literate Programming](https://www.wikiwand.com/en/articles/Literate_programming) in the sense that it is a narrative that includes code snippets that are interspersed with explanations. When a function or a class needs to be exported in a dedicated python module (in our case `marisco/handlers/ospar.py`) the code snippet is added to the module using `#| exports` as provided by the wonderful [nbdev](https://nbdev.readthedocs.io/en/latest/) library.

In [None]:
#| hide
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
#| export
import pandas as pd 
import numpy as np
from functools import partial 
import fastcore.all as fc 
from pathlib import Path 
from dataclasses import asdict
from typing import List, Dict, Callable, Tuple, Any 
from collections import OrderedDict, defaultdict
import re

from marisco.utils import (
    has_valid_varname, 
    match_worms, 
    Remapper, 
    ddmm_to_dd,
    match_maris_lut, 
    Match, 
    get_unique_across_dfs
    )

from marisco.callbacks import (
    Callback, 
    Transformer, 
    RemoveAllNAValuesCB,
    EncodeTimeCB, 
    AddSampleTypeIdColumnCB,
    AddNuclideIdColumnCB, 
    LowerStripNameCB, 
    SanitizeLonLatCB, 
    ReshapeLongToWide, 
    CompareDfsAndTfmCB,
    RemoveAllNAValuesCB,
    RemapCB
    )

from marisco.metadata import (
    GlobAttrsFeeder, 
    BboxCB, 
    DepthRangeCB, 
    TimeRangeCB, 
    ZoteroCB, 
    KeyValuePairCB
    )

from marisco.configs import (
    nuc_lut_path, 
    nc_tpl_path, 
    cfg, 
    cache_path, 
    cdl_cfg, 
    Enums, 
    lut_path, 
    species_lut_path, 
    sediments_lut_path, 
    bodyparts_lut_path, 
    detection_limit_lut_path, 
    filtered_lut_path, 
    area_lut_path,
    get_lut,
    unit_lut_path
    )

from marisco.utils import NA
from marisco.serializers import NetCDFEncoder,  OpenRefineCsvEncoder

import warnings
warnings.filterwarnings('ignore')

In [None]:
#| hide
pd.set_option('display.max_rows', 100)

In [None]:
warnings.filterwarnings('ignore')

## Configuration and file paths

1. **fname_in** - is the path to the folder containing the OSPAR data in CSV format. The path can be defined as a relative path. 

2. **fname_out_nc** - is the path and filename for the NetCDF output.The path can be defined as a relative path. 

3. **fname_out_csv** - is the path and filename for the Open Refine csv output.The path can be defined as a relative path.

4. **Zotero key** - is used to retrieve attributes related to the dataset from [Zotero](https://www.zotero.org/). The MARIS datasets include a [library](https://maris.iaea.org/datasets) available on [Zotero](https://www.zotero.org/groups/2432820/maris/library). 

5. **ref_id** - refers to the location in archive of the Zotero library.


In [None]:
# | exports
fname_in = '../../_data/accdb/ospar/csv'
fname_out_nc = '../../_data/output/191-OSPAR-2024.nc'
fname_out_csv = '../../_data/output/191-OSPAR-2024.csv'
zotero_key ='LQRA4MMK' # OSPAR MORS zotero key
ref_id = 191 # OSPAR reference id as defined by MARIS

## Load data

[OSPAR Environmental Monitoring Data](https://odims.ospar.org/en/) is provided as a Microsoft Access database. [`Mdbtools`](https://github.com/mdbtools/mdbtools) can be used to convert the tables of the Microsoft Access database to `.csv` files on Unix-like OS.

**Example steps**:

1. [Download data](https://odims.ospar.org/en/)
2. Install `mdbtools` via `VScode` Terminal (for instance):

    ```
    sudo apt-get -y install mdbtools
    ````

3. Install unzip via VScode Terminal 

    ```
    sudo apt-get -y install unzip
    ````

4. In `VS code` terminal (for instance), navigate to the marisco data folder

    ```
    cd /home/marisco/downloads/marisco/_data/accdb/ospar
    ```

5. Unzip `OSPAR_Env_Concentrations_20240206.zip`

    ```
    unzip OSPAR_Env_Concentrations_20240206.zip
    ```

6. Run `preprocess.sh` to generate the required data files

    ```
    ./preprocess.sh OSPAR_Env_Concentrations_20240206.zip
    ````

7. Content of `preprocess.sh` script:
    ```
    #!/bin/bash

    # Example of use: ./preprocess.sh OSPAR_Env_Concentrations_20240206.zip
    unzip $1
    dbname=$(ls *.accdb *.mdb)
    mkdir csv
    for table in $(mdb-tables -1 "$dbname"); do
        echo "Export table $table"
        mdb-export "$dbname" "$table" > "csv/$table.csv"
    done
    ```

Once converted to `.csv` files, the data is ready to be loaded into a dictionary of dataframes.
    

Load OSPAR data and return the data in a Python dictionary of dataframes with the dictionary key as the sample type.

In [None]:
#| exports
default_smp_types = {'Seawater data': 'seawater', 'Biota data': 'biota'}

In [None]:
#| exports
def load_data(src_dir:str, # Directory where the source CSV files are located
              lut:dict=default_smp_types # A dictionary with the file name as key and the sample type as value
              ) -> dict: # A dictionary with sample types as keys and their corresponding dataframes as values
    "Load `OSPAR` data and return the data in a dictionary of dataframes with the dictionary key as the sample type."
    return {
        sample_type: pd.read_csv(Path(src_dir) / f'{file_name}.csv', encoding='unicode_escape')
        for file_name, sample_type in lut.items()
    }

`dfs` includes a dictionary of dataframes that is created from the OSPAR dataset defined by `fname_in`. The data to be included in each dataframe is sorted by sample type. Each dictionary is defined with a key equal to the sample type. 

In [None]:
#|eval: false
dfs = load_data(fname_in)
print('keys/sample types: ', dfs.keys())

for key in dfs.keys():
    print(f'{key} columns: ', dfs[key].columns)

keys/sample types:  dict_keys(['seawater', 'biota'])
seawater columns:  Index(['ID', 'Contracting Party', 'RSC Sub-division', 'Station ID',
       'Sample ID', 'LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sample type', 'Sampling depth', 'Sampling date',
       'Nuclide', 'Value type', 'Activity or MDA', 'Uncertainty', 'Unit',
       'Data provider', 'Measurement Comment', 'Sample Comment',
       'Reference Comment'],
      dtype='object')
biota columns:  Index(['ID', 'Contracting Party', 'RSC Sub-division', 'Station ID',
       'Sample ID', 'LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sample type', 'Biological group', 'Species',
       'Body Part', 'Sampling date', 'Nuclide', 'Value type',
       'Activity or MDA', 'Uncertainty', 'Unit', 'Data provider',
       'Measurement Comment', 'Sample Comment', 'Reference Comment'],
      dtype='object')


## Remove missing data

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Seawater` dataset contains 538 rows with all NA values as shown below.

:::

In [None]:
#| eval: false
dfs = load_data(fname_in)
for key in dfs.keys():
    cols_to_check = dfs[key].columns[1:]
    mask = dfs[key][cols_to_check].isnull().all(axis=1)
    print(f'{key}: {mask.sum()} rows with all NA values')

seawater: 538 rows with all NA values
biota: 0 rows with all NA values


In [None]:
#| exports
common_cols = [
    'Contracting Party', 'RSC Sub-division', 'Station ID', 'Sample ID',
    'LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM', 'LongS', 'LongDir',
    'Sample type', 'Sampling date', 'Nuclide', 'Value type', 'Activity or MDA',
    'Uncertainty', 'Unit', 'Data provider', 'Measurement Comment',
    'Sample Comment', 'Reference Comment'
]

cols_to_check = {
    'seawater': common_cols + ['Sampling depth'],
    'biota': common_cols + ['Biological group', 'Species', 'Body Part']
}

Let's use the `RemoveAllNAValuesCB` callback to remove all rows with all NA values.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[RemoveAllNAValuesCB(cols_to_check)])

# Test that all NA values have been removed
fc.test_eq(tfm()['seawater'][cols_to_check['seawater']].isnull().all(axis=1).sum(), 0)

## Add sample type column

The sample type (`seawater`, `biota`) as defined in the `configs.ipynb` are encoded group names in NetCDF produced. Addition of sample type ids into individual dataframes is done using the `AddSampleTypeIdColumnCB` callback for legacy purposes (i.e. Open Refine output).

For instance:

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[AddSampleTypeIdColumnCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

print(tfm()['seawater'][['ID', 'Station ID', 'samptype_id']].head())
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

   ID   Station ID  samptype_id
0   1  Belgica-W01            1
1   2  Belgica-W02            1
2   3  Belgica-W03            1
3   4  Belgica-W04            1
4   5  Belgica-W05            1
                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18856  15314
Number of dropped rows                                     0      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 



## Normalize nuclide names

### Remap nuclide names to MARIS data formats

We map nuclide names used by OSPAR to the MARIS standard nuclide names. 

Remapping data provider nomenclatures into MARIS standards is one recurrent operation and is done in a semi-automated manner according to the following pattern:

1. **Inspect** data provider nomenclature:
2. **Match** automatically against MARIS nomenclature (using a fuzzy matching algorithm); 
3. **Fix** potential mismatches; 
4. **Apply** the lookup table to the dataframe.

As now on, we will use this pattern to remap the OSPAR data provider nomenclatures into MARIS standards and name it for the sake of brevity **IMFA** (**I**nspect, **M**atch, **F**ix, **A**pply).

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The `Nuclide` column has inconsistent naming. E.g:

- `Cs-137`,  `137Cs` or `CS-137`
- `239, 240 pu` or `239,240 pu`
- `ra-226` and `226ra` 

See below:

:::

In [None]:
#| eval: false
dfs = load_data(fname_in)
get_unique_across_dfs(dfs, col_name='Nuclide', as_df=True)

Unnamed: 0,index,value
0,0,238Pu
1,1,"239, 240 Pu"
2,2,210Po
3,3,210Pb
4,4,
5,5,CS-137
6,6,Cs-134
7,7,CS-134
8,8,RA-226
9,9,"239,240Pu"


Let's now create an instance of a fuzzy matching algorithm `Remapper`:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='Nuclide', as_df=True),
                    maris_lut_fn=nuc_lut_path,
                    maris_col_id='nuclide_id',
                    maris_col_name='nc_name',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='nuclides_ospar.pkl')

And try to match HELCOM to MARIS nuclide names as automatically as possible. The `match_score` column allows to assess the results:

In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/18 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [00:00<00:00, 43.74it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"239, 240 Pu",pu240,"239, 240 Pu",8
"239,240Pu",pu240,"239,240Pu",6
241Am,pu241,241Am,4
137Cs,h3,137Cs,4
226Ra,u235,226Ra,4
228Ra,u238,228Ra,4
210Pb,ag106m,210Pb,4
210Po,ag106m,210Po,4
99Tc,tu,99Tc,3
238Pu,u238,238Pu,3


We then manually inspect the remaining unmatched names and create a fixes table to map them to the correct MARIS standards:

In [None]:
#| exports
fixes_nuclide_names = {
    '226Ra': 'ra226',
    '228Ra': 'ra228',
    '239, 240 Pu': 'pu239_240_tot',
    'CS-134': 'cs134',
    '137Cs': 'cs137',
    'RA-226': 'ra226',
    '3H': 'h3',
    'RA-228': 'ra228',
    '238Pu': 'pu238',
    '241Am': 'am241',
    'CS-137': 'cs137',
    '210Po': 'po210',
    '210Pb': 'pb210',
    'Cs-137': 'cs137',
    '99Tc': 'tc99',
    'Cs-134': 'cs134',
    '239,240Pu': 'pu239_240_tot'
    }

Let's try to match again but this time we use the `fixes_nuclide_names` to map the nuclide names to the MARIS standards:


In [None]:
#| eval: false
remapper.generate_lookup_table(as_df=True, fixes=fixes_nuclide_names)
remapper.select_match(match_score_threshold=1)

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 18/18 [00:00<00:00, 42.81it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


Values are remapped correctly! We can now create a callback `RemapNuclideNameCB` to remap the nuclide names. Note that we pass `overwrite=False` to the `Remapper` constructor to now use the cached version.

In [None]:
#| exports
# Create a lookup table for nuclide names
lut_nuclides = lambda df: Remapper(provider_lut_df=df,
                                   maris_lut_fn=nuc_lut_path,
                                   maris_col_id='nuclide_id',
                                   maris_col_name='nc_name',
                                   provider_col_to_match='value',
                                   provider_col_key='value',
                                   fname_cache='nuclides_ospar.pkl').generate_lookup_table(fixes=fixes_nuclide_names, 
                                                                                            as_df=False, overwrite=False)

In [None]:
#| exports
class RemapNuclideNameCB(Callback):
    def __init__(self, 
                 fn_lut:Callable # Function that returns the lookup table dictionary
                ):
        "Remap data provider nuclide names to MARIS nuclide names."
        fc.store_attr()

    def __call__(self, tfm):
        df_uniques = get_unique_across_dfs(tfm.dfs, col_name='Nuclide', as_df=True)
        lut = {k: v.matched_maris_name for k, v in self.fn_lut(df_uniques).items()}    
        for k in tfm.dfs.keys():
            tfm.dfs[k]['NUCLIDE'] = tfm.dfs[k]['Nuclide'].replace(lut)

Let's see it in action, along with the `RemapRdnNameCB` callback:

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[RemapNuclideNameCB(lut_nuclides)])
dfs_out = tfm()

# For instance
for key in dfs_out.keys():
    print(f'{key} NUCLIDE unique: ', dfs_out[key]['NUCLIDE'].unique())

seawater NUCLIDE unique:  ['cs137' 'pu239_240_tot' 'ra226' 'ra228' 'tc99' 'h3' 'po210' 'pb210'
 'Unknown']
biota NUCLIDE unique:  ['pu239_240_tot' 'tc99' 'cs137' 'ra226' 'ra228' 'pu238' 'am241' 'cs134'
 'h3' 'pb210' 'po210']


First lets apply the `RemoveAllNAValuesCB` and `RemapNuclideNameCB` callbacks to the `seawater` sample type.


In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[RemoveAllNAValuesCB(cols_to_check), 
                            RemapNuclideNameCB(lut_nuclides)])
tfm()
print(tfm.dfs['seawater'])

           ID Contracting Party  RSC Sub-division   Station ID Sample ID  \
0           1           Belgium               8.0  Belgica-W01    WNZ 01   
1           2           Belgium               8.0  Belgica-W02    WNZ 02   
2           3           Belgium               8.0  Belgica-W03    WNZ 03   
3           4           Belgium               8.0  Belgica-W04    WNZ 04   
4           5           Belgium               8.0  Belgica-W05    WNZ 05   
...       ...               ...               ...          ...       ...   
18851  121646    United Kingdom              10.0       Rosyth   2100318   
18852  121647    United Kingdom              10.0       Rosyth   2101399   
18853  121648    United Kingdom               6.0        Wylfa    21-656   
18854  121649    United Kingdom               6.0        Wylfa    21-657   
18855  121650    United Kingdom               6.0        Wylfa    21-654   

       LatD  LatM  LatS LatDir  LongD  ...  Nuclide  Value type  \
0      51.0  22.0  3

:::{.callout-tip}

**DISCUSS**: The `Seawater` dataset contains rows where nuclide is `nan` (remapped to `Unkown`), see below.

:::

Lets return the `seawater` entries with `Unknown` nuclides.

In [None]:
tfm.dfs['seawater'][tfm.dfs['seawater']['NUCLIDE'] == 'Unknown']

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment,NUCLIDE
18471,120363,Ireland,4.0,N1,,53.0,25.0,0.0,N,6.0,...,,,,,,,2021 data,The Irish Navy attempted a few times to collec...,,Unknown
18472,120364,Ireland,4.0,N2,,53.0,36.0,0.0,N,5.0,...,,,,,,,2021 data,,,Unknown
18473,120365,Ireland,4.0,N3,,53.0,44.0,0.0,N,5.0,...,,,,,,,2021 data,,,Unknown
18474,120366,Ireland,4.0,N8,,53.0,39.0,0.0,N,5.0,...,,,,,,,2021 data,,,Unknown
18475,120367,Ireland,4.0,N9,,53.0,53.0,0.0,N,5.0,...,,,,,,,2021 data,,,Unknown
18476,120368,Ireland,4.0,N10,,53.0,52.0,0.0,N,5.0,...,,,,,,,2021 data,,,Unknown
18477,120369,Ireland,1.0,Salthill,,53.0,15.0,40.0,N,9.0,...,,,,,,,2021 data,Woodstown (County Waterford) and Salthill (Cou...,,Unknown
18478,120370,Ireland,1.0,Woodstown,,52.0,11.0,55.0,N,6.0,...,,,,,,,,,,Unknown


Lets return the `biota` entries with `Unknown` nuclides.

In [None]:
tfm.dfs['biota'][tfm.dfs['biota']['NUCLIDE'] == 'Unknown']

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment,NUCLIDE


### Add Nuclide Id column

The `nuclide_id` column is added to the dataframe for legacy reasons (again Open Refine output).

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemoveAllNAValuesCB(cols_to_check),
                            RemapNuclideNameCB(lut_nuclides),
                            AddNuclideIdColumnCB(col_value='NUCLIDE')
                            ])
dfs_out = tfm()

# For instance
dfs_out['biota'][['NUCLIDE', 'nuclide_id']]

Unnamed: 0,NUCLIDE,nuclide_id
0,pu239_240_tot,77
1,tc99,15
2,pu239_240_tot,77
3,pu239_240_tot,77
4,tc99,15
...,...,...
15309,tc99,15
15310,pu239_240_tot,77
15311,cs137,33
15312,cs137,33


## Standardize Time

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `Seawater` dataset contains 1O rows with `NaN` reported for the `Sampling date` column as shown below.

:::

In [None]:
#| eval: false
dfs = load_data(fname_in)
dfs_test = Transformer(dfs, cbs=[RemoveAllNAValuesCB(cols_to_check)])()
dfs_test['seawater'][dfs_test['seawater']['Sampling date'].isnull()]


Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
17298,97948,Sweden,11.0,SW7,1.0,58.0,36.0,12.0,N,11.0,...,,3H,,,,Bq/l,Swedish Radiation Safety Authority,no 3H this year due to broken LSC,,
17302,97952,Sweden,12.0,Ringhals (R35),7.0,57.0,14.0,5.0,N,11.0,...,,3H,,,,Bq/l,Swedish Radiation Safety Authority,no 3H this year due to broken LSC,,
18471,120363,Ireland,4.0,N1,,53.0,25.0,0.0,N,6.0,...,,,,,,,,2021 data,The Irish Navy attempted a few times to collec...,
18472,120364,Ireland,4.0,N2,,53.0,36.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18473,120365,Ireland,4.0,N3,,53.0,44.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18474,120366,Ireland,4.0,N8,,53.0,39.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18475,120367,Ireland,4.0,N9,,53.0,53.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18476,120368,Ireland,4.0,N10,,53.0,52.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18477,120369,Ireland,1.0,Salthill,,53.0,15.0,40.0,N,9.0,...,,,,,,,,2021 data,Woodstown (County Waterford) and Salthill (Cou...,
18478,120370,Ireland,1.0,Woodstown,,52.0,11.0,55.0,N,6.0,...,,,,,,,,,,


Create a callback that remaps the time format in the dictionary of dataframes (i.e. `%m/%d/%y %H:%M:%S`) and handles missing dates:

In [None]:
#| exports
class ParseTimeCB(Callback):
    "Parse the time format in the DataFrame."
    def __call__(self, tfm):
        for df in tfm.dfs.values():
            df['time'] = pd.to_datetime(df['Sampling date'], format='%d/%m/%Y', errors='coerce')
            df['begperiod'] = df['time']
            df.dropna(subset=['time'], inplace=True)

Apply the transformer for callbacks `ParseTimeCB`.

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    ParseTimeCB(),
    CompareDfsAndTfmCB(dfs)])

tfm()

print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')
print(tfm.dfs['seawater'][['begperiod','time']])

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18308  15314
Number of dropped rows                                   548      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

       begperiod       time
0     2010-01-27 2010-01-27
1     2010-01-27 2010-01-27
2     2010-01-27 2010-01-27
3     2010-01-27 2010-01-27
4     2010-01-26 2010-01-26
...          ...        ...
18851 2021-04-29 2021-04-29
18852 2021-12-10 2021-12-10
18853 2021-04-07 2021-04-07
18854 2021-04-07 2021-04-07
18855 2021-04-07 2021-04-07

[18308 rows x 2 columns]


NetCDF time format requires the time to be encoded as number of milliseconds since a time of origin. In our case the time of origin is `1970-01-01` as indicated in `configs.ipynb` `CONFIFS['units']['time']` dictionary.

`EncodeTimeCB` converts the HELCOM `time` format to the MARIS NetCDF `time` format. Now, print the ``begperiod`` and `time` data for `seawater`.

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    ParseTimeCB(),
    EncodeTimeCB(cfg(), verbose = True)])

tfm()
tfm.dfs['seawater'][['begperiod','time']].head()

Unnamed: 0,begperiod,time
0,2010-01-27,1264550400
1,2010-01-27,1264550400
2,2010-01-27,1264550400
3,2010-01-27,1264550400
4,2010-01-26,1264464000


## Sanitize value

We allocate each column containing measurement values into a single column `value` and remove `NA` where needed.

In [None]:
# | exports
class SanitizeValue(Callback):
    "Sanitize value by removing blank entries and populating the `value` column."
    def __init__(self, 
                 value_col: str='Activity or MDA' # Column name to sanitize
                 ):
        fc.store_attr()

    def __call__(self, tfm):
        for df in tfm.dfs.values():
            df.dropna(subset=[self.value_col], inplace=True)
            df['value'] = df[self.value_col]

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[SanitizeValue()])
tfm()['seawater'][['value']].head()

Unnamed: 0,value
0,0.2
1,0.27
2,0.26
3,0.25
4,0.2


## Normalize uncertainty

For each sample type in the OSPAR dataset, the reported uncertainty is given as an expanded uncertainty with a coverage factor `ùëò=2`. For further details, refer to the [OSPAR reporting guidelines](https://mcc.jrc.ec.europa.eu/documents/OSPAR/Guidelines_forestimationof_a_%20measurefor_uncertainty_in_OSPARmonitoring.pdf).

**Note**: Below, the OSPAR uncertainty values are normalized to standard uncertainty with a coverage factor 
ùëò=1.

`NormalizeUncCB` callback normalizes the uncertainty using the following `lambda` function:

In [None]:
#| exports
unc_exp2stan = lambda df, unc_col: df[unc_col] / 2

In [None]:
#| exports
class NormalizeUncCB(Callback):
    """Normalize uncertainty values in DataFrames."""
    def __init__(self, 
                 col_unc: str='Uncertainty', # Column name to normalize
                 fn_convert_unc: Callable=unc_exp2stan, # Function correcting coverage factor
                 ): 
        fc.store_attr()

    def __call__(self, tfm):
        for df in tfm.dfs.values():
            df['uncertainty'] = self.fn_convert_unc(df, self.col_unc)

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[       
                            SanitizeValue(),               
                            NormalizeUncCB()
                            ])
tfm()

for grp in ['seawater', 'biota']:
    print(f'\n{grp}:')
    print(tfm.dfs[grp][['value', 'uncertainty']].head())


seawater:
   value  uncertainty
0   0.20          NaN
1   0.27          NaN
2   0.26          NaN
3   0.25          NaN
4   0.20          NaN

biota:
     value  uncertainty
0   0.3510        0.033
1  39.0000        7.500
2   0.0938        0.009
3   1.5400        0.155
4  16.0000        3.000


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `Seawater` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

To show situations where the uncertainty is much greater than the value we will calcualte the relative uncertainty for the seawater dataset. 

In [None]:
grp='seawater'
tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['uncertainty'] / tfm.dfs[grp]['value'])
    # Multiply by 100 to convert to percentage
    * 100
)

Now we will return all rows where the relative uncertainty is greater than 100% for the seawater dataset.

In [None]:
threshold = 100
cols_to_show=['ID','Contracting Party','Nuclide', 'Value type','Activity or MDA', 'Uncertainty', 'Unit', 'relative_uncertainty' ]
tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]


Unnamed: 0,ID,Contracting Party,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,relative_uncertainty
1158,11075,United Kingdom,137Cs,=,0.0028,0.3276,Bq/l,5850.0
1160,11077,United Kingdom,137Cs,=,0.0029,0.3364,Bq/l,5800.0
1162,11079,United Kingdom,137Cs,=,0.0025,0.3325,Bq/l,6650.0
1164,11081,United Kingdom,137Cs,=,0.0025,0.345,Bq/l,6900.0
1166,11083,United Kingdom,137Cs,=,0.0038,0.3344,Bq/l,4400.0
1168,11085,United Kingdom,137Cs,=,0.0035,0.322,Bq/l,4600.0
1170,11087,United Kingdom,137Cs,=,0.0035,0.3395,Bq/l,4850.0
1211,11128,United Kingdom,137Cs,=,0.0016,0.3456,Bq/l,10800.0
1213,11130,United Kingdom,137Cs,=,0.0016,0.3296,Bq/l,10300.0
1215,11132,United Kingdom,137Cs,=,0.003,0.33,Bq/l,5500.0


:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `biota` dataset contains rows where the uncertainty is much greater than the value. Altough this is not impossible, I think it is worth highlighting these entries.

:::

Include the relative uncertainty for the biota dataset. 

In [None]:
grp='biota'
tfm.dfs[grp]['relative_uncertainty'] = (
    # Divide 'uncertainty' by 'value'
    (tfm.dfs[grp]['uncertainty'] / tfm.dfs[grp]['value'])
    # Multiply by 100 to convert to percentage
    * 100
)

Return all rows where the relative uncertainty is greater than 100% for the biota dataset..

In [None]:
threshold = 100
cols_to_show=['ID','Contracting Party','Nuclide', 'Value type','Activity or MDA', 'Uncertainty', 'Unit', 'relative_uncertainty' ]
tfm.dfs[grp][cols_to_show][tfm.dfs[grp]['relative_uncertainty'] > threshold]


Unnamed: 0,ID,Contracting Party,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,relative_uncertainty
1491,88591,Denmark,137Cs,=,0.024,0.1248,Bq/kg f.w.,260.0
3279,82675,United Kingdom,"239,240Pu",=,0.056,0.13,Bq/kg f.w.,116.071429
3430,82600,Sweden,137Cs,=,0.38,3.38,Bq/kg f.w.,444.736842
5934,49310,Sweden,137Cs,=,0.168608,0.704,Bq/kg f.w.,208.768267
6202,49307,Sweden,137Cs,=,0.157033,0.746,Bq/kg f.w.,237.529691
6605,49305,Sweden,137Cs,=,0.118002,0.554,Bq/kg f.w.,234.741784
6891,49300,Sweden,137Cs,=,0.153924,0.762,Bq/kg f.w.,247.524752
7238,49297,Sweden,137Cs,=,0.192765,0.71,Bq/kg f.w.,184.162063
7435,62016,France,137Cs,=,0.039809,0.12,Bq/kg f.w.,150.719717
7454,49296,Sweden,137Cs,=,0.174048,0.672,Bq/kg f.w.,193.050193


## Remap Biota species

The OSPAR dataset contains biota species information in the `Species` column of the biota dataframe. To ensure consistency with MARIS standards, we need to remap these species names. We'll use a same approach to the one we employed for standardizing nuclide names:


We first inspect unique `Species` values used by OSPAR:

In [None]:
dfs = load_data(fname_in)
get_unique_across_dfs(dfs, col_name='Species', as_df=True)

Unnamed: 0,index,value
0,0,HIPPOGLOSSUS HIPPOGLOSSUS
1,1,
2,2,Anguilla anguilla
3,3,Unknown
4,4,GADUS MORHUA
...,...,...
151,151,Flatfish
152,152,RAJIDAE/BATOIDEA
153,153,BOREOGADUS SAIDA
154,154,Hyperoplus lanceolatus


We try to remap the `Species` column to the `species` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='Species', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='species_ospar.pkl')

In this step, we generate a lookup table using the `remapper` object. The lookup table maps data provider entries to MARIS entries using fuzzy matching. After generating the table, we select matches that meet a specified threshold (i.e., greater than 1), which means that matches requiring more than one character change are shown.

- **`generate_lookup_table(as_df=True)`**: This method generates the lookup table and returns it as a DataFrame. It uses fuzzy matching to align entries from the data provider with those in the MARIS lookup table.
- **`select_match(match_score_threshold=1)`**: This method filters the generated lookup table to include only those matches with a score greater than or equal to the specified threshold. A threshold of 1 ensures that only perfect matches are selected.

In [None]:
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1)

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 156/156 [00:27<00:00,  5.72it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,Lomentaria catenata,RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA,31
"Mixture of green, red and brown algae",Mercenaria mercenaria,"Mixture of green, red and brown algae",26
Solea solea (S.vulgaris),Loligo vulgaris,Solea solea (S.vulgaris),12
SOLEA SOLEA (S.VULGARIS),Loligo vulgaris,SOLEA SOLEA (S.VULGARIS),12
CERASTODERMA (CARDIUM) EDULE,Cerastoderma edule,CERASTODERMA (CARDIUM) EDULE,10
Cerastoderma (Cardium) Edule,Cerastoderma edule,Cerastoderma (Cardium) Edule,10
MONODONTA LINEATA,Ophiothrix lineata,MONODONTA LINEATA,9
DICENTRARCHUS (MORONE) LABRAX,Dicentrarchus labrax,DICENTRARCHUS (MORONE) LABRAX,9
NUCELLA LAPILLUS,Mugil cephalus,NUCELLA LAPILLUS,9
RAJIDAE/BATOIDEA,Batoidea,RAJIDAE/BATOIDEA,8


Below, we fixthe entries that are not properly matched by the `Remapper` object:

In [None]:
#|exports
fixes_biota_species = {
    'SOLEA SOLEA (S.VULGARIS)': 'Solea solea',
    'MONODONTA LINEATA': 'Phorcus lineatus',
    'NUCELLA LAPILLUS': NA, # Dropped. In Worms 'Nucella lapillus (Linnaeus, 1758)'.
    'unknown': NA,
    'PECTINIDAE': NA, # Dropped. In Worms as PECTINIDAE is a family.
    'RAJIDAE/BATOIDEA': NA,
    'Flatfish': NA,
    'Unknown': NA,
    'PALMARIA PALMATA': NA, # Dropped. In Worms 'Palmaria palmata (Linnaeus) F.Weber & D.Mohr, 1805',
    'Mixture of green, red and brown algae': NA,
    'RHODYMENIA PSEUDOPALAMATA & PALMARIA PALMATA': NA,
    'Solea solea (S.vulgaris)': 'Solea solea'
    }

We now attempt remapping again, incorporating the `fixes_biota_species` dictionary:

In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_species)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/156 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 156/156 [00:30<00:00,  5.17it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cerastoderma (Cardium) Edule,Cerastoderma edule,Cerastoderma (Cardium) Edule,10
CERASTODERMA (CARDIUM) EDULE,Cerastoderma edule,CERASTODERMA (CARDIUM) EDULE,10
DICENTRARCHUS (MORONE) LABRAX,Dicentrarchus labrax,DICENTRARCHUS (MORONE) LABRAX,9
Pleuronectiformes [order],Pleuronectiformes,Pleuronectiformes [order],8
RAJA DIPTURUS BATIS,Dipturus batis,RAJA DIPTURUS BATIS,5
FUCUS SPP.,Fucus,FUCUS SPP.,5
Rhodymenia spp.,Rhodymenia,Rhodymenia spp.,5
Sepia spp.,Sepia,Sepia spp.,5
Thunnus sp.,Thunnus,Thunnus sp.,4
Gadus sp.,Gadus,Gadus sp.,4


Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| exports
lut_biota = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='Species', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='species_ospar.pkl').generate_lookup_table(fixes=fixes_biota_species, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `species` column to our `biota` dataframe, containing standardized species IDs.


In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota')    
    ])

tfm()['biota']['species'].unique()

array([ 394,   96,  129,   50,  139,  270,  395,   -1,   99,  377,  414,
       1608,  244,  192,   23,    0,  402,  407,  401,  274,  378, 1609,
        384,  386,  191,  382,  404,  405,  385,  388,  383,  379,  432,
        243,  392,  393,  413,  400,  425,  419,  399,  556,  272,  391,
        234,  431,  442,  396, 1606,  403,  412,  435, 1610,  381,  437,
        434,  444,  443,  389,  440,  441,  439,  427,  438, 1605,  436,
        426,  433,  390,  420,  417,  397,  421,  294, 1221,  422,  423,
        428,  424,  415, 1607,  387,  380,  406,  398,  416,  408,  409,
        418,  430,  429,  411,  410])

## Enhance Species Data Using Biological group column
The `Biological group` column in the OSPAR dataset provides valuable insights related to species. We will leverage this information to enrich the `species` column. To achieve this, we will employ the generic `RemapCB` callback to create an `enhanced_species` column. Subsequently, this `enhanced_species` column will be used to further enrich the `species` column.

First we inspect the unique values in the `Biological group` column.

In [None]:
get_unique_across_dfs(dfs, col_name='Biological group', as_df=True)

Unnamed: 0,index,value
0,0,seaweed
1,1,molluscs
2,2,Molluscs
3,3,SEAWEED
4,4,Seaweed
5,5,MOLLUSCS
6,6,Fish
7,7,FISH
8,8,fish


We will remap the `Biological group` columns data to the `species` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='Biological group', as_df=True),
                    maris_lut_fn=species_lut_path,
                    maris_col_id='species_id',
                    maris_col_name='species',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='enhance_species_ospar.pkl')

Like before we will generate the lookup table and select matches that meet a specified threshold (i.e., greater than 1), which means that matches requiring more than one character change are shown.

In [None]:
remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/9 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:01<00:00,  5.44it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fish,Fucus,Fish,4
FISH,Fucus,FISH,4
fish,Fucus,fish,4
molluscs,Mollusca,molluscs,1
Molluscs,Mollusca,Molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1


We can see that some of the entries require manual corrections.

In [None]:
fixes_enhanced_biota_species = {
    'fish': 'Pisces',
    'FISH': 'Pisces',
    'Fish': 'Pisces'    
    
}


Now we will apply the manual corrections to the lookup table and generate the lookup table again.

In [None]:
remapper.generate_lookup_table(fixes=fixes_enhanced_biota_species)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/9 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9/9 [00:01<00:00,  5.51it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
molluscs,Mollusca,molluscs,1
Molluscs,Mollusca,Molluscs,1
MOLLUSCS,Mollusca,MOLLUSCS,1


Visual inspection of the remaining imperfectly matched entries appears acceptable. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.


In [None]:
#| exports
lut_biota_enhanced = lambda: Remapper(provider_lut_df=get_unique_across_dfs(dfs, col_name='Biological group', as_df=True),
                             maris_lut_fn=species_lut_path,
                             maris_col_id='species_id',
                             maris_col_name='species',
                             provider_col_to_match='value',
                             provider_col_key='value',
                             fname_cache='enhance_species_ospar.pkl').generate_lookup_table(fixes=fixes_enhanced_biota_species, as_df=False, overwrite=False)

Now lets see the species that are not matched by the `LookupBiogroupCB` callback. 

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `species` column to our `biota` dataframe, containing standardized species IDs.

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota')    
    ])

tfm()['biota']['enhanced_species'].unique()

array([ 873, 1059,  712])

Now that we have the `enhanced_species` column, we can use it to enrich the `species` column. We will use the enhanced species column in the absence of a species match if the enhanced species column is valid. 

In [None]:
# | export
class EnhanceSpeciesCB(Callback):
    """Enhance the 'species' column using the 'enhanced_species' column if conditions are met."""

    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        self._enhance_species(tfm.dfs['biota'])

    def _enhance_species(self, df: pd.DataFrame):
        df['species'] = df.apply(
            lambda row: row['enhanced_species'] if row['species'] in [-1, 0] and pd.notnull(row['enhanced_species']) else row['species'],
            axis=1
        )

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
    EnhanceSpeciesCB()
    ])

tfm()['biota']['species'].unique()

array([ 394,   96,  129,   50,  139,  270,  395,  712,   99,  377,  414,
       1608,  244,  192,   23, 1059,  402,  407,  401,  274,  378, 1609,
        384,  386,  191,  382,  404,  405,  385,  388,  383,  379,  432,
        243,  392,  873,  393,  413,  400,  425,  419,  399,  556,  272,
        391,  234,  431,  442,  396, 1606,  403,  412,  435, 1610,  381,
        437,  434,  444,  443,  389,  440,  441,  439,  427,  438, 1605,
        436,  426,  433,  390,  420,  417,  397,  421,  294, 1221,  422,
        423,  428,  424,  415, 1607,  387,  380,  406,  398,  416,  408,
        409,  418,  430,  429,  411,  410])

All entries are matched for the `species` column.

## Remap biogroup

The MARIS species lookup table includes a ``biogroup_id`` column that associates each species with its corresponding ``biogroup``. We will leverage this relationship to populate a ``bio_group`` column in the biota DataFrame.

In [None]:
#| export
def get_biogroup_lut(maris_lut: str) -> dict:
    """
    Retrieve a lookup table for biogroup ids from a MARIS lookup table.

    Args:#
        maris_lut (str): Path to the MARIS lookup table (Excel file).

    Returns:
        dict: A dictionary mapping species_id to biogroup_id.
    """
    species = pd.read_excel(maris_lut)
    return species[['species_id', 'biogroup_id']].set_index('species_id').to_dict()['biogroup_id']


Now that we have defined a function to retrieve the biogroup associated with each species, we can create and apply the `LookupBiogroupCB` callback to the `biota` DataFrame.

In [None]:
#| export
class LookupBiogroupCB(Callback):
    """Update biogroup id based on MARIS dbo_species.xlsx."""

    def __init__(self, fn_lut: Callable):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        lut = self.fn_lut()
        self.update_bio_group(tfm.dfs['biota'], lut)

    def update_bio_group(self, df: pd.DataFrame, lut: dict):
        """
        Update the 'bio_group' column in the DataFrame based on the lookup table.

        Args:
            df (pd.DataFrame): The DataFrame to process.
            lut (Dict[str, Any]): The lookup table for updating 'bio_group'.
        """
        df['bio_group'] = df['species'].apply(lambda x: lut.get(x, -1))


In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
    EnhanceSpeciesCB(),
    LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path()))
    ])

tfm()['biota']['bio_group'].unique()

array([13, 11, 14,  4,  2,  6,  5, 12])

Bbiogroup is assigned to all species.

Comments from Franck and Niall:

In [None]:
# TO BE DONE
# Species column contains nan
# Niall replace with Biological group where missing
# In our case, we will use remapped species, once 
# done we can use internal MARIS lookup to remap to biota group. 
# If species is missing, we can use the biological group to perform the lookup. 

# Thank you, I approached this task slightly differently. 
# I created a new column called `enhanced_species`, which utilizes information from the 'Biological group' column to match entries to the MARIS species nomenclature
# where possible.If the previously matched 'species' value is -1 or 0, the `enhanced_species` column is used, provided it contains a valid species entry.
# Subsequently, the biogroup lookup is performed.

## Remap Biota body Part

The OSPAR dataset includes entries where the `Body Part` is labeled as `whole`. However, the MARIS data standard requires a more specific distinction in the `body_part` field, differentiating between `Whole animal` and `Whole plant`. Fortunately, the OSPAR data provides a `Biological group` field that allows us to make this distinction.

To address this discrepancy and ensure compatibility with MARIS standards, we will:

1. Create a temporary column `body_part_temp` that combines information from both `Body Part` and `Biological group`.
2. Use this temporary column to perform the lookup using our `Remapper` object.

Lets create the temporary column, `body_part_temp`, that combines `Body Part` and `Biological group`.

In [None]:
#| exports
class AddBodypartTempCB(Callback):
    "Add a temporary column with the body part and biological group combined."    
    def __call__(self, tfm):
        tfm.dfs['biota']['body_part_temp'] = (
            tfm.dfs['biota']['Body Part'] + ' ' + 
            tfm.dfs['biota']['Biological group']
            )                                    

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[  
                            RemoveAllNAValuesCB(cols_to_check),     
                            AddBodypartTempCB(),
                            ])
dfs_test = tfm()
dfs_test['biota']['body_part_temp'].unique()

array(['SOFT PARTS Molluscs', 'GROWING TIPS Seaweed',
       'Whole plant Seaweed', 'WHOLE Fish', 'WHOLE ANIMAL Fish',
       'FLESH WITHOUT BONES Fish', 'WHOLE ANIMAL Molluscs',
       'WHOLE PLANT Seaweed', 'Soft Parts Molluscs',
       'FLESH WITHOUT BONES Molluscs', 'WHOLE Seaweed',
       'Whole without head FISH', 'Cod medallion FISH', 'Muscle FISH',
       'Whole animal Fish', 'Whole fisk FISH', 'Whole FISH',
       'Mix of muscle and whole fish without liver FISH', 'Flesh Fish',
       'WHOLE FISH Fish', 'Whole animal Molluscs', 'Muscle Fish',
       'Whole fish Fish', 'FLESH WITHOUT BONE Fish', 'UNKNOWN Fish',
       'WHOLE PLANT seaweed', 'WHOLE PLANT SEAWEED',
       'SOFT PARTS molluscs', 'FLESH WITHOUT BONES FISH',
       'WHOLE ANIMAL FISH', 'FLESH WITHOUT BONES fish', 'FLESH Fish',
       'FLESH WITHOUT BONES SEAWEED', 'FLESH WITH SCALES Fish',
       'FLESH WITHOUT BONE FISH', 'HEAD FISH', 'WHOLE FISH FISH',
       'Flesh without bones Fish', 'UNKNOWN FISH', 'Soft parts

To align the ``body_part_temp`` column with the ``bodypar`` column in the MARIS nomenclature, we utilize a Remapper object. Since the OSPAR dataset does not include a predefined lookup table for the ``body_part`` column, we first create a lookup table by extracting unique values from the ``body_part_temp`` column.

In [None]:
get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True),

(    index                                            value
 0       0                         FLESH WITHOUT BONES FISH
 1       1                                       Flesh Fish
 2       2                         FLESH WITHOUT BONES fish
 3       3                           FLESH WITH SCALES Fish
 4       4                                     UNKNOWN Fish
 5       5                              WHOLE PLANT seaweed
 6       6                         FLESH WITHOUT BONES Fish
 7       7                              Soft parts Molluscs
 8       8                                  Whole fisk FISH
 9       9                                      Muscle Fish
 10     10                             GROWING TIPS Seaweed
 11     11                                  WHOLE FISH Fish
 12     12                                WHOLE ANIMAL FISH
 13     13                              SOFT PARTS Molluscs
 14     14                      FLESH WITHOUT BONES SEAWEED
 15     15                              

We try to remap the `body_part_temp` column to the `bodypar` column of the MARIS nomenclature, again using a `Remapper` object:

In [None]:
#| eval: false
remapper = Remapper(provider_lut_df=get_unique_across_dfs(dfs_test, col_name='body_part_temp', as_df=True),
                    maris_lut_fn=bodyparts_lut_path,
                    maris_col_id='bodypar_id',
                    maris_col_name='bodypar',
                    provider_col_to_match='value',
                    provider_col_key='value',
                    fname_cache='bodyparts_ospar.pkl'
                    )

remapper.generate_lookup_table(as_df=True)
remapper.select_match(match_score_threshold=0)

Processing:   0%|          | 0/46 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 46/46 [00:00<00:00, 92.48it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mix of muscle and whole fish without liver FISH,Flesh without bones,Mix of muscle and whole fish without liver FISH,31
Whole without head FISH,Flesh without bones,Whole without head FISH,13
Cod medallion FISH,Old leaf,Cod medallion FISH,13
SOFT PARTS MOLLUSCS,Soft parts,SOFT PARTS MOLLUSCS,9
Soft Parts Molluscs,Soft parts,Soft Parts Molluscs,9
UNKNOWN FISH,Growing tips,UNKNOWN FISH,9
SOFT PARTS molluscs,Soft parts,SOFT PARTS molluscs,9
SOFT PARTS Molluscs,Soft parts,SOFT PARTS Molluscs,9
WHOLE FISH Fish,Whole animal,WHOLE FISH Fish,9
FLESH WITHOUT BONES Molluscs,Flesh without bones,FLESH WITHOUT BONES Molluscs,9


Many of the lookup entries are sufficient for our needs. However, for values that don't find a match, we can use the `fixes_biota_bodyparts` dictionary to apply manual corrections. First we will create the dictionary.

In [None]:
#|exports
fixes_biota_bodyparts = {
    'WHOLE Seaweed' : 'Whole plant',
    'Flesh Fish': 'Flesh with bones', # We assume it as the category 'Flesh with bones' also exists
    'FLESH Fish' : 'Flesh with bones',
    'UNKNOWN Fish' : NA,
    'UNKNOWN FISH': NA,
    'Cod medallion FISH' : NA, # TO BE DETERMINED
    'Mix of muscle and whole fish without liver FISH' : NA, # TO BE DETERMINED
    'Whole without head FISH' : NA, # TO BE DETERMINED
    'FLESH WITHOUT BONES SEAWEED' : NA # TO BE DETERMINED
}

Now we will generate the lookup table and apply the manual corrections of the ``fixes_biota_bodyparts`` dictionary.


In [None]:
#| eval: false
remapper.generate_lookup_table(fixes=fixes_biota_bodyparts)
remapper.select_match(match_score_threshold=1)

Processing:   0%|          | 0/46 [00:00<?, ?it/s]

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 46/46 [00:00<00:00, 86.68it/s]


Unnamed: 0_level_0,matched_maris_name,source_name,match_score
source_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FLESH WITHOUT BONES Molluscs,Flesh without bones,FLESH WITHOUT BONES Molluscs,9
WHOLE FISH Fish,Whole animal,WHOLE FISH Fish,9
WHOLE FISH FISH,Whole animal,WHOLE FISH FISH,9
Soft Parts Molluscs,Soft parts,Soft Parts Molluscs,9
Whole animal Molluscs,Whole animal,Whole animal Molluscs,9
Whole fish Fish,Whole animal,Whole fish Fish,9
SOFT PARTS molluscs,Soft parts,SOFT PARTS molluscs,9
SOFT PARTS MOLLUSCS,Soft parts,SOFT PARTS MOLLUSCS,9
WHOLE ANIMAL Molluscs,Whole animal,WHOLE ANIMAL Molluscs,9
SOFT PARTS Molluscs,Soft parts,SOFT PARTS Molluscs,9


At this stage, the majority of entries have been successfully matched to MARIS nomenclature. For those entries that remain unmatched, they are appropriately marked as not available. We can now proceed with the final remapping process:

1. Create Remapper Lambda Function:

   We'll define a lambda function that instantiates a Remapper object and returns its corrected lookup table.

2. Apply RemapCB: 

   Using the generic `RemapCB` callback, we'll perform the actual remapping.

In [None]:
#| exports
lut_bodyparts = lambda: Remapper(provider_lut_df=get_unique_across_dfs(tfm.dfs, col_name='body_part_temp', as_df=True),
                               maris_lut_fn=bodyparts_lut_path,
                               maris_col_id='bodypar_id',
                               maris_col_name='bodypar',
                               provider_col_to_match='value',
                               provider_col_key='value',
                               fname_cache='bodyparts_ospar.pkl'
                               ).generate_lookup_table(fixes=fixes_biota_bodyparts, as_df=False, overwrite=False)

Putting it all together, we now apply the `RemapCB` to our data. This process results in the addition of a `body_part` column to our `biota` dataframe, containing standardized species IDs.

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[  
                            RemoveAllNAValuesCB(cols_to_check),     
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='body_part', col_src='body_part_temp' , dest_grps='biota')
                            ])
tfm()
tfm.dfs['biota']['body_part'].unique()

array([19, 56, 40,  1, 52,  0, 34,  4, 60, 13, 25])

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: `biota` dataset includes 1 entry where the `Body Part` is `FLESH WITHOUT BONES` for  the `Biological group` of `SEAWEED`, see below. 

:::

In [None]:
dfs['biota'][['ID','Contracting Party','Sample ID','Biological group','Body Part', 'Measurement Comment', 'Sample Comment']][(tfm.dfs['biota']['Body Part'] == 'FLESH WITHOUT BONES') & (tfm.dfs['biota']['Biological group'] == 'SEAWEED')]

Unnamed: 0,ID,Contracting Party,Sample ID,Biological group,Body Part,Measurement Comment,Sample Comment
2660,87356,Iceland,THFAG17C,SEAWEED,FLESH WITHOUT BONES,,


## Remap Taxon Information
Currently, the details (`Taxonname`, `TaxonRepName`, `Taxonrank`) are used for importing into the MARIS master database, but they are not included in the NetCDF encoding. 

We first need to retrieve the taxon information from the `dbo_species.xlsx` file.

In [None]:
#| exports
# TODO: Include Commonname field after next MARIS data reconciling process.
def get_taxon_info_lut(
    maris_lut:str # Path to the MARIS lookup table (Excel file)
) -> dict: # A dictionary mapping species_id to biogroup_id
    "Retrieve a lookup table for Taxonname from a MARIS lookup table."
    species = pd.read_excel(maris_lut)
    return species[['species_id', 'Taxonname', 'Taxonrank','TaxonDB','TaxonDBID','TaxonDBURL']].set_index('species_id').to_dict()

lut_taxon = lambda: get_taxon_info_lut(species_lut_path())

In [None]:
# | exports
class RemapTaxonInformationCB(Callback):
    "Update taxon information based on MARIS species LUT."
    def __init__(self, fn_lut: Callable):
        self.fn_lut = fn_lut

    def __call__(self, tfm: Transformer):
        lut = self.fn_lut()
        df = tfm.dfs['biota']
        
        df['TaxonRepName'] = df.get('RUBIN', 'Unknown')
        
        taxon_columns = ['Taxonname', 'Taxonrank', 'TaxonDB', 'TaxonDBID', 'TaxonDBURL']
        for col in taxon_columns:
            df[col] = df['species'].map(lut[col]).fillna('Unknown')
        
        unmatched = df[df['Taxonname'] == 'Unknown']['species'].unique()
        if len(unmatched) > 0:
            print(f"Unmatched species IDs: {', '.join(unmatched)}")

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
    RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
    EnhanceSpeciesCB(),
    RemapTaxonInformationCB(lut_taxon)
    ])

tfm()['biota'][['Taxonname', 'Taxonrank','TaxonDB','TaxonDBID','TaxonDBURL']].drop_duplicates().head()


Unnamed: 0,Taxonname,Taxonrank,TaxonDB,TaxonDBID,TaxonDBURL
0,Littorina littorea,species,Wikidata,Q27935,https://www.wikidata.org/wiki/Q27935
1,Fucus vesiculosus,species,Wikidata,Q754755,https://www.wikidata.org/wiki/Q754755
15,Mytilus edulis,species,Wikidata,Q27855,https://www.wikidata.org/wiki/Q27855
24,Clupea harengus,species,Wikidata,Q2396858,https://www.wikidata.org/wiki/Q2396858
28,Merlangius merlangus,species,Wikidata,Q273083,https://www.wikidata.org/wiki/Q273083


## Remap units

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The dataset contains unit names with inconsistency  (e.g., 'Bq/l', 'Bq/L', 'BQ/L'). To enhance consistency and usability, it is advisable to standardize the unit names across all sample types.

Additionally, the seawater DataFrame contains `NaN` values in the 'Unit' column, which constitute 14.5% of the total entries.
:::

Seawater units include nan values. We will take a look at the number of nan values in the seawater dataframe

In [None]:
tfm.dfs['seawater']['Unit'].isnull().sum()/len(tfm.dfs['seawater'])*100

14.543072387815265

The biota units include no nan values.


In [None]:
tfm.dfs['biota']['Unit'].isnull().sum()/len(tfm.dfs['biota'])*100

0.0

We use again the same **IMFA** (Inspect, Match, Fix, Apply) pattern to remap the OSAPR units. 

Let's inspect the units in the `biota` and `seawater` dataframes.

In [None]:
for key, df in tfm.dfs.items():
    print(key, df['Unit'].unique())

seawater ['Bq/l' nan 'Bq/L' 'BQ/L']
biota ['Bq/kg f.w.' 'Bq/kg.fw' 'Bq/kg fw' 'Bq/kg f.w']


Create `renaming_unit_rules` to rename the units. 

In [None]:
#| export
lut_units = {'Bq/l': 1, #'Bq/m3'
            'Bq/L': 1, # 'Bq/m3'
            'BQ/L': 1, # 'Bq/m3'
            'Bq/kg f.w.': 5, # Bq/kgw
            'Bq/kg.fw' : 5, # Bq/kgw
            'Bq/kg fw' : 5, # Bq/kgw
            'Bq/kg f.w' : 5  # Bq/kgw
            } 

Create a default unit dictionary.

In [None]:
default_units = {'seawater': 'Bq/l',
                 'biota': 'Bq/kgw'}

In [None]:
#| export
class RemapUnitCB(Callback):
    "Set the `unit` id column in the DataFrames based on a lookup table."
    
    def __init__(self, lut: dict = lut_units, default_units: dict = {}):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp in tfm.dfs.keys():
            # Apply default unit if specified for the group
            if grp in self.default_units:
                self._apply_default_units(tfm.dfs[grp], self.default_units[grp])
            self._perform_lookup(tfm.dfs[grp])

    def _apply_default_units(self, df: pd.DataFrame, default_unit: str):
        df.loc[df['Unit'].isnull(), 'Unit'] = default_unit

    def _perform_lookup(self, df: pd.DataFrame):
        df['unit'] = df['Unit'].apply(lambda x: self.lut.get(x, 'Unknown'))


In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    RemapUnitCB(lut_units, default_units),  
    CompareDfsAndTfmCB(dfs)
    ])

tfm()

for grp, df in tfm.dfs.items():
    print(grp, df['unit'].unique())
    

seawater [1]
biota [5]


## Remap detection limit

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The dataset contains `Value type` of `>` and `nan`. Note this is less than 1% of the total entries.

:::

Lets look at the percentage of `nan` and `>` entries in the `Value type` column.


In [None]:
#| eval: false
for grp, df in tfm.dfs.items():
    nan_percentage = (df['Value type'].isnull().sum() / len(df)) * 100
    print(f"{grp} `nan`: {nan_percentage:.2f} %")
    greater_than_percentage = (df[df['Value type'] == '>']['Value type'].count() / len(df)) * 100
    print(f"{grp} `>`: {greater_than_percentage:.2f} %")


seawater `nan`: 0.35 %
seawater `>`: 0.00 %
biota `nan`: 0.15 %
biota `>`: 0.11 %


For OSPAR data, the detection limit is indicated in the `Value type` column. If the `Value type` is `<`, the value represents the detection limit, whereas if it is `=`, the value is the actual measurand. We will begin by examining the unique values present in the `Value type` column to ensure accurate data interpretation and processing.

In [None]:
#|eval: false
for grp, df in tfm.dfs.items():
    print(grp, df['Value type'].unique())

seawater ['<' '=' nan]
biota ['=' '<' '>' nan]


Similarly, in MARIS nomenclature, the detection limit is specified within a designated column. The encoding of detection limits in MARIS is structured as follows:

In [None]:
#| eval: false
pd.read_excel(detection_limit_lut_path())

Unnamed: 0,id,name,name_sanitized
0,-1,Not applicable,Not applicable
1,0,Not Available,Not available
2,1,=,Detected value
3,2,<,Detection limit
4,3,ND,Not detected
5,4,DE,Derived


We will create a lookup table to map the name to the id.

In [None]:
#| exports
lut_dl = lambda: pd.read_excel(detection_limit_lut_path(), usecols=['name','id']).set_index('name').to_dict()['id']
lut_dl()

{'Not applicable': -1, 'Not Available': 0, '=': 1, '<': 2, 'ND': 3, 'DE': 4}

In [None]:
# | export
class RemapDetectionLimitCB(Callback):
    "Remap value type to MARIS format."
    def __init__(self, lut_dl: str):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        lut = self.lut_dl()
        for grp, df in tfm.dfs.items():
            df = self._process_dataframe(df, lut)
            tfm.dfs[grp] = df

    def _process_dataframe(self, df: pd.DataFrame, lut: dict) -> pd.DataFrame:
        df['detection_limit'] = df['Value type'].fillna('Not Available') # Fill nan with 'Not Available'
        df.loc[df['detection_limit'] == '>', 'detection_limit'] = '<' # Replace '>' with '<'
        df['detection_limit'] = df['detection_limit'].apply(lambda x: lut.get(x, 0)) # Map values to lookup table
        return df

In [None]:
#| eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    RemapDetectionLimitCB(lut_dl), 
    CompareDfsAndTfmCB(dfs)

    ])

tfm()

for grp, df in tfm.dfs.items():
    print(grp, df['detection_limit'].unique())
    
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')


seawater [2 1 0]
biota [1 2 0]
                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18318  15314
Number of dropped rows                                   538      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 



## Add Sample Laboratory code

:::{.callout-tip}

**DISCUSSION**: The Sample Laboratory code is currently stored in the MARIS master database but is not yet encoded as a NetCDF variable. The decision to include it in the NetCDF output is still to be determined (TBD).

:::

In [None]:
# | exports
class AddSampleLabCodeCB(Callback):
    "Remap data provider's ID column to `samplabcode` in each DataFrame."
    def __call__(self, tfm: Transformer):
        for grp in tfm.dfs:
            self._remap_sample_id(tfm.dfs[grp])
    
    def _remap_sample_id(self, df: pd.DataFrame):
        df['samplabcode'] = df['Sample ID']

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemoveAllNAValuesCB(cols_to_check),
                            AddSampleLabCodeCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

print(tfm()['seawater']['samplabcode'].unique())
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

['WNZ 01' 'WNZ 02' 'WNZ 03' ... '21-656' '21-657' '21-654']
                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18318  15314
Number of dropped rows                                   538      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 



## Add Station

*For MARIS master DB import only (not included in the NetCDF output).*

In [None]:
# | export
class RemapStationIdCB(Callback):
    """Remap Station ID to MARIS format."""

    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp in tfm.dfs.keys():
            self._remap_station_id(tfm.dfs[grp])

    def _remap_station_id(self, df: pd.DataFrame):
        df['station'] = df['Station ID'] + ', ' + df['Contracting Party']

In [None]:

#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemapStationIdCB(),
                            ])
tfm()
print(tfm.dfs['seawater'][['Station ID', 'Contracting Party', 'station']])


        Station ID Contracting Party                 station
0      Belgica-W01           Belgium    Belgica-W01, Belgium
1      Belgica-W02           Belgium    Belgica-W02, Belgium
2      Belgica-W03           Belgium    Belgica-W03, Belgium
3      Belgica-W04           Belgium    Belgica-W04, Belgium
4      Belgica-W05           Belgium    Belgica-W05, Belgium
...            ...               ...                     ...
18851       Rosyth    United Kingdom  Rosyth, United Kingdom
18852       Rosyth    United Kingdom  Rosyth, United Kingdom
18853        Wylfa    United Kingdom   Wylfa, United Kingdom
18854        Wylfa    United Kingdom   Wylfa, United Kingdom
18855        Wylfa    United Kingdom   Wylfa, United Kingdom

[18856 rows x 3 columns]


## Add measurement note

In [None]:
# | export
class RecordMeasurementNoteCB(Callback):
    """Record measurement notes by adding a 'measurenote' column to DataFrames."""
    
    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            if 'Measurement Comment' in df.columns:
                self._add_measurementnote(df)
            else:
                print(f"Warning: 'Measurement Comment' column not found in DataFrame for group '{grp}'")

    def _add_measurementnote(self, df: pd.DataFrame):
        df['measurenote'] = df['Measurement Comment']


In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RecordMeasurementNoteCB(),
                            ])

tfm()

# Print the combined unique values
note_col='measurenote'
# Ensure all entries in 'measurenote' are strings
tfm.dfs['seawater'][note_col] = tfm.dfs['seawater'][note_col].fillna('').astype(str)
tfm.dfs['biota'][note_col] = tfm.dfs['biota'][note_col].fillna('').astype(str)
# Combine and find unique values from both DataFrames
combined_unique_measurenotes = np.unique(
    np.concatenate([
        tfm.dfs['seawater'][note_col].unique(),
        tfm.dfs['biota'][note_col].unique()
    ])
)
# Print the combined unique values
print(combined_unique_measurenotes)

['' '10% uncertainty assumed' '10B07.XLS' '10B38,XLS' '10B45.XLS'
 '10B63.XLS' '10B70.XLS' '10B75.XLS' '10G14.XLS' '10G22.XLS' '10G32.XLS'
 '10G39.XLS' '11.10.-18.10.2014' '15% uncertainty assumed' '2021 data'
 '28.03.-05.04.2014' '5% uncertainty assumed'
 'Activity from Ra ratio on filter and uncertaintes of both filter Ra228 and Ra226 in seawater (precentages summed)'
 'Annual bulk of 2 samples - representative sampling date'
 'Annual bulk of 2 samples - representative sampling date. No sample ref number'
 'Annual bulk of 4 samples - representative sampling date'
 'Annual bulk of 4 samples - representative sampling date.'
 'Annual bulk of 4 samples - representative sampling date. No sample ref number.'
 'Assumed collection date'
 'Assumed collection date, no sample reference number'
 'Assumed collection date. No sample number'
 'Assumed collection date. No sample ref number.'
 'Average of 2 samples, representative sampling date.'
 'Bi-annual bulk of 2 samples - representative samplin

## Add Reference note

In [None]:
# | export
class RecordRefNoteCB(Callback):
    """Record reference notes by adding a 'refnote' column to DataFrames."""
    
    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            if 'Reference Comment' in df.columns:
                self._add_refnote(df)
            else:
                print(f"Warning: 'Reference Comment' column not found in DataFrame for group '{grp}'")

    def _add_refnote(self, df: pd.DataFrame):
        df['refnote'] = df['Reference Comment']

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RecordRefNoteCB(),
                            ])
tfm()

# Print the combined unique values
note_col='refnote'
# Ensure all entries in 'measurenote' are strings
tfm.dfs['seawater'][note_col] = tfm.dfs['seawater'][note_col].fillna('').astype(str)
tfm.dfs['biota'][note_col] = tfm.dfs['biota'][note_col].fillna('').astype(str)
# Combine and find unique values from both DataFrames
combined_unique_measurenotes = np.unique(
    np.concatenate([
        tfm.dfs['seawater'][note_col].unique(),
        tfm.dfs['biota'][note_col].unique()
    ])
)
# Print the combined unique values
print(combined_unique_measurenotes)


['' 'Assuming NRPA as data provider'
 'Data not used in the 5PE, as monitoring of this species ceased in 2013'
 'LRC 09G16' 'LRC 09G27' 'LRC 09G29' 'LRC 09G30'
 'provided via the Environment Agency']


## Add Sample note

In [None]:
# | export
class RecordSampleNoteCB(Callback):
    """Record sample notes by adding a 'sampnote' column to DataFrames."""
    
    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            if 'Sample Comment' in df.columns:
                self._add_samplenote(df)
            else:
                print(f"Warning: 'Sample Comment' column not found in DataFrame for group '{grp}'")

    def _add_samplenote(self, df: pd.DataFrame):
        df['sampnote'] = df['Sample Comment']


In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RecordSampleNoteCB(),
                            ])

tfm()

# Print the combined unique values
note_col='sampnote'
# Ensure all entries in 'measurenote' are strings
tfm.dfs['seawater'][note_col] = tfm.dfs['seawater'][note_col].fillna('').astype(str)
tfm.dfs['biota'][note_col] = tfm.dfs['biota'][note_col].fillna('').astype(str)
# Combine and find unique values from both DataFrames
combined_unique_measurenotes = np.unique(
    np.concatenate([
        tfm.dfs['seawater'][note_col].unique(),
        tfm.dfs['biota'][note_col].unique()
    ])
)
# Print the combined unique values
print(combined_unique_measurenotes)

['' '1 fish' '1,316 kg fw; 0,283 kg dw; 4,652 fw/dw'
 '1,612 kg fw; 0,545 kg dw; 2,961 fw/dw'
 '1,616 kg fw; 0,111 kg dw; 14,536 fw/dw'
 '1,616 kg fw; 0,382 kg dw; 4,224 fw/dw'
 '1,656 kg fw; 0,394 kg dw; 4,202 fw/dw'
 '1,683 kg fw; 0,529 kg dw; 3,183 fw/dw' '100 fish' '14 fish'
 '15,3369849964277' '18,1607194662025'
 '1st quarter 1996 (monthly samples pooled for one measurement) sampling date is inaccurate !'
 '1st quarter 1997 (monthly samples pooled for one measurement) sampling date is inaccurate !'
 '1st quarter 1998 (monthly samples pooled for one measurement) sampling date is inaccurate !'
 '1st quarter 1999 (monthly samples pooled for one measurement) sampling date is inaccurate !'
 '1st quarter 2000 (monthly samples pooled for one measurement) sampling date is inaccurate !'
 '1st quarter 2001 (monthly samples pooled for one measurement) sampling date is inaccurate !'
 '1st quarter 2002 (monthly samples pooled for one measurement) sampling date is inaccurate !'
 '1st quarter 20

## Standardize Coordinates

The OSPAR dataset offers coordinates in degrees, minutes, and seconds (DMS). The following callback is designed to convert DMS to decimal degrees. 

In [None]:
# | export
class ConvertLonLatCB(Callback):
    """Convert Longitude and Latitude values to decimal degrees (DDD.DDDDD¬∞). This class processes DataFrames to convert latitude and longitude from degrees, minutes, and seconds 
    (DMS) format with direction indicators to decimal degrees format."""
    def __init__(self):
        fc.store_attr()

    def __call__(self, tfm: 'Transformer'):
        for grp, df in tfm.dfs.items():
            df['lat'] = self._convert_latitude(df)
            df['lon'] = self._convert_longitude(df)

    def _convert_latitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['LatDir'].isin(['S']),
            self._dms_to_decimal(df['LatD'], df['LatM'], df['LatS']) * -1,
            self._dms_to_decimal(df['LatD'], df['LatM'], df['LatS'])
        )

    def _convert_longitude(self, df: pd.DataFrame) -> pd.Series:
        return np.where(
            df['LongDir'].isin(['W']),
            self._dms_to_decimal(df['LongD'], df['LongM'], df['LongS']) * -1,
            self._dms_to_decimal(df['LongD'], df['LongM'], df['LongS'])
        )

    def _dms_to_decimal(self, degrees: pd.Series, minutes: pd.Series, seconds: pd.Series) -> pd.Series:
        return degrees + minutes / 60 + seconds / 3600


In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB()
                            ])
tfm()
tfm.dfs['seawater'][['lat','LatD', 'LatM', 'LatS', 'lon', 'LatDir', 'LongD', 'LongM','LongS', 'LongDir']]

Unnamed: 0,lat,LatD,LatM,LatS,lon,LatDir,LongD,LongM,LongS,LongDir
0,51.375278,51.0,22.0,31.0,3.188056,N,3.0,11.0,17.0,E
1,51.223611,51.0,13.0,25.0,2.859444,N,2.0,51.0,34.0,E
2,51.184444,51.0,11.0,4.0,2.713611,N,2.0,42.0,49.0,E
3,51.420278,51.0,25.0,13.0,3.262222,N,3.0,15.0,44.0,E
4,51.416111,51.0,24.0,58.0,2.809722,N,2.0,48.0,35.0,E
...,...,...,...,...,...,...,...,...,...,...
18851,56.011111,56.0,0.0,40.0,-3.406667,N,3.0,24.0,24.0,W
18852,56.011111,56.0,0.0,40.0,-3.406667,N,3.0,24.0,24.0,W
18853,53.413333,53.0,24.0,48.0,-3.870278,N,3.0,52.0,13.0,W
18854,53.569722,53.0,34.0,11.0,-3.769722,N,3.0,46.0,11.0,W


Sanitize coordinates drops a row when both longitude & latitude equal 0 or data contains unrealistic longitude & latitude values. Converts longitude & latitude `,` separator to `.` separator."

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            ConvertLonLatCB(),
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

print(tfm.dfs['biota'][['lat','lon']])


                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18856  15314
Number of dropped rows                                     0      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 

             lat       lon
0      55.725278 -4.901944
1      54.968889 -3.240556
2      58.565833 -3.791389
3      58.618611 -3.647778
4      55.964722 -2.398056
...          ...       ...
15309  54.455000 -3.566111
15310  48.832778 -1.591389
15311  48.832778 -1.591389
15312  49.551667 -1.860000
15313  49.714444 -1.946111

[15314 rows x 2 columns]


## Review all callbacks

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemoveAllNAValuesCB(cols_to_check),
                            AddSampleTypeIdColumnCB(),
                            RemapNuclideNameCB(lut_nuclides),
                            AddNuclideIdColumnCB(col_value='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg(), verbose = True),
                            SanitizeValue(),
                            NormalizeUncCB(),
                            RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
                            EnhanceSpeciesCB(),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='body_part', col_src='body_part_temp' , dest_grps='biota'),
                            RemapTaxonInformationCB(lut_taxon),
                            RemapUnitCB(lut_units, default_units),  
                            RemapDetectionLimitCB(lut_dl), 
                            AddSampleLabCodeCB(),
                            RemapStationIdCB(),
                            RecordMeasurementNoteCB(),
                            RecordRefNoteCB(),
                            RecordSampleNoteCB(),   
                            ConvertLonLatCB(),                    
                            SanitizeLonLatCB(),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18308  15314
Number of dropped rows                                   548      0
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 



:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: The dataset contains entries not suitable for MARIS.

:::

In [None]:
seawater_dfs_dropped_review=tfm.dfs_dropped['seawater']
biota_dfs_dropped_review=tfm.dfs_dropped['biota']

Lets look at the dropped rows for the seawater group.

In [None]:
seawater_dfs_dropped_review

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
16799,97147,,,,,,,,,,...,,,,,,,,,,
16800,97148,,,,,,,,,,...,,,,,,,,,,
16801,97149,,,,,,,,,,...,,,,,,,,,,
16802,97150,,,,,,,,,,...,,,,,,,,,,
16803,97151,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18474,120366,Ireland,4.0,N8,,53.0,39.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18475,120367,Ireland,4.0,N9,,53.0,53.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18476,120368,Ireland,4.0,N10,,53.0,52.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18477,120369,Ireland,1.0,Salthill,,53.0,15.0,40.0,N,9.0,...,,,,,,,,2021 data,Woodstown (County Waterford) and Salthill (Cou...,


For the seawater group lets seperate the entries by `Contracting Party` and review the entries that are not suitable for MARIS.

In [None]:
seawater_dfs_dropped_review['Contracting Party'].unique()

array([nan, 'Sweden', 'Ireland'], dtype=object)

Contributions to the seawater group from Sweden that are not suitable for MARIS.

In [None]:
seawater_dfs_dropped_review[seawater_dfs_dropped_review['Contracting Party'] == 'Sweden']

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
17298,97948,Sweden,11.0,SW7,1,58.0,36.0,12.0,N,11.0,...,,3H,,,,Bq/l,Swedish Radiation Safety Authority,no 3H this year due to broken LSC,,
17302,97952,Sweden,12.0,Ringhals (R35),7,57.0,14.0,5.0,N,11.0,...,,3H,,,,Bq/l,Swedish Radiation Safety Authority,no 3H this year due to broken LSC,,


Contributions to the seawater group from Ireland that are not suitable for MARIS.


In [None]:
seawater_dfs_dropped_review[seawater_dfs_dropped_review['Contracting Party'] == 'Ireland']

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
18471,120363,Ireland,4.0,N1,,53.0,25.0,0.0,N,6.0,...,,,,,,,,2021 data,The Irish Navy attempted a few times to collec...,
18472,120364,Ireland,4.0,N2,,53.0,36.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18473,120365,Ireland,4.0,N3,,53.0,44.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18474,120366,Ireland,4.0,N8,,53.0,39.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18475,120367,Ireland,4.0,N9,,53.0,53.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18476,120368,Ireland,4.0,N10,,53.0,52.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18477,120369,Ireland,1.0,Salthill,,53.0,15.0,40.0,N,9.0,...,,,,,,,,2021 data,Woodstown (County Waterford) and Salthill (Cou...,
18478,120370,Ireland,1.0,Woodstown,,52.0,11.0,55.0,N,6.0,...,,,,,,,,,,


## Rename columns of interest for NetCDF or Open Refine

In [None]:
#| export
def get_common_rules(vars: dict, encoding_type: str) -> dict:
    "Get common renaming rules for NetCDF and OpenRefine."
    common = {
        'lat': 'latitude' if encoding_type == 'openrefine' else vars['defaults']['lat']['name'],
        'lon': 'longitude' if encoding_type == 'openrefine' else vars['defaults']['lon']['name'],
        'time': 'begperiod' if encoding_type == 'openrefine' else vars['defaults']['time']['name'],
        'NUCLIDE': 'nuclide_id' if encoding_type == 'openrefine' else 'nuclide',
        'detection_limit': 'detection' if encoding_type == 'openrefine' else vars['suffixes']['detection_limit']['name'],
        'unit': 'unit_id' if encoding_type == 'openrefine' else vars['suffixes']['unit']['name'],
        'value': 'activity' if encoding_type == 'openrefine' else 'value',
        'uncertainty': 'uncertaint' if encoding_type == 'openrefine' else vars['suffixes']['uncertainty']['name'],
    }
    
    if encoding_type == 'openrefine':
        common.update({
            'samptype_id': 'samptype_id',
            'station': 'station',
            'samplabcode': 'samplabcode',
            'measurenote': 'measurenote',
            'refnote': 'refnote'
        })
    
    return common

In [None]:
#| export
def get_specific_rules(vars: dict, encoding_type: str) -> dict:
    "Get specific renaming rules for NetCDF and OpenRefine."
    if encoding_type == 'netcdf':
        return {
            'seawater': {
                'Sampling depth': vars['defaults']['smp_depth']['name'],
            },
            'biota': {
                'species': vars['bio']['species']['name'],
                'body_part': vars['bio']['body_part']['name'],
                'bio_group': vars['bio']['bio_group']['name']
            }
        }
    elif encoding_type == 'openrefine':
        return {
            'seawater': {
                'Sampling depth': 'sampdepth',
            },
            'biota': {
                'species': 'species_id',
                'Taxonname': 'Taxonname',
                'TaxonRepName': 'TaxonRepName',
                'Taxonrank': 'Taxonrank',
                'TaxonDB': 'TaxonDB',
                'TaxonDBID': 'TaxonDBID',
                'TaxonDBURL': 'TaxonDBURL',
                'body_part': 'bodypar_id',
            }
        }

Transient rules are not essential for the transformation process, but allow addtional columns to be included in the processed data. This is useful for providing feedback to the data provider. 

In [None]:
#| export
def get_transient_rules(vars: dict, encoding_type: str) -> dict:
    """Get transient renaming rules used temporarily during transformation for NetCDF."""
    if encoding_type == 'netcdf':
        return {
            'seawater': {
                'Sample ID': 'sample_id',
                'Contracting Party': 'contracting_party', 
                'ID': 'id'    
            },
            'biota': {
                'Sample ID': 'sample_id',       
                'Contracting Party': 'contracting_party',
                'ID': 'id'    
            }
        }
    else:
        return {}

In [None]:
#| export
def get_renaming_rules(encoding_type='netcdf' , transient_rules = False):
    vars = cdl_cfg()['vars']
    common = get_common_rules(vars, encoding_type)
    specific = get_specific_rules(vars, encoding_type)
    if transient_rules:
        transient = get_transient_rules(vars, encoding_type)
    else:
        transient = {}
    # Combine rules for seawater and biota
    seawater_rules = {**common, **specific.get('seawater', {}), **transient.get('seawater', {})}
    biota_rules = {**common, **specific.get('biota', {}), **transient.get('biota', {})}
    
    return OrderedDict({
        ('seawater',): seawater_rules,
        ('biota',): biota_rules
    })

** Discussion **  

1. Should we include a check of the validity of the renaming rules here? Should we include both 'nuclide' and 'value' in the cdl? Then get the cdl? We use both 'nuclide' and 'value'

In [None]:
#| export
class SelectAndRenameColumnCB(Callback):
    "Select and rename columns in a DataFrame based on renaming rules for a specified encoding type."
    
    def __init__(self, 
                 fn_renaming_rules: Callable, # A function that returns an OrderedDict of renaming rules 
                 encoding_type: str='netcdf', # The encoding type (`netcdf` or `openrefine`) to determine which renaming rules to use
                 verbose: bool=False # Whether to print out renaming rules that were not applied
                 ):
        fc.store_attr()

    def __call__(self, tfm):
        "Apply column selection and renaming to DataFrames in the transformer, and identify unused rules."
        try:
            renaming_rules = self.fn_renaming_rules(self.encoding_type)
        except ValueError as e:
            print(f"Error fetching renaming rules: {e}")
            return

        for group in tfm.dfs.keys():
            # Get relevant renaming rules for the current group
            group_rules = self._get_group_rules(renaming_rules, group)

            if not group_rules:
                continue

            # Apply renaming rules and track keys not found in the DataFrame
            df = tfm.dfs[group]
            df, not_found_keys = self._apply_renaming(df, group_rules)
            tfm.dfs[group] = df
            
            # Print any renaming rules that were not used
            if not_found_keys and self.verbose:
                print(f"\nGroup '{group}' has the following renaming rules not applied:")
                for old_col in not_found_keys:
                    print(f"Key '{old_col}' from renaming rules was not found in the DataFrame.")

    def _get_group_rules(self, renaming_rules, group):
        "Retrieve and merge renaming rules for the specified group based on the encoding type."

        relevant_rules = [rules for key, rules in renaming_rules.items() if group in key]
        merged_rules = OrderedDict()
        for rules in relevant_rules:
            merged_rules.update(rules)
        return merged_rules

    def _apply_renaming(self, df, rename_rules):
        existing_columns = set(df.columns)
        valid_rules = OrderedDict((old_col, new_col) for old_col, new_col in rename_rules.items() if old_col in existing_columns)

        # Create a list to maintain the order of columns
        columns_to_keep = [col for col in rename_rules.keys() if col in existing_columns]
        columns_to_keep += [new_col for old_col, new_col in valid_rules.items() if new_col in df.columns]

        df = df[list(OrderedDict.fromkeys(columns_to_keep))]

        # Apply renaming
        df.rename(columns=valid_rules, inplace=True)

        # Determine which keys were not found
        not_found_keys = set(rename_rules.keys()) - existing_columns
        return df, not_found_keys


In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemoveAllNAValuesCB(cols_to_check),
                            AddSampleTypeIdColumnCB(),
                            RemapNuclideNameCB(lut_nuclides),
                            AddNuclideIdColumnCB(col_value='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg(), verbose = True),
                            SanitizeValue(),
                            NormalizeUncCB(),
                            RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
                            EnhanceSpeciesCB(),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='body_part', col_src='body_part_temp' , dest_grps='biota'),
                            RemapTaxonInformationCB(lut_taxon),
                            RemapUnitCB(lut_units),
                            RemapDetectionLimitCB(lut_dl), 
                            AddSampleLabCodeCB(),
                            RemapStationIdCB(),
                            RecordMeasurementNoteCB(),
                            RecordRefNoteCB(),
                            RecordSampleNoteCB(),   
                            ConvertLonLatCB(),                    
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(get_renaming_rules, encoding_type='netcdf'),
                            ])

tfm()
print(tfm.dfs['seawater'].columns)
print(tfm.dfs['biota'].columns)

Index(['lat', 'lon', 'time', 'nuclide', '_dl', '_unit', 'value', '_unc',
       'smp_depth'],
      dtype='object')
Index(['lat', 'lon', 'time', 'nuclide', '_dl', '_unit', 'value', '_unc',
       'species', 'body_part', 'bio_group'],
      dtype='object')


## Unique Sample Identification

:::{.callout-tip}

**FEEDBACK TO DATA PROVIDER**: Many entries do not include a unique 'Sample ID'. Including a unique 'Sample ID' for each data entry would greatly ease this step.

:::

In [None]:
print(f"Seawater data with nan Sample ID: {len(dfs['seawater'][dfs['seawater']['Sample ID'].isnull()].index)}")
print(f"Biota data with nan Sample ID: {len(dfs['biota'][dfs['biota']['Sample ID'].isnull()].index)}")


Seawater data with nan Sample ID: 7296
Biota data with nan Sample ID: 5483


:::{.callout-tip}

**DISCUSSION**:

1. **Relocation of `UniqueSampleHandlerCB`**: Should we move `UniqueSampleHandlerCB` to `callbacks.ipynb` for better organization?


:::

Before transforming data from a long to wide format, it is essential to manage duplicate entries effectively. The uniqueness of a sample is determined by the criteria used to define it. The ``UniqueSampleHandlerCB`` callback is specifically designed to define and explore the uniqueness of samples. This callback provides several options for handling duplicates: dropping them, averaging their values, or selecting the maximum or minimum value for each nuclide. The process results in an additional DataFrame, ``dfs_duplicated``, which facilitates the exploration of duplicated samples. The ``group_by`` parameter specifies the columns that define a unique sample.

**Discuss ** again should we include the 'nuclide' in the default cdl? 

In [None]:
#| export
class UniqueSampleHandlerCB(Callback):
    """
    Callback to process replicate rows in a DataFrame.
    """
    def __init__(self, encoding_type='netcdf', method='drop', verbose=False):
        self.encoding_type = encoding_type
        self.method = method
        self.verbose = verbose
        self.group_by = None

    def __call__(self, tfm: 'Transformer'):
        self.group_by = self.get_unique_columns(tfm.dfs)
        self._initialize_tfm_attributes(tfm)
        for grp, df in tfm.dfs.items():
            processed_df = self._process_replicates(df, grp)
            tfm.dfs_replicated[grp] = processed_df
            self._handle_replicates(tfm, grp, processed_df)

    def get_unique_columns(self, dfs):
        """
        Extracts unique columns for each DataFrame in dfs based on CDL configuration.
        """
        if self.encoding_type == 'netcdf':
            cdl = cdl_cfg()
            default_cdl = set(cdl['vars']['defaults'].keys())
            bio_cdl = set(cdl['vars']['bio'].keys())
            # include sed_cdl if/when UniqueSampleHandlerCB is moved to 'callbacks.ipynb'
        elif self.encoding_type == 'openrefine':
            print('openrefine not yet supported')
            return {}

        unique_column_group = {}
        for key, df in dfs.items():
            if df is not None:
                df_columns = set(df.columns)
                common_columns = df_columns.intersection(default_cdl.union(bio_cdl))
                unique_column_group[key] = list(common_columns) + ['nuclide']
        
        return unique_column_group

    def _initialize_tfm_attributes(self, tfm: Transformer) -> None:
        tfm.dfs_replicated = {}

    def _process_replicates(self, df, grp):
        """
        Process replicates by assigning a replicated_group number and filtering.
        """
        df_filled = df.fillna('unknown')
        df_filled['is_replicate'] = df_filled.duplicated(subset=self.group_by[grp], keep=False)
        replicates_only = df_filled[df_filled['is_replicate']]
        replicates_only['replicated_group'] = replicates_only.groupby(self.group_by[grp]).ngroup()
        replicates_only = replicates_only.sort_values(by='replicated_group').reset_index(drop=True)
        replicates_only = replicates_only.replace('unknown', pd.NA)
        return replicates_only

    def _handle_replicates(self, tfm: 'Transformer', grp: str, processed_df: pd.DataFrame):
        """
        Handle replicates based on the specified method.
        """
        if self.method == 'drop':
            tfm.dfs[grp] = tfm.dfs[grp].drop_duplicates(subset=self.group_by[grp], keep=False)
            if self.verbose:
                print(f"Group '{grp}': Duplicates removed based on columns {self.group_by[grp]}.")

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemoveAllNAValuesCB(cols_to_check),
                            AddSampleTypeIdColumnCB(),
                            RemapNuclideNameCB(lut_nuclides),
                            AddNuclideIdColumnCB(col_value='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg(), verbose = True),
                            SanitizeValue(),
                            NormalizeUncCB(),
                            RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
                            EnhanceSpeciesCB(),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='body_part', col_src='body_part_temp' , dest_grps='biota'),
                            RemapTaxonInformationCB(lut_taxon),
                            RemapUnitCB(lut_units, default_units),  
                            RemapDetectionLimitCB(lut_dl), 
                            AddSampleLabCodeCB(),
                            RemapStationIdCB(),
                            RecordMeasurementNoteCB(),
                            RecordRefNoteCB(),
                            RecordSampleNoteCB(),   
                            ConvertLonLatCB(),                    
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(partial(get_renaming_rules, transient_rules=True), encoding_type='netcdf'),
                            UniqueSampleHandlerCB(encoding_type='netcdf', method='drop', verbose=True),
                            CompareDfsAndTfmCB(dfs)
                            ])

tfm()
print(tfm.dfs['seawater'].columns)
print(tfm.dfs['biota'].columns)
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

Group 'seawater': Duplicates removed based on columns ['time', 'lon', 'smp_depth', 'lat', 'nuclide'].
Group 'biota': Duplicates removed based on columns ['bio_group', 'time', 'species', 'body_part', 'lat', 'lon', 'nuclide'].
Index(['lat', 'lon', 'time', 'nuclide', '_dl', '_unit', 'value', '_unc',
       'smp_depth', 'sample_id', 'contracting_party', 'id'],
      dtype='object')
Index(['lat', 'lon', 'time', 'nuclide', '_dl', '_unit', 'value', '_unc',
       'species', 'body_part', 'bio_group', 'sample_id', 'contracting_party',
       'id'],
      dtype='object')
                                                    seawater  biota
Number of rows in dfs                                  18856  15314
Number of rows in tfm.dfs                              18015  14714
Number of dropped rows                                   841    600
Number of rows in tfm.dfs + Number of dropped rows     18856  15314 



There is a large number of entires deemed replicates. Let's review the entries where 'samples' are replicated. This means that the columns listed in `unique_column_group` are identical. First, let's review the seawater group:

In [None]:
replicated_sea=tfm.dfs_replicated['seawater']
replicated_sea

Unnamed: 0,lat,lon,time,nuclide,_dl,_unit,value,_unc,smp_depth,sample_id,contracting_party,id,is_replicate,replicated_group
0,71.953333,-12.845000,805766400,cs137,1,1,0.0007,0.000083,0.0,1995433,Germany,47522,True,0
1,71.953333,-12.845000,805766400,cs137,2,1,0.0007,,0.0,1995432,Germany,47521,True,0
2,54.031389,-6.129722,875664000,cs137,1,1,0.0430,0.002687,0.0,,Ireland,50409,True,1
3,54.031389,-6.129722,875664000,cs137,1,1,0.0061,0.000381,0.0,,Ireland,50410,True,1
4,59.336111,2.492222,878342400,cs137,1,1,0.0029,0.000145,0.0,,Norway,71722,True,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
288,60.760833,3.491111,1563062400,cs137,1,1,0.0032,0.000125,323.0,SWA2019-014,Norway,97551,True,98
289,60.760833,3.491111,1563062400,po210,1,1,0.0019,0.00005,323.0,SWA2019-013,Norway,97556,True,99
290,60.760833,3.491111,1563062400,po210,1,1,0.0010,0.00005,323.0,SWA2019-014,Norway,97557,True,99
291,60.760833,3.491111,1563062400,ra226,1,1,0.0017,0.0001,323.0,SWA2019-013,Norway,97561,True,100


In [None]:
replicated_sea.to_csv('ospar_replicated_seawater.csv')

Lets see if these are measurent results (1) or detection limits (2).

In [None]:
replicated_sea['_dl'].value_counts()

_dl
1    275
2     18
Name: count, dtype: int64

Lets return the entries where both a measurement and a detection limit are reported.

In [None]:
replicated_sea.groupby('replicated_group').filter(lambda x: x['_dl'].nunique() > 1)


Unnamed: 0,lat,lon,time,nuclide,_dl,_unit,value,_unc,smp_depth,sample_id,contracting_party,id,is_replicate,replicated_group
0,71.953333,-12.845,805766400,cs137,1,1,0.0007,8.3e-05,0.0,1995433.0,Germany,47522,True,0
1,71.953333,-12.845,805766400,cs137,2,1,0.0007,,0.0,1995432.0,Germany,47521,True,0
279,54.488889,-3.606944,1561939200,h3,1,1,11.85,0.29625,0.0,,United Kingdom,98470,True,94
280,54.488889,-3.606944,1561939200,h3,2,1,4.0,,0.0,,United Kingdom,118996,True,94


Lets see the nuclide breakdown for replicated entries.

In [None]:
replicated_sea['nuclide'].value_counts()

nuclide
cs137            152
pu239_240_tot     40
tc99              39
ra226             38
ra228             12
h3                10
po210              2
Name: count, dtype: int64

Lets see entries where there is more than one replicate. We will create a small function, `filter_replicated_groups`,  to review groups that have more than one replicate.

In [None]:
def filter_replicated_groups(df, cnt=2):
    
    # Calculate the count for each 'replicated_group'
    group_counts = df.groupby('replicated_group').size()
    
    # Identify groups with counts greater than the specified minimum count
    groups_with_more_than_min_count = group_counts[group_counts > cnt].index
    
    # Filter the DataFrame to include only those rows
    filtered_df = df[df['replicated_group'].isin(groups_with_more_than_min_count)]
    
    return filtered_df

In [None]:
filtered_replicates_sea = filter_replicated_groups(tfm.dfs_replicated['seawater'], cnt=2)
filtered_replicates_sea

Unnamed: 0,lat,lon,time,nuclide,_dl,_unit,value,_unc,smp_depth,sample_id,contracting_party,id,is_replicate,replicated_group
4,59.336111,2.492222,878342400,cs137,1,1,0.002900,0.000145,0.0,,Norway,71722,True,2
5,59.336111,2.492222,878342400,cs137,1,1,0.009800,0.00049,0.0,,Norway,71729,True,2
6,59.336111,2.492222,878342400,cs137,1,1,0.003600,0.00018,0.0,,Norway,71728,True,2
7,59.336111,2.492222,878342400,cs137,1,1,0.006700,0.000335,0.0,,Norway,71730,True,2
8,59.336111,2.492222,878342400,cs137,1,1,0.005800,0.00029,0.0,,Norway,71726,True,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140,67.137778,11.386667,1088640000,pu239_240_tot,1,1,0.000007,0.0,0.0,,Norway,74854,True,25
141,67.137778,11.386667,1088640000,pu239_240_tot,1,1,0.000004,0.0,0.0,,Norway,74853,True,25
150,59.336111,2.492222,1120176000,ra226,1,1,0.001300,0.000065,0.0,,Norway,75342,True,30
151,59.336111,2.492222,1120176000,ra226,1,1,0.000900,0.000045,0.0,,Norway,75341,True,30


How many samples include more than 2 entries for seawater group?

In [None]:
len(filtered_replicates_sea['replicated_group'].unique())

18

We will examine the entries in our dataset that are marked as replicated. Specifically, we want to determine if these replicated entries, which share the same unique identifiers (such as latitude, longitude, time, and other relevant attributes), are linked to more than one nuclide. This understanding is crucial because, in a wide-format dataset, each nuclide must occupy a unique column. If the same sample is associated with multiple nuclides, these entries should be combined into a single row. However, if there is repetition of one nuclide, a decision is required on which row the other nuclide information associated with the sample should be assigned.

In [None]:
#| export
def get_unique_columns(dfs):
    """
    Extracts unique columns for each DataFrame in dfs based on CDL configuration.
    """
    cdl = cdl_cfg()
    default_cdl = set(cdl['vars']['defaults'].keys())
    bio_cdl = set(cdl['vars']['bio'].keys())
    
    unique_column_group = {}
    
    for key, df in dfs.items():
        if df is not None:
            df_columns = set(df.columns)
            common_columns = df_columns.intersection(default_cdl.union(bio_cdl))
            unique_column_group[key] = list(common_columns) + ['nuclide']
    
    return unique_column_group

In [None]:
unique_column_group=get_unique_columns(tfm.dfs)
unique_column_group

{'seawater': ['time', 'lon', 'smp_depth', 'lat', 'nuclide'],
 'biota': ['bio_group',
  'time',
  'species',
  'body_part',
  'lat',
  'lon',
  'nuclide']}

In [None]:
def find_entries_with_multiple_nuclides(tfm, unique_column_group, group):
    # Ensure the group is valid
    if group not in unique_column_group:
        raise ValueError(f"Invalid group: {group}. Must be one of {list(unique_column_group.keys())}.")

    # Get the unique columns for the specified group, excluding 'nuclide'
    unique_columns = [col for col in unique_column_group[group] if col != 'nuclide']
    
    # Step 1: Filter tfm.dfs_replicated[group] using the unique columns and keep the replicated_group
    replicated_data = tfm.dfs_replicated[group][unique_columns + ['replicated_group']].drop_duplicates()
    
    # Step 2: Search tfm.dfs[group] for these entries
    filtered_data = tfm.dfs[group].merge(replicated_data, on=unique_columns, how='inner')
    
    # Step 3: Determine if unique columns are used for more than one nuclide
    nuclide_counts = filtered_data.groupby(unique_columns)['nuclide'].nunique()
    
    # Filter groups with more than one nuclide
    multiple_nuclides = nuclide_counts[nuclide_counts > 1]
    
    # Step 4: Add entries to the replicated_group where another nuclide shares the same unique columns
    result = filtered_data[filtered_data[unique_columns].apply(tuple, axis=1).isin(multiple_nuclides.index)]
        
    # Sort by 'replicated_group'
    result_sorted = result.sort_values(by='replicated_group')
    
    return result_sorted

In [None]:
multiple_nuclides_replicated_seawater = find_entries_with_multiple_nuclides(tfm, unique_column_group, 'seawater')
print(len(multiple_nuclides_replicated_seawater.index))
multiple_nuclides_replicated_seawater

10


Unnamed: 0,lat,lon,time,nuclide,_dl,_unit,value,_unc,smp_depth,sample_id,contracting_party,id,replicated_group
3,74.713056,28.126111,1088640000,cs137,1,1,0.0028,0.00014,0.0,,Norway,74815,28
4,74.713056,28.126111,1088640000,pu239_240_tot,1,1,5e-06,2.55e-07,0.0,,Norway,74856,28
5,58.243056,9.585,1120176000,tc99,1,1,0.0012,6e-05,0.0,,Norway,75282,31
6,58.243056,9.585,1120176000,pu239_240_tot,1,1,8e-06,3.85e-07,0.0,,Norway,75313,31
8,60.760833,3.491111,1563062400,pu239_240_tot,1,1,3e-06,2.25e-07,323.0,SWA2019-014,Norway,97565,98
11,60.760833,3.491111,1563062400,h3,2,1,5.1,,323.0,SWA2019-014,Norway,97568,98
9,60.760833,3.491111,1563062400,pu239_240_tot,1,1,3e-06,2.25e-07,323.0,SWA2019-014,Norway,97565,99
12,60.760833,3.491111,1563062400,h3,2,1,5.1,,323.0,SWA2019-014,Norway,97568,99
10,60.760833,3.491111,1563062400,pu239_240_tot,1,1,3e-06,2.25e-07,323.0,SWA2019-014,Norway,97565,100
13,60.760833,3.491111,1563062400,h3,2,1,5.1,,323.0,SWA2019-014,Norway,97568,100


As shown, many entries that include replicated samples for a specific nuclide also contain multiple other nuclides for the same sample. For the current analysis, I have removed only the duplicated entries where replication occurs. It is important to note that I have not removed all nuclide entries for a sample that includes replicated entries, but rather just the duplicates themselves.

Now, lets review the biota group.

In [None]:
replicated_biota=tfm.dfs_replicated['biota']
replicated_biota

Unnamed: 0,lat,lon,time,nuclide,_dl,_unit,value,_unc,species,body_part,bio_group,sample_id,contracting_party,id,is_replicate,replicated_group
0,54.303333,7.495000,882662400,ra226,2,5,0.321,,99,4,4,15720,Germany,85454,True,0
1,54.303333,7.495000,882662400,ra226,2,5,0.161,,99,4,4,15719,Germany,85453,True,0
2,54.303056,7.494722,882662400,cs137,1,5,1.219,0.042665,99,52,4,15719,Germany,36615,True,1
3,54.303056,7.494722,882662400,cs137,1,5,0.505,0.01818,99,52,4,15720,Germany,36614,True,1
4,72.666667,36.250000,918345600,cs137,1,5,0.200,0.04,99,34,4,,Norway,70857,True,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,53.417222,6.876111,1633564800,pb210,2,5,1.000,,414,19,14,,Netherlands,96114,True,215
596,53.417222,6.876111,1633564800,pb210,2,5,1.000,,414,19,14,,Netherlands,96117,True,215
597,53.417222,6.876111,1633564800,ra226,2,5,1.800,,414,19,14,,Netherlands,96116,True,216
598,53.417222,6.876111,1633564800,ra226,2,5,1.800,,414,19,14,,Netherlands,96110,True,216


In [None]:
replicated_biota.to_csv('ospar_replicated_biota.csv')

Lets see if these are measurent results or detection limits.

In [None]:
replicated_biota['_dl'].value_counts()

_dl
1    329
2    271
Name: count, dtype: int64

Lets see the nuclide breakdown for replicated entries.

In [None]:
replicated_biota['nuclide'].value_counts()

nuclide
cs137            303
ra226             90
pb210             82
po210             51
tc99              36
pu239_240_tot     34
ra228              4
Name: count, dtype: int64

Lets use the `filter_replicated_groups` function to review groups that have more than one replicate.

In [None]:
filtered_replicated_biota = filter_replicated_groups(tfm.dfs_replicated['biota'], cnt=2)
filtered_replicated_biota

Unnamed: 0,lat,lon,time,nuclide,_dl,_unit,value,_unc,species,body_part,bio_group,sample_id,contracting_party,id,is_replicate,replicated_group
4,72.666667,36.250000,918345600,cs137,1,5,0.20,0.04,99,34,4,,Norway,70857,True,2
5,72.666667,36.250000,918345600,cs137,1,5,0.31,0.025,99,34,4,,Norway,70858,True,2
6,72.666667,36.250000,918345600,cs137,1,5,0.13,0.02,99,34,4,,Norway,70881,True,2
7,72.666667,36.250000,918345600,cs137,1,5,0.31,0.02,99,34,4,,Norway,70882,True,2
8,72.666667,36.250000,918345600,cs137,1,5,0.53,0.06,99,34,4,,Norway,70883,True,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,53.417222,6.876111,1633564800,pb210,2,5,1.00,,414,19,14,,Netherlands,96114,True,215
596,53.417222,6.876111,1633564800,pb210,2,5,1.00,,414,19,14,,Netherlands,96117,True,215
597,53.417222,6.876111,1633564800,ra226,2,5,1.80,,414,19,14,,Netherlands,96116,True,216
598,53.417222,6.876111,1633564800,ra226,2,5,1.80,,414,19,14,,Netherlands,96110,True,216


How many samples include more than 2 entries for biota group?


In [None]:
len(filtered_replicated_biota['replicated_group'].unique()) 

103

Again we examine the entries in our dataset that are marked as replicated. Specifically, we want to determine if these replicated entries, which share the same unique identifiers (such as latitude, longitude, time, and other relevant attributes), are linked to more than one nuclide. 

In [None]:
multiple_nuclides_replicated_biota = find_entries_with_multiple_nuclides(tfm, unique_column_group, 'biota')
print(len(multiple_nuclides_replicated_biota.index))
multiple_nuclides_replicated_biota.head(10)

10


Unnamed: 0,lat,lon,time,nuclide,_dl,_unit,value,_unc,species,body_part,bio_group,sample_id,contracting_party,id,replicated_group
1,43.634444,-3.578056,1576108800,cs137,1,5,0.04953,0.006313,442,52,4,,Spain,91655,74
3,43.634444,-3.578056,1576108800,pb210,1,5,0.08005,0.00867,442,52,4,,Spain,91658,74
5,43.634444,-3.578056,1576108800,ra228,2,5,0.1495,,442,52,4,,Spain,91666,74
7,43.634444,-3.578056,1576108800,pu239_240_tot,2,5,0.004328,,442,52,4,,Spain,91669,74
0,43.634444,-3.578056,1576108800,cs137,2,5,0.1086,,1059,40,11,,Spain,91654,102
2,43.634444,-3.578056,1576108800,pb210,1,5,3.789,0.092675,1059,40,11,,Spain,91657,102
4,43.634444,-3.578056,1576108800,ra228,1,5,0.3647,0.0424,1059,40,11,,Spain,91665,102
6,43.634444,-3.578056,1576108800,pu239_240_tot,1,5,0.01222,0.001374,1059,40,11,,Spain,91668,102
22,54.038611,-6.158056,1012521600,pu239_240_tot,1,5,0.156,0.00975,129,19,14,,Ireland,50372,116
23,54.038611,-6.158056,1012521600,tc99,1,5,14.2,0.8875,129,19,14,,Ireland,50434,116


Again, we have many entries with replicated nuclides that include entries for other nuclides. Note I have dropped all entries where replication occurs.

## Reshape: long to wide

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemoveAllNAValuesCB(cols_to_check),
                            AddSampleTypeIdColumnCB(),
                            RemapNuclideNameCB(lut_nuclides),
                            AddNuclideIdColumnCB(col_value='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg(), verbose = True),
                            SanitizeValue(),
                            NormalizeUncCB(),
                            RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
                            EnhanceSpeciesCB(),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='body_part', col_src='body_part_temp' , dest_grps='biota'),
                            RemapTaxonInformationCB(lut_taxon),
                            RemapUnitCB(lut_units, default_units),  
                            RemapDetectionLimitCB(lut_dl), 
                            AddSampleLabCodeCB(),
                            RemapStationIdCB(),
                            RecordMeasurementNoteCB(),
                            RecordRefNoteCB(),
                            RecordSampleNoteCB(),   
                            ConvertLonLatCB(),                    
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(get_renaming_rules, encoding_type='netcdf'),
                            UniqueSampleHandlerCB(encoding_type='netcdf', method='drop', verbose=True),
                            ReshapeLongToWide(), 
                            ])

tfm()
print(tfm.dfs['seawater'].columns)
print(tfm.dfs['biota'].columns)

In [None]:
seawater_dfs_review=tfm.dfs['seawater']
biota_dfs_review=tfm.dfs['biota']

In [None]:
seawater_dfs_review.columns

Index(['smp_depth', 'time', 'lon', 'lat', 'cs137_dl', 'h3_dl', 'pb210_dl',
       'po210_dl', 'pu239_240_tot_dl', 'ra226_dl', 'ra228_dl', 'tc99_dl',
       'cs137_unc', 'h3_unc', 'pb210_unc', 'po210_unc', 'pu239_240_tot_unc',
       'ra226_unc', 'ra228_unc', 'tc99_unc', 'cs137_unit', 'h3_unit',
       'pb210_unit', 'po210_unit', 'pu239_240_tot_unit', 'ra226_unit',
       'ra228_unit', 'tc99_unit', 'cs137', 'h3', 'pb210', 'po210',
       'pu239_240_tot', 'ra226', 'ra228', 'tc99'],
      dtype='object')

In [None]:
seawater_dfs_review

Unnamed: 0_level_0,smp_depth,time,lon,lat,cs137_dl,h3_dl,pb210_dl,po210_dl,pu239_240_tot_dl,ra226_dl,...,ra228_unit,tc99_unit,cs137,h3,pb210,po210,pu239_240_tot,ra226,ra228,tc99
org_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,789091200,-1.937500,49.691944,2.0,2.0,,,,,...,,,0.038000,20.00000,,,,,,
1,0.0,789091200,-1.591389,48.832778,2.0,2.0,,,,,...,,,0.023000,10.00000,,,,,,
2,0.0,789955200,-5.800000,54.866944,1.0,,,,,,...,,,0.049846,,,,,,,
3,0.0,790041600,-3.606944,54.488889,,1.0,,,,,...,,,,5.26388,,,,,,
4,0.0,790214400,-7.801389,43.882222,2.0,1.0,,,,,...,,,0.112000,0.19000,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10708,1683.0,1365292800,13.270000,73.720000,1.0,,,,,,...,,,0.000570,,,,,,,
10709,1685.0,1442966400,13.266667,73.724167,1.0,,,,1.0,,...,,,0.000544,,,,0.000015,,,
10710,1693.0,1219276800,13.270000,73.730000,2.0,,,,1.0,,...,,,0.002400,,,,0.000012,,,
10711,1694.0,1187136000,13.251667,73.717222,1.0,,,,,,...,,1.0,0.001200,,,,,,,0.00012


In [None]:
biota_dfs_review

Unnamed: 0_level_0,bio_group,time,lon,lat,body_part,species,am241_dl,cs134_dl,cs137_dl,h3_dl,...,cs134,cs137,h3,pb210,po210,pu238,pu239_240_tot,ra226,ra228,tc99
org_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2,1561939200,-6.158056,54.038611,19,23,,,,,...,,,,,,,0.015820,,,
1,2,1561939200,-6.158056,54.038611,19,1608,,,,,...,,,,,,,0.018000,,,
2,2,1583884800,-6.110000,54.020278,19,1608,,,1.0,,...,,0.070904,,,,,,,,
3,2,1592265600,-6.110000,54.020278,19,1608,,,1.0,,...,,0.231840,,,,,,,,
4,2,1593561600,-6.158056,54.038611,19,1608,,,,,...,,,,,,,,,,2.534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9719,14,1633478400,8.301944,54.760556,52,129,,2.0,1.0,,...,0.0097,0.014300,,,,,,,,
9720,14,1635120000,2.918056,51.238056,1,129,,,2.0,2.0,...,,0.343027,5.013468,,,,0.015832,0.606894,1.055467,
9721,14,1636588800,-6.158056,54.038611,19,414,,,2.0,,...,,0.080500,,,,,,,,
9722,14,1637539200,4.031111,51.393056,1,377,,,2.0,2.0,...,,0.231613,3.705809,,,,0.009265,0.509549,0.926452,


## NetCDF encoding

### Change logs

Review the change logs for the netcdf encoding.

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemoveAllNAValuesCB(cols_to_check),
                            AddSampleTypeIdColumnCB(),
                            RemapNuclideNameCB(lut_nuclides),
                            AddNuclideIdColumnCB(col_value='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg(), verbose = True),
                            SanitizeValue(),
                            NormalizeUncCB(),
                            RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
                            EnhanceSpeciesCB(),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='body_part', col_src='body_part_temp' , dest_grps='biota'),
                            RemapTaxonInformationCB(lut_taxon),
                            RemapUnitCB(lut_units, default_units),
                            RemapDetectionLimitCB(lut_dl), 
                            AddSampleLabCodeCB(),
                            RemapStationIdCB(),
                            RecordMeasurementNoteCB(),
                            RecordRefNoteCB(),
                            RecordSampleNoteCB(),   
                            ConvertLonLatCB(),                    
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(get_renaming_rules, encoding_type='netcdf'),
                            UniqueSampleHandlerCB(encoding_type='netcdf', method='drop', verbose=True),
                            ReshapeLongToWide()
                            ])

# Transform
tfm()
# Check transformation logs
tfm.logs

['Remove rows with all NA values.',
 'Parse the time format in the DataFrame.',
 'Encode time as `int` representing seconds since xxx.',
 'Sanitize value by removing blank entries and populating the `value` column.',
 'Normalize uncertainty values in DataFrames.',
 "Remap values from 'Species' to 'species' for groups: biota.",
 "Remap values from 'Biological group' to 'enhanced_species' for groups: biota.",
 "Enhance the 'species' column using the 'enhanced_species' column if conditions are met.",
 'Update biogroup id based on MARIS dbo_species.xlsx.',
 'Add a temporary column with the body part and biological group combined.',
 "Remap values from 'body_part_temp' to 'body_part' for groups: biota.",
 'Update taxon information based on MARIS species LUT.',
 'Set the `unit` id column in the DataFrames based on a lookup table.',
 'Remap value type to MARIS format.',
 "Remap data provider's ID column to `samplabcode` in each DataFrame.",
 'Remap Station ID to MARIS format.',
 "Record measu

***

### Feed global attributes

In [None]:
#| export
kw = ['oceanography', 'Earth Science > Oceans > Ocean Chemistry> Radionuclides',
      'Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure',
      'Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments',
      'Earth Science > Oceans > Ocean Chemistry, Earth Science > Oceans > Sea Ice > Isotopes',
      'Earth Science > Oceans > Water Quality > Ocean Contaminants',
      'Earth Science > Biological Classification > Animals/Vertebrates > Fish',
      'Earth Science > Biosphere > Ecosystems > Marine Ecosystems',
      'Earth Science > Biological Classification > Animals/Invertebrates > Mollusks',
      'Earth Science > Biological Classification > Animals/Invertebrates > Arthropods > Crustaceans',
      'Earth Science > Biological Classification > Plants > Macroalgae (Seaweeds)']


Should we extend the attrs to descrive 

In [None]:
#| export
def get_attrs(tfm, zotero_key, kw=kw):
    return GlobAttrsFeeder(tfm.dfs, cbs=[
        BboxCB(),
        DepthRangeCB(),
        TimeRangeCB(cfg()),
        ZoteroCB(zotero_key, cfg=cfg()),
        KeyValuePairCB('keywords', ', '.join(kw)),
        KeyValuePairCB('publisher_postprocess_logs', ', '.join(tfm.logs))
        ])()

In [None]:
#|eval: false
get_attrs(tfm, zotero_key=zotero_key, kw=kw)

{'geospatial_lat_min': '49.43222222222222',
 'geospatial_lat_max': '81.26805555555555',
 'geospatial_lon_min': '-58.23166666666667',
 'geospatial_lon_max': '36.181666666666665',
 'geospatial_bounds': 'POLYGON ((-58.23166666666667 36.181666666666665, 49.43222222222222 36.181666666666665, 49.43222222222222 81.26805555555555, -58.23166666666667 81.26805555555555, -58.23166666666667 36.181666666666665))',
 'time_coverage_start': '1995-01-01T00:00:00',
 'time_coverage_end': '2021-12-31T00:00:00',
 'title': 'OSPAR Environmental Monitoring of Radioactive Substances',
 'summary': '',
 'creator_name': '[{"creatorType": "author", "firstName": "", "lastName": "OSPAR Comission\'s Radioactive Substances Committee (RSC)"}]',
 'keywords': 'oceanography, Earth Science > Oceans > Ocean Chemistry> Radionuclides, Earth Science > Human Dimensions > Environmental Impacts > Nuclear Radiation Exposure, Earth Science > Oceans > Ocean Chemistry > Ocean Tracers, Earth Science > Oceans > Marine Sediments, Earth 

In [None]:
#| export
def enums_xtra(tfm, vars):
    "Retrieve a subset of the lengthy enum as 'species_t' for instance"
    enums = Enums(lut_src_dir=lut_path(), cdl_enums=cdl_cfg()['enums'])
    xtras = {}
    for var in vars:
        unique_vals = tfm.unique(var)
        if unique_vals.any():
            xtras[f'{var}_t'] = enums.filter(f'{var}_t', unique_vals)
    return xtras

### Encoding NETCDF

In [None]:
#| export
def encode(fname_in, fname_out_nc, nc_tpl_path, **kwargs):
    dfs = load_data(fname_in)
    tfm = Transformer(dfs, cbs=[
                            RemoveAllNAValuesCB(cols_to_check),
                            AddSampleTypeIdColumnCB(),
                            RemapNuclideNameCB(lut_nuclides),
                            AddNuclideIdColumnCB(col_value='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg(), verbose = True),
                            SanitizeValue(),
                            NormalizeUncCB(),
                            RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
                            EnhanceSpeciesCB(),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='body_part', col_src='body_part_temp' , dest_grps='biota'),
                            RemapTaxonInformationCB(lut_taxon),
                            RemapUnitCB(lut_units, default_units),
                            RemapDetectionLimitCB(lut_dl), 
                            AddSampleLabCodeCB(),
                            RemapStationIdCB(),
                            RecordMeasurementNoteCB(),
                            RecordRefNoteCB(),
                            RecordSampleNoteCB(),   
                            ConvertLonLatCB(),                    
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(get_renaming_rules, encoding_type='netcdf'),
                            UniqueSampleHandlerCB(encoding_type='netcdf', method='drop', verbose=True),
                            ReshapeLongToWide()
                                ])
    tfm()
    encoder = NetCDFEncoder(tfm.dfs, 
                            src_fname=nc_tpl_path,
                            dest_fname=fname_out_nc, 
                            global_attrs=get_attrs(tfm, zotero_key=zotero_key, kw=kw),
                            verbose=kwargs.get('verbose', False),
                            enums_xtra=enums_xtra(tfm, vars=['species', 'body_part'])
                           )
    encoder.encode()

In [None]:
#|eval: false
encode(fname_in, fname_out_nc, nc_tpl_path(), verbose=True)

--------------------------------------------------------------------------------
Group: seawater, Variable: lon
--------------------------------------------------------------------------------
Group: seawater, Variable: lat
--------------------------------------------------------------------------------
Group: seawater, Variable: smp_depth
--------------------------------------------------------------------------------
Group: seawater, Variable: time
--------------------------------------------------------------------------------
Group: seawater, Variable: h3
--------------------------------------------------------------------------------
Group: seawater, Variable: h3_unc
--------------------------------------------------------------------------------
Group: seawater, Variable: h3_dl
--------------------------------------------------------------------------------
Group: seawater, Variable: h3_unit
--------------------------------------------------------------------------------
Group: s

## Open Refine Pipeline

### Rename columns for Open Refine

In [None]:
#|eval: false
dfs = load_data(fname_in)
tfm = Transformer(dfs, cbs=[
                            RemoveAllNAValuesCB(cols_to_check),
                            AddSampleTypeIdColumnCB(),
                            RemapNuclideNameCB(lut_nuclides),
                            AddNuclideIdColumnCB(col_value='NUCLIDE'),
                            ParseTimeCB(),
                            EncodeTimeCB(cfg(), verbose = True),
                            SanitizeValue(),
                            NormalizeUncCB(),
                            RemapCB(fn_lut=lut_biota, col_remap='species', col_src='Species', dest_grps='biota'),
                            RemapCB(fn_lut=lut_biota_enhanced, col_remap='enhanced_species', col_src='Biological group', dest_grps='biota'),
                            EnhanceSpeciesCB(),
                            LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                            AddBodypartTempCB(),
                            RemapCB(fn_lut=lut_bodyparts, col_remap='body_part', col_src='body_part_temp' , dest_grps='biota'),
                            RemapTaxonInformationCB(lut_taxon),
                            RemapUnitCB(lut_units, default_units),
                            RemapDetectionLimitCB(lut_dl), 
                            AddSampleLabCodeCB(),
                            RemapStationIdCB(),
                            RecordMeasurementNoteCB(),
                            RecordRefNoteCB(),
                            RecordSampleNoteCB(),   
                            ConvertLonLatCB(),                    
                            SanitizeLonLatCB(),
                            SelectAndRenameColumnCB(get_renaming_rules, encoding_type='openrefine', verbose=True),
                            UniqueSampleHandlerCB(group_by=unique_column_group, method='drop'), # The unique sample handler should be used for consistency with the netcdf encoding
                            CompareDfsAndTfmCB(dfs)
                            ])                        

tfm()
print(pd.DataFrame.from_dict(tfm.compare_stats) , '\n')

KeyError: Index(['time', 'lon', 'lat', 'smp_depth', 'nuclide'], dtype='object')

**Example of data included in dfs_dropped.**

Main reasons for data to be dropped from dfs:
- No activity value reported (i.e. ``Activity or MDA``)

Reason 6 biota values are dropped:
- The body part is not known (i.e.'Mix of muscle and whole fish without liver' or 'UNKNOWN') 

In [None]:
grp='seawater'
#grp='biota'
tfm.dfs_dropped[grp]

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
16799,97147,,,,,,,,,,...,,,,,,,,,,
16800,97148,,,,,,,,,,...,,,,,,,,,,
16801,97149,,,,,,,,,,...,,,,,,,,,,
16802,97150,,,,,,,,,,...,,,,,,,,,,
16803,97151,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18474,120366,Ireland,4.0,N8,,53.0,39.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18475,120367,Ireland,4.0,N9,,53.0,53.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18476,120368,Ireland,4.0,N10,,53.0,52.0,0.0,N,5.0,...,,,,,,,,2021 data,,
18477,120369,Ireland,1.0,Salthill,,53.0,15.0,40.0,N,9.0,...,,,,,,,,2021 data,Woodstown (County Waterford) and Salthill (Cou...,


## Open Refine encoder

In [None]:
#| export
def encode_or(fname_in, fname_out_csv, ref_id, **kwargs):
    dfs = load_data(fname_in)
    tfm = Transformer(dfs, cbs=[
                                GetSampleTypeCB(type_lut),
                                LowerStripRdnNameCB(),
                                RemapRdnNameCB(),
                                ParseTimeCB(),
                                EncodeTimeCB(cfg()),        
                                SanitizeValue(),                       
                                NormalizeUncCB(unc_exp2stan),
                                LookupBiotaSpeciesCB(get_maris_species, unmatched_fixes_biota_species),
                                CorrectWholeBodyPartCB(),
                                LookupBiotaBodyPartCB(get_maris_bodypart, unmatched_fixes_biota_tissues),
                                LookupBiogroupCB(partial(get_biogroup_lut, species_lut_path())),
                                LookupTaxonInformationCB(partial(get_taxon_info_lut, species_lut_path())),
                                LookupUnitCB(renaming_unit_rules),
                                LookupDetectionLimitCB(detection_limit_lut_path()),
                                RemapDataProviderSampleIdCB(),
                                RemapStationIdCB(),
                                RecordMeasurementNoteCB(),
                                RecordRefNoteCB(),
                                RecordSampleNoteCB(),   
                                ConvertLonLatCB(),                    
                                SanitizeLonLatCB(),
                                SelectAndRenameColumnCB(get_renaming_rules, encoding_type='openrefine', verbose=True),
                                CompareDfsAndTfmCB(dfs)
                                ])
    tfm()

    encoder = OpenRefineCsvEncoder(tfm.dfs, 
                                    dest_fname=fname_out_csv, 
                                    ref_id = ref_id,
                                    verbose = True
                                )
    encoder.encode()

In [None]:
#|eval: false
encode_or(fname_in, fname_out_csv, ref_id, verbose=True)

## EXTRA 

I had included ``RemoveFilteredRowsCB`` but later replaced its purpose by using the gerneric callback `RemoveAllNAValuesCB`. However this type of callback might be useful for removing rows based on a custom filter condition.

In [None]:
#| exports
class RemoveFilteredRowsCB(Callback):
    """ Remove rows from a dataframe based on a filter condition. """
    
    def __init__(self, filters:dict, verbose:bool=False):
        fc.store_attr()
    
    def __call__(self, tfm: 'Transformer'):
        for df_name, filter_condition in self.filters.items():
            self._process_dataframe(tfm, df_name, filter_condition)

    def _process_dataframe(self, tfm: 'Transformer', df_name: str, filter_condition: Callable):
        if df_name in tfm.dfs:
            df = tfm.dfs[df_name]
            initial_rows = len(df)
            df = self._apply_filter(df, filter_condition)
            removed_rows = initial_rows - len(df)
            self._log_removal(df_name, removed_rows)
            tfm.dfs[df_name] = df
        else:
            self._log_missing_dataframe(df_name)

    def _apply_filter(self, df: pd.DataFrame, filter_condition: Callable) -> pd.DataFrame:
        mask = filter_condition(df)
        return df[~mask]  # Keep rows that don't match the filter

    def _log_removal(self, df_name: str, removed_rows: int):
        if self.verbose:
            print(f"RemoveFilteredRowsCB: Removed {removed_rows} rows from '{df_name}'.")

    def _log_missing_dataframe(self, df_name: str):
        if self.verbose:
            print(f"RemoveFilteredRowsCB: Dataframe '{df_name}' not found in tfm.dfs.")

The callback `RemoveFilteredRowsCB` allows to remove rows based on a custom filter condition. For instance, we can remove rows with `NUCLIDE` labelled as `Unknown` as shown below.

In [None]:
#| exports
nuclide_filters = {
    'seawater': lambda df: df['NUCLIDE'] == 'Unknown'
}

In [None]:
tfm = Transformer(dfs, cbs=[
    RemoveAllNAValuesCB(cols_to_check),
    RemapNuclideNameCB(lut_nuclides)])
tfm()


RemoveFilteredRowsCB: Removed 8 rows from 'seawater'.


{'seawater':            ID Contracting Party  RSC Sub-division   Station ID Sample ID  \
 0           1           Belgium               8.0  Belgica-W01    WNZ 01   
 1           2           Belgium               8.0  Belgica-W02    WNZ 02   
 2           3           Belgium               8.0  Belgica-W03    WNZ 03   
 3           4           Belgium               8.0  Belgica-W04    WNZ 04   
 4           5           Belgium               8.0  Belgica-W05    WNZ 05   
 ...       ...               ...               ...          ...       ...   
 18851  121646    United Kingdom              10.0       Rosyth   2100318   
 18852  121647    United Kingdom              10.0       Rosyth   2101399   
 18853  121648    United Kingdom               6.0        Wylfa    21-656   
 18854  121649    United Kingdom               6.0        Wylfa    21-657   
 18855  121650    United Kingdom               6.0        Wylfa    21-654   
 
        LatD  LatM  LatS LatDir  LongD  ...  Nuclide  Value ty