# Sample uniqueness and traceability

**Import notebook dependencies**

In [None]:
import pandas as pd

In [None]:
from IPython.display import display, Markdown

## Introduction
We aim to integrate a mechanism within MARISCO that enables us to identify and trace the origin of data back to the data provider. Traditionally (in the Open Refine CSV data format), this was achieved using the `ref id` and `samplelabcode` (where available). In cases where `samplelabcode` is absent, a combination of latitude (`latitude`), longitude (`longitude`), time (`begperiod`), and other identifiers (such as sample type (`sedtype_id`), species (`species_id`), nuclide (`nuclide_id`), etc.) are used to identify a measurement. This document explores the concept of measurement (data) uniqueness for each of our current data providers.

## Purpose of the Traceability Mechanism

1. **Traceability**: To trace each measurement entry back to its source, i.e., the data entry of the data provider.

2. **Uniqueness**: To maintain a unique record of data in our MARIS database. If data is updated at the source, we aim to identify these changes and update the MARIS database accordingly.


## Benefits of Using SampleLabCode

Traditionally, `SampleLabCode` has been utilized for traceability and uniqueness. However, `SampleLabCode` often encompasses more than just a unique identifier. For instance, a laboratory might encode details such as the sample type, the project name, or the laboratory. 

## Known Considerations
1. **Data Type in NetCDF**: It is preferable to use non-string data types in NetCDF to prevent data size inflation. However, some data providers use strings to identify samples.
2. **Unique Identifiers**: There may not be a single unique identifier for each sample. A combination of a unique value and the 'NUCLIDE' column might be necessary to uniquely identify a measurement (e.g. HELCOM).


## Additional Considerations
- **Uniqueness Check**: Verify if the combination of HELCOM 'KEY' and 'NUCLIDE' provides a unique identifier.
- **Duplicate Samples**: Consult with data providers to define what constitutes a duplicate sample and determine the protocol for handling duplicates (e.g., whether to report them).


## Review unique identifier for each data entry
1. Create a traceable unique identifier for each data entry. This will be a combination of certain columns from the data provider.
    - **OSPAR**: ID or 
    - **HELCOM**: KEY + nuclide (Note: This combination might not be unique, e.g., reporting for different measurement techniques).
    - **GEOTRACES**:
    - **TEPCO**:

Lets review the uniqueness of **OSPAR**

In [None]:
from marisco.handlers.ospar import load_data as ospar_load_data
from marisco.handlers.ospar import src_dir as ospar_src_dir

In [None]:
ospar_dfs = ospar_load_data(ospar_src_dir)

In [None]:
with pd.option_context('display.max_columns', None):
    display(ospar_dfs['BIOTA'].head(2))

Unnamed: 0,id,contracting party,rsc sub-division,station id,sample id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample type,biological group,species,body part,sampling date,nuclide,value type,activity or mda,uncertainty,unit,data provider,measurement comment,sample comment,reference comment
0,1,Belgium,8,Kloosterzande-Schelde,DA 17531,51,23.0,36.0,N,4,1.0,52.0,E,BIOT,Molluscs,Ostrea edulis,WHOLE ANIMAL,03/03/10 00:00:00,137Cs,<,0.326416,,Bq/kg f.w.,SCK•CEN,,,
1,2,Belgium,8,Kloosterzande-Schelde,DA 17534,51,23.0,36.0,N,4,1.0,52.0,E,BIOT,Molluscs,Ostrea edulis,WHOLE ANIMAL,06/14/10 00:00:00,137Cs,<,0.442704,,Bq/kg f.w.,SCK•CEN,,,


In [None]:
with pd.option_context('display.max_columns', None):
    display(ospar_dfs['SEAWATER'].head(2))

Unnamed: 0,id,contracting party,rsc sub-division,station id,sample id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample type,sampling depth,sampling date,nuclide,value type,activity or mda,uncertainty,unit,data provider,measurement comment,sample comment,reference comment
0,1,Belgium,8.0,Belgica-W01,WNZ 01,51,22.0,31.0,N,3,11.0,17.0,E,Water,3.0,01/27/10 00:00:00,137Cs,<,0.2,,Bq/l,SCK•CEN,,,
1,2,Belgium,8.0,Belgica-W02,WNZ 02,51,13.0,25.0,N,2,51.0,34.0,E,Water,3.0,01/27/10 00:00:00,137Cs,<,0.27,,Bq/l,SCK•CEN,,,


Columns that can be used to identify a unique measurement:

In the OSPAR datasets, data are categorized by sample type, specifically into Biota and Seawater datasets. Within each dataset, the `id` field is used to uniquely identify each measurement, starting from 0 and incrementing. However, the `id` is not unique across different sample type datasets. Therefore, once measurement data is processed and stored in the MARIS DB, using only the `id` is insufficient for unique identification. To accurately trace a measurement back to the OSPAR data, a combination of `ref_id`, `sample_type`, and `id` (if available) must be used.

Within each dataset, the `sample id` allows laboratories to to submit the `sample id` used by their laboratory. This field is free text, is not unique per measurement, and allows the sample to be identified. In some cases `NaN` values are used. 

In [None]:
duplicates = ospar_dfs['SEAWATER'][ospar_dfs['SEAWATER'].duplicated('sample id', keep=False)].sort_values('sample id')
with pd.option_context('display.max_columns', None):
    display(duplicates.head(5))

Unnamed: 0,id,contracting party,rsc sub-division,station id,sample id,latd,latm,lats,latdir,longd,longm,longs,longdir,sample type,sampling depth,sampling date,nuclide,value type,activity or mda,uncertainty,unit,data provider,measurement comment,sample comment,reference comment
6586,56322,United Kingdom,6.0,Chapelcross,00-4018,54,52.0,22.0,N,3,35.0,40.0,W,Water,0.0,04/12/00 00:00:00,3H,=,6.72,1.52,Bq/l,SEPA-Scottish Environment Protection Agency,,Southerness,
6575,56311,United Kingdom,6.0,Chapelcross,00-4018,54,52.0,22.0,N,3,35.0,40.0,W,Water,0.0,04/12/00 00:00:00,"239,240Pu",=,0.018,0.0011,Bq/l,SEPA-Scottish Environment Protection Agency,,Southerness,
6576,56312,United Kingdom,6.0,Chapelcross,00-4018,54,52.0,22.0,N,3,35.0,40.0,W,Water,0.0,04/13/00 00:00:00,"239,240Pu",=,0.00546,0.00037,Bq/l,SEPA-Scottish Environment Protection Agency,,Southerness,
6587,56323,United Kingdom,6.0,Chapelcross,00-4031,54,52.0,22.0,N,3,35.0,40.0,W,Water,0.0,07/26/00 00:00:00,3H,=,7.2,1.3,Bq/l,SEPA-Scottish Environment Protection Agency,,Southerness,
6577,56313,United Kingdom,6.0,Chapelcross,00-4031,54,52.0,22.0,N,3,35.0,40.0,W,Water,0.0,07/26/00 00:00:00,"239,240Pu",=,0.001142,7e-05,Bq/l,SEPA-Scottish Environment Protection Agency,,Southerness,
