# Defining compound_idx. 

In many MARIS handlers, a pivot from long to wideformat is required. For this pivot, a compound_idx is used.  
Currently, the compound_idx is determined by combining the folowing: 



Currently duplicaiton of the compound_idx will be found in the following situations:

1) in geotraces when we have the same (e.g for seawater) lon, lat, time, smp_depth for several nuclides measurement in a given rosette (at least that's what I understand);
2) in OSPAR sediment when we have records where top, bottom is NaN for a given lon, lat, time. In that case our compound index would be (lon, lat, time, top, bottom);
3) In situations where a nuclide is measured for a sample using more than one method (e.g. Am241 normaly measured by alpha and gamma spectrometry). 
4) Im situations where rapid analysis and detailed analysis is reported
5) In a situation where a sample is collected and split into two or more sub-samples. For this sample the compound index would be the same. Sometimes this type of sample is sent to several laboratories. 


In [None]:
import pandas as pd
from pathlib import Path 

In [None]:
from marisco.handlers.helcom import load_data as helcom_load_data

## Load OSPAR data

In [None]:
default_smp_types = {'Seawater data': 'seawater', 'Biota data': 'biota'}
def ospar_load_data(src_dir:str, # Directory where the source CSV files are located
              lut:dict=default_smp_types # A dictionary with the file name as key and the sample type as value
              ) -> dict: # A dictionary with sample types as keys and their corresponding dataframes as values
    "Load `OSPAR` data and return the data in a dictionary of dataframes with the dictionary key as the sample type."
    return {
        sample_type: pd.read_csv(Path(src_dir) / f'{file_name}.csv', encoding='unicode_escape')
        for file_name, sample_type in lut.items()
    }

In [None]:
ospar_fname_in = '../../_data/accdb/ospar/csv'

In [None]:
#| eval: false
ospar_dfs = ospar_load_data(ospar_fname_in)

print('keys/sample types: ', ospar_dfs.keys())

for key in ospar_dfs.keys():
    print(f'{key} columns: ', ospar_dfs[key].columns)

keys/sample types:  dict_keys(['seawater', 'biota'])
seawater columns:  Index(['ID', 'Contracting Party', 'RSC Sub-division', 'Station ID',
       'Sample ID', 'LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sample type', 'Sampling depth', 'Sampling date',
       'Nuclide', 'Value type', 'Activity or MDA', 'Uncertainty', 'Unit',
       'Data provider', 'Measurement Comment', 'Sample Comment',
       'Reference Comment'],
      dtype='object')
biota columns:  Index(['ID', 'Contracting Party', 'RSC Sub-division', 'Station ID',
       'Sample ID', 'LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sample type', 'Biological group', 'Species',
       'Body Part', 'Sampling date', 'Nuclide', 'Value type',
       'Activity or MDA', 'Uncertainty', 'Unit', 'Data provider',
       'Measurement Comment', 'Sample Comment', 'Reference Comment'],
      dtype='object')


Size of dataframe:

In [None]:
ospar_dfs['biota'].shape

(15314, 27)

In [None]:
ospar_dfs['seawater'].shape

(18856, 25)

# Review of the duplicate compound_idx for OSPAR data


## Duplicates of data

The OAPR dataset includes a unique 'ID' for each row, lets check that each row is unique.

Number of duplicated rows in the biota dataframe;

In [None]:
ospar_dfs['biota'].duplicated().sum()

0

Number of duplicated rows in the seawater dataframe;


In [None]:
ospar_dfs['seawater'].duplicated().sum()

0

There are no duplicated rows in the biota or seawater dataframe, each row is unique. 

## Review of the compound index ( BIOTA )

First, focusing on the biota dataframe. Lets look at the columns of the biota dataframe:

In [None]:
grp='biota'

In [None]:
ospar_dfs[grp].columns

Index(['ID', 'Contracting Party', 'RSC Sub-division', 'Station ID',
       'Sample ID', 'LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sample type', 'Biological group', 'Species',
       'Body Part', 'Sampling date', 'Nuclide', 'Value type',
       'Activity or MDA', 'Uncertainty', 'Unit', 'Data provider',
       'Measurement Comment', 'Sample Comment', 'Reference Comment'],
      dtype='object')

Lets looks where the combined 'Sample ID' and 'Nuclide' are duplicated.

In [None]:
ospar_dfs_sampleId_nuclide_df=ospar_dfs[grp][ospar_dfs[grp][['Sample ID','Nuclide']].duplicated(keep=False)]
print(ospar_dfs_sampleId_nuclide_df)

          ID Contracting Party  RSC Sub-division   Station ID Sample ID  LatD  \
4      96857    United Kingdom                10      Torness   2100074    55   
8      95781           Denmark                 9  Agger Tange  20210688    56   
9      95785           Denmark                 9  Hvide Sande  20210683    56   
88     96109       Netherlands                 9    BOCHTVWTM       NaN    53   
89     96110       Netherlands                 9    BOCHTVWTM       NaN    53   
...      ...               ...               ...          ...       ...   ...   
15283  49399           Ireland                 4  Bull Island       NaN    53   
15310  48606            France                 2    Granville       NaN    48   
15311  48634            France                 2    Granville       NaN    48   
15312  48650            France                 2     Dielette       NaN    49   
15313  48610            France                 2        Goury       NaN    49   

       LatM  LatS LatDir  L

How many 'Nan' sample id? 

In [None]:
nan_sample_id_rows = ospar_dfs[grp][ospar_dfs[grp]['Sample ID'].isna()]
nan_sample_id_rows

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
88,96109,Netherlands,9,BOCHTVWTM,,53,25,2.0,N,6,...,07/10/2021,137Cs,<,0.200,,Bq/kg f.w.,Rijkswaterstaat Laboratory CIV,,,
89,96110,Netherlands,9,BOCHTVWTM,,53,25,2.0,N,6,...,07/10/2021,226Ra,<,1.800,,Bq/kg f.w.,Rijkswaterstaat Laboratory CIV,,,
90,96111,Netherlands,9,BOCHTVWTM,,53,25,2.0,N,6,...,07/10/2021,210Pb,<,1.000,,Bq/kg f.w.,Rijkswaterstaat Laboratory CIV,,,
91,96112,Netherlands,9,BOCHTVWTM,,53,25,2.0,N,6,...,07/10/2021,137Cs,<,0.200,,Bq/kg f.w.,Rijkswaterstaat Laboratory CIV,,,
92,96113,Netherlands,9,BOCHTVWTM,,53,25,2.0,N,6,...,07/10/2021,226Ra,<,1.800,,Bq/kg f.w.,Rijkswaterstaat Laboratory CIV,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15283,49399,Ireland,4,Bull Island,,53,21,7.0,N,6,...,15/01/1995,137Cs,=,1.380,0.072968,Bq/kg f.w.,Radiological Protection Institute of Ireland,,Representative sample date - mid month,
15310,48606,France,2,Granville,,48,49,58.0,N,1,...,03/01/1995,"239,240Pu",=,0.018,0.001530,Bq/kg f.w.,IRSN : OPRI,,,
15311,48634,France,2,Granville,,48,49,58.0,N,1,...,03/01/1995,137Cs,=,0.210,0.034650,Bq/kg f.w.,IRSN : OPRI,,,
15312,48650,France,2,Dielette,,49,33,6.0,N,1,...,03/01/1995,137Cs,=,0.560,0.035000,Bq/kg f.w.,IRSN : LERFA,,,


There are 175 rows where the sample_id is duplicated (and not nan).
- Number of duplicate rows that are not nan : 5833 - 5483 = 350
- Number of 'Sample ID' with a duplicate: 350 / 2 = 175


Can we find a unique compound_idx for OSPAR Biota?

Rows where the lat/long and sampling date are the same;

In [None]:
compound_idx =['LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sampling date']
ospar_dfs[grp][compound_idx].duplicated().sum()

6555

Rows where the lat/long, sampling date, biologocal group, species and body part are the same;

In [None]:
compound_idx =['LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sampling date','Biological group', 'Species',
       'Body Part']
ospar_dfs[grp][compound_idx].duplicated().sum()

5454

A large amount of OSPAR data include the same position, time and biological infromation, below we will include the 'Sample ID' in the compound index. 

Rows where the lat/long, sampling date, biologocal group, species, body part, nuclide and sample ID are the same;

In [None]:
compound_idx = ['LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
                'LongS', 'LongDir', 'Sampling date', 'Biological group', 
                'Species', 'Body Part', 'Nuclide', 'Sample ID']

# Find duplicated rows
duplicated_rows = ospar_dfs[grp][ospar_dfs[grp][compound_idx].duplicated(keep=False)]

# Print the number of duplicated rows
print(f'Number of duplicated rows: {len(duplicated_rows)}')

# Group by the full compound index to find all identical rows
if not duplicated_rows.empty:
    print("\nGrouped identical rows with full compound index:")
    grouped_full = duplicated_rows.groupby(compound_idx)

    for compound_values, group in grouped_full:
        count = len(group)
        print("\nGroup:")
        print(group.to_string(index=False))  # Print the group without the index
        print(f"Count: {count}")

# Remove 'Sample ID' from the compound index
compound_idx_without_sample_id = [col for col in compound_idx if col != 'Sample ID']

# Group by the compound index without 'Sample ID'
if not duplicated_rows.empty:
    print("\nGrouped identical rows without 'Sample ID':")
    grouped_without_sample_id = duplicated_rows.groupby(compound_idx_without_sample_id)

    for compound_values, group in grouped_without_sample_id:
        count = len(group)
        print("\nGroup:")
        print(group.to_string(index=False))  # Print the group without the index
        print(f"Count: {count}")

Number of duplicated rows: 338

Grouped identical rows with full compound index:

Group:
   ID Contracting Party  RSC Sub-division Station ID Sample ID  LatD  LatM  LatS LatDir  LongD  LongM  LongS LongDir Sample type Biological group Species  Body Part Sampling date   Nuclide Value type  Activity or MDA  Uncertainty       Unit                                              Data provider Measurement Comment Sample Comment Reference Comment
83342            France                 2      Goury 201625023    49    42  52.0      N      1     56   46.0       W        BIOT         Molluscs PATELLA SOFT PARTS    07/06/2016 239,240Pu          =         0.004678      0.00016 Bq/kg f.w. Institut de Radioprotection et Sûreté Nucléaire : LRC/LMRE                 NaN            NaN               NaN
83344            France                 2      Goury 201625023    49    42  52.0      N      1     56   46.0       W        BIOT         Molluscs PATELLA SOFT PARTS    07/06/2016 239,240Pu          =      

There are 338 rows that are problematic. Some are duplicated (include the Activity or MDA value) and some rows have beem reported for multiple 'Data provider'. However there are some entries that have multiple Activity or MDA values reported for the same sample and nuclide. . 


In total there is a very small amount (2.21%) of the OSPAR biota dataset is duplicated. See;

In [None]:
338/15314*100

2.2071307300509337

## Review of the compound index ( SEAWATER  )

First, focusing on the biota dataframe. Lets look at the columns of the biota dataframe:

In [None]:
grp='seawater'

In [None]:
ospar_dfs[grp].columns

Index(['ID', 'Contracting Party', 'RSC Sub-division', 'Station ID',
       'Sample ID', 'LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sample type', 'Sampling depth', 'Sampling date',
       'Nuclide', 'Value type', 'Activity or MDA', 'Uncertainty', 'Unit',
       'Data provider', 'Measurement Comment', 'Sample Comment',
       'Reference Comment'],
      dtype='object')

Lets looks where the combined 'Sample ID' and 'Nuclide' are duplicated.

In [None]:
ospar_dfs_sampleId_nuclide_df=ospar_dfs[grp][ospar_dfs[grp][['Sample ID','Nuclide']].duplicated(keep=False)]
print(ospar_dfs_sampleId_nuclide_df)

           ID Contracting Party  RSC Sub-division  Station ID Sample ID  LatD  \
411      2435            Norway              11.0         507       NaN  57.0   
412      2436            Norway               9.0         509       NaN  56.0   
413      2437            Norway              11.0       Tjøme       NaN  59.0   
414      2438            Norway              11.0         505       NaN  57.0   
415      2439            Norway              11.0         506       NaN  58.0   
...       ...               ...               ...         ...       ...   ...   
18806  121601    United Kingdom               6.0  Sellafield       NaN  54.0   
18807  121602    United Kingdom               6.0  Sellafield       NaN  54.0   
18848  121643    United Kingdom              10.0  Hartlepool       NaN  54.0   
18849  121644    United Kingdom              10.0    Sizewell       NaN  52.0   
18850  121645    United Kingdom              10.0    Sizewell       NaN  52.0   

       LatM  LatS LatDir  L

In [None]:
nan_sample_id_rows = ospar_dfs[grp][ospar_dfs[grp]['Sample ID'].isna()]
nan_sample_id_rows

Unnamed: 0,ID,Contracting Party,RSC Sub-division,Station ID,Sample ID,LatD,LatM,LatS,LatDir,LongD,...,Sampling date,Nuclide,Value type,Activity or MDA,Uncertainty,Unit,Data provider,Measurement Comment,Sample Comment,Reference Comment
411,2435,Norway,11.0,507,,57.0,50.0,51.0,N,8.0,...,08/07/2010,"239,240Pu",=,0.000007,0.000001,Bq/l,Norwegian Radiation Protection Authority,,,
412,2436,Norway,9.0,509,,56.0,59.0,58.0,N,7.0,...,08/07/2010,"239,240Pu",=,0.000005,0.000001,Bq/l,Norwegian Radiation Protection Authority,,,
413,2437,Norway,11.0,Tjøme,,59.0,3.0,30.0,N,10.0,...,11/11/2010,"239,240Pu",=,0.000004,0.000002,Bq/l,Norwegian Radiation Protection Authority,,,
414,2438,Norway,11.0,505,,57.0,58.0,52.0,N,8.0,...,07/07/2010,99Tc,=,0.000390,0.000025,Bq/l,Institute for Marine Research,,,
415,2439,Norway,11.0,506,,58.0,6.0,34.0,N,8.0,...,07/07/2010,99Tc,=,0.000380,0.000025,Bq/l,Institute for Marine Research,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18806,121601,United Kingdom,6.0,Sellafield,,54.0,29.0,20.0,N,3.0,...,01/07/2021,99Tc,<,0.045000,,Bq/l,EA-Environment Agency,,"St Bees W. Average of 2 samples, representativ...",
18807,121602,United Kingdom,6.0,Sellafield,,54.0,29.0,20.0,N,3.0,...,01/07/2021,137Cs,<,0.138000,,Bq/l,EA-Environment Agency,,"St Bees W. Average of 2 samples, representativ...",
18848,121643,United Kingdom,10.0,Hartlepool,,54.0,38.0,55.0,N,1.0,...,01/07/2021,3H,<,4.250000,,Bq/l,EA-Environment Agency,,"North Gare. Average of 2 samples, representati...",
18849,121644,United Kingdom,10.0,Sizewell,,52.0,9.0,17.0,N,1.0,...,01/07/2021,3H,<,3.750000,,Bq/l,EA-Environment Agency,,"Local beach. Average of 2 samples, representat...",


There are 7477 rows where the sample_id is duplicated (and not nan). 
- Number of duplicate rows that are not nan : 7477 - 7296 = 181
- Number of 'Sample ID' with a duplicate: 181 / 2 = 90.5


Can we find a unique compound_idx for OSPAR Biota?

Rows where the lat/long and sampling date are the same;

In [None]:
compound_idx =['LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
       'LongS', 'LongDir', 'Sampling date']
ospar_dfs[grp][compound_idx].duplicated().sum()

8223

A large amount of OSPAR data include the same position and time, below we will include the 'Sample ID' in the compound index. 

Rows where the lat/long, sampling date, nuclide and sample ID are the same;

In [None]:
compound_idx = ['LatD', 'LatM', 'LatS', 'LatDir', 'LongD', 'LongM',
                'LongS', 'LongDir', 'Sampling date','Nuclide', 'Sample ID']

# Find duplicated rows
duplicated_rows = ospar_dfs[grp][ospar_dfs[grp][compound_idx].duplicated(keep=False)]

# Print the number of duplicated rows
print(f'Number of duplicated rows: {len(duplicated_rows)}')

# Group by the full compound index to find all identical rows
if not duplicated_rows.empty:
    print("\nGrouped identical rows with full compound index:")
    grouped_full = duplicated_rows.groupby(compound_idx)

    for compound_values, group in grouped_full:
        count = len(group)
        print("\nGroup:")
        print(group.to_string(index=False))  # Print the group without the index
        print(f"Count: {count}")

# Remove 'Sample ID' from the compound index
compound_idx_without_sample_id = [col for col in compound_idx if col != 'Sample ID']

# Group by the compound index without 'Sample ID'
if not duplicated_rows.empty:
    print("\nGrouped identical rows without 'Sample ID':")
    grouped_without_sample_id = duplicated_rows.groupby(compound_idx_without_sample_id)

    for compound_values, group in grouped_without_sample_id:
        count = len(group)
        print("\nGroup:")
        print(group.to_string(index=False))  # Print the group without the index
        print(f"Count: {count}")

Number of duplicated rows: 834

Grouped identical rows with full compound index:

Group:
   ID Contracting Party  RSC Sub-division Station ID Sample ID  LatD  LatM  LatS LatDir  LongD  LongM  LongS LongDir Sample type  Sampling depth Sampling date Nuclide Value type  Activity or MDA  Uncertainty Unit          Data provider Measurement Comment Sample Comment Reference Comment
71628            France               1.0 CONCARNEAU 201122066  47.0  47.0  33.0      N    3.0   50.0   54.0       W       WATER             0.0    30/08/2011      3H          =          0.00117      0.00008 Bq/l IRSN : LRC/LS3E/RSMASS                 NaN            NaN               NaN
71638            France               1.0 CONCARNEAU 201122066  47.0  47.0  33.0      N    3.0   50.0   54.0       W       WATER             0.0    30/08/2011      3H          =          0.15000      0.01110 Bq/l IRSN : LRC/LS3E/RSMASS                 NaN            NaN               NaN
Count: 2

Group:
   ID Contracting Party  RS

There are 834 rows that are problematic. Some are duplicated (include the Activity or MDA value).  However there are some entries that have multiple Activity or MDA values reported for the same sample and nuclide. . 


In total there is a small amount (4.42%) of the OSPAR biota dataset is duplicated. See;

In [None]:
834/18856*100

4.422995333050488