The codes used to build data in this study are in `src/datas/build_data.py`. This notebook is to investigate our data, make sure it is correct and identical to the baseline ([Lee et al., 2022](https://doi.org/10.1038/s41598-022-25377-x)).

In [1]:
import pandas as pd
import numpy as np
import glob
from src.utils import nwp_cali

import os
cwd = os.getcwd()

# XRF spectra

The spectra are extracted from `data/legacy/spe_dataset_20220629.csv`. We selected three cores (PS75-056-1, LV28-44-3-n, SO264-69-2) as the test set, which is relevant to the case study in the baseline. The baseline's method describes these three cores as the case study, but the core PS75-056-1 was accidently excluded in the actual calculation of R2 and data distribution. Yet, the confident interval of the baseline models were calcuated by the whole three cores. It is a mistake needs to be corrected. This study keeps the three cores as the test set for all calculation. Then we use the rest of the cores as the training set, which are later randomly split into the (actual) training and validation sets with 4:1 ratio.

## Test set

In [12]:
test_df = pd.read_csv(f"{cwd}/data/pretrain/test/info.csv")
spe_dir = glob.glob(f"{cwd}/data/pretrain/test/spe/*.csv")
test_df

Unnamed: 0,dirname,composite_id,cps,core,composite_depth_mm,section_depth_mm,filename,section
0,0.csv,PS75-056-1_00005,126814,PS75-056-1,5,5,PS75-056-1_0000 5.0mm 10s 10kV 150uA No-F...,0
1,1.csv,PS75-056-1_00010,170569,PS75-056-1,10,10,PS75-056-1_0000 10.0mm 10s 10kV 150uA No-F...,0
2,2.csv,PS75-056-1_00015,175200,PS75-056-1,15,15,PS75-056-1_0000 15.0mm 10s 10kV 150uA No-F...,0
3,3.csv,PS75-056-1_00020,175607,PS75-056-1,20,20,PS75-056-1_0000 20.0mm 10s 10kV 150uA No-F...,0
4,4.csv,PS75-056-1_00025,176768,PS75-056-1,25,25,PS75-056-1_0000 25.0mm 10s 10kV 150uA No-F...,0
...,...,...,...,...,...,...,...,...
4612,4612.csv,SO264-69-2_18460,109462,SO264-69-2,18460,910,SO264-69-1_017550 910.0mm 10s 10kV 150uA No...,18
4613,4613.csv,SO264-69-2_18470,108611,SO264-69-2,18470,920,SO264-69-1_017550 920.0mm 10s 10kV 150uA No...,18
4614,4614.csv,SO264-69-2_18480,111142,SO264-69-2,18480,930,SO264-69-1_017550 930.0mm 10s 10kV 150uA No...,18
4615,4615.csv,SO264-69-2_18490,109224,SO264-69-2,18490,940,SO264-69-1_017550 940.0mm 10s 10kV 150uA No...,18


In [13]:
print(f"data amount: {len(test_df)}")
print(f"spe amount: {len(spe_dir)}")
print(f"cores: {test_df.core.unique()}")

data amount: 4617
spe amount: 4617
cores: ['PS75-056-1' 'LV28-44-3-n' 'SO264-69-2']


## Training and validation sets
The annotation files under the `train` folder are `info.csv` and `val.csv` for the training and validation sets, respectively. The spectra are all stored in `spe` folder, which will be read by the model according to the annotation files.

In [2]:
train_df = pd.read_csv(f"{cwd}/data/pretrain/train/info.csv")
validation_df = pd.read_csv(f"{cwd}/data/pretrain/train/val.csv")
spe_dir = glob.glob(f"{cwd}/data/pretrain/train/spe/*.csv")
train_df

Unnamed: 0,dirname,composite_id,cps,core,composite_depth_mm,section_depth_mm,filename,section
0,28340.csv,SO264-60-12_07230,79071,SO264-60-12,7230,640,SO264-60-12_06590 640.0mm 10s 10kV 150uA No...,7
1,3651.csv,SO264-14-1_08580,155074,SO264-14-1,8580,170,SO264-14-1_0841 170.0mm 10s 10kV 150uA No-F...,9
2,36667.csv,SO178-12-3_04250,79329,SO178-12-3,4250,250,SO178-12-3_0400 250.0mm 10s 10kV 150uA No-F...,4
3,10834.csv,SO264-24-3_04100,102379,SO264-24-3,4100,300,SO264-24-3_0380 300.0mm 10s 10kV 150uA No-F...,4
4,10867.csv,SO264-24-3_04430,101157,SO264-24-3,4430,630,SO264-24-3_0380 630.0mm 10s 10kV 150uA No-F...,4
...,...,...,...,...,...,...,...,...
44163,12341.csv,SO264-26-2_04740,149877,SO264-26-2,4740,320,SO264-26-2_04420 320.0mm 10s 10kV 150uA No-...,5
44164,27439.csv,SO264-56-2_10790,88201,SO264-56-2,10790,130,SO264-56-2_10660 130.0mm 10s 10kV 150uA No-...,11
44165,53782.csv,PS97-093-2_11070,55959,PS97-093-2,11070,630,PS97-93-2_1044 630.0mm 10s 10kV 150uA No-Fi...,11
44166,53453.csv,PS97-093-2_09425,59859,PS97-093-2,9425,985,PS97-93-2_0844 985.0mm 10s 10kV 150uA No-Fi...,9


In [3]:
cores = np.hstack([train_df.core.unique(), validation_df.core.unique()])

print("train and validation sets")
print(f"data amount: {len(train_df)+len(validation_df)}")
print(f"spe amount: {len(spe_dir)}")

print(f"core amount: {len(cores)}")
print(f"cores: {cores}")

train and validation sets
data amount: 55211
spe amount: 55211
core amount: 116
cores: ['SO264-60-12' 'SO264-14-1' 'SO178-12-3' 'SO264-24-3' 'PS97-093-2'
 'SO264-41-2' 'SO264-44-3' 'PS75-095-5' 'PS97-092-1' 'SO264-32-2'
 'LV29-114-3' 'PS75-093-1' 'PS97-084-1' 'SO264-64-1' 'SO264-22-2'
 'SO264-16-2' 'SO264-26-2' 'SO264-46-5' 'SO264-19-2' 'SO264-15-2'
 'SO264-53-2' 'SO264-54-2' 'SO264-45-2' 'PS97-089-1' 'PS97-083-2'
 'PS75-054-1' 'SO264-70-1' 'PS97-085-3' 'SO264-09-2' 'SO264-28-2'
 'SO264-66-2' 'PS75-083-1' 'LV28-44-3' 'SO264-56-2' 'PS97-079-2'
 'SO264-51-2' 'PS97-052-4' 'SO264-52-2' 'SO264-47-2' 'SO264-13-2'
 'PS75-093-1_TC' 'SO264-55-1' 'SO264-76-1' 'PS97-128-2' 'PS97-053-2'
 'SO264-62-2' 'SO264-49-2' 'PS97-078-1' 'SO264-44-2' 'PS97-027-2'
 'SO202-37-2_re' 'PS97-046-4' 'SO264-34-2' 'PS97-085-3_TC' 'PS97-080-1'
 'PS97-089-1_TC' 'PS97-084-1_TC' 'PS75-083-1_TC' 'SO264-32-2' 'SO178-12-3'
 'PS75-054-1' 'PS97-085-3_TC' 'PS97-084-1' 'SO264-51-2' 'SO264-49-2'
 'SO264-56-2' 'SO264-64-1' 'PS97-0

# Downstream tasks
The tasks are predicting CaCO3 and TOC content from XRF spectra. The test set uses the same test cores as the XRF spectra, which is relevent to the case study in the baseline. The paired XRF and targets are extracted from `data/legacy/spe+bulk_dataset_20220629.csv`.The training and validation sets use the cores listed in the "CHOSEN" sheet in `data/legacy/ML station list.xlsx`, instead of the whole rest cores. This is the same as the training and test sets in the baseline. The data from these cores are then randomly split into training and validation sets with 4:1 ratio. 

In [6]:
def read_files(target, subset, test: bool):
    info_df = pd.read_csv(f"{cwd}/data/finetune/{target}/{subset}/info.csv")
    target_dir = glob.glob(f"{cwd}/data/finetune/{target}/{subset}/target/*.csv")
    spe_dir = glob.glob(f"{cwd}/data/finetune/{target}/{subset}/spe/*.csv")

    print(info_df)
    
    if test:
        return info_df, target_dir, spe_dir
    else:
        validation_df = pd.read_csv(f"{cwd}/data/finetune/{target}/{subset}/val.csv")
        return info_df, validation_df, target_dir, spe_dir

## CaCO3
### Test set

In [7]:
target = "CaCO3%"
subset = "test"
info_df, target_dir, spe_dir = read_files(target, subset, test=True)

     dirname         core  mid_depth_mm
0      0.csv   PS75-056-1          15.0
1      1.csv   PS75-056-1          55.0
2      2.csv   PS75-056-1         155.0
3      3.csv   PS75-056-1         255.0
4      4.csv   PS75-056-1         355.0
..       ...          ...           ...
389  389.csv  LV28-44-3-n       10875.0
390  390.csv  LV28-44-3-n       10925.0
391  391.csv  LV28-44-3-n       10975.0
392  392.csv  LV28-44-3-n       11025.0
393  393.csv  LV28-44-3-n       11075.0

[394 rows x 3 columns]


In [8]:
# use the baseline codes to subset data
test_cores = ["PS75-056-1", "LV28-44-3-n", "SO264-69-2"]
prepare = nwp_cali.PrepareData(
    measurement=target, 
    data_dir=f"{cwd}/data/legacy/spe+bulk_dataset_20220629.csv", 
    select_dir=f"{cwd}/data/legacy/ML station list.xlsx")

# note: its select_casestudy()
data_df = prepare.select_casestudy(case_cores = test_cores)
X, y = prepare.produce_Xy(data_df)

In [9]:
print(f"data amount in annotation file: {len(info_df)}")
print(f"actual data amount: {len(target_dir)}")
print(f"actual spe amount: {len(spe_dir)}")
print(f"cores: {info_df.core.unique()}")

print(f"data amount using baseline codes: {len(y)}")

data amount in annotation file: 394
actual data amount: 394
actual spe amount: 394
cores: ['PS75-056-1' 'SO264-69-2' 'LV28-44-3-n']
data amount using baseline codes: 394


The data amounts in the annotation file, actual target folder, actual spe folder and the extration by the baseline codes are all the same.

### Training and validation sets

In [10]:
target = "CaCO3%"
subset = "train"
info_df, validation_df, target_dir, spe_dir = read_files(target, subset, test=False)

       dirname        core  mid_depth_mm
0       34.csv  SO264-64-1       15015.0
1     1427.csv  PS75-054-1       15505.0
2     1857.csv  PS75-095-5       17605.0
3       39.csv  SO264-66-2         995.0
4      794.csv  PS97-085-3        1505.0
...        ...         ...           ...
1483   631.csv  PS97-083-2        1005.0
1484  1706.csv  PS75-095-5        2205.0
1485  1208.csv  PS97-093-2       11605.0
1486  1163.csv  PS97-093-2        7005.0
1487   196.csv  SO264-55-1        5045.0

[1488 rows x 3 columns]


In [11]:
# use the baseline codes to subset data
prepare = nwp_cali.PrepareData(
    measurement=target, 
    data_dir=f"{cwd}/data/legacy/spe+bulk_dataset_20220629.csv", 
    select_dir=f"{cwd}/data/legacy/ML station list.xlsx")

# note: its select_data()
data_df = prepare.select_data()
X, y = prepare.produce_Xy(data_df)

In [12]:
print(f"data amount in annotation file: {len(info_df)+len(validation_df)}")
print(f"actual data amount: {len(target_dir)}")
print(f"actual spe amount: {len(spe_dir)}")
print(f"cores: {info_df.core.unique()}")

print(f"data amount using baseline codes: {len(y)}")

data amount in annotation file: 1860
actual data amount: 1860
actual spe amount: 1860
cores: ['SO264-64-1' 'PS75-054-1' 'PS75-095-5' 'SO264-66-2' 'PS97-085-3'
 'PS97-084-1' 'SO264-28-2' 'PS97-089-1' 'PS97-093-2' 'SO264-13-2'
 'PS75-083-1' 'PS97-052-4' 'SO264-55-1' 'PS75-093-1' 'PS97-092-1'
 'PS97-079-2' 'PS97-078-1' 'PS97-128-2' 'SO264-56-2' 'PS97-083-2'
 'LV29-114-3' 'PS97-080-1' 'SO264-15-2' 'PS97-053-2' 'PS97-027-2'
 'PS97-046-4']
data amount using baseline codes: 1860


The data amounts in the annotation file, actual target folder, actual spe folder and the extration by the baseline codes are all the same.

## TOC
### Test set

In [13]:
target = "TOC%"
subset = "test"
info_df, target_dir, spe_dir = read_files(target, subset, test=True)

     dirname         core  mid_depth_mm
0      0.csv   PS75-056-1          15.0
1      1.csv   PS75-056-1          55.0
2      2.csv   PS75-056-1         155.0
3      3.csv   PS75-056-1         255.0
4      4.csv   PS75-056-1         355.0
..       ...          ...           ...
391  391.csv  LV28-44-3-n       10875.0
392  392.csv  LV28-44-3-n       10925.0
393  393.csv  LV28-44-3-n       10975.0
394  394.csv  LV28-44-3-n       11025.0
395  395.csv  LV28-44-3-n       11075.0

[396 rows x 3 columns]


In [14]:
# use the baseline codes to subset data
test_cores = ["PS75-056-1", "LV28-44-3-n", "SO264-69-2"]
prepare = nwp_cali.PrepareData(
    measurement=target, 
    data_dir=f"{cwd}/data/legacy/spe+bulk_dataset_20220629.csv", 
    select_dir=f"{cwd}/data/legacy/ML station list.xlsx")

# note: its select_casestudy()
data_df = prepare.select_casestudy(case_cores = test_cores)
X, y = prepare.produce_Xy(data_df)

In [15]:
print(f"data amount in annotation file: {len(info_df)}")
print(f"actual data amount: {len(target_dir)}")
print(f"actual spe amount: {len(spe_dir)}")
print(f"cores: {info_df.core.unique()}")

print(f"data amount using baseline codes: {len(y)}")

data amount in annotation file: 396
actual data amount: 396
actual spe amount: 396
cores: ['PS75-056-1' 'SO264-69-2' 'LV28-44-3-n']
data amount using baseline codes: 396


The data amounts in the annotation file, actual target folder, actual spe folder and the extration by the baseline codes are all the same.

### Training and validation sets

In [16]:
target = "TOC%"
subset = "train"
info_df, validation_df, target_dir, spe_dir = read_files(target, subset, test=False)

       dirname        core  mid_depth_mm
0      942.csv  PS97-085-3       13105.0
1     1581.csv  PS75-083-1        4805.0
2     1391.csv  PS75-054-1        8195.0
3      510.csv  PS97-052-4         305.0
4      714.csv  PS97-084-1        1005.0
...        ...         ...           ...
1568  1934.csv  PS75-095-5       14505.0
1569   914.csv  PS97-085-3       10305.0
1570  1804.csv  PS75-095-5        1305.0
1571   557.csv  PS97-078-1         305.0
1572  1935.csv  PS75-095-5       14605.0

[1573 rows x 3 columns]


In [17]:
# use the baseline codes to subset data
prepare = nwp_cali.PrepareData(
    measurement=target, 
    data_dir=f"{cwd}/data/legacy/spe+bulk_dataset_20220629.csv", 
    select_dir=f"{cwd}/data/legacy/ML station list.xlsx")

# note: its select_data()
data_df = prepare.select_data()
X, y = prepare.produce_Xy(data_df)

In [18]:
print(f"data amount in annotation file: {len(info_df)+len(validation_df)}")
print(f"actual data amount: {len(target_dir)}")
print(f"actual spe amount: {len(spe_dir)}")
print(f"cores: {info_df.core.unique()}")

print(f"data amount using baseline codes: {len(y)}")

data amount in annotation file: 1967
actual data amount: 1967
actual spe amount: 1967
cores: ['PS97-085-3' 'PS75-083-1' 'PS75-054-1' 'PS97-052-4' 'PS97-084-1'
 'SO264-28-2' 'SO264-15-2' 'PS75-095-5' 'SO264-55-1' 'PS97-128-2'
 'PS97-078-1' 'PS75-093-1' 'PS97-083-2' 'SO264-13-2' 'PS97-092-1'
 'PS97-093-2' 'PS97-079-2' 'SO264-64-1' 'PS97-053-2' 'SO264-66-2'
 'LV29-114-3' 'PS97-080-1' 'PS97-089-1' 'SO264-56-2' 'SO178-12-3'
 'PS97-027-2' 'PS97-046-4']
data amount using baseline codes: 1967


The data amounts in the annotation file, actual target folder, actual spe folder and the extration by the baseline codes are all the same.