# DIC/TA File Preparation

This NoteBook shows examples of a workflow that determines the DIC and TA content of samples analysed on a VINDTA 3C (at UEA), and assesses the quality of measurements based on results for the Certified Reference Materials (CRMs) run on each analysis day. 

For questions, contact Elise S. Droste (e.droste@uea.ac.uk)

# Introduction

In this NoteBook, prepare the file for post-analysis processing of dissolved inorganic carbon (DIC) and total alkalinity (TA) samples. 

"Preparations" include:
- correcting filenaming mistakes during lab analyses, which may interfere with any automatic processing down the line (highly recommended)
- adding auxiliary data from other datafiles, such as salinity (minimum requirement) and ideally also nutrients and any other co-collected data you might need later on (e.g. in situ temperature, pressure, ... ) (highly recommended)

I'll be correcting mistakes in filenaming during lab analyses, inserting bottle data from the CTD cast, and nutrient data, as well as some other corrections. 

The final output of this notebook  will  be imported into `DIC_CRMs_QC.ipynb` Notebook where the DIC content will be determined. Output of that notebook will subsequently feed into the `TA_CRMs_QC.ipynb` NoteBook, where the TA content is determined. Output of that notebook will lastly feed into the `DICTA_samples_QC.ipynb` NoteBook, where the DIC and TA content for the samples will be quality checked. 

***
# Packages

In [3]:
import numpy as np
import pandas as pd
import warnings
import datetime
import matplotlib.pyplot as plt
from dateutil.parser import parse
from IPython.core.interactiveshell import InteractiveShell 

from pathlib import Path

warnings.filterwarnings('ignore') # suppresses warning messages from packages
InteractiveShell.ast_node_interactivity = "all" # allows you to see multiple outputs from a single cell

***
# Import Data

Import the raw data from the 3C VINDTA

In [5]:
datadir = "../rawdata/"# directory where raw data is stored
datafile = "testrawdata_LucySummary_20240221.csv" # filename 

read_kw = dict(header = 1, encoding = "latin-1") # required for Pandas to be able to read special symbols in the raw datafile 

vindta_data = pd.read_csv(Path(datadir) / datafile, **read_kw, na_values = -999) # read the datafile

vindta_data.columns = vindta_data.columns.str.replace('°','') # remove the degree symbol, this will make importing the file later easier 

***
# Aborted runs and filenaming corrections

If there are any filenaming corrections that need to be done, do them here. This includes removing aborted runs that contain no data. 

If the TA run was aborted but you have a valid DIC run, make sure not to remove the coulometer data! 

Steps: 
- Remove rows with no data due to aborted runs (and reindex)
- Correct filenames in summary csv files
- Also correct the associated station, cast, niskin, depth, replicate, duplicate numbers where relevant
- Check if all associated .dat files can be found
- Correct filenames in .dat filenames

<font color=orange> Make sure that the corresponding .dat file is consistent with the filename i.e. Sample Name in the DataFrame! This will be checked in the next section. </font>

***
# .dat filenaming corrections

The .dat files contain the data for the TA determination. Check if all .dat files exist/can be found for each sample and CRM run. 

If they cannot be found, they need to be corrected manually in the specified directory where the .dat files are stored. 

<font color=orange> Do this only when a back up of the original files has been made elsewhere! </font>

In [9]:
import os

In [10]:
def check_datfiles(datdir, df):
    # Loop through all runs that are sample runs or CRM runs (i.e. ignore the junk runs)
    for filename in df["Sample Name"].loc[df["Station"] != 9999]:
        datFile = filename + ".dat"

        # Check if the dat file exists 
        if os.path.isfile(datdir + "/" + datFile):
            continue
        else: 
            print(datFile + " not found in specified directory.")

In [11]:
datdir = "../rawdata/"
check_datfiles(datdir = datdir, df = vindta_data)

8888_802_1208_0_1_1.dat not found in specified directory.
8888_802_1208_0_1_2.dat not found in specified directory.
0000_003_6_300_1_1.dat not found in specified directory.
0000_003_6_300_2_1.dat not found in specified directory.
0000_019_5_122_3_1.dat not found in specified directory.
0000_019_21_6_1_1.dat not found in specified directory.
7777_028_2_333_1_1.dat not found in specified directory.
7777_028_2_333_2_1.dat not found in specified directory.
7777_028_2_333_3_1.dat not found in specified directory.
7777_028_4_290_1_1.dat not found in specified directory.
7777_028_4_290_2_1.dat not found in specified directory.
7777_028_4_290_3_1.dat not found in specified directory.
7777_028_5_139_1_1.dat not found in specified directory.
7777_028_6_159_1_1.dat not found in specified directory.
7777_028_8_220_1_1.dat not found in specified directory.
7777_028_8_220_2_1.dat not found in specified directory.
7777_028_8_220_3_1.dat not found in specified directory.
7777_028_9_80_1_1.dat not foun

If there were are any .dat files not found, have they been corrected manually? 


<font color=green> YES /  </font> <font color=red> NO </font> 


<font color = orange> Note: the output above shows that there are many .dat files not found. This is because this test case uses datafiles that have already been corrected for filenaming mistakes. If you have correctly corrected filenaming mistakes, then all .dat files should have been found, perhaps except for some aborted runs. You can just ignore those </font>

***
# Pipette Volumes and Calibrations

Pipette volumes should have been calibrated before and after analysis. The summary file will (should) have the pipette volumes as calibrated just before analyses. Decide here whether volumes have changed over time, or justify using different volumes. 

When were pipette calibrations done? 


In [12]:
vindta_data["DIC pipette volume (ml)"].unique()
vindta_data["TA pipette vol (ml)"].unique()

array([21.8442])

array([96.7556])

***
# Insert auxiliary data

VINDTA software default values for Salinity and CTD Temp (C) are: Salinity= 35, CTD Temp (C) = 0. Nutrients ('Phosphate (uMol/Kg)','Silicate (uMol/Kg)', 'Nitrate (uMol/Kg)') are given a concentration of 0 umol/kg (nitrate is given -999/NAN?). 

If these data were unknown at the time of analysis and therefore have not been inserted into the VINDTA software during the run in the lab, you can add them here. 

Here, add measured values in these columns with the data from auxiliary datafiles (e.g. CTD bottle files). 

<font color=orange> Check for any missing data at the end </font>

***
# Certified data

According the the convention CRMs are identified with "8888" as the Station value. Their Sample Name is constructed as follows: 

> 8888_batch_CRMbottlenumber_repeat_replicate

The class that is used below will help link up the CRMs with the relevant certified data (nutrients, salinity). 

It will also be used to calculate the DIC concentrations. 

Remember, here we're only looking at the CRMs. Quality checking will be done throughout the lab analysis period, but will only be finalised at the end. 


Check which CRM batch was used. Certified information on all batches can be found here: 

https://www.ncei.noaa.gov/access/ocean-carbon-data-system/oceans/Dickson_CRM/batches.html

Record this information in a lookup file ('crm_batch_lookup.csv'), which will be used in the 'CalcDIC' Class (`DIC_CRMs_QC.ipynb`) to insert the certified information into the dataframe for further data processing. 


In [13]:
vindta_data[vindta_data["Station"] == 8888]["Cast"].unique()

array([802, 208, 204])

<font color=orange> Note: the output above shows that I have multiple batch numbers for CRMs in my raw datafile. These are in fact typo's made in the Software in the lab when setting up a run; all CRMs cam from batch 208. Normally, this should have been corrected earlier in this notebook, but it nicely illustrates that it's important to correct filenaming mistakes, to avoid confusion and accidentally using the wrong information </font>

***
# Insert additional values (if applicable)

This will also be a good place to insert any additional data required in further processing, such as any temperature measurements that were done in the lab of the sample, in addition to the temperature measurements made by the VINDTA system itself. 

The QC done in the next notebooks (i.e. `SD035_DIC_CRMs_QC.ipynb`, `SD035_TA_CRMs_QC.ipynb`, `SD035_DICTA_samples_QC.ipynb`) can expose other issues or necessary corrections in the datafile that need to be fixed before the final DIC and TA contents can be accurately determined. It can therefore be an iterative process. To streamline the process and enhance the transparency of the workflow and QC assessments, it is sometimes clearer to make any necessary formatting corrections in _this_ notebook/dataframe, instead of in any notebooks/dataframes down the processing line. In that way, all subsequent datafiles are consistent and changes made to the data are congregated in one place. 

***
# Save the Files

Save the file, which will be imported into `DIC_CRMs_QC.ipynb` for DIC determination. 

In [72]:
prepped_datadir = Path("../output_data/fileprepped_rawdata/")
prepped_datafile = "summary_prepped_" + datetime.date.today().strftime(format = "%Y%m%d") + ".csv"

# vindta_data.to_csv(prepped_datadir / prepped_datafile, index = False) # uncomment


:)