# Water Sampling Processing Script
### Author: Andrew Reed, CGSN - WHOI

### Motivation
The motivation for this script is to automate producing the bottle files in a consistent, easily parseable manner consistent with SeaBird's naming and processing scheme. I chose to take this approach after attempting to parse the bottle (.btl) files output from SeaBird's SeaSoft V2 processing software. Unfortunately, SeaBird outputs into a tab-deliminated text format with inconsistent spacing between columns and spacing offsets. This prevents simple alignment of column names:column values and makes parsing column names more difficult without a priori knowing where parsing issues will arise. 

### Approach
I chose to utilize the rosette (.ros) files produced by SeaBird's SeaSoft V2 software as part of the initial conversion of their propietary .cnv formatted data produced by their CTDs and rosettes. The .ros files have explicit column naming outlined in the header of the file, and the columns with the parameter values are consistently spaced. This allows for an easy mapping of the column name to column values based on location. 

Additionally, I use a secondary file which outlines the parameter "short names" to the parameter "full names + units." This information is from SeaBird's SeaSoft V2 manual. Additionally, the processing method of taking the mean of all scans per bottle firing also follows the procedure outlined in the SeaSoft manual. 

### Usage
To use this software for your own processing, the following pacakges need to be installed:
* Pandas
* Numpy

Additionally, you will need to change the filepaths to the appropriate directory locations on your local machine. Once that is complete, simply run the cells in order. The software will write the results to the directroy where the rosette files are stored. 

**Note**: The ".btl" files produced here are meant for ease-of-use in producing our water sample summary sheets. They will not work for continued processing to get derived variables using SeaBird's software. 

In [1]:
# Import packages used in this notebook
import os, sys
import pandas as pd
import numpy as np

In [2]:
basepath = 'C:/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/'

In [3]:
# Load the name mapping for the column names
sbe_name_map = pd.read_excel('C:/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Reference_Files/seabird_ctd_name_map.xlsx')
sbe_name_map

Unnamed: 0,Short Name,Full Name,Friendly Name,Units,Notes/Comments
0,accM,Acceleration [m/s^2],acc M,m/s^2,
1,accF,Acceleration [ft/s^2],acc F,ft/s^2,
2,altM,Altimeter [m],alt M,m,
3,altF,Altimeter [ft],alt F,ft,
4,avgsvCM,"Average Sound Velocity [Chen-Millero, m/s]",avgsv-C M,"Chen-Millero, m/s",
5,avgsvCF,"Average Sound Velocity [Chen-Millero, ft/s]",avgsv-C F,"Chen-Millero, ft/s",
6,avgsvDM,"Average Sound Velocity [Delgrosso, m/s]",avgsv-D M,"Delgrosso, m/s",
7,avgsvDF,"Average Sound Velocity [Delgrosso, ft/s]",avgsv-D F,"Delgrosso, ft/s",
8,avgsvWM,"Average Sound Velocity [Wilson, m/s]",avgsv-W M,"Wilson, m/s",
9,avgsvWF,"Average Sound Velocity [Wilson, ft/s]",avgsv-W F,"Wilson, ft/s",


In [4]:
# Load the cruise i.d.
with open('C:/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/Irminger/Irminger-5/Data/CRUISE_ID') as file:
    cruise_id = file.read().strip()
cruise_id

'ar30-03'

In [5]:
def parse_header(header):
    """
    Function to parse the header of a SeaBird rosette file (.ros).
    This takes the place of generating a bottle file (.btl) using
    SeaBird's SeaSoft software.
    
    Args:
        header - a text file containing the relevant header info
    Returns:
        header_dict - a dictionary containing a mapping of the
            column name to its position in the data file
        start_time - the time that the SBE system started recording
        scan_interval - the # of scans per second
    """
    # Initialize the header dictionary
    header_dict = {}
    start_time = []
    scan_interval = []
    for line in header.splitlines():
        if 'name' in line:
            header_index = line.split()[2]
            header_name = line.split()[4].replace(':','')
            header_dict.update({header_name:header_index})
        elif 'interval' in line:
            scan_interval = line.split()[-1]
        elif 'start_time' in line:
            start_time = line.split()[3:7] 
            
    # Return the relevant important data
    return header_dict, start_time, scan_interval

In [6]:
def parse_data(data,header_dict):
    """
    Parses the data from the rosette file based on the position of the
    column, using the column locations from the header.
    
    Args:
        data - a text file containing the data from the rosette file
        header_dict - a dictionary containing a mapping of column names to the
            column position
    Returns:
        data_dict - a dictionary containing key:value pairs of where key is the
            column position and value is the column values 
    """
    
    # Generate a dictionary for the data with mapping from the column dictionary
    data_dict = {x:[] for x in header_dict.values()}
    
    # Now parse the data
    for line in data.splitlines():
        for i,x in enumerate(line.split()):
            try:
                float(x)
                data_dict[str(i)].append(x)
            except:
                pass
    
    return data_dict

In [7]:
def generate_btl_data(data_dict, header_dict, start_time, scan_interval):
    """
    Function to generate the equivalent bottle file (.btl).
    
    Args:
        data_dict - a dictionary containing key:value pairs of where key is the
            column position and value is the column values
        header_dict - a dictionary containing a mapping of column names to the
            column position
        start_time - the time that the CTD cast started
        scan_interval - the number of seconds per ctd scan
    Returns:
        df - a pandas dataframe containing the data from the data dictionary 
            with the column names from the header dictionary and the datetime
            calculated from the start_time and the scan_interval
    """
    
    # Using the data and header dictionaries, map the data columns to the
    # appropriate column names
    result = {}
    for key,item in header_dict.items():
        values = data_dict.get(item)
        result.update({key:values})

    # Put the data into a dataframe and convert the data from strings to floats
    df = pd.DataFrame.from_dict(result)
    for column in df.columns.values:
        df[column] = df[column].apply(lambda x: float(x))
    
    # Groupby the dataframe based on bottle name
    df = df.groupby(by='nbf').mean()

    # Convert the scan counts to seconds
    df['scan'] = df['scan'].apply(lambda x: x*float(scan_interval))

    # Add in the date time
    start_time = pd.to_datetime(' '.join(start_time))
    df['Datetime'] = df['scan'].apply(lambda x: start_time + pd.to_timedelta(x,unit='s'))

    return df


In [8]:
def parse_cast_number(filename):
    """
    Parses the cast number out of the file name. It assumes that the 
    cast number is 3 numbers long and occurs right before the file
    extension.
    
    Args:
        filename - the name of the file to be parsed
    Returns:
        cast_num - the cast number of the file
    """
    
    index = file.index('.')
    # From the index, count backwards until have 3 numbers for cast
    num = 0
    ind = 0
    while num < 3:
        ind = ind+1
        try:
            float(file[index-ind])
            num = num+1
        except:
            pass
    # Nower return the cast number
    cast_num = file[index-ind:index]
    
    return cast_num

In [9]:
def process_ros_files(filepath,sbe_name_map,cruise_id):
    """
    Parent function to parse and process SeaBird rosette (.ros)
    files, generate a pandas dataframe, and write bottle (.btl)
    files (as csvs). This takes the place of the bottle processing
    in SeaSoft V2 provided by SeaBird.
    
    Args:
        filepath - directory path to the location of the rosette files
        sbe_name_map - a pandas dataframe containing a mapping of the
            seabird short names to full names. Taken directly from the
            seabird manuals.
        cruise_id - string input of the cruise id
    Calls:
        parse_header
        parse_data
        generate_btl_data
    Returns:
        .btl - writes a bottle file to the same directory location as the
            rosette file.
    """
    
    # First, open the file and read it in
    with open(filepath) as file:
        data = file.read()
        header, data = data.split('*END*')
        
    # Parse the header file
    header_dict, start_time, scan_interval = parse_header(header)
    
    # Parse the data based on the output from the header
    data_dict = parse_data(data, header_dict)
    
    # Create a pandas dataframe
    df = generate_btl_data(data_dict, header_dict, start_time, scan_interval)
    
    # Rename the column title using the sbe_name_mapping 
    for colname in list(df.columns.values):
        try:
            fullname = list(sbe_name_map[sbe_name_map['Short Name'] == colname]['Full Name'])[0]
            df.rename({colname:fullname},axis='columns',inplace=True)
        except:
            pass
    # Rename the index as well
    df.index.rename(list(sbe_name_map[sbe_name_map['Short Name'] == df.index.name]['Full Name'])[0],inplace=True)
    
    # Add in the cruise id
    df['Cruise ID'] = cruise_id
    
    # Parse and add in the cast number
    cast = parse_cast_number(filepath)
    df['Cast'] = cast
    
    # Generate the btl name file
    btl_path = filepath.replace('.ros','.btl')
    
    # Save to the same directory as the rosette files
    df.to_csv(btl_path)

In [10]:
# Iterating through this process will generate .btl files from the rosette files.
# This is done in lieu of using the SeaBird software processing. I chose this route
# because of inconsistent column spacings made parsing and matching column names to
# column values difficult. The rosette headers make the column relationship explicit.
filepath = 'C:/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/Irminger/Irminger-5/Data/CTD/'
for file in os.listdir(filepath):
    if '.ros' in file:
        process_ros_files(filepath+file, sbe_name_map, cruise_id)
        
        

In [11]:
btl = pd.read_csv('C:/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/Irminger/Irminger-5/Data/CTD/ar30-03003.btl')

In [12]:
btl

Unnamed: 0,Bottles Fired,"Pressure, Digiquartz [db]","Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],"Conductivity, 2 [S/m]","Oxygen raw, SBE 43 [V]","Fluorescence, WET Labs ECO-AFL/FL [mg/m^3]","Turbidity, WET Labs ECO [NTU]","Salinity, Practical [PSU]",SPAR/Surface Irradiance,Scan Count,Flag,Datetime,Cruise ID,Cast
0,1.0,2998.288122,1.351835,1.353786,3.137522,3.137501,1.749627,-0.375871,0.4148,34.884186,463.341429,3693.461288,0.0,2018-06-07 09:20:36.461288100,ar30-03,3
1,2.0,2540.853082,3.005212,3.006747,3.268461,3.268443,1.779553,-0.381859,0.3552,34.925969,473.93551,4512.961944,0.0,2018-06-07 09:34:15.961943700,ar30-03,3
2,3.0,2028.834633,3.460271,3.46142,3.289795,3.289794,1.867,-0.377459,0.3588,34.939969,1941.326531,5582.629466,0.0,2018-06-07 09:52:05.629466100,ar30-03,3
3,4.0,1623.698571,3.514384,3.515059,3.27443,3.274429,2.008163,-0.383841,0.363,34.896569,818.540612,6365.380092,0.0,2018-06-07 10:05:08.380092300,ar30-03,3
4,5.0,1318.284878,3.374037,3.374486,3.246383,3.246395,2.144782,-0.385437,0.368769,34.863065,659.627755,7027.755622,0.0,2018-06-07 10:16:10.755622200,ar30-03,3
5,6.0,1014.043633,3.484706,3.485165,3.244387,3.244406,2.2166,-0.383049,0.372424,34.876933,733.280612,7690.672819,0.0,2018-06-07 10:27:13.672819200,ar30-03,3
6,7.0,813.864633,3.553361,3.553355,3.242332,3.242327,2.266237,-0.373473,0.3752,34.883367,693.19898,8390.965046,0.0,2018-06-07 10:38:53.965046100,ar30-03,3
7,8.0,609.717816,3.645769,3.646351,3.242135,3.242197,2.317,-0.383845,0.3826,34.890165,720.349592,8947.257158,0.0,2018-06-07 10:48:10.257157800,ar30-03,3
8,9.0,407.865592,3.738643,3.739006,3.241234,3.241287,2.369347,-0.366288,0.3925,34.888245,753.049388,9505.174271,0.0,2018-06-07 10:57:28.174270800,ar30-03,3
9,10.0,205.447082,4.048127,4.048265,3.259834,3.259878,2.409522,-0.349937,0.3904,34.889555,738.329796,10081.758065,0.0,2018-06-07 11:07:04.758065400,ar30-03,3
