# Bottle Processing
Author: Andrew Reed

### Motivation:
Independent verification of the suite of physical and chemical observations provided by OOI are critical for the observations to be of use for scientifically valid investigations. Consequently, CTD casts and Niskin water samples are made during deployment and recovery of OOI platforms, vehicles, and instrumentation. The water samples are subsequently analyzed by independent labs for  comparison with the OOI telemetered and recovered data.

However, currently the water sample data routinely collected and analyzed as part of the OOI program are not available in a standardized format which maps the different chemical analyses to the physical measurements taken at bottle closure. Our aim is to make these physical and chemical analyses of collected water samples available to the end-user in a standardized format for easy comprehension and use, while maintaining the source data files. 

### Approach:
Generating a summary of the water sample analyses involves preprocessing and concatenating multiple data sources, and accurately matching samples with each other. To do this, I first preprocess the ctd casts to generate bottle (.btl) files using the SeaBird vendor software following the SOP available on Alfresco. 

Next, the bottle files are parsed using python code and the data renamed following SeaBird's naming guide. This creates a series of individual cast summary (.sum) files. These files are then loaded into pandas dataframes, appended to each other, and exported as a csv file containing all of the bottle data in a single data file.

### Data Sources/Software:

* **sbe_name_map**: This is a spreadsheet which maps the short names generated by the SeaBird SBE DataProcessing Software to the associated full names. The name mapping originates from SeaBird's SBE DataProcessing support documentation.

* **Alfresco**: The Alfresco CMS for OOI at alfresco.oceanobservatories.org is the source of the ctd hex, xmlcon, and psa files necessary for generating the bottle files needed to create the sample summary sheet.

* **SBEDataProcessing-Win32**: SeaBird vendor software for processing the raw ctd files and generating the .btl files.


**========================================================================================================================**
Import packages which will be used in this notebook:

In [2]:
import os, sys, re
import pandas as pd
import numpy as np

Load the name mapping for the column names based on SeaBird's manual:

In [3]:
sbe_name_map = pd.read_excel('/media/andrew/OS/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Reference_Files/seabird_ctd_name_map.xlsx')

In [4]:
sbe_name_map.head()

Unnamed: 0,Short Name,Full Name,Friendly Name,Units,Notes/Comments
0,accM,Acceleration [m/s^2],acc M,m/s^2,
1,accF,Acceleration [ft/s^2],acc F,ft/s^2,
2,altM,Altimeter [m],alt M,m,
3,altF,Altimeter [ft],alt F,ft,
4,avgsvCM,"Average Sound Velocity [Chen-Millero, m/s]",avgsv-C M,"Chen-Millero, m/s",


**========================================================================================================================**
Declare the directory paths to where the relevant information is stored:

In [5]:
basepath = '/home/andrew/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/'
array = 'Pioneer/'
cruise = 'Pioneer-04_AT-27_2015-04-28/'
water = 'Water Sampling/'
ctd = 'ctd/'
leg = 'Leg 1 (at27a)/'

In [6]:
sorted(os.listdir(basepath+array+cruise+leg+ctd));

In [29]:
sample_dir = basepath+array+cruise+leg+ctd
water_dir = basepath+array+cruise+water
salts_and_o2_path = water_dir+ 'Pioneer-04_AT-27A_Oxygen_Salinity_Sample_Data/'
log_path = water_dir+ 'Pioneer-04_AT-27A_CTD_Sampling_Log.xlsx'
nutrients_path = water_dir+ 'Pioneer-04_AT-27A_Nutrients_Sample_Data_2016-09-01_ver_1-00.xlsx'
dic_path = water_dir + 'Pioneer-04_AT-27_DIC_Sample_Data_2019-06-19_ver_1-00.xlsx'
chl_path = water_dir+ 'Pioneer-04_AT-27A_Chlorophyll_Sample_Data_2017-09-21_ver_1-00.xlsx'

In [30]:
# Parse the data for the start_time
def parse_header(header):
    """
    Parse the header of bottle (.btl) files to get critical information
    for the summary spreadsheet.
    
    Args:
        header - an object containing the header of the bottle file as a list of
            strings, split at the newline.
    Returns:
        hdr - a dictionary object containing the start_time, filename, latitude,
            longitude, and cruise id.
    """
    hdr = {}
    for line in header:
        if 'start_time' in line.lower():
            start_time = pd.to_datetime(re.split('= |\[',line)[1])
            hdr.update({'Start Time [UTC]':start_time.strftime('%Y-%m-%dT%H:%M:%SZ')})
        elif 'filename' in line.lower():
            hex_name = re.split('=',line)[1].strip()
            hdr.update({'Filename':hex_name})
        elif 'latitude' in line.lower():
            start_lat = re.split('=',line)[1].strip()
            hdr.update({'Start Latitude [degrees]':start_lat})
        elif 'longitude' in line.lower():
            start_lon = re.split('=',line)[1].strip()
            hdr.update({'Start Longitude [degrees]':start_lon})
        elif 'cruise id' in line.lower():
            cruise_id = re.split(':',line)[1].strip()
            hdr.update({'Cruise':cruise_id})
        else:
            pass
    
    return hdr

Get the path to the ctd-bottle data, load it, and parse it:

In [31]:
# Now write a function to autopopulate the bottle summary sample sheet
files = [x for x in os.listdir(sample_dir) if '.btl' in x]
for filename in files:
    filepath = os.path.abspath(sample_dir+filename)
    
    # Load the raw content into memory
    with open(filepath) as file:
        content = file.readlines()
    content = [x.strip() for x in content]
    
    # Now parse the file content
    header = []
    columns = []
    data = []
    for line in content:
        if line.startswith('*') or line.startswith('#'):
            header.append(line)
        else:
            try:
                float(line[0])
                data.append(line)
            except:
                columns.append(line)
                
    # Parse the header
    hdr = parse_header(header)
    
    # Parse the column identifiers
    column_dict = {}
    for line in columns:
        for i,x in enumerate(line.split()):
            try:
                column_dict[i] = column_dict[i] + ' ' + x
            except:
                column_dict.update({i:x})
                
    #Parse the bottle data based on the column header locations
    data_dict = {x:[] for x in column_dict.keys()}

    for line in data:
        if line.endswith('(avg)'):
            values = list(filter(None,re.split('  |\t', line) ) )
            for i,x in enumerate(values):
                data_dict[i].append(x)
        elif line.endswith('(sdev)'):
            values = list(filter(None,re.split('  |\t', line) ) )
            data_dict[1].append(values[0])
        else:
            pass
    
    # Join the date and time for each measurement into a single item
    data_dict[1] = [' '.join(item) for item in zip(data_dict[1][::2],data_dict[1][1::2])]
    
    # With the parsed data and column names, match up the data and column
    # based on the location
    results = {}
    for key,item in column_dict.items():
        values = data_dict[key]
        results.update({item:values})
        
    # Put the results into a dataframe
    df = pd.DataFrame.from_dict(results)

    # Now add the parsed info from the header files into the dataframe
    for key,item in hdr.items():
        df[key] = item
        
    # Get the cast number
    cast = filename[filename.index('.')-3:filename.index('.')]
    df['Cast'] = str(cast).zfill(3)
    
    # Add the header info back in
    for key in hdr.keys():
        df[key] = hdr[key]
        
    # Generate a filename for the summary file
    outname = filename.split('.')[0] + '.sum'
    
    # Save the results
    df.to_csv(sample_dir+outname)

In [32]:
# Now, for each "summary" file, load and append to each other
df = pd.DataFrame()
for file in os.listdir(sample_dir):
    if '.sum' in file:
        df = df.append(pd.read_csv(sample_dir+file))
    else:
        pass

In [33]:
# Rename the column title using the sbe_name_mapping 
for colname in list(df.columns.values):
    try:
        fullname = list(sbe_name_map[sbe_name_map['Short Name'].apply(lambda x: str(x).lower() == colname.lower()) == True]['Full Name'])[0]
        df.rename({colname:fullname},axis='columns',inplace=True)
    except:
        pass

In [34]:
df

Unnamed: 0.1,Unnamed: 0,Bottle Position,Date Time,"Pressure, Digiquartz [db]","Depth [salt water, m]",Latitude [deg],Longitude [deg],"Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],...,"Oxygen raw, SBE 43 [V]","Oxygen, SBE 43 [ml/l]","Oxygen Saturation, Garcia & Gordon [ml/l]","Beam Attenuation, WET Labs C-Star [1/m]","Beam Transmission, WET Labs C-Star [%]",Filename,Start Latitude [degrees],Start Longitude [degrees],Start Time [UTC],Cast
0,0,1,May 01 2015 04:43:02,121.528,120.559,40.22666,-70.88333,11.6626,11.6615,4.015489,...,2.0794,4.8236,6.07835,0.2914,92.9735 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4
1,1,2,May 01 2015 04:43:13,121.664,120.693,40.22666,-70.88333,11.6548,11.6590,4.014495,...,2.0802,4.8252,6.07944,0.2876,93.0634 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4
2,2,3,May 01 2015 04:49:56,60.490,60.017,40.22666,-70.88332,4.8166,4.8170,3.192494,...,2.3271,6.6137,7.18163,0.1623,96.0242 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4
3,3,4,May 01 2015 04:50:08,60.522,60.048,40.22666,-70.88332,4.8160,4.8165,3.192420,...,2.3277,6.6157,7.18175,0.1641,95.9806 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4
4,4,5,May 01 2015 04:55:23,18.700,18.556,40.22666,-70.88334,7.8618,7.8400,3.454734,...,2.7211,7.4455,6.69247,0.2866,93.0849 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4
5,5,6,May 01 2015 04:55:34,18.598,18.454,40.22666,-70.88334,7.7453,7.7703,3.443893,...,2.7245,7.4572,6.71035,0.2849,93.1246 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4
6,6,7,May 01 2015 04:57:41,1.650,1.637,40.22666,-70.88332,8.5716,8.5714,3.522467,...,2.7450,7.3624,6.58429,0.2969,92.8472 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4
7,7,8,May 01 2015 04:57:48,1.610,1.598,40.22666,-70.88332,8.5680,8.5668,3.522168,...,2.7446,7.3659,6.58480,0.2984,92.8121 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4
8,8,9,May 01 2015 04:57:55,1.755,1.742,40.22666,-70.88332,8.5693,8.5666,3.522298,...,2.7427,7.3552,6.58460,0.2985,92.8102 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4
9,9,10,May 01 2015 04:57:58,1.555,1.543,40.22666,-70.88332,8.5706,8.5673,3.522441,...,2.7432,7.3602,6.58440,0.3037,92.6898 (avg),C:\data\ctd\at27a_004.hex,40 13.60 N,070 53.00 W,2015-05-01T03:34:43Z,4


In [35]:
df.sort_values(by=['Cast','Bottle Position'], inplace=True)
df.drop(columns='Unnamed: 0',inplace=True)
for colname in list(df.columns.values):
    df.rename({colname:'CTD ' + colname},axis='columns',inplace=True)
bottles = df

In [36]:
df.to_csv(sample_dir+'CTD_Summary.csv')

In [37]:
df.head()

Unnamed: 0,CTD Bottle Position,CTD Date Time,"CTD Pressure, Digiquartz [db]","CTD Depth [salt water, m]",CTD Latitude [deg],CTD Longitude [deg],"CTD Temperature [ITS-90, deg C]","CTD Temperature, 2 [ITS-90, deg C]",CTD Conductivity [S/m],"CTD Conductivity, 2 [S/m]",...,"CTD Oxygen raw, SBE 43 [V]","CTD Oxygen, SBE 43 [ml/l]","CTD Oxygen Saturation, Garcia & Gordon [ml/l]","CTD Beam Attenuation, WET Labs C-Star [1/m]","CTD Beam Transmission, WET Labs C-Star [%]",CTD Filename,CTD Start Latitude [degrees],CTD Start Longitude [degrees],CTD Start Time [UTC],CTD Cast
0,1,Apr 29 2015 16:56:01,440.439,436.601,39.94,-70.88337,7.2931,7.2972,3.585522,3.586001,...,1.6521,4.0457,6.70124,0.1523,96.2641 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1
1,2,Apr 29 2015 16:56:11,440.11,436.276,39.94,-70.88338,7.3209,7.3246,3.588149,3.588595,...,1.6517,4.0344,6.69697,0.1499,96.3207 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1
2,3,Apr 29 2015 17:17:11,150.299,149.094,39.94,-70.88336,12.5311,12.5338,4.128747,4.129101,...,2.1947,5.0995,5.95926,0.0739,98.1700 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1
3,4,Apr 29 2015 17:17:25,150.294,149.089,39.94,-70.88336,12.5283,12.5288,4.12842,4.128522,...,2.2012,5.1149,5.95962,0.0773,98.0859 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1
4,5,Apr 29 2015 17:26:08,15.383,15.265,39.94,-70.88336,10.7084,10.7023,3.819122,3.818586,...,2.7271,6.9297,6.24224,0.7043,83.8564 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1


**========================================================================================================================**
### Process the Discrete Salinity and Oxygen Data
Next, I process the discrete salinity and oxygen sample data so that it is consistently named and ready to be merged with the existing data sets.

In [38]:
def clean_sal_files(dirpath):

    # Run check if files are held in excel format or csvs
    csv_flag = any(files.endswith('.SAL') for files in os.listdir(dirpath))
    if csv_flag:
        for filename in os.listdir(dirpath):
            sample = []
            salinity = []
            if filename.endswith('.SAL'):
                with open(dirpath+filename) as file:
                    data = file.readlines()
                    for ind1,line in enumerate(data):
                        if ind1 == 0:
                            strs = data[0].replace('"','').split(',')
                            cruisename = strs[0]
                            station = strs[1]
                            cast = strs[2]
                            case = strs[8]
                        elif int(line.split()[0]) == 0:
                            pass
                        else:
                            strs = line.split()
                            sample.append(strs[0])
                            salinity.append(strs[2])
                
                    # Generate a pandas dataframe to populate data
                    data_dict = {'Cruise':cruisename,'Station':station,'Cast':cast,'Case':case,'Sample ID':sample,'Salinity [psu]':salinity}
                    df = pd.DataFrame.from_dict(data_dict)
                    df.to_csv(file.name.replace('.','')+'.csv')
            else:
                pass
    
    else:
        # If the files are already in excel spreadsheets, they've been cleaned into a
        # logical tabular format
        pass
    

def process_sal_files(dirpath):
    
    # Check if the files are excel files or not
    excel_flag = any(files.endswith('SAL.xlsx') for files in os.listdir(dirpath))
    # Initialize a dataframe for processing the salinity files
    df = pd.DataFrame()
    if excel_flag:
        for file in os.listdir(dirpath):
            if 'SAL.xlsx' in file:
                df = df.append(pd.read_excel(dirpath+file))
        df.rename({'Sample':'Sample ID','Salinity':'Salinity [psu]','Niskin #':'Niskin','Case ID':'Case'}, 
                  axis='columns',inplace=True)
        df.dropna(inplace=True)
        df['Station'] = df['Station'].apply(lambda x: str( int(x)).zfill(3))
        df['Niskin'] = df['Niskin'].apply(lambda x: str( int(x)))
        df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
    else:
        for file in os.listdir(dirpath):
            if 'SAL.csv' in file:
                df = df.append(pd.read_csv(dirpath+file))
        df.dropna(inplace=True)
        df['Station'] = df['Station'].apply(lambda x: str( int(x)).zfill(3))
        df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
        df.drop(columns=[x for x in list(df.columns.values) if 'unnamed' in x.lower()],inplace=True)

    # Save the processed summary file for salinity
    df.to_csv(dirpath+'SAL_Summary.csv')
    
    
def process_oxy_files(dirpath):
    df = pd.DataFrame()
    for filename in os.listdir(dirpath):
        if 'oxy' in filename.lower() and filename.endswith('.xlsx'):
            df = df.append(pd.read_excel(dirpath+filename)) 
            # Rename and clean up the oxygen data to be uniform across data sets
    df.rename({'Niskin #':'Niskin','Sample#':'Sample ID','Oxy':'Oxygen [mL/L]','Unit':'Units'},
              axis='columns',inplace=True)
    df.dropna(inplace=True)
    df['Station'] = df['Station'].apply(lambda x: str( int(x)).zfill(3))
    #df['Niskin'] = df['Niskin'].apply(lambda x: str( int(x)))
    df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
    df['Cruise'] = df['Cruise'].apply(lambda x: x.replace('O','0'))
    
    # Save the processed summary file for oxygen
    df.to_csv(dirpath+'OXY_Summary.csv')

**If there is no oxygen or salinity info - run this cell, otherwise skip!!!**

In [39]:
# Now process the salts and oxygen data
    # Clean the salinity
clean_sal_files(salts_and_o2_path)
    # Process the salinity files
process_sal_files(salts_and_o2_path)
    # Process the oxygen files
process_oxy_files(salts_and_o2_path)

**====================================================================================================================**
Load the salinity and oxygen:

In [40]:
sal = pd.read_csv(salts_and_o2_path+'SAL_Summary.csv')
sal.drop(columns='Unnamed: 0', inplace=True)
for colname in list(sal.columns.values):
    sal.rename(columns={colname:'Sal ' + colname}, inplace=True)
sal['Sal Sample ID'] = sal['Sal Case'] + sal['Sal Sample ID'].map(str)

In [41]:
sal.head()

Unnamed: 0,Sal Cruise,Sal Station,Sal Cast,Sal Case,Sal Sample ID,Sal Salinity [psu]
0,SHORE,3,1,B,B1,35.1067
1,SHORE,3,1,B,B2,35.1055
2,SHORE,3,1,B,B3,35.7365
3,SHORE,3,1,B,B4,35.7363
4,SHORE,3,1,B,B5,35.6303


In [42]:
oxy = pd.read_csv(salts_and_o2_path+'OXY_Summary.csv')
oxy.drop(columns='Unnamed: 0', inplace=True)
for colname in list(oxy.columns.values):
    oxy.rename(columns={colname:'Oxy ' + colname}, inplace=True)
oxy['Oxy Sample ID'] = oxy['Oxy Case'] + oxy['Oxy Sample ID'].map(str)

In [43]:
oxy.head()

Unnamed: 0,Oxy Cruise,Oxy Station,Oxy Case,Oxy Sample ID,Oxy Oxygen [mL/L],Oxy Units
0,AT27-A,5,T,T9,5.117,mL/L
1,AT27-A,5,T,T10,5.117,mL/L
2,AT27-A,5,T,T11,5.333,mL/L
3,AT27-A,5,T,T12,5.349,mL/L
4,AT27-A,5,T,T13,7.16,mL/L


**========================================================================================================================**
### CTD Sampling Log
Load in the CTD sampling log summary sheet. The summary sheet needs to be manually created and the data cleaned before attempting to import. Additionally, ensure that there is only one header line and that it is at the top of the file.

In [71]:
sample_log = pd.read_excel(log_path,sheet_name='Summary',header=0)
sample_log = sample_log.sort_values(by=['Station-Cast #','Niskin #'])
for colname in list(sample_log.columns.values):
    sample_log.rename({colname:'Log ' + colname},axis='columns',inplace=True)
sample_log.head()

Unnamed: 0,Log Cruise ID,Log Station-Cast #,Log Target Station,Log Start Latitude,Log Start Longitude,Log Start Date,Log Start Time,Log Bottom Depth [m],Log Niskin #,Log Rosette Position,...,Log Oxygen Bottle #,Log Ph Bottle #,Log DIC/TA Bottle #,Log Salts Bottle #,Log Nitrate Bottle 1,Log Chlorophyll Brown Bottle #,Log Chlorophyll Filter Sample #,Log Chlorophyll Brown Bottle Volume,Log Chlorophyll LN Tube,Log Comments
0,AT-27A,1,OSPM,39 56.400 N,70 53.002 W,2015-04-29,15:40:00,445,1.0,1.0,...,T1,323.0,324.0,J1,1-1,1.0,01 / 01,1070.0,,
1,AT-27A,1,OSPM,39 56.400 N,70 53.002 W,2015-04-29,15:40:00,445,2.0,2.0,...,T2,325.0,326.0,J2,1-2,2.0,01 / 02,1070.0,,Duplicates
2,AT-27A,1,OSPM,39 56.400 N,70 53.002 W,2015-04-29,15:40:00,445,3.0,3.0,...,T3,327.0,328.0,J3,1-3,3.0,01 / 03,1070.0,,
3,AT-27A,1,OSPM,39 56.400 N,70 53.002 W,2015-04-29,15:40:00,445,4.0,4.0,...,T4,,,J4,,4.0,01 / 04,1070.0,,
4,AT-27A,1,OSPM,39 56.400 N,70 53.002 W,2015-04-29,15:40:00,445,5.0,5.0,...,T5,329.0,330.0,J5,1-4,5.0,01 / 05,1070.0,,"Chl max, O2 max"


In [72]:
def strip_x(x):
    if type(x) == str:
        x = x.replace('.','')
        return x
    else:
        return x

In [73]:
sample_log['Log Nitrate Bottle 1'] = sample_log['Log Nitrate Bottle 1'].apply(lambda x: strip_x(x))

In [74]:
# Reformat the sample_log start date and time as well as the date/time
sample_log['Log Start Date'] = sample_log['Log Start Date'].apply(lambda x: pd.to_datetime(x).strftime('%Y-%m-%d'))
sample_log['Log Start Time'] = sample_log['Log Start Time'].apply(lambda x: str(x))
sample_log['Log Start Time'] = sample_log['Log Start Date'] + 'T' + sample_log['Log Start Time'] + 'Z'

In [75]:
# Date and Time
sample_log['Log Date'] = pd.to_datetime(sample_log['Log Date']).apply(lambda x: x.strftime('%Y-%m-%d') if not pd.isnull(x) else '')
sample_log['Log Time'] = sample_log['Log Time'].apply(lambda x: x.strftime('%H:%M:%S') if not pd.isnull(x) else '')
sample_log['Log Time'] = sample_log['Log Date'] + 'T' + sample_log['Log Time'] + 'Z'

**========================================================================================================================**
### Merge the CTD-Bottle Data and Sample Log
The next step is to merge the CTD-Bottle data with the sample log using an outer merge based on the cast and niskin/bottle position. The outer merge means that all data will be retained, so that we do not accidentally discard either data-only casts or casts not recorded on the sample logs.

In [76]:
#summary = bottles.merge(sample_log, how='outer', right_on=['Station-Cast #','Niskin #'], left_on=['Cast','Bottle Position'])
#summary = bottles.merge(sample_log, how='outer', right_on=['Station-Cast #'], left_on=['Cast'])

In [77]:
summary = sample_log.merge(bottles, how='outer', left_on=['Log Station-Cast #','Log Niskin #'], right_on=['CTD Cast','CTD Bottle Position'])
summary

Unnamed: 0,Log Cruise ID,Log Station-Cast #,Log Target Station,Log Start Latitude,Log Start Longitude,Log Start Date,Log Start Time,Log Bottom Depth [m],Log Niskin #,Log Rosette Position,...,"CTD Oxygen raw, SBE 43 [V]","CTD Oxygen, SBE 43 [ml/l]","CTD Oxygen Saturation, Garcia & Gordon [ml/l]","CTD Beam Attenuation, WET Labs C-Star [1/m]","CTD Beam Transmission, WET Labs C-Star [%]",CTD Filename,CTD Start Latitude [degrees],CTD Start Longitude [degrees],CTD Start Time [UTC],CTD Cast
0,AT-27A,1.0,OSPM,39 56.400 N,70 53.002 W,2015-04-29,2015-04-29T15:40:00Z,445.0,1.0,1.0,...,1.6521,4.0457,6.70124,0.1523,96.2641 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1.0
1,AT-27A,1.0,OSPM,39 56.400 N,70 53.002 W,2015-04-29,2015-04-29T15:40:00Z,445.0,2.0,2.0,...,1.6517,4.0344,6.69697,0.1499,96.3207 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1.0
2,AT-27A,1.0,OSPM,39 56.400 N,70 53.002 W,2015-04-29,2015-04-29T15:40:00Z,445.0,3.0,3.0,...,2.1947,5.0995,5.95926,0.0739,98.1700 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1.0
3,AT-27A,1.0,OSPM,39 56.400 N,70 53.002 W,2015-04-29,2015-04-29T15:40:00Z,445.0,4.0,4.0,...,2.2012,5.1149,5.95962,0.0773,98.0859 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1.0
4,AT-27A,1.0,OSPM,39 56.400 N,70 53.002 W,2015-04-29,2015-04-29T15:40:00Z,445.0,5.0,5.0,...,2.7271,6.9297,6.24224,0.7043,83.8564 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1.0
5,AT-27A,1.0,OSPM,39 56.400 N,70 53.002 W,2015-04-29,2015-04-29T15:40:00Z,445.0,6.0,6.0,...,2.7247,6.9191,6.24090,0.7083,83.7728 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1.0
6,AT-27A,1.0,OSPM,39 56.400 N,70 53.002 W,2015-04-29,2015-04-29T15:40:00Z,445.0,7.0,7.0,...,2.7625,7.1411,6.33987,0.7430,83.0488 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1.0
7,AT-27A,1.0,OSPM,39 56.400 N,70 53.002 W,2015-04-29,2015-04-29T15:40:00Z,445.0,8.0,8.0,...,2.7627,7.1354,6.33561,0.7337,83.2421 (avg),C:\data\ctd\at27a_001.hex,39 56.40 N,070 53.00 W,2015-04-29T15:39:30Z,1.0
8,AT-27A,2.0,Glider,40 15 N,70 12 W,2015-04-30,2015-04-30T03:34:00Z,99.0,1.0,1.0,...,2.2672,5.9914,6.73805,0.2408,94.1569 (avg),C:\data\ctd\at27a_002.hex,40 15.48 N,070 12.91 W,2015-04-30T03:47:07Z,2.0
9,AT-27A,2.0,Glider,40 15 N,70 12 W,2015-04-30,2015-04-30T03:34:00Z,99.0,2.0,2.0,...,2.2674,5.9953,6.73865,0.2357,94.2788 (avg),C:\data\ctd\at27a_002.hex,40 15.48 N,070 12.91 W,2015-04-30T03:47:07Z,2.0


Fill in missing data based on the sample log info:

In [78]:
summary['CTD Start Latitude [degrees]'] = summary['CTD Start Latitude [degrees]'].fillna(value=summary['Log Start Latitude'])
summary['CTD Start Longitude [degrees]'] = summary['CTD Start Longitude [degrees]'].fillna(value=summary['Log Start Longitude'])
summary['CTD Start Time [UTC]'] = summary['CTD Start Time [UTC]'].fillna(value=summary['Log Start Time'])
summary['Log Station-Cast #'] = summary['Log Station-Cast #'].fillna(value=summary['CTD Cast'])
summary['CTD Bottle Position'] = summary['CTD Bottle Position'].fillna(value=summary['Log Niskin #']);
summary['CTD Date Time'] = summary['CTD Date Time'].fillna(value=summary['Log Time'])
summary['CTD Depth [salt water, m]'] = summary['CTD Depth [salt water, m]'].fillna(value=summary['Log Trip Depth'])

Eliminate redundant or non-useful columns from the existing dataframe:

In [79]:
summary.drop(columns=['Log Start Latitude','Log Start Longitude','Log Start Date','Log Start Time','CTD Cast',
                     'Log Niskin #','Log Rosette Position','Log Date','Log Time','Log Trip Depth'], inplace=True)

**====================================================================================================================**
Now, split rows which have multiple entries into their own individual rows/entries

In [80]:
def explode(df, lst_cols, fill_value='', preserve_index=False):
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)
    return res

**========================================================================================================================**
Merge the discrete salinity and oxygen data into the sample_log based on the cast and niskin number. Do not use the sample bottle number - it is not stored in the processed discrete data we get back from the labs:

In [81]:
summary = summary.merge(sal, how='left', left_on=['Log Station-Cast #','Log Salts Bottle #'], right_on=['Sal Station','Sal Sample ID'] )
summary['Sal Salinity [psu]'] = summary['Sal Salinity [psu]'].fillna(value=summary['Log Salts Bottle #'])
#summary.rename(columns={'Sal Salinity [psu]': 'Discrete Salinity [psu]'}, inplace=True)

In [82]:
# Check that the values match up
check = summary['Log Salts Bottle #'] == summary['Sal Sample ID']
if all(check) is False:
    print(summary[check == False][['Log Station-Cast #','Log Salts Bottle #','Sal Station','Sal Sample ID']])

     Log Station-Cast # Log Salts Bottle #  Sal Station Sal Sample ID
56                  8.0                NaN          NaN           NaN
57                  1.0                NaN          NaN           NaN
58                  1.0                NaN          NaN           NaN
59                  1.0                NaN          NaN           NaN
60                  1.0                NaN          NaN           NaN
61                  1.0                NaN          NaN           NaN
62                  1.0                NaN          NaN           NaN
63                  1.0                NaN          NaN           NaN
64                  1.0                NaN          NaN           NaN
65                  1.0                NaN          NaN           NaN
66                  1.0                NaN          NaN           NaN
67                  1.0                NaN          NaN           NaN
68                  1.0                NaN          NaN           NaN
69                  

Drop the unnecessary or extraneous columns:

In [83]:
summary.drop(columns=['Log Salts Bottle #','Sal Cruise','Sal Station','Sal Case','Sal Sample ID'], inplace=True)

Oxygen data:

In [84]:
summary = summary.merge(oxy, how='left', left_on=['Log Station-Cast #','Log Oxygen Bottle #'], right_on=['Oxy Station','Oxy Sample ID'] )
summary['Oxy Oxygen [mL/L]'] =  summary['Oxy Oxygen [mL/L]'].fillna(value=summary['Log Oxygen Bottle #'])
#summary.rename(columns={'Oxygen [mL/L]':'Discrete Oxygen [mL/L]'}, inplace=True)

In [85]:
# Check that the values match up
check = summary['Log Oxygen Bottle #'] == summary['Oxy Sample ID']
if all(check) is False:
    print(summary[check == False][['Log Station-Cast #','Log Oxygen Bottle #','Oxy Sample ID']])


     Log Station-Cast # Log Oxygen Bottle # Oxy Sample ID
56                  8.0                 NaN           NaN
57                  1.0                 NaN           NaN
58                  1.0                 NaN           NaN
59                  1.0                 NaN           NaN
60                  1.0                 NaN           NaN
61                  1.0                 NaN           NaN
62                  1.0                 NaN           NaN
63                  1.0                 NaN           NaN
64                  1.0                 NaN           NaN
65                  1.0                 NaN           NaN
66                  1.0                 NaN           NaN
67                  1.0                 NaN           NaN
68                  1.0                 NaN           NaN
69                  1.0                 NaN           NaN
70                  1.0                 NaN           NaN
71                  1.0                 NaN           NaN
72            

In [None]:
summary['Oxy']

In [87]:
summary.drop(columns=['Log Oxygen Bottle #','Oxy Cruise','Oxy Station','Oxy Case','Oxy Sample ID','Oxy Units',
                     ], inplace=True)

**========================================================================================================================**
### Nutrients Data
Load the nutrients data (if it exists) and merge with the summary sheet. If the nutrients data has not been returned yet, we fill in the relevant columns with the data from the sampling logs.

In [88]:
def clean_entry(x):
    if type(x) == float:
        return x
    else:
        x = x.replace(' ','')
        return x

In [89]:
summary['Log Nitrate Bottle 1'] = summary['Log Nitrate Bottle 1'].apply(lambda x: clean_entry(x))

In [90]:
try:
    nutrients = pd.read_excel(nutrients_path,header=0)
    nutrients
except IsADirectoryError:
    nutrients = pd.DataFrame(data=sample_log['Log Nitrate Bottle 1'])
    nutrients.rename(columns={'Log Nitrate Bottle 1':'Sample ID'}, inplace=True)
    columns = ['Sample ID','Cruise','Avg: Nitrate + Nitrite [µmol/L]','Avg: Ammonium [µmol/L]',
               'Avg: Phosphate [µmol/L]','Avg: Silicate [µmol/L]','Avg: Nitrite [µmol/L]','Avg: Nitrate [µmol/L]']
    for col in columns:
        if col not in nutrients.columns.values:
            nutrients[col] = nutrients['Sample ID']

In [91]:
nutrients.head()

Unnamed: 0,Sample ID,Avg: Nitrate+Nitrite [µmol/L],Avg: Ammonium [µmol/L],Avg: Phosphate [µmol/L],Avg: Silicate [µmol/L],Avg: Nitrite [µmol/L],Avg: Nitrate [µmol/L]
0,1-1,9.34579,0.838883,0.781959,6.305042,<0.04,9.34579
1,1-2,9.77051,0.863467,0.816794,6.541418,<0.04,9.77051
2,1-3,5.83786,0.831859,0.42719,2.20854,<0.04,5.83786
3,1-4,0.324218,0.5238,0.143924,0.04442,<0.04,0.324218
4,1-5,0.0773098,1.02251,0.110006,0.304346,<0.04,0.0773098


In [92]:
nutrients.rename(columns=lambda x: x.replace('Avg:', 'Nuts'), inplace=True)
nutrients.rename(columns={'Sample ID':'Nuts Sample ID'}, inplace=True)

Now we can merge into the summary sheet:

In [93]:
summary = summary.merge(nutrients, how='left', left_on=['Log Nitrate Bottle 1'], right_on=['Nuts Sample ID'])

In [94]:
summary.head()

Unnamed: 0,Log Cruise ID,Log Station-Cast #,Log Target Station,Log Bottom Depth [m],Log Ph Bottle #,Log DIC/TA Bottle #,Log Nitrate Bottle 1,Log Chlorophyll Brown Bottle #,Log Chlorophyll Filter Sample #,Log Chlorophyll Brown Bottle Volume,...,Sal Cast,Sal Salinity [psu],Oxy Oxygen [mL/L],Nuts Sample ID,Nuts Nitrate+Nitrite [µmol/L],Nuts Ammonium [µmol/L],Nuts Phosphate [µmol/L],Nuts Silicate [µmol/L],Nuts Nitrite [µmol/L],Nuts Nitrate [µmol/L]
0,AT-27A,1.0,OSPM,445.0,323.0,324.0,1-1,1.0,01 / 01,1070.0,...,1.0,35.1439,4.239,1-1,9.34579,0.838883,0.781959,6.305042,<0.04,9.34579
1,AT-27A,1.0,OSPM,445.0,325.0,326.0,1-2,2.0,01 / 02,1070.0,...,1.0,35.1219,4.248,1-2,9.77051,0.863467,0.816794,6.541418,<0.04,9.77051
2,AT-27A,1.0,OSPM,445.0,327.0,328.0,1-3,3.0,01 / 03,1070.0,...,1.0,35.6932,5.192,1-3,5.83786,0.831859,0.42719,2.20854,<0.04,5.83786
3,AT-27A,1.0,OSPM,445.0,,,,4.0,01 / 04,1070.0,...,1.0,35.6948,5.189,,,,,,,
4,AT-27A,1.0,OSPM,445.0,329.0,330.0,1-4,5.0,01 / 05,1070.0,...,1.0,34.4173,6.968,1-4,0.324218,0.5238,0.143924,0.04442,<0.04,0.324218


In [95]:
summary[['Log Nitrate Bottle 1','Nuts Sample ID']]

Unnamed: 0,Log Nitrate Bottle 1,Nuts Sample ID
0,1-1,1-1
1,1-2,1-2
2,1-3,1-3
3,,
4,1-4,1-4
5,1-5,1-5
6,1-6,1-6
7,,
8,,
9,,


In [96]:
summary.drop(columns=['Log Nitrate Bottle 1','Nuts Sample ID'], inplace=True)

**========================================================================================================================**
### Chlorophyll Data
If the Chlorophyll measurements have not been returned yet, we will generate a synthetic chlorophyll spreadsheet which substitutes the sample bottle numbers in place of the actual measurements. One complication is that the Chlorophyll sample # column title is not identical between cruises.

In [97]:
try:
    chl = pd.read_excel(chl_path)
    chl.head()
except IsADirectoryError:
    # If there is no chlorophyll sheet yet, need to copy the bottle data into the final sample log
    chl = sample_log[['Log Station-Cast #','Log Chlorophyll Brown Bottle #','Log Chlorophyll Filter Sample #','Log Chlorophyll LN Tube']]
    chl.rename(columns={
        'Log Chlorophyll Brown Bottle #': 'Brown Bottle #',
        'Log Chlorophyll Filter Sample #': 'Chl (ug/l)',
        'Log Chlorophyll LN Tube': 'Phaeo (ug/l)'
    }, inplace=True)

In [98]:
for colname in list(chl.columns.values):
    chl.rename({colname: 'Chloro ' + colname}, axis='columns', inplace=True)
chl.rename(columns = lambda x: x.replace(':','').replace('\n',''), inplace=True)

Select a subset of the chlorophyll data which we will merge with the summary spreadsheet

In [99]:
chl.columns

Index(['Chloro Cruise #', 'Chloro Date', 'Chloro Station Start Time (UTC)',
       'Chloro Station End Time (UTC)', 'Chloro Niskin Trip Time',
       'Chloro Lat', 'Chloro Lon', 'Chloro Station Depth',
       'Chloro Station-Cast #', 'Chloro Niskin #', 'Chloro Trip Depth',
       'Chloro Brown Bottle #', 'Chloro Replicate', 'Chloro Water Depth Rep',
       'Chloro Filter Sample #', 'Chloro Vol Filt', 'Chloro Filter Size',
       'Chloro Vol Extracted', 'Chloro Sample', 'Chloro 90% Acetone',
       'Chloro Dilution During Reading', 'Chloro Chl_Cal_Filename',
       'Chloro tau_Calibration', 'Chloro Fd_Calibration', 'Chloro Rb',
       'Chloro Ra', 'Chloro blank', 'Chloro Rb-blank', 'Chloro Ra-blank',
       'Chloro Chl (ug/l)', 'Chloro Phaeo (ug/l)', 'Chloro Cal_Date',
       'Chloro Fluorometer', 'Chloro Comments'],
      dtype='object')

In [100]:
chl = chl[['Chloro Cruise #','Chloro Station-Cast #','Chloro Niskin #','Chloro Brown Bottle #','Chloro Filter Sample #',
          'Chloro Chl (ug/l)','Chloro Phaeo (ug/l)','Chloro Comments']]

In [101]:
chl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 8 columns):
Chloro Cruise #           77 non-null object
Chloro Station-Cast #     77 non-null int64
Chloro Niskin #           77 non-null int64
Chloro Brown Bottle #     77 non-null int64
Chloro Filter Sample #    77 non-null object
Chloro Chl (ug/l)         77 non-null float64
Chloro Phaeo (ug/l)       77 non-null float64
Chloro Comments           26 non-null object
dtypes: float64(2), int64(3), object(3)
memory usage: 4.9+ KB


In [102]:
summary.drop(columns=[x for x in list(summary.columns.values) if 'Chloro ' in x], inplace=True)

In [103]:
summary = summary.merge(chl, how='left', left_on=['Log Station-Cast #','Log Chlorophyll Filter Sample #'], right_on=['Chloro Station-Cast #','Chloro Filter Sample #'])

In [104]:
check = summary['Log Chlorophyll Filter Sample #'] == summary['Chloro Filter Sample #']
if all(check) is False:
    print(summary[check == False][['Log Chlorophyll Filter Sample #','Chloro Filter Sample #']])

    Log Chlorophyll Filter Sample # Chloro Filter Sample #
0                           01 / 01                    NaN
1                           01 / 02                    NaN
2                           01 / 03                    NaN
3                           01 / 04                    NaN
4                           01 / 05                    NaN
5                           01 / 06                    NaN
6                           01 / 07                    NaN
7                           01 / 08                    NaN
8                           02 / 01                    NaN
9                           02 / 02                    NaN
10                          02 / 03                    NaN
11                          02 / 04                    NaN
12                          02 / 05                    NaN
13                          02 / 06                    NaN
14                          02 / 07                    NaN
15                          02 / 08                    N

In [None]:
summary.drop(columns=['Log Chlorophyll Brown Bottle #','Log Chlorophyll Filter Sample #',
                      'Log Chlorophyll Brown Bottle Volume','Log Chlorophyll LN Tube',
                     ], inplace = True)

In [None]:
#chl.dropna(subset=['Brown Bottle #'], inplace=True)

In [None]:
#summary = summary.merge(chl, how='outer', left_on=['Station-Cast #','Chlorophyll Brown Bottle #'], right_on=['Station-Cast #','Brown Bottle #'])

In [None]:
#summary.drop(columns=['Chlorophyll Brown Bottle #','Chlorophyll Filter Sample #','Chlorophyll LN Tube','Brown Bottle #',
#                     'Chlorophyll Brown Bottle Volume'], inplace = True)

**========================================================================================================================**
### Carbon-System Measurements
If the Carbon system measurements have not been returned yet, we will generate a synthetic DIC spreadsheet which substitutes the sample bottle numbers in place of the actual measurements.

In [None]:
try:
    dic = pd.read_excel(dic_path,header=0)
    dic
except IsADirectoryError:
    dic = sample_log[['Log Station-Cast #','Log Niskin #','Ph Bottle #','DIC/TA Bottle #']]
    dic.rename(columns={
        'Station-Cast #':'CAST_NO',
        'Niskin #':'NISKIN_NO',
        'DIC/TA Bottle #':'DIC_UMOL_KG',
        'Ph Bottle #':'PH_TOT_MEA',
    }, inplace=True)
    columns = ['CAST_NO', 'NISKIN_NO','DIC_UMOL_KG', 'DIC_FLAG_W', 'TA_UMOL_KG',
       'TA_FLAG_W', 'PH_TOT_MEA', 'TMP_PH_DEG_C', 'PH_FLAG_W']
    for col in columns:
        if col not in dic.columns.values:
            if 'dic' in col.lower() or 'ta' in col.lower():
                dic[col] = dic['DIC_UMOL_KG']
            elif 'ph' in col.lower():
                dic[col] = dic['PH_TOT_MEA']
            else:
                dic[col] = np.nan

In [None]:
dic.columns.values

In [None]:
#dic = dic[['CAST_NO', 'NISKIN_NO','DIC_UMOL_KG', 'DIC_FLAG_W', 'TA_UMOL_KG',
       #'TA_FLAG_W', 'PH_TOT_MEA', 'TMP_PH_DEG_C', 'PH_FLAG_W']]
#dic.rename(columns = {'DIC_UMOL_KG':'DIC [µmol/kg]','DIC_FLAG_W':'DIC Flag',
#               'TA_UMOL_KG':'Alkalinity [µmol/kg]',
 #              'TA_FLAG_W':'Alkalinity Flag',
  #             'PH_TOT_MEA':'pH [Total Scale]',
   #            'TMP_PH_DEG_C':'pH Analysis Temp [C]', 
    #           'PH_FLAG_W':'pH Flag'}, inplace=True)
# Add in the pCO2 columns, which we don't measure
dic['PCO2_UMOL_KG'] = np.nan
dic['PCO2_FLAG_W'] = np.nan
dic['TMP_PCO2_DEG_C'] = np.nan

dic.rename(columns=lambda x: 'CARBON ' + x, inplace=True)

In [None]:
dic

In [None]:
summary = summary.merge(dic, how='left', left_on=['Log Cruise ID','Log Station-Cast #','CTD Bottle Position'], right_on=['CARBON CRUISE_ID','CARBON CAST_NO','CARBON NISKIN_NO'])
#summary = summary.merge(dic, how='left', left_on='DIC/TA Bottle #', right_on='Discrete SAMPLE_ID')

In [None]:
summary[['Log Cruise ID','Log Station-Cast #','CTD Bottle Position','CARBON CRUISE_ID','CARBON CAST_NO','CARBON NISKIN_NO']]

In [None]:
summary.drop_duplicates(inplace=True)

**====================================================================================================================**
Next step is to select the desired columns from the total superset of data. We'll do this by setting up a list which contains the key columns that we want for each parameter, and use masking to select those from the superset

In [None]:
columns = ['Log Cruise ID', 'Log Station-Cast #', 'Log Target Station', 'CTD Start Latitude [degrees]', 
           'CTD Start Longitude [degrees]', 'CTD Start Time [UTC]', 'Log Bottom Depth [m]', 'CTD Filename',
           'CTD Bottle Position','CTD Date Time', 'CTD Pressure, Digiquartz [db]', 'CTD Depth [salt water, m]',
           'CTD Latitude [deg]', 'CTD Longitude [deg]', 'CTD Temperature [ITS-90, deg C]',
           'CTD Temperature, 2 [ITS-90, deg C]', 'CTD Conductivity [S/m]', 'CTD Conductivity, 2 [S/m]', 
           'CTD Salinity, Practical [PSU]', 'CTD Salinity, Practical, 2 [PSU]', 'CTD Oxygen, SBE 43 [ml/l]', 
           'CTD Oxygen Saturation, Garcia & Gordon [ml/l]', 'CTD Beam Attenuation, WET Labs C-Star [1/m]',
           'CTD Beam Transmission, WET Labs C-Star [%]', 'Oxy Oxygen [mL/L]', 'Chloro Chl (ug/l)',
           'Chloro Phaeo (ug/l)', 'Nuts Phosphate [µmol/L]', 'Nuts Silicate [µmol/L]',
           'Nuts Nitrate [µmol/L]', 'Nuts Nitrite [µmol/L]', 'Nuts Ammonium [µmol/L]',
           'Sal Salinity [psu]', 'CARBON TA_UMOL_KG', 'CARBON TA_FLAG_W', 'CARBON DIC_UMOL_KG', 'CARBON DIC_FLAG_W',
           'CARBON PCO2_UMOL_KG', 'CARBON TMP_PCO2_DEG_C', 'CARBON PCO2_FLAG_W', 'CARBON PH_TOT_MEA',
           'CARBON TMP_PH_DEG_C', 'CARBON PH_FLAG_W', 'Log Comments', 'Chloro Comments']
           

In [None]:
summary_sheet = summary[[x for x in columns]]
summary_sheet.head()

In [None]:
summary_sheet.rename(columns={'CTD Date Time':'CTD Bottle Closure'}, inplace=True)
summary_sheet.rename(columns={'Chloro Comments': 'Chl Comments'}, inplace=True)

Now, strip off the source name (i.e. Log/CTD/Sal/etc.) and replace with the appropriate name following the agreed-upon naming convention.

In [None]:
summary_sheet.rename(columns=lambda x: x.replace('Log ','').replace('CTD ','').replace('Nuts ','Discrete ').replace('Sal ','Discrete ').replace('Oxy ','Discrete ').replace('Chloro ','Discrete '),
                    inplace=True)
summary_sheet.rename(columns=lambda x: x.replace('CARBON ',''), inplace=True)
summary_sheet.rename(columns = {'DIC_UMOL_KG':'Discrete DIC [µmol/kg]','DIC_FLAG_W':'Discrete DIC Flag',
               'TA_UMOL_KG':'Discrete Alkalinity [µmol/kg]',
               'TA_FLAG_W':'Discrete Alkalinity Flag',
               'PH_TOT_MEA':'Discrete pH [Total Scale]',
               'TMP_PH_DEG_C':'Discrete pH Analysis Temp [C]', 
               'PH_FLAG_W':'Discrete pH Flag', 
               'PCO2_UMOL_KG':'Discrete pCO2 [µmol/kg]',
               'TMP_PCO2_DEG_C':'Discrete pCO2 Analysis Temp [C]',
               'PCO2_FLAG_W':'Discrete pCO2 Flag'}, inplace=True)

In [None]:
summary_sheet.columns

In [None]:
summary_sheet.sort_values(by=['Cruise ID','Station-Cast #','Bottle Position'], inplace=True)

**========================================================================================================================**
Import the column order list and use fuzzy string matching to sort the data and save the data to an new Excel spreadsheet.

In [None]:
column_order = pd.read_excel(basepath+'column_order.xlsx')

In [None]:
column_order = tuple([x.replace('CTD','').strip() for x in column_order.columns.values])

In [None]:
column_order

In [None]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [None]:
results = {}
CTDsorted = pd.DataFrame()
for column in column_order:
    match = process.extractBests(column.replace('Discrete ','').replace('Calculated ',''),
                                 summary_sheet.columns.values, limit=2, score_cutoff=56, scorer=fuzz.ratio)
    if 'calculated' in column.lower():
        CTDsorted[column] = -9999999
    elif 'flag' in column.lower():
        if column not in ['Discrete DIC Flag','Discrete Alkalinity Flag','Discrete pCO2 Flag','Discrete pH Flag']:
            CTDsorted[column] = -9999999
        else:
            CTDsorted[column] = summary_sheet[column]
            results.update({column:match[0]})
    elif len(match) == 0:
        CTDsorted[column] = -9999999
    elif (match[0][0] not in [x[0] for x in results.values()]):
        CTDsorted[match[0][0]] = summary_sheet[match[0][0]]
        results.update({column:match[0]})
    elif len(match) == 1:
        CTDsorted[match[0][0]] = summary_sheet[match[0][0]]
        results.update({column:match[0]})
    else:
        CTDsorted[match[1][0]] = summary_sheet[match[1][0]]
        results.update({column:match[1]})
CTDsorted['Comments'] = summary_sheet['Comments']
CTDsorted['Chl Comments'] = summary_sheet['Chl Comments']

In [None]:
CTDsorted.rename(columns = {'Cruise #:':'Cruise ID'}, inplace=True)
CTDsorted.sort_values(by=['Station-Cast #','Bottle Position'], inplace=True)

In [None]:
cruise_id = list(set(CTDsorted['Cruise ID'].dropna()))
CTDsorted['Cruise ID'] = CTDsorted['Cruise ID'].fillna(value=cruise_id[0])

In [None]:
cruise_name = cruise.replace('/','').split('_')[0]
current_date = pd.to_datetime(pd.datetime.now()).tz_localize(tz='US/Eastern').tz_convert(tz='UTC')
version = '1-02'

In [None]:
cruise_id, cruise_name

In [None]:
filename = '_'.join([cruise_name,cruise_id[0],'Discrete','Summary',current_date.strftime('%Y-%m-%d'),'ver',version,'.xlsx'])
filename

In [None]:
CTDsorted.fillna(value=-9999999,inplace=True)

In [None]:
CTDsorted.to_excel(basepath+array+cruise+filename)

**========================================================================================================================**


In [None]:
summary.to_csv(basepath+array+cruise+'Pioneer-05_AT-31_Discrete_Summary_2019-07-16_ver_1-02.csv')

In [None]:
#summary_file = '/home/andrew/Documents/OOI-CGSN/ooicgsn-water-sampling/Pioneer-07_AR-08_Discrete_Summary_2019-06-25_ver_1-01_.xlsx'
summary_file = basepath+array+cruise+'Pioneer-05_AT-31_Discrete_Summary_2019-07-16_ver_1-02_.xlsx'
summary_file

In [None]:
summary = pd.read_excel(summary_file)
summary.drop(columns='Unnamed: 0', inplace=True)
summary

In [None]:
cols = [x for x in summary.columns if 'flag' in x.lower()]

In [None]:
def reformat_numbers(x):
    if x == -9999999:
        return x
    else:
        x = str(x).zfill(16)
        return x

In [None]:
for col in cols:
    summary[col] = summary[col].apply(lambda x: reformat_numbers(x))

In [None]:
summary['Discrete Alkalinity Flag']

In [None]:
filename

In [None]:
chl = pd.read_excel(chl_path+'Pioneer-03_Leg-1_KN-222_Chlorophyll_Sample_Data_2017-09-21_ver_1-00.xlsx')

In [None]:
chl

In [None]:
for ind in summary.index:
    chl_sample = summary['Discrete Chl (ug/l)'].iloc[ind]
    subset = chl[chl['Filter \nSample #'] == chl_sample]
    if len(subset) == 0:
        continue
    else:
        chloro = float(subset['Chl (ug/l)'])
        phaeo = float(subset['Phaeo (ug/l)'])
    summary['Discrete Chl (ug/l)'].iloc[ind] = chloro
    summary['Discrete Phaeo (ug/l)'].iloc[ind] = phaeo

In [None]:
summary['Discrete Phaeo (ug/l)']

In [None]:
chl['Filter \nSample #']

In [None]:
def change_cruise(x):
    if x == 'AT27-A':
        x = 'AT-27A'
    elif x == 'AT27-B':
        x = 'AT-27B'
    else:
        x = x
    return x

In [None]:
dic['CRUISE_ID'] = dic['CRUISE_ID'].apply(lambda x: change_cruise(x))
dic.columns

In [None]:
for ind in summary.index:
    cid = summary['Cruise ID'].iloc[ind]
    sta = summary['Station-Cast #'].iloc[ind]
    bot = summary['Bottle Position'].iloc[ind]
    # Get the a subset of the dic 
    subset = dic[(dic['CRUISE_ID'] == cid) & (dic['CAST_NO'] == sta) & (dic['NISKIN_ID'] == bot)]
    # If the subset is empty, continue, else get the unique data
    if len(subset) == 0:
        continue
    else:
        # dic data
        co2 = subset['DIC_UMOL_KG']
        co2flag = subset['DIC_FLAG_W']
        # alkalinity 
        alk = subset['TA_UMOL_KG']
        alkflag = subset['TA_FLAG_W']
        # pH data
        pH = subset['PH_TOT_MEA']
        pHt = subset['TMP_PH_DEG_C']
        pHflag = subset['PH_FLAG_W']
    # Now fill in the relevant data
    summary['Discrete DIC [µmol/kg]'].iloc[ind] = float(co2)
    summary['Discrete DIC Flag'].iloc[ind] = float(co2flag)
    summary['Discrete Alkalinity [µmol/kg]'].iloc[ind] = float(alk)
    summary['Discrete Alkalinity Flag'].iloc[ind] = float(alkflag)
    summary['Discrete pH [Total Scale]'].iloc[ind] = float(pH)
    summary['Discrete pH Analysis Temp [C]'].iloc[ind] = float(pHt)
    summary['Discrete pH Flag'].iloc[ind] = float(pHflag)


In [None]:
summary['Discrete DIC [µmol/kg]']

In [None]:
summary.to_csv('/home/andrew/Documents/OOI-CGSN/ooicgsn-water-sampling/Pioneer-02_KN-217_Discrete_Summary_2019-07-11_ver_1-00.csv')

In [None]:
ctd_log_b = pd.read_excel(water_path+'Pioneer-11_AR-31B_CTD_Sampling_Log.xlsx',sheet_name='Summary')
ctd_log_b

In [None]:
ctd_log_c = pd.read_excel(water_path+'Pioneer-11_AR-31C_CTD_Sampling_Log.xlsx',sheet_name='Summary')
ctd_log_c['Cruise ID'] = 'AR-31C'

In [None]:
filt = []
for ind in summary.index:
    cid = summary['Cruise ID'].iloc[ind]
    sta = summary['Station-Cast #'].iloc[ind]
    bot = summary['Bottle Position'].iloc[ind]
    nutnum = ctd_log_b[(ctd_log_b['Cruise ID'] == cid) & (ctd_log_b['Station-Cast #'] == sta) & (ctd_log_b['Niskin #'] == bot)]['Nitrate Bottle 1']
    nutnum = nutnum.to_list()
    if len(nutnum) == 0:
        continue
    else:
        nutnum = str(nutnum[0])
        if nutnum == 'nan':
            continue
        else:
            summary['Discrete Nitrate [µmol/L]'].iloc[ind] = nutnum

In [None]:
# Repeat for the second ctd_log
filt = []
for ind in summary.index:
    cid = summary['Cruise ID'].iloc[ind]
    sta = summary['Station-Cast #'].iloc[ind]
    bot = summary['Bottle Position'].iloc[ind]
    nutnum = ctd_log_c[(ctd_log_c['Cruise ID'] == cid) & (ctd_log_c['Station-Cast #'] == sta) & (ctd_log_c['Niskin #'] == bot)]['Nitrate Bottle 1']
    nutnum = nutnum.to_list()
    if len(nutnum) == 0:
        continue
    else:
        nutnum = str(nutnum[0])
        if nutnum == 'nan':
            continue
        else:
            summary['Discrete Nitrate [µmol/L]'].iloc[ind] = nutnum

In [None]:
def replace_cruise(x):
    if x == 'AR31-B':
        return 'AR-31B'
    elif x == 'AR31-C':
        return 'AR-31C'
    else:
        return x

In [None]:
nutrients['Cruise'] = nutrients['Cruise'].apply(lambda x: replace_cruise(x))

In [None]:
nutrients['Cruise']

In [None]:
summary_b = summary.merge(nutrients, how='left', left_on=['Cruise ID','Discrete Nitrate [µmol/L]'], right_on=['Cruise','Sample ID'])

In [None]:
summary_b['Discrete Nitrate [µmol/L]_y'].dropna()

In [None]:
summary_b.to_excel(basepath+array+cruise+summary_name)

In [None]:
nutnum.

In [None]:
summary.query()

In [None]:
nutrients_path = water_path+'Pioneer-08_AR-18_Nutrients_Sample_Data_2017-08-18_ver_1-00.xlsx'

In [None]:
nutrients = pd.read_excel(nutrients_path)

In [None]:
nutrients

In [None]:
summary = summary.merge(nutrients, how='left', left_on='Discrete Nitrate [µmol/L]', right_on='Sample ID')

In [None]:
summary.info()

In [None]:
# Rename the columns:
summary['Discrete Nitrate [µmol/L]'] = summary['Avg: Nitrate [µmol/L]']
summary['Discrete Nitrite [µmol/L]'] = summary['Avg: Nitrite [µmol/L]']
summary['Discrete Phosphate [µmol/L]'] = summary['Avg: Phosphate [µmol/L]']
summary['Discrete Ammonium [µmol/L]'] = summary['Avg: Ammonium [µmol/L]']
summary['Discrete Silicate [µmol/L]'] = summary['Avg: Silicate [µmol/L]']

In [None]:
nutrients.columns.values

In [None]:
summary.drop(columns=nutrients.columns.values, inplace=True)

In [None]:
summary.info()

In [None]:
summary.drop_duplicates(inplace=True)

In [None]:
summary.info()

In [None]:
cols = [x for x in summary.columns.values if 'flag' in x.lower()]
cols

In [None]:
summary.fillna(value=-9999999, inplace=True)

In [None]:
def fill_flags(x):
    
    if x==-9999999:
        return x
    else:
        x = str(x).zfill(16)
        return x

In [None]:
for c in cols:
    print(c)
    summary[c] = summary[c].apply(lambda x: fill_flags(x))

In [None]:
summary['Start Time [UTC]'].iloc[286][-20:]

In [None]:
summary

In [None]:
def fix_start_time(x):
    if len(x) > 20:
        x = x[-20:]
        return x
    else:
        return x
        

In [None]:
summary['Start Time [UTC]'] = summary['Start Time [UTC]'].apply(lambda x: fix_start_time(x))

In [None]:
cruise_name = cruise.split('_')
cruise_name

In [None]:
summary['Cruise ID'] = summary['Cruise ID'].fillna(value=cruise_id[0])

In [None]:
cruise
cruise_id = list(set(summary['Cruise ID'].dropna()))[0].split('-')[0]
current_date = pd.to_datetime(pd.datetime.now()).tz_localize(tz='US/Eastern').tz_convert(tz='UTC')
version = '1-01'

In [None]:
filename = '_'.join([cruise_name,cruise_id,'Discrete','Summary',current_date.strftime('%Y-%m-%d'),'ver',version])
filename = filename+'.csv'
filename

In [None]:
summary.to_csv(basepath+array+cruise+filename)

In [None]:
df

In [None]:
results = {}
CTDsorted = pd.DataFrame()
for column in column_order:
    match = process.extractBests(column.replace('Discrete ','').replace('Calculated ',''),
                                 df.columns.values, limit=2, score_cutoff=56, scorer=fuzz.ratio)
    if 'calculated' in column.lower():
        CTDsorted[column] = -9999999
    elif 'flag' in column.lower():
        if column not in ['Discrete DIC Flag','Discrete Alkalinity Flag','Discrete pCO2 Flag','Discrete pH Flag']:
            CTDsorted[column] = -9999999
        else:
            CTDsorted[column] = df[column]
            results.update({column:match[0]})
    elif len(match) == 0:
        CTDsorted[column] = -9999999
    elif (match[0][0] not in [x[0] for x in results.values()]):
        CTDsorted[match[0][0]] = df[match[0][0]]
        results.update({column:match[0]})
    elif len(match) == 1:
        CTDsorted[match[0][0]] = df[match[0][0]]
        results.update({column:match[0]})
    else:
        CTDsorted[match[1][0]] = df[match[1][0]]
        results.update({column:match[1]})
CTDsorted['Comments'] = df['Comments']

In [None]:
df

In [None]:
for i in df.columns.values:
    print(i)

In [None]:
summary_name_map = {}
for i,key in enumerate(column_order):
    print(key + ': ' + str(i))

In [None]:
ctd_name_map = {}
for col in df.columns.values:
    ctd_name_map.update({col: ''})
    

In [None]:
ctd_name_map = {
    'Bottle Position': 'Niskin/Bottle Position',
    'Date Time': 'Bottle Closure Time [UTC]',
    'Pressure, Digiquartz [db]': 'Pressure [db]',
    'Depth [salt water, m]': 'Depth [m]',
    'Latitude [deg]': 'Latitude [deg]',
    'Longitude [deg]': 'Longitude [deg]',
    'Temperature [ITS-90, deg C]': 'Temperature 1 [deg C]',
    'Temperature, 2 [ITS-90, deg C]': 'Temperature 2 [deg C]',
    'Conductivity [S/m]': 'Conductivity 1 [S/m]',
    'Conductivity, 2 [S/m]': 'Conductivity 2 [S/m]',
    'Salinity, Practical [PSU]': 'Salinity 1, uncorrected [psu]',
    'Salinity, Practical, 2 [PSU]': 'Salinity 2, uncorrected [psu]',
    'Oxygen raw, SBE 43 [V]': None,
    'Oxygen, SBE 43 [ml/l]': 'Oxygen, uncorrected [mL/L]',
    'Oxygen Saturation, Garcia & Gordon [ml/l]': 'Oxygen Saturation [mL/L]',
    'Beam Attenuation, WET Labs C-Star [1/m]': 'Beam Attenuation [1/m]',
    'Beam Transmission, WET Labs C-Star [%]': 'Beam Transmission [%]',
    'Filename': 'File',
    'Start Latitude [degrees]': 'Start Latitude [degrees]',
    'Start Longitude [degrees]': 'Start Longitude [degrees]',
    'Cruise': 'Cruise',
    'Start Time [UTC]': 'Start Time [UTC]',
    'Cast': 'Cast'
}

In [None]:
sample_log_map = {
    'Cruise ID': 'Cruise',
    'Station-Cast #': 'Station',
    'Target Asset': 'Target Asset',
    'Start Latitude': 'Start Latitude [degrees]',
    'Start Longitude': 'Start Longitude [degrees]',
    'Start Date':'',
    'Start Time':'',
    'Bottom Depth [m]': 'Bottom Depth at Start Position [m]',
    'Niskin #': 'Niskin/Bottle Position',
    'Rosette Position': 'Niskin/Bottle Position',
    'Date': '',
    'Time': '',
    'Trip Depth': 'Depth [m]',
    'Oxygen Bottle #': 'Discrete Oxygen [mL/L]',
    'Ph Bottle #': ['Discrete pH [Total Scale]', ],
    'DIC/TA Bottle #': ,
    'Salts Bottle #': ,
    'Nitrate Bottle 1': ,
    
    
}

In [None]:
column_order

In [None]:
df2 = pd.DataFrame()
for key in ctd_name_map.keys():
    df2[ctd_name_map.get(key)] =  df[key]

In [None]:
df2

In [None]:
df2.to_excel(basepath+array+cruise+water+'Leg1summary.xlsx')

In [None]:
194*46767648*(0.75/100)