# Bottle Processing
Author: Andrew Reed

### Motivation:
Independent verification of the suite of physical and chemical observations provided by OOI are critical for the observations to be of use for scientifically valid investigations. Consequently, CTD casts and Niskin water samples are made during deployment and recovery of OOI platforms, vehicles, and instrumentation. The water samples are subsequently analyzed by independent labs for  comparison with the OOI telemetered and recovered data.

However, currently the water sample data routinely collected and analyzed as part of the OOI program are not available in a standardized format which maps the different chemical analyses to the physical measurements taken at bottle closure. Our aim is to make these physical and chemical analyses of collected water samples available to the end-user in a standardized format for easy comprehension and use, while maintaining the source data files. 

### Approach:
Generating a summary of the water sample analyses involves preprocessing and concatenating multiple data sources, and accurately matching samples with each other. To do this, I first preprocess the ctd casts to generate bottle (.btl) files using the SeaBird vendor software following the SOP available on Alfresco. 

Next, the bottle files are parsed using python code and the data renamed following SeaBird's naming guide. This creates a series of individual cast summary (.sum) files. These files are then loaded into pandas dataframes, appended to each other, and exported as a csv file containing all of the bottle data in a single data file.

### Data Sources/Software:

* **sbe_name_map**: This is a spreadsheet which maps the short names generated by the SeaBird SBE DataProcessing Software to the associated full names. The name mapping originates from SeaBird's SBE DataProcessing support documentation.

* **Alfresco**: The Alfresco CMS for OOI at alfresco.oceanobservatories.org is the source of the ctd hex, xmlcon, and psa files necessary for generating the bottle files needed to create the sample summary sheet.

* **SBEDataProcessing-Win32**: SeaBird vendor software for processing the raw ctd files and generating the .btl files.


**========================================================================================================================**
Import packages which will be used in this notebook:

In [1]:
import os, sys, re
import pandas as pd
import numpy as np

Load the name mapping for the column names based on SeaBird's manual:

In [2]:
sbe_name_map = pd.read_excel('/media/andrew/OS/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Reference_Files/seabird_ctd_name_map.xlsx')

In [3]:
sbe_name_map.head()

Unnamed: 0,Short Name,Full Name,Friendly Name,Units,Notes/Comments
0,accM,Acceleration [m/s^2],acc M,m/s^2,
1,accF,Acceleration [ft/s^2],acc F,ft/s^2,
2,altM,Altimeter [m],alt M,m,
3,altF,Altimeter [ft],alt F,ft,
4,avgsvCM,"Average Sound Velocity [Chen-Millero, m/s]",avgsv-C M,"Chen-Millero, m/s",


**========================================================================================================================**
Declare the directory paths to where the relevant information is stored:

In [142]:
basepath = '/home/andrew/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/'
array = 'Pioneer/'
cruise = 'Pioneer-08_AR-18_2017-05-30/'
leg = 'Leg 3 (ar18c)/'
water = 'Water Sampling/'
ctd = 'ctd/'

In [143]:
bottle_path = basepath+array+cruise+leg+ctd
water_path = basepath+array+cruise+water
salts_and_o2_path = water_path+'Pioneer-08_AR-18C_2017-05-30_Oxygen_Salinity_Sample_Data/'
sample_log_path = water_path+'Pioneer-08_AR-18C_CTD_sampling_log.xlsx'

In [144]:
# Parse the data for the start_time
def parse_header(header):
    """
    Parse the header of bottle (.btl) files to get critical information
    for the summary spreadsheet.
    
    Args:
        header - an object containing the header of the bottle file as a list of
            strings, split at the newline.
    Returns:
        hdr - a dictionary object containing the start_time, filename, latitude,
            longitude, and cruise id.
    """
    hdr = {}
    for line in header:
        if 'start_time' in line.lower():
            start_time = pd.to_datetime(re.split('= |\[',line)[1])
            hdr.update({'Start Time [UTC]':start_time.strftime('%Y-%m-%dT%H:%M:%SZ')})
        elif 'filename' in line.lower():
            hex_name = re.split('=',line)[1].strip()
            hdr.update({'Filename':hex_name})
        elif 'latitude' in line.lower():
            start_lat = re.split('=',line)[1].strip()
            hdr.update({'Start Latitude [degrees]':start_lat})
        elif 'longitude' in line.lower():
            start_lon = re.split('=',line)[1].strip()
            hdr.update({'Start Longitude [degrees]':start_lon})
        elif 'cruise id' in line.lower():
            cruise_id = re.split(':',line)[1].strip()
            hdr.update({'Cruise':cruise_id})
        else:
            pass
    
    return hdr

Get the path to the ctd-bottle data, load it, and parse it:

In [145]:
os.listdir(bottle_path)

['ar18c_998.bl',
 'ar18c002.bl',
 'ar18c008.btl',
 'ar18c005.btl',
 'ar18c001.btl',
 'ar18c025.hex',
 'ar18c002.ros',
 'ar18c016.btl',
 'ar18c003.hdr',
 'ar18c014.ros',
 'AR18C018.XMLCON',
 'AR18C001.XMLCON',
 'ar18c014.hdr',
 'AR18C011.XMLCON',
 'ar18c009.sum',
 'ar18c008.ros',
 'ar18c010.hdr',
 'armstrong_383_2017.XMLCON',
 'ar18c018.hdr',
 'ar18c007.ros',
 'AR18C005.XMLCON',
 'ar18c002.hdr',
 'ar18c015.sum',
 'ar18c024.bl',
 'ar18c012.sum',
 'ar18c025.bl',
 'AR18C020.XMLCON',
 'AR18C006.XMLCON',
 'ar18c012.hex',
 'ar18c003.bl',
 'ar18c005.bl',
 'AR18C023.XMLCON',
 'ar18c004.bl',
 'CTD_Summary.csv',
 'ar18c011.ros',
 'ar18c001.bl',
 'ar18c016.hdr',
 'ar18c012.bl',
 'ar18c019.hex',
 'AR18C016.XMLCON',
 'ar18c020.hdr',
 'ar18c010.sum',
 'ar18c023.hex',
 'AR18C024.XMLCON',
 'ar18c_setup.psa',
 'ar18c018.bl',
 'ar18c013.hdr',
 'ar18c_998.hdr',
 'ar18c007.bl',
 'ar18c021.bl',
 'ar18c002.sum',
 'ar18c009.btl',
 'ar18c009.ros',
 'doc',
 'ar18c010.ros',
 'ar18c020.ros',
 'seasave_armstrong_2

In [146]:
# Now write a function to autopopulate the bottle summary sample sheet
files = [x for x in os.listdir(bottle_path) if '.btl' in x]
for filename in files:
    filepath = os.path.abspath(bottle_path+filename)
    
    # Load the raw content into memory
    with open(filepath) as file:
        content = file.readlines()
    content = [x.strip() for x in content]
    
    # Now parse the file content
    header = []
    columns = []
    data = []
    for line in content:
        if line.startswith('*') or line.startswith('#'):
            header.append(line)
        else:
            try:
                float(line[0])
                data.append(line)
            except:
                columns.append(line)
                
    # Parse the header
    hdr = parse_header(header)
    
    # Parse the column identifiers
    column_dict = {}
    for line in columns:
        for i,x in enumerate(line.split()):
            try:
                column_dict[i] = column_dict[i] + ' ' + x
            except:
                column_dict.update({i:x})
                
    #Parse the bottle data based on the column header locations
    data_dict = {x:[] for x in column_dict.keys()}

    for line in data:
        if line.endswith('(avg)'):
            values = list(filter(None,re.split('  |\t', line) ) )
            for i,x in enumerate(values):
                data_dict[i].append(x)
        elif line.endswith('(sdev)'):
            values = list(filter(None,re.split('  |\t', line) ) )
            data_dict[1].append(values[0])
        else:
            pass
    
    # Join the date and time for each measurement into a single item
    data_dict[1] = [' '.join(item) for item in zip(data_dict[1][::2],data_dict[1][1::2])]
    
    # With the parsed data and column names, match up the data and column
    # based on the location
    results = {}
    for key,item in column_dict.items():
        values = data_dict[key]
        results.update({item:values})
        
    # Put the results into a dataframe
    df = pd.DataFrame.from_dict(results)

    # Now add the parsed info from the header files into the dataframe
    for key,item in hdr.items():
        df[key] = item
        
    # Get the cast number
    cast = filename[filename.index('.')-3:filename.index('.')]
    df['Cast'] = str(cast).zfill(3)
    
    # Add the header info back in
    for key in hdr.keys():
        df[key] = hdr[key]
        
    # Generate a filename for the summary file
    outname = filename.split('.')[0] + '.sum'
    
    # Save the results
    df.to_csv(bottle_path+outname)
    


In [147]:
# Now, for each "summary" file, load and append to each other
df = pd.DataFrame()
for file in os.listdir(bottle_path):
    if '.sum' in file:
        df = df.append(pd.read_csv(bottle_path+file))
    else:
        pass

In [148]:
sbe_name_map['Short Name'].apply(lambda x: str(x).lower());

In [149]:
# Rename the column title using the sbe_name_mapping 
for colname in list(df.columns.values):
    try:
        fullname = list(sbe_name_map[sbe_name_map['Short Name'].apply(lambda x: str(x).lower() == colname.lower()) == True]['Full Name'])[0]
        df.rename({colname:fullname},axis='columns',inplace=True)
    except:
        pass

In [150]:
df.sort_values(by=['Cast','Bottle Position'], inplace=True)
df.drop(columns='Unnamed: 0',inplace=True)
bottles = df

In [151]:
df.to_csv(bottle_path+'CTD_Summary.csv')

**========================================================================================================================**
### Process the Discrete Salinity and Oxygen Data
Next, I process the discrete salinity and oxygen sample data so that it is consistently named and ready to be merged with the existing data sets.

In [152]:
def clean_sal_files(dirpath):

    # Run check if files are held in excel format or csvs
    csv_flag = any(files.endswith('.SAL') for files in os.listdir(dirpath))
    if csv_flag:
        for filename in os.listdir(dirpath):
            sample = []
            salinity = []
            if filename.endswith('.SAL'):
                with open(dirpath+filename) as file:
                    data = file.readlines()
                    for ind1,line in enumerate(data):
                        if ind1 == 0:
                            strs = data[0].replace('"','').split(',')
                            cruisename = strs[0]
                            station = strs[1]
                            cast = strs[2]
                            case = strs[8]
                        elif int(line.split()[0]) == 0:
                            pass
                        else:
                            strs = line.split()
                            sample.append(strs[0])
                            salinity.append(strs[2])
                
                    # Generate a pandas dataframe to populate data
                    data_dict = {'Cruise':cruisename,'Station':station,'Cast':cast,'Case':case,'Sample ID':sample,'Salinity [psu]':salinity}
                    df = pd.DataFrame.from_dict(data_dict)
                    df.to_csv(file.name.replace('.','')+'.csv')
            else:
                pass
    
    else:
        # If the files are already in excel spreadsheets, they've been cleaned into a
        # logical tabular format
        pass
    

def process_sal_files(dirpath):
    
    # Check if the files are excel files or not
    excel_flag = any(files.endswith('SAL.xlsx') for files in os.listdir(dirpath))
    # Initialize a dataframe for processing the salinity files
    df = pd.DataFrame()
    if excel_flag:
        for file in os.listdir(dirpath):
            if 'SAL.xlsx' in file:
                df = df.append(pd.read_excel(dirpath+file))
        df.rename({'Sample':'Sample ID','Salinity':'Salinity [psu]','Niskin #':'Niskin','Case ID':'Case'}, 
                  axis='columns',inplace=True)
        df.dropna(inplace=True)
        df['Station'] = df['Station'].apply(lambda x: str( int(x)).zfill(3))
        df['Niskin'] = df['Niskin'].apply(lambda x: str( int(x)))
        df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
    else:
        for file in os.listdir(dirpath):
            if 'SAL.csv' in file:
                df = df.append(pd.read_csv(dirpath+file))
        df.dropna(inplace=True)
        df['Station'] = df['Station'].apply(lambda x: str( int(x)).zfill(3))
        df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
        df.drop(columns=[x for x in list(df.columns.values) if 'unnamed' in x.lower()],inplace=True)

    # Save the processed summary file for salinity
    df.to_csv(dirpath+'SAL_Summary.csv')
    
    
def process_oxy_files(dirpath):
    df = pd.DataFrame()
    for filename in os.listdir(dirpath):
        if 'oxy' in filename.lower() and filename.endswith('.xlsx'):
            df = df.append(pd.read_excel(dirpath+filename)) 
            # Rename and clean up the oxygen data to be uniform across data sets
    df.rename({'Niskin #':'Niskin','Sample#':'Sample ID','Oxy':'Oxygen [mL/L]','Unit':'Units'},
              axis='columns',inplace=True)
    df.dropna(inplace=True)
    df['Station'] = df['Station'].apply(lambda x: str( int(x)).zfill(3))
    df['Niskin'] = df['Niskin'].apply(lambda x: str( int(x)))
    df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
    df['Cruise'] = df['Cruise'].apply(lambda x: x.replace('O','0'))
    
    # Save the processed summary file for oxygen
    df.to_csv(dirpath+'OXY_Summary.csv')

In [153]:
os.listdir(salts_and_o2_path)

['004.SAL',
 'SAL_Summary.csv',
 '001OXY.xlsx',
 'OXY_Summary.csv',
 '004OXY.xlsx',
 '003SAL.csv',
 '004SAL.csv',
 '001SAL.xlsx',
 '004SAL.xlsx',
 '002OXY.xlsx',
 '002SAL.xlsx',
 '002.SAL',
 '002SAL.csv',
 '003SAL.xlsx',
 '003OXY.xlsx',
 '001SAL.csv',
 '003.SAL',
 '001.SAL']

In [154]:
# Now process the salts and oxygen data
    # Clean the salinity
clean_sal_files(salts_and_o2_path)
    # Process the salinity files
process_sal_files(salts_and_o2_path)
    # Process the oxygen files
process_oxy_files(salts_and_o2_path)

In [155]:
sal = pd.read_csv(salts_and_o2_path+'SAL_Summary.csv')
sal.drop(columns='Unnamed: 0', inplace=True)

In [156]:
sal

Unnamed: 0,Cruise,Station,Niskin,Case,Sample ID,Salinity [psu],Unit
0,AR18-C,1,1,B,1,34.2113,psu
1,AR18-C,1,2,B,2,34.2105,psu
2,AR18-C,1,3,B,3,33.486,psu
3,AR18-C,1,4,B,4,33.4699,psu
4,AR18-C,1,5,B,5,32.4733,psu
5,AR18-C,1,6,B,6,32.4696,psu
6,AR18-C,1,7,B,7,32.3755,psu
7,AR18-C,1,8,B,8,32.39,psu
8,AR18-C,4,1,B,1,33.9083,psu
9,AR18-C,4,2,B,2,33.9087,psu


In [157]:
oxy = pd.read_csv(salts_and_o2_path+'OXY_Summary.csv')
oxy.drop(columns='Unnamed: 0', inplace=True)

In [158]:
oxy

Unnamed: 0,Cruise,Station,Niskin,Case,Sample ID,Oxygen [mL/L],Units
0,AR18-C,1,1,F,1,5.183,mL/L
1,AR18-C,1,2,F,2,5.181,mL/L
2,AR18-C,1,3,F,3,5.934,mL/L
3,AR18-C,1,4,F,4,5.958,mL/L
4,AR18-C,1,5,F,5,7.092,mL/L
5,AR18-C,1,6,F,6,7.148,mL/L
6,AR18-C,1,7,F,7,6.646,mL/L
7,AR18-C,1,8,F,8,6.676,mL/L
8,AT18-C,4,1,T,1,5.332,mL/L
9,AT18-C,4,2,T,2,5.329,mL/L


**========================================================================================================================**
### CTD Sampling Log
Load in the CTD sampling log summary sheet. The summary sheet needs to be manually created and the data cleaned before attempting to import. Additionally, ensure that there is only one header line and that it is at the top of the file.

In [159]:
os.listdir(water_path)

['Pioneer-08_AR-18_2017-05-30_Nutrients_Sample_Data_2017-08-18_ver_1-00.xlsx',
 'Pioneer-08_AR-18A_2017-05-30_Oxygen_Salinity_Sample_Data',
 'Pioneer-08_AR-18C_2017-05-30_Oxygen_Salinity_Sample_Data',
 'Pioneer-08_AR-18B_CTD_Sampling_Log.xlsx',
 'Pioneer-08_AR-18C_CTD_sampling_log.xlsx',
 'Pioneer-08_AR-18A_CTD_Sampling_Log.xlsx',
 'Pioneer-08_AR-18B_2017-05-30_Oxygen_Salinity_Sample_Data']

In [160]:
sample_log = pd.read_excel(sample_log_path,sheet_name='Summary',header=0)
sample_log.sort_values(by=['Station-Cast #','Niskin #'])

Unnamed: 0,Cruise ID,Station-Cast #,Target Asset,Start Latitude,Start Longitude,Start Date,Start Time,Bottom Depth [m],Date,Niskin #,...,Ph Bottle #,DIC/TA Bottle #,Salts Bottle #,Nitrate Bottle 1,Chlorophyll Brown Bottle #,Chlorophyll Filter Sample #,Unnamed: 19,Chlorophyll Brown Bottle Volume,Chlorophyll LN Tube,Comments
0,AR18-C,1,OSPM,40 21.800' N,70 52.999' W,2017-06-15,14:35:00,94,2017-06-15,1.0,...,1029.0,1030.0,B1,1-1.,,,,,,
1,AR18-C,1,OSPM,40 21.800' N,70 52.999' W,2017-06-15,14:35:00,94,2017-06-15,2.0,...,,1031.0,B2,1-2.,,,,,,Duplicate DIC/TA
2,AR18-C,1,OSPM,40 21.800' N,70 52.999' W,2017-06-15,14:35:00,94,2017-06-15,3.0,...,,1032.0,B3,1-3.,1.0,01/01.,,539.0,,
3,AR18-C,1,OSPM,40 21.800' N,70 52.999' W,2017-06-15,14:35:00,94,2017-06-15,4.0,...,,,B4,,2.0,01/02.,,539.0,,
4,AR18-C,1,OSPM,40 21.800' N,70 52.999' W,2017-06-15,14:35:00,94,2017-06-15,5.0,...,,1033.0,B5,1-4.,3.0,01/03.,,539.0,,chl max
5,AR18-C,1,OSPM,40 21.800' N,70 52.999' W,2017-06-15,14:35:00,94,2017-06-15,6.0,...,,,B6,1-5.,4.0,01/04.,,539.0,,"Duplicated O2, S, N, chl"
6,AR18-C,1,OSPM,40 21.800' N,70 52.999' W,2017-06-15,14:35:00,94,2017-06-15,7.0,...,1034.0,1035.0,B7,1-6.,5.0,01/05.,,539.0,,
7,AR18-C,1,OSPM,40 21.800' N,70 52.999' W,2017-06-15,14:35:00,94,2017-06-15,8.0,...,1036.0,,B8,,6.0,01/06.,,539.0,,Duplicate pH
8,AR18-C,2,PMCO,40 05.904' N,70 52.998' W,2017-06-16,10:16:00,148,2017-06-16,1.0,...,1037.0,1038.0,S1,2-1.,,,,,,
9,AR18-C,2,PMCO,40 05.904' N,70 52.998' W,2017-06-16,10:16:00,148,2017-06-16,2.0,...,,,S2,,,,,,,


In [161]:
def strip_x(x):
    if type(x) == str:
        x = x.replace('.','')
        return x
    else:
        return x

In [162]:
sample_log['Nitrate Bottle 1'] = sample_log['Nitrate Bottle 1'].apply(lambda x: strip_x(x))
sample_log['Start Date'] = sample_log['Start Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
sample_log['Start Time'] = sample_log['Start Time'].apply(lambda x: x.strftime('%H:%M:%S'))
sample_log['Start Time'] = sample_log['Start Date'] + 'T' + sample_log['Start Time'] + 'Z'

**========================================================================================================================**
### Merge the CTD-Bottle Data and Sample Log
The next step is to merge the CTD-Bottle data with the sample log using an outer merge based on the cast and niskin/bottle position. The outer merge means that all data will be retained, so that we do not accidentally discard either data-only casts or casts not recorded on the sample logs.

In [163]:
summary = bottles.merge(sample_log, how='outer', right_on=['Station-Cast #','Niskin #'], left_on=['Cast','Bottle Position'])

Fill in missing data based on the sample log info:

In [164]:
summary['Start Latitude [degrees]'] = summary['Start Latitude [degrees]'].fillna(value=summary['Start Latitude'])
summary['Start Longitude [degrees]'] = summary['Start Longitude [degrees]'].fillna(value=summary['Start Longitude'])
summary['Start Time [UTC]'] = summary['Start Time [UTC]'].fillna(value=summary['Start Date']+summary['Start Time'])
summary['Station-Cast #'] = summary['Station-Cast #'].fillna(value=summary['Cast'])
summary['Bottle Position'] = summary['Bottle Position'].fillna(value=summary['Niskin #']);

Eliminate the redundant columns:

In [165]:
summary.drop(columns=['Start Latitude','Start Longitude','Start Date','Start Time','Cast',
                      'Niskin #','Date','Time','Trip Depth'], inplace=True)

**========================================================================================================================**
Merge the discrete salinity and oxygen data into the sample_log based on the cast and niskin number. Do not use the sample bottle number - it is not stored in the processed discrete data we get back from the labs:

In [166]:
summary = summary.merge(sal, how='left', left_on=['Station-Cast #','Bottle Position'], right_on=['Station','Niskin'] )
summary['Salinity [psu]'] = summary['Salinity [psu]'].fillna(value=summary['Salts Bottle #'])
summary.rename(columns={'Salinity [psu]': 'Discrete Salinity [psu]'}, inplace=True)

Drop the unnecessary or extraneous columns:

In [167]:
summary.drop(columns=['Cruise','Station','Niskin','Case', 'Sample ID', 'Unit', 'Salts Bottle #'], inplace=True)

Oxygen data:

In [168]:
summary = summary.merge(oxy, how='left', left_on=['Station-Cast #','Bottle Position'], right_on=['Station','Niskin'] )
summary['Oxygen [mL/L]'] =  summary['Oxygen [mL/L]'].fillna(value=summary[' Oxygen Bottle #'])
summary.rename(columns={'Oxygen [mL/L]':'Discrete Oxygen [mL/L]'}, inplace=True)

In [169]:
summary.drop(columns=['Cruise','Station','Niskin','Case', 'Sample ID', 'Units', ' Oxygen Bottle #'], inplace=True)

**========================================================================================================================**
### Nutrients Data
Load the nutrients data (if it exists) and merge with the summary sheet. If the nutrients data has not been returned yet, we fill in the relevant columns with the data from the sampling logs.

In [170]:
nutrients_path = basepath+array+cruise+water

In [171]:
try:
    nutrients = pd.read_excel(nutrients_path,header=0)
    nutrients
except IsADirectoryError:
    nutrients = pd.DataFrame(data=sample_log['Nitrate Bottle 1'])
    nutrients.rename(columns={'Nitrate Bottle 1':'Sample ID'}, inplace=True)
    columns = ['Sample ID','Cruise','Avg: Nitrate + Nitrite [µmol/L]','Avg: Ammonium [µmol/L]',
               'Avg: Phosphate [µmol/L]','Avg: Silicate [µmol/L]','Avg: Nitrite [µmol/L]','Avg: Nitrate [µmol/L]']
    for col in columns:
        if col not in nutrients.columns.values:
            nutrients[col] = nutrients['Sample ID']

In [172]:
nutrients.dropna(inplace=True)

In [173]:
nutrients.rename(columns=lambda x: x.replace('Avg:', 'Discrete'), inplace=True)

In [174]:
summary = summary.merge(nutrients, how='left', left_on='Nitrate Bottle 1', right_on='Sample ID')

In [175]:
summary.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 73 entries, 0 to 72
Data columns (total 44 columns):
Bottle Position                              52 non-null float64
Date Time                                    52 non-null object
Pressure, Digiquartz [db]                    52 non-null float64
Depth [salt water, m]                        52 non-null float64
Latitude [deg]                               52 non-null float64
Longitude [deg]                              52 non-null float64
Temperature [ITS-90, deg C]                  52 non-null float64
Temperature, 2 [ITS-90, deg C]               52 non-null float64
Conductivity [S/m]                           52 non-null float64
Conductivity, 2 [S/m]                        52 non-null float64
Salinity, Practical [PSU]                    52 non-null float64
Salinity, Practical, 2 [PSU]                 52 non-null float64
Oxygen raw, SBE 43 [V]                       52 non-null float64
Oxygen, SBE 43 [ml/l]                        52 non-n

In [176]:
summary.drop(columns=['Sample ID','Cruise','Nitrate Bottle 1'], inplace=True)

**========================================================================================================================**
### Chlorophyll Data
If the Chlorophyll measurements have not been returned yet, we will generate a synthetic chlorophyll spreadsheet which substitutes the sample bottle numbers in place of the actual measurements. One complication is that the Chlorophyll sample # column title is not identical between cruises.

In [177]:
chl_path = water_path+''

In [178]:
try:
    chl = pd.read_excel(chl_path)
    chl.head()
except IsADirectoryError:
    # If there is no chlorophyll sheet yet, need to copy the bottle data into the final sample log
    chl = sample_log[['Station-Cast #','Chlorophyll Brown Bottle #','Chlorophyll Filter Sample #','Chlorophyll LN Tube']]
    chl.rename(columns={
        'Chlorophyll Brown Bottle #': 'Brown Bottle #',
        'Chlorophyll Filter Sample #': 'Discrete Chl (ug/l)',
        'Chlorophyll LN Tube':'Discrete Phaeo (ug/l)'
    }, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [179]:
chl.dropna(subset=['Brown Bottle #'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [180]:
summary = summary.merge(chl, how='left', left_on=['Station-Cast #','Chlorophyll Brown Bottle #'], right_on=['Station-Cast #','Brown Bottle #'])

In [181]:
summary.drop(columns=['Chlorophyll Brown Bottle #','Chlorophyll Filter Sample #','Chlorophyll LN Tube','Brown Bottle #',
                     'Chlorophyll Brown Bottle Volume'], inplace = True)

**========================================================================================================================**
### Carbon-System Measurements
If the Carbon system measurements have not been returned yet, we will generate a synthetic DIC spreadsheet which substitutes the sample bottle numbers in place of the actual measurements.

In [182]:
dic_path = water_path + ''

In [183]:
try:
    dic = pd.read_excel(dic_path,header=0)
    dic
except IsADirectoryError:
    dic = sample_log[['Station-Cast #','Niskin #','Ph Bottle #','DIC/TA Bottle #']]
    dic.rename(columns={
        'Station-Cast #':'CAST_NO',
        'Niskin #':'NISKIN_NO',
        'DIC/TA Bottle #':'DIC_UMOL_KG',
        'Ph Bottle #':'PH_TOT_MEA',
    }, inplace=True)
    columns = ['CAST_NO', 'NISKIN_NO','DIC_UMOL_KG', 'DIC_FLAG_W', 'TA_UMOL_KG',
       'TA_FLAG_W', 'PH_TOT_MEA', 'TMP_PH_DEG_C', 'PH_FLAG_W']
    for col in columns:
        if col not in dic.columns.values:
            if 'dic' in col.lower() or 'ta' in col.lower():
                dic[col] = dic['DIC_UMOL_KG']
            elif 'ph' in col.lower():
                dic[col] = dic['PH_TOT_MEA']
            else:
                dic[col] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [184]:
dic = dic[['CAST_NO', 'NISKIN_NO','DIC_UMOL_KG', 'DIC_FLAG_W', 'TA_UMOL_KG',
       'TA_FLAG_W', 'PH_TOT_MEA', 'TMP_PH_DEG_C', 'PH_FLAG_W']]
dic.rename(columns = {'DIC_UMOL_KG':'DIC [µmol/kg]',
               'DIC_FLAG_W':'DIC Flag',
               'TA_UMOL_KG':'Alkalinity [µmol/kg]',
               'TA_FLAG_W':'Alkalinity Flag',
               'PH_TOT_MEA':'pH [Total Scale]',
               'TMP_PH_DEG_C':'pH Analysis Temp [C]', 
              'PH_FLAG_W':'pH Flag'}, inplace=True)
# Add in the pCO2 columns, which we don't measure
dic['pCO2'] = np.nan
dic['pCO2 Flag'] = np.nan
dic['pCO2 Analysis Temp [C]'] = np.nan

dic.rename(columns=lambda x: 'Discrete ' + x, inplace=True)

In [185]:
summary = summary.merge(dic, how='left', left_on=['Station-Cast #','Bottle Position'], right_on=['Discrete CAST_NO','Discrete NISKIN_NO'])

In [186]:
summary.drop(columns=['Ph Bottle #','DIC/TA Bottle #','Discrete CAST_NO','Discrete NISKIN_NO'], inplace=True)

In [187]:
summary.rename(columns={'Date Time':'Bottle Closure'}, inplace=True)

In [188]:
summary.info();

<class 'pandas.core.frame.DataFrame'>
Int64Index: 73 entries, 0 to 72
Data columns (total 47 columns):
Bottle Position                              52 non-null float64
Bottle Closure                               52 non-null object
Pressure, Digiquartz [db]                    52 non-null float64
Depth [salt water, m]                        52 non-null float64
Latitude [deg]                               52 non-null float64
Longitude [deg]                              52 non-null float64
Temperature [ITS-90, deg C]                  52 non-null float64
Temperature, 2 [ITS-90, deg C]               52 non-null float64
Conductivity [S/m]                           52 non-null float64
Conductivity, 2 [S/m]                        52 non-null float64
Salinity, Practical [PSU]                    52 non-null float64
Salinity, Practical, 2 [PSU]                 52 non-null float64
Oxygen raw, SBE 43 [V]                       52 non-null float64
Oxygen, SBE 43 [ml/l]                        52 non-n

**========================================================================================================================**
Import the column order list and use fuzzy string matching to sort the data and save the data to an new Excel spreadsheet.

In [189]:
column_order = pd.read_excel(basepath+'column_order.xlsx')

In [190]:
column_order = tuple([x.replace('CTD','').strip() for x in column_order.columns.values])

In [191]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [192]:
results = {}
CTDsorted = pd.DataFrame()
for column in column_order:
    match = process.extractBests(column.replace('Discrete ','').replace('Calculated ',''),
                                 summary.columns.values, limit=2, score_cutoff=56, scorer=fuzz.ratio)
    if 'calculated' in column.lower():
        CTDsorted[column] = -9999999
    elif 'flag' in column.lower():
        if column not in ['Discrete DIC Flag','Discrete Alkalinity Flag','Discrete pCO2 Flag','Discrete pH Flag']:
            CTDsorted[column] = -9999999
        else:
            CTDsorted[column] = summary[column]
            results.update({column:match[0]})
    elif len(match) == 0:
        CTDsorted[column] = -9999999
    elif (match[0][0] not in [x[0] for x in results.values()]):
        CTDsorted[match[0][0]] = summary[match[0][0]]
        results.update({column:match[0]})
    elif len(match) == 1:
        CTDsorted[match[0][0]] = summary[match[0][0]]
        results.update({column:match[0]})
    else:
        CTDsorted[match[1][0]] = summary[match[1][0]]
        results.update({column:match[1]})
CTDsorted['Comments'] = summary['Comments']

In [193]:
cruise_id = list(set(CTDsorted['Cruise ID'].dropna()))
CTDsorted['Cruise ID'] = CTDsorted['Cruise ID'].fillna(value=cruise_id[0])

In [194]:
cruise_name = cruise.replace('/','')
current_date = pd.to_datetime(pd.datetime.now()).tz_localize(tz='US/Eastern').tz_convert(tz='UTC')
version = '1-01'

In [195]:
cruise_id

['AR18-C']

In [196]:
filename = '_'.join([cruise_name,cruise_id[0],'Discrete','Summary',current_date.strftime('%Y-%m-%d'),'ver',version,'.xlsx'])
filename

'Pioneer-08_AR-18_2017-05-30_AR18-C_Discrete_Summary_2019-06-25_ver_1-01_.xlsx'

In [197]:
CTDsorted.drop_duplicates(inplace=True)

In [198]:
CTDsorted

Unnamed: 0,Cruise ID,Station-Cast #,Target Asset,Start Latitude [degrees],Start Longitude [degrees],Start Time [UTC],Cast,Cast Flag,Bottom Depth [m],Filename,...,Calculated Alkalinity [µmol/kg],Calculated DIC [µmol/kg],Calculated pCO2 [µatm],Calculated pH,Calculated CO2aq [µmol/kg],Calculated bicarb [µmol/kg],Calculated CO3 [µmol/kg],Calculated Omega-C,Calculated Omega-A,Comments
0,AR18-C,1.0,OSPM,40 21.80 N,070 53.00 W,2017-06-15T14:35:39Z,-9999999,-9999999,94.0,D:\Data\ar18c001.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,
1,AR18-C,1.0,OSPM,40 21.80 N,070 53.00 W,2017-06-15T14:35:39Z,-9999999,-9999999,94.0,D:\Data\ar18c001.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,Duplicate DIC/TA
2,AR18-C,1.0,OSPM,40 21.80 N,070 53.00 W,2017-06-15T14:35:39Z,-9999999,-9999999,94.0,D:\Data\ar18c001.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,
3,AR18-C,1.0,OSPM,40 21.80 N,070 53.00 W,2017-06-15T14:35:39Z,-9999999,-9999999,94.0,D:\Data\ar18c001.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,
4,AR18-C,1.0,OSPM,40 21.80 N,070 53.00 W,2017-06-15T14:35:39Z,-9999999,-9999999,94.0,D:\Data\ar18c001.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,chl max
5,AR18-C,1.0,OSPM,40 21.80 N,070 53.00 W,2017-06-15T14:35:39Z,-9999999,-9999999,94.0,D:\Data\ar18c001.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,"Duplicated O2, S, N, chl"
6,AR18-C,1.0,OSPM,40 21.80 N,070 53.00 W,2017-06-15T14:35:39Z,-9999999,-9999999,94.0,D:\Data\ar18c001.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,
7,AR18-C,1.0,OSPM,40 21.80 N,070 53.00 W,2017-06-15T14:35:39Z,-9999999,-9999999,94.0,D:\Data\ar18c001.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,Duplicate pH
8,AR18-C,2.0,PMCO,40 05.90 N,070 53.00 W,2017-06-16T10:17:07Z,-9999999,-9999999,148.0,D:\Data\ar18c002.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,
9,AR18-C,2.0,PMCO,40 05.90 N,070 53.00 W,2017-06-16T10:17:07Z,-9999999,-9999999,148.0,D:\Data\ar18c002.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,


In [199]:
CTDsorted.to_excel(basepath+array+cruise+filename)

In [76]:
os.listdir(basepath+array+cruise)

['Pioneer-08_Leg-1_AR18-A_Discrete_Summary_2019-03-13_ver_1-00_.xlsx',
 'Leg 2 (ar18b)',
 'Pioneer-08_Leg_3_AR18-C_Discrete_Summary_2019-06-13_ver_1-01_.xlsx',
 'Pioneer-08_AR-18_2017-05-30_AR18-A_Discrete_Summary_2019-06-25_ver_1-01_.xlsx',
 'Pioneer-08_Leg_1_AR18-A_Discrete_Summary_2019-06-13_ver_1-01_.xlsx',
 'Pioneer-08_AR18-C_Discrete_Summary_2019-06-21_ver_1-01_.xlsx',
 'Leg 1 (ar18a)',
 'Leg 3 (ar18c)',
 'Pioneer-08_AR18_Discrete_Summary_2019-06-21_ver_1-01_.xlsx',
 'Pioneer-08_Leg_2_AR18-B_Discrete_Summary_2019-06-13_ver_1-01_.xlsx',
 'Pioneer-08_AR18-B_Discrete_Summary_2019-06-21_ver_1-01_.xlsx',
 'Pioneer-08_Leg-3_AR18-C_Discrete_Summary_2019-03-13_ver_1-00_.xlsx',
 'Water Sampling',
 'Pioneer-08_Leg-2_AR18-B_Discrete_Summary_2019-03-13_ver_1-00_.xlsx',
 'Pioneer-08_AR18-A_Discrete_Summary_2019-06-21_ver_1-01_.xlsx']

In [None]:
CTDsorted