# Bottle Processing
Author: Andrew Reed

### Motivation:
Independent verification of the suite of physical and chemical observations provided by OOI are critical for the observations to be of use for scientifically valid investigations. Consequently, CTD casts and Niskin water samples are made during deployment and recovery of OOI platforms, vehicles, and instrumentation. The water samples are subsequently analyzed by independent labs for  comparison with the OOI telemetered and recovered data.

However, currently the water sample data routinely collected and analyzed as part of the OOI program are not available in a standardized format which maps the different chemical analyses to the physical measurements taken at bottle closure. Our aim is to make these physical and chemical analyses of collected water samples available to the end-user in a standardized format for easy comprehension and use, while maintaining the source data files. 

### Approach:
Generating a summary of the water sample analyses involves preprocessing and concatenating multiple data sources, and accurately matching samples with each other. To do this, I first preprocess the ctd casts to generate bottle (.btl) files using the SeaBird vendor software following the SOP available on Alfresco. 

Next, the bottle files are parsed using python code and the data renamed following SeaBird's naming guide. This creates a series of individual cast summary (.sum) files. These files are then loaded into pandas dataframes, appended to each other, and exported as a csv file containing all of the bottle data in a single data file.

### Data Sources/Software:

* **sbe_name_map**: This is a spreadsheet which maps the short names generated by the SeaBird SBE DataProcessing Software to the associated full names. The name mapping originates from SeaBird's SBE DataProcessing support documentation.

* **Alfresco**: The Alfresco CMS for OOI at alfresco.oceanobservatories.org is the source of the ctd hex, xmlcon, and psa files necessary for generating the bottle files needed to create the sample summary sheet.

* **SBEDataProcessing-Win32**: SeaBird vendor software for processing the raw ctd files and generating the .btl files.


**========================================================================================================================**
Import packages which will be used in this notebook:

In [1]:
import os, sys, re
import pandas as pd
import numpy as np

Load the name mapping for the column names based on SeaBird's manual:

In [2]:
sbe_name_map = pd.read_excel('/media/andrew/OS/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Reference_Files/seabird_ctd_name_map.xlsx')

In [3]:
sbe_name_map.head()

Unnamed: 0,Short Name,Full Name,Friendly Name,Units,Notes/Comments
0,accM,Acceleration [m/s^2],acc M,m/s^2,
1,accF,Acceleration [ft/s^2],acc F,ft/s^2,
2,altM,Altimeter [m],alt M,m,
3,altF,Altimeter [ft],alt F,ft,
4,avgsvCM,"Average Sound Velocity [Chen-Millero, m/s]",avgsv-C M,"Chen-Millero, m/s",


Specify the directories where the different data sets are stored locally:

In [218]:
basepath = '/home/andrew/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/'
array = 'Pioneer/'
cruise = 'Pioneer-08_AR-18_2017-05-30/'
leg = 'Leg 1 (ar18a)/'
water_dir = 'Water Sampling/'
ctd_dir = 'ctd/'

In [219]:
#os.listdir(basepath+array+cruise+water_dir)
os.listdir(basepath+array+cruise)

['Pioneer-08_Leg-1_AR18-A_Discrete_Summary_2019-03-13_ver_1-00_.xlsx',
 'Leg 2 (ar18b)',
 'Pioneer-08_Leg_3_AR18-C_Discrete_Summary_2019-06-13_ver_1-01_.xlsx',
 'Pioneer-08_Leg_1_AR18-A_Discrete_Summary_2019-06-13_ver_1-01_.xlsx',
 'Pioneer-08_AR18-C_Discrete_Summary_2019-06-21_ver_1-01_.xlsx',
 'Leg 1 (ar18a)',
 'Leg 3 (ar18c)',
 'Pioneer-08_Leg_2_AR18-B_Discrete_Summary_2019-06-13_ver_1-01_.xlsx',
 'Pioneer-08_AR18-B_Discrete_Summary_2019-06-21_ver_1-01_.xlsx',
 'Pioneer-08_Leg-3_AR18-C_Discrete_Summary_2019-03-13_ver_1-00_.xlsx',
 'Water Sampling',
 'Pioneer-08_Leg-2_AR18-B_Discrete_Summary_2019-03-13_ver_1-00_.xlsx',
 'Pioneer-08_AR18-A_Discrete_Summary_2019-06-21_ver_1-01_.xlsx']

In [220]:
files = os.listdir(basepath+array+cruise+water_dir)
files


['Pioneer-08_AR-18_2017-05-30_Nutrients_Sample_Data_2017-08-18_ver_1-00.xlsx',
 'Pioneer-08_AR-18A_2017-05-30_Oxygen_Salinity_Sample_Data',
 'Pioneer-08_AR-18C_2017-05-30_Oxygen_Salinity_Sample_Data',
 'Pioneer-08_AR-18B_CTD_Sampling_Log.xlsx',
 'Pioneer-08_AR-18C_CTD_sampling_log.xlsx',
 'Pioneer-08_AR-18A_CTD_Sampling_Log.xlsx',
 'Pioneer-08_AR-18B_2017-05-30_Oxygen_Salinity_Sample_Data']

Create the full directory paths for the relevant data:

In [221]:
# Specify the local directory where the bottle (.btl) files are stored for a particular cruise
btlpath = basepath+array+cruise+leg+ctd_dir
summary_sheet_path = basepath+array+cruise+water_dir+'Pioneer-08_AR-18A_CTD_Sampling_Log.xlsx'
salts_and_o2_path = basepath+array+cruise+water_dir+'Pioneer-08_AR-18A_2017-05-30_Oxygen_Salinity_Sample_Data/'
nutrients_path = basepath+array+cruise+water_dir+'Pioneer-08_AR-18_2017-05-30_Nutrients_Sample_Data_2017-08-18_ver_1-00.xlsx'
chl_path = basepath+array+cruise+water_dir+''
dic_path = basepath+array+cruise+water_dir+''

In [222]:
os.listdir(salts_and_o2_path)

['009SAL.csv',
 'SAL_Summary.csv',
 'OXY_Summary.csv',
 '009OXY.xlsx',
 'ctd1_4658.gif',
 '009SAL.xlsx',
 '009.SAL']

In [223]:
# Parse the data for the start_time
def parse_header(header):
    """
    Parse the header of bottle (.btl) files to get critical information
    for the summary spreadsheet.
    
    Args:
        header - an object containing the header of the bottle file as a list of
            strings, split at the newline.
    Returns:
        hdr - a dictionary object containing the start_time, filename, latitude,
            longitude, and cruise id.
    """
    hdr = {}
    for line in header:
        if 'start_time' in line.lower():
            start_time = pd.to_datetime(re.split('= |\[',line)[1])
            hdr.update({'Start Time [UTC]':start_time.strftime('%Y-%m-%dT%H:%M:%SZ')})
        elif 'filename' in line.lower():
            hex_name = re.split('=',line)[1].strip()
            hdr.update({'Filename':hex_name})
        elif 'latitude' in line.lower():
            start_lat = re.split('=',line)[1].strip()
            hdr.update({'Start Latitude [degrees]':start_lat})
        elif 'longitude' in line.lower():
            start_lon = re.split('=',line)[1].strip()
            hdr.update({'Start Longitude [degrees]':start_lon})
        elif 'cruise id' in line.lower():
            cruise_id = re.split(':',line)[1].strip()
            hdr.update({'Cruise':cruise_id})
        else:
            pass
    
    return hdr
        

In [224]:
# Now write a function to autopopulate the bottle summary sample sheet
files = [x for x in os.listdir(btlpath) if '.btl' in x]
for filename in files:
    filepath = os.path.abspath(btlpath+filename)
    
    # Load the raw content into memory
    with open(filepath) as file:
        content = file.readlines()
    content = [x.strip() for x in content]
    
    # Now parse the file content
    header = []
    columns = []
    data = []
    for line in content:
        if line.startswith('*') or line.startswith('#'):
            header.append(line)
        else:
            try:
                float(line[0])
                data.append(line)
            except:
                columns.append(line)
    
    # Parse the header
    hdr = parse_header(header)
    
    # Parse the column identifiers
    column_dict = {}
    for line in columns:
        for i,x in enumerate(line.split()):
            try:
                column_dict[i] = column_dict[i] + ' ' + x
            except:
                column_dict.update({i:x})
    
    # Parse the bottle data based on the column header locations
    data_dict = {x:[] for x in column_dict.keys()}

    for line in data:
        if line.endswith('(avg)'):
            values = list(filter(None,re.split('  |\t', line) ) )
            for i,x in enumerate(values):
                data_dict[i].append(x)
        elif line.endswith('(sdev)'):
            values = list(filter(None,re.split('  |\t', line) ) )
            data_dict[1].append(values[0])
        else:
            pass
            
    data_dict[1] = [' '.join(item) for item in zip(data_dict[1][::2],data_dict[1][1::2])]
    
    # With the parsed data and column names, match up the data and column
    # based on the location
    results = {}
    for key,item in column_dict.items():
        values = data_dict[key]
        results.update({item:values})
        
    # Put the results into a dataframe
    df = pd.DataFrame.from_dict(results)
        
    # Now add the parsed info from the header files into the dataframe
    for key,item in hdr.items():
        df[key] = item
        
    # Get the cast number
    cast = filename[filename.index('.')-3:filename.index('.')]
    df['Cast'] = str(cast).zfill(3)
    
    # Generate a filename for the summary file
    outname = filename.split('.')[0] + '.sum'
    
    # Save the results
    df.to_csv(btlpath+outname)

In [225]:
# Now, for each "summary" file, load and append to each other
df = pd.DataFrame()
for file in os.listdir(btlpath):
    if '.sum' in file:
        df = df.append(pd.read_csv(btlpath+file))
    else:
        pass

In [226]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 0
Data columns (total 23 columns):
Unnamed: 0                   20 non-null int64
Bottle Position              20 non-null int64
Date Time                    20 non-null object
PrDM                         20 non-null float64
DepSM                        20 non-null float64
Latitude                     20 non-null float64
Longitude                    20 non-null float64
T090C                        20 non-null float64
T190C                        20 non-null float64
C0S/m                        20 non-null float64
C1S/m                        20 non-null float64
Sal00                        20 non-null float64
Sal11                        20 non-null float64
Sbeox0V                      20 non-null float64
Sbeox0ML/L                   20 non-null float64
OxsolML/L                    20 non-null float64
CStarAt0                     20 non-null float64
CStarTr0                     20 non-null object
Filename              

In [227]:
sbe_name_map['Short Name'].apply(lambda x: str(x).lower());

In [228]:
# Rename the column title using the sbe_name_mapping 
for colname in list(df.columns.values):
    try:
        fullname = list(sbe_name_map[sbe_name_map['Short Name'].apply(lambda x: str(x).lower() == colname.lower()) == True]['Full Name'])[0]
        df.rename({colname:fullname},axis='columns',inplace=True)
    except:
        pass

In [229]:
df

Unnamed: 0.1,Unnamed: 0,Bottle Position,Date Time,"Pressure, Digiquartz [db]","Depth [salt water, m]",Latitude [deg],Longitude [deg],"Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],...,"Oxygen raw, SBE 43 [V]","Oxygen, SBE 43 [ml/l]","Oxygen Saturation, Garcia & Gordon [ml/l]","Beam Attenuation, WET Labs C-Star [1/m]","Beam Transmission, WET Labs C-Star [%]",Filename,Start Latitude [degrees],Start Longitude [degrees],Start Time [UTC],Cast
0,0,1,Jun 01 2017 05:46:35,112.045,111.153,40.28337,-70.83382,10.8062,10.8014,3.87169,...,1.9482,4.0613,6.21367,0.4328,89.7438 (avg),D:\Data\ar18a005.hex,40 17.01 N,070 50.02 W,2017-06-01T05:40:35Z,5
0,0,1,Jun 01 2017 04:37:31,86.69,86.005,40.37482,-70.8332,8.1993,8.2078,3.520541,...,2.0075,4.5028,6.62671,0.2527,93.8799 (avg),D:\Data\ar18a004.hex,40 22.49 N,070 49.99 W,2017-06-01T04:32:39Z,4
0,0,1,Jun 01 2017 20:54:01,465.958,461.87,39.92994,-70.89002,6.744,6.7581,3.532714,...,1.5773,3.4605,6.78747,0.1287,96.8339 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
1,1,2,Jun 01 2017 20:54:36,466.011,461.923,39.92995,-70.89002,6.7432,6.7139,3.533018,...,1.5772,3.4604,6.78741,0.1289,96.8284 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
2,2,3,Jun 01 2017 21:04:51,133.199,132.137,39.92996,-70.89,12.4248,12.4232,4.103608,...,1.7266,3.3113,5.97746,0.0915,97.7384 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
3,3,4,Jun 01 2017 21:05:03,133.282,132.219,39.92996,-70.89,12.4273,12.4255,4.103845,...,1.7279,3.3133,5.97715,0.0916,97.7362 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
4,4,5,Jun 01 2017 21:24:34,29.522,29.294,39.92996,-70.89,13.2262,13.2255,4.036749,...,2.3834,4.9946,5.92886,0.4439,89.4961 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
5,5,6,Jun 01 2017 21:24:43,29.891,29.66,39.92996,-70.89,13.2229,13.2238,4.034434,...,2.3877,5.0058,5.92996,0.4479,89.4062 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
6,6,7,Jun 01 2017 21:27:02,6.86,6.807,39.92996,-70.89001,14.6555,14.6574,4.147715,...,2.4682,5.0643,5.76765,0.5631,86.8693 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
7,7,8,Jun 01 2017 21:27:11,7.124,7.07,39.92996,-70.89001,14.6647,14.6679,4.148739,...,2.4679,5.0666,5.76654,0.5611,86.9131 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9


In [230]:
df['Bottle Position'] = df['Bottle Position'].apply(lambda x: str( int(x) ) )
df.drop(columns='Unnamed: 0',inplace=True)
df['Cast'] = df['Cast'].apply(lambda x: str(x).zfill(3) )

In [231]:
df

Unnamed: 0,Bottle Position,Date Time,"Pressure, Digiquartz [db]","Depth [salt water, m]",Latitude [deg],Longitude [deg],"Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],"Conductivity, 2 [S/m]",...,"Oxygen raw, SBE 43 [V]","Oxygen, SBE 43 [ml/l]","Oxygen Saturation, Garcia & Gordon [ml/l]","Beam Attenuation, WET Labs C-Star [1/m]","Beam Transmission, WET Labs C-Star [%]",Filename,Start Latitude [degrees],Start Longitude [degrees],Start Time [UTC],Cast
0,1,Jun 01 2017 05:46:35,112.045,111.153,40.28337,-70.83382,10.8062,10.8014,3.87169,3.870753,...,1.9482,4.0613,6.21367,0.4328,89.7438 (avg),D:\Data\ar18a005.hex,40 17.01 N,070 50.02 W,2017-06-01T05:40:35Z,5
0,1,Jun 01 2017 04:37:31,86.69,86.005,40.37482,-70.8332,8.1993,8.2078,3.520541,3.521091,...,2.0075,4.5028,6.62671,0.2527,93.8799 (avg),D:\Data\ar18a004.hex,40 22.49 N,070 49.99 W,2017-06-01T04:32:39Z,4
0,1,Jun 01 2017 20:54:01,465.958,461.87,39.92994,-70.89002,6.744,6.7581,3.532714,3.533962,...,1.5773,3.4605,6.78747,0.1287,96.8339 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
1,2,Jun 01 2017 20:54:36,466.011,461.923,39.92995,-70.89002,6.7432,6.7139,3.533018,3.529875,...,1.5772,3.4604,6.78741,0.1289,96.8284 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
2,3,Jun 01 2017 21:04:51,133.199,132.137,39.92996,-70.89,12.4248,12.4232,4.103608,4.1032,...,1.7266,3.3113,5.97746,0.0915,97.7384 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
3,4,Jun 01 2017 21:05:03,133.282,132.219,39.92996,-70.89,12.4273,12.4255,4.103845,4.103519,...,1.7279,3.3133,5.97715,0.0916,97.7362 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
4,5,Jun 01 2017 21:24:34,29.522,29.294,39.92996,-70.89,13.2262,13.2255,4.036749,4.03715,...,2.3834,4.9946,5.92886,0.4439,89.4961 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
5,6,Jun 01 2017 21:24:43,29.891,29.66,39.92996,-70.89,13.2229,13.2238,4.034434,4.034635,...,2.3877,5.0058,5.92996,0.4479,89.4062 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
6,7,Jun 01 2017 21:27:02,6.86,6.807,39.92996,-70.89001,14.6555,14.6574,4.147715,4.147665,...,2.4682,5.0643,5.76765,0.5631,86.8693 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9
7,8,Jun 01 2017 21:27:11,7.124,7.07,39.92996,-70.89001,14.6647,14.6679,4.148739,4.148835,...,2.4679,5.0666,5.76654,0.5611,86.9131 (avg),D:\Data\ar18a009.hex,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9


In [232]:
df.to_csv(btlpath+'CTD_Summary.csv')

### Oxygen & Salinity 
Now, we need to add the 

In [233]:
def clean_sal_files(dirpath):

    # Run check if files are held in excel format or csvs
    csv_flag = any(files.endswith('.SAL') for files in os.listdir(dirpath))
    if csv_flag:
        for filename in os.listdir(dirpath):
            sample = []
            salinity = []
            if filename.endswith('.SAL'):
                with open(dirpath+filename) as file:
                    data = file.readlines()
                    for ind1,line in enumerate(data):
                        if ind1 == 0:
                            strs = data[0].replace('"','').split(',')
                            cruisename = strs[0]
                            station = strs[1]
                            cast = strs[2]
                            case = strs[8]
                        elif int(line.split()[0]) == 0:
                            pass
                        else:
                            strs = line.split()
                            sample.append(strs[0])
                            salinity.append(strs[2])
                
                    # Generate a pandas dataframe to populate data
                    data_dict = {'Cruise':cruisename,'Station':station,'Cast':cast,'Case':case,'Sample ID':sample,'Salinity [psu]':salinity}
                    df = pd.DataFrame.from_dict(data_dict)
                    df.to_csv(file.name.replace('.','')+'.csv')
            else:
                pass
    
    else:
        # If the files are already in excel spreadsheets, they've been cleaned into a
        # logical tabular format
        pass
    

def process_sal_files(dirpath):
    
    # Check if the files are excel files or not
    excel_flag = any(files.endswith('SAL.xlsx') for files in os.listdir(dirpath))
    # Initialize a dataframe for processing the salinity files
    df = pd.DataFrame()
    if excel_flag:
        for file in os.listdir(dirpath):
            if 'SAL.xlsx' in file:
                df = df.append(pd.read_excel(dirpath+file))
        df.rename({'Sample':'Sample ID','Salinity':'Salinity [psu]','Niskin #':'Niskin','Case ID':'Case'}, 
                  axis='columns',inplace=True)
        df.dropna(inplace=True)
        df['Station'] = df['Station'].apply(lambda x: str( int(x)).zfill(3))
        df['Niskin'] = df['Niskin'].apply(lambda x: str( int(x)))
        df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
    else:
        for file in os.listdir(dirpath):
            if 'SAL.csv' in file:
                df = df.append(pd.read_csv(dirpath+file))
        df.dropna(inplace=True)
        df['Station'] = df['Station'].apply(lambda x: str( int(x)).zfill(3))
        df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
        df.drop(columns=[x for x in list(df.columns.values) if 'unnamed' in x.lower()],inplace=True)

    # Save the processed summary file for salinity
    df.to_csv(dirpath+'SAL_Summary.csv')
    
    
def process_oxy_files(dirpath):
    df = pd.DataFrame()
    for filename in os.listdir(dirpath):
        if 'oxy' in filename.lower() and filename.endswith('.xlsx'):
            df = df.append(pd.read_excel(dirpath+filename)) 
            # Rename and clean up the oxygen data to be uniform across data sets
    df.rename({'Niskin #':'Niskin','Sample#':'Sample ID','Oxy':'Oxygen [mL/L]','Unit':'Units'},
              axis='columns',inplace=True)
    df.dropna(inplace=True)
    df['Station'] = df['Station'].apply(lambda x: str( int(x)).zfill(3))
    df['Niskin'] = df['Niskin'].apply(lambda x: str( int(x)))
    df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
    df['Cruise'] = df['Cruise'].apply(lambda x: x.replace('O','0'))
    
    # Save the processed summary file for oxygen
    df.to_csv(dirpath+'OXY_Summary.csv')

In [234]:
# Now process the salts and oxygen data
    # Clean the salinity
clean_sal_files(salts_and_o2_path)
    # Process the salinity files
process_sal_files(salts_and_o2_path)
    # Process the oxygen files
process_oxy_files(salts_and_o2_path)

### CTD Sampling Log
Load in the CTD sampling log summary sheet. The summary sheet needs to be manually created and the data cleaned before attempting to import. Additionally, ensure that there is only one header line and that it is at the top of the file.

In [235]:
del sample_log

In [236]:
sample_log = pd.read_excel(summary_sheet_path,sheet_name='Summary',header=0)
sample_log.head()

Unnamed: 0,Cruise ID,Station-Cast #,Target Asset,Start Latitude,Start Longitude,Start Date,Start Time,Bottom Depth [m],Niskin #,Date,...,Oxygen Bottle #,Ph Bottle #,DIC/TA Bottle #,Salts Bottle #,Nitrate Bottle 1,Chlorophyll Brown Bottle #,Chlorophyll Filter Sample #,Chlorophyll Brown Bottle Volume,Chlorophyll LN Tube,Comments
0,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,1,2017-06-01,...,B1,245.0,246.0,S1,9-1.,,,,,
1,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,2,2017-06-01,...,B2,,247.0,S2,9-2.,,,,,Duplicate DIC/TA
2,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,3,2017-06-01,...,B3,,248.0,S3,9-3.,,,,,
3,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,4,2017-06-01,...,B4,,,S4,,,,,,
4,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,5,2017-06-01,...,B5,,249.0,S5,9-4.,1.0,09/01.,539.0,,chl. Max


Rename the Comments field:

In [237]:
sample_log.rename(columns={'Chlorophyll Comments':'Comments'},inplace=True)
sample_log.head()

Unnamed: 0,Cruise ID,Station-Cast #,Target Asset,Start Latitude,Start Longitude,Start Date,Start Time,Bottom Depth [m],Niskin #,Date,...,Oxygen Bottle #,Ph Bottle #,DIC/TA Bottle #,Salts Bottle #,Nitrate Bottle 1,Chlorophyll Brown Bottle #,Chlorophyll Filter Sample #,Chlorophyll Brown Bottle Volume,Chlorophyll LN Tube,Comments
0,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,1,2017-06-01,...,B1,245.0,246.0,S1,9-1.,,,,,
1,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,2,2017-06-01,...,B2,,247.0,S2,9-2.,,,,,Duplicate DIC/TA
2,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,3,2017-06-01,...,B3,,248.0,S3,9-3.,,,,,
3,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,4,2017-06-01,...,B4,,,S4,,,,,,
4,AR18-A,9,OSPM,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,469,5,2017-06-01,...,B5,,249.0,S5,9-4.,1.0,09/01.,539.0,,chl. Max


In [238]:
def strip_x(x):
    if type(x) == str:
        x = x.replace('.','')
        return x
    else:
        return x

In [239]:
sample_log['Nitrate Bottle 1'] = sample_log['Nitrate Bottle 1'].apply(lambda x: strip_x(x))

In [240]:
sample_log['Nitrate Bottle 1']

0    9-1
1    9-2
2    9-3
3    NaN
4    9-4
5    9-5
6    9-6
7    NaN
Name: Nitrate Bottle 1, dtype: object

### Nutrient & Chlorophyll Data

In [241]:
try:
    nutrients = pd.read_excel(nutrients_path,header=0)
    nutrients
except IsADirectoryError:
    nutrients = pd.DataFrame(data=sample_log['Nitrate Bottle 1'])
    nutrients.rename(columns={'Nitrate Bottle 1':'Sample ID'}, inplace=True)
    columns = ['Sample ID','Cruise','Avg: Nitrate + Nitrite [µmol/L]','Avg: Ammonium [µmol/L]',
               'Avg: Phosphate [µmol/L]','Avg: Silicate [µmol/L]','Avg: Nitrite [µmol/L]','Avg: Nitrate [µmol/L]']
    for col in columns:
        if col not in nutrients.columns.values:
            nutrients[col] = nutrients['Sample ID']

In [242]:
nutrients.rename(columns={'Unnamed: 0':'Sample ID'}, inplace=True)

In [243]:
nutrients.dropna(inplace=True)

In [244]:
nutrients

Unnamed: 0,Sample ID,Cruise,Avg: Nitrate + Nitrite [µmol/L],Avg: Ammonium [µmol/L],Avg: Phosphate [µmol/L],Avg: Silicate [µmol/L],Avg: Nitrite [µmol/L],Avg: Nitrate [µmol/L]
0,6-1,AR18-B,18.4056,0.6660,1.09619,11.3763,<0.04,18.3656
1,6-2,AR18-B,13.6015,0.8160,0.87714,8.70838,<0.04,13.5615
2,6-3,AR18-B,14.0576,1.0020,0.968409,7.49593,<0.04,14.0176
3,6-4,AR18-B,<0.04,1.6800,<0.009,<0.03,<0.04,<0.04
4,6-5,AR18-B,<0.04,1.1940,0.0134501,<0.03,<0.04,<0.04
5,6-6,AR18-B,<0.04,1.9900,<0.009,<0.03,<0.04,<0.04
6,7-1,AR18-B,20.2275,1.4500,1.15863,11.0833,<0.04,20.1875
7,7-2,AR18-B,17.8869,1.6170,1.05199,9.79746,<0.04,17.8469
8,7-3,AR18-B,10.5906,0.4570,0.582679,4.20051,<0.04,10.5506
9,7-4,AR18-B,<0.04,1.2330,0.0317039,<0.03,<0.04,<0.04


In [245]:
del chl

In [246]:
sample_log['Chlorophyll Filter Sample #']

0       NaN
1       NaN
2       NaN
3       NaN
4    09/01.
5    09/02.
6    09/03.
7    09/04.
Name: Chlorophyll Filter Sample #, dtype: object

In [247]:
sample_log.columns

Index(['Cruise ID', 'Station-Cast #', 'Target Asset', 'Start Latitude',
       'Start Longitude', 'Start Date', 'Start Time', 'Bottom Depth [m]',
       'Niskin #', 'Date', 'Time', 'Trip Depth', 'Potential Temp', 'Salinity',
       'Oxygen Bottle #', 'Ph Bottle #', 'DIC/TA Bottle #', 'Salts Bottle #',
       'Nitrate Bottle 1', 'Chlorophyll Brown Bottle #',
       'Chlorophyll Filter Sample #', 'Chlorophyll Brown Bottle Volume',
       'Chlorophyll LN Tube', 'Comments'],
      dtype='object')

In [248]:
try:
    chl = pd.read_excel(chl_path)
    chl.head()
except IsADirectoryError:
    # If there is no chlorophyll sheet yet, need to copy the bottle data into the final sample log
    chl = sample_log[['Station-Cast #','Chlorophyll Brown Bottle #','Chlorophyll Filter Sample #','Chlorophyll LN Tube']]
    chl.rename(columns={
        'Chlorophyll Brown Bottle #': 'Brown Bottle #',
        'Chlorophyll Filter Sample #': 'Chl (ug/l)',
        'Chlorophyll LN Tube':'Phaeo (ug/l)'
    }, inplace=True)

In [249]:
chl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
Station-Cast #    8 non-null int64
Brown Bottle #    4 non-null float64
Chl (ug/l)        4 non-null object
Phaeo (ug/l)      0 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 336.0+ bytes


In [250]:
chl

Unnamed: 0,Station-Cast #,Brown Bottle #,Chl (ug/l),Phaeo (ug/l)
0,9,,,
1,9,,,
2,9,,,
3,9,,,
4,9,1.0,09/01.,
5,9,2.0,09/02.,
6,9,3.0,09/03.,
7,9,4.0,09/04.,


In [251]:
# Load the Salinity and oxygen summaries
sal = pd.read_csv(salts_and_o2_path+'SAL_Summary.csv')
if 'case' in [x.lower() for x in sal.columns.values]:
    sal['Sample ID'] = sal['Case'] + sal['Sample ID'].apply(lambda x: str(x)) 
oxy = pd.read_csv(salts_and_o2_path+'OXY_Summary.csv')
if 'case' in [x.lower() for x in oxy.columns.values]:
    oxy['Sample ID'] = oxy['Case'] + oxy['Sample ID'].apply(lambda x: str(x)) 

**========================================================================================================================**
### Carbon-System Measurements

In [252]:
try:
    dic = pd.read_excel(dic_path,header=0)
    dic
except IsADirectoryError:
    dic = sample_log[['Station-Cast #','Niskin #','Ph Bottle #','DIC/TA Bottle #']]
    dic.rename(columns={
        'Station-Cast #':'CAST_NO',
        'Niskin #':'NISKIN_NO',
        'DIC/TA Bottle #':'DIC_UMOL_KG',
        'Ph Bottle #':'PH_TOT_MEA',
    }, inplace=True)
    columns = ['CAST_NO', 'NISKIN_NO','DIC_UMOL_KG', 'DIC_FLAG_W', 'TA_UMOL_KG',
       'TA_FLAG_W', 'PH_TOT_MEA', 'TMP_PH_DEG_C', 'PH_FLAG_W']
    for col in columns:
        if col not in dic.columns.values:
            if 'dic' in col.lower() or 'ta' in col.lower():
                dic[col] = dic['DIC_UMOL_KG']
            elif 'ph' in col.lower():
                dic[col] = dic['PH_TOT_MEA']
            else:
                dic[col] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [253]:
dic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 9 columns):
CAST_NO         8 non-null int64
NISKIN_NO       8 non-null int64
PH_TOT_MEA      3 non-null float64
DIC_UMOL_KG     5 non-null float64
DIC_FLAG_W      5 non-null float64
TA_UMOL_KG      5 non-null float64
TA_FLAG_W       5 non-null float64
TMP_PH_DEG_C    3 non-null float64
PH_FLAG_W       3 non-null float64
dtypes: float64(7), int64(2)
memory usage: 656.0 bytes


**========================================================================================================================**
### Sample Log 
Next, we need to merge the sample log with the individual oxygen, salinity, nutrient, chlorophyll, and carbon sampling sheets.

In [254]:
sample_log.columns.values

array(['Cruise ID', 'Station-Cast #', 'Target Asset', 'Start Latitude',
       'Start Longitude', 'Start Date', 'Start Time', 'Bottom Depth [m]',
       'Niskin #', 'Date', 'Time', 'Trip Depth', 'Potential Temp',
       'Salinity', 'Oxygen Bottle #', 'Ph Bottle #', 'DIC/TA Bottle #',
       'Salts Bottle #', 'Nitrate Bottle 1', 'Chlorophyll Brown Bottle #',
       'Chlorophyll Filter Sample #', 'Chlorophyll Brown Bottle Volume',
       'Chlorophyll LN Tube', 'Comments'], dtype=object)

**========================================================================================================================**
Merge the **salinity** information with the sample_log based on cast # and salts sampling bottle:

In [255]:
# Now need to mak
sample_log = sample_log.merge(sal[['Station','Sample ID','Salinity [psu]']], how='left', left_on=['Station-Cast #','Salts Bottle #'], right_on=['Station','Sample ID'])

In [256]:
sample_log.rename({'Salinity [psu]':'Discrete Salinity [psu]'},axis='columns',inplace=True)
sample_log.drop(['Station','Sample ID'],axis='columns',inplace=True)

In [257]:
sample_log.rename(columns=lambda x: x.strip(),inplace=True)

**========================================================================================================================**
Next, merge the **oxygen** data into the sample log based on cast # and oxygen sampling bottle:

In [258]:
sample_log = sample_log.merge(oxy[['Station','Sample ID','Oxygen [mL/L]']], how='left', left_on=['Station-Cast #','Oxygen Bottle #'], right_on=['Station','Sample ID'])

In [259]:
sample_log.rename({'Oxygen [mL/L]':'Discrete Oxygen [mL/L]'},axis='columns',inplace=True)
sample_log.drop(['Station','Sample ID'],axis='columns',inplace=True)

In [260]:
sample_log.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 0 to 7
Data columns (total 26 columns):
Cruise ID                          8 non-null object
Station-Cast #                     8 non-null int64
Target Asset                       8 non-null object
Start Latitude                     8 non-null object
Start Longitude                    8 non-null object
Start Date                         8 non-null datetime64[ns]
Start Time                         8 non-null object
Bottom Depth [m]                   8 non-null int64
Niskin #                           8 non-null int64
Date                               8 non-null datetime64[ns]
Time                               8 non-null int64
Trip Depth                         8 non-null int64
Potential Temp                     0 non-null float64
Salinity                           0 non-null float64
Oxygen Bottle #                    8 non-null object
Ph Bottle #                        3 non-null float64
DIC/TA Bottle #                    5 

**========================================================================================================================**
Merge the **nutrients** data into the sample log:

In [261]:
nutrients.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 89
Data columns (total 8 columns):
Sample ID                          90 non-null object
Cruise                             90 non-null object
Avg: Nitrate + Nitrite [µmol/L]    90 non-null object
Avg: Ammonium [µmol/L]             90 non-null float64
Avg: Phosphate [µmol/L]            90 non-null object
Avg: Silicate [µmol/L]             90 non-null object
Avg: Nitrite [µmol/L]              90 non-null object
Avg: Nitrate [µmol/L]              90 non-null object
dtypes: float64(1), object(7)
memory usage: 6.3+ KB


In [262]:
nutrients.reset_index(inplace=True)

In [263]:
sample_log['Nitrate Bottle 1'] = sample_log['Nitrate Bottle 1'].apply(lambda x: x.strip() if type(x) == str else x)
sample_log.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 0 to 7
Data columns (total 26 columns):
Cruise ID                          8 non-null object
Station-Cast #                     8 non-null int64
Target Asset                       8 non-null object
Start Latitude                     8 non-null object
Start Longitude                    8 non-null object
Start Date                         8 non-null datetime64[ns]
Start Time                         8 non-null object
Bottom Depth [m]                   8 non-null int64
Niskin #                           8 non-null int64
Date                               8 non-null datetime64[ns]
Time                               8 non-null int64
Trip Depth                         8 non-null int64
Potential Temp                     0 non-null float64
Salinity                           0 non-null float64
Oxygen Bottle #                    8 non-null object
Ph Bottle #                        3 non-null float64
DIC/TA Bottle #                    5 

In [264]:
sample_log = sample_log.merge(nutrients, how='left', left_on=['Nitrate Bottle 1'], right_on=['Sample ID'])

In [265]:
sample_log.drop_duplicates(inplace=True)

Rename the avg values to discrete, and drop unneeded columns:

In [266]:
sample_log.rename(columns=lambda x: x.replace('Avg:', 'Discrete'), inplace=True)
sample_log.drop(['Sample ID'],axis='columns',inplace=True)

In [267]:
sample_log.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 13
Data columns (total 34 columns):
Cruise ID                              14 non-null object
Station-Cast #                         14 non-null int64
Target Asset                           14 non-null object
Start Latitude                         14 non-null object
Start Longitude                        14 non-null object
Start Date                             14 non-null datetime64[ns]
Start Time                             14 non-null object
Bottom Depth [m]                       14 non-null int64
Niskin #                               14 non-null int64
Date                                   14 non-null datetime64[ns]
Time                                   14 non-null int64
Trip Depth                             14 non-null int64
Potential Temp                         0 non-null float64
Salinity                               0 non-null float64
Oxygen Bottle #                        14 non-null object
Ph Bottle #     

**========================================================================================================================**
Merge the **chlorophyll** data into the sampling sheet:

In [268]:
chl.columns.values

array(['Station-Cast #', 'Brown Bottle #', 'Chl (ug/l)', 'Phaeo (ug/l)'],
      dtype=object)

In [269]:
chl_df = chl[['Station-Cast #','Brown Bottle #','Chl (ug/l)','Phaeo (ug/l)']]#,'Comments']]
chl_df.rename(columns={'Comments':'Chl Comments'}, inplace = True)
chl_df.rename(columns=lambda x: 'Discrete ' + x, inplace=True)
#chl_df.rename({'Discrete quality_flag':'Discrete Chl quality flag'},axis='columns',inplace=True)

In [270]:
chl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
Station-Cast #    8 non-null int64
Brown Bottle #    4 non-null float64
Chl (ug/l)        4 non-null object
Phaeo (ug/l)      0 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 336.0+ bytes


In [271]:
sample_log = sample_log.merge(chl_df, how='left', left_on=['Station-Cast #','Chlorophyll Brown Bottle #'], right_on=['Discrete Station-Cast #','Discrete Brown Bottle #'])
sample_log.drop(['Discrete Station-Cast #','Discrete Brown Bottle #'],axis='columns',inplace=True)

In [272]:
sample_log.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35 entries, 0 to 34
Data columns (total 36 columns):
Cruise ID                              35 non-null object
Station-Cast #                         35 non-null int64
Target Asset                           35 non-null object
Start Latitude                         35 non-null object
Start Longitude                        35 non-null object
Start Date                             35 non-null datetime64[ns]
Start Time                             35 non-null object
Bottom Depth [m]                       35 non-null int64
Niskin #                               35 non-null int64
Date                                   35 non-null datetime64[ns]
Time                                   35 non-null int64
Trip Depth                             35 non-null int64
Potential Temp                         0 non-null float64
Salinity                               0 non-null float64
Oxygen Bottle #                        35 non-null object
Ph Bottle #     

**========================================================================================================================**
Merge the **Carbon** data into the sampling sheet

In [273]:
dic_df = dic[['CAST_NO', 'NISKIN_NO','DIC_UMOL_KG', 'DIC_FLAG_W', 'TA_UMOL_KG',
       'TA_FLAG_W', 'PH_TOT_MEA', 'TMP_PH_DEG_C', 'PH_FLAG_W']]
dic_df.rename(columns = {'DIC_UMOL_KG':'DIC [µmol/kg]',
               'DIC_FLAG_W':'DIC Flag',
               'TA_UMOL_KG':'Alkalinity [µmol/kg]',
               'TA_FLAG_W':'Alkalinity Flag',
               'PH_TOT_MEA':'pH [Total Scale]',
               'TMP_PH_DEG_C':'pH Analysis Temp [C]', 
              'PH_FLAG_W':'pH Flag'}, inplace=True)
# Add in the pCO2 columns, which we don't measure
dic_df['pCO2'] = np.nan
dic_df['pCO2 Flag'] = np.nan
dic_df['pCO2 Analysis Temp [C]'] = np.nan

dic_df.rename(columns=lambda x: 'Discrete ' + x, inplace=True)

In [274]:
sample_log = sample_log.merge(dic_df, how='left', left_on=['Station-Cast #','Niskin #'], right_on=['Discrete CAST_NO','Discrete NISKIN_NO'])
sample_log.drop(['Discrete CAST_NO','Discrete NISKIN_NO'], axis='columns', inplace=True)

In [275]:
sample_log.drop_duplicates(inplace=True)

In [276]:
sample_log.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 34
Data columns (total 46 columns):
Cruise ID                              14 non-null object
Station-Cast #                         14 non-null int64
Target Asset                           14 non-null object
Start Latitude                         14 non-null object
Start Longitude                        14 non-null object
Start Date                             14 non-null datetime64[ns]
Start Time                             14 non-null object
Bottom Depth [m]                       14 non-null int64
Niskin #                               14 non-null int64
Date                                   14 non-null datetime64[ns]
Time                                   14 non-null int64
Trip Depth                             14 non-null int64
Potential Temp                         0 non-null float64
Salinity                               0 non-null float64
Oxygen Bottle #                        14 non-null object
Ph Bottle #     

**========================================================================================================================**
### CTD Data
Now, we want to load the CTD bottle summary data and merge it with the water sampling data in the sample log.

In [277]:
CTD = pd.read_csv(basepath+array+cruise+leg+ctd_dir+'CTD_Summary.csv')

In [278]:
sample_log.rename({'Target Station':'Target Asset'},axis='columns',inplace=True)

In [279]:
sample_log.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 34
Data columns (total 46 columns):
Cruise ID                              14 non-null object
Station-Cast #                         14 non-null int64
Target Asset                           14 non-null object
Start Latitude                         14 non-null object
Start Longitude                        14 non-null object
Start Date                             14 non-null datetime64[ns]
Start Time                             14 non-null object
Bottom Depth [m]                       14 non-null int64
Niskin #                               14 non-null int64
Date                                   14 non-null datetime64[ns]
Time                                   14 non-null int64
Trip Depth                             14 non-null int64
Potential Temp                         0 non-null float64
Salinity                               0 non-null float64
Oxygen Bottle #                        14 non-null object
Ph Bottle #     

Create a list of columns to merge from the water sampling log with the CTD bottle summary data:

In [280]:
column_list = []
for name in list(sample_log.columns.values):
    if 'Discrete' in name:
        column_list.append(name)
column_list.append('Station-Cast #')
column_list.append('Start Latitude')
column_list.append('Start Longitude')
column_list.append('Start Date')
column_list.append('Start Time')
column_list.append('Niskin #')
column_list.append('Target Asset')
column_list.append('Bottom Depth [m]')
column_list.append('Comments')

Use the column list to pull out the discrete data from the sample log:

In [281]:
discrete_data = sample_log[column_list]

In [282]:
discrete_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 34
Data columns (total 29 columns):
Discrete Salinity [psu]                14 non-null float64
Discrete Oxygen [mL/L]                 14 non-null float64
Discrete Nitrate + Nitrite [µmol/L]    12 non-null object
Discrete Ammonium [µmol/L]             12 non-null float64
Discrete Phosphate [µmol/L]            12 non-null object
Discrete Silicate [µmol/L]             12 non-null object
Discrete Nitrite [µmol/L]              12 non-null object
Discrete Nitrate [µmol/L]              12 non-null object
Discrete Chl (ug/l)                    7 non-null object
Discrete Phaeo (ug/l)                  0 non-null float64
Discrete DIC [µmol/kg]                 10 non-null float64
Discrete DIC Flag                      10 non-null float64
Discrete Alkalinity [µmol/kg]          10 non-null float64
Discrete Alkalinity Flag               10 non-null float64
Discrete pH [Total Scale]              5 non-null float64
Discrete pH Analysis 

In [283]:
discrete_data['Station-Cast #'] = discrete_data['Station-Cast #'].apply(lambda x: str(int(x)).zfill(3))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [284]:
CTD['Cast'] = CTD['Cast'].apply(lambda x: str(x).zfill(3) )

Merge the discrete data into the CTD data based on the Cast and Niskin bottle number:

In [285]:
CTD = CTD.merge(discrete_data, how='left', left_on=['Cast','Bottle Position'], right_on=['Station-Cast #','Niskin #'])

In [286]:
CTD.columns.values

array(['Unnamed: 0', 'Bottle Position', 'Date Time',
       'Pressure, Digiquartz [db]', 'Depth [salt water, m]',
       'Latitude [deg]', 'Longitude [deg]', 'Temperature [ITS-90, deg C]',
       'Temperature, 2 [ITS-90, deg C]', 'Conductivity [S/m]',
       'Conductivity, 2 [S/m]', 'Salinity, Practical [PSU]',
       'Salinity, Practical, 2 [PSU]', 'Oxygen raw, SBE 43 [V]',
       'Oxygen, SBE 43 [ml/l]',
       'Oxygen Saturation, Garcia & Gordon [ml/l]',
       'Beam Attenuation, WET Labs C-Star [1/m]',
       'Beam Transmission, WET Labs C-Star [%]', 'Filename',
       'Start Latitude [degrees]', 'Start Longitude [degrees]',
       'Start Time [UTC]', 'Cast', 'Discrete Salinity [psu]',
       'Discrete Oxygen [mL/L]', 'Discrete Nitrate + Nitrite [µmol/L]',
       'Discrete Ammonium [µmol/L]', 'Discrete Phosphate [µmol/L]',
       'Discrete Silicate [µmol/L]', 'Discrete Nitrite [µmol/L]',
       'Discrete Nitrate [µmol/L]', 'Discrete Chl (ug/l)',
       'Discrete Phaeo (ug/l)', 'Dis

In [287]:
CTD.drop(labels=['Unnamed: 0','Niskin #'],axis='columns',inplace=True)

In [288]:
CTD.rename({'Cast #':'Station'},inplace=True)

In [289]:
CTD

Unnamed: 0,Bottle Position,Date Time,"Pressure, Digiquartz [db]","Depth [salt water, m]",Latitude [deg],Longitude [deg],"Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],"Conductivity, 2 [S/m]",...,Discrete pCO2 Flag,Discrete pCO2 Analysis Temp [C],Station-Cast #,Start Latitude,Start Longitude,Start Date,Start Time,Target Asset,Bottom Depth [m],Comments
0,1,Jun 01 2017 05:46:35,112.045,111.153,40.28337,-70.83382,10.8062,10.8014,3.87169,3.870753,...,,,,,,NaT,,,,
1,1,Jun 01 2017 04:37:31,86.69,86.005,40.37482,-70.8332,8.1993,8.2078,3.520541,3.521091,...,,,,,,NaT,,,,
2,1,Jun 01 2017 20:54:01,465.958,461.87,39.92994,-70.89002,6.744,6.7581,3.532714,3.533962,...,,,9.0,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,OSPM,469.0,
3,1,Jun 01 2017 20:54:01,465.958,461.87,39.92994,-70.89002,6.744,6.7581,3.532714,3.533962,...,,,9.0,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,OSPM,469.0,
4,2,Jun 01 2017 20:54:36,466.011,461.923,39.92995,-70.89002,6.7432,6.7139,3.533018,3.529875,...,,,9.0,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,OSPM,469.0,Duplicate DIC/TA
5,2,Jun 01 2017 20:54:36,466.011,461.923,39.92995,-70.89002,6.7432,6.7139,3.533018,3.529875,...,,,9.0,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,OSPM,469.0,Duplicate DIC/TA
6,3,Jun 01 2017 21:04:51,133.199,132.137,39.92996,-70.89,12.4248,12.4232,4.103608,4.1032,...,,,9.0,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,OSPM,469.0,
7,3,Jun 01 2017 21:04:51,133.199,132.137,39.92996,-70.89,12.4248,12.4232,4.103608,4.1032,...,,,9.0,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,OSPM,469.0,
8,4,Jun 01 2017 21:05:03,133.282,132.219,39.92996,-70.89,12.4273,12.4255,4.103845,4.103519,...,,,9.0,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,OSPM,469.0,
9,5,Jun 01 2017 21:24:34,29.522,29.294,39.92996,-70.89,13.2262,13.2255,4.036749,4.03715,...,,,9.0,39 55.797' N,70 53.400' W,2017-06-01,20:41:00,OSPM,469.0,chl. Max


In [290]:
CTD.fillna(-9999999, inplace=True)
CTD['Cruise ID'] = sample_log['Cruise ID'][0]
CTD['Bottle Closure Time [UTC]'] = CTD['Date Time'].apply(lambda x: pd.to_datetime(x).strftime('%Y-%m-%dT%H:%M:%SZ'))
CTD.drop(columns='Date Time', inplace=True)

Now, find where there were data casts only (so no bottle closures), and fill in missing data on Start Latitutde, Longitude, Time, and Bottome Depth:

In [291]:
CTD['Start Latitude [degrees]'] = CTD['Start Latitude'].where(CTD['Start Latitude [degrees]'] == -9999999, other=CTD['Start Latitude [degrees]'])
CTD['Start Longitude [degrees]'] = CTD['Start Longitude'].where(CTD['Start Longitude [degrees]'] == -9999999, other=CTD['Start Longitude [degrees]'])

In [292]:
CTD['Station-Cast #'] = CTD['Cast'].where(CTD['Station-Cast #'] == -9999999, other=CTD['Station-Cast #'])

In [293]:
CTD.drop_duplicates(inplace=True)

In [294]:
CTD

Unnamed: 0,Bottle Position,"Pressure, Digiquartz [db]","Depth [salt water, m]",Latitude [deg],Longitude [deg],"Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],"Conductivity, 2 [S/m]","Salinity, Practical [PSU]",...,Station-Cast #,Start Latitude,Start Longitude,Start Date,Start Time,Target Asset,Bottom Depth [m],Comments,Cruise ID,Bottle Closure Time [UTC]
0,1,112.045,111.153,40.28337,-70.83382,10.8062,10.8014,3.87169,3.870753,34.81,...,5,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999.0,-9999999,AR18-A,2017-06-01T05:46:35Z
1,1,86.69,86.005,40.37482,-70.8332,8.1993,8.2078,3.520541,3.521091,33.6865,...,4,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999.0,-9999999,AR18-A,2017-06-01T04:37:31Z
2,1,465.958,461.87,39.92994,-70.89002,6.744,6.7581,3.532714,3.533962,35.0749,...,9,39 55.797' N,70 53.400' W,2017-06-01 00:00:00,20:41:00,OSPM,469.0,-9999999,AR18-A,2017-06-01T20:54:01Z
3,1,465.958,461.87,39.92994,-70.89002,6.744,6.7581,3.532714,3.533962,35.0749,...,9,39 55.797' N,70 53.400' W,2017-06-01 00:00:00,20:41:00,OSPM,469.0,-9999999,AR18-A,2017-06-01T20:54:01Z
4,2,466.011,461.923,39.92995,-70.89002,6.7432,6.7139,3.533018,3.529875,35.0791,...,9,39 55.797' N,70 53.400' W,2017-06-01 00:00:00,20:41:00,OSPM,469.0,Duplicate DIC/TA,AR18-A,2017-06-01T20:54:36Z
5,2,466.011,461.923,39.92995,-70.89002,6.7432,6.7139,3.533018,3.529875,35.0791,...,9,39 55.797' N,70 53.400' W,2017-06-01 00:00:00,20:41:00,OSPM,469.0,Duplicate DIC/TA,AR18-A,2017-06-01T20:54:36Z
6,3,133.199,132.137,39.92996,-70.89,12.4248,12.4232,4.103608,4.1032,35.5496,...,9,39 55.797' N,70 53.400' W,2017-06-01 00:00:00,20:41:00,OSPM,469.0,-9999999,AR18-A,2017-06-01T21:04:51Z
7,3,133.199,132.137,39.92996,-70.89,12.4248,12.4232,4.103608,4.1032,35.5496,...,9,39 55.797' N,70 53.400' W,2017-06-01 00:00:00,20:41:00,OSPM,469.0,-9999999,AR18-A,2017-06-01T21:04:51Z
8,4,133.282,132.219,39.92996,-70.89,12.4273,12.4255,4.103845,4.103519,35.5495,...,9,39 55.797' N,70 53.400' W,2017-06-01 00:00:00,20:41:00,OSPM,469.0,-9999999,AR18-A,2017-06-01T21:05:03Z
9,5,29.522,29.294,39.92996,-70.89,13.2262,13.2255,4.036749,4.03715,34.2135,...,9,39 55.797' N,70 53.400' W,2017-06-01 00:00:00,20:41:00,OSPM,469.0,chl. Max,AR18-A,2017-06-01T21:24:34Z


Autogenerate the filename following the agreed-upon form:

In [295]:
cruise_name = 'Pioneer-08'
cruise_id = list(set(CTD['Cruise ID']))[0]
current_date = pd.to_datetime(pd.datetime.now()).tz_localize(tz='US/Eastern').tz_convert(tz='UTC')
version = '1-01'

In [296]:
filename = '_'.join([cruise_name,cruise_id,'Discrete','Summary',current_date.strftime('%Y-%m-%d'),'ver',version,'.xlsx'])
filename

'Pioneer-08_AR18-A_Discrete_Summary_2019-06-21_ver_1-01_.xlsx'

**========================================================================================================================**
Import the column order list and use fuzzy string matching to sort the data and save the data to an new Excel spreadsheet.

In [297]:
column_order = pd.read_excel(basepath+'column_order.xlsx')

In [298]:
column_order = tuple([x.replace('CTD','').strip() for x in column_order.columns.values])

In [299]:
column_order

('Cruise',
 'Station',
 'Target Asset',
 'Start Latitude [degrees]',
 'Start Longitude [degrees]',
 'Start Time [UTC]',
 'Cast',
 'Cast Flag',
 'Bottom Depth at Start Position [m]',
 'File',
 'File Flag',
 'Niskin/Bottle Position',
 'Niskin Flag',
 'Bottle Closure Time [UTC]',
 'Pressure [db]',
 'Pressure Flag',
 'Depth [m]',
 'Latitude [deg]',
 'Longitude [deg]',
 'Temperature 1 [deg C]',
 'Temperature 1 Flag',
 'Temperature 2 [deg C]',
 'Temperature 2 Flag',
 'Conductivity 1 [S/m]',
 'Conductivity 1 Flag',
 'Conductivity 2 [S/m]',
 'Conductivity 2 Flag',
 'Salinity 1, uncorrected [psu]',
 'Salinity 2, uncorrected [psu]',
 'Oxygen, uncorrected [mL/L]',
 'Oxygen Flag',
 'Oxygen Saturation [mL/L]',
 'Fluorescence [mg/m^3]',
 'Fluorescence Flag',
 'Beam Attenuation [1/m]',
 'Beam Transmission [%]',
 'Transmissometer Flag',
 'pH',
 'pH Flag',
 'Discrete Oxygen [mL/L]',
 'Discrete Oxygen Flag',
 'Discrete Oxygen Duplicate Flag',
 'Discrete Chlorophyll [ug/L]',
 'Discrete Phaeopigment [ug/L

In [300]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [301]:
results = {}
CTDsorted = pd.DataFrame()
for column in column_order:
    match = process.extractBests(column.replace('Discrete ','').replace('Calculated ',''),
                                 CTD.columns.values, limit=2, score_cutoff=56, scorer=fuzz.ratio)
    if 'calculated' in column.lower():
        CTDsorted[column] = -9999999
    elif 'flag' in column.lower():
        if column not in ['Discrete DIC Flag','Discrete Alkalinity Flag','Discrete pCO2 Flag','Discrete pH Flag']:
            CTDsorted[column] = -9999999
        else:
            CTDsorted[column] = CTD[column]
            results.update({column:match[0]})
    elif len(match) == 0:
        CTDsorted[column] = -9999999
    elif (match[0][0] not in [x[0] for x in results.values()]):
        CTDsorted[match[0][0]] = CTD[match[0][0]]
        results.update({column:match[0]})
    elif len(match) == 1:
        CTDsorted[match[0][0]] = CTD[match[0][0]]
        results.update({column:match[0]})
    else:
        CTDsorted[match[1][0]] = CTD[match[1][0]]
        results.update({column:match[1]})
CTDsorted['Comments'] = CTD['Comments']

In [302]:
CTDsorted

Unnamed: 0,Cruise ID,Station-Cast #,Target Asset,Start Latitude [degrees],Start Longitude [degrees],Start Time [UTC],Cast,Cast Flag,Bottom Depth [m],Filename,...,Calculated Alkalinity [µmol/kg],Calculated DIC [µmol/kg],Calculated pCO2 [µatm],Calculated pH,Calculated CO2aq [µmol/kg],Calculated bicarb [µmol/kg],Calculated CO3 [µmol/kg],Calculated Omega-C,Calculated Omega-A,Comments
0,AR18-A,5,-9999999,40 17.01 N,070 50.02 W,2017-06-01T05:40:35Z,5,-9999999,-9999999.0,D:\Data\ar18a005.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
1,AR18-A,4,-9999999,40 22.49 N,070 49.99 W,2017-06-01T04:32:39Z,4,-9999999,-9999999.0,D:\Data\ar18a004.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
2,AR18-A,9,OSPM,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9,-9999999,469.0,D:\Data\ar18a009.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
3,AR18-A,9,OSPM,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9,-9999999,469.0,D:\Data\ar18a009.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
4,AR18-A,9,OSPM,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9,-9999999,469.0,D:\Data\ar18a009.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,Duplicate DIC/TA
5,AR18-A,9,OSPM,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9,-9999999,469.0,D:\Data\ar18a009.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,Duplicate DIC/TA
6,AR18-A,9,OSPM,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9,-9999999,469.0,D:\Data\ar18a009.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
7,AR18-A,9,OSPM,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9,-9999999,469.0,D:\Data\ar18a009.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
8,AR18-A,9,OSPM,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9,-9999999,469.0,D:\Data\ar18a009.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
9,AR18-A,9,OSPM,39 55.80 N,070 53.40 W,2017-06-01T20:41:30Z,9,-9999999,469.0,D:\Data\ar18a009.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,chl. Max


Now, check each of the resulting columns for if the values have actually been put in. If they haven't, we want to substitute in the sample bottle number, etc

In [303]:
# Create a mask for the values that are just surface bottle closures
A = CTDsorted['Target Asset'] == -9999999
#B = CTDsorted['Station-Cast #'] == '008'
#mask = np.logical_or(A,B)

In [304]:
CTDsorted[A].sort_values(by=['Station-Cast #','Bottle Position'])

Unnamed: 0,Cruise ID,Station-Cast #,Target Asset,Start Latitude [degrees],Start Longitude [degrees],Start Time [UTC],Cast,Cast Flag,Bottom Depth [m],Filename,...,Calculated Alkalinity [µmol/kg],Calculated DIC [µmol/kg],Calculated pCO2 [µatm],Calculated pH,Calculated CO2aq [µmol/kg],Calculated bicarb [µmol/kg],Calculated CO3 [µmol/kg],Calculated Omega-C,Calculated Omega-A,Comments
16,AR18-A,1,-9999999,40 14.96 N,070 45.01 W,2017-05-31T07:56:13Z,1,-9999999,-9999999.0,D:\Data\ar18a001.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
25,AR18-A,2,-9999999,40 21.97 N,070 53.30 W,2017-05-31T09:30:51Z,2,-9999999,-9999999.0,D:\Data\ar18a002.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
18,AR18-A,3,-9999999,40 22.53 N,070 46.01 W,2017-06-01T03:36:26Z,3,-9999999,-9999999.0,D:\Data\ar18a003.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
1,AR18-A,4,-9999999,40 22.49 N,070 49.99 W,2017-06-01T04:32:39Z,4,-9999999,-9999999.0,D:\Data\ar18a004.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
0,AR18-A,5,-9999999,40 17.01 N,070 50.02 W,2017-06-01T05:40:35Z,5,-9999999,-9999999.0,D:\Data\ar18a005.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
19,AR18-A,6,-9999999,40 12.01 N,070 50.01 W,2017-06-01T06:44:51Z,6,-9999999,-9999999.0,D:\Data\ar18a006.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
23,AR18-A,7,-9999999,40 06.07 N,070 50.01 W,2017-06-01T08:01:31Z,7,-9999999,-9999999.0,D:\Data\ar18a007.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
17,AR18-A,8,-9999999,40 06.27 N,070 53.65 W,2017-06-01T08:53:41Z,8,-9999999,-9999999.0,D:\Data\ar18a008.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
21,AR18-A,10,-9999999,40 17.07 N,070 57.67 W,2017-06-02T03:34:15Z,10,-9999999,-9999999.0,D:\Data\ar18a010.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
24,AR18-A,11,-9999999,40 17.07 N,070 52.01 W,2017-06-02T04:43:29Z,11,-9999999,-9999999.0,D:\Data\ar18a011.hex,...,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999


In [305]:
CTDsorted.columns.values

array(['Cruise ID', 'Station-Cast #', 'Target Asset',
       'Start Latitude [degrees]', 'Start Longitude [degrees]',
       'Start Time [UTC]', 'Cast', 'Cast Flag', 'Bottom Depth [m]',
       'Filename', 'File Flag', 'Bottle Position', 'Niskin Flag',
       'Bottle Closure Time [UTC]', 'Pressure, Digiquartz [db]',
       'Pressure Flag', 'Depth [salt water, m]', 'Latitude [deg]',
       'Longitude [deg]', 'Temperature [ITS-90, deg C]',
       'Temperature 1 Flag', 'Temperature, 2 [ITS-90, deg C]',
       'Temperature 2 Flag', 'Conductivity [S/m]', 'Conductivity 1 Flag',
       'Conductivity, 2 [S/m]', 'Conductivity 2 Flag',
       'Salinity, Practical [PSU]', 'Salinity, Practical, 2 [PSU]',
       'Oxygen, SBE 43 [ml/l]', 'Oxygen Flag',
       'Oxygen Saturation, Garcia & Gordon [ml/l]',
       'Fluorescence [mg/m^3]', 'Fluorescence Flag',
       'Beam Attenuation, WET Labs C-Star [1/m]',
       'Beam Transmission, WET Labs C-Star [%]', 'Transmissometer Flag',
       'pH', 'pH Flag', 

In [306]:
np.unique(CTDsorted[A]['Station-Cast #'])

array(['001', '002', '003', '004', '005', '006', '007', '008', '010',
       '011', '012', '013'], dtype=object)

In [307]:
CTDsorted[A][CTDsorted[A]['Station-Cast #'] == '007']['Pressure, Digiquartz [db]']

23    136.904
Name: Pressure, Digiquartz [db], dtype: float64

In [308]:
# Now save the results
CTDsorted.to_excel(basepath+array+cruise+filename)

In [None]:
CTDsorted[CTDsorted['Target Asset'] == -9999999]

In [None]:
CTDsorted.columns

In [None]:
CTDsorted[CTDsorted['Target Asset'] == -9999999][['Discrete Phaeo (ug/l)', 'Discrete Fo/Fa Ratio',
       'Discrete Fluorescence Flag', 'Discrete Fluorescence Duplicate Flag']].iloc[50:100]

In [None]:
np.unique(CTDsorted[mask]['Station-Cast #'])

In [None]:
CTDsorted[A]['Station-Cast #']

In [None]:
np.unique(CTDsorted[CTDsorted['Target Asset'] != -9999999]['Station-Cast #'])

In [None]:
np.unique(CTD['Station-Cast #'])

In [None]:
len(CTDsorted[CTDsorted['Station-Cast #'] == '008'])

### Continued Processing

In [None]:
os.getcwd()

In [None]:
dirpath = '/'.join((os.getcwd(),'Pioneer','Pioneer-08_AR-18_2017-05-30'))

In [None]:
os.listdir(dirpath)

In [None]:
fpatha = '/'.join((dirpath,'Pioneer-08_Leg_1_AR18-A_Discrete_Summary_2019-06-13_ver_1-01_.xlsx'))
fpathb = '/'.join((dirpath,'Pioneer-08_Leg_2_AR18-B_Discrete_Summary_2019-06-13_ver_1-01_.xlsx'))
fpathc = '/'.join((dirpath,'Pioneer-08_Leg_3_AR18-C_Discrete_Summary_2019-06-13_ver_1-01_.xlsx'))

sa = pd.read_excel(fpatha)
sb = pd.read_excel(fpathb)
sc = pd.read_excel(fpathc)

In [None]:
summary = pd.concat([sa, sb, sc])

In [None]:
summary.sort_values(by=['Cruise ID','Station-Cast #','Bottle Position'], inplace=True)

In [None]:
summary.drop(columns='Unnamed: 0',inplace=True)

In [None]:
os.listdir(dirpath+'/Water Sampling')

In [None]:
nutrients = pd.read_excel('/'.join((dirpath,'Water Sampling','Pioneer-08_AR-18_2017-05-30_Nutrients_Sample_Data_2017-08-18_ver_1-00.xlsx')))

In [None]:
summary.head(10)

In [None]:
nutrients.head(10)

In [None]:
summary.columns.values

In [None]:
nutrients.columns.values

In [None]:
summary = summary.merge(nutrients, how='left', left_on=['Cruise ID','Station-Cast #','Bottle Position'], right_on=['CRUISE_ID','CAST_NO','NISKIN_NO'])

In [None]:
summary[summary['Discrete Nitrate [µmol/L]'] > 0 ]

In [None]:
summary.info()

In [None]:
summary['Discrete DIC [µmol/kg]'] = summary['DIC_UMOL_KG']
summary['Discrete DIC Flag'] = summary['DIC_FLAG_W']
summary['Discrete Alkalinity [µmol/kg]'] = summary['TA_UMOL_KG']
summary['Discrete Alkalinity Flag'] = summary['TA_FLAG_W']
summary['Discrete pH [Total Scale]'] = summary['PH_TOT_MEA']
summary['Discrete pH Analysis Temp [C]'] = summary['TMP_PH_DEG_C']
summary['Discrete pH Flag'] = summary['PH_FLAG_W']

In [None]:
summary.drop(columns=['EXPOCODE', 'CRUISE_ID', 'STATION_ID', 'CAST_NO', 'NISKIN_NO',
       'NISKIN_ID', 'YEAR_UTC', 'MONTH_UTC', 'DAY_UTC', 'TIME_UTC',
       'LONGITUDE_DEC', 'LATITUDE_DEC', 'DEPTH_METER', 'DEPTH_BTM_METER',
       'SALINITY_PSS78', 'DIC_UMOL_KG', 'DIC_FLAG_W', 'TA_UMOL_KG',
       'TA_FLAG_W', 'PH_TOT_MEA', 'TMP_PH_DEG_C', 'PH_FLAG_W'], inplace=True)

In [None]:
summary.info()

In [None]:
summary.fillna(value=-9999999,inplace=True)

In [None]:
summary.info()

In [None]:
summary.to_excel('/'.join((dirpath,'Water Sampling','Pioneer-06_AR-04_2016-05-12_DIC_Sample_Data_2019-06-21_ver_1-01.xlsx')))

In [None]:
df = pd.read_excel('/'.join((os.getcwd(),'Pioneer','Pioneer-03','Pioneer-02_KN 217_Discrete_Summary_2019-06-05_ver_1-01_.xlsx')))

In [None]:
df.info()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
filename = '/'.join((os.getcwd(),'Pioneer','Pioneer-02','Pioneer-02_KN 217_Discrete_Summary_2019-06-18_ver_1-01_.xlsx'))

In [None]:
df.to_excel(filename)

In [None]:
dirpath