# Bottle Processing
Author: Andrew Reed

### Motivation:
Independent verification of the suite of physical and chemical observations provided by OOI are critical for the observations to be of use for scientifically valid investigations. Consequently, CTD casts and Niskin water samples are made during deployment and recovery of OOI platforms, vehicles, and instrumentation. The water samples are subsequently analyzed by independent labs for  comparison with the OOI telemetered and recovered data.

However, currently the water sample data routinely collected and analyzed as part of the OOI program are not available in a standardized format which maps the different chemical analyses to the physical measurements taken at bottle closure. Our aim is to make these physical and chemical analyses of collected water samples available to the end-user in a standardized format for easy comprehension and use, while maintaining the source data files. 

### Approach:
Generating a summary of the water sample analyses involves preprocessing and concatenating multiple data sources, and accurately matching samples with each other. To do this, I first preprocess the ctd casts to generate bottle (.btl) files using the SeaBird vendor software following the SOP available on Alfresco. 

Next, the bottle files are parsed using python code and the data renamed following SeaBird's naming guide. This creates a series of individual cast summary (.sum) files. These files are then loaded into pandas dataframes, appended to each other, and exported as a csv file containing all of the bottle data in a single data file.

### Data Sources/Software:

* **sbe_name_map**: This is a spreadsheet which maps the short names generated by the SeaBird SBE DataProcessing Software to the associated full names. The name mapping originates from SeaBird's SBE DataProcessing support documentation.

* **Alfresco**: The Alfresco CMS for OOI at alfresco.oceanobservatories.org is the source of the ctd hex, xmlcon, and psa files necessary for generating the bottle files needed to create the sample summary sheet.

* **SBEDataProcessing-Win32**: SeaBird vendor software for processing the raw ctd files and generating the .btl files.


In [1]:
# Import packages used in this notebook
import os, sys, re
import pandas as pd
import numpy as np

In [2]:
# Load the name mapping for the column names
# Specifiy the local directory
# sbe_name_map = pd.read_excel('/media/andrew/OS/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Reference_Files/seabird_ctd_name_map.xlsx')
sbe_name_map = pd.read_excel('C:/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Reference_Files/seabird_ctd_name_map.xlsx')

In [3]:
sbe_name_map

Unnamed: 0,Short Name,Full Name,Friendly Name,Units,Notes/Comments
0,accM,Acceleration [m/s^2],acc M,m/s^2,
1,accF,Acceleration [ft/s^2],acc F,ft/s^2,
2,altM,Altimeter [m],alt M,m,
3,altF,Altimeter [ft],alt F,ft,
4,avgsvCM,"Average Sound Velocity [Chen-Millero, m/s]",avgsv-C M,"Chen-Millero, m/s",
5,avgsvCF,"Average Sound Velocity [Chen-Millero, ft/s]",avgsv-C F,"Chen-Millero, ft/s",
6,avgsvDM,"Average Sound Velocity [Delgrosso, m/s]",avgsv-D M,"Delgrosso, m/s",
7,avgsvDF,"Average Sound Velocity [Delgrosso, ft/s]",avgsv-D F,"Delgrosso, ft/s",
8,avgsvWM,"Average Sound Velocity [Wilson, m/s]",avgsv-W M,"Wilson, m/s",
9,avgsvWF,"Average Sound Velocity [Wilson, ft/s]",avgsv-W F,"Wilson, ft/s",


In [5]:
# basepath = '/media/andrew/OS/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/'
basepath = 'C:/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/'
array = 'Pioneer/'
cruise = 'Pioneer-02/'
water_dir = 'Water Sampling/'
ctd_dir = 'ctd/'

In [6]:
os.listdir(basepath+array+cruise+water_dir)

['CHLs KN-214.xls',
 'Pioneer II Nutrient Data 2014.xlsx',
 'Pioneer II Sampling Log.pdf',
 'Pioneer II_Spring2014_DIC,TA, pH data.xlsx',
 'Pioneer2_KN-217_sampling_log-1.xlsx',
 'Salts and O2']

In [7]:
# Specify the local directory where the bottle (.btl) files are stored for a particular cruise
# dirpath = 'C:/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/Irminger/Irminger-5/ctd/'
btlpath = basepath+array+cruise+ctd_dir
summary_sheet_path = basepath+array+cruise+water_dir+'Pioneer2_KN-217_sampling_log-1.xlsx'
salts_and_o2_path = basepath+array+cruise+water_dir+'Salts and O2/'
nutrients_path = basepath+array+cruise+water_dir+'Pioneer II Nutrient Data 2014.xlsx'
chl_path = basepath+array+cruise+water_dir+'CHLs KN-214.xls'

In [9]:
# Parse the data for the start_time
def parse_header(header):
    """
    Parse the header of bottle (.btl) files to get critical information
    for the summary spreadsheet.
    
    Args:
        header - an object containing the header of the bottle file as a list of
            strings, split at the newline.
    Returns:
        hdr - a dictionary object containing the start_time, filename, latitude,
            longitude, and cruise id.
    """
    hdr = {}
    for line in header:
        if 'start_time' in line.lower():
            start_time = pd.to_datetime(re.split('= |\[',line)[1])
            hdr.update({'Start Time':start_time.strftime('%Y-%m-%dT%H:%M:%SZ')})
        elif 'filename' in line.lower():
            hex_name = re.split('=',line)[1].strip()
            hdr.update({'Hex name':hex_name})
        elif 'latitude' in line.lower():
            start_lat = re.split('=',line)[1].strip()
            hdr.update({'Start Latitude':start_lat})
        elif 'longitude' in line.lower():
            start_lon = re.split('=',line)[1].strip()
            hdr.update({'Start Longitude':start_lon})
        elif 'cruise id' in line.lower():
            cruise_id = re.split(':',line)[1].strip()
            hdr.update({'Cruise ID':cruise_id})
        else:
            pass
    
    return hdr
        

In [15]:
# Now write a function to autopopulate the bottle summary sample sheet
files = [x for x in os.listdir(btlpath) if '.btl' in x]
for filename in files:
    filepath = os.path.abspath(btlpath+filename)
    
    # Load the raw content into memory
    with open(filepath) as file:
        content = file.readlines()
    content = [x.strip() for x in content]
    
    # Now parse the file content
    header = []
    columns = []
    data = []
    for line in content:
        if line.startswith('*') or line.startswith('#'):
            header.append(line)
        else:
            try:
                float(line[0])
                data.append(line)
            except:
                columns.append(line)
    
    # Parse the header
    hdr = parse_header(header)
    
    # Parse the column identifiers
    column_dict = {}
    for line in columns:
        for i,x in enumerate(line.split()):
            try:
                column_dict[i] = column_dict[i] + ' ' + x
            except:
                column_dict.update({i:x})
    
    # Parse the bottle data based on the column header locations
    data_dict = {x:[] for x in column_dict.keys()}

    for line in data:
        if line.endswith('(avg)'):
            values = list(filter(None,re.split('  |\t', line) ) )
            for i,x in enumerate(values):
                data_dict[i].append(x)
        elif line.endswith('(sdev)'):
            values = list(filter(None,re.split('  |\t', line) ) )
            data_dict[1].append(values[0])
        else:
            pass
            
    data_dict[1] = [' '.join(item) for item in zip(data_dict[1][::2],data_dict[1][1::2])]
    
    # With the parsed data and column names, match up the data and column
    # based on the location
    results = {}
    for key,item in column_dict.items():
        values = data_dict[key]
        results.update({item:values})
        
    # Put the results into a dataframe
    df = pd.DataFrame.from_dict(results)
        
    # Now add the parsed info from the header files into the dataframe
    for key,item in hdr.items():
        df[key] = item
        
    # Get the cast number
    cast = filename[filename.index('.')-3:filename.index('.')]
    df['Cast #'] = str(cast).zfill(3)
    
    # Generate a filename for the summary file
    outname = filename.split('.')[0] + '.sum'
    
    # Save the results
    df.to_csv(btlpath+outname)

In [16]:
filename

'kn217999.btl'

In [17]:
# Now, for each "summary" file, load and append to each other
df = pd.DataFrame()
for file in os.listdir(btlpath):
    if '.sum' in file:
        df = df.append(pd.read_csv(btlpath+file))
    else:
        pass

In [18]:
df.head()

Unnamed: 0.1,Unnamed: 0,Bottle Position,Date Time,PrDM,DepSM,Latitude,Longitude,T090C,T190C,C0S/m,...,Sbeox0V,Sbeox0ML/L,OxsolML/L,CStarAt0,CStarTr0,Hex name,Start Latitude,Start Longitude,Start Time,Cast #
0,0,1,Apr 12 2014 18:14:00,90.718,90.001,40.36166,-70.77197,5.05,5.2542,3.19234,...,2.543,6.6363,7.1538,0.3267,92.1575 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
1,1,2,Apr 12 2014 18:14:09,90.823,90.105,40.36166,-70.77198,4.9394,5.187,3.185822,...,2.4942,6.4294,7.17096,0.3622,91.3633 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
2,2,3,Apr 12 2014 18:18:21,51.425,51.023,40.36166,-70.77198,4.9974,4.9989,3.178876,...,2.674,7.0568,7.16667,0.1511,96.2937 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
3,3,4,Apr 12 2014 18:18:32,51.69,51.286,40.36166,-70.77198,5.016,5.0114,3.17867,...,2.6783,7.0752,7.1645,0.1521,96.2681 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
4,4,5,Apr 12 2014 18:22:10,26.715,26.507,40.36166,-70.77199,5.5665,5.5637,3.220398,...,2.7415,7.1601,7.07405,0.2532,93.8761 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1


In [19]:
sbe_name_map['Short Name'].apply(lambda x: str(x).lower());

In [20]:
# Rename the column title using the sbe_name_mapping 
for colname in list(df.columns.values):
    try:
        fullname = list(sbe_name_map[sbe_name_map['Short Name'].apply(lambda x: str(x).lower() == colname.lower()) == True]['Full Name'])[0]
        df.rename({colname:fullname},axis='columns',inplace=True)
    except:
        pass

In [21]:
df.rename(columns={'Bottle Position':'Niskin #'},inplace=True)
df['Niskin #'] = df['Niskin #'].apply(lambda x: str( int(x) ) )
df.drop(columns='Unnamed: 0',inplace=True)
df['Cast #'] = df['Cast #'].apply(lambda x: str(x).zfill(3) )

In [22]:
df

Unnamed: 0,Niskin #,Date Time,"Pressure, Digiquartz [db]","Depth [salt water, m]",Latitude [deg],Longitude [deg],"Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],"Conductivity, 2 [S/m]",...,"Oxygen raw, SBE 43 [V]","Oxygen, SBE 43 [ml/l]","Oxygen Saturation, Garcia & Gordon [ml/l]","Beam Attenuation, WET Labs C-Star [1/m]","Beam Transmission, WET Labs C-Star [%]",Hex name,Start Latitude,Start Longitude,Start Time,Cast #
0,1,Apr 12 2014 18:14:00,90.718,90.001,40.36166,-70.77197,5.0500,5.2542,3.192340e+00,3.218524,...,2.5430,6.6363,7.15380,0.3267,92.1575 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,001
1,2,Apr 12 2014 18:14:09,90.823,90.105,40.36166,-70.77198,4.9394,5.1870,3.185822e+00,3.206726,...,2.4942,6.4294,7.17096,0.3622,91.3633 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,001
2,3,Apr 12 2014 18:18:21,51.425,51.023,40.36166,-70.77198,4.9974,4.9989,3.178876e+00,3.179077,...,2.6740,7.0568,7.16667,0.1511,96.2937 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,001
3,4,Apr 12 2014 18:18:32,51.690,51.286,40.36166,-70.77198,5.0160,5.0114,3.178670e+00,3.178866,...,2.6783,7.0752,7.16450,0.1521,96.2681 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,001
4,5,Apr 12 2014 18:22:10,26.715,26.507,40.36166,-70.77199,5.5665,5.5637,3.220398e+00,3.220331,...,2.7415,7.1601,7.07405,0.2532,93.8761 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,001
5,6,Apr 12 2014 18:22:21,26.808,26.600,40.36166,-70.77200,5.5634,5.5678,3.220174e+00,3.220768,...,2.7437,7.1647,7.07454,0.2327,94.3495 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,001
6,7,Apr 12 2014 18:25:16,6.714,6.662,40.36166,-70.77198,5.9768,5.9596,3.256445e+00,3.255135,...,2.7799,7.1846,7.00539,0.2558,93.8056 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,001
7,8,Apr 12 2014 18:25:27,6.760,6.708,40.36166,-70.77200,5.8953,5.8948,3.249481e+00,3.249591,...,2.7774,7.1908,7.01873,0.2532,93.8670 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,001
0,1,Apr 13 2014 18:00:29,145.688,144.520,40.09324,-70.87536,10.9919,10.9916,3.913031e+00,3.913154,...,2.4173,5.3981,6.18038,0.1763,95.6884 (avg),C:\Data\ctd\kn217002.hex,40 05.59 N,070 52.52 W,2014-04-13T17:54:46Z,002
1,2,Apr 13 2014 18:00:39,145.614,144.446,40.09324,-70.87536,10.9913,10.9912,3.912959e+00,3.913107,...,2.4177,5.3975,6.18046,0.1815,95.5631 (avg),C:\Data\ctd\kn217002.hex,40 05.59 N,070 52.52 W,2014-04-13T17:54:46Z,002


In [23]:
df.to_csv(btlpath+'CTD_Summary.csv')

### Oxygen & Salinity 
Now, we need to add the 

In [24]:
def clean_sal_files(dirpath):
    
    # Run check if files are held in excel format or csvs
    csv_flag = any(files.endswith('.SAL') for files in os.listdir(dirpath))
    if csv_flag:
        for files in os.listdir(dirpath):
            sample = []
            salinity = []
            with open(basepath+array+cruise+'Water Sampling/Salts and O2/'+leg+files) as file:
                data = file.readlines()
                for ind1,line in enumerate(data):
                    if ind1 == 0:
                        strs = data[0].replace('"','').split(',')
                        cruisename = strs[0]
                        station = strs[1]
                        cast = strs[2]
                        case = strs[8]
                    elif int(line.split()[0]) == 0:
                        pass
                    else:
                        strs = line.split()
                        sample.append(strs[0])
                        salinity.append(strs[2]) 
                # Generate a pandas dataframe to populate data
                data_dict = {'Cruise ID':cruisename,'Station #':station,'Cast #':cast,'Case':case,'Sample ID':sample,'Salinity [psu]':salinity}
                df = pd.DataFrame.from_dict(data_dict)
                df.to_csv(file.name.replace('.','')+'.csv')
    
    else:
        # If the files are already in excel spreadsheets, they've been cleaned into a
        # logical tabular format
        pass
    

def process_sal_files(dirpath):
    
    # Check if the files are excel files or not
    excel_flag = any(files.endswith('SAL.xlsx') for files in os.listdir(dirpath))
    # Initialize a dataframe for processing the salinity files
    df = pd.DataFrame()
    if excel_flag:
        for file in os.listdir(dirpath):
            if 'SAL.xlsx' in file:
                df = df.append(pd.read_csv(dirpath+file))
        df.rename({'Cruise':'Cruise ID','Station':'Station #','Sample':'Sample ID','Salinity':'Salinity [psu]'},
          axis='columns',inplace=True)
        df.dropna(inplace=True)
        df['Station #'] = df['Station #'].apply(lambda x: str( int(x)).zfill(3))
        df['Niskin #'] = df['Niskin #'].apply(lambda x: str( int(x)))
        df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
    else:
        for file in os.listdir(dirpath):
            if 'SAL.csv' in file:
                df = df.append(pd.read_csv(dirpath+file))
        df.dropna(inplace=True)
        df['Station #'] = df['Station #'].apply(lambda x: str( int(x)).zfill(3))
        df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
        df.drop(columns=[x for x in list(df.columns.values) if 'unnamed' in x.lower()],inplace=True)

    # Save the processed summary file for salinity
    df.to_csv(dirpath+'SAL_Summary.csv')
    
    
def process_oxy_files(dirpath):
    df = pd.DataFrame()
    for file in os.listdir(dirpath):
        if 'OXY' in file:
            df = df.append(pd.read_excel(dirpath+file))
    # Rename and clean up the oxygen data to be uniform across data sets
    df.rename({'Cruise':'Cruise ID','Station':'Station #','Sample#':'Sample ID','Oxy':'Oxygen [mL/L]','Unit':'Units'},
          axis='columns',inplace=True)
    df.dropna(inplace=True)
    df['Station #'] = df['Station #'].apply(lambda x: str( int(x)).zfill(3))
    df['Niskin #'] = df['Niskin #'].apply(lambda x: str( int(x)))
    df['Sample ID'] = df['Sample ID'].apply(lambda x: str( int(x)))
    df['Cruise ID'] = df['Cruise ID'].apply(lambda x: x.replace('O','0'))
    
    # Save the processed summary file for oxygen
    df.to_csv(dirpath+'OXY_Summary.csv')

### CTD Sampling Log

In [25]:
os.listdir(basepath+array+cruise+water_dir)

['CHLs KN-214.xls',
 'Pioneer II Nutrient Data 2014.xlsx',
 'Pioneer II Sampling Log.pdf',
 'Pioneer II_Spring2014_DIC,TA, pH data.xlsx',
 'Pioneer2_KN-217_sampling_log-1.xlsx',
 'Salts and O2']

In [26]:
summary_sheet_path = basepath+array+cruise+water_dir+'Pioneer2_KN-217_sampling_log-1.xlsx'

In [27]:
sample_log = pd.read_excel(summary_sheet_path,sheet_name='Summary',header=0)

In [28]:
sample_log.head()

Unnamed: 0,Cruise ID,Target Asset,Station-Cast #,Niskin #,Rosette Position,Date,Time,Bottom Depth [m],Trip Depth,Oxygen Bottle #,Ph Bottle #,DIC/TA Bottle #,Salts Bottle #,Nitrate Bottle 1,Chlorophyll Brown Bottle #,Chlorophyll Filter Sample # Cast #/Depth/Bottle #/,Chlorophyll Brown Bottle Volume,Chlorophyll LN Tube,Comments
0,KN 217,PMUI,1,1,1,2014-04-12,,96,90.6,L1,68.0,69.0,A1,1-1,1,01/01,1057,1,
1,KN 217,PMUI,1,2,2,2014-04-12,,96,90.6,L2,76.0,77.0,A2,1-2,2,01/02,1056,1,
2,KN 217,PMUI,1,3,3,2014-04-12,,96,52.0,L3,70.0,71.0,A3,1-3,3,01/03,1056,1,
3,KN 217,PMUI,1,4,4,2014-04-12,,96,52.0,L4,,,A4,1-4,4,01/04,1056,1,
4,KN 217,PMUI,1,5,5,2014-04-12,,96,26.7,L5,72.0,73.0,A5,1-5,5,01/05,1057,1,


In [None]:
# # Rename the column headers
# sample_log.rename(columns=lambda x: 'Sample Log: ' + x.strip(), inplace=True)
# sample_log['Sample Log: Niskin #'] = sample_log['Sample Log: Niskin #'].apply(lambda x: int(x.replace('*','')) if type(x) == str else x )
# sample_log['Sample Log: Niskin #'] = sample_log['Sample Log: Niskin #'].apply(lambda x: x if np.isnan(x) else str( int(x) ) )
# sample_log['Sample Log: Cast #'] = sample_log['Sample Log: Station-Cast #'].apply(lambda x: str(x).zfill(3))

### Nutrient & Chlorophyll Data

In [29]:
os.listdir(basepath+array+cruise+water_dir)

['CHLs KN-214.xls',
 'Pioneer II Nutrient Data 2014.xlsx',
 'Pioneer II Sampling Log.pdf',
 'Pioneer II_Spring2014_DIC,TA, pH data.xlsx',
 'Pioneer2_KN-217_sampling_log-1.xlsx',
 'Salts and O2']

In [30]:
chlfile = 'CHLs KN-214.xls'
nutfile = 'Pioneer II Nutrient Data 2014.xlsx'

In [49]:
nutrients = pd.read_excel(basepath+array+cruise+water_dir+nutfile)

In [50]:
nutrients.head()

Unnamed: 0,Sample ID,Rep1: Nitrate+Nitrite [µM/L],Rep1: Ammonium [µM/L],Rep1: Phosphate [µM/L],Rep1: Silicate [µM/L],Rep2: Nitrate+Nitrite [µM/L],Rep2: Ammonium [µM/L],Rep2: Phosphate [µM/L],Rep2: Silicate [µM/L],Avg: Nitrate+Nitrite [µM/L],Avg: Ammonium [µM/L],Avg: Phosphate [µM/L],Avg: Silicate [µM/L],Avg: Nitrite [µM/L],Avg: Nitrate [µM/L]
0,1-1,2.051355,2.294434,0.689154,0.994681,2.04764,2.46074,0.696308,0.989391,2.049496,2.377587,0.692731,0.992036,<0.015,2.034496
1,1-2,2.021611,2.441819,0.726115,1.07299,1.98722,2.22871,0.715385,1.0169,2.004416,2.335263,0.72075,1.04494,<0.015,1.989416
2,1-3,0.680377,0.823566,0.374385,0.0380942,0.688742,0.719002,0.376769,<0.030,0.684559,0.771284,0.375577,0.0380942,<0.015,0.669559
3,1-4,0.736145,0.843483,0.4185,<0.030,0.665505,0.650289,0.385115,0.0317452,0.700825,0.746886,0.401808,0.0317452,<0.015,0.685825
4,1-5,0.130127,0.502903,0.326692,<0.030,0.118044,0.47004,0.308808,<0.030,0.124085,0.486472,0.31775,<0.030,<0.015,0.109085


In [33]:
chl = pd.read_excel(basepath+array+cruise+water_dir+chlfile)
chl.head()

Unnamed: 0,Cruise #:,Date,Station Start Time (UTC),Station End Time (UTC),Niskin Trip Time,Lat,Lon,Station Depth,Station-Cast #,Niskin #,...,blank,Rb-blank,Ra-blank,Chl (ug/l),Phaeo (ug/l),quality_flag,Cal_Date,Fluorometer,Comments,Unnamed: 35
0,KN-214,2013-11-21,,23:25:00,23:25:00,40° 7.996' N,70° 45.919' W,131,2,1,...,0.892833,15.757167,13.127167,0.048851,0.163883,2,2013-04-03,0,Phaeo > chl?; Not enough sample water filtered,Time is station end time. Bottle trip time not...
1,KN-214,2013-11-21,,23:25:00,23:25:00,40° 7.996' N,70° 45.919' W,131,2,1,...,0.892833,16.587167,13.297167,0.061111,0.154378,2,2013-04-03,0,Phaeo > chl?; Not enough sample water filtered,Time is station end time. Bottle trip time not...
2,KN-214,2013-11-21,,23:25:00,23:25:00,40° 7.996' N,70° 45.919' W,131,2,3,...,0.892833,14.427167,11.297167,0.057733,0.124068,2,2013-04-03,0,Phaeo > chl?; Not enough sample water filtered,Time is station end time. Bottle trip time not...
3,KN-214,2013-11-21,,23:25:00,23:25:00,40° 7.996' N,70° 45.919' W,131,2,3,...,0.892833,14.907167,11.827167,0.05701,0.133987,2,2013-04-03,0,Phaeo > chl?; Not enough sample water filtered,Time is station end time. Bottle trip time not...
4,KN-214,2013-11-21,,23:25:00,23:25:00,40° 7.996' N,70° 45.919' W,131,2,5,...,0.892833,259.207167,154.507167,1.937964,0.557167,1,2013-04-03,0,,Time is station end time. Bottle trip time not...


In [34]:
os.listdir(basepath+array+cruise+water_dir+'Salts and O2')

['001.SAL',
 '001SAL.csv',
 '002.SAL',
 '002SAL.csv',
 '003.SAL',
 '003SAL.csv',
 '004.SAL',
 '004SAL.csv',
 '005.SAL',
 '005SAL.csv',
 '006.SAL',
 '006SAL.csv',
 '007.SAL',
 '007SAL.csv',
 '008.SAL',
 '008SAL.csv',
 'KN217 all oxy.xlsx',
 'OXY_Summary.csv',
 'SAL_Summary.csv']

In [36]:
# Load the Salinity and oxygen summaries
sal = pd.read_csv(basepath+array+cruise+water_dir+'Salts and O2/SAL_Summary.csv')
if 'case' in [x.lower() for x in sal.columns.values]:
    sal['Sample ID'] = sal['Case'] + sal['Sample ID'].apply(lambda x: str(x)) 
oxy = pd.read_csv(basepath+array+cruise+water_dir+'Salts and O2/OXY_Summary.csv')
if 'case' in [x.lower() for x in oxy.columns.values]:
    oxy['Sample ID'] = oxy['Case'] + oxy['Sample ID'].apply(lambda x: str(x)) 

In [37]:
sal.head()

Unnamed: 0.1,Unnamed: 0,Cruise ID,Station #,Cast #,Case,Sample ID,Salinity [psu]
0,0,KN217,1,1,A,A1,33.0886
1,1,KN217,1,1,A,A2,33.0933
2,2,KN217,1,1,A,A3,33.035
3,3,KN217,1,1,A,A4,33.036
4,4,KN217,1,1,A,A5,32.9782


In [38]:
sample_log.head()

Unnamed: 0,Cruise ID,Target Asset,Station-Cast #,Niskin #,Rosette Position,Date,Time,Bottom Depth [m],Trip Depth,Oxygen Bottle #,Ph Bottle #,DIC/TA Bottle #,Salts Bottle #,Nitrate Bottle 1,Chlorophyll Brown Bottle #,Chlorophyll Filter Sample # Cast #/Depth/Bottle #/,Chlorophyll Brown Bottle Volume,Chlorophyll LN Tube,Comments
0,KN 217,PMUI,1,1,1,2014-04-12,,96,90.6,L1,68.0,69.0,A1,1-1,1,01/01,1057,1,
1,KN 217,PMUI,1,2,2,2014-04-12,,96,90.6,L2,76.0,77.0,A2,1-2,2,01/02,1056,1,
2,KN 217,PMUI,1,3,3,2014-04-12,,96,52.0,L3,70.0,71.0,A3,1-3,3,01/03,1056,1,
3,KN 217,PMUI,1,4,4,2014-04-12,,96,52.0,L4,,,A4,1-4,4,01/04,1056,1,
4,KN 217,PMUI,1,5,5,2014-04-12,,96,26.7,L5,72.0,73.0,A5,1-5,5,01/05,1057,1,


In [39]:
# Now need to mak
sample_log = sample_log.merge(sal[['Station #','Sample ID','Salinity [psu]']], how='left', left_on=['Station-Cast #','Salts Bottle #'], right_on=['Station #','Sample ID'])

In [40]:
sample_log.rename({'Salinity [psu]':'Discrete Salinity [psu]'},axis='columns',inplace=True)
sample_log.drop(['Station #','Sample ID'],axis='columns',inplace=True)

In [41]:
sample_log.rename(columns=lambda x: x.strip(),inplace=True)

In [42]:
oxy.head()

Unnamed: 0.1,Unnamed: 0,Cruise ID,Station #,Sample ID,Niskin #,Oxygen [mL/L],Units
0,2,KN217,1,L1,1,7.316,ml/L
1,3,KN217,1,L2,2,7.076,ml/L
2,4,KN217,1,L3,3,7.292,ml/L
3,5,KN217,1,L4,4,7.274,ml/L
4,6,KN217,1,L5,5,7.341,ml/L


In [43]:
sample_log = sample_log.merge(oxy[['Station #','Sample ID','Oxygen [mL/L]']], how='left', left_on=['Station-Cast #','Oxygen Bottle #'], right_on=['Station #','Sample ID'])

In [44]:
sample_log.rename({'Oxygen [mL/L]':'Discrete Oxygen [mL/L]'},axis='columns',inplace=True)
sample_log.drop(['Station #','Sample ID'],axis='columns',inplace=True)

In [45]:
nutrients.reset_index(inplace=True)

In [46]:
#nutrients.rename({'index':'Sample ID'},axis='columns',inplace=True)

In [51]:
nutrients.head()

Unnamed: 0,Sample ID,Rep1: Nitrate+Nitrite [µM/L],Rep1: Ammonium [µM/L],Rep1: Phosphate [µM/L],Rep1: Silicate [µM/L],Rep2: Nitrate+Nitrite [µM/L],Rep2: Ammonium [µM/L],Rep2: Phosphate [µM/L],Rep2: Silicate [µM/L],Avg: Nitrate+Nitrite [µM/L],Avg: Ammonium [µM/L],Avg: Phosphate [µM/L],Avg: Silicate [µM/L],Avg: Nitrite [µM/L],Avg: Nitrate [µM/L]
0,1-1,2.051355,2.294434,0.689154,0.994681,2.04764,2.46074,0.696308,0.989391,2.049496,2.377587,0.692731,0.992036,<0.015,2.034496
1,1-2,2.021611,2.441819,0.726115,1.07299,1.98722,2.22871,0.715385,1.0169,2.004416,2.335263,0.72075,1.04494,<0.015,1.989416
2,1-3,0.680377,0.823566,0.374385,0.0380942,0.688742,0.719002,0.376769,<0.030,0.684559,0.771284,0.375577,0.0380942,<0.015,0.669559
3,1-4,0.736145,0.843483,0.4185,<0.030,0.665505,0.650289,0.385115,0.0317452,0.700825,0.746886,0.401808,0.0317452,<0.015,0.685825
4,1-5,0.130127,0.502903,0.326692,<0.030,0.118044,0.47004,0.308808,<0.030,0.124085,0.486472,0.31775,<0.030,<0.015,0.109085


In [52]:
sample_log = sample_log.merge(nutrients, how='left', left_on=['Nitrate Bottle 1'], right_on=['Sample ID'])

In [53]:
sample_log.rename(columns=lambda x: x.replace('Avg:', 'Discrete'), inplace=True)
sample_log.drop(['Sample ID'],axis='columns',inplace=True)

In [54]:
sample_log.columns.values

array(['Cruise ID', 'Target Asset', 'Station-Cast #', 'Niskin #',
       'Rosette Position', 'Date', 'Time', 'Bottom Depth [m]',
       'Trip Depth', 'Oxygen Bottle #', 'Ph Bottle #', 'DIC/TA Bottle #',
       'Salts Bottle #', 'Nitrate Bottle 1', 'Chlorophyll Brown Bottle #',
       'Chlorophyll Filter Sample # \nCast #/Depth/Bottle #/',
       'Chlorophyll Brown Bottle Volume', 'Chlorophyll LN Tube',
       'Comments', 'Discrete Salinity [psu]', 'Discrete Oxygen [mL/L]',
       'Rep1: Nitrate+Nitrite [µM/L]', 'Rep1: Ammonium [µM/L]',
       'Rep1: Phosphate [µM/L]', 'Rep1: Silicate [µM/L]',
       'Rep2: Nitrate+Nitrite [µM/L]', 'Rep2: Ammonium [µM/L]',
       'Rep2: Phosphate [µM/L]', 'Rep2: Silicate [µM/L]',
       'Discrete Nitrate+Nitrite [µM/L]', 'Discrete Ammonium [µM/L]',
       'Discrete Phosphate [µM/L]', 'Discrete Silicate [µM/L]',
       'Discrete Nitrite [µM/L]', 'Discrete Nitrate [µM/L]'], dtype=object)

In [55]:
# Now add the chlorophyll data
chl.columns.values

array(['Cruise #:', 'Date', 'Station \nStart Time (UTC)',
       'Station \nEnd Time (UTC)', 'Niskin Trip Time', 'Lat', 'Lon',
       'Station Depth', 'Station-Cast #', 'Niskin #', 'Trip \nDepth',
       'Brown Bottle #', 'Replicate', 'Water Depth Rep',
       'Filter \nSample #', 'Vol\nFilt', 'Filter\nSize', 'Vol Extracted',
       'Sample', '90% Acetone', 'Dilution During Reading',
       'Chl_Cal_Filename', 'tau_Calibration', 'Fd_Calibration', 'Rb',
       'Ra', 'blank', 'Rb-blank', 'Ra-blank', 'Chl (ug/l)',
       'Phaeo (ug/l)', 'quality_flag', 'Cal_Date', 'Fluorometer',
       'Comments', 'Unnamed: 35'], dtype=object)

In [56]:
chl_df = chl[['Station-Cast #','Brown Bottle #','Chl (ug/l)','Phaeo (ug/l)']]
chl_df.rename(columns=lambda x: 'Discrete ' + x, inplace=True)
#chl_df.rename({'Discrete quality_flag':'Discrete Chl quality flag'},axis='columns',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [57]:
sample_log = sample_log.merge(chl_df, how='left', left_on=['Station-Cast #','Chlorophyll Brown Bottle #'], right_on=['Discrete Station-Cast #','Discrete Brown Bottle #'])
sample_log.drop(['Discrete Station-Cast #','Discrete Brown Bottle #'],axis='columns',inplace=True)

In [59]:
# Now load the CTD summary data
CTD = pd.read_csv(basepath+array+cruise+ctd_dir+'CTD_Summary.csv')

In [66]:
CTD

Unnamed: 0.1,Unnamed: 0,Niskin #,Date Time,"Pressure, Digiquartz [db]","Depth [salt water, m]",Latitude [deg],Longitude [deg],"Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],...,"Oxygen raw, SBE 43 [V]","Oxygen, SBE 43 [ml/l]","Oxygen Saturation, Garcia & Gordon [ml/l]","Beam Attenuation, WET Labs C-Star [1/m]","Beam Transmission, WET Labs C-Star [%]",Hex name,Start Latitude,Start Longitude,Start Time,Cast #
0,0,1,Apr 12 2014 18:14:00,90.718,90.001,40.36166,-70.77197,5.0500,5.2542,3.192340e+00,...,2.5430,6.6363,7.15380,0.3267,92.1575 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
1,1,2,Apr 12 2014 18:14:09,90.823,90.105,40.36166,-70.77198,4.9394,5.1870,3.185822e+00,...,2.4942,6.4294,7.17096,0.3622,91.3633 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
2,2,3,Apr 12 2014 18:18:21,51.425,51.023,40.36166,-70.77198,4.9974,4.9989,3.178876e+00,...,2.6740,7.0568,7.16667,0.1511,96.2937 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
3,3,4,Apr 12 2014 18:18:32,51.690,51.286,40.36166,-70.77198,5.0160,5.0114,3.178670e+00,...,2.6783,7.0752,7.16450,0.1521,96.2681 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
4,4,5,Apr 12 2014 18:22:10,26.715,26.507,40.36166,-70.77199,5.5665,5.5637,3.220398e+00,...,2.7415,7.1601,7.07405,0.2532,93.8761 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
5,5,6,Apr 12 2014 18:22:21,26.808,26.600,40.36166,-70.77200,5.5634,5.5678,3.220174e+00,...,2.7437,7.1647,7.07454,0.2327,94.3495 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
6,6,7,Apr 12 2014 18:25:16,6.714,6.662,40.36166,-70.77198,5.9768,5.9596,3.256445e+00,...,2.7799,7.1846,7.00539,0.2558,93.8056 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
7,7,8,Apr 12 2014 18:25:27,6.760,6.708,40.36166,-70.77200,5.8953,5.8948,3.249481e+00,...,2.7774,7.1908,7.01873,0.2532,93.8670 (avg),C:\Data\ctd\kn217001.hex,40 21.70 N,070 46.31 W,2014-04-12T18:06:29Z,1
8,0,1,Apr 13 2014 18:00:29,145.688,144.520,40.09324,-70.87536,10.9919,10.9916,3.913031e+00,...,2.4173,5.3981,6.18038,0.1763,95.6884 (avg),C:\Data\ctd\kn217002.hex,40 05.59 N,070 52.52 W,2014-04-13T17:54:46Z,2
9,1,2,Apr 13 2014 18:00:39,145.614,144.446,40.09324,-70.87536,10.9913,10.9912,3.912959e+00,...,2.4177,5.3975,6.18046,0.1815,95.5631 (avg),C:\Data\ctd\kn217002.hex,40 05.59 N,070 52.52 W,2014-04-13T17:54:46Z,2


In [69]:
column_list = []
for name in list(sample_log.columns.values):
    if 'Discrete' in name:
        column_list.append(name)
column_list.append('Station-Cast #')
column_list.append('Niskin #')
column_list.append('Cruise ID')
column_list.append('Target Asset')
column_list.append('Bottom Depth [m]')

In [70]:
discrete_data = sample_log[column_list]

In [72]:
discrete_data.head()

Unnamed: 0,Discrete Salinity [psu],Discrete Oxygen [mL/L],Discrete Nitrate+Nitrite [µM/L],Discrete Ammonium [µM/L],Discrete Phosphate [µM/L],Discrete Silicate [µM/L],Discrete Nitrite [µM/L],Discrete Nitrate [µM/L],Discrete Chl (ug/l),Discrete Phaeo (ug/l),Station-Cast #,Niskin #,Cruise ID,Target Asset,Bottom Depth [m]
0,33.0886,7.316,2.049496,2.377587,0.692731,0.992036,<0.015,2.034496,1.619683,0.645979,1,1,KN 217,PMUI,96
1,33.0933,7.076,2.004416,2.335263,0.72075,1.04494,<0.015,1.989416,2.012234,0.805782,1,2,KN 217,PMUI,96
2,33.035,7.292,0.684559,0.771284,0.375577,0.0380942,<0.015,0.669559,0.378986,0.105017,1,3,KN 217,PMUI,96
3,33.036,7.274,0.700825,0.746886,0.401808,0.0317452,<0.015,0.685825,0.506317,0.135111,1,4,KN 217,PMUI,96
4,32.9782,7.341,0.124085,0.486472,0.31775,<0.030,<0.015,0.109085,0.679125,0.128612,1,5,KN 217,PMUI,96


In [73]:
CTD = CTD.merge(discrete_data, how='left', left_on=['Cast #','Niskin #'], right_on=['Station-Cast #','Niskin #'])

In [74]:
CTD.head()

Unnamed: 0.1,Unnamed: 0,Niskin #,Date Time,"Pressure, Digiquartz [db]","Depth [salt water, m]",Latitude [deg],Longitude [deg],"Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],...,Discrete Phosphate [µM/L],Discrete Silicate [µM/L],Discrete Nitrite [µM/L],Discrete Nitrate [µM/L],Discrete Chl (ug/l),Discrete Phaeo (ug/l),Station-Cast #,Cruise ID,Target Asset,Bottom Depth [m]
0,0,1,Apr 12 2014 18:14:00,90.718,90.001,40.36166,-70.77197,5.05,5.2542,3.19234,...,0.692731,0.992036,<0.015,2.034496,1.619683,0.645979,1.0,KN 217,PMUI,96.0
1,1,2,Apr 12 2014 18:14:09,90.823,90.105,40.36166,-70.77198,4.9394,5.187,3.185822,...,0.72075,1.04494,<0.015,1.989416,2.012234,0.805782,1.0,KN 217,PMUI,96.0
2,2,3,Apr 12 2014 18:18:21,51.425,51.023,40.36166,-70.77198,4.9974,4.9989,3.178876,...,0.375577,0.0380942,<0.015,0.669559,0.378986,0.105017,1.0,KN 217,PMUI,96.0
3,3,4,Apr 12 2014 18:18:32,51.69,51.286,40.36166,-70.77198,5.016,5.0114,3.17867,...,0.401808,0.0317452,<0.015,0.685825,0.506317,0.135111,1.0,KN 217,PMUI,96.0
4,4,5,Apr 12 2014 18:22:10,26.715,26.507,40.36166,-70.77199,5.5665,5.5637,3.220398,...,0.31775,<0.030,<0.015,0.109085,0.679125,0.128612,1.0,KN 217,PMUI,96.0


In [75]:
CTD.drop(['Unnamed: 0'],axis='columns',inplace=True)

In [76]:
CTD.fillna(-9999999, inplace=True)
CTD['Date Time'] = CTD['Date Time'].apply(lambda x: pd.to_datetime(x).strftime('%Y-%m-%dT%H:%M:%SZ'))

In [77]:
CTD.head()

Unnamed: 0,Niskin #,Date Time,"Pressure, Digiquartz [db]","Depth [salt water, m]",Latitude [deg],Longitude [deg],"Temperature [ITS-90, deg C]","Temperature, 2 [ITS-90, deg C]",Conductivity [S/m],"Conductivity, 2 [S/m]",...,Discrete Phosphate [µM/L],Discrete Silicate [µM/L],Discrete Nitrite [µM/L],Discrete Nitrate [µM/L],Discrete Chl (ug/l),Discrete Phaeo (ug/l),Station-Cast #,Cruise ID,Target Asset,Bottom Depth [m]
0,1,2014-04-12T18:14:00Z,90.718,90.001,40.36166,-70.77197,5.05,5.2542,3.19234,3.218524,...,0.692731,0.992036,<0.015,2.034496,1.619683,0.645979,1.0,KN 217,PMUI,96.0
1,2,2014-04-12T18:14:09Z,90.823,90.105,40.36166,-70.77198,4.9394,5.187,3.185822,3.206726,...,0.72075,1.04494,<0.015,1.989416,2.012234,0.805782,1.0,KN 217,PMUI,96.0
2,3,2014-04-12T18:18:21Z,51.425,51.023,40.36166,-70.77198,4.9974,4.9989,3.178876,3.179077,...,0.375577,0.0380942,<0.015,0.669559,0.378986,0.105017,1.0,KN 217,PMUI,96.0
3,4,2014-04-12T18:18:32Z,51.69,51.286,40.36166,-70.77198,5.016,5.0114,3.17867,3.178866,...,0.401808,0.0317452,<0.015,0.685825,0.506317,0.135111,1.0,KN 217,PMUI,96.0
4,5,2014-04-12T18:22:10Z,26.715,26.507,40.36166,-70.77199,5.5665,5.5637,3.220398,3.220331,...,0.31775,<0.030,<0.015,0.109085,0.679125,0.128612,1.0,KN 217,PMUI,96.0


In [89]:
ind = summary_sheet_path.lower().find('sampling_log')
summary_name = ''.join([summary_sheet_path[:ind],'Sample_Summary.csv'])
summary_name

'C:/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox/Ship_data/Pioneer/Pioneer-02/Water Sampling/Pioneer2_KN-217_Sample_Summary.csv'

In [90]:
CTD.to_csv(summary_name)

In [None]:
nutrient_bottles = sample_log['Sample Log: Nitrate Bottle 1'].str.split(',').apply(pd.Series, 1).stack()
nutrient_bottles.index = nutrient_bottles.index.droplevel(-1)
nutrient_bottles.name = 'Nitrate Bottle #'

In [None]:
nutrient_bottles

In [None]:
# Add the nutrient bottle number back into the sample log, and remove the excess '.'
sample_log = sample_log.join(nutrient_bottles)
sample_log['Nitrate Bottle #'] = sample_log['Nitrate Bottle #'].apply(lambda x: x.replace('.','') if type(x) == str else x)
sample_log['Nitrate Bottle #'] = sample_log['Nitrate Bottle #'].apply(lambda x: x.replace(' ','') if type(x) == str else x)

In [None]:
# Now I can add the nutrient bottle data to the sample log before loading
sample_log = sample_log.merge(nutrients, how='left', left_on='Nitrate Bottle #', right_on='Sample ID')

In [None]:
sample_log.columns.values

In [None]:
sample_columns = ['Sample Log: Cast #','Sample Log: Niskin #','SAL: Salinity','OXY: Oxy','Avg: Nitrate+Nitrite [µmol/L]', 'Avg: Ammonium [µmol/L]',
    'Avg: Phosphate [µmol/L]', 'Avg: Silicate [µmol/L]', 'Avg: Nitrite [µmol/L]', 'Avg: Nitrate [µmol/L]']

In [None]:
result = df.merge(sample_log[sample_columns], how='left',
                  left_on=['Cast #','Niskin #'], right_on=['Sample Log: Cast #','Sample Log: Niskin #'])

In [None]:
result

In [None]:
result.columns.values

In [None]:
sal_unit = '[' + list(set(sal_df['SAL: Unit']) )[0] + ']'
oxy_unit = '[' + list(set(oxy_df['OXY: Unit']) )[0] + ']'
sal_unit, oxy_unit

In [None]:
result.drop(['Sample Log: Cast #','Sample Log: Niskin #'], axis=1, inplace=True)

In [None]:
result.rename(columns=lambda x: x.replace('SAL:','Bottle') + ' ' + sal_unit if 'SAL:' in x else x, inplace=True)
result.rename(columns=lambda x: x.replace('OXY:','Bottle') + ' ' + oxy_unit if 'OXY:' in x else x, inplace=True)
result.rename(columns=lambda x: x.replace('Avg:','Bottle') if 'Avg:' in x else x, inplace=True)

In [None]:
result

In [None]:
# Replace all of the nans with -999999 and save to a csv
result.fillna(str(-9999999),inplace=True)

In [None]:
result

In [None]:
result.to_csv(salts_and_o2_path+'Irminger-3_Summary.csv')

In [None]:
os.getcwd()