# Exploratory Analysis for Metadata Review in OOI Asset Management System

### Motivation:
The Asset Management system for OOI is primarly housed on GitHub in a variety of csv files. Until now, the calibration coefficients stored in the csv files have been manually entered. While we have utilized a "human-in-the-loop" review approach to catch errors, some errors have slipped through (e.g. truncation of significant figures).

### Approach:
My goal is to develop an automated approach to catch possible errors which already exist within the asset management system. To accomplish this, I will compare the csv files loaded into the GitHub asset management system with the original vendor files as well as the QCT (quality control testing) documents which capture the coefficients loaded onto the instrument at the time of reception at WHOI from the vendor.

### Data Sources:
* **GitHub**: CSV files containing the calibration coefficients. Directory organization by sensor+class. The files are named as "(CGINS)-(sensor+class)-(serial number)-(YYYYMMDD)" where YYYYMMDD is the calibration date.
* **Vault**: 

In [1]:
# Import likely important packages, etc.
import sys, os, csv, re
from wcmatch import fnmatch
import datetime
import time
import xml.etree.ElementTree as et
from zipfile import ZipFile
import numpy as np
import pandas as pd
import xarray as xr

In [2]:
coefficient_name_map = {
            'TA0': 'CC_a0',
            'TA1': 'CC_a1',
            'TA2': 'CC_a2',
            'TA3': 'CC_a3',
            'CPCOR': 'CC_cpcor',
            'CTCOR': 'CC_ctcor',
            'CG': 'CC_g',
            'CH': 'CC_h',
            'CI': 'CC_i',
            'CJ': 'CC_j',
            'G': 'CC_g',
            'H': 'CC_h',
            'I': 'CC_i',
            'J': 'CC_j',
            'PA0': 'CC_pa0',
            'PA1': 'CC_pa1',
            'PA2': 'CC_pa2',
            'PTEMPA0': 'CC_ptempa0',
            'PTEMPA1': 'CC_ptempa1',
            'PTEMPA2': 'CC_ptempa2',
            'PTCA0': 'CC_ptca0',
            'PTCA1': 'CC_ptca1',
            'PTCA2': 'CC_ptca2',
            'PTCB0': 'CC_ptcb0',
            'PTCB1': 'CC_ptcb1',
            'PTCB2': 'CC_ptcb2',
            # additional types for series O
            'C1': 'CC_C1',
            'C2': 'CC_C2',
            'C3': 'CC_C3',
            'D1': 'CC_D1',
            'D2': 'CC_D2',
            'T1': 'CC_T1',
            'T2': 'CC_T2',
            'T3': 'CC_T3',
            'T4': 'CC_T4',
            'T5': 'CC_T5',
        }

o2_coefficients_map = {
            'A': 'CC_residual_temperature_correction_factor_a',
            'B': 'CC_residual_temperature_correction_factor_b',
            'C': 'CC_residual_temperature_correction_factor_c',
            'E': 'CC_residual_temperature_correction_factor_e',
            'SOC': 'CC_oxygen_signal_slope',
            'OFFSET': 'CC_frequency_offset'
        }
        

In [3]:
o2_coefficients_map;

In [4]:
coefficient_name_map;

### WHOI Asset Tracking Spreadsheet
First, I want to load and examine exactly what type of data is stored in the WHOI Asset Tracking Spreadsheet and what information it has that may be useful.

In [5]:
def whoi_asset_tracking(spreadsheet,sheet_name,instrument_class='All',whoi=True,series=None):
    """
    Loads all the individual sensors of a specific instrument class and
    series type. Currently applied only for WHOI deployed instruments.
    
    Args:
        spreadsheet - directory path and name of the excel spreadsheet with
            the WHOI asset tracking information.
        sheet_name - name of the sheet in the spreadsheet to load
        instrument_class - the type (i.e. CTDBP, CTDMO, PCO2W, etc). Defaults
            to 'All', which will load all of the instruments
        whoi - return only whoi instruments? Defaults to True.
        series - a specified class of the instrument to load. Defaults to None,
            which will load all of the series for a specified instrument class
    """
    
    all_sensors = pd.read_excel(spreadsheet,sheet_name=sheet_name,header=1)
    # Select a specific class of instruments
    if instrument_class == 'All':
        inst_class = all_sensors
    else:
        inst_class  = all_sensors[all_sensors['Instrument\nClass']==instrument_class]
    # Return only the whoi instruments?
    if whoi == True:
        whoi_insts = inst_class[inst_class['Deployment History'] != 'EA']
    else:
        whoi_insts = inst_class
    # Slect a specific series of the instrument?
    if series != None:
        instrument = whoi_insts[whoi_insts['Series'] == series]
    else:
        instrument = whoi_insts
 
    return instrument
    
    

In [6]:
#excel_spreadsheet = 'C:/Users/areed/Documents/Project_Files/Documentation/System/System Notebook/WHOI_Asset_Tracking.xlsx'
excel_spreadsheet = '/media/andrew/OS/Users/areed/Documents/Project_Files/Documentation/System/System Notebook/WHOI_Asset_Tracking.xlsx'
sheet_name = 'Sensors'

In [90]:
CTDBPD = whoi_asset_tracking(excel_spreadsheet,sheet_name,instrument_class='CTDBP',whoi=True,series='D')

In [91]:
CTDBPD

Unnamed: 0,Instrument Class,Series,Supplier Serial Number,WHOI #,OOI #,UID,Model,CGSN PN,Firmware Version,Supplier,...,QCT Testing,PreDeployment,Post Deployment,Refurbishment/ Repair,DO Number,Date Received,Deployment History,Current Deployment,Instrument Location on Current Deployment,Notes
53,CTDBP,D,16-50008,116258,A00700,CGINS-CTDBPD-50008,16PlusV2,1336-00001-00004,2.5.2,SeaBird,...,3305-00102-00018\n3305-00102-00083\n3305-00102...,,,3305-00900-00085\n3305-00900-00178\n3305-00900...,WH-SC11-01-CTD-1007,2014-02-18 00:00:00,CP01CNSM-00002\nCP01CNSM-00003\nCP03ISSM-00004...,CP01CNSM-00010,MFN,
56,CTDBP,D,16-50058,116848,A01106,CGINS-CTDBPD-50058,16PlusV2,1336-00001-00004,2.5.2,SeaBird,...,3305-00102-00038\n3305-00102-00088\n3305-00102...,,,3305-00900-00085\n3305-00900-00178\n3305-00900...,WH-SC11-01-CTD-1012,2014-10-01 00:00:00,CP3a Spare\nCP03ISSM-00002\nCP01CNSM-00005\nCP...,,,
72,CTDBP,D,16-50110,117191,A01335,CGINS-CTDBPD-50110,16PlusV2,1336-00001-00004,2.5.2,SeaBird,...,3305-00102-00047\n3305-00102-00109\n3305-00102...,,,3305-00900-00135\n3305-00900-00373,WH-SC11-01-CTD-1016,2015-04-13 00:00:00,CP03ISSM-00003\nCP03ISSM-00005\nCP03ISSM-00007,Pioneer 11 Spare,MFN,
92,CTDBP,D,16P71174-7209,115128,A00083,CGINS-CTDBPD-07209,16PlusV2,1336-00001-00004,2.5.2,SeaBird,...,3305-00102-00001\n3305-00102-00087\n3305-00102...,,,3305-00900-00085\n3305-00900-00359,WH-SC11-01-CTD-1002,2012-11-05 00:00:00,Pioneer 1 as spare\nCP03ISSM-00001\nCP01CNSM-0...,CP03ISSM-00009,MFN,Detached from Mooring
93,CTDBP,D,16P71879-7239,115220,A00132,CGINS-CTDBPD-07239,16PlusV2,1336-00001-00004,2.5.2,SeaBird,...,3305-00102-00011\n3305-00102-00090\n3305-00102...,,,3305-00900-00013\n3305-00900-00306\n3305-00900...,WH-SC11-01-CTD-1005,1/8/2013\n4/2/2015,CP01CNSM-00001\nCP01CNSM-00007\nCP01CNSM-00009,,,


In [88]:
set(CTDBP['Series'])

{'C', 'D', 'E', 'F', 'P'}

### Checking instrument calibration values
After loading the **WHOI Asset Tracking Sheet**, we now have the following critical data for checking calibration information:
* Supplier Serial Number - this links back to the original **.cal**, **.xmlcon**, and vendor docs
* OOI UID - this is the link between the instrument and the OOINet
* QCT Document Number - this number links the instrument to the QCT screen capture of the calibration values loaded onto the instruments

### Process to load the **CSV** calibration file
In order to check that the calibrations in asset management, I have to be able to load the asset management calibration csv files into a dataframe. 
* First, get all the unique CTDBPCs in Asset Management
* Next, parse the csv files in asset management to get the unique instrument serial numbers
* With the serial numbers, find the associated instrument calibration csvs
* For each calibration csv, load the data into a pandas dataframe

In [9]:
def load_asset_management(instrument,filepath):
    """
    Loads the calibration csv files from a local repository containing
    the asset management information.
    
    Args:
        instrument - a pandas dataframe with the asset tracking information
            for a specific instrument.
        filepath - the directory path pointing to where the csv files are
            stored.
    Raises:
        TypeError - if the instrument input is not a pandas dataframe
    Returns:
        csv_dict - a dictionary with keys of the UIDs from the instrument dataframe
            which correspond to lists of the relevant calibration csv files
            
    """
    
    # Check that the input is a pandas DataFrame
    if type(instrument) != pd.core.frame.DataFrame:
        raise TypeError()
        
    uids = sorted( list( set( instrument['UID'] ) ) )
    
    csv_dict = {}
    for uid in uids:
        # Get a specified uid from the instrument dataframe
        instrument['UID_match'] = instrument['UID'].apply(lambda x: True if uid in x else False)
        instrument[instrument['UID_match'] == True]
        
        # Now, get all the csvs from asset management for a particular UID
        csv_files = []
        for file in os.listdir(filepath):
            if fnmatch.fnmatch(file,'*'+uid+'*'):
                csv_files.append(file)
            else:
                pass
        
        # Update the dictionary storing the asset management files for each UID
        if len(csv_files) > 0:
            csv_dict.update({uid:csv_files})
        else:
            pass
        
    return csv_dict
    

In [92]:
csv_dict = load_asset_management(CTDBPC,'../GitHub/OOI-Integration/asset-management/calibration/CTDBPD/')
csv_dict

{'CGINS-CTDBPD-07209': ['CGINS-CTDBPD-07209__20121025.csv',
  'CGINS-CTDBPD-07209__20151210.csv',
  'CGINS-CTDBPD-07209__20170718.csv',
  'CGINS-CTDBPD-07209__20180525.csv'],
 'CGINS-CTDBPD-07239': ['CGINS-CTDBPD-07239__20121207.csv',
  'CGINS-CTDBPD-07239__20150219.csv',
  'CGINS-CTDBPD-07239__20171202.csv'],
 'CGINS-CTDBPD-50008': ['CGINS-CTDBPD-50008__20140125.csv',
  'CGINS-CTDBPD-50008__20151212.csv',
  'CGINS-CTDBPD-50008__20161215.csv',
  'CGINS-CTDBPD-50008__20171212.csv'],
 'CGINS-CTDBPD-50058': ['CGINS-CTDBPD-50058__20140925.csv',
  'CGINS-CTDBPD-50058__20151211.csv',
  'CGINS-CTDBPD-50058__20161214.csv'],
 'CGINS-CTDBPD-50110': ['CGINS-CTDBPD-50110__20150329.csv',
  'CGINS-CTDBPD-50110__20160626.csv',
  'CGINS-CTDBPD-50110__20170711.csv',
  'CGINS-CTDBPD-50110__20180519.csv']}

In [93]:
# Now I need to load the all of the csv files based on their UID
def load_csv_info(csv_dict,filepath):
    """
    Loads the calibration coefficient information contained in asset management
    
    Args:
        csv_dict - a dictionary which associates an instrument UID to the
            calibration csv files in asset management
        filepath - the path to the directory containing the calibration csv files
    Returns:
        csv_cals - a dictionary which associates an instrument UID to a pandas
            dataframe which contains the calibration coefficients. The dataframes
            are indexed by the date of calibration
    """
    
    # Load the calibration data into pandas dataframes, which are then placed into
    # a dictionary by the UID
    csv_cals = {}
    for uid in csv_dict:
        cals = pd.DataFrame()
        for file in csv_dict[uid]:
            data = pd.read_csv(filepath+file)
            date = file.split('__')[1].split('.')[0]
            data['CAL DATE'] = pd.to_datetime(date)
            cals = cals.append(data)
        csv_cals.update({uid:cals})
        
    # Pivot the dataframe to be sorted based on calibration date
    for uid in csv_cals:
        csv_cals[uid] = csv_cals[uid].pivot(index=csv_cals[uid]['CAL DATE'], columns='name')['value']
        
    return csv_cals



In [94]:
CSV = load_csv_info(csv_dict,'../GitHub/OOI-Integration/asset-management/calibration/CTDBPD/')
CSV

{'CGINS-CTDBPD-07209': name           CC_a0     CC_a1         CC_a2         CC_a3      CC_cpcor  \
 CAL DATE                                                                   
 2012-10-25  0.001246  0.000277 -1.298569e-06  1.889486e-07 -9.570000e-08   
 2015-12-10  0.001254  0.000274 -9.551421e-07  1.750128e-07 -9.570000e-08   
 2017-07-18  0.001252  0.000275 -1.025633e-06  1.771923e-07 -9.570000e-08   
 2018-05-25  0.001236  0.000280 -1.747581e-06  2.073228e-07 -9.570000e-08   
 
 name        CC_ctcor      CC_g      CC_h      CC_i      CC_j     ...      \
 CAL DATE                                                         ...       
 2012-10-25  0.000003 -0.953572  0.133625 -0.000378  0.000045     ...       
 2015-12-10  0.000003 -0.954144  0.133626 -0.000378  0.000045     ...       
 2017-07-18  0.000003 -0.955038  0.133779 -0.000397  0.000047     ...       
 2018-05-25  0.000003 -0.954057  0.133591 -0.000376  0.000045     ...       
 
 name              CC_pa2  CC_ptca0  CC_ptca1  CC_

In [95]:
CSV['CGINS-CTDBPD-50110']

name,CC_a0,CC_a1,CC_a2,CC_a3,CC_cpcor,CC_ctcor,CC_g,CC_h,CC_i,CC_j,...,CC_pa2,CC_ptca0,CC_ptca1,CC_ptca2,CC_ptcb0,CC_ptcb1,CC_ptcb2,CC_ptempa0,CC_ptempa1,CC_ptempa2
CAL DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-03-29,0.00129,0.000267,-6.133168e-07,1.588086e-07,-9.57e-08,3e-06,-0.974897,0.149154,-0.00012,3.1e-05,...,8.311135e-11,521059.7,-69.74867,0.104089,24.80888,-0.000225,0.0,164.5857,-64.11637,-2.787496
2016-06-26,0.001284,0.000269,-8.485468e-07,1.675527e-07,-9.57e-08,3e-06,-0.974269,0.1489,-4.6e-05,2.6e-05,...,8.431714e-11,521111.2,-71.22117,0.192376,24.80888,-0.000225,0.0,186.6791,-85.37835,2.296005
2017-07-11,0.001273,0.000273,-1.357612e-06,1.890172e-07,-9.57e-08,3e-06,-0.975956,0.149502,-0.000215,3.9e-05,...,8.360019e-11,521104.1,-69.71797,0.112801,24.80888,-0.000225,0.0,181.3828,-80.22888,1.072808
2018-05-19,0.001292,0.000266,-4.935913e-07,1.534591e-07,-9.57e-08,3e-06,-0.975438,0.149296,-0.000156,3.5e-05,...,8.384369e-11,520995.9,-77.39612,0.156681,24.80888,-0.000225,0.0,172.7524,-71.91609,-0.911798


Now we have successfully loaded the csv calibrations into a pandas dataframe that allows for easy comparison between calibrations based on the calibration date for each calibration coefficient.

### Load the QCT values
The next step is to take the capture files from the QCT and load them into a comparable pandas dataframe. This involves several steps:
* Get the QCT document numbers from the WHOI Asset Tracking Sheet for each individual instrument
* Find where the QCT documents are stored
* Load the QCT documents
* Parse the QCT documents
* Translate the parsed QCT values into a pandas dataframe

In [98]:
uids = sorted( list( set( CTDBPD['UID'])))

In [99]:
qct_dict = {}
for uid in uids:
    # Get the QCT Document numbers from the asset tracking sheet
    CTDBPD['UID_match'] = CTDBPD['UID'].apply(lambda x: True if uid in x else False)
    qct_series = CTDBPD[CTDBPD['UID_match'] == True]['QCT Testing']
    qct_series = list(qct_series.iloc[0].split('\n'))
    qct_dict.update({uid:qct_series})

In [100]:
qct_dict

{'CGINS-CTDBPD-07209': ['3305-00102-00001',
  '3305-00102-00087',
  '3305-00102-00134',
  '3305-00102-00188'],
 'CGINS-CTDBPD-07239': ['3305-00102-00011',
  '3305-00102-00090',
  '3305-00102-00128',
  '3305-00102-00157'],
 'CGINS-CTDBPD-50008': ['3305-00102-00018',
  '3305-00102-00083',
  '3305-00102-00125',
  '3305-00102-00173'],
 'CGINS-CTDBPD-50058': ['3305-00102-00038',
  '3305-00102-00088',
  '3305-00102-00126'],
 'CGINS-CTDBPD-50110': ['3305-00102-00047',
  '3305-00102-00109',
  '3305-00102-00139',
  '3305-00102-00192']}

In [101]:
#dirpath = 'C:/Users/areed/Documents/Project_Files/Records/Instrument_Records/cap_files/'
dirpath = '/media/andrew/OS/Users/areed/Documents/Project_Files/'

In [102]:
# Try building a function to do the file path generator
def generate_file_path(dirpath,filename,exclude=['_V','_Data_Workshop']):
    """
    Function which searches for the location of the given file and returns
    the full path to the file.
    
    Args:
        dirpath - parent directory path under which to search
        filename - the name of the file to search for
        exclude - optional list which allows for excluding certain
            directories from the search
    Returns:
        fpath - the file path to the filename from the current
            working directory.
    """
    for root, dirs, files in os.walk(dirpath):
        dirs[:] = [d for d in dirs if d not in exclude]
        for fname in files:
            if fnmatch.fnmatch(fname, [filename+'*.cap', filename+'*.txt', filename+'*.log']):
                fpath = os.path.join(root, fname)
                return fpath

In [103]:
# Now to develop an automated approach to load all the QCT documents, parse them
# into a dictionary, and convert the dictionary into a pandas dataframe
def load_qct_data(qct_dict,coefficient_name_map,dirpath='../../../Documents/Project_Files/'):
    qct = {}
    for uid in qct_dict:
        capture_data = {}
        for capfile in qct_dict[uid]:
            # First, find and return the path to the capture file which
            # matches the capture file indentifier
            cappath = generate_file_path(dirpath, capfile)
            
            # Function to pull out the coefficients from the capture files. This is a naive implementation
            # and splits only on either a ":" or "=", it doesn't do any comprehension of the file
            if cappath is None:
                pass
            else:
                coeffs = {}
                with open(cappath) as filename:
                    data = filename.read()
                    for line in data.splitlines():
                        items = re.split(': | =',line)
                        key = items[0].strip()
                        value = items[-1].strip()
                        coeffs.update({key:value})
                    
                # The best way to do this is to use the CTD name mapping to only get the important values
                capture = {}
                # With the capture coefficients, now map it to the CTD coefficients
                for key in coeffs.keys():
                    if key in coefficient_name_map.keys():
                        capture[coefficient_name_map[key]] = coeffs[key]
            
                # Get the calibration date
                caldate = coeffs['conductivity']
            
                # Update the capture file to include the calibration date
                capture['CAL DATE'] = pd.to_datetime(caldate)
            
                # Now, update the parent dictionary
                capture_data.update({capfile:capture})
            
        df = pd.DataFrame.from_dict({i: capture_data[i] for i in capture_data.keys()}, orient='index')
        qct.update({uid:df})
        
    return qct   

In [104]:
qct = load_qct_data(qct_dict,coefficient_name_map)

In [105]:
# Reset the index to the calibration date
for uid in qct:
    qct[uid].set_index('CAL DATE', drop=True, inplace=True)

### Vendor Calibration values: .cal and .xmlcon
This next step is to load the CTD .cal and .xmlcon files in order to compare the

In [106]:
CTDBPD[CTDBPD['UID_match'] == True]

Unnamed: 0,Instrument Class,Series,Supplier Serial Number,WHOI #,OOI #,UID,Model,CGSN PN,Firmware Version,Supplier,...,PreDeployment,Post Deployment,Refurbishment/ Repair,DO Number,Date Received,Deployment History,Current Deployment,Instrument Location on Current Deployment,Notes,UID_match
72,CTDBP,D,16-50110,117191,A01335,CGINS-CTDBPD-50110,16PlusV2,1336-00001-00004,2.5.2,SeaBird,...,,,3305-00900-00135\n3305-00900-00373,WH-SC11-01-CTD-1016,2015-04-13 00:00:00,CP03ISSM-00003\nCP03ISSM-00005\nCP03ISSM-00007,Pioneer 11 Spare,MFN,,True


In [107]:
def get_serial_num(df):
    serial_num = list(df[df['UID_match'] == True]['Supplier\nSerial Number'])
    serial_num = serial_num[0].split('-')[1]
    return serial_num

In [126]:
CTDBPD['seri']

53    CGINS-CTDBPD-50008
56    CGINS-CTDBPD-50058
72    CGINS-CTDBPD-50110
92    CGINS-CTDBPD-07209
93    CGINS-CTDBPD-07239
Name: UID, dtype: object

In [127]:
serial_nums = {}
for uid in uids:
    CTDBPD['UID_match'] = CTDBPD['UID'].apply(lambda x: True if uid in x else False)
    serial_num = get_serial_num(CTDBPD)
    serial_nums.update({uid:serial_num})
    

In [128]:
serial_nums

{'CGINS-CTDBPD-07209': '7209',
 'CGINS-CTDBPD-07239': '7239',
 'CGINS-CTDBPD-50008': '50008',
 'CGINS-CTDBPD-50058': '50058',
 'CGINS-CTDBPD-50110': '50110'}

In [110]:
def read_cal(data, coefficient_name_map):
    """
    Reads in the calibration coefficients from the vendor supplied
    .cal file.
        
    Args:
        self - the CTD object
        data - an opened, read cal file that has been interpreted
        into ASCII.
    Returns:
        A populated CTD object's dictionary with coeff names and
        associated values from the cal file. 
    """
    coefficients = {}
    for line in data.splitlines():
        key, value = line.replace(" ","").split('=')

        if key == 'INSTRUMENT_TYPE' and value == 'SEACATPLUS':
            serial = '16-'

        if key == 'SERIALNO':
            serial = serial + value
    
        if key == 'CCALDATE':
            date = datetime.datetime.strptime(value, '%d-%b-%y').strftime('%Y%m%d')

        name = coefficient_name_map.get(key)
        if not name or name is None:
            continue
        else:
            coefficients[name] = value
            
    return coefficients,date

In [111]:
def read_xml(data, coefficient_name_map, o2_coefficients_map):
    Tflag = False
    O2flag = False
    coefficients = {}
    date = None
        
    for child in data.iter():
        key = child.tag.upper()
        value = child.text.upper()
        
        # Do a couple of checks for type of CTD and flag for presence of
        # Oxygen sensor, Type (16+ vs 37)
        if key == 'OXYGENSENSOR':
            O2flag = True
        
        if key == 'CALIBRATIONDATE':
            if date is None and value is not None:
                date = datetime.datetime.strptime(value, '%d-%b-%y').strftime('%Y%m%d')
            
        # Have to rename the temperature keys to 'T'+key because fuck it, nothing is straightforward
        if key == 'TEMPERATURESENSOR':
            Tflag = True
        elif 'SENSOR' in key and Tflag == True:
            Tflag = False
        else:
            pass
        
        if Tflag == True:
            key = 'T'+key
        
        # Find the mapping of the vendor coeff name -> UFrame coefficient name
        try:
            name = coefficient_name_map.get(key)
        except:
            if O2flag == True:
                try:
                    name = o2_coefficients_map.get(key)
                except:
                    pass
            else:
                pass

        # Now, can update a dictionary to store key->value pairs of coefficients from the xmlcon file    
        coefficients.update({name:value})
        
    return coefficients,date

In [129]:
serial_nums

{'CGINS-CTDBPD-07209': '7209',
 'CGINS-CTDBPD-07239': '7239',
 'CGINS-CTDBPD-50008': '50008',
 'CGINS-CTDBPD-50058': '50058',
 'CGINS-CTDBPD-50110': '50110'}

In [130]:
vendor_files = {}
for uid,sn in serial_nums.items():
    files = []
    for file in os.listdir('../../Project_Files/Records/Instrument_Records/CTDBP/'):
        if sn in file:
            if 'Calibration_Files' in file:
                files.append(file)
            else:
                pass
        else:
            pass
    vendor_files.update({uid:files})

In [131]:
vendor_files

{'CGINS-CTDBPD-07209': ['CTDBP-D_SBE_16PlusV2_SN_16P71174-7209_Calibration_Files_2012-11-21.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16P71174-7209_Calibration_Files_2016-01-06.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16P71174-7209_Calibration_Files_2017-07-21.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16P71174-7209_Calibration_Files_2018-05-25.zip'],
 'CGINS-CTDBPD-07239': ['CTDBP-D_SBE_16PlusV2_SN_16P71879-7239_Calibration_Files_2012-12-28.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16P71879-7239_Calibration_Files_2013-01-24.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16P71879-7239_Calibration_Files_2015-02-23.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16P71879-7239_Calibration_Files_2017-12-02.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16P71879-7239_Calibration_Files_2019-01-03.zip'],
 'CGINS-CTDBPD-50008': ['CTDBP-D_SBE16PlusV2_SN_16-50008_Calibration_Files_2014-03-10.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16-50008_Calibration_Files_2016-01-06.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16-50008_Calibration_Files_2017-03-22.zip',
  'CTDBP-D_SBE_16PlusV2_SN_16-50008_

In [132]:
def load_cal_coeffs(files,filepath, coefficient_name_map, o2_coefficients_map):
    """
    Loads all of the calibration coefficients from the vendor cal files for
    a given CTD instrument class.
    
    Args:
        files - a list of zipfile names containing the vendor calibration files
        filepath - directory path to where the zipfiles are stored locally
        coefficient_name_map - a mapping of the calibration names in the vendor file
            to the calibration coeff names needed for OOINet
        o2_coefficients_map - mapping for CTDs containing an oxygen sensor
    Returns:
        cal_coeffs - a dictionary of the calibration coefficients with the respective
            values, nested in a dictionary sorted by calibration date
    """
    cal_coeffs = {}
    for file in files:
        fpath = filepath+file
        with ZipFile(fpath) as zfile:
            if any('.cal' in x for x in zfile.namelist()):
                findex, *ignore = [(i,x) for i,x in enumerate(zfile.namelist()) if '.cal' in x][0]
                filename = zfile.namelist()[findex]
                data = zfile.read(filename).decode('ASCII')
                coeffs, date = read_cal(data, coefficient_name_map)
                cal_coeffs.update({date:coeffs})
    return cal_coeffs

In [133]:
cal = {}
filepath = '../../Project_Files/Records/Instrument_Records/CTDBP/'
for uid,files in vendor_files.items():
    cal_coeffs = load_cal_coeffs(files,filepath,coefficient_name_map,o2_coefficients_map)
    cal_df = pd.DataFrame.from_dict({i: cal_coeffs[i] for i in cal_coeffs.keys()}, orient='index')
    cal_df.index = pd.to_datetime(cal_df.index)
    cal.update({uid:cal_df})

In [134]:
cal

{'CGINS-CTDBPD-07209':                     CC_a0          CC_a1           CC_a2          CC_a3  \
 2012-10-25  1.246218e-003  2.767838e-004  -1.298569e-006  1.889486e-007   
 2015-12-10  1.253857e-003  2.739732e-004  -9.551421e-007  1.750128e-007   
 2018-05-25  1.236308e-003  2.804391e-004  -1.747581e-006  2.073228e-007   
 
                       CC_g           CC_h            CC_i           CC_j  \
 2012-10-25  -9.535724e-001  1.336248e-001  -3.783545e-004  4.512857e-005   
 2015-12-10  -9.541443e-001  1.336263e-001  -3.821880e-004  4.517075e-005   
 2018-05-25  -9.540566e-001  1.335906e-001  -3.755135e-004  4.538584e-005   
 
                  CC_ctcor        CC_cpcor       ...               CC_pa2  \
 2012-10-25  3.250000e-006  -9.570000e-008       ...        5.361054e-012   
 2015-12-10  3.250000e-006  -9.570000e-008       ...        4.883463e-012   
 2018-05-25  3.250000e-006  -9.570000e-008       ...        5.713939e-012   
 
                  CC_ptca0       CC_ptca1        CC_

#### Repeat the above process with the .xmlcon file

In [136]:
def load_xml_coeffs(files,filepath, coefficient_name_map, o2_coefficients_map):
    """
    Loads all of the calibration coefficients from the vendor cal files in xmlcon
    format for a given CTD instrument class.
    
    Args:
        files - a list of zipfile names containing the vendor calibration files
        filepath - directory path to where the zipfiles are stored locally
        coefficient_name_map - a mapping of the calibration names in the vendor file
            to the calibration coeff names needed for OOINet
        o2_coefficients_map - mapping for CTDs containing an oxygen sensor
    Returns:
        cal_coeffs - a dictionary of the calibration coefficients with the respective
            values, nested in a dictionary sorted by calibration date
    """
    xml_coeffs = {}
    for file in files:
        fpath = filepath+file
        with ZipFile(fpath) as zfile:
            if any('.xmlcon' in x for x in zfile.namelist()):
                findex, *ignore = [(i,x) for i,x in enumerate(zfile.namelist()) if '.xmlcon' in x][0]
                filename = zfile.namelist()[findex]
                data = et.parse(zfile.open(filename))
                coeffs, date = read_xml(data, coefficient_name_map, o2_coefficients_map)
                xml_coeffs.update({date:coeffs})
    return xml_coeffs

In [137]:
xml = {}
filepath = '../../Project_Files/Records/Instrument_Records/CTDBP/'
for uid,files in vendor_files.items():
    xml_coeffs = load_xml_coeffs(files,filepath,coefficient_name_map,o2_coefficients_map)
    xml_df = pd.DataFrame.from_dict({i: xml_coeffs[i] for i in xml_coeffs.keys()}, orient='index')
    xml_df.drop(columns=[None],axis=1,inplace=True)
    xml_df.index = pd.to_datetime(xml_df.index)
    xml.update({uid:xml_df})

In [138]:
xml

{'CGINS-CTDBPD-07209':                       CC_a0            CC_a1             CC_a2  \
 2012-10-25  1.24621840E-003  2.76783777E-004  -1.29856876E-006   
 2015-12-10  1.25385742E-003  2.73973191E-004  -9.55142116E-007   
 2017-07-18  1.25158892E-003  2.74682089E-004  -1.02563341E-006   
 2018-05-25  1.23630751E-003  2.80439081E-004  -1.74758064E-006   
 
                       CC_a3          CC_cpcor              CC_g  \
 2012-10-25  1.88948575E-007  -9.57000000E-008  -9.53572400E-001   
 2015-12-10  1.75012763E-007  -9.57000000E-008  -9.54144279E-001   
 2017-07-18  1.77192286E-007  -9.57000000E-008  -9.55037963E-001   
 2018-05-25  2.07322785E-007  -9.57000000E-008  -9.54056599E-001   
 
                        CC_h              CC_i             CC_j     CC_ctcor  \
 2012-10-25  1.33624800E-001  -3.78354500E-004  4.51285700E-005  3.2500E-006   
 2015-12-10  1.33626304E-001  -3.82187986E-004  4.51707476E-005  3.2500E-006   
 2017-07-18  1.33778797E-001  -3.96866503E-004  4.68614840E

### Comparisons
Now that I have .cal, .xmlcon, the qct capture files, and the csv files from asset management, I can begin comparison of the calibration coefficients between the different files. The goal is that the dates, values, and coefficients all match.

In [139]:
CSV

{'CGINS-CTDBPD-07209': name           CC_a0     CC_a1         CC_a2         CC_a3      CC_cpcor  \
 CAL DATE                                                                   
 2012-10-25  0.001246  0.000277 -1.298569e-06  1.889486e-07 -9.570000e-08   
 2015-12-10  0.001254  0.000274 -9.551421e-07  1.750128e-07 -9.570000e-08   
 2017-07-18  0.001252  0.000275 -1.025633e-06  1.771923e-07 -9.570000e-08   
 2018-05-25  0.001236  0.000280 -1.747581e-06  2.073228e-07 -9.570000e-08   
 
 name        CC_ctcor      CC_g      CC_h      CC_i      CC_j     ...      \
 CAL DATE                                                         ...       
 2012-10-25  0.000003 -0.953572  0.133625 -0.000378  0.000045     ...       
 2015-12-10  0.000003 -0.954144  0.133626 -0.000378  0.000045     ...       
 2017-07-18  0.000003 -0.955038  0.133779 -0.000397  0.000047     ...       
 2018-05-25  0.000003 -0.954057  0.133591 -0.000376  0.000045     ...       
 
 name              CC_pa2  CC_ptca0  CC_ptca1  CC_

In [140]:
qct

{'CGINS-CTDBPD-07209':                    CC_a0         CC_a1          CC_a2         CC_a3  \
 CAL DATE                                                              
 2015-12-10  1.253857e-03  2.739732e-04  -9.551422e-07  1.750128e-07   
 2017-07-18  1.251589e-03  2.746821e-04  -1.025633e-06  1.771923e-07   
 2018-05-25  1.236308e-03  2.804391e-04  -1.747581e-06  2.073228e-07   
 
                      CC_g          CC_h           CC_i          CC_j  \
 CAL DATE                                                               
 2015-12-10  -9.541443e-01  1.336263e-01  -3.821880e-04  4.517075e-05   
 2017-07-18  -9.550380e-01  1.337788e-01  -3.968665e-04  4.686148e-05   
 2018-05-25  -9.540566e-01  1.335906e-01  -3.755135e-04  4.538584e-05   
 
                  CC_cpcor      CC_ctcor      ...              CC_pa2  \
 CAL DATE                                     ...                       
 2015-12-10  -9.570000e-08  3.250000e-06      ...        4.883464e-12   
 2017-07-18  -9.570000e-08  3.

In [141]:
cal

{'CGINS-CTDBPD-07209':                     CC_a0          CC_a1           CC_a2          CC_a3  \
 2012-10-25  1.246218e-003  2.767838e-004  -1.298569e-006  1.889486e-007   
 2015-12-10  1.253857e-003  2.739732e-004  -9.551421e-007  1.750128e-007   
 2018-05-25  1.236308e-003  2.804391e-004  -1.747581e-006  2.073228e-007   
 
                       CC_g           CC_h            CC_i           CC_j  \
 2012-10-25  -9.535724e-001  1.336248e-001  -3.783545e-004  4.512857e-005   
 2015-12-10  -9.541443e-001  1.336263e-001  -3.821880e-004  4.517075e-005   
 2018-05-25  -9.540566e-001  1.335906e-001  -3.755135e-004  4.538584e-005   
 
                  CC_ctcor        CC_cpcor       ...               CC_pa2  \
 2012-10-25  3.250000e-006  -9.570000e-008       ...        5.361054e-012   
 2015-12-10  3.250000e-006  -9.570000e-008       ...        4.883463e-012   
 2018-05-25  3.250000e-006  -9.570000e-008       ...        5.713939e-012   
 
                  CC_ptca0       CC_ptca1        CC_

In [142]:
xml

{'CGINS-CTDBPD-07209':                       CC_a0            CC_a1             CC_a2  \
 2012-10-25  1.24621840E-003  2.76783777E-004  -1.29856876E-006   
 2015-12-10  1.25385742E-003  2.73973191E-004  -9.55142116E-007   
 2017-07-18  1.25158892E-003  2.74682089E-004  -1.02563341E-006   
 2018-05-25  1.23630751E-003  2.80439081E-004  -1.74758064E-006   
 
                       CC_a3          CC_cpcor              CC_g  \
 2012-10-25  1.88948575E-007  -9.57000000E-008  -9.53572400E-001   
 2015-12-10  1.75012763E-007  -9.57000000E-008  -9.54144279E-001   
 2017-07-18  1.77192286E-007  -9.57000000E-008  -9.55037963E-001   
 2018-05-25  2.07322785E-007  -9.57000000E-008  -9.54056599E-001   
 
                        CC_h              CC_i             CC_j     CC_ctcor  \
 2012-10-25  1.33624800E-001  -3.78354500E-004  4.51285700E-005  3.2500E-006   
 2015-12-10  1.33626304E-001  -3.82187986E-004  4.51707476E-005  3.2500E-006   
 2017-07-18  1.33778797E-001  -3.96866503E-004  4.68614840E

In [143]:
# First, I need to reindex all of the different dataframes such that they all have two indices:
# A dataset index and a datetime index, and set them to uniform name (for concatenation)
for uid in uids:
    CSV[uid]['Dataset'] = 'CSV'
    CSV[uid].set_index(['Dataset',CSV[uid].index],inplace=True)
    CSV[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
CSV

{'CGINS-CTDBPD-07209': name                   CC_a0     CC_a1         CC_a2         CC_a3  \
 Dataset Cal Date                                                     
 CSV     2012-10-25  0.001246  0.000277 -1.298569e-06  1.889486e-07   
         2015-12-10  0.001254  0.000274 -9.551421e-07  1.750128e-07   
         2017-07-18  0.001252  0.000275 -1.025633e-06  1.771923e-07   
         2018-05-25  0.001236  0.000280 -1.747581e-06  2.073228e-07   
 
 name                    CC_cpcor  CC_ctcor      CC_g      CC_h      CC_i  \
 Dataset Cal Date                                                           
 CSV     2012-10-25 -9.570000e-08  0.000003 -0.953572  0.133625 -0.000378   
         2015-12-10 -9.570000e-08  0.000003 -0.954144  0.133626 -0.000378   
         2017-07-18 -9.570000e-08  0.000003 -0.955038  0.133779 -0.000397   
         2018-05-25 -9.570000e-08  0.000003 -0.954057  0.133591 -0.000376   
 
 name                    CC_j     ...            CC_pa2  CC_ptca0  CC_ptca1  \
 Datase

In [144]:
for uid in uids:
    qct[uid]['Dataset'] = 'QCT'
    qct[uid].set_index(['Dataset',qct[uid].index],inplace=True)
    qct[uid].index.set_names(['Dataset','Cal Date'],inplace=True)


In [145]:
for uid in uids:
    cal[uid]['Dataset'] = 'CAL'
    cal[uid].set_index(['Dataset',cal[uid].index],inplace=True)
    cal[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
cal

{'CGINS-CTDBPD-07209':                             CC_a0          CC_a1           CC_a2  \
 Dataset Cal Date                                                   
 CAL     2012-10-25  1.246218e-003  2.767838e-004  -1.298569e-006   
         2015-12-10  1.253857e-003  2.739732e-004  -9.551421e-007   
         2018-05-25  1.236308e-003  2.804391e-004  -1.747581e-006   
 
                             CC_a3            CC_g           CC_h  \
 Dataset Cal Date                                                   
 CAL     2012-10-25  1.889486e-007  -9.535724e-001  1.336248e-001   
         2015-12-10  1.750128e-007  -9.541443e-001  1.336263e-001   
         2018-05-25  2.073228e-007  -9.540566e-001  1.335906e-001   
 
                               CC_i           CC_j       CC_ctcor  \
 Dataset Cal Date                                                   
 CAL     2012-10-25  -3.783545e-004  4.512857e-005  3.250000e-006   
         2015-12-10  -3.821880e-004  4.517075e-005  3.250000e-006   
        

In [146]:
for uid in uids:
    xml[uid]['Dataset'] = 'XML'
    xml[uid].set_index(['Dataset',xml[uid].index],inplace=True)
    xml[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
xml

{'CGINS-CTDBPD-07209':                               CC_a0            CC_a1             CC_a2  \
 Dataset Cal Date                                                         
 XML     2012-10-25  1.24621840E-003  2.76783777E-004  -1.29856876E-006   
         2015-12-10  1.25385742E-003  2.73973191E-004  -9.55142116E-007   
         2017-07-18  1.25158892E-003  2.74682089E-004  -1.02563341E-006   
         2018-05-25  1.23630751E-003  2.80439081E-004  -1.74758064E-006   
 
                               CC_a3          CC_cpcor              CC_g  \
 Dataset Cal Date                                                          
 XML     2012-10-25  1.88948575E-007  -9.57000000E-008  -9.53572400E-001   
         2015-12-10  1.75012763E-007  -9.57000000E-008  -9.54144279E-001   
         2017-07-18  1.77192286E-007  -9.57000000E-008  -9.55037963E-001   
         2018-05-25  2.07322785E-007  -9.57000000E-008  -9.54056599E-001   
 
                                CC_h              CC_i             C

All four possible sources of calibration coefficients available for an instrument - the calibration **CSV** loaded into asset management, the calibration coefficients loaded onto the instrument during check-in (**QCT**), the **.cal** file provided by the vendor, and the **XML** file provided by the vendor. 

The next step is to concatenate the different instruments into a single dataframe and to sort by calibration date. This will allow for comparison based on the date of the calibration.

In [147]:
comparison = {}
for uid in uids:
    comparison.update({uid:pd.concat([CSV[uid], cal[uid], xml[uid], qct[uid]])})
    comparison[uid].reset_index(level='Cal Date',inplace=True)
    comparison[uid].sort_values(by='Cal Date',inplace=True)
comparison

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  This is separate from the ipykernel package so we can avoid doing imports until


{'CGINS-CTDBPD-07209':           Cal Date            CC_a0            CC_a1             CC_a2  \
 Dataset                                                                  
 CSV     2012-10-25       0.00124622      0.000276784      -1.29857e-06   
 CAL     2012-10-25    1.246218e-003    2.767838e-004    -1.298569e-006   
 XML     2012-10-25  1.24621840E-003  2.76783777E-004  -1.29856876E-006   
 CSV     2015-12-10       0.00125386      0.000273973      -9.55142e-07   
 CAL     2015-12-10    1.253857e-003    2.739732e-004    -9.551421e-007   
 XML     2015-12-10  1.25385742E-003  2.73973191E-004  -9.55142116E-007   
 QCT     2015-12-10     1.253857e-03     2.739732e-04     -9.551422e-07   
 CSV     2017-07-18       0.00125159      0.000274682      -1.02563e-06   
 XML     2017-07-18  1.25158892E-003  2.74682089E-004  -1.02563341E-006   
 QCT     2017-07-18     1.251589e-03     2.746821e-04     -1.025633e-06   
 CSV     2018-05-25       0.00123631      0.000280439      -1.74758e-06   
 CA

In [148]:
def convert_type(x):
    if type(x) is str:
        return float(x)
    else:
        return x

In [149]:
for uid in uids:
    comparison[uid] = comparison[uid].applymap(convert_type)
comparison

{'CGINS-CTDBPD-07209':           Cal Date     CC_a0     CC_a1         CC_a2         CC_a3  \
 Dataset                                                              
 CSV     2012-10-25  0.001246  0.000277 -1.298569e-06  1.889486e-07   
 CAL     2012-10-25  0.001246  0.000277 -1.298569e-06  1.889486e-07   
 XML     2012-10-25  0.001246  0.000277 -1.298569e-06  1.889486e-07   
 CSV     2015-12-10  0.001254  0.000274 -9.551421e-07  1.750128e-07   
 CAL     2015-12-10  0.001254  0.000274 -9.551421e-07  1.750128e-07   
 XML     2015-12-10  0.001254  0.000274 -9.551421e-07  1.750128e-07   
 QCT     2015-12-10  0.001254  0.000274 -9.551422e-07  1.750128e-07   
 CSV     2017-07-18  0.001252  0.000275 -1.025633e-06  1.771923e-07   
 XML     2017-07-18  0.001252  0.000275 -1.025633e-06  1.771923e-07   
 QCT     2017-07-18  0.001252  0.000275 -1.025633e-06  1.771923e-07   
 CSV     2018-05-25  0.001236  0.000280 -1.747581e-06  2.073228e-07   
 CAL     2018-05-25  0.001236  0.000280 -1.747581e-06  

In [150]:
def all_the_same(elements):
    """
    This function checks which values in an array are all the same.
    
    Args:
        elements - an array of values
    Returns:
        error - an array of length (m-1) which checks if
    
    """
    if len(elements) < 1:
        return True
    el = iter(elements)
    first = next(el, None)
    #check = [element == first for element in el]
    error = [np.isclose(element,first) for element in el]
    return error

In [151]:
def locate_cal_error(array):
    """
    This function locates which source file (e.g. xmlcon vs csv vs cal)
    have calibration values that are different from the others. It does
    NOT identify which is correct, only which is different.
    
    Args:
        array - A numpy array which contains the values for a specific
                calibration coefficient for a specific date from all of
                the calibration source files
    Returns:
        dataset - a list containing which calibration sources are different
                from the other files
        True - if all of the calibration values are the same
        False - if the first calibration value is different
    """
    # Call the function to check if there are any differences between each of
    # calibration values from the different sheets
    error = all_the_same(array)
    # If they are all the same, return True
    if all(error):
        return True
    # If there is a mixture of True/False, find the false and return them
    elif any(error) == True:
        indices = [i+1 for i, j in enumerate(error) if j == False]
        dataset = list(array.index[indices])
        return dataset
    # Last, if all are false, that means the first value 
    else:
        return False

In [152]:
# With all the functions set up, now go through all of the data
def search_for_errors(df):
    """
    This function is designed to search through a pandas dataframe
    which contains all of the calibration coefficients from all of
    the files, and check for differences.
    
    Args: 
        df - A dataframe which contains all fo the calibration coefficients
        from the asset management csv, qct checkout, and the vendor
        files (.cal and .xmlcon)
    Returns:
        cal_errors - A nested dictionary containing the calibration timestamp, the
        relevant calibration coefficient, and which file(s) have the
        erroneous calibration file.
    """
    
    cal_errors = {}
    for date in np.unique(df['Cal Date']):
        df2 = df[df['Cal Date'] == date]
        wrong_cals = {}
        for column in df2.columns.values:
            array = df2[column]
            array.sort_index()
            if array.dtype == 'datetime64[ns]':
                pass
            else:
                error = locate_cal_error(array)
                if error == False:
                    wrong_cals.update({column:array.index[0]})
                elif error == True:
                    pass
                else:
                    wrong_cals.update({column:error})
        
        if len(wrong_cals) < 1:
            cal_errors.update({str(date).split('T')[0]:'No Errors'})
        else:
            cal_errors.update({str(date).split('T')[0]:wrong_cals})
    
    return cal_errors

In [153]:
cal_errors = {}
for uid in uids:
    ce = search_for_errors(comparison[uid])
    cal_errors.update({uid:ce})
    

In [154]:
cal_errors

{'CGINS-CTDBPD-07209': {'2012-10-25': 'No Errors',
  '2015-12-10': {'CC_i': 'CSV', 'CC_j': 'CSV'},
  '2017-07-18': 'No Errors',
  '2018-05-25': 'No Errors'},
 'CGINS-CTDBPD-07239': {'2012-12-07': 'No Errors',
  '2015-02-19': 'No Errors',
  '2017-12-02': 'No Errors',
  '2019-01-03': 'No Errors'},
 'CGINS-CTDBPD-50008': {'2014-01-25': 'No Errors',
  '2015-12-12': 'No Errors',
  '2016-12-14': 'No Errors',
  '2016-12-15': 'No Errors',
  '2017-12-12': 'No Errors'},
 'CGINS-CTDBPD-50058': {'2014-09-25': 'No Errors',
  '2015-12-11': 'No Errors',
  '2016-12-14': 'No Errors'},
 'CGINS-CTDBPD-50110': {'2015-03-29': 'No Errors',
  '2016-06-26': 'No Errors',
  '2017-07-11': 'No Errors',
  '2018-05-19': 'No Errors'}}

In [155]:
pd.DataFrame.from_dict(cal_errors)

Unnamed: 0,CGINS-CTDBPD-07209,CGINS-CTDBPD-07239,CGINS-CTDBPD-50008,CGINS-CTDBPD-50058,CGINS-CTDBPD-50110
2012-10-25,No Errors,,,,
2012-12-07,,No Errors,,,
2014-01-25,,,No Errors,,
2014-09-25,,,,No Errors,
2015-02-19,,No Errors,,,
2015-03-29,,,,,No Errors
2015-12-10,"{'CC_i': 'CSV', 'CC_j': 'CSV'}",,,,
2015-12-11,,,,No Errors,
2015-12-12,,,No Errors,,
2016-06-26,,,,,No Errors


In [156]:
df2=pd.DataFrame.from_dict({i: cal_errors[i] for i in cal_errors.keys()}, orient='index')

In [157]:
df2

Unnamed: 0,2012-10-25,2015-12-10,2017-07-18,2018-05-25,2012-12-07,2015-02-19,2017-12-02,2019-01-03,2014-01-25,2015-12-12,2016-12-14,2016-12-15,2017-12-12,2014-09-25,2015-12-11,2015-03-29,2016-06-26,2017-07-11,2018-05-19
CGINS-CTDBPD-07209,No Errors,"{'CC_i': 'CSV', 'CC_j': 'CSV'}",No Errors,No Errors,,,,,,,,,,,,,,,
CGINS-CTDBPD-07239,,,,,No Errors,No Errors,No Errors,No Errors,,,,,,,,,,,
CGINS-CTDBPD-50008,,,,,,,,,No Errors,No Errors,No Errors,No Errors,No Errors,,,,,,
CGINS-CTDBPD-50058,,,,,,,,,,,No Errors,,,No Errors,No Errors,,,,
CGINS-CTDBPD-50110,,,,,,,,,,,,,,,,No Errors,No Errors,No Errors,No Errors


In [81]:
os.getcwd()

'/media/andrew/OS/Users/areed/Documents/OOI-CGSN/QAQC_Sandbox'

In [83]:
df2.to_csv('CTDBPC_Errors.csv')

### Check which CTDBP-C Calibration files are not correctly named
In order to check the calibration values, need to have the correctly named calibration csv files. We can check this by comparison of deployment dates with the CTDBPC calibration dates. This requires loading both the deployment csv and parsing all the file names, flagging the file names THAT MATCH, and then revisiting them in order to correct the name.

In [None]:
# Load the deployment csvs fo
# Parse for all WHOI CG Deployment Sheets based on 'CP' or CG
# Easier to check for non-CG 
deploy_csvs = []
for file in os.listdir('../GitHub/OOI-Integration/asset-management/deployment/'):
    if file[0:2] == 'RS' or file[0:2] == 'CE':
        pass
    elif 'MOAS' in file:
        pass
    else:
        deploy_csvs.append(file)
        print(file)

In [None]:
# Get the Deployment History from the WHOI Asset Tracking System
CTDBPC_Deploy = CTDBPC['Deployment History']

In [None]:
CTDBPC_Deploy

In [None]:
# Split the string at the newline to generate a list of deployments for each CTDBP-C
CTDBPC_Deploy = CTDBPC['Deployment History'].apply(lambda x: x.split('\n'))

In [None]:
CTDBPC_Deploy

In [None]:
# List out all the individual deployments
deploy_list = []
for i in range(0,len(CTDBPC_Deploy)):
    for item in CTDBPC_Deploy.iloc[i]:
        if '-' in item:
            deploy_list.append(item)
        else:
            pass

In [None]:
# So I now have a list of the deployments all the CTDBP-Cs were used on.
# Now, parse the name of the array to
array = list( set( [x.split('-')[0] for x in deploy_list] ) )
array

In [None]:
# With the list of array names, I can now parse the deployment file names to find
# the relevant deployment sheets which match where the CTDBP-Cs were deployed
deploy_csvs = []
for file in os.listdir('../GitHub/OOI-Integration/asset-management/deployment/'):
    if file.split('_')[0] in array:
        deploy_csvs.append(file)
deploy_csvs

In [None]:
# Using the identified deployment csvs, can now load the deployment csvs into
# a pandas dataframe
deployments = pd.DataFrame()
for file in deploy_csvs:
    deployments = deployments.append(pd.read_csv('../GitHub/OOI-Integration/asset-management/deployment/'+file))
deployments.head()

In [None]:
# Get the CTDBPC sensor uids
sensor_uids = list( set( CTDBPC['UID'] ) )
sensor_uids

In [None]:
# Find in the deployment spreadsheets the matching entry for the CTDBP-Cs that I'm looking for
deployments['CTDBPC'] = deployments['sensor.uid'].apply(lambda x: True if x in sensor_uids else False)
deployments = deployments[deployments['CTDBPC'] == True]

In [None]:
deployments.head()

In [None]:
# Now, parse out the date string in the format of YYYYMMDD from the startDateTime
# in order to compare with the date in the calibration file names
deploy_dates = deployments['startDateTime'].apply(lambda x: x.replace('-','').split('T')[0])
deploy_dates = list(set(deploy_dates))
deploy_dates

In [None]:
cal_csvs = []
for file in os.listdir('../GitHub/OOI-Integration/asset-management/calibration/CTDBPC/'):
    date = file.split('__')[1].split('.')[0]
    if date in deploy_dates:
        cal_csvs = cal_csvs.append(file)
print(cal_csvs)
        

Great! None of the CTDBP-C have calibration dates which match deployment dates. That is a good sign - it means that the dates in the calibration file name *should* match the calibration dates in the calibration info.

However, that is no guarantee that the date in the file name matches the date in the calibration data. This can be check in a future step by comparing the calibration date in the vendor docs, QCT info, and the .cal and .xmlcon file info.

In [None]:
# Now, using the "deploy" csvs for each node in the various arrays,
# need to load into a large pandas dataframe for easy handling
import pandas as pd

deployments = pd.DataFrame()
for file in deploy_csvs:
    deployments = deployments.append(pd.read_csv('../GitHub/OOI-Integration/asset-management/deployment/'+file))

In [None]:
deployments

In [None]:
# Get all the unique deployment dates from the deployment csvs and put into the form of 
# YYYYMMDD. 
deploy_dates = deployments['startDateTime'].apply(lambda x: x.split('T')[0].replace('-',''))

In [None]:
deploy_dates = list(set(deploy_dates))
deploy_dates[0:10]

In [None]:
len(deploy_dates)

In [None]:
check_files = []
for root, dirs, files in os.walk('../GitHub/OOI-Integration/asset-management/calibration/'):
    for name in files:
        if 'CGINS' in name:
            cal_date = name.split('__')[1].split('.')[0]
            if cal_date in deploy_dates:
                check_files.append(name)

In [None]:
# Okay, there are a potential 1364 files that we need to check on the
# calibration date in the file name, because the parsed date in the 
# file name matches a deployment date.
len(list(set(check_files)))

In [None]:
# Cool, now save the file to the local working directory
with open('calibration_files_to_check.csv','w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(check_files)