# Exploratory Analysis for Metadata Review in OOI Asset Management System

### Motivation:
The Asset Management system for OOI is primarly housed on GitHub in a variety of csv files. Until now, the calibration coefficients stored in the csv files have been manually entered. While we have utilized a "human-in-the-loop" review approach to catch errors, some errors have slipped through (e.g. truncation of significant figures).

### Approach:
My goal is to develop an automated approach to catch possible errors which already exist within the asset management system. To accomplish this, I will compare the csv files loaded into the GitHub asset management system with the original vendor files as well as the QCT (quality control testing) documents which capture the coefficients loaded onto the instrument at the time of reception at WHOI from the vendor.

### Data Sources:
* **GitHub**: CSV files containing the calibration coefficients. Directory organization by sensor+class. The files are named as "(CGINS)-(sensor+class)-(serial number)-(YYYYMMDD)" where YYYYMMDD is the calibration date.
* **Vault**: Version-controlled storage location of the vendor calibrations, in the Records/Instrument Records/Instrument directories. Within the relevant directory, calibration files are stored as either .cal, .xmlcon, .pdf, or within zipped directories.
* **Alfresco**: Version-controlled web-accessed. The calibrations loaded onto the instrument during the initial checkin-in upon receipt (the QCT process) are stored here as either .cap or .txt files. 

In [8]:
# Import likely important packages, etc.
import sys, os, csv, re
from wcmatch import fnmatch
import datetime
import time
import xml.etree.ElementTree as et
from zipfile import ZipFile
import numpy as np
import pandas as pd
import xarray as xr

In [2]:
coefficient_name_map = {
            'TA0': 'CC_a0',
            'TA1': 'CC_a1',
            'TA2': 'CC_a2',
            'TA3': 'CC_a3',
            'CPCOR': 'CC_cpcor',
            'CTCOR': 'CC_ctcor',
            'CG': 'CC_g',
            'CH': 'CC_h',
            'CI': 'CC_i',
            'CJ': 'CC_j',
            'G': 'CC_g',
            'H': 'CC_h',
            'I': 'CC_i',
            'J': 'CC_j',
            'PA0': 'CC_pa0',
            'PA1': 'CC_pa1',
            'PA2': 'CC_pa2',
            'PTEMPA0': 'CC_ptempa0',
            'PTEMPA1': 'CC_ptempa1',
            'PTEMPA2': 'CC_ptempa2',
            'PTCA0': 'CC_ptca0',
            'PTCA1': 'CC_ptca1',
            'PTCA2': 'CC_ptca2',
            'PTCB0': 'CC_ptcb0',
            'PTCB1': 'CC_ptcb1',
            'PTCB2': 'CC_ptcb2',
            # additional types for series O
            'C1': 'CC_C1',
            'C2': 'CC_C2',
            'C3': 'CC_C3',
            'D1': 'CC_D1',
            'D2': 'CC_D2',
            'T1': 'CC_T1',
            'T2': 'CC_T2',
            'T3': 'CC_T3',
            'T4': 'CC_T4',
            'T5': 'CC_T5',
        }

o2_coefficients_map = {
            'A': 'CC_residual_temperature_correction_factor_a',
            'B': 'CC_residual_temperature_correction_factor_b',
            'C': 'CC_residual_temperature_correction_factor_c',
            'E': 'CC_residual_temperature_correction_factor_e',
            'SOC': 'CC_oxygen_signal_slope',
            'OFFSET': 'CC_frequency_offset'
        }
        

### WHOI Asset Tracking Spreadsheet
First, I want to load and examine exactly what type of data is stored in the WHOI Asset Tracking Spreadsheet and what information it has that may be useful.

In [9]:
def whoi_asset_tracking(spreadsheet,sheet_name,instrument_class='All',whoi=True,series=None):
    """
    Loads all the individual sensors of a specific instrument class and
    series type. Currently applied only for WHOI deployed instruments.
    
    Args:
        spreadsheet - directory path and name of the excel spreadsheet with
            the WHOI asset tracking information.
        sheet_name - name of the sheet in the spreadsheet to load
        instrument_class - the type (i.e. CTDBP, CTDMO, PCO2W, etc). Defaults
            to 'All', which will load all of the instruments
        whoi - return only whoi instruments? Defaults to True.
        series - a specified class of the instrument to load. Defaults to None,
            which will load all of the series for a specified instrument class
    """
    
    all_sensors = pd.read_excel(spreadsheet,sheet_name=sheet_name,header=1)
    # Select a specific class of instruments
    if instrument_class == 'All':
        inst_class = all_sensors
    else:
        inst_class  = all_sensors[all_sensors['Instrument\nClass']==instrument_class]
    # Return only the whoi instruments?
    if whoi == True:
        whoi_insts = inst_class[inst_class['Deployment History'] != 'EA']
    else:
        whoi_insts = inst_class
    # Slect a specific series of the instrument?
    if series != None:
        instrument = whoi_insts[whoi_insts['Series'] == series]
    else:
        instrument = whoi_insts
 
    return instrument
    
    

In [10]:
#excel_spreadsheet = 'C:/Users/areed/Documents/Project_Files/Documentation/System/System Notebook/WHOI_Asset_Tracking.xlsx'
excel_spreadsheet = '/media/andrew/OS/Users/areed/Documents/Project_Files/Documentation/System/System Notebook/WHOI_Asset_Tracking.xlsx'
sheet_name = 'Sensors'

In [14]:
# What are all the different series of CTDs?
CTDBP = whoi_asset_tracking(excel_spreadsheet,sheet_name,instrument_class='CTDBP',whoi=True)
CTDBP

Unnamed: 0,Instrument Class,Series,Supplier Serial Number,WHOI #,OOI #,UID,Model,CGSN PN,Firmware Version,Supplier,...,QCT Testing,PreDeployment,Post Deployment,Refurbishment/ Repair,DO Number,Date Received,Deployment History,Current Deployment,Instrument Location on Current Deployment,Notes
49,CTDBP,F,16-50001,116098,A00635,CGINS-CTDBPF-50001,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00016\n3305-00102-00091\n3305-00102...,,,3305-00900-00080\n3305-00900-00280,WH-SC11-01-CTD-1007,2014-01-23 00:00:00,GI01SUMO-00001\nGI01SUMO-00003\nGI01SUMO-00005,GI01SUMO-00005,NSIF,(NSIF)
50,CTDBP,C,16-50002,116255,A00697,CGINS-CTDBPC-50002,16PlusV2,1336-00001-00003,2.5.2,SeaBird,...,3305-00102-00019\n3305-00102-00071\n3305-00102...,,,3305-00900-00035\n3305-00900-00121\n3305-00900...,WH-SC11-01-CTD-1007,2014-02-18 00:00:00,CP04OSSM-00001\nCP01CNSM-00004\nCP01CNSM-00007...,,,On OSSM if it returns from vendor in time\nUpd...
51,CTDBP,C,16-50003,116256,A00698,CGINS-CTDBPC-50003,16PlusV2,1336-00001-00003,2.5.2,SeaBird,...,3305-00102-00020\n3305-00102-00082\n3305-00102...,,,3305-00900-00085\n3305-00900-00178\n3305-00900...,WH-SC11-01-CTD-1007,2014-02-18 00:00:00,CP03ISSM-00001\nCP04OSSM-00004\nCP04OSSM-00006...,,,Detached from Mooring ??
52,CTDBP,E,16-50004,116257,A00699,CGINS-CTDBPE-50004,16PlusV2,1336-00001-00005,2.5.2,SeaBird,...,3305-00102-00017\n3305-00102-00070\n3305-00102...,,,3305-00900-00035\n3305-00900-00360,WH-SC11-01-CTD-1005,2014-02-18 00:00:00,CP04OSSM-00001\nCP04OSSM-00005\nCP04OSSM-00007,CP04OSSM-00009,MFN,
53,CTDBP,D,16-50008,116258,A00700,CGINS-CTDBPD-50008,16PlusV2,1336-00001-00004,2.5.2,SeaBird,...,3305-00102-00018\n3305-00102-00083\n3305-00102...,,,3305-00900-00085\n3305-00900-00178\n3305-00900...,WH-SC11-01-CTD-1007,2014-02-18 00:00:00,CP01CNSM-00002\nCP01CNSM-00003\nCP03ISSM-00004...,CP01CNSM-00010,MFN,
55,CTDBP,C,16-50056,116827,A01089,CGINS-CTDBPC-50056,16PlusV2,1336-00001-00003,2.5.2,SeaBird,...,3305-00102-00037\n3305-00102-00086\n3305-00102...,,,3305-00900-00085\n3305-00900-00285\n3305-00900...,WH-SC11-01-CTD-1012,2014-09-26 00:00:00,CP3a Spare\nCP03ISSM-00002\nCP04OSSM-00005\nCP...,,,
56,CTDBP,D,16-50058,116848,A01106,CGINS-CTDBPD-50058,16PlusV2,1336-00001-00004,2.5.2,SeaBird,...,3305-00102-00038\n3305-00102-00088\n3305-00102...,,,3305-00900-00085\n3305-00900-00178\n3305-00900...,WH-SC11-01-CTD-1012,2014-10-01 00:00:00,CP3a Spare\nCP03ISSM-00002\nCP01CNSM-00005\nCP...,,,
57,CTDBP,P,16-50059,116834,A01096,CGINS-CTDBPP-50059,16Plus-IM V2,1336-00001-00016,2.5.2,SeaBird,...,3305-00126-00001,,,,WH-SC11-01-CTD-1014,2014-09-29 00:00:00,GI01SUMO-00002,,,Lost on GI01SUMO-00002 (40 m depth)
58,CTDBP,F,16-50060,116830,A01092,CGINS-CTDBPF-50060,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00039\n3305-00102-00092,,,3305-00900-00103,WH-SC11-01-CTD-1013,2014-09-29 00:00:00,GS01SUMO-00001\nGS01SUMO-00003,,,
59,CTDBP,F,16-50061,116831,A01093,CGINS-CTDBPF-50061,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00040\n3305-00102-00113,,,3305-00900-00155,WH-SC11-01-CTD-1013,2014-09-29 00:00:00,GI01SUMO-00002,,,


In [22]:
list(CTDBP[CTDBP['Supplier\nSerial Number'] == '16-50002']['QCT Testing'])

['3305-00102-00019\n3305-00102-00071\n3305-00102-00094\n3305-00102-00127\n3305-00102-00131\n3305-00102-00155']

In [None]:
CTDBPP = whoi_asset_tracking(excel_spreadsheet,sheet_name,instrument_class='CTDBP',whoi=True,series='P')

In [None]:
CTDBPP

### Checking instrument calibration values
After loading the **WHOI Asset Tracking Sheet**, we now have the following critical data for checking calibration information:
* Supplier Serial Number - this links back to the original **.cal**, **.xmlcon**, and vendor docs
* OOI UID - this is the link between the instrument and the OOINet
* QCT Document Number - this number links the instrument to the QCT screen capture of the calibration values loaded onto the instruments

### Process to load the **CSV** calibration file
In order to check that the calibrations in asset management, I have to be able to load the asset management calibration csv files into a dataframe. 
* First, get all the unique CTDBPCs in Asset Management
* Next, parse the csv files in asset management to get the unique instrument serial numbers
* With the serial numbers, find the associated instrument calibration csvs
* For each calibration csv, load the data into a pandas dataframe

In [1]:
def load_asset_management(instrument,filepath):
    """
    Loads the calibration csv files from a local repository containing
    the asset management information.
    
    Args:
        instrument - a pandas dataframe with the asset tracking information
            for a specific instrument.
        filepath - the directory path pointing to where the csv files are
            stored.
    Raises:
        TypeError - if the instrument input is not a pandas dataframe
    Returns:
        csv_dict - a dictionary with keys of the UIDs from the instrument dataframe
            which correspond to lists of the relevant calibration csv files
            
    """
    
    # Check that the input is a pandas DataFrame
    if type(instrument) != pd.core.frame.DataFrame:
        raise TypeError()
        
    uids = sorted( list( set( instrument['UID'] ) ) )
    
    csv_dict = {}
    for uid in uids:
        # Get a specified uid from the instrument dataframe
        instrument['UID_match'] = instrument['UID'].apply(lambda x: True if uid in x else False)
        instrument[instrument['UID_match'] == True]
        
        # Now, get all the csvs from asset management for a particular UID
        csv_files = []
        for file in os.listdir(filepath):
            if fnmatch.fnmatch(file,'*'+uid+'*'):
                csv_files.append(file)
            else:
                pass
        
        # Update the dictionary storing the asset management files for each UID
        if len(csv_files) > 0:
            csv_dict.update({uid:csv_files})
        else:
            pass
        
    return csv_dict
    

NameError: name 'os' is not defined

In [3]:
csv_dict = load_asset_management(CTDBPC,'../../GitHub/OOI-Integration/asset-management/calibration/CTDBPC/')
csv_dict

NameError: name 'CTDBPC' is not defined

In [None]:
# Now I need to load the all of the csv files based on their UID
def load_csv_info(csv_dict,filepath):
    """
    Loads the calibration coefficient information contained in asset management
    
    Args:
        csv_dict - a dictionary which associates an instrument UID to the
            calibration csv files in asset management
        filepath - the path to the directory containing the calibration csv files
    Returns:
        csv_cals - a dictionary which associates an instrument UID to a pandas
            dataframe which contains the calibration coefficients. The dataframes
            are indexed by the date of calibration
    """
    
    # Load the calibration data into pandas dataframes, which are then placed into
    # a dictionary by the UID
    csv_cals = {}
    for uid in csv_dict:
        cals = pd.DataFrame()
        for file in csv_dict[uid]:
            data = pd.read_csv(filepath+file)
            date = file.split('__')[1].split('.')[0]
            data['CAL DATE'] = pd.to_datetime(date)
            cals = cals.append(data)
        csv_cals.update({uid:cals})
        
    # Pivot the dataframe to be sorted based on calibration date
    for uid in csv_cals:
        csv_cals[uid] = csv_cals[uid].pivot(index=csv_cals[uid]['CAL DATE'], columns='name')['value']
        
    return csv_cals



In [None]:
CSV = load_csv_info(csv_dict,'../../GitHub/OOI-Integration/asset-management/calibration/CTDBPP/')
CSV

Now we have successfully loaded the csv calibrations into a pandas dataframe that allows for easy comparison between calibrations based on the calibration date for each calibration coefficient.

### Load the QCT values
The next step is to take the capture files from the QCT and load them into a comparable pandas dataframe. This involves several steps:
* Get the QCT document numbers from the WHOI Asset Tracking Sheet for each individual instrument
* Find where the QCT documents are stored
* Load the QCT documents
* Parse the QCT documents
* Translate the parsed QCT values into a pandas dataframe

In [None]:
uids = sorted( list( set( CTDBPP['UID'])))

In [None]:
qct_dict = {}
for uid in uids:
    # Get the QCT Document numbers from the asset tracking sheet
    CTDBPP['UID_match'] = CTDBPP['UID'].apply(lambda x: True if uid in x else False)
    qct_series = CTDBPP[CTDBPP['UID_match'] == True]['QCT Testing']
    qct_series = list(qct_series.iloc[0].split('\n'))
    qct_dict.update({uid:qct_series})

In [None]:
qct_dict

In [None]:
#dirpath = 'C:/Users/areed/Documents/Project_Files/Records/Instrument_Records/cap_files/'
dirpath = '/media/andrew/OS/Users/areed/Documents/Project_Files/'

In [None]:
# Try building a function to do the file path generator
def generate_file_path(dirpath,filename,ext,exclude=['_V','_Data_Workshop']):
    """
    Function which searches for the location of the given file and returns
    the full path to the file.
    
    Args:
        dirpath - parent directory path under which to search
        filename - the name of the file to search for
        ext - 
        exclude - optional list which allows for excluding certain
            directories from the search
    Returns:
        fpath - the file path to the filename from the current
            working directory.
    """
    for root, dirs, files in os.walk(dirpath):
        dirs[:] = [d for d in dirs if d not in exclude]
        for fname in files:
            if fnmatch.fnmatch(fname, [filename+'*.cap', filename+'*.txt', filename+'*.log']):
                fpath = os.path.join(root, fname)
                return fpath

In [None]:
# Now to develop an automated approach to load all the QCT documents, parse them
# into a dictionary, and convert the dictionary into a pandas dataframe
def load_qct_data(qct_dict,coefficient_name_map,dirpath='../../../Documents/Project_Files/'):
    qct = {}
    qct_missing = {}
    for uid in qct_dict:
        print(uid)
        capture_data = {}
        missing = []
        for capfile in qct_dict[uid]:
            # First, find and return the path to the capture file which
            # matches the capture file indentifier
            cappath = generate_file_path(dirpath, capfile)
            
            # Function to pull out the coefficients from the capture files. This is a naive implementation
            # and splits only on either a ":" or "=", it doesn't do any comprehension of the file
            if cappath is None:
                missing.append(capfile)
            else:
                coeffs = {}
                with open(cappath) as filename:
                    data = filename.read()
                    for line in data.splitlines():
                        items = re.split(': | =',line)
                        key = items[0].strip()
                        value = items[-1].strip()
                        coeffs.update({key:value})
                    
                # The best way to do this is to use the CTD name mapping to only get the important values
                capture = {}
                # With the capture coefficients, now map it to the CTD coefficients
                for key in coeffs.keys():
                    if key in coefficient_name_map.keys():
                        capture[coefficient_name_map[key]] = coeffs[key]
            
                # Get the calibration date
                caldate = coeffs['conductivity']
            
                # Update the capture file to include the calibration date
                capture['CAL DATE'] = pd.to_datetime(caldate)
            
                # Now, update the parent dictionary
                capture_data.update({capfile:capture})
            
        df = pd.DataFrame.from_dict({i: capture_data[i] for i in capture_data.keys()}, orient='index')
        qct.update({uid:df})
        qct_missing.update({uid:missing})
        
    return qct, qct_missing   

In [None]:
qct, qct_missing = load_qct_data(qct_dict,coefficient_name_map,dirpath='../../../../Documents/Project_Files/')

In [None]:
qct

In [None]:
qct_missing

In [None]:
# Reset the index to the calibration date
for uid in qct:
    qct[uid].set_index('CAL DATE', drop=True, inplace=True)

In [None]:
qct

### Vendor Calibration values: .cal and .xmlcon
This next step is to load the CTD .cal and .xmlcon files in order to compare the

In [None]:
def get_serial_num(df):
    serial_num = list(df[df['UID_match'] == True]['Supplier\nSerial Number'])
    serial_num = serial_num[0].split('-')[1]
    return serial_num

In [None]:
serial_nums = {}
for uid in uids:
    CTDBPP['UID_match'] = CTDBPP['UID'].apply(lambda x: True if uid in x else False)
    serial_num = get_serial_num(CTDBPP)
    serial_nums.update({uid:serial_num})
    

In [None]:
serial_nums

In [None]:
def read_cal(data, coefficient_name_map):
    """
    Reads in the calibration coefficients from the vendor supplied
    .cal file.
        
    Args:
        self - the CTD object
        data - an opened, read cal file that has been interpreted
        into ASCII.
    Returns:
        A populated CTD object's dictionary with coeff names and
        associated values from the cal file. 
    """
    coefficients = {}
    for line in data.splitlines():
        key, value = line.replace(" ","").split('=')

        if key == 'INSTRUMENT_TYPE' and value == 'SEACATPLUS':
            serial = '16-'

        if key == 'SERIALNO':
            serial = serial + value
    
        if key == 'CCALDATE':
            date = datetime.datetime.strptime(value, '%d-%b-%y').strftime('%Y%m%d')

        name = coefficient_name_map.get(key)
        if not name or name is None:
            continue
        else:
            coefficients[name] = value
            
    return coefficients,date

In [None]:
def read_xml(data, coefficient_name_map, o2_coefficients_map):
    Tflag = False
    O2flag = False
    coefficients = {}
    date = None
        
    for child in data.iter():
        key = child.tag.upper()
        value = child.text.upper()
        
        # Do a couple of checks for type of CTD and flag for presence of
        # Oxygen sensor, Type (16+ vs 37)
        if key == 'OXYGENSENSOR':
            O2flag = True
        
        if key == 'CALIBRATIONDATE':
            if date is None and value is not None:
                date = datetime.datetime.strptime(value, '%d-%b-%y').strftime('%Y%m%d')
            
        # Have to rename the temperature keys to 'T'+key because fuck it, nothing is straightforward
        if key == 'TEMPERATURESENSOR':
            Tflag = True
        elif 'SENSOR' in key and Tflag == True:
            Tflag = False
        else:
            pass
        
        if Tflag == True:
            key = 'T'+key
        
        # Find the mapping of the vendor coeff name -> UFrame coefficient name
        try:
            name = coefficient_name_map.get(key)
        except:
            if O2flag == True:
                try:
                    name = o2_coefficients_map.get(key)
                except:
                    pass
            else:
                pass

        # Now, can update a dictionary to store key->value pairs of coefficients from the xmlcon file    
        coefficients.update({name:value})
        
    return coefficients,date

In [None]:
vendor_files = {}
for uid,sn in serial_nums.items():
    files = []
    for file in os.listdir('../../../../Documents/Project_Files/Records/Instrument_Records/CTDBP/'):
        if sn in file:
            if 'Calibration_File' in file:
                files.append(file)
            else:
                pass
        else:
            pass
    vendor_files.update({uid:files})

In [None]:
vendor_files

In [None]:
def load_cal_coeffs(files, filepath, coefficient_name_map, o2_coefficients_map):
    """
    Loads all of the calibration coefficients from the vendor cal files for
    a given CTD instrument class.
    
    Args:
        files - a list of zipfile names containing the vendor calibration files
        filepath - directory path to where the zipfiles are stored locally
        coefficient_name_map - a mapping of the calibration names in the vendor file
            to the calibration coeff names needed for OOINet
        o2_coefficients_map - mapping for CTDs containing an oxygen sensor
    Returns:
        cal_coeffs - a dictionary of the calibration coefficients with the respective
            values, nested in a dictionary sorted by calibration date
    """
    cal_coeffs = {}
    missing = []
    for file in files:
        fpath = filepath+file
        # If it is a zipfile, unzip to memory, find
        if fpath.endswith('.zip'):
            with ZipFile(fpath) as zfile:
                fname = [name for name in zfile.namelist() if '.cal' in name]
                if len(fname) > 0:
                    data = zfile.read(fname[0]).decode('ASCII')
                    coeffs, date = read_cal(data, coefficient_name_map)
                    cal_coeffs.update({date:coeffs})
                else:
                    print(f"No vendor documents of type '.cal' found for file {file}.")
                    missing.append(file)
        elif fpath.endswith('.cal'):
            with open(fpath) as cfile:
                data = cfile.read()
                coeffs, date = read_cal(data, coefficient_name_map)
                cal_coeffs.update({date:coeffs})
        else:
            print(f"No vendor documents of type '.cal' found for file {file}.")
            missing.append(file)
    
    return cal_coeffs, missing

In [None]:
cal = {}
cal_missing = {}
filepath = '../../../../Documents/Project_Files/Records/Instrument_Records/CTDBP/'
for uid,files in vendor_files.items():
    cal_coeffs, missing = load_cal_coeffs(files,filepath,coefficient_name_map,o2_coefficients_map)
    cal_df = pd.DataFrame.from_dict({i: cal_coeffs[i] for i in cal_coeffs.keys()}, orient='index')
    cal_df.index = pd.to_datetime(cal_df.index)
    cal.update({uid:cal_df})
    cal_missing.update({uid:missing})

In [None]:
cal

In [None]:
cal_missing

#### Repeat the above process with the .xmlcon file

In [None]:
def load_xml_coeffs(files,filepath, coefficient_name_map, o2_coefficients_map):
    """
    Loads all of the calibration coefficients from the vendor cal files in xmlcon
    format for a given CTD instrument class.
    
    Args:
        files - a list of zipfile names containing the vendor calibration files
        filepath - directory path to where the zipfiles are stored locally
        coefficient_name_map - a mapping of the calibration names in the vendor file
            to the calibration coeff names needed for OOINet
        o2_coefficients_map - mapping for CTDs containing an oxygen sensor
    Returns:
        cal_coeffs - a dictionary of the calibration coefficients with the respective
            values, nested in a dictionary sorted by calibration date
    """
    
    cal_coeffs = {}
    missing = []
    for file in files:
        fpath = filepath+file
        # If it is a zipfile, unzip to memory, find
        if fpath.endswith('.zip'):
            with ZipFile(fpath) as zfile:
                fname = [name for name in zfile.namelist() if '.xmlcon' in name]
                if len(fname) > 0:
                    data = et.parse(zfile.open(fname[0]))
                    coeffs, date = read_xml(data, coefficient_name_map, o2_coefficients_map)
                    cal_coeffs.update({date:coeffs})
                else:
                    print(f"No vendor documents of type '.xmlcon' found for file {file}.")
                    missing.append(file)
        elif fpath.endswith('.xmlcon'):
            with open(fpath) as xfile:
                data = et.parse(xfile)
                coeffs, date = read_xml(data, coefficient_name_map, o2_coefficients_map)
                cal_coeffs.update({date:coeffs})
        else:
            print(f"No vendor documents of type '.xmlcon' found for file {file}.")
            missing.append(file)
            
    return cal_coeffs, missing


In [None]:
xml = {}
xml_missing = {}
filepath = '../../../../Documents/Project_Files/Records/Instrument_Records/CTDBP/'
for uid,files in vendor_files.items():
    xml_coeffs, missing = load_xml_coeffs(files,filepath,coefficient_name_map,o2_coefficients_map)
    xml_df = pd.DataFrame.from_dict({i: xml_coeffs[i] for i in xml_coeffs.keys()}, orient='index')
    xml_df.drop(columns=[None],axis=1,inplace=True)
    xml_df.index = pd.to_datetime(xml_df.index)
    xml.update({uid:xml_df})
    xml_missing.update({uid:missing})

In [None]:
xml

In [None]:
xml_missing

### Comparisons
Now that I have .cal, .xmlcon, the qct capture files, and the csv files from asset management, I can begin comparison of the calibration coefficients between the different files. The goal is that the dates, values, and coefficients all match.

In [None]:
CSV

In [None]:
qct

In [None]:
cal

In [None]:
xml

In [None]:
# First, I need to reindex all of the different dataframes such that they all have two indices:
# A dataset index and a datetime index, and set them to uniform name (for concatenation)
for uid in uids:
    try:
        CSV[uid]['Dataset'] = 'CSV'
        CSV[uid].set_index(['Dataset',CSV[uid].index],inplace=True)
        CSV[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
    except:
        pass
CSV

In [None]:
qct

In [None]:
for uid in uids:
    qct[uid]['Dataset'] = 'QCT'
    qct[uid].set_index(['Dataset',qct[uid].index],inplace=True)
    qct[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
qct

In [None]:
for uid in uids:
    cal[uid]['Dataset'] = 'CAL'
    cal[uid].set_index(['Dataset',cal[uid].index],inplace=True)
    cal[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
cal

In [None]:
for uid in uids:
    xml[uid]['Dataset'] = 'XML'
    xml[uid].set_index(['Dataset',xml[uid].index],inplace=True)
    xml[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
xml

All four possible sources of calibration coefficients available for an instrument - the calibration **CSV** loaded into asset management, the calibration coefficients loaded onto the instrument during check-in (**QCT**), the **.cal** file provided by the vendor, and the **XML** file provided by the vendor. 

The next step is to concatenate the different instruments into a single dataframe and to sort by calibration date. This will allow for comparison based on the date of the calibration.

In [None]:
comparison = {}
for uid in uids:
    comparison.update({uid:pd.concat([CSV.get(uid), cal.get(uid), xml.get(uid), qct.get(uid)])})
    comparison[uid].reset_index(level='Cal Date',inplace=True)
    comparison[uid].sort_values(by='Cal Date',inplace=True)
comparison

In [None]:
def convert_type(x):
    if type(x) is str:
        return float(x)
    else:
        return x

In [None]:
for uid in uids:
    comparison[uid] = comparison[uid].applymap(convert_type)
comparison

In [None]:
def all_the_same(elements):
    """
    This function checks which values in an array are all the same.
    
    Args:
        elements - an array of values
    Returns:
        error - an array of length (m-1) which checks if
    
    """
    if len(elements) < 1:
        return True
    el = iter(elements)
    first = next(el, None)
    #check = [element == first for element in el]
    error = [np.isclose(element,first) for element in el]
    return error

In [None]:
def locate_cal_error(array):
    """
    This function locates which source file (e.g. xmlcon vs csv vs cal)
    have calibration values that are different from the others. It does
    NOT identify which is correct, only which is different.
    
    Args:
        array - A numpy array which contains the values for a specific
                calibration coefficient for a specific date from all of
                the calibration source files
    Returns:
        dataset - a list containing which calibration sources are different
                from the other files
        True - if all of the calibration values are the same
        False - if the first calibration value is different
    """
    # Call the function to check if there are any differences between each of
    # calibration values from the different sheets
    error = all_the_same(array)
    # If they are all the same, return True
    if all(error):
        return True
    # If there is a mixture of True/False, find the false and return them
    elif any(error) == True:
        indices = [i+1 for i, j in enumerate(error) if j == False]
        dataset = list(array.index[indices])
        return dataset
    # Last, if all are false, that means the first value 
    else:
        return False

In [None]:
# With all the functions set up, now go through all of the data
def search_for_errors(df):
    """
    This function is designed to search through a pandas dataframe
    which contains all of the calibration coefficients from all of
    the files, and check for differences.
    
    Args: 
        df - A dataframe which contains all fo the calibration coefficients
        from the asset management csv, qct checkout, and the vendor
        files (.cal and .xmlcon)
    Returns:
        cal_errors - A nested dictionary containing the calibration timestamp, the
        relevant calibration coefficient, and which file(s) have the
        erroneous calibration file.
    """
    
    cal_errors = {}
    for date in np.unique(df['Cal Date']):
        df2 = df[df['Cal Date'] == date]
        wrong_cals = {}
        for column in df2.columns.values:
            array = df2[column]
            array.sort_index()
            if array.dtype == 'datetime64[ns]':
                pass
            else:
                error = locate_cal_error(array)
                if error == False:
                    wrong_cals.update({column:array.index[0]})
                elif error == True:
                    pass
                else:
                    wrong_cals.update({column:error})
        
        if len(wrong_cals) < 1:
            cal_errors.update({str(date).split('T')[0]:'No Errors'})
        else:
            cal_errors.update({str(date).split('T')[0]:wrong_cals})
    
    return cal_errors

In [None]:
cal_errors = {}
for uid in uids:
    ce = search_for_errors(comparison[uid])
    cal_errors.update({uid:ce})
    

In [None]:
cal_errors

In [None]:
pd.DataFrame.from_dict(cal_errors)

In [None]:
df2=pd.DataFrame.from_dict({i: cal_errors[i] for i in cal_errors.keys()}, orient='index')

In [None]:
df2

In [None]:
df2.to_csv('CTDBPP_Errors.csv')

In [None]:
# Generate a dataframe of the missing files
df_missing = pd.DataFrame(index=uids)

In [None]:
df_missing['.CAL FILES'] = cal_missing.values()
df_missing

In [None]:
df_missing['.XML FILES'] = xml_missing.values()
df_missing

In [None]:
df_missing['.QCT FILES'] = qct_missing.values()
df_missing

In [None]:
df_missing.to_csv('CTDBPP_Missing_Files.csv')

### Check which CTDBP-C Calibration files are not correctly named
In order to check the calibration values, need to have the correctly named calibration csv files. We can check this by comparison of deployment dates with the CTDBPC calibration dates. This requires loading both the deployment csv and parsing all the file names, flagging the file names THAT MATCH, and then revisiting them in order to correct the name.

In [None]:
# Load the deployment csvs fo
# Parse for all WHOI CG Deployment Sheets based on 'CP' or CG
# Easier to check for non-CG 
deploy_csvs = []
for file in os.listdir('../../GitHub/OOI-Integration/asset-management/deployment/'):
    if file[0:2] == 'RS' or file[0:2] == 'CE':
        pass
    elif 'MOAS' in file:
        pass
    else:
        deploy_csvs.append(file)
        print(file)

In [None]:
# Get the Deployment History from the WHOI Asset Tracking System
CTDBPF_Deploy = CTDBPF['Deployment History']

In [None]:
CTDBPF_Deploy

In [None]:
# Split the string at the newline to generate a list of deployments for each CTDBP-C
CTDBPF_Deploy = CTDBPF['Deployment History'].apply(lambda x: x.split('\n'))

In [None]:
CTDBPF_Deploy

In [None]:
# List out all the individual deployments
deploy_list = []
for i in range(0,len(CTDBPF_Deploy)):
    for item in CTDBPF_Deploy.iloc[i]:
        if '-' in item:
            deploy_list.append(item)
        else:
            pass

In [None]:
deploy_list

In [None]:
# So I now have a list of the deployments all the CTDBP-Cs were used on.
# Now, parse the name of the array to
array = list( set( [x.split('-')[0] for x in deploy_list] ) )
array

In [None]:
# With the list of array names, I can now parse the deployment file names to find
# the relevant deployment sheets which match where the CTDBP-Cs were deployed
deploy_csvs = []
for file in os.listdir('../../GitHub/OOI-Integration/asset-management/deployment/'):
    if file.split('_')[0] in array:
        deploy_csvs.append(file)
deploy_csvs

In [None]:
# Using the identified deployment csvs, can now load the deployment csvs into
# a pandas dataframe
deployments = pd.DataFrame()
for file in deploy_csvs:
    deployments = deployments.append(pd.read_csv('../../GitHub/OOI-Integration/asset-management/deployment/'+file))
deployments.head()

In [None]:
# Get the CTDBPF sensor uids
sensor_uids = list( set( CTDBPF['UID'] ) )
sensor_uids

In [None]:
# Find in the deployment spreadsheets the matching entry for the CTDBP-Cs that I'm looking for
deployments['CTDBPF'] = deployments['sensor.uid'].apply(lambda x: True if x in sensor_uids else False)
deployments = deployments[deployments['CTDBPF'] == True]

In [None]:
deployments.head()

In [None]:
# Now, parse out the date string in the format of YYYYMMDD from the startDateTime
# in order to compare with the date in the calibration file names
deploy_dates = deployments['startDateTime'].apply(lambda x: x.replace('-','').split('T')[0])
deploy_dates = list(set(deploy_dates))
deploy_dates

In [None]:
cal_csvs = []
for file in os.listdir('../../GitHub/OOI-Integration/asset-management/calibration/CTDBPF/'):
    date = file.split('__')[1].split('.')[0]
    print(date)
    if date in deploy_dates:
        cal_csvs = cal_csvs.append(file)
print(cal_csvs)
        

In [None]:
cal_csvs

Great! None of the CTDBP-C have calibration dates which match deployment dates. That is a good sign - it means that the dates in the calibration file name *should* match the calibration dates in the calibration info.

However, that is no guarantee that the date in the file name matches the date in the calibration data. This can be check in a future step by comparing the calibration date in the vendor docs, QCT info, and the .cal and .xmlcon file info.

In [None]:
# Now, using the "deploy" csvs for each node in the various arrays,
# need to load into a large pandas dataframe for easy handling
import pandas as pd

deployments = pd.DataFrame()
for file in deploy_csvs:
    deployments = deployments.append(pd.read_csv('../GitHub/OOI-Integration/asset-management/deployment/'+file))

In [None]:
deployments

In [None]:
# Get all the unique deployment dates from the deployment csvs and put into the form of 
# YYYYMMDD. 
deploy_dates = deployments['startDateTime'].apply(lambda x: x.split('T')[0].replace('-',''))

In [None]:
deploy_dates = list(set(deploy_dates))
deploy_dates[0:10]

In [None]:
len(deploy_dates)

In [None]:
check_files = []
for root, dirs, files in os.walk('../GitHub/OOI-Integration/asset-management/calibration/'):
    for name in files:
        if 'CGINS' in name:
            cal_date = name.split('__')[1].split('.')[0]
            if cal_date in deploy_dates:
                check_files.append(name)

In [None]:
# Okay, there are a potential 1364 files that we need to check on the
# calibration date in the file name, because the parsed date in the 
# file name matches a deployment date.
len(list(set(check_files)))

In [None]:
# Cool, now save the file to the local working directory
with open('calibration_files_to_check.csv','w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(check_files)