# Exploratory Analysis for Metadata Review in OOI Asset Management System

### Motivation:
The Asset Management system for OOI is primarly housed on GitHub in a variety of csv files. Until now, the calibration coefficients stored in the csv files have been manually entered. While we have utilized a "human-in-the-loop" review approach to catch errors, some errors have slipped through (e.g. truncation of significant figures).

### Approach:
My goal is to develop an automated approach to catch possible errors which already exist within the asset management system. To accomplish this, I will compare the csv files loaded into the GitHub asset management system with the original vendor files as well as the QCT (quality control testing) documents which capture the coefficients loaded onto the instrument at the time of reception at WHOI from the vendor.

### Data Sources:
* **GitHub**: CSV files containing the calibration coefficients. Directory organization by sensor+class. The files are named as "(CGINS)-(sensor+class)-(serial number)-(YYYYMMDD)" where YYYYMMDD is the calibration date.
* **Vault**: Version-controlled storage location of the vendor calibrations, in the Records/Instrument Records/Instrument directories. Within the relevant directory, calibration files are stored as either .cal, .xmlcon, .pdf, or within zipped directories.
* **Alfresco**: Version-controlled web-accessed. The calibrations loaded onto the instrument during the initial checkin-in upon receipt (the QCT process) are stored here as either .cap or .txt files. 

In [1]:
# Import likely important packages, etc.
import sys, os, csv, re
from wcmatch import fnmatch
import datetime
import time
import xml.etree.ElementTree as et
from zipfile import ZipFile
import numpy as np
import pandas as pd
import xarray as xr
import shutil

Import self-written functions from utils package:

In [2]:
from utils import *

In [3]:
def get_calibration_files(serial_nums,dirpath):
    calibration_files = {}
    for uid,sn in serial_nums.items():
        files = []
        for file in os.listdir(dirpath):
            if sn in file:
                if 'Calibration_File' in file:
                    files.append(file)
                else:
                    pass
            else:
                pass
        
        calibration_files.update({uid:files})
        
    return calibration_files
        

In [27]:
# Try building a function to do the file path generator
def generate_file_path(dirpath,filename,ext=['.cap','.txt','.log'],exclude=['_V','_Data_Workshop']):
    """
    Function which searches for the location of the given file and returns
    the full path to the file.
    
    Args:
        dirpath - parent directory path under which to search
        filename - the name of the file to search for
        ext - 
        exclude - optional list which allows for excluding certain
            directories from the search
    Returns:
        fpath - the file path to the filename from the current
            working directory.
    """
    for root, dirs, files in os.walk(dirpath):
        dirs[:] = [d for d in dirs if d not in exclude]
        for fname in files:
            if fnmatch.fnmatch(fname, [filename+'*'+x for x in ext]):
                fpath = os.path.join(root, fname)
                return fpath

### WHOI Asset Tracking Spreadsheet
First, I want to load and examine exactly what type of data is stored in the WHOI Asset Tracking Spreadsheet and what information it has that may be useful.

In [4]:
#excel_spreadsheet = 'C:/Users/areed/Documents/Project_Files/Documentation/System/System Notebook/WHOI_Asset_Tracking.xlsx'
excel_spreadsheet = '/media/andrew/OS/Users/areed/Documents/Project_Files/Documentation/System/System Notebook/WHOI_Asset_Tracking.xlsx'
sheet_name = 'Sensors'

In [5]:
# What are all the different series of CTDs?
CTDBP = whoi_asset_tracking(excel_spreadsheet,sheet_name,instrument_class='CTDBP',whoi=True,series='F')
CTDBP

Unnamed: 0,Instrument Class,Series,Supplier Serial Number,WHOI #,OOI #,UID,Model,CGSN PN,Firmware Version,Supplier,...,QCT Testing,PreDeployment,Post Deployment,Refurbishment/ Repair,DO Number,Date Received,Deployment History,Current Deployment,Instrument Location on Current Deployment,Notes
49,CTDBP,F,16-50001,116098,A00635,CGINS-CTDBPF-50001,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00016\n3305-00102-00091\n3305-00102...,,,3305-00900-00080\n3305-00900-00280,WH-SC11-01-CTD-1007,2014-01-23 00:00:00,GI01SUMO-00001\nGI01SUMO-00003\nGI01SUMO-00005,GI01SUMO-00005,NSIF,(NSIF)
58,CTDBP,F,16-50060,116830,A01092,CGINS-CTDBPF-50060,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00039\n3305-00102-00092,,,3305-00900-00103,WH-SC11-01-CTD-1013,2014-09-29 00:00:00,GS01SUMO-00001\nGS01SUMO-00003,,,
59,CTDBP,F,16-50061,116831,A01093,CGINS-CTDBPF-50061,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00040\n3305-00102-00113,,,3305-00900-00155,WH-SC11-01-CTD-1013,2014-09-29 00:00:00,GI01SUMO-00002,,,
60,CTDBP,F,16-50062,116832,A01094,CGINS-CTDBPF-50062,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00041\n3305-00102-00093\n3305-00102...,,,3305-00900-00097\n3305-00900-00329,WH-SC11-01-CTD-1013,2014-09-29 00:00:00,GA01SUMO-00001\nGA01SUMO-00003,GS01SUMO-00004,NSIF,
61,CTDBP,F,16-50065,116833,A01095,CGINS-CTDBPF-50065,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00042\n3305-00102-00072,,,3305-00900-00050,WH-SC11-01-CTD-1013,2014-09-29 00:00:00,GI Spare,,,"Battery voltage diminished to ""!!!Low Battery!..."
74,CTDBP,F,16-50116,117285,A01420,CGINS-CTDBPF-50116,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00060\n3305-00102-00133,,,3305-00900-00203,WH-SC11-01-CTD-1017,2015-05-14 00:00:00,GA01SUMO-00002\nIrminger 5 Spare,GS 5 spare,,
81,CTDBP,F,16-50142,117448,A01573,CGINS-CTDBPF-50142,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00061\n3305-00102-00132\n3305-00102...,,,3305-00900-00194\n3305-00900-00395,WH-SC11-01-CTD-1018,2015-07-01 00:00:00,GS01SUMO-00002\nGI01SUMO-00004,,NSIF,
82,CTDBP,F,16-50143,117447,A01572,CGINS-CTDBPF-50143,16PlusV2,1336-00001-00006,2.5.2,SeaBird,...,3305-00102-00062\n3305-00102-00187,,,3305-00900-00345,WH-SC11-01-CTD-1018,2015-07-01 00:00:00,GA/GS Spare,,,


Get the unique identifiers (UID) of the instruments:

In [6]:
def ensure_dir(file_path):
    if not os.path.exists(file_path):
        os.makedirs(file_path)

In [7]:
uids = list(set(CTDBP['UID']))
uids

['CGINS-CTDBPF-50142',
 'CGINS-CTDBPF-50001',
 'CGINS-CTDBPF-50143',
 'CGINS-CTDBPF-50060',
 'CGINS-CTDBPF-50061',
 'CGINS-CTDBPF-50116',
 'CGINS-CTDBPF-50065',
 'CGINS-CTDBPF-50062']

Get the QCT file names for the UIDs:

In [8]:
qct_dict = {}
for uid in uids:
    # Get the QCT Document numbers from the asset tracking sheet
    CTDBP['UID_match'] = CTDBP['UID'].apply(lambda x: True if uid in x else False)
    qct_series = CTDBP[CTDBP['UID_match'] == True]['QCT Testing']
    qct_series = list(qct_series.iloc[0].split('\n'))
    qct_dict.update({uid:qct_series})

In [9]:
qct_dict

{'CGINS-CTDBPF-50142': ['3305-00102-00061',
  '3305-00102-00132',
  '3305-00102-00194'],
 'CGINS-CTDBPF-50001': ['3305-00102-00016',
  '3305-00102-00091',
  '3305-00102-00153'],
 'CGINS-CTDBPF-50143': ['3305-00102-00062', '3305-00102-00187'],
 'CGINS-CTDBPF-50060': ['3305-00102-00039', '3305-00102-00092'],
 'CGINS-CTDBPF-50061': ['3305-00102-00040', '3305-00102-00113'],
 'CGINS-CTDBPF-50116': ['3305-00102-00060', '3305-00102-00133'],
 'CGINS-CTDBPF-50065': ['3305-00102-00042', '3305-00102-00072'],
 'CGINS-CTDBPF-50062': ['3305-00102-00041',
  '3305-00102-00093',
  '3305-00102-00174']}

In [10]:
qct_directory = '/media/andrew/OS/Users/areed/Documents/Project_Files/'
cal_directory = '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDBP/'
asset_management_directory = '/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibration/CTDBPF'

Load the asset management information:

In [11]:
csv_dict = load_asset_management(CTDBP, asset_management_directory)
csv_dict

{'CGINS-CTDBPF-50001': ['CGINS-CTDBPF-50001__20140116.csv',
  'CGINS-CTDBPF-50001__20151230.csv',
  'CGINS-CTDBPF-50001__20170923.csv'],
 'CGINS-CTDBPF-50060': ['CGINS-CTDBPF-50060__20150327.csv',
  'CGINS-CTDBPF-50060__20140920.csv'],
 'CGINS-CTDBPF-50061': ['CGINS-CTDBPF-50061__20140919.csv',
  'CGINS-CTDBPF-50061__20161021.csv'],
 'CGINS-CTDBPF-50062': ['CGINS-CTDBPF-50062__20140919.csv',
  'CGINS-CTDBPF-50062__20180414.csv',
  'CGINS-CTDBPF-50062__20160309.csv'],
 'CGINS-CTDBPF-50116': ['CGINS-CTDBPF-50116__20150428.csv',
  'CGINS-CTDBPF-50116__20170312.csv'],
 'CGINS-CTDBPF-50142': ['CGINS-CTDBPF-50142__20170312.csv',
  'CGINS-CTDBPF-50142__20150616.csv',
  'CGINS-CTDBPF-50142__20181005.csv'],
 'CGINS-CTDBPF-50143': ['CGINS-CTDBPF-50143__20150615.csv',
  'CGINS-CTDBPF-50143__20180502.csv']}

Get the calibration files:

In [12]:
serial_nums = get_serial_nums(CTDBP, uids)
serial_nums

{'CGINS-CTDBPF-50142': '50142',
 'CGINS-CTDBPF-50001': '50001',
 'CGINS-CTDBPF-50143': '50143',
 'CGINS-CTDBPF-50060': '50060',
 'CGINS-CTDBPF-50061': '50061',
 'CGINS-CTDBPF-50116': '50116',
 'CGINS-CTDBPF-50065': '50065',
 'CGINS-CTDBPF-50062': '50062'}

In [13]:
cal_dict = get_calibration_files(serial_nums, cal_directory)
cal_dict

{'CGINS-CTDBPF-50142': ['CTDBP-F_SBE_16PlusV2_SN_16-50142_Calibration_Files_2015-07-01.zip',
  'CTDBP-F_SBE_16PlusV2_SN_16-50142_Calibration_Files_2017-03-12.zip',
  'CTDBP-F_SBE_16PlusV2_SN_16-50142_Calibration_Files_2018-10-05.zip'],
 'CGINS-CTDBPF-50001': ['CTDBP-F_SBE_16PlusV2_SN_16-50001_Calibration_Files.zip',
  'CTDBP-F_SBE_16PlusV2_SN_16-50001_Calibration_Files_2016-03-31.zip',
  'CTDBP-F_SBE_16PlusV2_SN_16-50001_Calibration_Files_2017-09-29.zip'],
 'CGINS-CTDBPF-50143': ['CTDBP-F_SBE_16PlusV2_SN_16-50143_Calibration_Files_2015-07-01.zip',
  'CTDBP-F_SBE_16PlusV2_SN_16-50143_Calibration_Files_2018-05-02.zip'],
 'CGINS-CTDBPF-50060': ['CTDBP-F_SBE_16PlusV2_SN_16-50060_Calibration_Files_2014-10-15.zip',
  'CTDBP-F_SBE_16PlusV2_SN_16-50060_Calibration_Files_2016-05-02.zip'],
 'CGINS-CTDBPF-50061': ['CTDBP-F_SBE_16PlusV2_SN_16-50061_Calibration_Files_2014-10-15.zip',
  'CTDBP-F_SBE_16plusV2_SN_16-50061_Calibration_Files_2016-10-21.zip'],
 'CGINS-CTDBPF-50116': ['CTDBP-F_SBE_16PlusV

In [14]:
# Now, pick out the first UID
uid = uids[0]
uid

'CGINS-CTDBPF-50142'

In [15]:
# Initialize a CTD Calibration object
CTDcal = CTDCalibration(uid=uid)
CTDxml = CTDCalibration(uid=uid)
CTDqct = CTDCalibration(uid=uid)

In [16]:
# Purge the temp directory
shutil.rmtree('/'.join((os.getcwd(),'temp')))

In [28]:
# Put the csv files into a similar temp directory for local working
for file in csv_dict[uid]:
    csv_savepath = '/'.join((os.getcwd(),'temp','csv'))
    ensure_dir('/'.join((os.getcwd(),'temp','csv')))
    # Now save the csv into the temp directory
    shutil.copy('/'.join((asset_management_directory,file)), csv_savepath)

In [18]:
qct_dict

{'CGINS-CTDBPF-50142': ['3305-00102-00061',
  '3305-00102-00132',
  '3305-00102-00194'],
 'CGINS-CTDBPF-50001': ['3305-00102-00016',
  '3305-00102-00091',
  '3305-00102-00153'],
 'CGINS-CTDBPF-50143': ['3305-00102-00062', '3305-00102-00187'],
 'CGINS-CTDBPF-50060': ['3305-00102-00039', '3305-00102-00092'],
 'CGINS-CTDBPF-50061': ['3305-00102-00040', '3305-00102-00113'],
 'CGINS-CTDBPF-50116': ['3305-00102-00060', '3305-00102-00133'],
 'CGINS-CTDBPF-50065': ['3305-00102-00042', '3305-00102-00072'],
 'CGINS-CTDBPF-50062': ['3305-00102-00041',
  '3305-00102-00093',
  '3305-00102-00174']}

In [31]:
for file in qct_dict[uid]:
    # Generate the full file path
    qct_path = generate_file_path(qct_directory, file)
    # Initialize a CTD object
    CTD = CTDCalibration(uid=uid)
    # Load the QCT information
    try:
        CTD.load_qct(qct_path)
        # Generate the save file path
        qct_savepath = '/'.join((os.getcwd(),'temp','qct'))
        ensure_dir('/'.join((os.getcwd(),'temp','qct')))
    except:
        print(f'No QCT file found for: {file}')
    # Now save the qct info to a csv
    try:
        CTD.write_csv(qct_savepath)
    except:
        print(f'No QCT file found for: {file}')

No QCT file found for: 3305-00102-00061
No QCT file found for: 3305-00102-00061
Write CGINS-CTDBPF-50142__20170312.csv to /home/andrew/Documents/OOI-CGSN/QAQC_Sandbox/Metadata_Review/temp/qct? [y/n]: y
Write CGINS-CTDBPF-50142__20181005.csv to /home/andrew/Documents/OOI-CGSN/QAQC_Sandbox/Metadata_Review/temp/qct? [y/n]: y


In [34]:
qct_dict[uid]

['3305-00102-00061', '3305-00102-00132', '3305-00102-00194']

In [35]:
for file in cal_dict[uid]:
    # Generate the full file path
    cal_path = generate_file_path(cal_directory, file, ext=[''])
    # Initialize a CTD object
    CTD = CTDCalibration(uid=uid)
    # Load the QCT information
    CTD.load_cal(cal_path)
    # Generate the save file path
    cal_savepath = '/'.join((os.getcwd(),'temp','cal'))
    ensure_dir('/'.join((os.getcwd(),'temp','cal')))
    # Now save the qct info to a csv
    try:
        CTD.write_csv(cal_savepath)
    except ValueError:
        print(f'No cal file found in {file}')
    

No cal file found in CTDBP-F_SBE_16PlusV2_SN_16-50142_Calibration_Files_2015-07-01.zip
No cal file found in CTDBP-F_SBE_16PlusV2_SN_16-50142_Calibration_Files_2017-03-12.zip
Write CGINS-CTDBPF-50142__20181005.csv to /home/andrew/Documents/OOI-CGSN/QAQC_Sandbox/Metadata_Review/temp/cal? [y/n]: y


In [36]:
for file in cal_dict[uid]:
    # Generate the full file path
    cal_path = generate_file_path(cal_directory, file, ext=[''])
    # Initialize a CTD object
    CTD = CTDCalibration(uid=uid)
    # Load the QCT information
    try:
        CTD.load_xml(cal_path)
        # Generate the save file path
        xml_savepath = '/'.join((os.getcwd(),'temp','xml'))
        ensure_dir(xml_savepath)
        # Now save the qct info to a csv
    except:
        pass
    try:
        CTD.write_csv(xml_savepath)
    except ValueError:
        print(f'No xml file found for {file}')

Write CGINS-CTDBPF-50142__20150616.csv to /home/andrew/Documents/OOI-CGSN/QAQC_Sandbox/Metadata_Review/temp/xml? [y/n]: y
Write CGINS-CTDBPF-50142__20170312.csv to /home/andrew/Documents/OOI-CGSN/QAQC_Sandbox/Metadata_Review/temp/xml? [y/n]: y
Write CGINS-CTDBPF-50142__20181005.csv to /home/andrew/Documents/OOI-CGSN/QAQC_Sandbox/Metadata_Review/temp/xml? [y/n]: y


### Checking instrument calibration values
After loading the **WHOI Asset Tracking Sheet**, we now have the following critical data for checking calibration information:
* Supplier Serial Number - this links back to the original **.cal**, **.xmlcon**, and vendor docs
* OOI UID - this is the link between the instrument and the OOINet
* QCT Document Number - this number links the instrument to the QCT screen capture of the calibration values loaded onto the instruments

### Process to load the **CSV** calibration file
In order to check that the calibrations in asset management, I have to be able to load the asset management calibration csv files into a dataframe. 
* First, get all the unique CTDBPCs in Asset Management
* Next, parse the csv files in asset management to get the unique instrument serial numbers
* With the serial numbers, find the associated instrument calibration csvs
* For each calibration csv, load the data into a pandas dataframe

In [37]:
def get_file_date(x):
    x = str(x)
    ind1 = x.index('__')
    ind2 = x.index('.')
    return x[ind1+2:ind2]

In [38]:
# Now we want to compare dataframe
csv_files = pd.DataFrame(sorted(csv_dict[uid]),columns=['csv'])
csv_files['cal date'] = csv_files['csv'].apply(lambda x: get_file_date(x))
csv_files.set_index('cal date',inplace=True)

In [39]:
# Now we want to compare dataframe
cal_files = pd.DataFrame(sorted(os.listdir('temp/cal')),columns=['cal'])
cal_files['cal date'] = cal_files['cal'].apply(lambda x: get_file_date(x))
cal_files.set_index('cal date',inplace=True)

In [40]:
# Now we want to compare dataframe
xml_files = pd.DataFrame(sorted(os.listdir('temp/xml')),columns=['xml'])
xml_files['cal date'] = xml_files['xml'].apply(lambda x: get_file_date(x))
xml_files.set_index('cal date',inplace=True)

In [41]:
# Now we want to compare dataframe
qct_files = pd.DataFrame(sorted(os.listdir('temp/qct')),columns=['qct'])
qct_files['cal date'] = qct_files['qct'].apply(lambda x: get_file_date(x))
qct_files.set_index('cal date',inplace=True)

In [42]:
df_files = csv_files.join(cal_files,how='outer').join(xml_files,how='outer').join(qct_files,how='outer').fillna(value='-999')

In [43]:
df_files

Unnamed: 0_level_0,csv,cal,xml,qct
cal date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20150616,CGINS-CTDBPF-50142__20150616.csv,-999,CGINS-CTDBPF-50142__20150616.csv,-999
20170312,CGINS-CTDBPF-50142__20170312.csv,-999,CGINS-CTDBPF-50142__20170312.csv,CGINS-CTDBPF-50142__20170312.csv
20181005,CGINS-CTDBPF-50142__20181005.csv,CGINS-CTDBPF-50142__20181005.csv,CGINS-CTDBPF-50142__20181005.csv,CGINS-CTDBPF-50142__20181005.csv


In [44]:
df_files['csv']

cal date
20150616    CGINS-CTDBPF-50142__20150616.csv
20170312    CGINS-CTDBPF-50142__20170312.csv
20181005    CGINS-CTDBPF-50142__20181005.csv
Name: csv, dtype: object

In [46]:
def check_cal_coeffs(coeffs_dict):
    
    # Part 1: coeff by coeff comparison between each source of coefficients
    keys = list(coeffs_dict.keys())
    comparison = {}
    for i in range(len(keys)):
        names = (keys[i], keys[i - (len(keys)-1)])
        check = len(coeffs_dict.get(keys[i])['value']) == len(coeffs_dict.get(keys[i - (len(keys)-1)])['value'])
        if check:
            compare = np.isclose(coeffs_dict.get(keys[i])['value'], coeffs_dict.get(keys[i - (len(keys)-1)])['value'])
            comparison.update({names:compare})
        else:
            pass
        
    # Part 2: now do a logical_and comparison between the results from part 1
    keys = list(comparison.keys())
    i = 0
    mask = comparison.get(keys[i])
    while i < len(keys)-1:
        i = i + 1
        mask = np.logical_and(mask, comparison.get(keys[i]))
        print(i)
       
    return mask 

In [47]:
result = {}
for cal_date in df_files.index:
    # Part 1, load all of the csv files
    coeffs_dict = {}
    for source,fname in df_files.loc[cal_date].items():
        if fname != '-999':
            load_directory = '/'.join((os.getcwd(),'temp',source,fname))
            df_coeffs = pd.read_csv(load_directory)
            df_coeffs.set_index(keys='name',inplace=True)
            df_coeffs.sort_index(inplace=True)
            coeffs_dict.update({source:df_coeffs})
        else:
            pass
    
    # Part 2, now check the calibration coefficients
    mask = check_cal_coeffs(coeffs_dict)
    
    # Part 3: get the calibration coefficients are wrong
    # and show them
    fname = df_files.loc[cal_date]['csv']
    if fname == '-999':
        incorrect = 'No csv file.'
    else:
        incorrect = coeffs_dict['csv'][mask == False]
    result.update({fname:incorrect})

1
1
2
1
2
3


In [48]:
result

{'CGINS-CTDBPF-50142__20150616.csv': Empty DataFrame
 Columns: [serial, value, notes]
 Index: [], 'CGINS-CTDBPF-50142__20170312.csv': Empty DataFrame
 Columns: [serial, value, notes]
 Index: [], 'CGINS-CTDBPF-50142__20181005.csv': Empty DataFrame
 Columns: [serial, value, notes]
 Index: []}

In [49]:
mask = check_cal_coeffs(coeffs_dict)
result = coeffs_dict['csv'][mask == False]
result.merge(coeffs_dict['qct'][mask == False])

1
2
3


Unnamed: 0,serial,value,notes


In [50]:
coeffs_dict

{'csv':               serial         value  notes
 name                                     
 CC_a0       16-50142  1.250743e-03    NaN
 CC_a1       16-50142  2.729980e-04    NaN
 CC_a2       16-50142 -8.672512e-07    NaN
 CC_a3       16-50142  1.722119e-07    NaN
 CC_cpcor    16-50142 -9.570000e-08    NaN
 CC_ctcor    16-50142  3.250000e-06    NaN
 CC_g        16-50142 -1.001339e+00    NaN
 CC_h        16-50142  1.549455e-01    NaN
 CC_i        16-50142 -1.633388e-04    NaN
 CC_j        16-50142  3.716101e-05    NaN
 CC_pa0      16-50142 -6.679744e-02    NaN
 CC_pa1      16-50142  4.836685e-04    NaN
 CC_pa2      16-50142 -3.215211e-12    NaN
 CC_ptca0    16-50142  5.245713e+05    NaN
 CC_ptca1    16-50142 -6.453360e+00    NaN
 CC_ptca2    16-50142  1.189820e-02    NaN
 CC_ptcb0    16-50142  2.524618e+01    NaN
 CC_ptcb1    16-50142 -5.764411e-04    NaN
 CC_ptcb2    16-50142  0.000000e+00    NaN
 CC_ptempa0  16-50142 -5.590677e+01    NaN
 CC_ptempa1  16-50142  5.460593e+01    NaN
 CC_

In [None]:
mask = check_cal_coeffs(coeffs_dict)
incorrect = coeffs_dict['csv'][mask == False]

.reset_index()
incorrect

In [None]:
file

In [None]:
results

In [None]:
fname

In [None]:
CSV = pd.read_csv('/'.join((asset_management_directory,df_files.loc[indices[0]].loc['csv'])))
CSV.set_index(keys='name',inplace=True)
CSV.sort_index(inplace=True)
CSV

In [None]:
QCT = pd.read_csv('/'.join((os.getcwd(),'temp','qct',df_files.loc[indices[0]].loc['qct'])))
QCT.set_index(keys='name',inplace=True)
QCT.sort_index(inplace=True)
QCT

In [None]:
XML = pd.read_csv('/'.join((os.getcwd(),'temp','xml',df_files.loc[indices[0]].loc['xml'])))
XML.set_index(keys='name',inplace=True)
XML.sort_index(inplace=True)
XML

In [None]:
csv_xml = np.isclose(CSV['value'],XML['value'])
csv_qct = np.isclose(CSV['value'],QCT['value'])

In [None]:
csv_xml

In [None]:
mask = (csv_xml | csv_qct)

In [None]:
mask

In [None]:
CSV[mask == False]

In [None]:
file

In [None]:
# Now I need to load the all of the csv files based on their UID
def load_csv_info(csv_dict,filepath):
    """
    Loads the calibration coefficient information contained in asset management
    
    Args:
        csv_dict - a dictionary which associates an instrument UID to the
            calibration csv files in asset management
        filepath - the path to the directory containing the calibration csv files
    Returns:
        csv_cals - a dictionary which associates an instrument UID to a pandas
            dataframe which contains the calibration coefficients. The dataframes
            are indexed by the date of calibration
    """
    
    # Load the calibration data into pandas dataframes, which are then placed into
    # a dictionary by the UID
    csv_cals = {}
    for uid in csv_dict:
        cals = pd.DataFrame()
        for file in csv_dict[uid]:
            data = pd.read_csv(filepath+file)
            date = file.split('__')[1].split('.')[0]
            data['CAL DATE'] = pd.to_datetime(date)
            cals = cals.append(data)
        csv_cals.update({uid:cals})
        
    # Pivot the dataframe to be sorted based on calibration date
    for uid in csv_cals:
        csv_cals[uid] = csv_cals[uid].pivot(index=csv_cals[uid]['CAL DATE'], columns='name')['value']
        
    return csv_cals



Now we have successfully loaded the csv calibrations into a pandas dataframe that allows for easy comparison between calibrations based on the calibration date for each calibration coefficient.

### Load the QCT values
The next step is to take the capture files from the QCT and load them into a comparable pandas dataframe. This involves several steps:
* Get the QCT document numbers from the WHOI Asset Tracking Sheet for each individual instrument
* Find where the QCT documents are stored
* Load the QCT documents
* Parse the QCT documents
* Translate the parsed QCT values into a pandas dataframe

In [None]:
uids = sorted( list( set( CTDBPP['UID'])))

In [None]:
qct_dict = {}
for uid in uids:
    # Get the QCT Document numbers from the asset tracking sheet
    CTDBPP['UID_match'] = CTDBPP['UID'].apply(lambda x: True if uid in x else False)
    qct_series = CTDBPP[CTDBPP['UID_match'] == True]['QCT Testing']
    qct_series = list(qct_series.iloc[0].split('\n'))
    qct_dict.update({uid:qct_series})

In [None]:
qct_dict

In [None]:
# Try building a function to do the file path generator
def generate_file_path(dirpath,filename,ext=['.cap','.txt','.log'],exclude=['_V','_Data_Workshop']):
    """
    Function which searches for the location of the given file and returns
    the full path to the file.
    
    Args:
        dirpath - parent directory path under which to search
        filename - the name of the file to search for
        ext - 
        exclude - optional list which allows for excluding certain
            directories from the search
    Returns:
        fpath - the file path to the filename from the current
            working directory.
    """
    for root, dirs, files in os.walk(dirpath):
        dirs[:] = [d for d in dirs if d not in exclude]
        for fname in files:
            if fnmatch.fnmatch(fname, [filename+'*'+x for x in ext]):
                fpath = os.path.join(root, fname)
                return fpath

In [None]:
qct_filepath

In [None]:
CTD = CTDCalibration(uid=uids[0])

In [None]:
CTD.coefficients

In [None]:
CTD.date

In [None]:
CTD.load_qct(qct_filepath)

In [None]:
CTD.coefficients

In [None]:
qct_filepath = generate_file_path(dirpath,qcts[0])
qct_filepath

In [None]:
CTD = CTDCalibration(uid=uids[0])

In [None]:
CTD.load_qct('/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/3305-00102-00019-A.txt')

In [None]:
CTD.serial

In [None]:
for root, dirs, files in os.walk(dirpath):
    dirs[:] = [d for d in dirs if d not in exclude]
    for fname in files:
        if fnmatch.fnmatch(fname, [])

In [None]:
# Now to develop an automated approach to load all the QCT documents, parse them
# into a dictionary, and convert the dictionary into a pandas dataframe
def load_qct_data(qct_dict,coefficient_name_map,dirpath='../../../Documents/Project_Files/'):
    qct = {}
    qct_missing = {}
    for uid in qct_dict:
        print(uid)
        capture_data = {}
        missing = []
        for capfile in qct_dict[uid]:
            # First, find and return the path to the capture file which
            # matches the capture file indentifier
            cappath = generate_file_path(dirpath, capfile)
            
            # Function to pull out the coefficients from the capture files. This is a naive implementation
            # and splits only on either a ":" or "=", it doesn't do any comprehension of the file
            if cappath is None:
                missing.append(capfile)
            else:
                coeffs = {}
                with open(cappath) as filename:
                    data = filename.read()
                    for line in data.splitlines():
                        items = re.split(': | =',line)
                        key = items[0].strip()
                        value = items[-1].strip()
                        coeffs.update({key:value})
                    
                # The best way to do this is to use the CTD name mapping to only get the important values
                capture = {}
                # With the capture coefficients, now map it to the CTD coefficients
                for key in coeffs.keys():
                    if key in coefficient_name_map.keys():
                        capture[coefficient_name_map[key]] = coeffs[key]
            
                # Get the calibration date
                caldate = coeffs['conductivity']
            
                # Update the capture file to include the calibration date
                capture['CAL DATE'] = pd.to_datetime(caldate)
            
                # Now, update the parent dictionary
                capture_data.update({capfile:capture})
            
        df = pd.DataFrame.from_dict({i: capture_data[i] for i in capture_data.keys()}, orient='index')
        qct.update({uid:df})
        qct_missing.update({uid:missing})
        
    return qct, qct_missing   

In [None]:
qct, qct_missing = load_qct_data(qct_dict,coefficient_name_map,dirpath='../../../../Documents/Project_Files/')

In [None]:
qct

In [None]:
qct_missing

In [None]:
# Reset the index to the calibration date
for uid in qct:
    qct[uid].set_index('CAL DATE', drop=True, inplace=True)

In [None]:
qct

### Vendor Calibration values: .cal and .xmlcon
This next step is to load the CTD .cal and .xmlcon files in order to compare the

In [None]:
serial_nums = get_serial_nums(CTDBPC, uids)

In [None]:
serial_nums

In [None]:
vendor_files = {}
for uid,sn in serial_nums.items():
    files = []
    for file in os.listdir('../../../../Documents/Project_Files/Records/Instrument_Records/CTDBP/'):
        if sn in file:
            if 'Calibration_File' in file:
                files.append(file)
            else:
                pass
        else:
            pass
    vendor_files.update({uid:files})

In [None]:
cal_dict = get_calibration_files(serial_nums,'/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDBP')

In [None]:
cal = {}
cal_missing = {}
filepath = '../../../../Documents/Project_Files/Records/Instrument_Records/CTDBP/'
for uid,files in vendor_files.items():
    cal_coeffs, missing = load_cal_coeffs(files,filepath,coefficient_name_map,o2_coefficients_map)
    cal_df = pd.DataFrame.from_dict({i: cal_coeffs[i] for i in cal_coeffs.keys()}, orient='index')
    cal_df.index = pd.to_datetime(cal_df.index)
    cal.update({uid:cal_df})
    cal_missing.update({uid:missing})

In [None]:
cal

In [None]:
cal_missing

#### Repeat the above process with the .xmlcon file

In [None]:
xml = {}
xml_missing = {}
filepath = '../../../../Documents/Project_Files/Records/Instrument_Records/CTDBP/'
for uid,files in vendor_files.items():
    xml_coeffs, missing = load_xml_coeffs(files,filepath,coefficient_name_map,o2_coefficients_map)
    xml_df = pd.DataFrame.from_dict({i: xml_coeffs[i] for i in xml_coeffs.keys()}, orient='index')
    xml_df.drop(columns=[None],axis=1,inplace=True)
    xml_df.index = pd.to_datetime(xml_df.index)
    xml.update({uid:xml_df})
    xml_missing.update({uid:missing})

In [None]:
xml

In [None]:
xml_missing

### Comparisons
Now that I have .cal, .xmlcon, the qct capture files, and the csv files from asset management, I can begin comparison of the calibration coefficients between the different files. The goal is that the dates, values, and coefficients all match.

In [None]:
CSV

In [None]:
qct

In [None]:
cal

In [None]:
xml

In [None]:
# First, I need to reindex all of the different dataframes such that they all have two indices:
# A dataset index and a datetime index, and set them to uniform name (for concatenation)
for uid in uids:
    try:
        CSV[uid]['Dataset'] = 'CSV'
        CSV[uid].set_index(['Dataset',CSV[uid].index],inplace=True)
        CSV[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
    except:
        pass
CSV

In [None]:
qct

In [None]:
for uid in uids:
    qct[uid]['Dataset'] = 'QCT'
    qct[uid].set_index(['Dataset',qct[uid].index],inplace=True)
    qct[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
qct

In [None]:
for uid in uids:
    cal[uid]['Dataset'] = 'CAL'
    cal[uid].set_index(['Dataset',cal[uid].index],inplace=True)
    cal[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
cal

In [None]:
for uid in uids:
    xml[uid]['Dataset'] = 'XML'
    xml[uid].set_index(['Dataset',xml[uid].index],inplace=True)
    xml[uid].index.set_names(['Dataset','Cal Date'],inplace=True)
xml

All four possible sources of calibration coefficients available for an instrument - the calibration **CSV** loaded into asset management, the calibration coefficients loaded onto the instrument during check-in (**QCT**), the **.cal** file provided by the vendor, and the **XML** file provided by the vendor. 

The next step is to concatenate the different instruments into a single dataframe and to sort by calibration date. This will allow for comparison based on the date of the calibration.

In [None]:
comparison = {}
for uid in uids:
    comparison.update({uid:pd.concat([CSV.get(uid), cal.get(uid), xml.get(uid), qct.get(uid)])})
    comparison[uid].reset_index(level='Cal Date',inplace=True)
    comparison[uid].sort_values(by='Cal Date',inplace=True)
comparison

In [None]:
def convert_type(x):
    if type(x) is str:
        return float(x)
    else:
        return x

In [None]:
for uid in uids:
    comparison[uid] = comparison[uid].applymap(convert_type)
comparison

In [None]:
def all_the_same(elements):
    """
    This function checks which values in an array are all the same.
    
    Args:
        elements - an array of values
    Returns:
        error - an array of length (m-1) which checks if
    
    """
    if len(elements) < 1:
        return True
    el = iter(elements)
    first = next(el, None)
    #check = [element == first for element in el]
    error = [np.isclose(element,first) for element in el]
    return error

In [None]:
def locate_cal_error(array):
    """
    This function locates which source file (e.g. xmlcon vs csv vs cal)
    have calibration values that are different from the others. It does
    NOT identify which is correct, only which is different.
    
    Args:
        array - A numpy array which contains the values for a specific
                calibration coefficient for a specific date from all of
                the calibration source files
    Returns:
        dataset - a list containing which calibration sources are different
                from the other files
        True - if all of the calibration values are the same
        False - if the first calibration value is different
    """
    # Call the function to check if there are any differences between each of
    # calibration values from the different sheets
    error = all_the_same(array)
    # If they are all the same, return True
    if all(error):
        return True
    # If there is a mixture of True/False, find the false and return them
    elif any(error) == True:
        indices = [i+1 for i, j in enumerate(error) if j == False]
        dataset = list(array.index[indices])
        return dataset
    # Last, if all are false, that means the first value 
    else:
        return False

In [None]:
# With all the functions set up, now go through all of the data
def search_for_errors(df):
    """
    This function is designed to search through a pandas dataframe
    which contains all of the calibration coefficients from all of
    the files, and check for differences.
    
    Args: 
        df - A dataframe which contains all fo the calibration coefficients
        from the asset management csv, qct checkout, and the vendor
        files (.cal and .xmlcon)
    Returns:
        cal_errors - A nested dictionary containing the calibration timestamp, the
        relevant calibration coefficient, and which file(s) have the
        erroneous calibration file.
    """
    
    cal_errors = {}
    for date in np.unique(df['Cal Date']):
        df2 = df[df['Cal Date'] == date]
        wrong_cals = {}
        for column in df2.columns.values:
            array = df2[column]
            array.sort_index()
            if array.dtype == 'datetime64[ns]':
                pass
            else:
                error = locate_cal_error(array)
                if error == False:
                    wrong_cals.update({column:array.index[0]})
                elif error == True:
                    pass
                else:
                    wrong_cals.update({column:error})
        
        if len(wrong_cals) < 1:
            cal_errors.update({str(date).split('T')[0]:'No Errors'})
        else:
            cal_errors.update({str(date).split('T')[0]:wrong_cals})
    
    return cal_errors

In [None]:
cal_errors = {}
for uid in uids:
    ce = search_for_errors(comparison[uid])
    cal_errors.update({uid:ce})
    

In [None]:
cal_errors

In [None]:
pd.DataFrame.from_dict(cal_errors)

In [None]:
df2=pd.DataFrame.from_dict({i: cal_errors[i] for i in cal_errors.keys()}, orient='index')

In [None]:
df2

In [None]:
df2.to_csv('CTDBPP_Errors.csv')

In [None]:
# Generate a dataframe of the missing files
df_missing = pd.DataFrame(index=uids)

In [None]:
df_missing['.CAL FILES'] = cal_missing.values()
df_missing

In [None]:
df_missing['.XML FILES'] = xml_missing.values()
df_missing

In [None]:
df_missing['.QCT FILES'] = qct_missing.values()
df_missing

In [None]:
df_missing.to_csv('CTDBPP_Missing_Files.csv')

### Check which CTDBP-C Calibration files are not correctly named
In order to check the calibration values, need to have the correctly named calibration csv files. We can check this by comparison of deployment dates with the CTDBPC calibration dates. This requires loading both the deployment csv and parsing all the file names, flagging the file names THAT MATCH, and then revisiting them in order to correct the name.

In [None]:
# Load the deployment csvs fo
# Parse for all WHOI CG Deployment Sheets based on 'CP' or CG
# Easier to check for non-CG 
deploy_csvs = []
for file in os.listdir('../../GitHub/OOI-Integration/asset-management/deployment/'):
    if file[0:2] == 'RS' or file[0:2] == 'CE':
        pass
    elif 'MOAS' in file:
        pass
    else:
        deploy_csvs.append(file)
        print(file)

In [None]:
# Get the Deployment History from the WHOI Asset Tracking System
CTDBPF_Deploy = CTDBPF['Deployment History']

In [None]:
CTDBPF_Deploy

In [None]:
# Split the string at the newline to generate a list of deployments for each CTDBP-C
CTDBPF_Deploy = CTDBPF['Deployment History'].apply(lambda x: x.split('\n'))

In [None]:
CTDBPF_Deploy

In [None]:
# List out all the individual deployments
deploy_list = []
for i in range(0,len(CTDBPF_Deploy)):
    for item in CTDBPF_Deploy.iloc[i]:
        if '-' in item:
            deploy_list.append(item)
        else:
            pass

In [None]:
deploy_list

In [None]:
# So I now have a list of the deployments all the CTDBP-Cs were used on.
# Now, parse the name of the array to
array = list( set( [x.split('-')[0] for x in deploy_list] ) )
array

In [None]:
# With the list of array names, I can now parse the deployment file names to find
# the relevant deployment sheets which match where the CTDBP-Cs were deployed
deploy_csvs = []
for file in os.listdir('../../GitHub/OOI-Integration/asset-management/deployment/'):
    if file.split('_')[0] in array:
        deploy_csvs.append(file)
deploy_csvs

In [None]:
# Using the identified deployment csvs, can now load the deployment csvs into
# a pandas dataframe
deployments = pd.DataFrame()
for file in deploy_csvs:
    deployments = deployments.append(pd.read_csv('../../GitHub/OOI-Integration/asset-management/deployment/'+file))
deployments.head()

In [None]:
# Get the CTDBPF sensor uids
sensor_uids = list( set( CTDBPF['UID'] ) )
sensor_uids

In [None]:
# Find in the deployment spreadsheets the matching entry for the CTDBP-Cs that I'm looking for
deployments['CTDBPF'] = deployments['sensor.uid'].apply(lambda x: True if x in sensor_uids else False)
deployments = deployments[deployments['CTDBPF'] == True]

In [None]:
deployments.head()

In [None]:
# Now, parse out the date string in the format of YYYYMMDD from the startDateTime
# in order to compare with the date in the calibration file names
deploy_dates = deployments['startDateTime'].apply(lambda x: x.replace('-','').split('T')[0])
deploy_dates = list(set(deploy_dates))
deploy_dates

In [None]:
cal_csvs = []
for file in os.listdir('../../GitHub/OOI-Integration/asset-management/calibration/CTDBPF/'):
    date = file.split('__')[1].split('.')[0]
    print(date)
    if date in deploy_dates:
        cal_csvs = cal_csvs.append(file)
print(cal_csvs)
        

In [None]:
cal_csvs

Great! None of the CTDBP-C have calibration dates which match deployment dates. That is a good sign - it means that the dates in the calibration file name *should* match the calibration dates in the calibration info.

However, that is no guarantee that the date in the file name matches the date in the calibration data. This can be check in a future step by comparing the calibration date in the vendor docs, QCT info, and the .cal and .xmlcon file info.

In [None]:
# Now, using the "deploy" csvs for each node in the various arrays,
# need to load into a large pandas dataframe for easy handling
import pandas as pd

deployments = pd.DataFrame()
for file in deploy_csvs:
    deployments = deployments.append(pd.read_csv('../GitHub/OOI-Integration/asset-management/deployment/'+file))

In [None]:
deployments

In [None]:
# Get all the unique deployment dates from the deployment csvs and put into the form of 
# YYYYMMDD. 
deploy_dates = deployments['startDateTime'].apply(lambda x: x.split('T')[0].replace('-',''))

In [None]:
deploy_dates = list(set(deploy_dates))
deploy_dates[0:10]

In [None]:
len(deploy_dates)

In [None]:
check_files = []
for root, dirs, files in os.walk('../GitHub/OOI-Integration/asset-management/calibration/'):
    for name in files:
        if 'CGINS' in name:
            cal_date = name.split('__')[1].split('.')[0]
            if cal_date in deploy_dates:
                check_files.append(name)

In [None]:
# Okay, there are a potential 1364 files that we need to check on the
# calibration date in the file name, because the parsed date in the 
# file name matches a deployment date.
len(list(set(check_files)))

In [None]:
# Cool, now save the file to the local working directory
with open('calibration_files_to_check.csv','w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(check_files)