# CTDMO Metadata Review

This notebook describes the process for reviewing the calibration coefficients for the CTDMO IM-37. The purpose is to check the calibration coefficients contained in the CSVs stored within the asset management repository on GitHub, which are the coefficients utilized by OOI-net for calculating data products, against the different available sources of calibration information to identify when errors were made during entering the calibration csvs. This includes checking the following information:
1. The calibration date - this information is stored in the filename of the csv
2. Calibration source - identifying all the possible sources of calibration information, and determine which file should supply the calibration info
3. Calibration coeffs - checking the accuracy and precision of the numbers stored in the calibration coefficients

The CTDMO contains 24 different calibration coefficients to check. The possible calibration sources for the CTDMOs are vendor PDFs, vendor .cal files, and QCT check-ins. A complication is that the vendor documents are principally available only as PDFs that are copies of images. This requires the use of Optical Character Recognition (OCR) in order to parse the PDFs. Unfortunately, OCR frequently misinterprets certain character combinations, since it utilizes Levenstein-distance to do character matching. 

Furthermore, using OCR to read PDFs requires significant preprocessing of the PDFs to create individual PDFs with uniform metadata and encoding. Without this preprocessing, the OCR will not generate uniformly spaced characters, making parsing not amenable to repeatable automated parsing.

In [1]:
# Import likely important packages, etc.
import sys, os
import numpy as np
import pandas as pd
import shutil

In [2]:
from utils import *

**====================================================================================================================**
Define the directories where the QCT document files are stored as well as where the vendor documents are stored, where asset tracking is located, and where the calibration csvs are located.

In [222]:
qct_directory = '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results'
cal_directory = '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal'
asset_management_directory = '/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibration/CTDMOR'

In [223]:
excel_spreadsheet = '/media/andrew/OS/Users/areed/Documents/Project_Files/Documentation/System/System Notebook/WHOI_Asset_Tracking.xlsx'
sheet_name = 'Sensors'

In [224]:
CTDMO = whoi_asset_tracking(excel_spreadsheet,sheet_name,instrument_class='CTDMO',whoi=True,series='R')
CTDMO.head(10)

Unnamed: 0,Instrument Class,Series,Supplier Serial Number,WHOI #,OOI #,UID,Model,CGSN PN,Firmware Version,Supplier,...,QCT Testing,PreDeployment,Post Deployment,Refurbishment/ Repair,DO Number,Date Received,Deployment History,Current Deployment,Instrument Location on Current Deployment,Notes
135,CTDMO,R,37-11483,116102,A00639,CGINS-CTDMOR-11483,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00086\n3305-00101-00292\n3305-00101...,,,3305-00900-00080\n3305-00900-00273,WH-SC11-01-CTD-1007,2014-01-23 00:00:00,GI01SUMO-00001\nGI01SUMO-00003\nGI01SUMO-00005,GI01SUMO-00005,Inductive Wire (750),5/4/14: Clamp needs to be drilled out\nClamps ...
136,CTDMO,R,37-11484,116103,A00640,CGINS-CTDMOR-11484,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00087\n3305-00101-00299\n3305-00101...,,,3305-00900-00080\n3305-00900-00273,WH-SC11-01-CTD-1007,2014-01-23 00:00:00,GI01SUMO-00001\nGI01SUMO-00003\nGI01SUMO-00005,GI01SUMO-00005,Inductive Wire (1000),5/4/14: Clamp needs to be drilled out\nClamps ...
137,CTDMO,R,37-11485,116104,A00641,CGINS-CTDMOR-11485,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00088\n3305-00101-00302\n3305-00101...,,,3305-00900-00080\n3305-00900-00273,WH-SC11-01-CTD-1007,2014-01-23 00:00:00,GI01SUMO-00001\nGI01SUMO-00003\nGI01SUMO-00005,GI01SUMO-00005,Inductive Wire (1500),5/4/14: Clamp needs to be drilled out\nClamps ...
138,CTDMO,R,37-11486,116105\n118365,A00642\nA02174,CGINS-CTDMOR-11486,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00089\n3305-00101-00489\n3305-00101...,,,3305-00900-00187\n3305-00900-00391,WH-SC11-01-CTD-1007,2014-01-23 00:00:00,GI Spare\nGS01SUMO-00002\nGI01SUMO-00004,,Inductive wire (1500),5/4/14: Clamp needs to be drilled out\nClamps ...
255,CTDMO,R,37-12565,116881\n118165,A01127\nA02071,CGINS-CTDMOR-12565,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00124\n3305-00101-00316,,,3305-00900-00103,WH-SC11-01-CTD-1013,2014-10-14 00:00:00,GS01SUMO-00001,GS01SUMO-00004,Riser (750m),"Clamp Drilled out to 7/16"""
256,CTDMO,R,37-12566,116882,A01126,CGINS-CTDMOR-12566,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00125 3305-00101-00311,,,3305-00900-00103,WH-SC11-01-CTD-1013,2014-10-14 00:00:00,GS01SUMO-00001\nGS01SUMO-00003,,,"Clamp Drilled out to 7/16"""
257,CTDMO,R,37-12567,116883,A01125,CGINS-CTDMOR-12567,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00126\n3305-00101-00313,,,3305-00900-00103,WH-SC11-01-CTD-1013,2014-10-14 00:00:00,GS01SUMO-00001,GS01SUMO-00004,Riser (1000m),"Clamp Drilled out to 7/16"""
258,CTDMO,R,37-12568,116884,A01128,CGINS-CTDMOR-12568,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00127 3305-00101-00353\n3305-00101-...,,,3305-00900-00097\n3305-00900-00333,WH-SC11-01-CTD-1013,2014-10-14 00:00:00,GA01SUMO-00001\nGA01SUMO-00003\nGS 5 spare,,,"Clamp Drilled out to 7/16"""
259,CTDMO,R,37-12569,116885,A01129,CGINS-CTDMOR-12569,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00128 3305-00101-00352\n3305-00101-...,,,3305-00900-00097\n3305-00900-00333,WH-SC11-01-CTD-1013,2014-10-14 00:00:00,GA01SUMO-00001\nGA01SUMO-00003,,,"Clamp Drilled out to 7/16"""
280,CTDMO,R,37-12647,116886,A01130,CGINS-CTDMOR-12647,37IM,1336-00001-00018,3.1,SeaBird,...,3305-00101-00129 3305-00101-00355\n3305-00101-...,,,3305-00900-00097\n3305-00900-00333,WH-SC11-01-CTD-1013,2014-10-14 00:00:00,GA01SUMO-00001\nGA01SUMO-00003\nGS 5 spare,,,"Clamp Drilled out to 7/16"""


In [225]:
for file in sorted(os.listdir(asset_management_directory)):
    sn = '37-' + file[13:18]
    cd = file[20:28]
    print('CTDMO-G  ' + sn + '  ' + file + '  ' + cd)

CTDMO-G  37-11483  CGINS-CTDMOR-11483__20140910.csv  20140910
CTDMO-G  37-11483  CGINS-CTDMOR-11483__20160710.csv  20160710
CTDMO-G  37-11483  CGINS-CTDMOR-11483__20170920.csv  20170920
CTDMO-G  37-11484  CGINS-CTDMOR-11484__20140910.csv  20140910
CTDMO-G  37-11484  CGINS-CTDMOR-11484__20160710.csv  20160710
CTDMO-G  37-11484  CGINS-CTDMOR-11484__20170912.csv  20170912
CTDMO-G  37-11485  CGINS-CTDMOR-11485__20140910.csv  20140910
CTDMO-G  37-11485  CGINS-CTDMOR-11485__20160710.csv  20160710
CTDMO-G  37-11485  CGINS-CTDMOR-11485__20170912.csv  20170912
CTDMO-G  37-11486  CGINS-CTDMOR-11486__20151214.csv  20151214
CTDMO-G  37-11486  CGINS-CTDMOR-11486__20170303.csv  20170303
CTDMO-G  37-11486  CGINS-CTDMOR-11486__20180911.csv  20180911
CTDMO-G  37-12565  CGINS-CTDMOR-12565__20150218.csv  20150218
CTDMO-G  37-12566  CGINS-CTDMOR-12566__20150218.csv  20150218
CTDMO-G  37-12566  CGINS-CTDMOR-12566__20161125.csv  20161125
CTDMO-G  37-12567  CGINS-CTDMOR-12567__20150218.csv  20150218
CTDMO-G 

**======================================================================================================================**

First, get all the unique CTDMO Instrument UIDs:

In [226]:
uids = sorted(list(set(CTDMO['UID'])))

Identify the QCT Testing documents associated with each individual instrument (the UID):

In [227]:
qct_dict = get_qct_files(CTDMO, qct_directory)
qct_dict

{'CGINS-CTDMOR-13629': ['3305-00101-00243', '3305-00101-00543'],
 'CGINS-CTDMOR-11483': ['3305-00101-00086',
  '3305-00101-00292',
  '3305-00101-00614'],
 'CGINS-CTDMOR-13376': ['3305-00101-00138',
  '3305-00101-00450',
  '3305-00101-00706'],
 'CGINS-CTDMOR-12649': ['3305-00101-00130'],
 'CGINS-CTDMOR-13628': ['3305-00101-00242', '3305-00101-00486'],
 'CGINS-CTDMOR-12565': ['3305-00101-00124', '3305-00101-00316'],
 'CGINS-CTDMOR-13375': ['3305-00101-00140', '3305-00101-00455'],
 'CGINS-CTDMOR-12567': ['3305-00101-00126', '3305-00101-00313'],
 'CGINS-CTDMOR-11486': ['3305-00101-00089',
  '3305-00101-00489',
  '3305-00101-00705'],
 'CGINS-CTDMOR-13651': ['3305-00101-00246'],
 'CGINS-CTDMOR-13377': ['3305-00101-00139',
  '3305-00101-00451',
  '3305-00101-00704'],
 'CGINS-CTDMOR-12569': ['3305-00101-00128 3305-00101-00352',
  '3305-00101-00653'],
 'CGINS-CTDMOR-12566': ['3305-00101-00125 3305-00101-00311'],
 'CGINS-CTDMOR-13630': ['3305-00101-00244', '3305-00101-00548'],
 'CGINS-CTDMOR-136

Identify the calibration csvs stored in asset management which correspond to a particular instrument:

In [228]:
csv_dict = load_asset_management(CTDMO, asset_management_directory)
csv_dict

{'CGINS-CTDMOR-11483': ['CGINS-CTDMOR-11483__20140910.csv',
  'CGINS-CTDMOR-11483__20160710.csv',
  'CGINS-CTDMOR-11483__20170920.csv'],
 'CGINS-CTDMOR-11484': ['CGINS-CTDMOR-11484__20170912.csv',
  'CGINS-CTDMOR-11484__20140910.csv',
  'CGINS-CTDMOR-11484__20160710.csv'],
 'CGINS-CTDMOR-11485': ['CGINS-CTDMOR-11485__20140910.csv',
  'CGINS-CTDMOR-11485__20170912.csv',
  'CGINS-CTDMOR-11485__20160710.csv'],
 'CGINS-CTDMOR-11486': ['CGINS-CTDMOR-11486__20170303.csv',
  'CGINS-CTDMOR-11486__20151214.csv',
  'CGINS-CTDMOR-11486__20180911.csv'],
 'CGINS-CTDMOR-12565': ['CGINS-CTDMOR-12565__20150218.csv'],
 'CGINS-CTDMOR-12566': ['CGINS-CTDMOR-12566__20150218.csv',
  'CGINS-CTDMOR-12566__20161125.csv'],
 'CGINS-CTDMOR-12567': ['CGINS-CTDMOR-12567__20150218.csv'],
 'CGINS-CTDMOR-12568': ['CGINS-CTDMOR-12568__20180329.csv',
  'CGINS-CTDMOR-12568__20140930.csv',
  'CGINS-CTDMOR-12568__20160317.csv'],
 'CGINS-CTDMOR-12569': ['CGINS-CTDMOR-12569__20140930.csv',
  'CGINS-CTDMOR-12569__20180411.cs

Get the serial numbers for each CTDMO, and use those serial numbers to search for and return all of the relevant vendor documents for a particular instrument:

In [229]:
serial_nums = get_serial_nums(CTDMO, uids)

In [230]:
cal_dict = get_calibration_files(serial_nums, cal_directory)
cal_dict

{'CGINS-CTDMOR-11483': ['CTDMO-R_SBE_37IM_37-11483_Calibration_Files_2016-03-31.zip',
  'CTDMO-R_SBE_37IM_SN_37-11483_Calibration_Files_2014-03-14.zip',
  'CTDMO-R_SBE_37IM_SN_37-11483_Calibration_Files_2017-09-13.zip'],
 'CGINS-CTDMOR-11484': ['CTDMO-R_SBE_37IM_37-11484_Calibration_Files_2016-03-31.zip',
  'CTDMO-R_SBE_37IM_SN_37-11484_Calibration_Files_2014-03-14.zip',
  'CTDMO-R_SBE_37IM_SN_37-11484_Calibration_Files_2017-09-20.zip'],
 'CGINS-CTDMOR-11485': ['CTDMO-R_SBE_37IM_37-11485_Calibration_Files_2016-03-31.zip',
  'CTDMO-R_SBE_37IM_SN_37-11485_Calibration_Files_2014-03-14.zip',
  'CTDMO-R_SBE_37IM_SN_37-11485_Calibration_Files_2017-09-12.zip'],
 'CGINS-CTDMOR-11486': ['CTDMO-R_SBE_37IM_SN_37-11486_Calibration_Files_2014-03-14.zip',
  'CTDMO-R_SBE_37IM_SN_37-11486_Calibration_Files_2017-03-03.zip',
  'CTDMO-R_SBE_37IM_SN_37-11486_Calibration_Files_2018-09-11.zip'],
 'CGINS-CTDMOR-12565': ['CTDMO-R_SBE_37IM_SN_37-12565_Calibration_Files_2014-10-16.zip',
  'CTDMO-R_SBE_37IM_SN_3

**========================================================================================================================**
Print all of the CTDMO CSV files in order to retrieve all of the relevant files that need to be checked:

In [231]:
for uid in sorted(csv_dict.keys()):
    files = sorted(csv_dict[uid])
    sn = serial_nums[uid]
    for f in files:
        print('CTDMO-G' + '  ' + '37-' + sn + '  ' + f)

CTDMO-G  37-11483  CGINS-CTDMOR-11483__20140910.csv
CTDMO-G  37-11483  CGINS-CTDMOR-11483__20160710.csv
CTDMO-G  37-11483  CGINS-CTDMOR-11483__20170920.csv
CTDMO-G  37-11484  CGINS-CTDMOR-11484__20140910.csv
CTDMO-G  37-11484  CGINS-CTDMOR-11484__20160710.csv
CTDMO-G  37-11484  CGINS-CTDMOR-11484__20170912.csv
CTDMO-G  37-11485  CGINS-CTDMOR-11485__20140910.csv
CTDMO-G  37-11485  CGINS-CTDMOR-11485__20160710.csv
CTDMO-G  37-11485  CGINS-CTDMOR-11485__20170912.csv
CTDMO-G  37-11486  CGINS-CTDMOR-11486__20151214.csv
CTDMO-G  37-11486  CGINS-CTDMOR-11486__20170303.csv
CTDMO-G  37-11486  CGINS-CTDMOR-11486__20180911.csv
CTDMO-G  37-12565  CGINS-CTDMOR-12565__20150218.csv
CTDMO-G  37-12566  CGINS-CTDMOR-12566__20150218.csv
CTDMO-G  37-12566  CGINS-CTDMOR-12566__20161125.csv
CTDMO-G  37-12567  CGINS-CTDMOR-12567__20150218.csv
CTDMO-G  37-12568  CGINS-CTDMOR-12568__20140930.csv
CTDMO-G  37-12568  CGINS-CTDMOR-12568__20160317.csv
CTDMO-G  37-12568  CGINS-CTDMOR-12568__20180329.csv
CTDMO-G  37-

**========================================================================================================================**
With the individual files identified for the CTDMO Vendor documents, QCTs, and CSVs, we next get the full directory path to the files. This is necessary to load them:

CSV file paths:

In [232]:
csv_paths = {}
for uid in sorted(csv_dict.keys()):
    paths = []
    for file in csv_dict.get(uid):
        path = generate_file_path(asset_management_directory, file, ext=['.csv','.ext'])
        paths.append(path)
    csv_paths.update({uid: paths})

In [233]:
csv_paths

{'CGINS-CTDMOR-11483': ['/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibration/CTDMOR/CGINS-CTDMOR-11483__20140910.csv',
  '/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibration/CTDMOR/CGINS-CTDMOR-11483__20160710.csv',
  '/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibration/CTDMOR/CGINS-CTDMOR-11483__20170920.csv'],
 'CGINS-CTDMOR-11484': ['/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibration/CTDMOR/CGINS-CTDMOR-11484__20170912.csv',
  '/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibration/CTDMOR/CGINS-CTDMOR-11484__20140910.csv',
  '/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibration/CTDMOR/CGINS-CTDMOR-11484__20160710.csv'],
 'CGINS-CTDMOR-11485': ['/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibration/CTDMOR/CGINS-CTDMOR-11485__20140910.csv',
  '/home/andrew/Documents/OOI-CGSN/ooi-integration/asset-management/calibratio

CAL file paths:

In [234]:
# Retrieve and save the full directory path to the calibration files
cal_paths = {}
for uid in sorted(cal_dict.keys()):
    paths = []
    for file in cal_dict.get(uid):
        path = generate_file_path(cal_directory, file, ext=['.zip','.cap', '.txt', '.log'])
        paths.append(path)
    cal_paths.update({uid: paths})

In [235]:
cal_paths

{'CGINS-CTDMOR-11483': ['/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-R_SBE_37IM_37-11483_Calibration_Files_2016-03-31.zip',
  '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-R_SBE_37IM_SN_37-11483_Calibration_Files_2014-03-14.zip',
  '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-R_SBE_37IM_SN_37-11483_Calibration_Files_2017-09-13.zip'],
 'CGINS-CTDMOR-11484': ['/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-R_SBE_37IM_37-11484_Calibration_Files_2016-03-31.zip',
  '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-R_SBE_37IM_SN_37-11484_Calibration_Files_2014-03-14.zip',
  '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-R_SBE_37IM_SN_37-11484_Calibration_Files_

QCT file paths:

In [236]:
qct_paths = {}
for uid in sorted(qct_dict.keys()):
    paths = []
    for file in qct_dict.get(uid):
        path = generate_file_path(qct_directory, file)
        paths.append(path)
    qct_paths.update({uid: paths})

In [237]:
qct_paths

{'CGINS-CTDMOR-11483': ['/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00086-A.cap',
  '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00292-A.txt',
  '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00614-A.log'],
 'CGINS-CTDMOR-11484': ['/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00087-A.cap',
  '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00299-A.txt',
  '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00613-A.log'],
 'CGINS-CTDMOR-11485': ['/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00088-A.cap',
  '/media/andrew/OS/Users/areed/Docu

**========================================================================================================================**
# Processing and Parsing the Calibration Coefficients
With the associated vendor documents (cal files/vendor pdfs), QCT checkins (qct files), and calibration csvs (csv files), I want to be able to compare the following:
* **(1)** That the calibration date matches between the different documents
* **(2)** The file name agrees with the CTDMO UID and the calibration date
* **(3)** The calibration coefficients all agree between the different reference documents and calibration csvs
* **(4)** Identify when a calibration coefficient is incorrect, where to find it, and how to correct it

The first step is to define a CTDMO Calibration parsing object. This object contains the relevant attributes and the functions necessary to open, read, and parse the CTDMO calibration coefficients and date, and write the calibration info to a properly-named CSV file.

In [238]:
import re
import os
import string
import pandas as pd
import numpy as np
from wcmatch import fnmatch
from zipfile import ZipFile
import textract

class CTDMOCalibration():
    # Class that stores calibration values for CTDs.

    def __init__(self, uid):
        self.serial = ''
        self.uid = uid
        self.ctd_type = uid
        self.coefficients = {}
        self.date = {}

        # Name mapping for the MO-type CTDs (when reading from pdfs)
        self.mo_coefficient_name_map = {
            'ptcb1': 'CC_ptcb1',
            'pa2': 'CC_pa2',
            'a3': 'CC_a3',
            'pa0': 'CC_pa0',
            'wbotc': 'CC_wbotc',
            'ptcb0': 'CC_ptcb0',
            'g': 'CC_g',
            'ptempa1': 'CC_ptempa1',
            'ptcb2': 'CC_ptcb2',
            'a0': 'CC_a0',
            'h': 'CC_h',
            'ptca0': 'CC_ptca0',
            'a2': 'CC_a2',
            'cpcor': 'CC_cpcor',
            'pcor':'CC_cpcor',
            'i': 'CC_i',
            'ptempa0': 'CC_ptempa0',
            'prange': 'CC_p_range',
            'ctcor': 'CC_ctcor',
            'tcor':'CC_ctcor',
            'a1': 'CC_a1',
            'j': 'CC_j',
            'ptempa2': 'CC_ptempa2',
            'pa1': 'CC_pa1',
            'ptca1': 'CC_ptca1',
            'ptca2': 'CC_ptca2',
        }

    @property
    def uid(self):
        return self._uid

    @uid.setter
    def uid(self, d):
        r = re.compile('.{5}-.{6}-.{5}')
        if r.match(d) is not None:
            self.serial = d.split('-')[2]
            self._uid = d
        else:
            raise Exception(f"The instrument uid {d} is not a valid uid. Please check.")

    @property
    def ctd_type(self):
        return self._ctd_type

    @ctd_type.setter
    def ctd_type(self, d):
        if 'MO' in d:
            self._ctd_type = '37'
        elif 'BP' in d:
            self._ctd_type = '16'
        else:
            self._ctd_type = ''

            
    def mo_parse_pdf(self, filepath):
        """
        This function extracts the text from a given pdf file.
        Depending on if the text concerns calibration for 
        temperature/conductivity or pressure, it calls a further
        function to parse out the individual calibration coeffs.
    
        Args:
            filepath - the full directory path to the pdf file
                which it to be extracted and parsed.
        Calls:
            mo_parse_p(text, filepath)
            mo_parse_ts(text)
        Returns:
            self - a CTDMO calibration object with calibration
                coefficients parsed into the object calibration
                dictionary
        """
    
        text = textract.process(filepath, encoding='utf-8')
        text = text.decode('utf-8')
    
        if 'PRESSURE CALIBRATION DATA' in text:
            self.mo_parse_p(filepath)
    
        elif 'TEMPERATURE CALIBRATION DATA' or 'CONDUCTIVITY CALIBRATION DATA' in text:
            self.mo_parse_ts(text)
        
        else:
            pass
    

    def mo_parse_ts(self, text):
        """
        This function parses text from a pdf and loads the appropriate calibration
        coefficients for the temperature and conductivity sensors into the CTDMO 
        calibration object.
    
        Args:
            text - extracted text from a pdf page
        Returns:
            self - a CTDMO calibration object with either temperature or conductivity
                calibration values filled in the calibration coefficients dictionary
        Raises:
            Exception - if the serial number in the pdf text does not match the
                serial number parsed from the UID
        """
    
        keys = self.mo_coefficient_name_map.keys()
        for line in text.splitlines():
    
            if 'CALIBRATION DATE' in line:
                *ignore, cal_date = line.split(':')
                cal_date = pd.to_datetime(cal_date).strftime('%Y%m%d')
                self.date.update({len(self.date): cal_date})
        
            elif 'SERIAL NUMBER' in line:
                *ignore, serial_num = line.split(':')
                serial_num = serial_num.strip()
                if serial_num != self.serial:
                    raise Exception(f'Instrument serial number {serial_num} does not match UID serial num {self.serial}')
           
            elif '=' in line:
                key, *ignore, value = line.split()
                name = self.mo_coefficient_name_map.get(key.strip().lower())
                if name is not None:
                    self.coefficients.update({name: value.strip()})
            else:
                continue
            
            
    def mo_parse_p(self,filepath):
        """
        Function to parse the pressure calibration information from a pdf. To parse
        the pressure cal info requires re-extracting the text from the pdf file using
        tesseract-ocr rather than the basic pdf2text converter.
    
        Args:
            text - extracted text from a pdf page using pdf2text
            filepath - full directory path to the pdf file containing the pressure
                calibration info. This is the file which will be re-extracted.
        Returns
            self - a CTDMO calibration object with pressure calibration values filled
                in the calibration coefficients dictionary
        """
    
        # Now, can reprocess using tesseract-ocr rather than pdftotext
        ptext = textract.process(filepath, method='tesseract', encoding='utf-8')
        ptext = ptext.replace(b'\xe2\x80\x94',b'-')
        ptext = ptext.decode('utf-8')
        keys = list(self.mo_coefficient_name_map.keys())
        
        # Get the calibration date:
        for line in ptext.splitlines():
            if 'CALIBRATION DATE' in line:
                items = line.split()
                ind = items.index('DATE:')
                cal_date = items[ind+1]
                cal_date = pd.to_datetime(cal_date).strftime('%Y%m%d')
                self.date.update({len(self.date):cal_date})
            
            if 'psia S/N' in line:
                items = line.split()
                ind = items.index('psia')
                prange = items[ind-1]
                name = self.mo_coefficient_name_map.get('prange')
                self.coefficients.update({name: prange})
    
            # Loop through each line looking for the lines which contain
            # calibration coefficients
            if '=' in line:
                # Tesseract-ocr misreads '0' as O, and 1 as IL
                line = line.replace('O','0').replace('IL','1').replace('=','').replace(',.','.').replace(',','.')
                line = line.replace('L','1').replace('@','0').replace('l','1').replace('--','-')
                if '11' in line and 'PA2' not in line:
                    line = line.replace('11','1')
                items = line.split()
                for n, k in enumerate(items):
                    if k.lower() in keys:
                        try:
                            float(items[n+1])
                            name = self.mo_coefficient_name_map.get(k.lower())
                            self.coefficients.update({name: items[n+1]})
                        except:
                            pass
        if 'CC_ptcb2' not in list(self.mo_coefficient_name_map.keys()):
            self.coefficients.update({'CC_ptcb2': '0.000000e+000'})


    def mo_parse_cal(self, filepath):
        """
        Function to parse the .cal file for the CTDMO when a .cal file
        is available.
        """
    
        if not filepath.endswith('.cal'):
            raise Exception(f'Not a .cal filetype.')
    
        with open(filepath) as file:
            data = file.read()
        
        for line in data.splitlines():
            key, value = line.split('=')
            key = key.strip()
            value = value.strip()
        
            if 'SERIALNO' in key:
                sn = value
                if self.serial != sn:
                    raise Exception(f'File serial number {sn} does not match UID {self.uid}')
                
            elif 'CALDATE' in key:
                cal_date = pd.to_datetime(value).strftime('%Y%m%d')
                self.date.update({len(self.date): cal_date})
            
            elif 'INSTRUMENT_TYPE' in key:
                ctd_type = value[-2:]
                if self.ctd_type != ctd_type:
                    raise Exception(f'CTD type {ctd_type} does not match uid {self.uid}.')
                
            else:
                if key.startswith('T'):
                    key = key.replace('T','')
                if key.startswith('C') and len(key)==2:
                    key = key.replace('C','')
                name = self.mo_coefficient_name_map.get(key.lower())
                if name is not None:
                    self.coefficients.update({name: value})
                    
        # Now we need to add in the range of the sensor
        name = self.mo_coefficient_name_map.get('prange')
        self.coefficients.update({name: '1450'})

                    
    def mo_parse_qct(self, filepath):
        """
        This function reads and parses the QCT file into
        the CTDMO calibration object.
    
        Args:
            filepath - full directory path and filename of
                the QCT file
        Returns:
        
        """
        
        with open(filepath,errors='ignore') as file:
            data = file.read()

        data = data.replace('<',' ').replace('>',' ')
        keys = self.mo_coefficient_name_map.keys()

        for line in data.splitlines():
            items = line.split()
    
            # If the line is empty, go to next line
            if len(items) == 0:
                continue
    
            # Check the serial number from the instrument
            elif 'SERIAL NO' in line:
                ind = items.index('NO.')
                sn = items[ind+1]
                if sn != self.serial:
                    raise Exception(f'Serial number {sn} in QCT document does not match uid serial number {self.serial}')
        
            # Check if the line contains the calibration date
            elif 'CalDate' in line:
                cal_date = pd.to_datetime(items[1]).strftime('%Y%m%d')
                self.date.update({len(self.date): cal_date})
        
            # Get the coefficient names and values
            elif items[0].lower() in keys:
                name = self.mo_coefficient_name_map[items[0].lower()]
                self.coefficients.update({name: items[1]})
        
            else:
                pass
    
    
    def write_csv(self, outpath):
        """
        This function writes the correctly named csv file for the ctd to the
        specified directory.

        Args:
            outpath - directory path of where to write the csv file
        Raises:
            ValueError - raised if the CTD object's coefficient dictionary
                has not been populated
        Returns:
            self.to_csv - a csv of the calibration coefficients which is
                written to the specified directory from the outpath.
        """

        # Run a check that the coefficients have actually been loaded
        if len(self.coefficients) == 0:
            raise ValueError('No calibration coefficients have been loaded.')

        # Create a dataframe to write to the csv
        data = {'serial': [self.ctd_type + '-' + self.serial]*len(self.coefficients),
                'name': list(self.coefficients.keys()),
                'value': list(self.coefficients.values()),
                'notes': ['']*len(self.coefficients)
                }
        df = pd.DataFrame().from_dict(data)

        # Generate the csv name
        cal_date = max(self.date.values())
        csv_name = self.uid + '__' + cal_date + '.csv'

        # Write the dataframe to a csv file
        # check = input(f"Write {csv_name} to {outpath}? [y/n]: ")
        check = 'y'
        if check.lower().strip() == 'y':
            df.to_csv(outpath+'/'+csv_name, index=False)

**========================================================================================================================**
Below, I plan on going through each of the CTDMO UIDs, and parse the data into csvs. For source files which may contain multiple calibrations or calibration sources, I plan on extracting each of the calibrations to a temporary folder using the following structure:

    <local working directory>/<temp>/<source>/data/<calibration file>
    
The separate calibrations will be saved using the standard UFrame naming convention with the following directory structure:

    <local working directory>/<temp>/<source>/<calibration csv>
    
The csvs themselves will also be copied to the temporary folder. This allows for the program to be looking into the same temp directory for every CTDMO check.

In [239]:
import shutil

In [240]:
uids

['CGINS-CTDMOR-11483',
 'CGINS-CTDMOR-11484',
 'CGINS-CTDMOR-11485',
 'CGINS-CTDMOR-11486',
 'CGINS-CTDMOR-12565',
 'CGINS-CTDMOR-12566',
 'CGINS-CTDMOR-12567',
 'CGINS-CTDMOR-12568',
 'CGINS-CTDMOR-12569',
 'CGINS-CTDMOR-12647',
 'CGINS-CTDMOR-12649',
 'CGINS-CTDMOR-13375',
 'CGINS-CTDMOR-13376',
 'CGINS-CTDMOR-13377',
 'CGINS-CTDMOR-13628',
 'CGINS-CTDMOR-13629',
 'CGINS-CTDMOR-13630',
 'CGINS-CTDMOR-13648',
 'CGINS-CTDMOR-13651',
 'CGINS-CTDMOR-13654']

**====================================================================================================================**
# START HERE

In [248]:
i = 0
uid = sorted(uids)[i]
uid

'CGINS-CTDMOR-11483'

In [268]:
i = i + 1
uid = sorted(uids)[i]
for cpath in sorted(cal_paths[uid]):
    print(cpath.split('/')[-1])
print()
for qpath in qct_paths[uid]:
    if qpath is not None:
        print(qpath.split('/')[-1].split('.')[0])

CTDMO-R_SBE_37IM_SN_37-13654_Calibration_Files_2015-07-01.pdf
CTDMO-R_SBE_37IM_SN_37-13654_Calibration_Files_2017-03-03.zip

3305-00101-00247-A
3305-00101-00487-A


In [61]:
temp_directory = '/'.join((os.getcwd(),'temp'))
if os.path.exists(temp_directory):
    shutil.rmtree(temp_directory)
    ensure_dir(temp_directory)

**=======================================================================================================================**
Copy the existing CTDMO asset management csvs to the local temp directory:

In [23]:
for filepath in csv_paths[uid]:
    savedir = '/'.join((temp_directory,'csv'))
    ensure_dir(savedir)
    savepath = '/'.join((savedir, filepath.split('/')[-1]))
    shutil.copyfile(filepath, savepath)

========================================================================================================================
### Parse and process the vendor documents
The next step is to read and parse the vendor documents. This is a more difficult challenge, since for CTDMOs the vendor documents are retained mostly as pdf files. While the pdf files are parseable, there is an added complication in that the forms have changed over time, with sometimes the T/S/P calibration pdfs combined into a single file, whereas other times they are separated into individual files. Furthermore, the files are often zipped into a single folder. So, I have the following possible vendor documents:
* **(1)** A .cal file - this is the easiest to read and parse, in a similar format to the CTDBP .cal files
* **(2)** A combinded pdf - this is the most difficult format. Need to separate out the different pages which each separately contain either the temperature calibration info, the conductivity calibration info, or the pressure calibration info.
* **(3)** Separate pdfs - this is a simpler pdf reading schematic, where I know a priori which particular "page" will contain relevant calibration info. 

There are a couple of different pdf readers that I can use:
1. PyPDF2
2. PDFMiner
3. Textract

In [24]:
import PyPDF2

PyPDF2 does not work to extract text from the CTDMO combined pdf file document. Neither does the straightforward PDFMiner application. We will have to use OCR and textract to parse the pdf forms.

When parsing the pdf file, it appears that the built-in method of pdf2text does the best job at parsing the forms, particularly the temperature and conductivity coefficients. The pressure calibration coefficients are not as well parsed, due to the positioning of the image.

This means that I'm going to split and use two different methods for getting the calibration coefficients depending on what the calibration is for, i.e. T/S/P. For T and S, I'll use the built-in method for extracting text. For the pressure, I'll use the tesseract OCR approach.

========================================================================================================================
### Preprocessing the Vendor Files
In order to automate the parsing of the CTDMO calibration coefficients from pdf files into csv files that can be read by Python requires a bit of preprocessing. In particular, the following steps are taken to make parsing the files:
* **(1)** Copy or extract the vendor calibration files from the Vault location to a local temp directory
* **(2)** Iterate over the available pdfs and split multipage pdfs into single page pdfs and append _page_ to the file
* **(3)** Once the pdfs have been split, they are ready to be parsed by the CTDMO object parsers.

In [25]:
# Now, write a function to copy over the file
cal_paths[uid]

['/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2012-11-13.pdf',
 '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2014-02-14.pdf',
 '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2016-01-13.pdf',
 '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Cal/CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2017-08-29.zip']

Copy the vendor pdf files to a local temporary directory:

In [26]:
for filepath in cal_paths[uid]:
    folder, *ignore = filepath.split('/')[-1].split('.')
    savedir = '/'.join((temp_directory,'data',folder))
    ensure_dir(savedir)
    
    if filepath.endswith('.zip'):
        with ZipFile(filepath,'r') as zfile:
            for file in zfile.namelist():
                zfile.extract(file, path=savedir)
    else:
        shutil.copy(filepath, savedir)

In [27]:
for file in os.listdir('/'.join((temp_directory,'data',folder))):
    if os.path.isdir('/'.join((temp_directory,'data',folder,file))):
        for subfile in os.listdir('/'.join((temp_directory,'data',folder,file))):
            src = '/'.join((temp_directory,'data',folder,file,subfile))
            dst = '/'.join((temp_directory,'data',folder,subfile))
            shutil.move(src,dst)
        shutil.rmtree('/'.join((temp_directory,'data',folder,file)))

In [28]:
folders = os.listdir('/'.join((os.getcwd(),'temp','data')))
rmfile = None
for folder in folders:
    filepath = '/'.join((os.getcwd(),'temp','data',folder))
    
    if any([file for file in os.listdir(filepath) if file.endswith('.cal')]):
        pass
    else:
        files = [file for file in os.listdir(filepath) if 'SERVICE REPORT' not in file]
        
        try:
            
            for file in files:
                trip = False
                inputpath = '/'.join((filepath,file))
                inputpdf = PyPDF2.PdfFileReader(inputpath, 'rb')

                for i in range(inputpdf.numPages):
                    output = PyPDF2.PdfFileWriter()
                    output.addPage(inputpdf.getPage(i))
                    filename = '_'.join((inputpath.split('.')[0], 'page', str(i)))
                    with open(filename+'.pdf', "wb") as outputStream:
                        output.write(outputStream)
        except:
            rmfile = filepath
            print(f'Cannot reformat {filepath}')
            
if rmfile is not None:
    shutil.rmtree(rmfile)

In [29]:
os.listdir(temp_directory+'/data')

['CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2014-02-14',
 'CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2017-08-29',
 'CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2016-01-13',
 'CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2012-11-13']

The next step is to iterate over the vendor calibration files and extract the calibration coefficients from the files. This is done by starting an instance of the CTDMO calibration object, check if any of the calibration data is stored as a .cal file, if no .cal file loop over the other files looking for _page_ files which indicates that the pdf file has been prepped.

In [30]:
datadir = os.path.abspath('/'.join((os.getcwd(),'temp','data')))
for folder in os.listdir(datadir):
    # Okay, now start generating calibration csvs
    ctdmo = CTDMOCalibration(uid)
    files = [file for file in os.listdir('/'.join((datadir,folder)))]
    if any([file for file in files if file.endswith('.cal')]):
        for file in files:
            if file.endswith('.cal'):
                ctdmo.mo_parse_cal('/'.join((datadir,folder,file)))
    else:
        for file in files:
            if '_page_' in file:
                try:
                    ctdmo.mo_parse_pdf('/'.join((datadir,folder,file)))
                except:
                    print(f'Parsing failed for {file}')
                    
    savedir = '/'.join((os.getcwd(),'temp','cal'))
    ensure_dir(savedir)
    try:
        ctdmo.write_csv(savedir)
    except:
        pass

Check that the calibration object properly loaded all of the calibration coefficients, serial number, calibration date, etc., and wrote the appropriate csv file.

In [31]:
os.listdir(temp_directory+'/cal')

['CGINS-CTDMOG-10214__20170829.csv',
 'CGINS-CTDMOG-10214__20121113.csv',
 'CGINS-CTDMOG-10214__20140214.csv',
 'CGINS-CTDMOG-10214__20151110.csv']

**=======================================================================================================================**
Next, we need to parse the QCT files and check that they have been successfully saved to a csv file. There should be 24 coefficients. Similarly, check the instrument serial number, the calibration date (may be more than one b/c separate calibration dates for T, S, and P sensors), and the type (for CTDMOs should be 37).

In [32]:
for filepath in qct_paths[uid]:
    savedir = '/'.join((temp_directory,'qct'))
    ensure_dir(savedir)
    if filepath is not None:
        try:
            ctdmo = CTDMOCalibration(uid)
            ctdmo.mo_parse_qct(filepath)
            ctdmo.write_csv(savedir)
        except:
            print(f'Failed to parse {filepath}')
    else:
        pass

In [33]:
qct_paths[uid]

['/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00001-A.cap',
 '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00082-A.txt',
 '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00254-A.cap',
 '/media/andrew/OS/Users/areed/Documents/Project_Files/Records/Instrument_Records/CTDMO/CTDMO_Results/3305-00101-00569-A.log']

In [34]:
os.listdir('/'.join((temp_directory,'qct')))

['CGINS-CTDMOG-10214__20170829.csv',
 'CGINS-CTDMOG-10214__20121113.csv',
 'CGINS-CTDMOG-10214__20140214.csv',
 'CGINS-CTDMOG-10214__20151110.csv']

**========================================================================================================================**
### Compare results
Now, with QCT files parsed into csvs which follow the UFrame format, I can load both the QCT and the calibratoin csvs into pandas dataframes, which will allow element by element comparison in relatively few lines of code.

In [35]:
def get_file_date(x):
    x = str(x)
    ind1 = x.index('__')
    ind2 = x.index('.')
    return x[ind1+2:ind2]

Load the calibration csvs:

In [36]:
# Now we want to compare dataframe
csv_files = pd.DataFrame(sorted(os.listdir('temp/csv')),columns=['csv'])
csv_files['cal date'] = csv_files['csv'].apply(lambda x: get_file_date(x))
csv_files.set_index('cal date',inplace=True)
csv_files

Unnamed: 0_level_0,csv
cal date,Unnamed: 1_level_1
20121113,CGINS-CTDMOG-10214__20121113.csv
20140214,CGINS-CTDMOG-10214__20140214.csv
20151110,CGINS-CTDMOG-10214__20151110.csv
20170829,CGINS-CTDMOG-10214__20170829.csv


Load the QCT csvs:

In [37]:
# Now we want to compare dataframe
qct_files = pd.DataFrame(sorted(os.listdir('temp/qct')),columns=['qct'])
qct_files['cal date'] = qct_files['qct'].apply(lambda x: get_file_date(x))
qct_files.set_index('cal date',inplace=True)
qct_files

Unnamed: 0_level_0,qct
cal date,Unnamed: 1_level_1
20121113,CGINS-CTDMOG-10214__20121113.csv
20140214,CGINS-CTDMOG-10214__20140214.csv
20151110,CGINS-CTDMOG-10214__20151110.csv
20170829,CGINS-CTDMOG-10214__20170829.csv


Load the calibration csvs:

In [38]:
cal_files = pd.DataFrame(sorted(os.listdir('temp/cal')),columns=['cal'])
cal_files['cal date'] = cal_files['cal'].apply(lambda x: get_file_date(x))
cal_files.set_index('cal date',inplace=True)
cal_files

Unnamed: 0_level_0,cal
cal date,Unnamed: 1_level_1
20121113,CGINS-CTDMOG-10214__20121113.csv
20140214,CGINS-CTDMOG-10214__20140214.csv
20151110,CGINS-CTDMOG-10214__20151110.csv
20170829,CGINS-CTDMOG-10214__20170829.csv


Combine the dataframes into one in order to know which csv files to compare and check calibration dates.

In [2677]:
df_files = csv_files.join(qct_files,how='outer').join(cal_files,how='outer').fillna(value='-999')
df_files

Unnamed: 0_level_0,csv,qct,cal
cal date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20150618,-999,CGINS-CTDMOQ-13619__20150618.csv,CGINS-CTDMOQ-13619__20150618.csv
20161125,CGINS-CTDMOQ-13619__20161125.csv,-999,-999


If the filename is wrong, the calibration coefficient checker will not manage to compare the results. Consequently, we'll make a local copy of the wrong file to a new file with the correct name, and then run the calibration coefficient checker.

In [2678]:
d1 = str(20161125)
d2 = str(20150618)

In [2679]:
src = f'temp/csv/{uid}__{d1}.csv'
dst = f'temp/csv/{uid}__{d2}.csv'

In [2680]:
shutil.move(src,dst)

'temp/csv/CGINS-CTDMOQ-13619__20150618.csv'

In [2681]:
os.listdir('temp/csv')

['CGINS-CTDMOQ-13619__20150618.csv']

Reload the data so that all files are uniformly named:

In [2682]:
csv_files = pd.DataFrame(sorted(os.listdir('temp/csv')),columns=['csv'])
csv_files['cal date'] = csv_files['csv'].apply(lambda x: get_file_date(x))
csv_files.set_index('cal date',inplace=True)
csv_files

Unnamed: 0_level_0,csv
cal date,Unnamed: 1_level_1
20150618,CGINS-CTDMOQ-13619__20150618.csv


In [2683]:
df_files = csv_files.join(qct_files,how='outer').join(cal_files,how='outer').fillna(value='-999')
df_files

Unnamed: 0_level_0,csv,qct,cal
cal date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20150618,CGINS-CTDMOQ-13619__20150618.csv,CGINS-CTDMOQ-13619__20150618.csv,CGINS-CTDMOQ-13619__20150618.csv


In [2684]:
caldates = df_files.index
for i in caldates:
    print(i)

20150618


In [39]:
for cpath in sorted(cal_paths[uid]):
    print(cpath.split('/')[-1])

CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2012-11-13.pdf
CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2014-02-14.pdf
CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2016-01-13.pdf
CTDMO-G_SBE_37IM_SN_37-10214_Calibration_Files_2017-08-29.zip


In [86]:
for qpath in qct_paths[uid]:
    if qpath is not None:
        print(qpath.split('/')[-1].split('.')[0])

3305-00101-00039-A
3305-00101-00271-A
3305-00101-00587-A


With uniformly named csv files, we can now directly compare different calibration coefficient sources for the CTDMO.

This table tells us that, for the csv CGINS-CTDMOG-11596__20150608.csv, I am missing a QCT document and vendor doc which could verify the calibration coefficients. Next, for the files I can compare, I want to go through and check each calibration coefficient.

**========================================================================================================================**
Okay, I want to check the following in the comparison between the CSV files contained in Asset Management, the QCT checkins, and the vendor docs:
1. Do the calibration coefficients match exactly?
2. Do the calibration coefficients match to within 0.001%?

In [2687]:
files = sorted(os.listdir(temp_directory+'/csv'))
files

['CGINS-CTDMOQ-13619__20150618.csv']

In [2688]:
i = 0
fname = files[i]
dfcsv = pd.read_csv(temp_directory+'/csv/'+fname)
dfcsv.sort_values(by='name', inplace=True)
dfcsv.reset_index(inplace=True)
dfcsv.drop(columns='index',inplace=True)

In [2689]:
dfcal = pd.read_csv(temp_directory+'/cal/'+fname)
dfcal.sort_values(by='name', inplace=True)
dfcal.reset_index(inplace=True)
dfcal.drop(columns='index',inplace=True)

In [2690]:
dfqct = pd.read_csv(temp_directory+'/qct/'+fname)
dfqct.sort_values(by='name', inplace=True)
dfqct.reset_index(inplace=True)
dfqct.drop(columns='index', inplace=True)

In [2691]:
cal_close = np.isclose(dfcsv['value'], dfcal['value'])
cal_check = dfcsv == dfcal

In [2692]:
dfcsv[cal_check['value'] == False], dfcal[cal_check['value'] == False]

(      serial      name      value  notes
 11  37-13619    CC_pa0   0.217212    NaN
 17  37-13619  CC_ptcb0  24.938000    NaN,
       serial      name      value  notes
 11  37-13619    CC_pa0   0.217216    NaN
 17  37-13619  CC_ptcb0  24.938000    NaN)

In [2693]:
dfcsv[cal_close == False], dfcal[cal_close == False]

(      serial    name     value  notes
 11  37-13619  CC_pa0  0.217212    NaN,       serial    name     value  notes
 11  37-13619  CC_pa0  0.217216    NaN)

In [2694]:
qct_check = dfcsv == dfqct
qct_close = np.isclose(dfcsv['value'], dfqct['value'])

In [2695]:
dfcsv[qct_check['value'] == False], dfqct[qct_check['value'] == False]

(      serial        name         value  notes
 6   37-13619        CC_g -9.868872e-01    NaN
 17  37-13619    CC_ptcb0  2.493800e+01    NaN
 22  37-13619  CC_ptempa2 -7.650646e-07    NaN
 23  37-13619    CC_wbotc  5.057800e-07    NaN,
       serial        name         value  notes
 6   37-13619        CC_g -9.868871e-01    NaN
 17  37-13619    CC_ptcb0  2.493800e+01    NaN
 22  37-13619  CC_ptempa2 -7.650647e-07    NaN
 23  37-13619    CC_wbotc  5.057824e-07    NaN)

In [2696]:
set(dfqct[qct_check['value'] == False]['name'])


{'CC_g', 'CC_ptcb0', 'CC_ptempa2', 'CC_wbotc'}

In [2697]:
dfcsv[qct_close == False], dfqct[qct_close == False]

(Empty DataFrame
 Columns: [serial, name, value, notes]
 Index: [], Empty DataFrame
 Columns: [serial, name, value, notes]
 Index: [])

In [2578]:
def check_exact_coeffs(coeffs_dict):
    
    # Part 1: coeff by coeff comparison between each source of coefficients
    keys = list(coeffs_dict.keys())
    comparison = {}
    for i in range(len(keys)):
        names = (keys[i], keys[i - (len(keys)-1)])
        check = len(coeffs_dict.get(keys[i])['value']) == len(coeffs_dict.get(keys[i - (len(keys)-1)])['value'])
        if check:
            compare = np.equal(coeffs_dict.get(keys[i])['value'], coeffs_dict.get(keys[i - (len(keys)-1)])['value'])
            comparison.update({names:compare})
        else:
            pass
        
    # Part 2: now do a logical_and comparison between the results from part 1
    keys = list(comparison.keys())
    i = 0
    mask = comparison.get(keys[i])
    while i < len(keys)-1:
        i = i + 1
        mask = np.logical_and(mask, comparison.get(keys[i]))
        print(i)
       
    return mask 

In [1896]:
def check_relative_coeffs(coeffs_dict):
    
    # Part 1: coeff by coeff comparison between each source of coefficients
    keys = list(coeffs_dict.keys())
    comparison = {}
    for i in range(len(keys)):
        names = (keys[i], keys[i - (len(keys)-1)])
        check = len(coeffs_dict.get(keys[i])['value']) == len(coeffs_dict.get(keys[i - (len(keys)-1)])['value'])
        if check:
            compare = np.isclose(coeffs_dict.get(keys[i])['value'], coeffs_dict.get(keys[i - (len(keys)-1)])['value'], rtol=1e-5)
            comparison.update({names:compare})
        else:
            pass
        
    # Part 2: now do a logical_and comparison between the results from part 1
    keys = list(comparison.keys())
    i = 0
    mask = comparison.get(keys[i])
    while i < len(keys)-1:
        i = i + 1
        mask = np.logical_and(mask, comparison.get(keys[i]))
        print(i)
       
    return mask 

In [None]:
exact_match = {}
for cal_date in df_files.index:
    # Part 1, load all of the csv files
    coeffs_dict = {}
    for source,fname in df_files.loc[cal_date].items():
        if fname != '-999':
            load_directory = '/'.join((os.getcwd(),'temp',source,fname))
            df_coeffs = pd.read_csv(load_directory)
            for i in list(set(df_coeffs['serial'])):
                print(source + '-' + fname + ': ' + str(i))
            df_coeffs.set_index(keys='name',inplace=True)
            df_coeffs.sort_index(inplace=True)
            coeffs_dict.update({source:df_coeffs})
        else:
            pass
    
    # Part 2, now check the calibration coefficients
    mask = check_exact_coeffs(coeffs_dict)
    
    # Part 3: get the calibration coefficients are wrong
    # and show them
    fname = df_files.loc[cal_date]['csv']
    if fname == '-999':
        incorrect = 'No csv file.'
    else:
        incorrect = coeffs_dict['csv'][mask == False]
    exact_match.update({fname:incorrect})

In [None]:
relative_match = {}
for cal_date in df_files.index:
    # Part 1, load all of the csv files
    coeffs_dict = {}
    for source,fname in df_files.loc[cal_date].items():
        if fname != '-999':
            load_directory = '/'.join((os.getcwd(),'temp',source,fname))
            df_coeffs = pd.read_csv(load_directory)
            for i in list(set(df_coeffs['serial'])):
                print(source + '-' + fname + ': ' + str(i))
            df_coeffs.set_index(keys='name',inplace=True)
            df_coeffs.sort_index(inplace=True)
            coeffs_dict.update({source:df_coeffs})
        else:
            pass
    
    # Part 2, now check the calibration coefficients
    mask = check_relative_coeffs(coeffs_dict)
    
    # Part 3: get the calibration coefficients are wrong
    # and show them
    fname = df_files.loc[cal_date]['csv']
    if fname == '-999':
        incorrect = 'No csv file.'
    else:
        incorrect = coeffs_dict['csv'][mask == False]
    relative_match.update({fname:incorrect})

In [None]:
os.listdir(temp_directory+'/csv')

In [None]:
for key in sorted(exact_match.keys()):
    if key != '-999':
        print(', '.join((ind for ind in exact_match[key].index.values)))

In [None]:
for key in sorted(relative_match.keys()):
    if key != '-999':
        print(', '.join((ind for ind in relative_match[key].index.values)))

In [None]:
qct

**========================================================================================================================**
Now we need to check that the calibration coefficients for each CTDMO csv have the same number of significant digits as are reported on the vendor PDFs. For the CTDMO, the vendor reports to six significant figures.

In [None]:
csv_paths

In [None]:
uid = uids[0]
uid

In [None]:
CSV = pd.read_csv(csv_paths[uid][0])
CSV

In [None]:
for val in CSV['value']:
    print("{:.6e}".format(val))

In [None]:
print("{:.2e}".format(0.00253))

In [None]:
import math

In [None]:
def to_precision(x,p):
    """
    Returns a string representation of x formatted with a precision of p,
    following the toPrecision method from javascript. This implementation
    is based on example code from www.randlet.com.
    
    Args:
        x - number to format to a specified precision
        p - the specified precision for the number x
    Returns:
    
    """
    
    # First check if x is a string
    if type(x) is not float:
        x = float(x)
        
    # Next, check if p is an int and if not, convert to int
    if type(p) is not int:
        p = int(p)
    
        
    if x == 0.:
        return "0." + "0"*(p-1)
    
    out = []
    
    if x < 0:
        out.append("-")
        x = -x
        
    e = int(math.log10(x))
    tens = math.pow(10, e - p + 1)
    n = math.floor(x / tens)
    
    if n < math.pow(10, p - 1):
        e = e - 1
        tens = math.pow(10, e - p + 1)
        n = math.floor(x / tens)
        
    if abs((n + 1.) * tens - x) <= abs(n * tens - x):
        n = n + 1
        
    if n >= math.pow(10, p):
        n = n / 10.
        e = e + 1
        
    m = "%.*g" % (p, n)
    
    if e < -2 or e >= p:
        out.append(m[0])
        if p > 1:
            out.append(".")
            out.extend(m[1:p])
        out.append('e')
        if e > 0:
            out.append("+")
        out.append(str(e))
    elif e == (p - 1):
        out.append(m)
    elif e >= 0:
        out.append(m[:e+1])
        if (e + 1) < len(m):
            out.append(".")
            out.extend(m[e+1:])
    else:
        out.append("0.")
        out.extend(["0"]*-(e+1))
        out.append(m)
        
    return "".join(out)