# Data Pipeline for C100 Fault Classification 
_December 17, 2019_

In [1]:
# used to get an idea of run time
from datetime import datetime
startTime = datetime.now()

The purpose of this notebook is to put together a rough impementation of a data pipeline for CEBAF C100 cavity and fault detection. There are two questions that need to be addressed:
<br>

  - *which cavity tripped first?* <br>
  - *which type of fault caused the trip?*

This implementation uses machine learning (as opposed to deep learning) and suffers from the fact that feature extraction and selection is computationally expensive. Nevertheless, it provides the opportunity to implement a functional pipeline on a timescale that can be used for the summer 2019 running period.

## Which Cavity?

At the onset of a cavity trip, the waveform harvester will collect and write data to file. The data collected will be 17 RF signals for each of the 8 cavities in the offending C100 cryomodule. To make the feature engineering manageable (i.e. less computationally expensive) we do the following:

- keep only 4 of the 17 signals for each of the 8 cavities
    - based on analysis from subject matter experts, the most relevant signals are (GASK, GMES, CRFP, DETA2)
- for each signal use `tsfresh` to compute a subset of available features
    - specifically, compute only the `fft_coefficients` for each signal

### Reading Data

At present I am mimicking the file structure for the data as it appears in `M:\asd\asddata\FCCWaveforms\Spring 2018\rf`. Therefore the script to read in the data will need to be modified.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import glob
from pathlib import Path

Create a `dictionary` to use consistent nomenclature for the cryomodule name.

In [3]:
cav_dict = {'0L04': 'R04', '1L22': 'R1M', '1L23': 'R1N', '1L24': 'R1O', '1L25': 'R1P',
            '1L26': 'R1Q', '2L22': 'R2M', '2L23': 'R2N', '2L24': 'R2O', '2L25': 'R2P', '2L26': 'R2Q'}

cavity_df = pd.DataFrame()
module_path = Path('C:/Users/tennant/Desktop/rfw_tsf_extractor-Spring-2018/waveform-data/rf')
dir = Path('C:/Users/tennant/Desktop/rfw_tsf_extractor-Spring-2018/labeled-examples')

For developing machine learning models, a list of `fault_*.txt` files were read in. Each file was generated by a subject matter expert and contained a list of trip events. For each event, the cavity and fault label were given as well as the timestamp associated with the fault. As a test of the system, we read in a similar file containing only one event and ignore the labels.

In [4]:
filelist = ['example.txt']

The code below is ugly but the idea is that after reading in and parsing the timestamp from the file above, it searches the directories and collects the appropriate cavity signals and stores the labels (the cavity number which tripped). With the real-time system we will not have labels (all actions associated with `y` have been commented out).

In [5]:
#y = pd.Series([])

k = 0
for i in filelist:
    data_file_path = dir/i # IMPORTANT to add sub-directores to `module_path`

    log = pd.read_table(data_file_path, sep='\t')

    m, n = log.shape
    
    for j in range(0, m):
        k += 1
        date, time = log.time[j].split(" ", 1)
        
        # formatting the timestamp in the .txt file to match the format used in the filenames
        date_format = date.replace("/", "_")
        time_format = time.replace(":", "")

        list1 = [time_format, '.', '?']

        ct = os.path.join(module_path, log.zone[j], date_format, "".join(list1))
        
        # output of glob.glob() is a list; rather than convert to string and remove brackets and quotes, select first element
        dir1 = glob.glob(ct)
        dir2 = os.listdir(dir1[0]) # list of filenames in the directory
    
        module_df = pd.DataFrame()
        
        # only read in data if all 8 cavity files are present in directory
        if len(dir2) == 8:

            for m in range(0,8):
                f = os.path.join(dir1[0], dir2[m])
                df = pd.read_table(f, sep='\t')
                sLength = len(df['Time'])
                tStep = (df.Time[2] - df.Time[1]) # in milliseconds
                df['id'] = pd.Series(k, index=df.index)
                col = ['Time', 
                      f'{m+1}_IMES', f'{m+1}_QMES', f'{m+1}_GMES', f'{m+1}_PMES', f'{m+1}_IASK', f'{m+1}_QASK', 
                      f'{m+1}_GASK', f'{m+1}_PASK', f'{m+1}_CRFP', f'{m+1}_CRFPP', f'{m+1}_CRRP', f'{m+1}_CRRPP', 
                      f'{m+1}_GLDE', f'{m+1}_PLDE', f'{m+1}_DETA2_', f'{m+1}_CFQE2_', f'{m+1}_DFQES',
                      'id']
                df.columns=col
                module_df = pd.concat([module_df, df], axis=1, sort=False)
                #print("Path: {0:s}".format(f))
                print("j = {0:}, k = {1:}, input shape = {2:}, cavity shape = {3:}, time step = {4:3.2f} ms".format(j, k, df.shape, module_df.shape, tStep))
            module_df = module_df.loc[:,~module_df.columns.duplicated()]
            #y_tmp = pd.Series(log.cavity[j], index=[k])
            #y = y.append(y_tmp)
            
        else:
            print("Directory did not contain data files for all 8 cavities in the zone.")

        cavity_df = cavity_df.append(module_df)

j = 0, k = 1, input shape = (8192, 19), cavity shape = (8192, 19), time step = 0.20 ms
j = 0, k = 1, input shape = (8192, 19), cavity shape = (8192, 38), time step = 0.20 ms
j = 0, k = 1, input shape = (8192, 19), cavity shape = (8192, 57), time step = 0.20 ms
j = 0, k = 1, input shape = (8192, 19), cavity shape = (8192, 76), time step = 0.20 ms
j = 0, k = 1, input shape = (8192, 19), cavity shape = (8192, 95), time step = 0.20 ms
j = 0, k = 1, input shape = (8192, 19), cavity shape = (8192, 114), time step = 0.20 ms
j = 0, k = 1, input shape = (8192, 19), cavity shape = (8192, 133), time step = 0.20 ms
j = 0, k = 1, input shape = (8192, 19), cavity shape = (8192, 152), time step = 0.20 ms


To reduce computational burden, we select only the (GMES, GASK, CRFP, DETA2) signals from each cavity (i.e. `<cavID>_<signal>`).

In [6]:
sel_col = ["Time", "id", "1_GMES", "1_GASK", "1_CRFP", "1_DETA2_", "2_GMES", "2_GASK", "2_CRFP", "2_DETA2_", 
           "3_GMES", "3_GASK", "3_CRFP", "3_DETA2_", "4_GMES", "4_GASK", "4_CRFP", "4_DETA2_", 
           "5_GMES", "5_GASK", "5_CRFP", "5_DETA2_", "6_GMES", "6_GASK", "6_CRFP", "6_DETA2_", 
           "7_GMES", "7_GASK", "7_CRFP", "7_DETA2_", "8_GMES", "8_GASK", "8_CRFP", "8_DETA2_"]

In [7]:
cavity_df = cavity_df[sel_col]

In [8]:
cavity_df.head()

Unnamed: 0,Time,id,1_GMES,1_GASK,1_CRFP,1_DETA2_,2_GMES,2_GASK,2_CRFP,2_DETA2_,...,6_CRFP,6_DETA2_,7_GMES,7_GASK,7_CRFP,7_DETA2_,8_GMES,8_GASK,8_CRFP,8_DETA2_
0,-102.4,1,0,0.861206,0.00185,159.994,12.0961,2.724,1.01394,-22.8955,...,1.93018,7.96509,15.4987,1.98059,1.44703,0.774536,13.5022,2.20428,0.966795,1.86218
1,-102.2,1,0,0.861206,0.001576,167.146,12.0926,2.6181,1.01266,-23.725,...,1.90401,6.3446,15.4992,2.18506,1.75664,2.38953,13.4987,2.20551,0.725554,-0.730591
2,-102.0,1,0,0.861206,0.001925,174.221,12.093,3.01453,1.16815,-23.0988,...,1.90122,7.9541,15.4996,2.03247,1.56444,0.856934,13.504,2.19269,0.870981,-0.928345
3,-101.8,1,0,0.861206,0.001875,177.874,12.0965,2.56287,1.07656,-23.5382,...,1.75819,6.97632,15.5001,1.94489,1.52809,1.30737,13.5004,2.25037,0.907622,-0.911865
4,-101.6,1,0,0.861206,0.001875,167.069,12.0987,2.65076,0.94003,-22.8021,...,1.97692,8.09143,15.5015,1.9989,1.61412,0.785522,13.499,2.47467,0.870981,-0.258179


In [9]:
cavity_df = cavity_df.rename(columns={"1_DETA2_":"1_DETA2", "2_DETA2_":"2_DETA2", "3_DETA2_":"3_DETA2", "4_DETA2_":"4_DETA2",
                          "5_DETA2_":"5_DETA2", "6_DETA2_":"6_DETA2", "7_DETA2_":"7_DETA2", "8_DETA2_":"8_DETA2"});

A few diagnostic checks on the data.

In [10]:
print("The input file has", cavity_df.shape[0], "rows and", cavity_df.shape[1], "columns")
print("The input file has", cavity_df.shape[0]/8192, "samples")

The input file has 8192 rows and 34 columns
The input file has 1.0 samples


### Feature Extraction

Import relevant `tsfresh` modules.

In [11]:
import tsfresh
from tsfresh import extract_features, extract_relevant_features, select_features
from tsfresh.feature_extraction import ComprehensiveFCParameters, EfficientFCParameters, MinimalFCParameters
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import settings

In [12]:
print(tsfresh.__version__)

0.13.0


We define the custom set of features to compute for each signal. Based on previous analysis using the full set of features available, it was found that the `fft_coefficients` had the greatest importance for determining which cavity faulted. Each signal will have 400 FFT coefficients computed.

In [13]:
fft_only = {'fft_coefficient': 
 [{'attr': 'real', 'coeff': 0},
  {'attr': 'real', 'coeff': 1},
  {'attr': 'real', 'coeff': 2},
  {'attr': 'real', 'coeff': 3},
  {'attr': 'real', 'coeff': 4},
  {'attr': 'real', 'coeff': 5},
  {'attr': 'real', 'coeff': 6},
  {'attr': 'real', 'coeff': 7},
  {'attr': 'real', 'coeff': 8},
  {'attr': 'real', 'coeff': 9},
  {'attr': 'real', 'coeff': 10},
  {'attr': 'real', 'coeff': 11},
  {'attr': 'real', 'coeff': 12},
  {'attr': 'real', 'coeff': 13},
  {'attr': 'real', 'coeff': 14},
  {'attr': 'real', 'coeff': 15},
  {'attr': 'real', 'coeff': 16},
  {'attr': 'real', 'coeff': 17},
  {'attr': 'real', 'coeff': 18},
  {'attr': 'real', 'coeff': 19},
  {'attr': 'real', 'coeff': 20},
  {'attr': 'real', 'coeff': 21},
  {'attr': 'real', 'coeff': 22},
  {'attr': 'real', 'coeff': 23},
  {'attr': 'real', 'coeff': 24},
  {'attr': 'real', 'coeff': 25},
  {'attr': 'real', 'coeff': 26},
  {'attr': 'real', 'coeff': 27},
  {'attr': 'real', 'coeff': 28},
  {'attr': 'real', 'coeff': 29},
  {'attr': 'real', 'coeff': 30},
  {'attr': 'real', 'coeff': 31},
  {'attr': 'real', 'coeff': 32},
  {'attr': 'real', 'coeff': 33},
  {'attr': 'real', 'coeff': 34},
  {'attr': 'real', 'coeff': 35},
  {'attr': 'real', 'coeff': 36},
  {'attr': 'real', 'coeff': 37},
  {'attr': 'real', 'coeff': 38},
  {'attr': 'real', 'coeff': 39},
  {'attr': 'real', 'coeff': 40},
  {'attr': 'real', 'coeff': 41},
  {'attr': 'real', 'coeff': 42},
  {'attr': 'real', 'coeff': 43},
  {'attr': 'real', 'coeff': 44},
  {'attr': 'real', 'coeff': 45},
  {'attr': 'real', 'coeff': 46},
  {'attr': 'real', 'coeff': 47},
  {'attr': 'real', 'coeff': 48},
  {'attr': 'real', 'coeff': 49},
  {'attr': 'real', 'coeff': 50},
  {'attr': 'real', 'coeff': 51},
  {'attr': 'real', 'coeff': 52},
  {'attr': 'real', 'coeff': 53},
  {'attr': 'real', 'coeff': 54},
  {'attr': 'real', 'coeff': 55},
  {'attr': 'real', 'coeff': 56},
  {'attr': 'real', 'coeff': 57},
  {'attr': 'real', 'coeff': 58},
  {'attr': 'real', 'coeff': 59},
  {'attr': 'real', 'coeff': 60},
  {'attr': 'real', 'coeff': 61},
  {'attr': 'real', 'coeff': 62},
  {'attr': 'real', 'coeff': 63},
  {'attr': 'real', 'coeff': 64},
  {'attr': 'real', 'coeff': 65},
  {'attr': 'real', 'coeff': 66},
  {'attr': 'real', 'coeff': 67},
  {'attr': 'real', 'coeff': 68},
  {'attr': 'real', 'coeff': 69},
  {'attr': 'real', 'coeff': 70},
  {'attr': 'real', 'coeff': 71},
  {'attr': 'real', 'coeff': 72},
  {'attr': 'real', 'coeff': 73},
  {'attr': 'real', 'coeff': 74},
  {'attr': 'real', 'coeff': 75},
  {'attr': 'real', 'coeff': 76},
  {'attr': 'real', 'coeff': 77},
  {'attr': 'real', 'coeff': 78},
  {'attr': 'real', 'coeff': 79},
  {'attr': 'real', 'coeff': 80},
  {'attr': 'real', 'coeff': 81},
  {'attr': 'real', 'coeff': 82},
  {'attr': 'real', 'coeff': 83},
  {'attr': 'real', 'coeff': 84},
  {'attr': 'real', 'coeff': 85},
  {'attr': 'real', 'coeff': 86},
  {'attr': 'real', 'coeff': 87},
  {'attr': 'real', 'coeff': 88},
  {'attr': 'real', 'coeff': 89},
  {'attr': 'real', 'coeff': 90},
  {'attr': 'real', 'coeff': 91},
  {'attr': 'real', 'coeff': 92},
  {'attr': 'real', 'coeff': 93},
  {'attr': 'real', 'coeff': 94},
  {'attr': 'real', 'coeff': 95},
  {'attr': 'real', 'coeff': 96},
  {'attr': 'real', 'coeff': 97},
  {'attr': 'real', 'coeff': 98},
  {'attr': 'real', 'coeff': 99},
  {'attr': 'imag', 'coeff': 0},
  {'attr': 'imag', 'coeff': 1},
  {'attr': 'imag', 'coeff': 2},
  {'attr': 'imag', 'coeff': 3},
  {'attr': 'imag', 'coeff': 4},
  {'attr': 'imag', 'coeff': 5},
  {'attr': 'imag', 'coeff': 6},
  {'attr': 'imag', 'coeff': 7},
  {'attr': 'imag', 'coeff': 8},
  {'attr': 'imag', 'coeff': 9},
  {'attr': 'imag', 'coeff': 10},
  {'attr': 'imag', 'coeff': 11},
  {'attr': 'imag', 'coeff': 12},
  {'attr': 'imag', 'coeff': 13},
  {'attr': 'imag', 'coeff': 14},
  {'attr': 'imag', 'coeff': 15},
  {'attr': 'imag', 'coeff': 16},
  {'attr': 'imag', 'coeff': 17},
  {'attr': 'imag', 'coeff': 18},
  {'attr': 'imag', 'coeff': 19},
  {'attr': 'imag', 'coeff': 20},
  {'attr': 'imag', 'coeff': 21},
  {'attr': 'imag', 'coeff': 22},
  {'attr': 'imag', 'coeff': 23},
  {'attr': 'imag', 'coeff': 24},
  {'attr': 'imag', 'coeff': 25},
  {'attr': 'imag', 'coeff': 26},
  {'attr': 'imag', 'coeff': 27},
  {'attr': 'imag', 'coeff': 28},
  {'attr': 'imag', 'coeff': 29},
  {'attr': 'imag', 'coeff': 30},
  {'attr': 'imag', 'coeff': 31},
  {'attr': 'imag', 'coeff': 32},
  {'attr': 'imag', 'coeff': 33},
  {'attr': 'imag', 'coeff': 34},
  {'attr': 'imag', 'coeff': 35},
  {'attr': 'imag', 'coeff': 36},
  {'attr': 'imag', 'coeff': 37},
  {'attr': 'imag', 'coeff': 38},
  {'attr': 'imag', 'coeff': 39},
  {'attr': 'imag', 'coeff': 40},
  {'attr': 'imag', 'coeff': 41},
  {'attr': 'imag', 'coeff': 42},
  {'attr': 'imag', 'coeff': 43},
  {'attr': 'imag', 'coeff': 44},
  {'attr': 'imag', 'coeff': 45},
  {'attr': 'imag', 'coeff': 46},
  {'attr': 'imag', 'coeff': 47},
  {'attr': 'imag', 'coeff': 48},
  {'attr': 'imag', 'coeff': 49},
  {'attr': 'imag', 'coeff': 50},
  {'attr': 'imag', 'coeff': 51},
  {'attr': 'imag', 'coeff': 52},
  {'attr': 'imag', 'coeff': 53},
  {'attr': 'imag', 'coeff': 54},
  {'attr': 'imag', 'coeff': 55},
  {'attr': 'imag', 'coeff': 56},
  {'attr': 'imag', 'coeff': 57},
  {'attr': 'imag', 'coeff': 58},
  {'attr': 'imag', 'coeff': 59},
  {'attr': 'imag', 'coeff': 60},
  {'attr': 'imag', 'coeff': 61},
  {'attr': 'imag', 'coeff': 62},
  {'attr': 'imag', 'coeff': 63},
  {'attr': 'imag', 'coeff': 64},
  {'attr': 'imag', 'coeff': 65},
  {'attr': 'imag', 'coeff': 66},
  {'attr': 'imag', 'coeff': 67},
  {'attr': 'imag', 'coeff': 68},
  {'attr': 'imag', 'coeff': 69},
  {'attr': 'imag', 'coeff': 70},
  {'attr': 'imag', 'coeff': 71},
  {'attr': 'imag', 'coeff': 72},
  {'attr': 'imag', 'coeff': 73},
  {'attr': 'imag', 'coeff': 74},
  {'attr': 'imag', 'coeff': 75},
  {'attr': 'imag', 'coeff': 76},
  {'attr': 'imag', 'coeff': 77},
  {'attr': 'imag', 'coeff': 78},
  {'attr': 'imag', 'coeff': 79},
  {'attr': 'imag', 'coeff': 80},
  {'attr': 'imag', 'coeff': 81},
  {'attr': 'imag', 'coeff': 82},
  {'attr': 'imag', 'coeff': 83},
  {'attr': 'imag', 'coeff': 84},
  {'attr': 'imag', 'coeff': 85},
  {'attr': 'imag', 'coeff': 86},
  {'attr': 'imag', 'coeff': 87},
  {'attr': 'imag', 'coeff': 88},
  {'attr': 'imag', 'coeff': 89},
  {'attr': 'imag', 'coeff': 90},
  {'attr': 'imag', 'coeff': 91},
  {'attr': 'imag', 'coeff': 92},
  {'attr': 'imag', 'coeff': 93},
  {'attr': 'imag', 'coeff': 94},
  {'attr': 'imag', 'coeff': 95},
  {'attr': 'imag', 'coeff': 96},
  {'attr': 'imag', 'coeff': 97},
  {'attr': 'imag', 'coeff': 98},
  {'attr': 'imag', 'coeff': 99},
  {'attr': 'abs', 'coeff': 0},
  {'attr': 'abs', 'coeff': 1},
  {'attr': 'abs', 'coeff': 2},
  {'attr': 'abs', 'coeff': 3},
  {'attr': 'abs', 'coeff': 4},
  {'attr': 'abs', 'coeff': 5},
  {'attr': 'abs', 'coeff': 6},
  {'attr': 'abs', 'coeff': 7},
  {'attr': 'abs', 'coeff': 8},
  {'attr': 'abs', 'coeff': 9},
  {'attr': 'abs', 'coeff': 10},
  {'attr': 'abs', 'coeff': 11},
  {'attr': 'abs', 'coeff': 12},
  {'attr': 'abs', 'coeff': 13},
  {'attr': 'abs', 'coeff': 14},
  {'attr': 'abs', 'coeff': 15},
  {'attr': 'abs', 'coeff': 16},
  {'attr': 'abs', 'coeff': 17},
  {'attr': 'abs', 'coeff': 18},
  {'attr': 'abs', 'coeff': 19},
  {'attr': 'abs', 'coeff': 20},
  {'attr': 'abs', 'coeff': 21},
  {'attr': 'abs', 'coeff': 22},
  {'attr': 'abs', 'coeff': 23},
  {'attr': 'abs', 'coeff': 24},
  {'attr': 'abs', 'coeff': 25},
  {'attr': 'abs', 'coeff': 26},
  {'attr': 'abs', 'coeff': 27},
  {'attr': 'abs', 'coeff': 28},
  {'attr': 'abs', 'coeff': 29},
  {'attr': 'abs', 'coeff': 30},
  {'attr': 'abs', 'coeff': 31},
  {'attr': 'abs', 'coeff': 32},
  {'attr': 'abs', 'coeff': 33},
  {'attr': 'abs', 'coeff': 34},
  {'attr': 'abs', 'coeff': 35},
  {'attr': 'abs', 'coeff': 36},
  {'attr': 'abs', 'coeff': 37},
  {'attr': 'abs', 'coeff': 38},
  {'attr': 'abs', 'coeff': 39},
  {'attr': 'abs', 'coeff': 40},
  {'attr': 'abs', 'coeff': 41},
  {'attr': 'abs', 'coeff': 42},
  {'attr': 'abs', 'coeff': 43},
  {'attr': 'abs', 'coeff': 44},
  {'attr': 'abs', 'coeff': 45},
  {'attr': 'abs', 'coeff': 46},
  {'attr': 'abs', 'coeff': 47},
  {'attr': 'abs', 'coeff': 48},
  {'attr': 'abs', 'coeff': 49},
  {'attr': 'abs', 'coeff': 50},
  {'attr': 'abs', 'coeff': 51},
  {'attr': 'abs', 'coeff': 52},
  {'attr': 'abs', 'coeff': 53},
  {'attr': 'abs', 'coeff': 54},
  {'attr': 'abs', 'coeff': 55},
  {'attr': 'abs', 'coeff': 56},
  {'attr': 'abs', 'coeff': 57},
  {'attr': 'abs', 'coeff': 58},
  {'attr': 'abs', 'coeff': 59},
  {'attr': 'abs', 'coeff': 60},
  {'attr': 'abs', 'coeff': 61},
  {'attr': 'abs', 'coeff': 62},
  {'attr': 'abs', 'coeff': 63},
  {'attr': 'abs', 'coeff': 64},
  {'attr': 'abs', 'coeff': 65},
  {'attr': 'abs', 'coeff': 66},
  {'attr': 'abs', 'coeff': 67},
  {'attr': 'abs', 'coeff': 68},
  {'attr': 'abs', 'coeff': 69},
  {'attr': 'abs', 'coeff': 70},
  {'attr': 'abs', 'coeff': 71},
  {'attr': 'abs', 'coeff': 72},
  {'attr': 'abs', 'coeff': 73},
  {'attr': 'abs', 'coeff': 74},
  {'attr': 'abs', 'coeff': 75},
  {'attr': 'abs', 'coeff': 76},
  {'attr': 'abs', 'coeff': 77},
  {'attr': 'abs', 'coeff': 78},
  {'attr': 'abs', 'coeff': 79},
  {'attr': 'abs', 'coeff': 80},
  {'attr': 'abs', 'coeff': 81},
  {'attr': 'abs', 'coeff': 82},
  {'attr': 'abs', 'coeff': 83},
  {'attr': 'abs', 'coeff': 84},
  {'attr': 'abs', 'coeff': 85},
  {'attr': 'abs', 'coeff': 86},
  {'attr': 'abs', 'coeff': 87},
  {'attr': 'abs', 'coeff': 88},
  {'attr': 'abs', 'coeff': 89},
  {'attr': 'abs', 'coeff': 90},
  {'attr': 'abs', 'coeff': 91},
  {'attr': 'abs', 'coeff': 92},
  {'attr': 'abs', 'coeff': 93},
  {'attr': 'abs', 'coeff': 94},
  {'attr': 'abs', 'coeff': 95},
  {'attr': 'abs', 'coeff': 96},
  {'attr': 'abs', 'coeff': 97},
  {'attr': 'abs', 'coeff': 98},
  {'attr': 'abs', 'coeff': 99},
  {'attr': 'angle', 'coeff': 0},
  {'attr': 'angle', 'coeff': 1},
  {'attr': 'angle', 'coeff': 2},
  {'attr': 'angle', 'coeff': 3},
  {'attr': 'angle', 'coeff': 4},
  {'attr': 'angle', 'coeff': 5},
  {'attr': 'angle', 'coeff': 6},
  {'attr': 'angle', 'coeff': 7},
  {'attr': 'angle', 'coeff': 8},
  {'attr': 'angle', 'coeff': 9},
  {'attr': 'angle', 'coeff': 10},
  {'attr': 'angle', 'coeff': 11},
  {'attr': 'angle', 'coeff': 12},
  {'attr': 'angle', 'coeff': 13},
  {'attr': 'angle', 'coeff': 14},
  {'attr': 'angle', 'coeff': 15},
  {'attr': 'angle', 'coeff': 16},
  {'attr': 'angle', 'coeff': 17},
  {'attr': 'angle', 'coeff': 18},
  {'attr': 'angle', 'coeff': 19},
  {'attr': 'angle', 'coeff': 20},
  {'attr': 'angle', 'coeff': 21},
  {'attr': 'angle', 'coeff': 22},
  {'attr': 'angle', 'coeff': 23},
  {'attr': 'angle', 'coeff': 24},
  {'attr': 'angle', 'coeff': 25},
  {'attr': 'angle', 'coeff': 26},
  {'attr': 'angle', 'coeff': 27},
  {'attr': 'angle', 'coeff': 28},
  {'attr': 'angle', 'coeff': 29},
  {'attr': 'angle', 'coeff': 30},
  {'attr': 'angle', 'coeff': 31},
  {'attr': 'angle', 'coeff': 32},
  {'attr': 'angle', 'coeff': 33},
  {'attr': 'angle', 'coeff': 34},
  {'attr': 'angle', 'coeff': 35},
  {'attr': 'angle', 'coeff': 36},
  {'attr': 'angle', 'coeff': 37},
  {'attr': 'angle', 'coeff': 38},
  {'attr': 'angle', 'coeff': 39},
  {'attr': 'angle', 'coeff': 40},
  {'attr': 'angle', 'coeff': 41},
  {'attr': 'angle', 'coeff': 42},
  {'attr': 'angle', 'coeff': 43},
  {'attr': 'angle', 'coeff': 44},
  {'attr': 'angle', 'coeff': 45},
  {'attr': 'angle', 'coeff': 46},
  {'attr': 'angle', 'coeff': 47},
  {'attr': 'angle', 'coeff': 48},
  {'attr': 'angle', 'coeff': 49},
  {'attr': 'angle', 'coeff': 50},
  {'attr': 'angle', 'coeff': 51},
  {'attr': 'angle', 'coeff': 52},
  {'attr': 'angle', 'coeff': 53},
  {'attr': 'angle', 'coeff': 54},
  {'attr': 'angle', 'coeff': 55},
  {'attr': 'angle', 'coeff': 56},
  {'attr': 'angle', 'coeff': 57},
  {'attr': 'angle', 'coeff': 58},
  {'attr': 'angle', 'coeff': 59},
  {'attr': 'angle', 'coeff': 60},
  {'attr': 'angle', 'coeff': 61},
  {'attr': 'angle', 'coeff': 62},
  {'attr': 'angle', 'coeff': 63},
  {'attr': 'angle', 'coeff': 64},
  {'attr': 'angle', 'coeff': 65},
  {'attr': 'angle', 'coeff': 66},
  {'attr': 'angle', 'coeff': 67},
  {'attr': 'angle', 'coeff': 68},
  {'attr': 'angle', 'coeff': 69},
  {'attr': 'angle', 'coeff': 70},
  {'attr': 'angle', 'coeff': 71},
  {'attr': 'angle', 'coeff': 72},
  {'attr': 'angle', 'coeff': 73},
  {'attr': 'angle', 'coeff': 74},
  {'attr': 'angle', 'coeff': 75},
  {'attr': 'angle', 'coeff': 76},
  {'attr': 'angle', 'coeff': 77},
  {'attr': 'angle', 'coeff': 78},
  {'attr': 'angle', 'coeff': 79},
  {'attr': 'angle', 'coeff': 80},
  {'attr': 'angle', 'coeff': 81},
  {'attr': 'angle', 'coeff': 82},
  {'attr': 'angle', 'coeff': 83},
  {'attr': 'angle', 'coeff': 84},
  {'attr': 'angle', 'coeff': 85},
  {'attr': 'angle', 'coeff': 86},
  {'attr': 'angle', 'coeff': 87},
  {'attr': 'angle', 'coeff': 88},
  {'attr': 'angle', 'coeff': 89},
  {'attr': 'angle', 'coeff': 90},
  {'attr': 'angle', 'coeff': 91},
  {'attr': 'angle', 'coeff': 92},
  {'attr': 'angle', 'coeff': 93},
  {'attr': 'angle', 'coeff': 94},
  {'attr': 'angle', 'coeff': 95},
  {'attr': 'angle', 'coeff': 96},
  {'attr': 'angle', 'coeff': 97},
  {'attr': 'angle', 'coeff': 98},
  {'attr': 'angle', 'coeff': 99}]
}

This step extracts features but does no selection. Note that `tsfresh` has a procedure whereby it keeps only those features that it deems important. However, it is fairly time consuming so we do not invoke it here.

In [14]:
extraction_settings = fft_only

%time X_cavity = extract_features(cavity_df.astype('float64'), column_id="id", column_sort="Time", \
                           impute_function=impute, default_fc_parameters=extraction_settings);

Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 16/16 [00:01<00:00,  9.45it/s]


Wall time: 3.13 s


For a single trip event (4 signals/cavity for 8 cavities) `tsfresh` takes just under 3 seconds on a CPU to compute 12,800 `fft_coefficients`.

Create `X_cavity_master` that will be used as inputs to the model (i.e. the feature matrix).

In [15]:
X_cavity_master = X_cavity

A few diagnostic checks.

In [16]:
print("Number of training examples: {}".format(X_cavity_master.shape[0]))
print("Number of features: {}".format(X_cavity_master.shape[1]))

Number of training examples: 1
Number of features: 12800


### Data Pre-Processing

Use `impute` from `tsfresh` to replace all `NaNs` and `infs` from the `DataFrame` with average/extreme values from the same columns.

In [17]:
impute(X_cavity_master);

Standardizing input data is crucial for neural networks, but less so for Tree-based models. For completeness, however, we standardize the input data using the saved `mean` and `standard deviation` from training data for the `RF_CAVITY_fft_only_data` model.

In [18]:
X_cavity_mean = np.load('RF_CAVITY_fft_only_data_mean.npy')
X_cavity_var  = np.load('RF_CAVITY_fft_only_data_var.npy')

In [19]:
X_cavity_master = (X_cavity_master - X_cavity_mean) / X_cavity_var

Use `joblib` to import previously saved model for determining which cavity tripped. The model is based on a `RandomForestClassifier` using the following parameters found using `GridSearchCV`:
- `n_estimators` = X
- `max_depth` = X
- `min_samples_split` = X
- `max_features` = None

In [20]:
from sklearn.externals import joblib

RF_cavity = joblib.load('RF_CAVITY_fft_only_data_12172019.sav')

In [21]:
cavityID = RF_cavity.predict(X_cavity_master)
cavityID_prob = RF_cavity.predict_proba(X_cavity_master)
cavityID_str = cavityID.astype(str)[0]
print("Cavity", cavityID_str, "was the first to go unstable")
ID_confidence = float(cavityID_prob[0][cavityID]*100)
print("Confidence level is:", ID_confidence,"%")

Cavity 6 was the first to go unstable
Confidence level is: 99.0 %


## Which Fault?

Once again, to make the feature engineering manageable (i.e. less computationally expensive) we do the following:

- for each of the 17 signals from the first faulted cavity (prediction from previous model) use `tsfresh` to compute a subset of available features
    - specifically, compute the top contributing features using the `feature_importances_`

### Reading Data

Pass `cavityID` - from above - into the script below to load all 17 signals from _only the cavity that tripped_.

In [22]:
fault_df = pd.DataFrame(columns=['Time', 'IMES', 'QMES', 'GMES', 'PMES', 'IASK', 'QASK', 'GASK',
                                  'PASK', 'CRFP', 'CRFPP', 'CRRP', 'CRRPP', 'GLDE', 'PLDE', 'DETA2', 'CFQE2', 'DFQES', 'id'])

In some instances, the time stamp changes enough from one cavity to the next that you may need to use `time_format[:3]`

In [23]:
k = 0

for i in filelist:
    data_file_path =dir/i

    log = pd.read_table(data_file_path, sep='\t')

    m, n = log.shape

    for j in range(0, m):
        k += 1
        date, time = log.time[j].split(" ", 1)

        date_format = date.replace("/", "_")
        time_format = time.replace(":", "")

        list1 = [time_format, '.', '?']

        ct = os.path.join(module_path, log.zone[j], date_format, "".join(list1))

        list2 = (cav_dict[log.zone[j]], cavityID_str, 'WFSharv.',
                 date_format, '_', time_format[:4], '*', '.?.txt')

        filename = "".join(list2)
        #print(filename)

        for f in glob.glob(os.path.join(ct, filename)):
            df = pd.read_table(f, sep='\t')
            sLength = len(df['Time'])
            tStep = (df.Time[2] - df.Time[1]) # in milliseconds
            df['id'] = pd.Series(k, index=df.index)
            df.columns = fault_df.columns
            fault_df = fault_df.append(df)
            print("Path: {0:s}".format(f))
            print("j = {0:}, k = {1:}, input shape = {2:}, fault shape = {3:}, time step = {4:3.2f} ms".format(j, k, df.shape, fault_df.shape, tStep))

Path: C:\Users\tennant\Desktop\rfw_tsf_extractor-Spring-2018\waveform-data\rf\1L24\2018_05_04\044822.5\R1O6WFSharv.2018_05_04_044823.2.txt
j = 0, k = 1, input shape = (8192, 19), fault shape = (8192, 19), time step = 0.20 ms


In [24]:
print("The input file has", fault_df.shape[0], "rows and", fault_df.shape[1], "columns")
print("The input file has", fault_df.shape[0]/8192, "samples")

The input file has 8192 rows and 19 columns
The input file has 1.0 samples


To reduce the computational load only the *top 50* features - based on previous analysis of a `RandomForestClassifier` - are computed and used as inputs.

In [25]:
top_features = ['GMES__fft_coefficient__coeff_16__attr_"real"',
 'PLDE__fft_coefficient__coeff_16__attr_"real"',
 'GMES__fft_coefficient__coeff_11__attr_"angle"',
 'PMES__index_mass_quantile__q_0.9',
 'CRRP__fft_coefficient__coeff_64__attr_"imag"',
 'CRRP__fft_coefficient__coeff_79__attr_"abs"',
 'CRFP__ar_coefficient__k_10__coeff_1',
 'GMES__fft_coefficient__coeff_63__attr_"imag"',
 'PMES__fft_coefficient__coeff_0__attr_"abs"',
 'CRRP__maximum',
 'PMES__index_mass_quantile__q_0.8',
 'GMES__fft_coefficient__coeff_30__attr_"imag"',
 'GMES__fft_coefficient__coeff_79__attr_"angle"',
 'IMES__change_quantiles__f_agg_"var"__isabs_True__qh_1.0__ql_0.8',
 'IMES__fft_coefficient__coeff_2__attr_"abs"',
 'PLDE__minimum',
 'PLDE__fft_coefficient__coeff_28__attr_"real"',
 'IMES__change_quantiles__f_agg_"var"__isabs_False__qh_1.0__ql_0.8',
 'CRRP__agg_linear_trend__f_agg_"min"__chunk_len_10__attr_"stderr"',
 'DFQES__fft_coefficient__coeff_52__attr_"abs"',
 'CRRP__approximate_entropy__m_2__r_0.5',
 'CRRP__fft_coefficient__coeff_29__attr_"abs"',
 'CRRP__fft_coefficient__coeff_48__attr_"imag"',
 'DFQES__ratio_value_number_to_time_series_length',
 'PMES__number_peaks__n_5',
 'PMES__energy_ratio_by_chunks__num_segments_10__segment_focus_5',
 'PLDE__fft_coefficient__coeff_49__attr_"angle"',
 'IMES__agg_linear_trend__f_agg_"max"__chunk_len_10__attr_"slope"',
 'CRRP__agg_linear_trend__f_agg_"var"__chunk_len_10__attr_"intercept"',
 'GASK__ar_coefficient__k_10__coeff_1',
 'CRRP__fft_coefficient__coeff_75__attr_"angle"',
 'IASK__fft_coefficient__coeff_45__attr_"abs"',
 'DETA2__number_peaks__n_3',
 'CRFPP__percentage_of_reoccurring_values_to_all_values',
 'CRRP__fft_coefficient__coeff_64__attr_"angle"',
 'GMES__fft_coefficient__coeff_28__attr_"real"',
 'CRRP__fft_coefficient__coeff_31__attr_"abs"',
 'CRRP__fft_coefficient__coeff_96__attr_"imag"',
 'GASK__ratio_beyond_r_sigma__r_10',
 'PLDE__fft_coefficient__coeff_33__attr_"real"',
 'CRRP__fft_coefficient__coeff_33__attr_"abs"',
 'CRRP__standard_deviation',
 'GMES__spkt_welch_density__coeff_5',
 'CRRP__fft_coefficient__coeff_76__attr_"abs"',
 'PLDE__fft_coefficient__coeff_17__attr_"abs"',
 'DFQES__fft_coefficient__coeff_51__attr_"abs"',
 'CRFP__time_reversal_asymmetry_statistic__lag_1',
 'DFQES__fft_coefficient__coeff_50__attr_"real"',
 'IMES__linear_trend__attr_"rvalue"',
 'PMES__energy_ratio_by_chunks__num_segments_10__segment_focus_9']

In [26]:
extraction_settings = settings.from_columns(top_features)

# per https://github.com/blue-yonder/tsfresh/issues/478
%time X_fault = extract_features(fault_df.astype("float64"), column_id="id", column_sort="Time", \
                           impute_function=impute, kind_to_fc_parameters=extraction_settings, default_fc_parameters={})

Feature Extraction: 100%|██████████████████████████████████████████████████████████████| 17/17 [00:08<00:00,  1.23s/it]


Wall time: 8.64 s


In [27]:
X_fault_master = X_fault[top_features]
X_fault_master.shape

(1, 50)

This step is required because the `RF_FAULT_top50` model was trained on standardized data.

In [28]:
X_fault_mean = np.load('RF_FAULT_top50_mean.npy')
X_fault_var  = np.load('RF_FAULT_top50_var.npy')

In [29]:
X_fault_master = (X_fault_master - X_fault_mean) / X_fault_var

Load the `RF_FAULT_top50` model.

In [30]:
RF_fault = joblib.load('RF_FAULT_top50_12172019.sav')

Load the `preprocessing` library to apply the `inverse_transform` to the numerical result and return a categorical label.

In [31]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.classes_ = np.load('le_fault_classes.npy')
le.classes_

array(['E_Quench', 'Microphonics', 'Quench', 'Single Cav Turn off'],
      dtype=object)

In [32]:
cavityFault = RF_fault.predict(X_fault_master)
cavityFault_prob = RF_fault.predict_proba(X_fault_master)
cavityFault_name = le.inverse_transform(cavityFault)
cavityFault_name_str = cavityFault_name.astype(str)[0]

print("The trip was caused by a", cavityFault_name_str, "fault")
fault_confidence = float(cavityFault_prob[0][cavityFault]*100)
print("Confidence level is:", fault_confidence,"%")

The trip was caused by a Quench fault
Confidence level is: 99.06666666666666 %


## Summary

In [33]:
print("The trip event was initated by a", cavityFault_name_str ,"fault in cavity", int(cavityID), "of zone", log.zone[0])

The trip event was initated by a Quench fault in cavity 6 of zone 1L24


In [34]:
print("Executing the notebook took:", datetime.now() - startTime, "(h:mm:ss)")

Executing the notebook took: 0:00:20.744687 (h:mm:ss)


In [35]:
#import tsfresh
#print(tsfresh.__version__)
#from platform import python_version
#print(python_version())
#print(pd.__version__)
#print(np.__version__)
#import sklearn
#sklearn.__version__

## Notes About the Models

A few notes about each of the models used in this version of the pipeline (June 8, 2019). Note that these models essentially serve as placeholders until (a) higher fidelity models can be trained, or (b) an adequate deep learning model is developed.

### Model: Which Cavity?

- trained on Spring 2018 data

### Model: Which Fault?

- trained on Spring 2018 data (i.e. the only fault types it knows are: Single Cavity Turn Off, Microphonics, Quench, E-Quench)