<h1>What I have</h1>

**From signal**
(everywhere can be used std_normalization):

- Original signal
- Denoised signal
- Noise (difference between signals)
- Mean of noiseless signal
- Mean of original signal

**From other data**:

- Metrics of signal
- Frequency domain of original signal
- Spectrogram of signal, where channels are (denoised, original)

<h1>What I should use</h1>

- Denoised signal, using wavelet transform (800_000)
    - **\[Res(ODE)-Conv\]+Pool x 2-4** -> **B-LSTM x 2-4** -> **Dense**<br>Put dropout and batch normalization in between.
- Mean of noiseless signal (800_000)
    - **\[Res(ODE)-Conv\]+Pool x 2-4** -> **B-LSTM x 2-4** -> **Dense**<br>Put dropout and batch normalization in between.
- Metrics of signal (19)
    - **Dense x 2-4**
    - **Gradient boosting**
- Spectrogram (129 x 3_571 x 2)
    - **\[Res(ODE)-Conv2D\]+Pool x 4-8** -> **Dense**<br>Put dropout and batch normalization in between.
- Frequency domain of a signal (1000)
    - **\[Res(ODE)-Conv\]+Pool x 2-4** -> **B-LSTM x 2-4** -> **Dense**<br>Put dropout and batch normalization in between.
    
<h1>Final Model</h1>

Input: stacked into vector (predicted probabilities of all models, \[output of the last layer\] x 5)
- **Dense x 2-4**
- **Gradient boosting**
Output: final prediction

Optimize threshhold for matthews correlation.

In [1]:
import pandas as pd
import numpy as np
import pywt as pw
from scipy import fftpack, signal, stats
import patsy
from statsmodels.robust import mad

import pyarrow.parquet as pq
from plotting import plot_phases, plot_single_func, plot_phases_func, plot_values, plot_values_loglog
import matplotlib.pyplot as plt
%matplotlib inline

from tqdm import tnrange, tqdm_notebook, tqdm
import multiprocessing as mp
import gc
import h5py

In [2]:
NUM_OF_MEASURES = 100
COLS_STR = [str(i) for i in range(3 * NUM_OF_MEASURES)]
COLS_INT = [i for i in range(3 * NUM_OF_MEASURES)]

TR_PQ = '../data/parquet/train.parquet'
TR_META = '../data/meta/metadata_train.csv'

TS_PQ = '../data/parquet/test.parquet'
TS_META = '../data/meta/metadata_test.csv'

TR_H5 = '../data/hdf5/train.hdf5'

In [3]:
train_meta = pd.read_csv(TR_META)
test_meta = pd.read_csv(TS_META)

# Functions

In [4]:
WAVELET_TYPE = 'db6'
WAVELET_LEVEL = 3

In [5]:
def denoise_phase(phase):
    wavelet = pw.Wavelet(WAVELET_TYPE)
    wc = pw.wavedec(phase, wavelet, level=WAVELET_LEVEL)
    sigma = mad(wc[-1])
    threshold = sigma * np.sqrt(2 * np.log(len(phase)))

    wc_r = wc[:]
    wc_r[1:] = (pw.threshold(x, threshold) for x in wc[1:])
    return pw.waverec(wc_r, wavelet)

def std_normalize_phase(phase):
    return (phase - np.mean(phase)) / np.std(phase)

def denoise_normalize_phase(phase):
    return std_normalize_phase(denoise_phase(phase))

def minmax_normalize(phase):
    return (phase - np.min(phase)) / (np.max(phase) - np.min(phase))

def mean(phases):
    return np.mean(phases, axis=0)

In [6]:
def get_phases(df, measurement_id):
    p1 = df[str(measurement_id * 3)].values
    p2 = df[str(measurement_id * 3 + 1)].values
    p3 = df[str(measurement_id * 3 + 2)].values

    return p1, p2, p3

def triple_denoise(df, measurement_id):
    return np.asarray(
        [denoise_phase(phase) for phase in get_phases(df, measurement_id)])

def original_mean(df, measurement_id):
    return mean(get_phases(df, measurement_id))

def denoised_mean(df, measurement_id):
    return mean(triple_denoise(df, measurement_id))

def denoise_normalize(df, measurement_id):
    return np.asarray([
        std_normalize_phase(phase)
        for phase in triple_denoise(df, measurement_id)
    ])

def original_normalize(df, measurement_id):
    return np.asarray([
        std_normalize_phase(phase) for phase in get_phases(df, measurement_id)
    ])

In [7]:
def metrics(phase, asdict=False):
    f, Pxx = signal.welch(phase)
    ix_mx = np.argmax(Pxx)
    ix_mn = np.argmin(Pxx)

    d = {
        'mean_signal': np.mean(phase),
        'std_signal': np.std(phase),
        'kurtosis_signal': stats.kurtosis(phase),
        'skewness_signal': stats.skew(phase),
        
        'mean_amp': np.mean(Pxx),
        'std_amp': np.std(Pxx),
        'median_amp': np.median(Pxx),
        'kurtosis_amp': stats.kurtosis(Pxx),
        'skewness_amp': stats.skew(Pxx),

        'max_signal': np.max(phase),
        'min_signal': np.min(phase),
        
        'max_amp': Pxx[ix_mx],
        'min_amp': Pxx[ix_mn],
        
        'max_freq': f[ix_mx],
        'min_freq': f[ix_mn],
        
        'strong_amp': np.sum(Pxx > 2.5),
        'weak_amp': np.sum(Pxx < 0.4),
    }

    if asdict:
        return d
    else:
        return np.asarray(list(d.values()))

In [8]:
def onehot_phase(phase):
    return [0 if phase == 0 else 1, 0 if phase == 1 else 1]

# Frequency domain

In [9]:
def get_freq(val, n, d):
    sig_fft = fftpack.fft(val, n=n)
    sample_freq = fftpack.fftfreq(n=n, d=d)
    pos_mask = np.where(sample_freq >= 0)

    freqs = sample_freq[pos_mask][1:]
    power = np.abs(sig_fft)[pos_mask][1:]

    return freqs, power

def get_freq_dom(values, denoised=None, n=1000, d=(0.02 / 800000.)):
    size = n * 2 + 2
    
    if denoised is None:
        denoised = denoise_phase(values)
    
    _, p1 = get_freq(values, size, d)
    _, p2 = get_freq(denoised, size, d)
    
    return np.reshape(np.asarray([p1, p2]).T, (n, 2))

# Spectrogram

In [10]:
def get_spectrogram(values, denoised=None, fs=1 / (2e-2 / 800000), uselog=True):
    if denoised is None:
        denoised = denoise_phase(values)

    _, _, Sx1 = signal.spectrogram(denoised, fs)
    _, _, Sx2 = signal.spectrogram(values, fs)

    ret = np.concatenate((
        np.reshape(Sx1, (Sx1.shape[0], Sx1.shape[1], -1)),
        np.reshape(Sx2, (Sx2.shape[0], Sx2.shape[1], -1)),
    ),
                         axis=-1)

    if uselog:
        return np.log10(ret)
    else:
        return ret

# Paperspace parallel

In [11]:
def parallel_apply_along_axis(func1d, axis, arr, *args, **kwargs):
    """
    Like numpy.apply_along_axis(), but takes advantage of multiple
    cores.
    """        
    # Effective axis where apply_along_axis() will be applied by each
    # worker (any non-zero axis number would work, so as to allow the use
    # of `np.array_split()`, which is only done on axis 0):
    effective_axis = 1 if axis == 0 else axis
    if effective_axis != axis:
        arr = arr.swapaxes(axis, effective_axis)

    # Chunks for the mapping (only a few chunks):
    chunks = [(func1d, effective_axis, sub_arr, args, kwargs)
              for sub_arr in np.array_split(arr, mp.cpu_count())]

    pool = mp.Pool(mp.cpu_count())
    individual_results = pool.map(unpacking_apply_along_axis, chunks)
    # Freeing the workers:
    pool.close()
    pool.join()

    return np.concatenate(individual_results)

def unpacking_apply_along_axis(tp):
    func1d, axis, arr, args, kwargs = tp
    """
    Like numpy.apply_along_axis(), but and with arguments in a tuple
    instead.

    This function is useful with multiprocessing.Pool().map(): (1)
    map() only handles functions that take a single argument, and (2)
    this function can generally be imported from a module, as required
    by map().
    """
    return np.apply_along_axis(func1d, axis, arr, *args, **kwargs)

In [12]:
def parallel_proc_h5(name, data, meta, batch=1000):
    steps = int(np.ceil(data.shape[0] / batch))

    dir_file = '../data/hdf5/' + name + '.hdf5'
    print('Creating hdf5 file in directory: {}'.format(dir_file))
    hd_file = h5py.File(dir_file, mode='a')

    print('Creating datasets: ', end='')
    denoised_ds = hd_file.create_dataset(
        'denoised',
        shape=data.shape,
        dtype=np.float32,
        chunks=True,
        compression="gzip",
        compression_opts=6)
    print('denoised, ', end='')
    
    metrics_ds = hd_file.create_dataset(
        'metrics', shape=(data.shape[0], 19), dtype=np.float32, chunks=True)
    print('metrics, ', end='')
    
    spectrogram_ds = hd_file.create_dataset(
        'spectrogram',
        shape=(data.shape[0], 129, 3571, 2),
        dtype=np.float32,
        chunks=True,
        compression="gzip",
        compression_opts=6)
    print('spectrogram, ', end='')
    
    freq_dom_ds = hd_file.create_dataset(
        'freq_dom',
        shape=(data.shape[0], 1000, 2),
        dtype=np.float32,
        chunks=True)
    print('freq_dom.')

    t = tnrange(steps)
    for i in t:
        start = i * batch
        finish = (i + 1) * batch if i + 1 != steps else data.shape[0]

        t.set_description('Denoising')
        denoised_ds[start:finish] = parallel_apply_along_axis(
            denoise_normalize_phase, 1, data[start:finish])

        t.set_description('Metrics')
        ds = parallel_apply_along_axis(metrics, 1, data[start:finish])
        oh = parallel_apply_along_axis(
            onehot_phase, 1, meta['phase'].values[start:finish].reshape(
                finish - start, 1))
        metrics_ds[start:finish] = np.concatenate([ds, oh], axis=1)

        t.set_description('Spectrogram')
        spectrogram_ds[start:finish] = parallel_apply_along_axis(
            get_spectrogram, 1, data[start:finish])

        t.set_description('Frequency')
        freq_dom_ds[start:finish] = parallel_apply_along_axis(
            get_freq_dom, 1, data[start:finish])

# May help but not now

## Correlation

In [None]:
def cross_corr(df, measurement_id, mode='same'):  #same, full
    p1, p2, p3 = triple_denoise(df, measurement_id)

    c12 = signal.correlate(p1, p2, mode=mode)
    c13 = signal.correlate(p1, p3, mode=mode)
    c23 = signal.correlate(p2, p3, mode=mode)

    return np.asarray([c12, c13, c23])

# Plotting

In [None]:
plot_phases(train, train_meta, 0, url=True)

In [None]:
plot_phases_func(
    train, train_meta, 0, func=triple_denoise, name='denoise', url=True)

In [None]:
denoise_phase(train['0'].values)

In [None]:
plot_single_func(
    train, train_meta, 0, func=original_mean, name='orig_mean', url=True)

In [None]:
plot_single_func(
    train, train_meta, 0, func=denoised_mean, name='wavelet_mean', url=True)

# Coding

In [16]:
from keras.layers import (Bidirectional, LSTM, Dense, TimeDistributed, Conv1D,
                          Input, Add, BatchNormalization, ReLU, Dropout, Flatten, Activation)
from keras.models import Model
from keras.callbacks import TensorBoard

from imblearn.under_sampling import RandomUnderSampler
from sklearn.utils import shuffle

# from sklearn.utils.class_weight import compute_class_weight

In [14]:
f = h5py.File(TR_H5, mode='r+')
print(f.keys())
denoised = f['denoised']

<KeysViewHDF5 ['denoised', 'freq_dom', 'metrics', 'spectrogram']>


In [15]:
import keras.backend as K


def matthews_correlation(y_true, y_pred):
    y_pred_pos = K.round(K.clip(y_pred, 0, 1))
    y_pred_neg = 1 - y_pred_pos

    y_pos = K.round(K.clip(y_true, 0, 1))
    y_neg = 1 - y_pos

    tp = K.sum(y_pos * y_pred_pos)
    tn = K.sum(y_neg * y_pred_neg)

    fp = K.sum(y_neg * y_pred_pos)
    fn = K.sum(y_pos * y_pred_neg)

    numerator = (tp * tn - fp * fn)
    denominator = K.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))

    return numerator / (denominator + K.epsilon())

In [23]:
def res_layer(input_data, filters, kernel_size, block_num=None):
    if block_num is None:
        raise ValueError('Block number is not defined')

    block = 'res_block' + str(block_num) + '_'
    inp_ = input_data

    out = Conv1D(
        filters, kernel_size, padding='same', name=block + 'conv1')(input_data)
    out = BatchNormalization(name=block + 'bn1')(out)
    out = ReLU(name=block + 'relu1')(out)

    #     out = Dropout(drop, name=block + 'drop')(out)

    out = Conv1D(
        filters, kernel_size, padding='same', name=block + 'conv2')(out)
    out = BatchNormalization(name=block + 'bn2')(out)

    out = Add(name=block + 'add')([out, inp_])
    out = ReLU(name=block + 'relu2')(out)

    return out

In [41]:
filters = [4, 8, 16]
kernel = [32, 32, 32]


def get_model(block, kernel, warm=False):
    input_model = Input(shape=(800000, 1), name='Input')

    out = Conv1D(4, 128, strides=16, name='in_conv')(input_model)
    out = BatchNormalization(name='in_bn')(out)
    out = ReLU(name='in_relu')(out)

#     out = res_layer(out, block[0], kernel[0], block_num=1)
    
    out = Conv1D(block[0], 64, strides=4, name='1_conv')(out)
    out = BatchNormalization(name='1_bn')(out)
    out = ReLU(name='1_relu')(out)

    out = res_layer(out, block[0], kernel[0], block_num=1)

    out = Conv1D(block[1], 64, strides=4, name='2_conv')(out)
    out = BatchNormalization(name='2_bn')(out)
    out = ReLU(name='2_relu')(out)

    out = res_layer(out, block[1], kernel[1], block_num=2)

    out = Conv1D(block[2], 64, strides=4, name='3_conv')(out)
    out = BatchNormalization(name='3_bn')(out)
    out = ReLU(name='3_relu')(out)

    out = res_layer(out, block[2], kernel[2], block_num=3)

    out = Conv1D(32, 32, strides=4, name='4_conv')(out)
    out = BatchNormalization(name='4_bn')(out)
    out = ReLU()(out)

#     b_lstm1 = Bidirectional(LSTM(32, return_sequences=True), merge_mode='sum', name='bidirectional1')(bnorm4)
    out = Bidirectional(LSTM(128, return_sequences=False), merge_mode='sum', name='bidirectional2')(out)

#     flat = Flatten()(out)

    out = Dense(1, name='dense_out')(out)
    output_model = Activation('sigmoid')(out)

    model = Model(inputs=input_model, outputs=output_model)
    if warm: model.summary()
    return model

In [18]:
under_sample = RandomUnderSampler()
indicies = under_sample.fit_resample(
    train_meta['signal_id'].values.reshape((-1, 1)),
    train_meta['target'].values.reshape((-1, 1)))
data = np.concatenate(indicies, axis=1)
data = data[data[:,0].argsort()]

indicies, y_train = data[: ,0].tolist(), data[:, 1]

In [19]:
X_train, y_train = shuffle(denoised[indicies], y_train)

In [21]:
X_train, y_train = X_train.reshape((-1, X_train.shape[1], 1)), y_train.reshape((-1, 1))

In [42]:
model = get_model(filters, kernel, True)
m_tb = TensorBoard(log_dir='../logs/model_next', histogram_freq=0, write_graph=True)
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy', matthews_correlation]
)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input (InputLayer)              (None, 800000, 1)    0                                            
__________________________________________________________________________________________________
in_conv (Conv1D)                (None, 49993, 4)     516         Input[0][0]                      
__________________________________________________________________________________________________
in_bn (BatchNormalization)      (None, 49993, 4)     16          in_conv[0][0]                    
__________________________________________________________________________________________________
in_relu (ReLU)                  (None, 49993, 4)     0           in_bn[0][0]                      
__________________________________________________________________________________________________
1_conv (Co

In [43]:
model.fit(
    X_train,
    y_train,
    epochs=50,
    batch_size=256,
    verbose=1,
    callbacks=[m_tb]
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f0ac477b278>