## Variation Bayesian Independent Component Analysis (vbICA)

The following discussion is based on Gualandi et al. (2015). 

This method should be ideal for determining transient signals in the borehole strainmeter data. A great thesis (Choudrey, 2002) is available describing the development of the method, suitable for any incoming grad student to dive into as long as they have a basic understanding of calculus.... fair warning, it is 250 pages but has a fun tone to it (at least for the parts I read). 

One assumption in vbICA, which is ideal for solving the classic Blind Source Separation Problem (BSS), is that source signals are are assumed static. Clearly, this would not be the case for fault zone processes affecting timeseries, but Gualandi et al. (2015) show that the assumption is a fine approximation for mogi sources, seasonal signals, and post-seismic signals in GPS networks. They present subsequent papers detecting Slow Slip Events in the Cascades and Andes. My hope is to apply the same method to strainmeter data. 

vbICA uses a modelling approach (as opposed to a mapping approach, which something like FastICA, for example, uses) to explain signals in the data. In this approach, a contrast function, either the liklihood or, if Bayesian, the posterior probability distribution function (pdf), of the parameters is maximized. Gualandi et al. modify the code of Choudrey (2002) following Chan et al. (2003) to account for missing data. 

The generative model for this approach is characterized by observed variables (i.e. the data), hidden variables, and hidden parameters. Hidden parameters and variables are unknown. 

The results are highly dependent on the choice of priors. Priors enter the initial estimation of the weights used to characterize the generative model. Weights are composed of hidden/latent variables and hidden parameters. Hidden variables are identified with real world quantities. Hidden parameters, also termed hyper-parameters, are required for a working model. Specified priors include loosely constrained hyper-parameter values, with pdfs that describe the random variables and prior parameters. 

The bayesian application of this method involves automatic relevance determination (ARD). Giving weak confidence to priors allows data to guide the estimation of posterior parameters more heavily (as opposed to being governed by the priors). ARD essentially uses a precision value (the variance) of each source signal to determine the optimal number of components to explain the data. Gualandi et al. (2015) decide to call any signal with a maximum variance over 10x larger than the minimum variance noise. 



In [1]:
# Import python modules
from sklearn.decomposition import FastICA
from scipy import signal
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib widget
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 8, 6

import obspy
from obspy import UTCDateTime
from obspy.imaging import spectrogram

import ipywidgets as widgets
from ipywidgets import HBox, VBox, interact, Layout
style = {'description_width': 'initial'}
layout=Layout(width='30%', height='40px')

In [2]:
# Load data
# Assign station codes from selected file

dir = './DataFiles/Level2/'
sta_list = []
for file in os.listdir(dir):
    if file.endswith('regional_strain_and_corrections.txt'):
        sta_list.append(file[0:24])
        
# Set initial values   
file = sta_list[0]
network = file[0:2]
scode = file[3:7]
loc = file[8:10]
cha = file[11:13]

print('Pick the file you would like to analyze:')
sta_select = widgets.Dropdown(
            options=sta_list,
            value=sta_list[0],
            description='Station file:',
            )
display(sta_select)

# Change the station and network as the dropdown is changed
def the_ccodes(siteval):
    global scode, network, loc, cha, file
    file = siteval
    network = siteval[0:2]
    scode = siteval[3:7]
    loc = siteval[8:10]
    cha = siteval[11:13]
def on_cselect(change):
    the_ccodes(change.new)
sta_select.observe(on_cselect,names='value')

# Load the strains and corrections to a dataframe

# Make a dataframe with the file and assign start and end dates
fbutton = widgets.Button(description="Load files to a dataframe", button_style='danger',layout=layout)
foutput = widgets.Output()

# Initial start and end times (arbitrarily chosen)
start = UTCDateTime('2000-01-01 00:00:00.000')
end = UTCDateTime('2000-01-02 00:00:00.000')

def on_fbutton_clicked(b):
    with foutput:
        foutput.clear_output()
        global df, start, end
        df = pd.DataFrame([])
        for files in os.listdir(dir):
            if files.startswith(file):
                # Print file comments
                with open(dir+files,'r') as f:
                    for ln in f:
                        if ln.startswith('#'):
                            print('File comment: '+ ln[1:])
                f.close()
                adf = pd.read_csv(dir+files,sep='\t',index_col=0,header=0,comment='#')
                df = pd.concat([df,adf],axis='columns')
        print('Wait for the dataframe column headers to print.')

        ind = []
        for i in range(0,len(df)):
            ind.append(UTCDateTime(df.index[i]))
        df.index = ind
        start = df.index[0]
        end = df.index[-1]
        print(df.columns)
fbutton.on_click(on_fbutton_clicked)


display(fbutton, foutput)

Pick the file you would like to analyze:


Dropdown(description='Station file:', options=('PB.B916.T0.RS.2019-06-01',), value='PB.B916.T0.RS.2019-06-01')

Button(button_style='danger', description='Load files to a dataframe', layout=Layout(height='40px', width='30%…

Output()

Following the steps of Gualandi  et al. (2016) "Blind source separation problem in GPS time series"

1. Center the dataset, i.e., remove the mean to each time
series.
> CH: This is simple
2. Check the correlation of the centered dataset.
> CH: This can be completed with the numpy pearson correlation coefficient tool of numpy.
(2a) If the correlation is greater than 0.67, go to point 3).
(2b) If the correlation is smaller than 0.67, go to point 4).
3. Detrend the time series.
> CH: The scipy detrend tool is good for this
4. Correct for the co-seismic offsets because of the nonindependence
with the post-seismic signals.
5. Perform a vbICA with loose priors and select the number
of components via a criterion based on the ARD method.

In [7]:
df.columns

Index(['ch0 [ms]', 'ch1 [ms]', 'ch2 [ms]', 'ch3 [ms]', 'baro_ch0', 'baro_ch1',
       'baro_ch2', 'baro_ch3', 'trend_ch0', 'trend_ch1', 'trend_ch2',
       'trend_ch3', 'tide_ch0', 'tide_ch1', 'tide_ch2', 'tide_ch3', 'gaugeEA',
       'gaugeED', 'gaugeES', 'tideEA', 'tideED', 'tideES', 'baroEA', 'baroED',
       'baroES', 'trendEA', 'trendED', 'trendES'],
      dtype='object')

### Synthetic test of FastICA

In [87]:
x = np.column_stack([df.tide_ch3.values+df.baro_ch3.values,df.tide_ch2.values+df.baro_ch2.values])
time = df.index

In [88]:
df.tide_ch3.values+df.baro_ch3.values

array([-0.00201263, -0.00147288, -0.00084746, ..., -0.01109687,
       -0.01054443, -0.00991671])

In [89]:
np.shape(x)

(4177, 2)

In [158]:
ica = FastICA(n_components=2, algorithm='parallel', whiten=True, fun='exp',max_iter=100000,tol=0.000001)

In [159]:
recovered = ica.fit_transform(x)

In [160]:
recovered

array([[ 0.01393687,  0.0094695 ],
       [ 0.01432477,  0.01091599],
       [ 0.01456449,  0.01238958],
       ..., 
       [ 0.03142761,  0.00831115],
       [ 0.03184684,  0.0098131 ],
       [ 0.0321271 ,  0.0113304 ]])

In [161]:
plt.close()
plt.plot(time,recovered)
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [6]:
# Demean the data and check the correlation coefficient

# Convert UTCDateTime to seconds
time = [obspy.UTCDateTime(df.index[i]).timestamp for i in range(0,len(df))]
ch = {}; abscorr = {}
for cha in ['ch0 [ms]','ch1 [ms]','ch2 [ms]','ch3 [ms]']:
    demeaned = df[cha].values - np.mean(df[cha])
    ch[cha[2:3]] = demeaned
    abscorr[cha[2:3]] = abs(np.corrcoef(time,demeaned)[0,1])
    print('Correlation coefficient for '+cha[0:3]+': ',np.corrcoef(time,demeaned)[0,1])
    # Detrend if correlation is large for any channel
    if max(abscorr.values()) > 0.67:
        print('Correlation greater than 0.67! Detrending all channels',end='\r')
        ch[cha[2:3]] = signal.detrend(demeaned)
        abscorr[cha[2:3]] = abs(np.corrcoef(time,ch[cha[2:3]])[0,1])
        print('New correlation coefficient for '+cha[0:3]+': ',np.corrcoef(time,ch[cha[2:3]])[0,1])

Correlation coefficient for ch0:  -0.996171195614
Correlation greater than 0.67! Detrending all channelsNew correlation coefficient for ch0:  -7.53254986555e-15
Correlation coefficient for ch1:  -0.986745304892
Correlation greater than 0.67! Detrending all channelsNew correlation coefficient for ch1:  -3.59976043345e-16
Correlation coefficient for ch2:  -0.991781270901
Correlation greater than 0.67! Detrending all channelsNew correlation coefficient for ch2:  -2.48582132924e-15
Correlation coefficient for ch3:  -0.817223078637
Correlation greater than 0.67! Detrending all channelsNew correlation coefficient for ch3:  -5.19773453876e-16


In [48]:
plt.close()
plt.hist(df._ch0)
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [44]:
plt.close()
plt.plot(time,recovered)
#plt.plot(time,df.tide_ch0)
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [60]:
plt.close()
plt.plot(time,df.gaugeEA)
plt.plot(time,df.gaugeES)
plt.plot(time,df.gaugeED)
#plt.plot(time,df.tide_ch0)
#plt.plot(time,df.baro_ch0)
plt.plot()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

[]

In [58]:
UTCDateTime(df.index[0]),UTCDateTime(df.index[-1])
df.columns

Index(['ch0 [ms]', 'ch1 [ms]', 'ch2 [ms]', 'ch3 [ms]', 'baro_ch0', 'baro_ch1',
       'baro_ch2', 'baro_ch3', 'trend_ch0', 'trend_ch1', 'trend_ch2',
       'trend_ch3', 'tide_ch0', 'tide_ch1', 'tide_ch2', 'tide_ch3', 'gaugeEA',
       'gaugeED', 'gaugeES', 'tideEA', 'tideED', 'tideES', 'baroEA', 'baroED',
       'baroES', 'trendEA', 'trendED', 'trendES'],
      dtype='object')