# Banpei
https://github.com/tsurubee/banpei

### Singular Spectrum Transformation (SST)

The input 'data' must be one-dimensional array-like object containing a sequence of values.
The output 'results' is Numpy array with the same size as input data.
The graph below shows the change-point scoring calculated by sst for the periodic data.

The data used is placed as '/tests/test_data/periodic_wave.csv'. You can read a CSV file using the following code.

SST processing can be accelerated using the Lanczos method which is one of Krylov subspace methods by specifying True for the is_lanczos argument like below.

In [5]:
import banpei
import pandas as pd
import matplotlib.pyplot as plt
pd.options.plotting.backend = "plotly"

#df = pd.read_csv('./banpei_test/abnormal_frequency.csv')
#display(df.plot())

raw_data = pd.read_csv('./banpei_test/periodic_wave.csv')
data = raw_data['y']
#display(data.plot())

# Create Banpei instance
model   = banpei.SST(w=50)
results = model.detect(data)

results = model.detect(data, is_lanczos=True)

# Transform results from np.array to pd.DataFrame:
results = pd.DataFrame(results, columns = ['r'])
results = results['r']

#data = np.array(data)
#print(data)
#print(results)

display(data.plot())
display(results.plot())

#plt.plot(data)
#plt.show()

#plt.plot(results)
#plt.show()

"""
# Plot Data against banpei result:
plt.plot(X, data, color='b', label='data')
plt.plot(X, results, color='r', label='results')

plt.title("Banpei Data & Results")
plt.legend()
  
plt.show()
"""

'\n# Plot Data against banpei result:\nplt.plot(X, data, color=\'b\', label=\'data\')\nplt.plot(X, results, color=\'r\', label=\'results\')\n\nplt.title("Banpei Data & Results")\nplt.legend()\n  \nplt.show()\n'

## Test on sample KDD Data

In [2]:
BASE_PATH = '../data-sets/KDD-Cup/data/'

raw_data = pd.read_csv(BASE_PATH + '248_UCR_Anomaly_2000.txt', names=['series'])
data = raw_data['series']
T = data.shape[0]

#print(data)
T = data.shape[0]
print(T)
print(int(T/4))

display(data.plot())

8432
2108


In [6]:
import banpei
import pandas as pd
import matplotlib.pyplot as plt
pd.options.plotting.backend = "plotly"

BASE_PATH = '../data-sets/KDD-Cup/data/'

#raw_data = pd.read_csv(BASE_PATH + '248_UCR_Anomaly_2000.txt', names=['series'])
raw_data = pd.read_csv(BASE_PATH + '005_UCR_Anomaly_4000.txt', names=['series'])
data = raw_data['series']
T = data.shape[0]

display(data.plot())

# Create Banpei instance
model   = banpei.SST(w=(380))
"""
Parameters
        ----------
        w    : int
               Window size
        m    : int
               Number of basis vectors
        k    : int
               Number of columns for the trajectory and test matrices
        L    : int
               Lag time
"""
#results = model.detect(data)

results_lanc = model.detect(data, is_lanczos=True)

# Transform results from np.array to pd.DataFrame:
#results = pd.DataFrame(results, columns = ['result'])
results_lanc = pd.DataFrame(results_lanc, columns = ['result'])
#results = results['result']

#data = np.array(data)
#print(data)
#print(results)

#display(data.plot())
#display(results.plot())
display(results_lanc.plot())

#plt.plot(data)
#plt.show()

#plt.plot(results)
#plt.show()

"""
# Plot Data against banpei result:
plt.plot(X, data, color='b', label='data')
plt.plot(X, results, color='r', label='results')

plt.title("Banpei Data & Results")
plt.legend()
  
plt.show()
"""

'\n# Plot Data against banpei result:\nplt.plot(X, data, color=\'b\', label=\'data\')\nplt.plot(X, results, color=\'r\', label=\'results\')\n\nplt.title("Banpei Data & Results")\nplt.legend()\n  \nplt.show()\n'

### Theory of SSA separability
https://en.wikipedia.org/wiki/Singular_spectrum_analysis

The two main questions which the theory of SSA attempts to answer are: (a) what time series components can be separated by SSA, and (b) how to choose the window length {\displaystyle L}L and make proper grouping for extraction of a desirable component. Many theoretical results can be found in Golyandina et al. (2001, Ch. 1 and 6).

Trend (which is defined as a slowly varying component of the time series), periodic components and noise are asymptotically separable as {\displaystyle N\rightarrow \infty }N\rightarrow \infty . In practice {\displaystyle N}N is fixed and one is interested in approximate separability between time series components. A number of indicators of approximate separability can be used, see Golyandina et al. (2001, Ch. 1). The window length {\displaystyle L}L determines the resolution of the method: larger values of {\displaystyle L}L provide more refined decomposition into elementary components and therefore better separability. The window length {\displaystyle L}L determines the longest periodicity captured by SSA. Trends can be extracted by grouping of eigentriples with slowly varying eigenvectors. A sinusoid with frequency smaller than 0.5 produces two approximately equal eigenvalues and two sine-wave eigenvectors with the same frequencies and {\displaystyle \pi /2}\pi /2-shifted phases.

Elsner and Tsonis [6] give some discussion and remark that choosing L = T /4 is a common practice. It has been recommended that L should be large enough but not larger than T /2 [1]. Large values of L allow longer period oscillations to
be resolved, but choosing L too large leaves too few observations from which to estimate the covariance matrix of the L
variables. Although considerable attempt and various techniques have been considered for selecting the optimal value of L,
there is inadequate theoretical justification for choosing L

L = window Length
T = observations on the time series

## Demo
https://github.com/tsurubee/banpei

In [None]:
import banpei
import numpy as np
import pandas as pd
from bokeh.io import curdoc
from bokeh.models import ColumnDataSource, DatetimeTickFormatter
from bokeh.plotting import figure, show, output_file
from datetime import datetime
from math import radians
from pytz import timezone

raw_data = pd.read_csv('./banpei_test/abnormal_frequency.csv')
raw_data = np.array(raw_data['data'])
data = []
results = []


def get_new_data():
    global data
    global raw_data
    data.append(raw_data[0])
    raw_data = np.delete(raw_data, 0)


def update_data():
    global results
    get_new_data()
    ret = model.stream_detect(data)
    results.append(ret)
    now = datetime.now(tz=timezone("Asia/Tokyo"))
    new_data = dict(x=[now], y=[data[-1]], ret=[results[-1]])
    source.stream(new_data, rollover=500)

# Create Data Source
source = ColumnDataSource(dict(x=[], y=[], ret=[]))


# Create Banpei instance
model = banpei.SST(w=30)

# Draw a graph
fig = figure(x_axis_type="datetime",
             x_axis_label="Datetime",
             plot_width=950,
             plot_height=650)
fig.title.text = "Realtime monitoring with Banpei"
fig.line(source=source, x='x', y='y', line_width=2, alpha=.85, color='blue', legend_label='observed data')
fig.line(source=source, x='x', y='ret', line_width=2, alpha=.85, color='red', legend_label='change-point score')
fig.circle(source=source, x='x', y='y', line_width=2, line_color='blue', color='blue')
fig.legend.location = "top_left"

# Configuration of the axis
format = "%Y-%m-%d-%H-%M-%S"
fig.xaxis.formatter = DatetimeTickFormatter(
    seconds=[format],
    minsec =[format],
    minutes=[format],
    hourmin=[format],
    hours  =[format],
    days   =[format],
    months =[format],
    years  =[format]
)
fig.xaxis.major_label_orientation=radians(90)

# Configuration of the callback
curdoc().add_root(fig)
curdoc().add_periodic_callback(update_data, 1000) #ms

#output_file("fig.html")

#show(fig)

print(ColumnDataSource)