# Time Series Analysis

This section treats the envisaged dataset as a time series problem. Proposed techniques / methodology here is to:
* Covert the featured label into a discrete column value type, instead of continuous.
* Time Shift the provided datasets with varied 'lag' values.
* Combine all 3 matrices into a agglomorated matrix of 61 + 162 + 1179 features (1402). Duplicate columns ('SNAP_ID') will be reduced to a single one.
* Slicing agglomorated matrix into Features/Labels.
* Splitting of train/validation/test set.
* Perform feature selection on the agglomorated matrix, dropping redundant features and finding the  optimum number of features. (Multivariate analysis - through a wrapper approach)
* Feed dataset into a number of machine learning models

Applicable links:
* https://machinelearningmastery.com/how-to-scale-data-for-long-short-term-memory-networks-in-python/
* https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
* https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
* https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/

In [155]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [156]:
tpcds='TPCDS1' # Schema upon which to operate test
debug_mode=True # Determines whether to plot graphs or not, useful for development purposes 
low_quartile_limit = 0 # Lower Quartile threshold to detect outliers
upper_quartile_limit = 1 # Upper Quartile threshold to detect outliers
lag=0 # Time Series shift / Lag Step. Each lag value equates to 1 minute
test_split=.3 # Denotes which Data Split to operate under when it comes to training / validation
y_label = 'CPU_TIME_DELTA' # Denotes which label to use for time series experiments
#
# Open Data
rep_hist_snapshot_path = 'C:/Users/gabriel.sammut/University/Data_ICS5200/Schedule/' + tpcds + '/rep_hist_snapshot.csv'
rep_hist_sysmetric_summary_path = 'C:/Users/gabriel.sammut/University/Data_ICS5200/Schedule/' + tpcds + '/rep_hist_sysmetric_summary.csv'
rep_hist_sysstat_path = 'C:/Users/gabriel.sammut/University/Data_ICS5200/Schedule/' + tpcds + '/rep_hist_sysstat.csv'
#
rep_hist_snapshot_df = pd.read_csv(rep_hist_snapshot_path)
rep_hist_sysmetric_summary_df = pd.read_csv(rep_hist_sysmetric_summary_path)
rep_hist_sysstat_df = pd.read_csv(rep_hist_sysstat_path)
#
def prettify_header(headers):
    """
    Cleans header list from unwated character strings
    """
    header_list = []
    [header_list.append(header.replace("(","").replace(")","").replace("'","").replace(",","")) for header in headers]
    return header_list
#
rep_hist_snapshot_df.columns = prettify_header(rep_hist_snapshot_df.columns.values)
rep_hist_sysmetric_summary_df.columns = prettify_header(rep_hist_sysmetric_summary_df.columns.values)
rep_hist_sysstat_df.columns = prettify_header(rep_hist_sysstat_df.columns.values)
#
print('Header Lengths [Before Pivot]')
print('REP_HIST_SNAPSHOT: ' + str(len(rep_hist_snapshot_df.columns)))
print('REP_HIST_SYSMETRIC_SUMMARY: ' + str(len(rep_hist_sysmetric_summary_df.columns)))
print('REP_HIST_SYSSTAT: ' + str(len(rep_hist_sysstat_df.columns)))
#
# Table REP_HIST_SYSMETRIC_SUMMARY
rep_hist_sysmetric_summary_df = rep_hist_sysmetric_summary_df.pivot(index='SNAP_ID', columns='METRIC_NAME', values='AVERAGE')
rep_hist_sysmetric_summary_df.reset_index(inplace=True)
rep_hist_sysmetric_summary_df[['SNAP_ID']] = rep_hist_sysmetric_summary_df[['SNAP_ID']].astype(int)
rep_hist_sysmetric_summary_df.sort_values(by=['SNAP_ID'],inplace=True,ascending=True)
#
# Table REP_HIST_SYSSTAT
rep_hist_sysstat_df = rep_hist_sysstat_df.pivot(index='SNAP_ID', columns='STAT_NAME', values='VALUE')
rep_hist_sysstat_df.reset_index(inplace=True)
rep_hist_sysstat_df[['SNAP_ID']] = rep_hist_sysstat_df[['SNAP_ID']].astype(int)
rep_hist_sysstat_df.sort_values(by=['SNAP_ID'],inplace=True,ascending=True)
#
# Refreshing columns with pivoted columns
def convert_list_to_upper(col_list):
    """
    Takes a string and converts elements to upper
    """
    upper_col_list = []
    for col in col_list:
        upper_col_list.append(col.upper())
    return upper_col_list
#
rep_hist_sysmetric_summary_df.rename(str.upper, inplace=True, axis='columns')
rep_hist_sysstat_df.rename(str.upper, inplace=True, axis='columns')
#
# Group By Values by SNAP_ID , sum all metrics (for table REP_HIST_SNAPSHOT)
rep_hist_snapshot_df = rep_hist_snapshot_df.groupby(['SNAP_ID']).sum()
rep_hist_snapshot_df.reset_index(inplace=True)
#
print('\nHeader Lengths [After Pivot]')
print('REP_HIST_SNAPSHOT: ' + str(len(rep_hist_snapshot_df.columns)))
print('REP_HIST_SYSMETRIC_SUMMARY: ' + str(len(rep_hist_sysmetric_summary_df.columns)))
print('REP_HIST_SYSSTAT: ' + str(len(rep_hist_sysstat_df.columns)))
#
# DF Shape
print('\nDataframe shapes:\nTable [REP_HIST_SNAPSHOT] - ' + str(rep_hist_snapshot_df.shape))
print('Table [REP_HIST_SYSMETRIC_SUMMARY] - ' + str(rep_hist_sysmetric_summary_df.shape))
print('Table [REP_HIST_SYSSTAT] - ' + str(rep_hist_sysstat_df.shape))

Header Lengths [Before Pivot]
REP_HIST_SNAPSHOT: 88
REP_HIST_SYSMETRIC_SUMMARY: 26
REP_HIST_SYSSTAT: 16

Header Lengths [After Pivot]
REP_HIST_SNAPSHOT: 77
REP_HIST_SYSMETRIC_SUMMARY: 162
REP_HIST_SYSSTAT: 1179

Dataframe shapes:
Table [REP_HIST_SNAPSHOT] - (172, 77)
Table [REP_HIST_SYSMETRIC_SUMMARY] - (172, 162)
Table [REP_HIST_SYSSTAT] - (172, 1179)


### Merging Frames

This part merges the following pandas data frame into a single frame:
* REP_HIST_SNAPSHOT
* REP_HIST_SYSMETRIC_SUMMARY
* REP_HIST_SYSSTAT

In addition, this step isolates the label column from the remainder of the feature matrix

In [157]:
df = pd.merge(rep_hist_snapshot_df, rep_hist_sysmetric_summary_df, on='SNAP_ID')
df = pd.merge(df, rep_hist_sysstat_df, on='SNAP_ID')
print(df.shape)
#
y_df = df[[y_label]]
X_df = df.drop(columns=[y_label])
print("Label [" + y_label + "] shape: " + str(y_df.shape))
print("Feature matrix shape: " + str(X_df.shape))

(172, 1416)
Label [CPU_TIME_DELTA] shape: (172, 1)
Feature matrix shape: (172, 1415)


### Time Series Shifting

Shifting the datasets N lag minutes, in order to transform the problem into a supervised dataset. Each Lag Shift equates to 60 seconds (due to the way design of the data capturing tool). For each denoted lag amount, the same number of feature vectors will be stripped away at the beginning.

In [158]:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=False):
    """
    Frame a time series as a supervised learning dataset.
    Arguments:
        data: Sequence of observations as a list or NumPy array.
        n_in: Number of lag observations as input (X).
        n_out: Number of observations as output (y).
        dropnan: Boolean whether or not to drop rows with NaN values.
    Returns:
        Pandas DataFrame of series framed for supervised learning.
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = data
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = pd.concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg
#
shape = X_df.shape
X_df = series_to_supervised(data=X_df,n_in=lag)
print(type(X_df))
print('Before: ' + str(shape) + '\nAfter: ' + str(X_df.shape))
print(X_df.columns[:5])

<class 'pandas.core.frame.DataFrame'>
Before: (172, 1415)
After: (172, 1415)
Index(['var1(t)', 'var2(t)', 'var3(t)', 'var4(t)', 'var5(t)'], dtype='object')


### Continuous to Discrete Conversion

This section converts the established 'y_label' continuous column into a discrete version. Values are binned into 10 categories (a percentage measure): 
* 0  - 10
* 11 - 20
* 21 - 30
* 31 - 40
* 41 - 50
* 51 - 60
* 61 - 70
* 71 - 80
* 81 - 90
* 91 - 100

In [159]:
def discretize_label(df=None, bin_total=10):
    """
    Converts pandas column into a range of bins (converts data from contiguous to discrete)
    """
    if df is None:
        raise ValueError('Dataframe was not specified!')
    if bin_total < 1:
        raise ValueError('Bin Amounts must be at least 1!')
    #
    label_max, label_min = y_df.max(), 0
    interval = float((label_max - label_min) / bin_total)
    print(y_label + ' min: ' + str(label_min))
    print(y_label + ' max: ' + str(label_max))
    discrete_bins = []
    for val in y_df.values[:,0]:
        val = float(val)
        for i in range(bin_total):
            if (val > (interval * i)) and (val <= (interval * (i+1))):
                discrete_bins.append(i)
                break
    return pd.DataFrame(data=discrete_bins,columns=df.columns)
#
print("Label shape before discretization: " + str(y_df.shape))
y_df = discretize_label(df=y_df, bin_total=10)
print("Label shape after discretization: " + str(y_df.shape))
print(y_df[y_label].unique())

Label shape before discretization: (172, 1)
CPU_TIME_DELTA min: 0
CPU_TIME_DELTA max: CPU_TIME_DELTA    1266827638
dtype: int64
Label shape after discretization: (172, 1)
[0 1 8 3 5 2 7 6 9 4]


### Test Split Train / Validation / Test

In [161]:
X_train, X_validate, y_train, y_validate = train_test_split(X_df, y_df, test_size=test_split)
print("X_train shape [" + str(X_train.shape) + "]")
print("X_validate shape [" + str(X_validate.shape) + "]")
print("y_train shape [" + str(y_train.shape) + "]")
print("y_validate shape [" + str(y_validate.shape) + "]")

X_train shape [(120, 1415)]
X_validate shape [(52, 1415)]
y_train shape [(120, 1)]
y_validate shape [(52, 1)]
