# Schedule Top Consumer Profiling

This notebook deals with outlier detection, particularly from the SQL domain. This work attributes itself to detection of high consumers within a database system, particularly flagging those SQL which stand out in terms of computation time/resources required to execute.

Due to the high dimensionality of the available data points, unsupervised machine learning techniques will be applied to this problem, so as to isolate data anamolies and flag them as potential bottlenecks.

### Module Installation and Importing Libraries

In [1]:
# scipy
import scipy as sc
print('scipy: %s' % sc.__version__)
# numpy
import numpy as np
print('numpy: %s' % np.__version__)
# matplotlib
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import qqplot
# pandas
import pandas as pd
print('pandas: %s' % pd.__version__)
# scikit-learn
from sklearn.cluster import KMeans
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score
from sklearn.ensemble import IsolationForest

scipy: 1.1.0
numpy: 1.15.2
pandas: 0.23.4


In [2]:
#
# Experiment Config
tpcds='TPCDS1' # Schema upon which to operate test
y_label = ['CPU_TIME_DELTA','OPTIMIZER_COST','EXECUTIONS_DELTA','ELAPSED_TIME_DELTA']
y_label2 = ['COST','CARDINALITY','BYTES','IO_COST','TEMP_SPACE','TIME']

### Read data from file into pandas dataframes

In [3]:
#
# Open Data
# rep_hist_snapshot_path = 'C:/Users/gabriel.sammut/University/Data_ICS5200/Schedule/' + tpcds + '/rep_hist_snapshot.csv'
# rep_vsql_plan_path = 'C:/Users/gabriel.sammut/University/Data_ICS5200/Schedule/' + tpcds + '/rep_vsql_plan.csv'
rep_hist_snapshot_path = 'D:/Projects/Datagenerated_ICS5200/Schedule/' + tpcds + '/rep_hist_snapshot.csv'
rep_vsql_plan_path = 'D:/Projects/Datagenerated_ICS5200/Schedule/' + tpcds + '/rep_vsql_plan.csv'
#
rep_hist_snapshot_df = pd.read_csv(rep_hist_snapshot_path)
rep_vsql_plan_df = pd.read_csv(rep_vsql_plan_path)
#
def prettify_header(headers):
    """
    Cleans header list from unwated character strings
    """
    header_list = []
    [header_list.append(header.replace("(","").replace(")","").replace("'","").replace(",","")) for header in headers]
    return header_list
#
rep_hist_snapshot_df.columns = prettify_header(rep_hist_snapshot_df.columns.values)
rep_vsql_plan_df.columns = prettify_header(rep_vsql_plan_df.columns.values)
print(rep_hist_snapshot_df.columns)
print('------------------------------------------')
print(rep_vsql_plan_df.columns)

FileNotFoundError: File b'C:/Users/gabriel.sammut/University/Data_ICS5200/Schedule/TPCDS1/rep_hist_snapshot.csv' does not exist

### Pivoting tables and changing matrix shapes

Changes all dataframe shapes to be similar to each other, where in a number of snap_id timestamps are cojoined with instance metrics.

In [None]:
print('Header Lengths [Before Pivot]')
print('REP_HIST_SNAPSHOT: ' + str(len(rep_hist_snapshot_df.columns)))
print('REP_VSQL_PLAN: ' + str(len(rep_vsql_plan_df.columns)))
#
# Group By Values by SNAP_ID, PLAN_HASH_VALUE , sum all metrics (for table REP_HIST_SNAPSHOT)
rep_hist_snapshot_df = rep_hist_snapshot_df.groupby(['SNAP_ID','PLAN_HASH_VALUE','DBID','INSTANCE_NUMBER']).sum()
rep_hist_snapshot_df.reset_index(inplace=True)
#
# Group By Values by PLAN_HASH_VALUE,TIMESTAMP, sum all metrics (for table REP_VSQL_PLAN)
rep_vsql_plan_df = rep_vsql_plan_df.groupby(['TIMESTAMP','SQL_ID','ID','DBID','CON_DBID']).sum()
rep_vsql_plan_df.reset_index(inplace=True)
#
print('\nHeader Lengths [After Pivot]')
print('REP_HIST_SNAPSHOT: ' + str(len(rep_hist_snapshot_df.columns)))
print('REP_VSQL_PLAN: ' + str(len(rep_vsql_plan_df.columns)) + "\n")
print(rep_hist_snapshot_df.columns)
print(rep_vsql_plan_df.columns)

### Dealing with empty values

In [None]:
def get_na_columns(df, headers):
    """
    Return columns which consist of NAN values
    """
    na_list = []
    for head in headers:
        if df[head].isnull().values.any():
            na_list.append(head)
    return na_list
#
print('N/A Columns\n')
print('\n REP_HIST_SNAPSHOT Features ' + str(len(rep_hist_snapshot_df.columns)) + ': ' + str(get_na_columns(df=rep_hist_snapshot_df,headers=rep_hist_snapshot_df.columns)) + "\n")
print('REP_VSQL_PLAN Features ' + str(len(rep_vsql_plan_df.columns)) + ': ' + str(get_na_columns(df=rep_vsql_plan_df,headers=rep_vsql_plan_df.columns)) + "\n")
#
def fill_na(df):
    """
    Replaces NA columns with 0s
    """
    return df.fillna(0)
#
# Populating NaN values with amount '0'
rep_hist_snapshot_df = fill_na(df=rep_hist_snapshot_df)
rep_vsql_plan_df = fill_na(df=rep_vsql_plan_df)

### Data Ordering

Sorting of datasets in order of:

* REP_HIST_SNAPSHOT - SNAP_ID
* REP_VSQL_PLAN - TIMESTAMP, SQL_ID, ID

In [None]:
rep_hist_snapshot_df.sort_values(by=['SNAP_ID'], ascending=True, inplace=True)
rep_vsql_plan_df.sort_values(by=['TIMESTAMP','SQL_ID','ID'], ascending=True, inplace=True)

In [None]:
# def encode(df, features):
#     encoder_dict={} # Used to keep track of respective encoders, in case it is required to decoded labels further down the line
#     for f in features:
#         for col in df.columns:
#             col = str(col)
#             if col.lower() == f.lower()
#                 le = preprocessing.LabelEncoder()
#                 df[col].values = le.fit_transform(df[col].values)
#                 encoder_dict[col] = le
#     return df, le
# #
# encoded_labels_hist_snapshot = []
# encoded_labels_vsql_plan = ['OPERATION',
#                             'OPTIONS']

### Floating point precision conversion

Each column is converted into a column of type values which are floating point for higher precision, and rounded to 3 decimal places.

In [None]:
for col in rep_hist_snapshot_df.columns:
    try:
        rep_hist_snapshot_df[col].astype('float32',inplace=True)
    except:
        rep_hist_snapshot_df.drop(columns=col, inplace=True)
        print('Dropped column [' + col + ']')
#
print('-------------------------------------------------------------')
#
for col in rep_vsql_plan_df.columns:
    try:
        rep_vsql_plan_df[col].astype('float32',inplace=True)
    except:
        rep_vsql_plan_df.drop(columns=col, inplace=True)
        print('Dropped column [' + col + ']')
#
rep_hist_snapshot_df = np.round(rep_hist_snapshot_df, 3) # rounds to 3 dp
rep_vsql_plan_df = np.round(rep_vsql_plan_df, 3) # rounds to 3 dp

### Feature Selection

In this step, redundant features are dropped. Features are considered redundant if exhibit a standard devaition of 0 (meaning no change in value).

In [None]:
print('Before')
print(rep_hist_snapshot_df.shape)
print(rep_vsql_plan_df.shape)
#
def drop_flatline_columns(df):
    columns = df.columns
    flatline_features = []
    for i in range(len(columns)):
        try:
            std = df[columns[i]].std()
            if std == 0:
                flatline_features.append(columns[i])
        except:
            pass
    #
    #print('Features which are considered flatline:\n')
    #for col in flatline_features:
    #    print(col)
    print('\nShape before changes: [' + str(df.shape) + ']')
    df = df.drop(columns=flatline_features)
    print('Shape after changes: [' + str(df.shape) + ']')
    print('Dropped a total [' + str(len(flatline_features)) + ']')
    return df
#
rep_hist_snapshot_df = drop_flatline_columns(df=rep_hist_snapshot_df)
rep_vsql_plan_df = drop_flatline_columns(df=rep_vsql_plan_df)
#
dropped_columns_rep_hist_snapshot = ['SNAP_ID',
                                       'PLAN_HASH_VALUE',
                                       'OPTIMIZER_ENV_HASH_VALUE',
                                       'LOADED_VERSIONS',
                                       'VERSION_COUNT',
                                       'PARSING_SCHEMA_ID',
                                       'PARSING_USER_ID']
dropped_columns_rep_vsql_plan = ['PLAN_HASH_VALUE',
                                 'ID',
                                 'OBJECT#',
                                 'PARENT_ID',
                                 'SEARCH_COLUMNS']
rep_hist_snapshot_df.drop(columns=dropped_columns_rep_hist_snapshot, inplace=True)
rep_vsql_plan_df.drop(columns=dropped_columns_rep_vsql_plan, inplace=True)
#
print('After')
print(rep_hist_snapshot_df.shape)
print(rep_vsql_plan_df.shape)

### Guaging Outliers (REP_HIST_SNAPSHOT)

Uses the following labels and plots them, so as to showcase the presence of outliers:
* CPU_TIME_DELTA
* OPTIMIZER_COST
* EXECUTIONS_DELTA
* ELAPSED_TIME_DELTA

In [None]:
#y_label = ['CPU_TIME_DELTA','OPTIMIZER_COST','EXECUTIONS_DELTA','ELAPSED_TIME_DELTA']
plt.rcParams['figure.figsize'] = [20, 15]
rep_hist_snapshot_df.plot.scatter(x='CPU_TIME_DELTA',
                                  y='OPTIMIZER_COST',
                                  c='DarkBlue')
rep_hist_snapshot_df.plot.scatter(x='CPU_TIME_DELTA',
                                  y='EXECUTIONS_DELTA',
                                  c='DarkBlue')
rep_hist_snapshot_df.plot.scatter(x='CPU_TIME_DELTA',
                                  y='ELAPSED_TIME_DELTA',
                                  c='DarkBlue')
plt.show()
print('--------------------------------------------------------')
rep_hist_snapshot_df.plot.scatter(x='OPTIMIZER_COST',
                                  y='CPU_TIME_DELTA',
                                  c='DarkBlue')
rep_hist_snapshot_df.plot.scatter(x='OPTIMIZER_COST',
                                  y='EXECUTIONS_DELTA',
                                  c='DarkBlue')
rep_hist_snapshot_df.plot.scatter(x='OPTIMIZER_COST',
                                  y='ELAPSED_TIME_DELTA',
                                  c='DarkBlue')
plt.show()
print('--------------------------------------------------------')
rep_hist_snapshot_df.plot.scatter(x='EXECUTIONS_DELTA',
                                  y='CPU_TIME_DELTA',
                                  c='DarkBlue')
rep_hist_snapshot_df.plot.scatter(x='EXECUTIONS_DELTA',
                                  y='OPTIMIZER_COST',
                                  c='DarkBlue')
rep_hist_snapshot_df.plot.scatter(x='EXECUTIONS_DELTA',
                                  y='ELAPSED_TIME_DELTA',
                                  c='DarkBlue')
plt.show()
print('--------------------------------------------------------')
rep_hist_snapshot_df.plot.scatter(x='ELAPSED_TIME_DELTA',
                                  y='CPU_TIME_DELTA',
                                  c='DarkBlue')
rep_hist_snapshot_df.plot.scatter(x='ELAPSED_TIME_DELTA',
                                  y='OPTIMIZER_COST',
                                  c='DarkBlue')
rep_hist_snapshot_df.plot.scatter(x='ELAPSED_TIME_DELTA',
                                  y='EXECUTIONS_DELTA',
                                  c='DarkBlue')
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = [20, 15]
plt.boxplot(rep_hist_snapshot_df['CPU_TIME_DELTA'].values)
plt.title('CPU_TIME_DELTA')
plt.show()
plt.boxplot(rep_hist_snapshot_df['OPTIMIZER_COST'].values)
plt.title('OPTIMIZER_COST')
plt.show()
plt.boxplot(rep_hist_snapshot_df['EXECUTIONS_DELTA'].values)
plt.title('EXECUTIONS_DELTA')
plt.show()
plt.boxplot(rep_hist_snapshot_df['ELAPSED_TIME_DELTA'].values)
plt.title('ELAPSED_TIME_DELTA')
plt.show()

In [None]:
limit = 100
label = 'CPU_TIME_DELTA'
rep_hist_snapshot_df2 = rep_hist_snapshot_df.sort_values(by=label, ascending=False)
rep_hist_snapshot_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
label = 'OPTIMIZER_COST'
rep_hist_snapshot_df2 = rep_hist_snapshot_df.sort_values(by=label, ascending=False)
rep_hist_snapshot_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
label = 'EXECUTIONS_DELTA'
rep_hist_snapshot_df2 = rep_hist_snapshot_df.sort_values(by=label, ascending=False,)
rep_hist_snapshot_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
label = 'ELAPSED_TIME_DELTA'
rep_hist_snapshot_df2 = rep_hist_snapshot_df.sort_values(by=label, ascending=False)
rep_hist_snapshot_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
plt.show()

### Guaging Outliers (REP_VSQL_PLAN)

Uses the following labels and plots them, so as to showcase the presence of outliers:
* COST
* CARDINALITY
* BYTES
* CPU_COST
* IO_COST
* TEMP_SPACE
* TIME

In [None]:
limit = 100
label = 'COST'
rep_vsql_plan_df2 = rep_vsql_plan_df.sort_values(by=label, ascending=False)
rep_vsql_plan_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
label = 'CARDINALITY'
rep_vsql_plan_df2 = rep_vsql_plan_df.sort_values(by=label, ascending=False)
rep_vsql_plan_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
label = 'BYTES'
rep_vsql_plan_df2 = rep_vsql_plan_df.sort_values(by=label, ascending=False)
rep_vsql_plan_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
label = 'IO_COST'
rep_vsql_plan_df2 = rep_vsql_plan_df.sort_values(by=label, ascending=False)
rep_vsql_plan_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
label = 'TEMP_SPACE'
rep_vsql_plan_df2 = rep_vsql_plan_df.sort_values(by=label, ascending=False)
rep_vsql_plan_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
label = 'TIME'
rep_vsql_plan_df2 = rep_vsql_plan_df.sort_values(by=label, ascending=False)
rep_vsql_plan_df2[[label]][0:limit].plot(kind='bar', title =label, figsize=(20, 15), legend=True, fontsize=12)
plt.show()

In [None]:
# y_label2 = ['COST','CARDINALITY','BYTES','IO_COST','TEMP_SPACE','TIME']
plt.rcParams['figure.figsize'] = [20, 15]
rep_vsql_plan_df.plot.scatter(x='COST',
                              y='CARDINALITY',
                              c='DarkBlue')
rep_vsql_plan_df.plot.scatter(x='COST',
                              y='BYTES',
                              c='DarkBlue')
rep_vsql_plan_df.plot.scatter(x='COST',
                              y='IO_COST',
                              c='DarkBlue')
rep_vsql_plan_df.plot.scatter(x='COST',
                              y='TEMP_SPACE',
                              c='DarkBlue')
rep_vsql_plan_df.plot.scatter(x='COST',
                              y='TIME',
                              c='DarkBlue')
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = [20, 15]
plt.boxplot(rep_vsql_plan_df['COST'].values)
plt.title('COST')
plt.show()
plt.boxplot(rep_vsql_plan_df['CARDINALITY'].values)
plt.title('CARDINALITY')
plt.show()
plt.boxplot(rep_vsql_plan_df['BYTES'].values)
plt.title('BYTES')
plt.show()
plt.boxplot(rep_vsql_plan_df['IO_COST'].values)
plt.title('IO_COST')
plt.show()
plt.boxplot(rep_vsql_plan_df['TEMP_SPACE'].values)
plt.title('TEMP_SPACE')
plt.show()
plt.boxplot(rep_vsql_plan_df['TIME'].values)
plt.title('TIME')
plt.show()

### K-Means Clustering (K=2)

Attempts at clustering data (Separation between inliers and outliers). Initial attempts will target K=2, and then visualize centroid to gauge their effectiveness in the achieved clustering.

In [None]:
def get_col_pos(df, target_label):
    """
    Iterates over column, and retrieves position of col in dataset
    """
    columns = df.columns
    index = -1
    for i in range(0,len(columns)):
        if columns[i].lower() == target_label.lower():
            index = i
            break
    return index
#
K = 2
kmeans_hist = KMeans(n_clusters=K, random_state=0).fit(rep_hist_snapshot_df.values)
print(kmeans_hist)
print(kmeans_hist.labels_)
unique, counts = np.unique(kmeans_hist.labels_, return_counts=True)
print('Unique: ' + str(unique))
print('Counts: ' + str(counts))
print(kmeans_hist.cluster_centers_)
#
plt.rcParams['figure.figsize'] = [20, 15]
##################################
plt.scatter(x=rep_hist_snapshot_df.values[:,get_col_pos(rep_hist_snapshot_df, 'ELAPSED_TIME_DELTA')],
            y=rep_hist_snapshot_df.values[:,get_col_pos(rep_hist_snapshot_df, 'CPU_TIME_DELTA')],
            c='b')
plt.scatter(x=kmeans_hist.cluster_centers_[:,get_col_pos(rep_hist_snapshot_df, 'ELAPSED_TIME_DELTA')],
            y=kmeans_hist.cluster_centers_[:,get_col_pos(rep_hist_snapshot_df, 'CPU_TIME_DELTA')],
            c='r')
plt.title('ELAPSED_TIME_DELTA vs CPU_TIME_DELTA')
plt.xlabel('ELAPSED_TIME_DELTA')
plt.ylabel('CPU_TIME_DELTA')
plt.show()
##################################
plt.scatter(x=rep_hist_snapshot_df.values[:,get_col_pos(rep_hist_snapshot_df, 'ELAPSED_TIME_DELTA')],
            y=rep_hist_snapshot_df.values[:,get_col_pos(rep_hist_snapshot_df, 'OPTIMIZER_COST')],
            c='b')
plt.scatter(x=kmeans_hist.cluster_centers_[:,get_col_pos(rep_hist_snapshot_df, 'ELAPSED_TIME_DELTA')],
            y=kmeans_hist.cluster_centers_[:,get_col_pos(rep_hist_snapshot_df, 'OPTIMIZER_COST')],
            c='r')
plt.title('ELAPSED_TIME_DELTA vs OPTIMIZER_COST')
plt.xlabel('ELAPSED_TIME_DELTA')
plt.ylabel('OPTIMIZER_COST')
plt.show()
##################################
plt.scatter(x=rep_hist_snapshot_df.values[:,get_col_pos(rep_hist_snapshot_df, 'ELAPSED_TIME_DELTA')],
            y=rep_hist_snapshot_df.values[:,get_col_pos(rep_hist_snapshot_df, 'EXECUTIONS_DELTA')],
            c='b')
plt.scatter(x=kmeans_hist.cluster_centers_[:,get_col_pos(rep_hist_snapshot_df, 'ELAPSED_TIME_DELTA')],
            y=kmeans_hist.cluster_centers_[:,get_col_pos(rep_hist_snapshot_df, 'EXECUTIONS_DELTA')],
            c='r')
plt.title('ELAPSED_TIME_DELTA vs EXECUTIONS_DELTA')
plt.xlabel('ELAPSED_TIME_DELTA')
plt.ylabel('EXECUTIONS_DELTA')
plt.show()

In [None]:
#
kmeans_vsql = KMeans(n_clusters=K, random_state=0).fit(rep_vsql_plan_df.values)
print(kmeans_vsql)
print(kmeans_vsql.labels_)
unique, counts = np.unique(kmeans_vsql.labels_, return_counts=True)
print('Unique: ' + str(unique))
print('Counts: ' + str(counts))
print(kmeans_vsql.cluster_centers_)
#
plt.rcParams['figure.figsize'] = [20, 15]
##################################
plt.scatter(x=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'CARDINALITY')],
            c='b')
plt.scatter(x=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'CARDINALITY')],
            c='r')
plt.title('COST vs CARDINALITY')
plt.xlabel('COST')
plt.ylabel('CARDINALITY')
plt.show()
##################################
plt.scatter(x=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'BYTES')],
            c='b')
plt.scatter(x=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'BYTES')],
            c='r',)
plt.title('COST vs BYTES')
plt.xlabel('COST')
plt.ylabel('BYTES')
plt.show()
##################################
plt.scatter(x=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'IO_COST')],
            c='b')
plt.scatter(x=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'IO_COST')],
            c='r')
plt.title('COST vs IO_COST')
plt.xlabel('COST')
plt.ylabel('IO_COST')
plt.show()
##################################
plt.scatter(x=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'TEMP_SPACE')],
            c='b')
plt.scatter(x=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'TEMP_SPACE')],
            c='r')
plt.title('COST vs TEMP_SPACE')
plt.xlabel('COST')
plt.ylabel('TEMP_SPACE')
plt.show()
##################################
plt.scatter(x=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=rep_vsql_plan_df.values[:,get_col_pos(rep_vsql_plan_df, 'TIME')],
            c='b')
plt.scatter(x=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'COST')],
            y=kmeans_hist.cluster_centers_[:,get_col_pos(rep_vsql_plan_df, 'TIME')],
            c='r')
plt.title('COST vs TIME')
plt.xlabel('COST')
plt.ylabel('TIME')
plt.show()

In [None]:
for i in range(len(kmeans_hist.labels_)):
    if kmeans_hist.labels_[i] == 0:
        print(rep_hist_snapshot_df.iloc[i])
        break
print('----------------------------------------')
for i in range(len(kmeans_hist.labels_)):
    if kmeans_hist.labels_[i] == 1:
        print(rep_hist_snapshot_df.iloc[i])
        break
print('----------------------------------------')
print('----------------------------------------')
print('----------------------------------------')
for i in range(len(kmeans_vsql.labels_)):
    if kmeans_vsql.labels_[i] == 0:
        print(rep_vsql_plan_df.iloc[i])
        break
print('----------------------------------------')
for i in range(len(kmeans_vsql.labels_)):
    if kmeans_vsql.labels_[i] == 1:
        print(rep_vsql_plan_df.iloc[i])
        break

### Clustering Validation

So as to verify the success of the clustering attempts, the achieved clustering labels require to be compared
to what is assumed to be the actual label predictions. These 'actual' clusters will be assumed to coincide with
the data matrix average - if a particular data vector is larger/smaller than the mean threshold, it will coincide in one cluster or the other.

In [None]:
class ValidateKMeans:
    """
    Wrapper class for the KMeans algorithm, so as to validate the clustering it has achieved
    """
    #
    def __init__(self, X, k):
        self.X = X
        self.k = k
        self.model = KMeans(n_clusters=self.k, random_state=0, init='k-means++',n_jobs=2)
        self.model.fit(self.X)
        self.__y_labels = self.model.labels_
        self.scorings = None
        print(self.model)
    #        
    def __get_threshold_vector(self):
        mean = np.mean(self.X.values)
        std = np.std(self.X.values)
        std3 = np.multiply(std, 3)
        return np.add(mean, std3)
    #
    def __calculate_expected_labels(self):
        """
        Estimates label clustering by comparing them to a threshold mean value. These labels
        will be used to gauge a scoring for the unsupervised clustering achieved by the K-Means algorithm.
        """
        mean_vect = self.__get_threshold_vector()
        mean_labels = []
        for vector in self.X.values:
            if np.greater(vector, mean_vect).any():
                mean_labels.append(1)
            else:
                mean_labels.append(0)
        return mean_labels
    #
    def outlier_score_precision(self):
        if self.scorings is None or len(self.scorings) == 0:
            raise ValueError('Scorings list is empty!')
        elif len(self.scorings) > 2:
            raise ValueError('Scorings list length is greater than 2! Must be composed of the following structure [scoring1, scoring2]')
        return self.scorings[1]/self.scorings[0]
    #
    def label_centroids(self):
        centroids = self.model.cluster_centers_
        mean_vect = self.__get_threshold_vector()
        categorized_labels = [] # [[Self_Classified_Label,Centroid_Label],[Self_Classified_Label,Centroid_Label],...]
        for i in range(len(centroids)):
            if np.greater(centroids[i], mean_vect).any():
                categorized_labels.append([1,i])
            else:
                categorized_labels.append([0,i])
        return categorized_labels
    #
    def evaluate_clusters(self):
        y = self.__calculate_expected_labels()
        yhat = []
        labelled_centroids = self.label_centroids()
        print('Labeled Centroids: ' + str(labelled_centroids))
        #
        for label in self.__y_labels:
            for x, i in labelled_centroids:
                if label == i:
                    yhat.append(x)
                    break
        #
        print('Total Clusters [' + str(self.k) + ']\nDistribution:')
        unique, counts = np.unique(y, return_counts=True)
        print('Expected Label Distribution')
        for i in range(len(unique)):
            print('Label [' + str(unique[i]) + '] -> Count [' + str(counts[i]) + ']')
            if unique[i] == 1:
                self.scorings.append(counts[i])
        unique, counts = np.unique(yhat, return_counts=True)
        print('Clustered Label Distribution')
        for i in range(len(unique)):
            print('Label [' + str(unique[i]) + '] -> Count [' + str(counts[i]) + ']')
            if unique[i] == 1:
                self.scorings.append(counts[i])
        #
        print("\n----\nAccuracy: " + str(accuracy_score(y, yhat)))
        print("Precision: " + str(precision_score(y, yhat, average='micro')))
        print("Recall: " + str(recall_score(y, yhat, average='micro')))
        print("F-Score: " + str(f1_score(y, yhat, average='micro')))
        print("Outlier Score Precision [" + str(self.outlier_score_precision()) + "]1n----")

### Exhausting K

Iterating over a number of K values, whilst gauging K under different number of combinations. Each K denotes the number of clusters as to group the data with. In turn, each cluster is then further categorized into 2 groups, those pertaining to:
* Inliers
* Outliers

Accuracy, Precision, Recall & FScore metrics will be used to evaluate the effectiveness of each K-Means choice, with each experiment executed 3 times to anticipate for random variants of centroid positioning (Initial positioning is handled by K-Means++). Clustered amounts will be compared to a rough, hard placed metric, which determines any points to be outliers if they contain a data point at the 99th % standard deviation.

An additional metric (apart from those mentioned above) will be used during the evaluation of this experiment. Particular focus will be given to the number of clustered outlier points, discounting inliers all together.

In [None]:
def exhaust_k_possibilities(df):
    """
    Method which attempts to exhaust a number of K options for the input pandas dataframe.
    K Attempts will be attempted in steps of 2, so as to speed up the K finding process.

    :param - df (Dataframe of type Pandas)
    """
    k_experiment_scorings = [] # k, score
    for k in range(2, len(df.columns), 2):
        print('Experiment start -------------[' + str(k) + ']-------------')
        experiment_scorings = []
        for i in range(3):
            validInstance = ValidateKMeans(df, k)
            validInstance.evaluate_clusters()
            experiment_scorings.append(validInstance.outlier_score_precision())
        experiment_scorings.append([k, sum(k_experiment_scorings)/len(k_experiment_scorings)])
        print('Experiment end -------------[' + str(k) + ']-------------')
    #
    final_score, final_k = 0,0
    for k,score in k_experiment_scorings:
        if score > final_score:
            final_k = k
            final_score = score
    print('\n\nExperiment Conclusion: K[' + str(final_k) + '] - score[' + str(final_score) + ']')
#
print('Experiment: REP_HIST_SNAPSHOT K-MEANS GRID SEARCH')
exhaust_k_possibilities(df=rep_hist_snapshot_df)
print('Experiment: REP_VSQL_PLAN K-MEANS GRID SEARCH')
exhaust_k_possibilities(df=rep_vsql_plan_df)

### Isolation Forest Outlier Detection

This section moves past K-Means clustering prediction, and attempts to detect / flag outliers using the Isolation Forest ensemble algorithm.

Return the anomaly score of each sample using the IsolationForest algorithm

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

In [None]:
#
# REP_HIST_SNAPSHOT
iforest_rep_hist_snapshot = IsolationForest(n_estimators=100, max_samples=256, contamination=0.2, random_state=0)
iforest_rep_hist_snapshot.fit(rep_hist_snapshot_df.values)
print(iforest_rep_hist_snapshot)
scores = iforest_rep_hist_snapshot.decision_function(rep_hist_snapshot_df.values)
plt.figure(figsize=(12, 8))
plt.hist(scores, bins=50);
plt.title('Isolation Forest Scorings')
plt.show()

In [None]:
#
# REP_VSQL_PLAN
iforest_rep_vsql_plan = IsolationForest(n_estimators=100, max_samples=256, contamination=0.2, random_state=0)
iforest_rep_vsql_plan.fit(rep_vsql_plan_df.values)
print(iforest_rep_vsql_plan)
scores = iforest_rep_vsql_plan.decision_function(rep_vsql_plan_df.values)
plt.figure(figsize=(12, 8))
plt.hist(scores, bins=50);
plt.title('Isolation Forest Scorings')
plt.show()