# Schedule TPC-DS100 Plan Comparison (Variant to Trace)

This experiment is intended at quantifying the statistical recommendation technique, through comparison of two query streams. The query streams are denoted as follows:

* Expected Stream - Denotes a sequence of baseline query plans, against which comparison will be made.
* Variation Stream - Denotes a sequence of upcoming query plans. Queries found within the upcoming stream mirror those established in the Expected Stream, with a number of exceptions. These exceptions are considered as query variants, and contain a degree of change from the original queries taken from the prior stream.

Query variants are denoted below, and are therefore eligable to be flagged during the evaluation phase:

* Query 5  
* Query 10
* Query 14
* Query 18
* Query 22
* Query 27
* Query 35
* Query 36
* Query 51
* Query 67
* Query 70
* Query 77
* Query 80
* Query 86

In [1]:
# pandas
import pandas as pd
print('pandas: %s' % pd.__version__)
# numpy
import numpy as np
print('numpy: %s' % np.__version__)
# matplotlib
import matplotlib.pyplot as plt
# sklearn
import sklearn as sk
from sklearn import preprocessing
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics import f1_score, accuracy_score
# AnyTree
from anytree import Node, RenderTree, PostOrderIter
from fuzzywuzzy import process

pandas: 0.24.1
numpy: 1.16.1


### Configuration Cell

Tweak parametric changes from this cell to influence outcome of experiment

In [2]:
# Experiment Config
tpcds='TPCDS100' # Schema upon which to operate test
test_split=.2
y_labels = ['COST',
            'CARDINALITY',
            'BYTES',
            #'CPU_COST',
            'IO_COST',
            'TEMP_SPACE',
            'TIME']
black_list = ['TIMESTAMP',
              'SQL_ID',
              'OPERATION',
              'OPTIONS',
              'OBJECT_NAME',
              'OBJECT_OWNER',
              'OBJECT_TYPE',
              'PARTITION_STOP',
              'PARTITION_START'] # Columns which will be ignored during type conversion, and later used for aggregation
nrows = 20000
variant_ids = (5, 10, 14, 18, 22, 27, 35, 36, 51, 67, 70, 77, 80, 86)

### Read data from file into pandas dataframes

In [3]:
# Root path
base_dir = 'C:/Users/gabriel.sammut/University/'
#base_dir = 'D:/Projects/'
root_dir = base_dir + 'Data_ICS5200/Schedule/' + tpcds
src_dir = base_dir + 'ICS5200/src/sql/Runtime/TPC-DS/' + tpcds + '/Variants/'

rep_vsql_plan_path = root_dir + '/rep_vsql_plan.csv'
#rep_vsql_plan_path = root_dir + '/rep_vsql_plan.csv'

dtype={'COST':'int64',
       'CARDINALITY':'int64',
       'BYTES':'int64',
       #'CPU_COST':'int64',
       'IO_COST':'int64',
       'TEMP_SPACE':'int64',
       'TIME':'int64',
       'OPERATION':'str',
       'OBJECT_NAME':'str'}
rep_vsql_plan_df = pd.read_csv(rep_vsql_plan_path, nrows=nrows, dtype=dtype)
print(rep_vsql_plan_df.head())

def prettify_header(headers):
    """
    Cleans header list from unwated character strings
    """
    header_list = []
    [header_list.append(header.replace("(","").replace(")","").replace("'","").replace(",","")) for header in headers]
    return header_list

rep_vsql_plan_df.columns = prettify_header(rep_vsql_plan_df.columns.values)
print('------------------------------------------')
print(rep_vsql_plan_df.columns)

    ('DBID',)    ('SQL_ID',)  ('PLAN_HASH_VALUE',)  ('ID',)    ('OPERATION',)  \
0  2634225673  2j8td2wuthnfv            1917374110        0  SELECT STATEMENT   
1  2634225673  2j8td2wuthnfv            1917374110        1      TABLE ACCESS   
2  2634225673  2j8td2wuthnfv            1917374110        2              SORT   
3  2634225673  2j8td2wuthnfv            1917374110        3      TABLE ACCESS   
4  2634225673  9nf3gy0tv9p0u            3537130676        0  SELECT STATEMENT   

  ('OPTIONS',) ('OBJECT_NODE',)  ('OBJECT#',) ('OBJECT_OWNER',)  \
0          NaN              NaN           NaN               NaN   
1         FULL              NaN        8693.0               SYS   
2    AGGREGATE              NaN           NaN               NaN   
3         FULL              NaN        8693.0               SYS   
4          NaN              NaN           NaN               NaN   

  ('OBJECT_NAME',)  ... ('ACCESS_PREDICATES',) ('FILTER_PREDICATES',)  \
0              NaN  ...              

  interactivity=interactivity, compiler=compiler, result=result)


### Read outlier data from file into pandas dataframes and concatenate

In [4]:
# CSV Outlier Paths
outlier_hints_q5_path = src_dir + 'hints/output/query_5.csv'
outlier_hints_q10_path = src_dir + 'hints/output/query_10.csv'
outlier_hints_q14_path = src_dir + 'hints/output/query_14.csv'
outlier_hints_q18_path = src_dir + 'hints/output/query_18.csv'
outlier_hints_q22_path = src_dir + 'hints/output/query_22.csv'
outlier_hints_q27_path = src_dir + 'hints/output/query_27.csv'
outlier_hints_q35_path = src_dir + 'hints/output/query_35.csv'
outlier_hints_q36_path = src_dir + 'hints/output/query_36.csv'
outlier_hints_q51_path = src_dir + 'hints/output/query_51.csv'
outlier_hints_q67_path = src_dir + 'hints/output/query_67.csv'
outlier_hints_q70_path = src_dir + 'hints/output/query_70.csv'
outlier_hints_q77_path = src_dir + 'hints/output/query_77.csv'
outlier_hints_q80_path = src_dir + 'hints/output/query_80.csv'
outlier_hints_q86_path = src_dir + 'hints/output/query_86.csv'

outlier_predicates_q5_path = src_dir + 'predicates/output/query_5.csv'
outlier_predicates_q10_path = src_dir + 'predicates/output/query_10.csv'
outlier_predicates_q14_path = src_dir + 'predicates/output/query_14.csv'
outlier_predicates_q18_path = src_dir + 'predicates/output/query_18.csv'
outlier_predicates_q22_path = src_dir + 'predicates/output/query_22.csv'
outlier_predicates_q27_path = src_dir + 'predicates/output/query_27.csv'
outlier_predicates_q35_path = src_dir + 'predicates/output/query_35.csv'
outlier_predicates_q36_path = src_dir + 'predicates/output/query_36.csv'
outlier_predicates_q51_path = src_dir + 'predicates/output/query_51.csv'
outlier_predicates_q67_path = src_dir + 'predicates/output/query_67.csv'
outlier_predicates_q70_path = src_dir + 'predicates/output/query_70.csv'
outlier_predicates_q77_path = src_dir + 'predicates/output/query_77.csv'
outlier_predicates_q80_path = src_dir + 'predicates/output/query_80.csv'
outlier_predicates_q86_path = src_dir + 'predicates/output/query_86.csv'

outlier_rownum_q5_path = src_dir + 'rownum/output/query_5.csv'
outlier_rownum_q10_path = src_dir + 'rownum/output/query_10.csv'
outlier_rownum_q14_path = src_dir + 'rownum/output/query_14.csv'
outlier_rownum_q18_path = src_dir + 'rownum/output/query_18.csv'
outlier_rownum_q22_path = src_dir + 'rownum/output/query_22.csv'
outlier_rownum_q27_path = src_dir + 'rownum/output/query_27.csv'
outlier_rownum_q35_path = src_dir + 'rownum/output/query_35.csv'
outlier_rownum_q36_path = src_dir + 'rownum/output/query_36.csv'
outlier_rownum_q51_path = src_dir + 'rownum/output/query_51.csv'
outlier_rownum_q67_path = src_dir + 'rownum/output/query_67.csv'
outlier_rownum_q70_path = src_dir + 'rownum/output/query_70.csv'
outlier_rownum_q77_path = src_dir + 'rownum/output/query_77.csv'
outlier_rownum_q80_path = src_dir + 'rownum/output/query_80.csv'
outlier_rownum_q86_path = src_dir + 'rownum/output/query_86.csv'

# Read CSV Paths
outlier_hints_q5_df = pd.read_csv(outlier_hints_q5_path,dtype=str)
outlier_hints_q10_df = pd.read_csv(outlier_hints_q10_path,dtype=str)
outlier_hints_q14_df = pd.read_csv(outlier_hints_q14_path,dtype=str)
outlier_hints_q18_df = pd.read_csv(outlier_hints_q18_path,dtype=str)
outlier_hints_q22_df = pd.read_csv(outlier_hints_q22_path,dtype=str)
outlier_hints_q27_df = pd.read_csv(outlier_hints_q27_path,dtype=str)
outlier_hints_q35_df = pd.read_csv(outlier_hints_q35_path,dtype=str)
outlier_hints_q36_df = pd.read_csv(outlier_hints_q36_path,dtype=str)
outlier_hints_q51_df = pd.read_csv(outlier_hints_q51_path,dtype=str)
outlier_hints_q67_df = pd.read_csv(outlier_hints_q67_path,dtype=str)
outlier_hints_q70_df = pd.read_csv(outlier_hints_q70_path,dtype=str)
outlier_hints_q77_df = pd.read_csv(outlier_hints_q77_path,dtype=str)
outlier_hints_q80_df = pd.read_csv(outlier_hints_q80_path,dtype=str)
outlier_hints_q86_df = pd.read_csv(outlier_hints_q86_path,dtype=str)

outlier_predicates_q5_df = pd.read_csv(outlier_predicates_q5_path,dtype=str)
outlier_predicates_q10_df = pd.read_csv(outlier_predicates_q10_path,dtype=str)
outlier_predicates_q14_df = pd.read_csv(outlier_predicates_q14_path,dtype=str)
outlier_predicates_q18_df = pd.read_csv(outlier_predicates_q18_path,dtype=str)
outlier_predicates_q22_df = pd.read_csv(outlier_predicates_q22_path,dtype=str)
outlier_predicates_q27_df = pd.read_csv(outlier_predicates_q27_path,dtype=str)
outlier_predicates_q35_df = pd.read_csv(outlier_predicates_q35_path,dtype=str)
outlier_predicates_q36_df = pd.read_csv(outlier_predicates_q36_path,dtype=str)
outlier_predicates_q51_df = pd.read_csv(outlier_predicates_q51_path,dtype=str)
outlier_predicates_q67_df = pd.read_csv(outlier_predicates_q67_path,dtype=str)
outlier_predicates_q70_df = pd.read_csv(outlier_predicates_q70_path,dtype=str)
outlier_predicates_q77_df = pd.read_csv(outlier_predicates_q77_path,dtype=str)
outlier_predicates_q80_df = pd.read_csv(outlier_predicates_q80_path,dtype=str)
outlier_predicates_q86_df = pd.read_csv(outlier_predicates_q86_path,dtype=str)

outlier_rownum_q5_df = pd.read_csv(outlier_rownum_q5_path,dtype=str)
outlier_rownum_q10_df = pd.read_csv(outlier_rownum_q10_path,dtype=str)
outlier_rownum_q14_df = pd.read_csv(outlier_rownum_q14_path,dtype=str)
outlier_rownum_q18_df = pd.read_csv(outlier_rownum_q18_path,dtype=str)
outlier_rownum_q22_df = pd.read_csv(outlier_rownum_q22_path,dtype=str)
outlier_rownum_q27_df = pd.read_csv(outlier_rownum_q27_path,dtype=str)
outlier_rownum_q35_df = pd.read_csv(outlier_rownum_q35_path,dtype=str)
outlier_rownum_q36_df = pd.read_csv(outlier_rownum_q36_path,dtype=str)
outlier_rownum_q51_df = pd.read_csv(outlier_rownum_q51_path,dtype=str)
outlier_rownum_q67_df = pd.read_csv(outlier_rownum_q67_path,dtype=str)
outlier_rownum_q70_df = pd.read_csv(outlier_rownum_q70_path,dtype=str)
outlier_rownum_q77_df = pd.read_csv(outlier_rownum_q77_path,dtype=str)
outlier_rownum_q80_df = pd.read_csv(outlier_rownum_q80_path,dtype=str)
outlier_rownum_q86_df = pd.read_csv(outlier_rownum_q86_path,dtype=str)

# Merge dataframes into a single pandas matrix
df_hints_outliers = pd.concat([outlier_hints_q5_df, outlier_hints_q10_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q14_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q18_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q22_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q27_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q35_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q36_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q51_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q67_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q70_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q77_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q80_df], sort=False)
df_hints_outliers = pd.concat([df_hints_outliers, outlier_hints_q86_df], sort=False)

df_predicate_outliers = pd.concat([outlier_predicates_q5_df, outlier_predicates_q10_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q14_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q18_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q22_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q27_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q35_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q36_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q51_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q67_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q70_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q77_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q80_df], sort=False)
df_predicate_outliers = pd.concat([df_predicate_outliers, outlier_predicates_q86_df], sort=False)

df_rownum_outliers = pd.concat([outlier_rownum_q5_df, outlier_rownum_q10_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q14_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q18_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q22_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q27_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q35_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q36_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q51_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q67_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q70_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q77_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q80_df], sort=False)
df_rownum_outliers = pd.concat([df_rownum_outliers, outlier_rownum_q86_df], sort=False)

print(df_hints_outliers.shape)
print(df_hints_outliers.head())
print('------------------------------------------')
print(df_predicate_outliers.shape)
print(df_predicate_outliers.head())
print('------------------------------------------')
print(df_rownum_outliers.shape)
print(df_rownum_outliers.head())

(461, 35)
  PLAN_ID            TIMESTAMP REMARKS         OPERATION          OPTIONS  \
0   12447  11/20/2018 09:56:46     NaN  SELECT STATEMENT              NaN   
1   12447  11/20/2018 09:56:46     NaN             COUNT          STOPKEY   
2   12447  11/20/2018 09:56:46     NaN              VIEW              NaN   
3   12447  11/20/2018 09:56:46     NaN              SORT  GROUP BY ROLLUP   
4   12447  11/20/2018 09:56:46     NaN              VIEW              NaN   

  OBJECT_NODE OBJECT_OWNER OBJECT_NAME                OBJECT_ALIAS  \
0         NaN          NaN         NaN                         NaN   
1         NaN          NaN         NaN                         NaN   
2         NaN     TPCDS100         NaN  from$_subquery$_018@SEL$11   
3         NaN          NaN         NaN                         NaN   
4         NaN     TPCDS100         NaN                    X@SEL$12   

  OBJECT_INSTANCE  ...                                          OTHER_XML  \
0             NaN  ...       

### Dealing with empty values

In [5]:
def get_na_columns(df, headers):
    """
    Return columns which consist of NAN values
    """
    na_list = []
    for head in headers:
        if df[head].isnull().values.any():
            na_list.append(head)
    return na_list

print('N/A Columns\n')
print('\nREP_VSQL_PLAN Features ' + str(len(rep_vsql_plan_df.columns)) + ': ' + str(get_na_columns(df=rep_vsql_plan_df,headers=rep_vsql_plan_df.columns)) + "\n")
print('\nDF_HINT_OUTLIERS Features ' + str(len(df_hints_outliers.columns)) + ': ' + str(get_na_columns(df=df_hints_outliers,headers=df_hints_outliers.columns)) + "\n")
print('\nDF_PREDICATE_OUTLIERS Features ' + str(len(df_predicate_outliers.columns)) + ': ' + str(get_na_columns(df=df_predicate_outliers,headers=df_predicate_outliers.columns)) + "\n")
print('\nDF_ROWNUM_OUTLIERS Features ' + str(len(df_rownum_outliers.columns)) + ': ' + str(get_na_columns(df=df_rownum_outliers,headers=df_rownum_outliers.columns)) + "\n")
#
def fill_na(df):
    """
    Replaces NA columns with 0s
    """
    return df.fillna(0)

# Populating NaN values with amount '0'
df = fill_na(df=rep_vsql_plan_df)
df_hints_outliers = fill_na(df=df_hints_outliers)
df_predicate_outliers = fill_na(df=df_predicate_outliers)
df_rownum_outliers = fill_na(df=df_rownum_outliers)

N/A Columns


REP_VSQL_PLAN Features 39: ['OPTIONS', 'OBJECT_NODE', 'OBJECT#', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS', 'OBJECT_TYPE', 'OPTIMIZER', 'PARENT_ID', 'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG', 'PARTITION_START', 'PARTITION_STOP', 'PARTITION_ID', 'OTHER', 'DISTRIBUTION', 'CPU_COST', 'IO_COST', 'TEMP_SPACE', 'ACCESS_PREDICATES', 'FILTER_PREDICATES', 'PROJECTION', 'TIME', 'QBLOCK_NAME', 'REMARKS', 'OTHER_XML']


DF_HINT_OUTLIERS Features 35: ['REMARKS', 'OPTIONS', 'OBJECT_NODE', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS', 'OBJECT_INSTANCE', 'OBJECT_TYPE', 'OPTIMIZER', 'SEARCH_COLUMNS', 'PARENT_ID', 'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG', 'PARTITION_START', 'PARTITION_STOP', 'PARTITION_ID', 'OTHER', 'OTHER_XML', 'DISTRIBUTION', 'CPU_COST', 'IO_COST', 'TEMP_SPACE', 'ACCESS_PREDICATES', 'FILTER_PREDICATES', 'PROJECTION', 'TIME', 'QBLOCK_NAME']


DF_PREDICATE_OUTLIERS Features 35: ['REMARKS', 'OPTIONS', 'OBJECT_NODE', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS', '

### Type conversion

Each column is converted into a column of type values which are Integer64.

In [6]:
def handle_numeric_overflows(x):
    """
    Accepts a dataframe column, and 
    """
    try:
        #df = df.astype('int64')
        x1 = pd.DataFrame([x],dtype='int64')
    except ValueError:
        x = 9223372036854775807 # Max int size
    return x

for col in df.columns:
    try:
        if col in black_list:
            continue
        df[col] = df[col].apply(handle_numeric_overflows)
        df[col].astype('int64',inplace=True)
    except:
        df.drop(columns=col, inplace=True)
        print('Dropped column [' + col + ']')

# print('-------------------------------------------------------------')

for col in df_hints_outliers.columns:
    try:
        if col in black_list:
            continue
        df_hints_outliers[col] = df_hints_outliers[col].astype('int64')
    except OverflowError:
        #
        # Handles numeric overflow conversions by replacing such values with max value inside the dataset.
        df_hints_outliers[col] = df_hints_outliers[col].apply(handle_numeric_overflows)
        df_hints_outliers[col] = df_hints_outliers[col].astype('int64')
    except Exception as e:
        df_hints_outliers.drop(columns=col, inplace=True)
        print('Dropped column [' + col + ']')

print('-------------------------------------------------------------')

for col in df_predicate_outliers.columns:
    try:
        if col in black_list:
            continue
        df_predicate_outliers[col] = df_predicate_outliers[col].astype('int64')
    except OverflowError:
        
        # Handles numeric overflow conversions by replacing such values with max value inside the dataset.
        df_predicate_outliers[col] = df_predicate_outliers[col].apply(handle_numeric_overflows)
        df_predicate_outliers[col] = df_predicate_outliers[col].astype('int64')
    except Exception as e:
        df_predicate_outliers.drop(columns=col, inplace=True)
        print('Dropped column [' + col + ']')       

print('-------------------------------------------------------------')

for col in df_rownum_outliers.columns:
    try:
        if col in black_list:
            continue
        df_rownum_outliers[col] = df_rownum_outliers[col].astype('int64')
    except OverflowError:
        #
        # Handles numeric overflow conversions by replacing such values with max value inside the dataset.
        df_rownum_outliers[col] = df_rownum_outliers[col].apply(handle_numeric_overflows)
        df_rownum_outliers[col] = df_rownum_outliers[col].astype('int64')
    except Exception as e:
        df_rownum_outliers.drop(columns=col, inplace=True)
        print('Dropped column [' + col + ']')    

print('-------------------------------------------------------------')
      
print(df.columns)
print(df_hints_outliers.columns)
print(df_predicate_outliers.columns)
print(df_rownum_outliers.columns)

Dropped column [OBJECT_ALIAS]
Dropped column [OPTIMIZER]
Dropped column [OTHER_XML]
Dropped column [CPU_COST]
Dropped column [ACCESS_PREDICATES]
Dropped column [FILTER_PREDICATES]
Dropped column [PROJECTION]
Dropped column [QBLOCK_NAME]
-------------------------------------------------------------
Dropped column [OBJECT_ALIAS]
Dropped column [OPTIMIZER]
Dropped column [OTHER_XML]
Dropped column [CPU_COST]
Dropped column [ACCESS_PREDICATES]
Dropped column [FILTER_PREDICATES]
Dropped column [PROJECTION]
Dropped column [QBLOCK_NAME]
-------------------------------------------------------------
Dropped column [OBJECT_ALIAS]
Dropped column [OPTIMIZER]
Dropped column [OTHER_XML]
Dropped column [CPU_COST]
Dropped column [ACCESS_PREDICATES]
Dropped column [FILTER_PREDICATES]
Dropped column [PROJECTION]
Dropped column [QBLOCK_NAME]
-------------------------------------------------------------
Index(['DBID', 'SQL_ID', 'PLAN_HASH_VALUE', 'ID', 'OPERATION', 'OPTIONS',
       'OBJECT_NODE', 'OBJECT

### Feature Elimination

In this step, redundant features are dropped. Features are considered redundant if exhibit a standard devaition of 0 (meaning no change in value).

In [7]:
def drop_flatline_columns(df):
    columns = df.columns
    flatline_features = []
    for i in range(len(columns)):
        try:
            #
            if columns[i] in black_list:
                continue
            #
            std = df[columns[i]].std()
            if std == 0:
                flatline_features.append(columns[i])
        except:
            pass
    
    #print('Features which are considered flatline:\n')
    #for col in flatline_features:
    #    print(col)
    print('\nShape before changes: [' + str(df.shape) + ']')
    df = df.drop(columns=flatline_features)
    print('Shape after changes: [' + str(df.shape) + ']')
    print('Dropped a total [' + str(len(flatline_features)) + ']')
    return df

df = drop_flatline_columns(df=df)
df_hints_outliers = drop_flatline_columns(df=df_hints_outliers)
df_predicate_outliers = drop_flatline_columns(df=df_predicate_outliers)
df_rownum_outliers = drop_flatline_columns(df=df_rownum_outliers)

print('\nAfter flatline column drop:')
print(df.shape)
print(df.columns)

print('--------------------------------------------------------')
print('\nAfter outlier flatline column drop [df_hints_outliers]:')
print(df_hints_outliers.shape)
print(df_hints_outliers.columns)

print('--------------------------------------------------------')
print('\nAfter outlier flatline column drop [df_predicate_outliers]:')
print(df_predicate_outliers.shape)
print(df_predicate_outliers.columns)

print('--------------------------------------------------------')
print('\nAfter outlier flatline column drop [df_rownum_outliers]:')
print(df_rownum_outliers.shape)
print(df_rownum_outliers.columns)


Shape before changes: [(20000, 39)]
Shape after changes: [(20000, 31)]
Dropped a total [8]

Shape before changes: [(461, 27)]
Shape after changes: [(461, 21)]
Dropped a total [6]

Shape before changes: [(489, 27)]
Shape after changes: [(489, 21)]
Dropped a total [6]

Shape before changes: [(483, 27)]
Shape after changes: [(483, 21)]
Dropped a total [6]

After flatline column drop:
(20000, 31)
Index(['SQL_ID', 'PLAN_HASH_VALUE', 'ID', 'OPERATION', 'OPTIONS',
       'OBJECT_NODE', 'OBJECT#', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS',
       'OBJECT_TYPE', 'OPTIMIZER', 'PARENT_ID', 'DEPTH', 'POSITION',
       'SEARCH_COLUMNS', 'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG',
       'PARTITION_START', 'PARTITION_STOP', 'PARTITION_ID', 'DISTRIBUTION',
       'CPU_COST', 'IO_COST', 'TEMP_SPACE', 'TIME', 'QBLOCK_NAME', 'TIMESTAMP',
       'OTHER_XML'],
      dtype='object')
--------------------------------------------------------

After outlier flatline column drop [df_hints_outliers]:
(461,

### Scaling columns

This section attempts to process a number of data columns through a MinMax Scaler. This is done, to normalize data on a similar scaler, particularly before comparing column measurements using a euclidean based measure. The following columns will be targetted:

* CARDINALITY
* BYTES
* PARTITION_START
* PARTITION_STOP
* CPU_COST
* IO_COST
* TEMP_SPACE
* TIME

In [8]:
scaler = preprocessing.MinMaxScaler()
scaled_columns = ['CARDINALITY',
                'BYTES',
                #'CPU_COST',
                'IO_COST',
                'TEMP_SPACE',
                'TIME']
print(df['PARTITION_START'].iloc[0])
df[scaled_columns] = scaler.fit_transform(df[scaled_columns])
print(df['PARTITION_START'].iloc[0])
print("Minimal Vector Points: " + str(scaler.data_min_))
print("Maximal Vector Points: " + str(scaler.data_max_))

print('\nAfter scaled column transformation:')
print(df.shape)
print(df.columns)

print('--------------------------------------------------------')
print('\nAfter outlier scaled column transformation [df_hints_outliers]:')
print(df_hints_outliers.shape)
print(df_hints_outliers.columns)

print('--------------------------------------------------------')
print('\nAfter outlier scaled column transformation [df_predicate_outliers]:')
print(df_predicate_outliers.shape)
print(df_predicate_outliers.columns)

print('--------------------------------------------------------')
print('\nAfter outlier scaled column transformation [df_rownum_outliers]:')
print(df_rownum_outliers.shape)
print(df_rownum_outliers.columns)

0
0
Minimal Vector Points: [0. 0. 0. 0. 0.]
Maximal Vector Points: [3.9933000e+08 3.6287625e+10 4.1962300e+05 4.6007000e+07 1.7000000e+01]

After scaled column transformation:
(20000, 31)
Index(['SQL_ID', 'PLAN_HASH_VALUE', 'ID', 'OPERATION', 'OPTIONS',
       'OBJECT_NODE', 'OBJECT#', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS',
       'OBJECT_TYPE', 'OPTIMIZER', 'PARENT_ID', 'DEPTH', 'POSITION',
       'SEARCH_COLUMNS', 'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG',
       'PARTITION_START', 'PARTITION_STOP', 'PARTITION_ID', 'DISTRIBUTION',
       'CPU_COST', 'IO_COST', 'TEMP_SPACE', 'TIME', 'QBLOCK_NAME', 'TIMESTAMP',
       'OTHER_XML'],
      dtype='object')
--------------------------------------------------------

After outlier scaled column transformation [df_hints_outliers]:
(461, 21)
Index(['PLAN_ID', 'TIMESTAMP', 'OPERATION', 'OPTIONS', 'OBJECT_OWNER',
       'OBJECT_NAME', 'OBJECT_INSTANCE', 'OBJECT_TYPE', 'SEARCH_COLUMNS', 'ID',
       'PARENT_ID', 'DEPTH', 'POSITION', 'COS

### Adding Grouping Column

An extra column is added to allow access plans to be isolated per instance

In [9]:
# Adds a columns per SQL_ID, PLAN_HASH_VALUE grouping, which can be used to group instances together
def add_grouping_column(df, column_identifier):
    """
    Receives a pandas dataframe, and adds a new column which allows dataframe to be aggregated per 
    SQL_ID, PLAN_HASH_VALUE combination.
    
    :param: df                - Pandas Dataframe
    :param: column_identifier - String denoting matrix column to group by
    
    :return: Pandas Dataframe, with added column    
    """
    print('Shape before transformation: ' + str(df.shape))
    new_grouping_col = []
    counter = 0
    last_sql_id = df[column_identifier].iloc(0) # Starts with first SQL_ID
    for index, row in df.iterrows():
        if column_identifier == 'SQL_ID':
            if last_sql_id != row.SQL_ID:
                last_sql_id = row.SQL_ID
                counter += 1
        elif column_identifier == 'PLAN_ID':
            if last_sql_id != row.PLAN_ID:
                last_sql_id = row.PLAN_ID
                counter += 1
        else:
            raise ValueError('Column does not exist!')
        new_grouping_col.append(counter)
    
    # Append list as new column
    new_col = pd.Series(new_grouping_col)
    df['PLAN_INSTANCE'] = new_col.values
    print('Shape after transformation: ' + str(df.shape))
    return df

df = add_grouping_column(df=df,column_identifier='SQL_ID')
df_hints_outliers = add_grouping_column(df=df_hints_outliers,column_identifier='PLAN_ID')
df_predicate_outliers = add_grouping_column(df=df_predicate_outliers,column_identifier='PLAN_ID')
df_rownum_outliers = add_grouping_column(df=df_rownum_outliers,column_identifier='PLAN_ID')

Shape before transformation: (20000, 31)
Shape after transformation: (20000, 32)
Shape before transformation: (461, 21)
Shape after transformation: (461, 22)
Shape before transformation: (489, 21)
Shape after transformation: (489, 22)
Shape before transformation: (483, 21)
Shape after transformation: (483, 22)


### Plan Matching Column

This column is used to match plans between inliers and outliers.

In [10]:
def add_matching_column(df):
    
    # Create new empty column
#     new_col = pd.Series([])
#     df['SQL_MATCH'] = new_col.values
    df = df.assign(SQL_MATCH = lambda x: 0)
    
    # Retrieve unique listing of SQL_IDs and iterate over them
    unique_ids = pd.unique(df['SQL_ID'])
    
    for sql_id in unique_ids:
        sql_plan = df[df['SQL_ID'] == sql_id]['OPERATION']
        sql_match = sql_plan.str.cat(sep=' > ')        
        sql_plan2 = df[df['SQL_ID'] == sql_id]['OBJECT_NAME'].astype(str)
        sql_match2 = sql_plan2.str.cat(sep=' > ')
        df['SQL_MATCH'].loc[df['SQL_ID'] == sql_id] = sql_match + sql_match2
    
    return df

df = add_matching_column(df=df)
print(df.head())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


          SQL_ID  PLAN_HASH_VALUE  ID         OPERATION    OPTIONS  \
0  2j8td2wuthnfv       1917374110   0  SELECT STATEMENT          0   
1  2j8td2wuthnfv       1917374110   1      TABLE ACCESS       FULL   
2  2j8td2wuthnfv       1917374110   2              SORT  AGGREGATE   
3  2j8td2wuthnfv       1917374110   3      TABLE ACCESS       FULL   
4  9nf3gy0tv9p0u       3537130676   0  SELECT STATEMENT          0   

   OBJECT_NODE  OBJECT# OBJECT_OWNER    OBJECT_NAME         OBJECT_ALIAS  ...  \
0            0      0.0            0              0                    0  ...   
1            0   8693.0          SYS  WRM$_SNAPSHOT  9223372036854775807  ...   
2            0      0.0            0              0                    0  ...   
3            0   8693.0          SYS  WRM$_SNAPSHOT  9223372036854775807  ...   
4            0      0.0            0              0                    0  ...   

  DISTRIBUTION  CPU_COST   IO_COST  TEMP_SPACE      TIME          QBLOCK_NAME  \
0          

### Tree Formatting

Constructs the tree plan structure

In [11]:
class PlanTreeModeller:
    """
    This class simulates an access plan in the form of a tree structure
    """
    
    @staticmethod
    def __create_node(node_name, parent=None):
        """
        Builds a node which will be added to the tree. If the parent is 'None', it is assumed that this
        node will be used as the root/parent Node.
        
        :param: node_name - String specifying node name.
        :param: parent    - Parent node specifying parent node name.
        
        :return: anytree object
        """
        if node_name is None:
            raise ValueError('Node name was not specified!')
        
        if parent is None:
            node = Node(node_name)
        else:
            node = Node(node_name, parent=parent)
        
        return node
    
    @staticmethod
    def build_tree(df):
        """
        This method receives a pandas dataframe, and converts it into a searchable python tree
        
        :param: df - Pandas Dataframe, pertaining to input access plan
        
        :return: Dictionary object, consisting of node objects (which are linked in a tree fashion)
        """
        parent_node = None
        node_dict = {}
        for index, row in df.iterrows():
            
            # Build Node and add to parent
            row_id = int(row['ID'])
            parent_id = int(row['PARENT_ID'])
            
            if row_id == 0:
                node = PlanTreeModeller.__create_node(node_name=row_id)
            else:
                parent_node = node_dict[parent_id]
                node = PlanTreeModeller.__create_node(node_name=row_id, parent=parent_node)
            node_dict[row_id] = node
        
        return node_dict # Dictionary consisting of tree nodes
    
    @staticmethod
    def __retrieve_plan_details(df, node_name):
        """
        Accepts a dataframe, and the node_name. Retrieves features pertaining to the row id in the access plan
        
        :param: df - Dataframe consisting of access plan features
        :param: id - String id denoting which row to retrieve from the parameter dataframe
        
        :return: Dictionary consisting of access plan attributes
        """
        operation = str(df[df['ID'] == node_name]['OPERATION'].iloc[0])
        options = str(df[df['ID'] == node_name]['OPTIONS'].iloc[0])
        object_name = str(df[df['ID'] == node_name]['OBJECT_NAME'].iloc[0])
        try:
            object_type = str(df[df['ID'] == node_name]['OBJECT_TYPE'].iloc[0])
        except KeyError: # This is required because variant query plans do not have this node.
            object_type = None
        cardinality = int(df[df['ID'] == node_name]['CARDINALITY'].iloc[0])
        bytess = int(df[df['ID'] == node_name]['BYTES'].iloc[0])
        partition_delta = int(df[df['ID'] == node_name]['PARTITION_STOP'].iloc[0]) - int(df[df['ID'] == node_name]['PARTITION_START'].iloc[0])
        #cpu_cost = int(df[df['ID'] == node_name]['CPU_COST'].iloc[0])
        io_cost = int(df[df['ID'] == node_name]['IO_COST'].iloc[0])
        temp_space = int(df[df['ID'] == node_name]['TEMP_SPACE'].iloc[0])
        time = int(df[df['ID'] == node_name]['TIME'].iloc[0]) 
        
        return {'OPERATION':operation,
                'OPTIONS':options,
                'OBJECT_NAME':object_name,
                'OBJECT_TYPE':object_type,
                'CARDINALITY':cardinality,
                'BYTES':bytess,
                'PARTITION_DELTA':partition_delta,
                #'CPU_COST':cpu_cost,
                'IO_COST':io_cost,
                'TEMP_SPACE':temp_space,
                'TIME':time}
    
    @staticmethod
    def __tree_node_euclidean(tree_dict1, tree_dict2):
        """
        This method calculates the eucldiean distance between two vectors.
        
        :param: tree_dict1 - Dictionary denoting a single node within plan / tree 1
        :param: tree_dict2 - Dictionary denoting a single node within plan / tree 2
        
        :return: List denoting euclidean distance
        """
        tree_vector_1 = [tree_dict1['CARDINALITY'],
                         tree_dict1['BYTES'],
                         tree_dict1['PARTITION_DELTA'],
                         #tree_dict1['CPU_COST'],
                         tree_dict1['IO_COST'],
                         tree_dict1['TEMP_SPACE'],
                         tree_dict1['TIME']]
        
        tree_vector_2 = [tree_dict2['CARDINALITY'],
                         tree_dict2['BYTES'],
                         tree_dict2['PARTITION_DELTA'],
                         #tree_dict2['CPU_COST'],
                         tree_dict2['IO_COST'],
                         tree_dict2['TEMP_SPACE'],
                         tree_dict2['TIME']]
        
        euc_distance = euclidean_distances([tree_vector_1],[tree_vector_2])
        return euc_distance[0][0]
    
    @staticmethod
    def render_tree(tree, df):
        """
        Renders Tree by printing to screen
        
        :param: tree - AnyTree object, representing tree modelled access plan
        :param: df   - Pandas dataframe representatnt of the access plan about to be rendered
        
        :return: None
        """
        for pre, fill, node in RenderTree(tree):
            
            access_plan_dict = PlanTreeModeller.__retrieve_plan_details(df=df,
                                                                        node_name = node.name)
            
            if access_plan_dict['OBJECT_NAME'] == '0':
                print("%s%s > %s" % (pre, node.name, access_plan_dict['OPERATION']))
            else:
                if access_plan_dict['OPTIONS'] == '0': 
                    print("%s%s > %s (%s)" % (pre, node.name, access_plan_dict['OPERATION'], access_plan_dict['OBJECT_NAME']))
                else:
                    print("%s%s > %s | %s (%s)" % (pre, node.name, access_plan_dict['OPERATION'], access_plan_dict['OPTIONS'], access_plan_dict['OBJECT_NAME']))
    
    @staticmethod
    def __postorder(tree):
        """
        Accepts a tree, and iterates in post order fashion (left,right,root)
        
        :param: tree - Dictionary consisting of AnyTree Nodes
        
        :return: List consisting of tree traversal order
        """
        post_order_traversal = [node.name for node in PostOrderIter(tree[0])]
        return post_order_traversal
    
    @staticmethod
    def tree_compare(tree1, tree2, df1, df2):
        """
        Accepts two trees of type 'AnyTree', along with respective dataframe denoting each respective access
        path.
        
        :param: tree1 - Dictionary consisting of 'AnyTree' nodes, belonging to tree 1
        :param: tree2 - Dictionary consisting of 'AnyTree' nodes, belonging to tree 2
        :param: df1   - Pandas dataframe consisting of access plan instructions opted for by tree 1
        :param: df2   - Pandas dataframe consisting of access plan instructions opted for by tree 2
        
        :return: None
        """
        
        # Retrieves traversal order for both trees
        post_order_traversal1 = PlanTreeModeller.__postorder(tree1)
        post_order_traversal2 = PlanTreeModeller.__postorder(tree2)
        
        # Iterates over traversal order, until a change is encountered
        max_range = max(len(post_order_traversal1),len(post_order_traversal2))
        delta_flag = True
        euclidean_measure = []
        for i in range(0,max_range):
            
            # This check avoids a list IndexError for scebarious when one plan is bigger than the others,
            # and consequently the number of node traversals is bigger than the other tree.
            if i >= len(post_order_traversal1) or i >= len(post_order_traversal2):
                break
            
            # Retrive prior, current, and next nodes
            try:
                id_1_prev = post_order_traversal1[i-1]
                id_2_prev = post_order_traversal2[i-1]
            except IndexError:
                id_1_prev = None
                id_2_prev = None
            try:
                id_1 = post_order_traversal1[i]
                id_2 = post_order_traversal2[i]
            except IndexError:
                id_1 = None
                id_2 = None
            try:
                id_1_next = post_order_traversal1[i+1]
                id_2_next = post_order_traversal2[i+1]
            except IndexError:
                id_1_next = None
                id_2_next = None

            if id_1_prev is not None and id_2_prev is not None:
                pd_tree1_prev = PlanTreeModeller.__retrieve_plan_details(df=df1, node_name=id_1_prev)
                pd_tree2_prev = PlanTreeModeller.__retrieve_plan_details(df=df2, node_name=id_2_prev)
            if id_1 is not None and id_2 is not None:
                pd_tree1 = PlanTreeModeller.__retrieve_plan_details(df=df1, node_name=id_1)
                pd_tree2 = PlanTreeModeller.__retrieve_plan_details(df=df2, node_name=id_2)
            if id_1_next is not None and id_2_next is not None:
                pd_tree1_next = PlanTreeModeller.__retrieve_plan_details(df=df1, node_name=id_1_next)
                pd_tree2_next = PlanTreeModeller.__retrieve_plan_details(df=df2, node_name=id_2_next)
            
            if (pd_tree1['OPERATION'] != pd_tree2['OPERATION'] or pd_tree1['OBJECT_NAME'] != pd_tree2['OBJECT_NAME'] or pd_tree1['OPTIONS'] != pd_tree2['OPTIONS']) and delta_flag:
                print('Access Predicate Difference detected!')
                print('Tree 1 difference at node [' + str(id_1) + '] operator > ' + pd_tree1['OPERATION'] + '(' + pd_tree1['OPTIONS'] + ') on object [' + pd_tree1['OBJECT_NAME'] + ']')
                print('Tree 2 difference at node [' + str(id_2) + '] operator > ' + pd_tree2['OPERATION'] + '(' + pd_tree2['OPTIONS'] + ') on object [' + pd_tree2['OBJECT_NAME'] + ']')
                PlanTreeModeller.render_tree(tree=tree1[0], df=df1) # Tree rendederer uses root node and traverses downwards
                PlanTreeModeller.render_tree(tree=tree2[0], df=df2) # Tree rendederer uses root node and traverses downwards
                
                encountered_recommendations = []
                print('Stat Recommendation: ')
                display_counter = 1
                if pd_tree1['OBJECT_TYPE'] != '0' and pd_tree1['OBJECT_NAME'] not in encountered_recommendations:
                    print(str(display_counter) + ') Collect [' + pd_tree1['OBJECT_TYPE'] + '] stats on [' + pd_tree1['OBJECT_NAME'] + ']')
                    encountered_recommendations.append(pd_tree1['OBJECT_NAME'])
                    display_counter += 1
                if pd_tree2['OBJECT_TYPE'] != '0' and pd_tree2['OBJECT_NAME'] not in encountered_recommendations:
                    print(str(display_counter) + ') Collect [' + pd_tree2['OBJECT_TYPE'] + '] stats on [' + pd_tree2['OBJECT_NAME'] + ']')
                    encountered_recommendations.append(pd_tree2['OBJECT_NAME'])
                    display_counter += 1
#                 if pd_tree1_prev['OBJECT_TYPE'] != '0' and pd_tree1_prev['OBJECT_NAME'] not in encountered_recommendations:
#                     print(str(display_counter) + ') Collect [' + pd_tree1_prev['OBJECT_TYPE'] + '] stats on [' + pd_tree1_prev['OBJECT_NAME'] + ']')
#                     encountered_recommendations.append(pd_tree1_prev['OBJECT_NAME'])
#                     display_counter += 1
#                 if pd_tree2_prev['OBJECT_TYPE'] != '0' and pd_tree2_prev['OBJECT_NAME'] not in encountered_recommendations:
#                     print(str(display_counter) + ') Collect [' + pd_tree2_prev['OBJECT_TYPE'] + '] stats on [' + pd_tree2_prev['OBJECT_NAME'] + ']')
#                     encountered_recommendations.append(pd_tree2_prev['OBJECT_NAME'])
#                     display_counter += 1
                if pd_tree1_next['OBJECT_TYPE'] != '0' and pd_tree1_next['OBJECT_NAME'] not in encountered_recommendations:
                    print(str(display_counter) + ') Collect [' + pd_tree1_next['OBJECT_TYPE'] + '] stats on [' + pd_tree1_next['OBJECT_NAME'] + ']')
                    encountered_recommendations.append(pd_tree1_next['OBJECT_NAME'])
                    display_counter += 1
                if pd_tree2_next['OBJECT_TYPE'] != '0' and pd_tree2_next['OBJECT_NAME'] not in encountered_recommendations:
                    print(str(display_counter) + ') Collect [' + pd_tree2_next['OBJECT_TYPE'] + '] stats on [' + pd_tree2_next['OBJECT_NAME'] + ']')
                    encountered_recommendations.append(pd_tree2_prev['OBJECT_NAME'])
                    display_counter += 1
                delta_flag = False
            
            # Calculate Node Euclidean Measure
            euclidean_vector = PlanTreeModeller.__tree_node_euclidean(tree_dict1=pd_tree1,
                                                                      tree_dict2=pd_tree2)
            euclidean_measure.append(euclidean_vector)
            
        if delta_flag is not False and sum(euclidean_measure) > 10000:
            print('Access Predicate Difference detected!')
            print('Plan structure was the same, but a big operator difference was detected with delta score [' + str(sum(euclidean_measure))  + ']')
            PlanTreeModeller.render_tree(tree=tree1[0], df=df1) # Tree rendederer uses root node and traverses downwards
            PlanTreeModeller.render_tree(tree=tree2[0], df=df2) # Tree rendederer uses root node and traverses downwards
        
        if delta_flag:
            print('No plan differences detected.')
        
        print('Total computed delta score [' + str(sum(euclidean_measure)) + ']')

### Building Testing Streams

This cell builds a total of 4 lists, composed as follows:

* Expected Stream, composed of SQL queries with which comparison will be made.
* Variant Stream, with intermingled hint outliers
* Variant Stream, with intermingled predicate outliers
* Variant Stream, with intermingled rownum outliers

In [12]:
# Retrieve Unique set of PLAN_HASH_VALUES
np_sql_id = pd.unique(df['SQL_ID'])

# Remove those which are not originating from TPC-DS
filtered_sql = []
for sql in np_sql_id:
    
    df_temp_plan = df[df['SQL_ID'] == sql]

    # This step ensures that only TPC-DS related queries are displayed
    tpc_check = df_temp_plan['OBJECT_OWNER'].tolist()
    if tpcds not in tpc_check:
        continue
        
    #
    # Discards plans with double entries - Due to the parallel nature of the throughput test for 
    # TPC-DS, multiple threads may execute the same query at the same time, resulting in sql access
    # plans with the same SQL_ID, same PLAN_HASH_VALUE, and same TIMESTAMP. Such occurances are skipped.
    df_temp_count = df_temp_plan[df_temp_plan['ID'] == 0]
    if df_temp_count.shape[0] != 1:
        continue
        
    filtered_sql.append(sql)
np_sql_id= filtered_sql 

print('ACTUAL:')
print(np_sql_id)
print(type(np_sql_id))
print(len(np_sql_id))
print('-'*100)

# Retrieve Unique set of PLAN_IDs for hint outliers
np_hint_outlier_plan_id = pd.unique(df_hints_outliers['PLAN_ID'])
print('HINT_VARIANTS:')
print(np_hint_outlier_plan_id)
print(type(np_hint_outlier_plan_id))
print('-'*100)

# Retrieve Unique set of PLAN_IDs for predicate outliers
np_predicate_outlier_plan_id = pd.unique(df_predicate_outliers['PLAN_ID'])
print('PREDICATE_VARIANTS:')
print(np_predicate_outlier_plan_id)
print(type(np_predicate_outlier_plan_id))
print('-'*100)

# Retrieve Unique set of PLAN_IDs for rownum outliers
np_rownum_outlier_plan_id = pd.unique(df_rownum_outliers['PLAN_ID'])
print('ROWNUM_VARIANTS:')
print(np_rownum_outlier_plan_id)
print(type(np_rownum_outlier_plan_id))
print('-'*100)

ACTUAL:
['9nf3gy0tv9p0u', 'dmarhxq3sjbay', 'atmzuqq2j04vf', '8h30qknj67qkd', 'c08uay6yqd6g6', 'cdnf103s6xdrq', '7709u7vc53hzp', '1v8msnbvxkyns', 'g1gk65zaj4v13', 'fguqxhgu1dsb0', '7jbz5k0dtf423', '8bkwvvpj53p99', '2r0jymb3zn4jf', '2xgw6vvusj8b5', 'b0v3ckntj8u2a', 'gc8fy2s1t1cu9', '20tqu460batd7', '0vu1tx383zny5', 'fcwqqyym0s6jt', '4w6s7g5fzs73j', '341gsjr61mshb', '6hxba954xkbr5', 'd8skjycj376g5', 'fxkcmts3gvwxq', 'cdhsvwqxkam8t', '2z07h80455ga1', 'c2z5yntnskd4a', '0tmf6pgnf5jnq', 'dhh64fnj09d5h', '246rprswfccwf', '893thpqvhsmtj', '13yty6ncn52g9', 'ax9nqy7g8gdjk', 'ctw35amk1n56t', '6zg8hz91awun3', '4a4gj8y2sg6za', '91qq5sbbw1wj2', '9wyaa29uhuujf', 'd88xndadpcsrd', 'bqusp3ck0v1tm', 'c6289n6x7q3ct', 'c4w987zzxa97v', '038sf3f71cmgz', '5vwb8shdzwy6f', '44wm17x9hs6ur', '7ny5n5kz9w8vr', '0n0fcwdb2n1d2', 'a9ps282b9czw9', 'cvj7vbpg7tczx', '5at0uqw2udhtj', 'ggy4j2s4hkn3j', '3fscxf8wh6kw8', '4qkmajxbvwfsj', 'b3ycr1ac6y7c8', 'b763ujgfhx022', 'bv8vvr33uwtck', 'ccr8dmjjxw13t', 'g5gubcbr1z1s1', '2gd3

## Stream Comparison with Hint Based Outliers

Compares the outlier queries with those in the inlier set. Uses fuzzy wuzzy library for Lehvensthein plan comparison to link the two sets together (since SQL_ID is different between two sets).

In [13]:
counter = 0
for plan in np_hint_outlier_plan_id:
    #print(plan)
    print('\n\n---------------------------------------------------\nQuery variant [' + str(variant_ids[counter]) + '] with plan_id [' + str(plan) + ']')
    sql_plan = df_hints_outliers[df_hints_outliers['PLAN_ID'] == plan]['OPERATION']
    sql_match = sql_plan.str.cat(sep=' > ')
    sql_plan2 = df_hints_outliers[df_hints_outliers['PLAN_ID'] == plan]['OBJECT_NAME'].astype(str)
    sql_match2 = sql_plan2.str.cat(sep=' > ')
    sql_match = sql_match + sql_match2
    
    sql_id_list, inlier_plans = [], []
    for sql in np_sql_id:
        sql_plan2 = df[df['SQL_ID'] == sql]['OPERATION']
        sql_match2 = sql_plan2.str.cat(sep=' > ')
        sql_plan3 = df[df['SQL_ID'] == sql]['OBJECT_NAME'].astype(str)
        sql_match3 = sql_plan3.str.cat(sep=' > ')
        inlier_plans.append(sql_match2 + sql_match3)
        sql_id_list.append(sql)
    
    inlier_match = process.extractOne(sql_match, inlier_plans)
    #print(inlier_match)
    inlier_sql_id = None
    for i in range(len(inlier_plans)):

        if inlier_plans[i] == inlier_match[0]:
            inlier_sql_id = sql_id_list[i]
            break
    
    # Reads Inlier and Outlier plans into memory (Pandas Dataframes)
    df_inlier_plan = df[df['SQL_ID'] == inlier_sql_id]
    df_inlier_plan = df_inlier_plan.sort_values(by='ID', ascending=True)
    df_outlier_plan = df_hints_outliers[df_hints_outliers['PLAN_ID'] == plan]
    df_outlier_plan = df_outlier_plan.sort_values(by='ID', ascending=True)
    #print(df_inlier_plan['OBJECT_TYPE'])
    #print(df_outlier_plan['OBJECT_TYPE'])
    
    # Builds Trees
    inlier_tree = PlanTreeModeller.build_tree(df=df_inlier_plan)
    outlier_plan = PlanTreeModeller.build_tree(df=df_outlier_plan)
    
    # Compare Trees
    PlanTreeModeller.tree_compare(tree1=inlier_tree, 
                                  tree2=outlier_plan, 
                                  df1=df_inlier_plan, 
                                  df2=df_outlier_plan)
    
    counter += 1



---------------------------------------------------
Query variant [5] with plan_id [12447]
Access Predicate Difference detected!
Tree 1 difference at node [9] operator > INDEX(SAMPLE FAST FULL SCAN) on object [CR_RETURNING_HDEMO_SK_INDEX]
Tree 2 difference at node [10] operator > TABLE ACCESS(FULL) on object [DATE_DIM]
0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10001)
            └── 4 > SORT
                └── 5 > PX RECEIVE
                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                                └── 9 > INDEX | SAMPLE FAST FULL SCAN (CR_RETURNING_HDEMO_SK_INDEX)
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > VIEW
                └── 5 > UNION-ALL
                    ├── 6 > HASH
                    │   └── 7 > NESTED LOOPS
                    │       ├── 8 > NESTED LOOPS
            

    │               │   ├── 8 > SORT
    │               │   │   └── 9 > HASH JOIN
    │               │   │       ├── 10 > TABLE ACCESS | FULL (DATE_DIM)
    │               │   │       └── 11 > HASH JOIN
    │               │   │           ├── 12 > TABLE ACCESS | FULL (ITEM)
    │               │   │           └── 13 > TABLE ACCESS | FULL (STORE_SALES)
    │               │   └── 14 > SORT
    │               │       └── 15 > HASH JOIN
    │               │           ├── 16 > TABLE ACCESS | FULL (ITEM)
    │               │           └── 17 > NESTED LOOPS
    │               │               ├── 18 > NESTED LOOPS
    │               │               │   ├── 19 > TABLE ACCESS | FULL (DATE_DIM)
    │               │               │   └── 20 > INDEX | RANGE SCAN (CS_SOLD_DATE_SK_INDEX)
    │               │               └── 21 > TABLE ACCESS | BY INDEX ROWID (CATALOG_SALES)
    │               └── 22 > SORT
    │                   └── 23 > HASH JOIN
    │                       ├── 24 > T

Total computed delta score [8396824228.827813]


---------------------------------------------------
Query variant [22] with plan_id [12451]
Access Predicate Difference detected!
Tree 1 difference at node [7] operator > TABLE ACCESS(FULL) on object [STORE]
Tree 2 difference at node [6] operator > TABLE ACCESS(FULL) on object [INVENTORY]
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > HASH
                └── 5 > COUNT
                    └── 6 > HASH JOIN
                        ├── 7 > TABLE ACCESS | FULL (STORE)
                        └── 8 > NESTED LOOPS
                            ├── 9 > NESTED LOOPS
                            │   ├── 10 > TABLE ACCESS | FULL (DATE_DIM)
                            │   └── 11 > INDEX | RANGE SCAN (SS_SOLD_DATE_SK_INDEX)
                            └── 12 > TABLE ACCESS | BY INDEX ROWID (STORE_SALES)
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > SORT
      

0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > VIEW
                └── 5 > WINDOW
                    └── 6 > VIEW (VW_FOJ_0)
                        └── 7 > HASH JOIN
                            ├── 8 > VIEW
                            │   └── 9 > WINDOW
                            │       └── 10 > SORT
                            │           └── 11 > HASH JOIN
                            │               ├── 12 > TABLE ACCESS | FULL (DATE_DIM)
                            │               └── 13 > TABLE ACCESS | FULL (STORE_SALES)
                            └── 14 > VIEW
                                └── 15 > WINDOW
                                    └── 16 > SORT
                                        └── 17 > NESTED LOOPS
                                            ├── 18 > NESTED LOOPS
                                            │   ├── 19 > TABLE ACCESS | FULL (DATE_DIM)
                                            │   └── 20 > INDE

                        │       └── 40 > HASH JOIN
                        │           ├── 41 > INDEX | FAST FULL SCAN (SYS_C0021223)
                        │           └── 42 > NESTED LOOPS
                        │               ├── 43 > NESTED LOOPS
                        │               │   ├── 44 > TABLE ACCESS | FULL (DATE_DIM)
                        │               │   └── 45 > INDEX | RANGE SCAN (WS_SOLD_DATE_SK_INDEX)
                        │               └── 46 > TABLE ACCESS | BY INDEX ROWID (WEB_SALES)
                        └── 47 > VIEW
                            └── 48 > HASH
                                └── 49 > NESTED LOOPS
                                    ├── 50 > NESTED LOOPS
                                    │   ├── 51 > TABLE ACCESS | FULL (DATE_DIM)
                                    │   └── 52 > TABLE ACCESS | BY INDEX ROWID BATCHED (WEB_RETURNS)
                                    │       └── 53 > INDEX | RANGE SCAN (WR_RETURNED_DATE_SK_INDEX)
  

## Stream Comparison with Predicate Based Outliers

Compares the expected stream with variation stream. Variations found here will be composed of SQL optimizer hint injections to purposely skew the plan.

In [14]:
counter = 0
for plan in np_predicate_outlier_plan_id:
    #print(plan)
    print('\n\n---------------------------------------------------\nQuery variant [' + str(variant_ids[counter]) + '] with plan_id [' + str(plan) + ']')
    sql_plan = df_predicate_outliers[df_predicate_outliers['PLAN_ID'] == plan]['OPERATION']
    sql_match = sql_plan.str.cat(sep=' > ')
    sql_plan2 = df_predicate_outliers[df_predicate_outliers['PLAN_ID'] == plan]['OBJECT_NAME'].astype(str)
    sql_match2 = sql_plan2.str.cat(sep=' > ')
    sql_match = sql_match + sql_match2
    
    sql_id_list, inlier_plans = [], []
    for sql in np_sql_id:
        sql_plan2 = df[df['SQL_ID'] == sql]['OPERATION']
        sql_match2 = sql_plan2.str.cat(sep=' > ')
        sql_plan3 = df[df['SQL_ID'] == sql]['OBJECT_NAME'].astype(str)
        sql_match3 = sql_plan3.str.cat(sep=' > ')
        inlier_plans.append(sql_match2 + sql_match3)
        sql_id_list.append(sql)
    
    inlier_match = process.extractOne(sql_match, inlier_plans)
    #print(inlier_match)
    inlier_sql_id = None
    for i in range(len(inlier_plans)):

        if inlier_plans[i] == inlier_match[0]:
            inlier_sql_id = sql_id_list[i]
            break
    
    # Reads Inlier and Outlier plans into memory (Pandas Dataframes)
    df_inlier_plan = df[df['SQL_ID'] == inlier_sql_id]
    df_inlier_plan = df_inlier_plan.sort_values(by='ID', ascending=True)
    df_outlier_plan = df_predicate_outliers[df_predicate_outliers['PLAN_ID'] == plan]
    df_outlier_plan = df_outlier_plan.sort_values(by='ID', ascending=True)
    
    # Builds Trees
    inlier_tree = PlanTreeModeller.build_tree(df=df_inlier_plan)
    outlier_plan = PlanTreeModeller.build_tree(df=df_outlier_plan)
    
    # Compare Trees
    PlanTreeModeller.tree_compare(tree1=inlier_tree, 
                                  tree2=outlier_plan, 
                                  df1=df_inlier_plan, 
                                  df2=df_outlier_plan)
    
    counter += 1



---------------------------------------------------
Query variant [5] with plan_id [12461]
Access Predicate Difference detected!
Tree 1 difference at node [9] operator > INDEX(SAMPLE FAST FULL SCAN) on object [CR_RETURNING_HDEMO_SK_INDEX]
Tree 2 difference at node [10] operator > TABLE ACCESS(FULL) on object [DATE_DIM]
0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10001)
            └── 4 > SORT
                └── 5 > PX RECEIVE
                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                                └── 9 > INDEX | SAMPLE FAST FULL SCAN (CR_RETURNING_HDEMO_SK_INDEX)
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > VIEW
                └── 5 > UNION-ALL
                    ├── 6 > HASH
                    │   └── 7 > NESTED LOOPS
                    │       ├── 8 > NESTED LOOPS
            

    │               │   ├── 8 > SORT
    │               │   │   └── 9 > HASH JOIN
    │               │   │       ├── 10 > TABLE ACCESS | FULL (DATE_DIM)
    │               │   │       └── 11 > HASH JOIN
    │               │   │           ├── 12 > TABLE ACCESS | FULL (ITEM)
    │               │   │           └── 13 > TABLE ACCESS | FULL (STORE_SALES)
    │               │   └── 14 > SORT
    │               │       └── 15 > HASH JOIN
    │               │           ├── 16 > TABLE ACCESS | FULL (ITEM)
    │               │           └── 17 > NESTED LOOPS
    │               │               ├── 18 > NESTED LOOPS
    │               │               │   ├── 19 > TABLE ACCESS | FULL (DATE_DIM)
    │               │               │   └── 20 > INDEX | RANGE SCAN (CS_SOLD_DATE_SK_INDEX)
    │               │               └── 21 > TABLE ACCESS | BY INDEX ROWID (CATALOG_SALES)
    │               └── 22 > SORT
    │                   └── 23 > HASH JOIN
    │                       ├── 24 > T

Total computed delta score [3843711071.762744]


---------------------------------------------------
Query variant [22] with plan_id [12465]
Access Predicate Difference detected!
Tree 1 difference at node [2] operator > INDEX(FULL SCAN) on object [SYS_C0021203]
Tree 2 difference at node [6] operator > TABLE ACCESS(FULL) on object [ITEM]
0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > INDEX | FULL SCAN (SYS_C0021203)
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > SORT
                └── 5 > HASH JOIN
                    ├── 6 > TABLE ACCESS | FULL (ITEM)
                    └── 7 > HASH JOIN
                        ├── 8 > TABLE ACCESS | FULL (DATE_DIM)
                        └── 9 > TABLE ACCESS | FULL (INVENTORY)
Stat Recommendation: 
1) Collect [INDEX (UNIQUE)] stats on [SYS_C0021203]
2) Collect [TABLE] stats on [ITEM]
3) Collect [TABLE] stats on [DATE_DIM]
Total computed delta score [6045890086.497116]


---------------------------------

        └── 3 > SORT
            └── 4 > VIEW
                └── 5 > WINDOW
                    └── 6 > VIEW (VW_FOJ_0)
                        └── 7 > HASH JOIN
                            ├── 8 > VIEW
                            │   └── 9 > WINDOW
                            │       └── 10 > SORT
                            │           └── 11 > NESTED LOOPS
                            │               ├── 12 > NESTED LOOPS
                            │               │   ├── 13 > TABLE ACCESS | FULL (DATE_DIM)
                            │               │   └── 14 > INDEX | RANGE SCAN (WS_SOLD_DATE_SK_INDEX)
                            │               └── 15 > TABLE ACCESS | BY INDEX ROWID (WEB_SALES)
                            └── 16 > VIEW
                                └── 17 > WINDOW
                                    └── 18 > SORT
                                        └── 19 > NESTED LOOPS
                                            ├── 20 > NESTED LOOPS
                    

                        │   └── 40 > HASH
                        │       └── 41 > HASH JOIN
                        │           ├── 42 > INDEX | FAST FULL SCAN (SYS_C0021223)
                        │           └── 43 > NESTED LOOPS
                        │               ├── 44 > NESTED LOOPS
                        │               │   ├── 45 > TABLE ACCESS | FULL (DATE_DIM)
                        │               │   └── 46 > INDEX | RANGE SCAN (WS_SOLD_DATE_SK_INDEX)
                        │               └── 47 > TABLE ACCESS | BY INDEX ROWID (WEB_SALES)
                        └── 48 > VIEW
                            └── 49 > HASH
                                └── 50 > NESTED LOOPS
                                    ├── 51 > NESTED LOOPS
                                    │   ├── 52 > TABLE ACCESS | FULL (DATE_DIM)
                                    │   └── 53 > TABLE ACCESS | BY INDEX ROWID BATCHED (WEB_RETURNS)
                                    │       └── 54 > INDEX |

Total computed delta score [213100.513026967]


---------------------------------------------------
Query variant [86] with plan_id [12474]
Access Predicate Difference detected!
Tree 1 difference at node [7] operator > TABLE ACCESS(FULL) on object [STORE]
Tree 2 difference at node [7] operator > TABLE ACCESS(FULL) on object [ITEM]
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > HASH
                └── 5 > COUNT
                    └── 6 > HASH JOIN
                        ├── 7 > TABLE ACCESS | FULL (STORE)
                        └── 8 > NESTED LOOPS
                            ├── 9 > NESTED LOOPS
                            │   ├── 10 > TABLE ACCESS | FULL (DATE_DIM)
                            │   └── 11 > INDEX | RANGE SCAN (SS_SOLD_DATE_SK_INDEX)
                            └── 12 > TABLE ACCESS | BY INDEX ROWID (STORE_SALES)
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > WINDOW
          

## Stream Comparison with Rownum Based Outliers

Compares the expected stream with variation stream. Variations found here will be composed of SQL optimizer hint injections to purposely skew the plan.

In [15]:
counter = 0
for plan in np_rownum_outlier_plan_id:
    #print(plan)
    print('\n\n---------------------------------------------------\nQuery variant [' + str(variant_ids[counter]) + '] with plan_id [' + str(plan) + ']')
    sql_plan = df_rownum_outliers[df_rownum_outliers['PLAN_ID'] == plan]['OPERATION']
    sql_match = sql_plan.str.cat(sep=' > ')
    sql_plan2 = df_rownum_outliers[df_rownum_outliers['PLAN_ID'] == plan]['OBJECT_NAME'].astype(str)
    sql_match2 = sql_plan2.str.cat(sep=' > ')
    sql_match = sql_match + sql_match2    
        
    sql_id_list, inlier_plans = [], []
    for sql in np_sql_id:
        sql_plan2 = df[df['SQL_ID'] == sql]['OPERATION']
        sql_match2 = sql_plan2.str.cat(sep=' > ')
        sql_plan3 = df[df['SQL_ID'] == sql]['OBJECT_NAME'].astype(str)
        sql_match3 = sql_plan3.str.cat(sep=' > ')
        inlier_plans.append(sql_match2 + sql_match3)
        sql_id_list.append(sql)
    
    inlier_match = process.extractOne(sql_match, inlier_plans)
    #print(inlier_match)
    inlier_sql_id = None
    for i in range(len(inlier_plans)):

        if inlier_plans[i] == inlier_match[0]:
            inlier_sql_id = sql_id_list[i]
            break
    
    # Reads Inlier and Outlier plans into memory (Pandas Dataframes)
    df_inlier_plan = df[df['SQL_ID'] == inlier_sql_id]
    df_inlier_plan = df_inlier_plan.sort_values(by='ID', ascending=True)
    df_outlier_plan = df_rownum_outliers[df_rownum_outliers['PLAN_ID'] == plan]
    df_outlier_plan = df_outlier_plan.sort_values(by='ID', ascending=True)
    
    # Builds Trees
    inlier_tree = PlanTreeModeller.build_tree(df=df_inlier_plan)
    outlier_plan = PlanTreeModeller.build_tree(df=df_outlier_plan)
    
    # Compare Trees
    PlanTreeModeller.tree_compare(tree1=inlier_tree, 
                                  tree2=outlier_plan, 
                                  df1=df_inlier_plan, 
                                  df2=df_outlier_plan)
    
    counter += 1



---------------------------------------------------
Query variant [5] with plan_id [12475]
Access Predicate Difference detected!
Tree 1 difference at node [9] operator > INDEX(SAMPLE FAST FULL SCAN) on object [CR_RETURNING_HDEMO_SK_INDEX]
Tree 2 difference at node [10] operator > TABLE ACCESS(FULL) on object [DATE_DIM]
0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10001)
            └── 4 > SORT
                └── 5 > PX RECEIVE
                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                                └── 9 > INDEX | SAMPLE FAST FULL SCAN (CR_RETURNING_HDEMO_SK_INDEX)
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > VIEW
                └── 5 > UNION-ALL
                    ├── 6 > HASH
                    │   └── 7 > NESTED LOOPS
                    │       ├── 8 > NESTED LOOPS
            

    │               │   │       ├── 10 > TABLE ACCESS | FULL (DATE_DIM)
    │               │   │       └── 11 > HASH JOIN
    │               │   │           ├── 12 > TABLE ACCESS | FULL (ITEM)
    │               │   │           └── 13 > TABLE ACCESS | FULL (STORE_SALES)
    │               │   └── 14 > SORT
    │               │       └── 15 > HASH JOIN
    │               │           ├── 16 > TABLE ACCESS | FULL (ITEM)
    │               │           └── 17 > NESTED LOOPS
    │               │               ├── 18 > NESTED LOOPS
    │               │               │   ├── 19 > TABLE ACCESS | FULL (DATE_DIM)
    │               │               │   └── 20 > INDEX | RANGE SCAN (CS_SOLD_DATE_SK_INDEX)
    │               │               └── 21 > TABLE ACCESS | BY INDEX ROWID (CATALOG_SALES)
    │               └── 22 > SORT
    │                   └── 23 > HASH JOIN
    │                       ├── 24 > TABLE ACCESS | FULL (ITEM)
    │                       └── 25 > NESTED LOOPS
    │  

Total computed delta score [3336756189.7668915]


---------------------------------------------------
Query variant [22] with plan_id [12479]
Access Predicate Difference detected!
Tree 1 difference at node [2] operator > INDEX(FULL SCAN) on object [SYS_C0021203]
Tree 2 difference at node [6] operator > TABLE ACCESS(FULL) on object [ITEM]
0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > INDEX | FULL SCAN (SYS_C0021203)
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > SORT
                └── 5 > HASH JOIN
                    ├── 6 > TABLE ACCESS | FULL (ITEM)
                    └── 7 > HASH JOIN
                        ├── 8 > TABLE ACCESS | FULL (DATE_DIM)
                        └── 9 > TABLE ACCESS | FULL (INVENTORY)
Stat Recommendation: 
1) Collect [INDEX (UNIQUE)] stats on [SYS_C0021203]
2) Collect [TABLE] stats on [ITEM]
3) Collect [TABLE] stats on [DATE_DIM]
Total computed delta score [6045886737.889441]


--------------------------------

                                └── 19 > VIEW
                                    └── 20 > WINDOW
                                        └── 21 > SORT
                                            └── 22 > COUNT
                                                └── 23 > NESTED LOOPS
                                                    ├── 24 > NESTED LOOPS
                                                    │   ├── 25 > TABLE ACCESS | BY INDEX ROWID (STORE_SALES)
                                                    │   │   └── 26 > INDEX | RANGE SCAN (SS_TICKET_NUMBER_INDEX)
                                                    │   └── 27 > INDEX | UNIQUE SCAN (SYS_C0021186)
                                                    └── 28 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > VIEW
                └── 5 > WINDOW
                    └── 6 > VIEW (VW_FOJ_0)
                        └── 7 > HASH JOIN
      

                    │               └── 23 > INDEX | UNIQUE SCAN (SYS_C0021206)
                    ├── 24 > MERGE JOIN
                    │   ├── 25 > VIEW
                    │   │   └── 26 > HASH
                    │   │       └── 27 > HASH JOIN
                    │   │           ├── 28 > TABLE ACCESS | FULL (DATE_DIM)
                    │   │           └── 29 > TABLE ACCESS | FULL (CATALOG_SALES)
                    │   └── 30 > BUFFER
                    │       └── 31 > VIEW
                    │           └── 32 > HASH
                    │               └── 33 > NESTED LOOPS
                    │                   ├── 34 > NESTED LOOPS
                    │                   │   ├── 35 > TABLE ACCESS | FULL (DATE_DIM)
                    │                   │   └── 36 > INDEX | RANGE SCAN (CR_RETURNED_DATE_SK_INDEX)
                    │                   └── 37 > TABLE ACCESS | BY INDEX ROWID (CATALOG_RETURNS)
                    └── 38 > HASH JOIN
                        

Access Predicate Difference detected!
Tree 1 difference at node [7] operator > TABLE ACCESS(FULL) on object [STORE]
Tree 2 difference at node [7] operator > TABLE ACCESS(FULL) on object [ITEM]
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > HASH
                └── 5 > COUNT
                    └── 6 > HASH JOIN
                        ├── 7 > TABLE ACCESS | FULL (STORE)
                        └── 8 > NESTED LOOPS
                            ├── 9 > NESTED LOOPS
                            │   ├── 10 > TABLE ACCESS | FULL (DATE_DIM)
                            │   └── 11 > INDEX | RANGE SCAN (SS_SOLD_DATE_SK_INDEX)
                            └── 12 > TABLE ACCESS | BY INDEX ROWID (STORE_SALES)
0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > WINDOW
                └── 5 > SORT
                    └── 6 > HASH JOIN
                        ├── 7 > TABLE ACCESS | FULL (ITEM)
                        