# Schedule TPC-DS100 Plan Comparison (Variant to Variant)

This experiment is intended at quantifying the statistical recommendation technique, through comparison of two query streams. The query streams are denoted as follows:

* Expected Stream - Denotes a sequence of baseline query plans, against which comparison will be made.
* Variation Stream - Denotes a sequence of upcoming query plans. Queries found within the upcoming stream mirror those established in the Expected Stream, with a number of exceptions. These exceptions are considered as query variants, and contain a degree of change from the original queries taken from the prior stream.

Query variants are denoted below, and are therefore eligable to be flagged during the evaluation phase:

* Query 5  
* Query 10
* Query 14
* Query 18
* Query 22
* Query 27
* Query 35
* Query 36
* Query 51
* Query 67
* Query 70
* Query 77
* Query 80
* Query 86

This section is intended to establish a structured approach to access plan analysis. The inherit nature of an SQL (CBO) access plan is derived from an acyclic tree structure. Each node in the tree structure, denotes a data access operator, instructing the underlying engine how best to access data from the database. Representing access plan is particularly useful since this allows efficient traversal of the access plan, where in the most crucial data access operators span from the bottom up of the tree (SQL costs start from the children operators,  and culminate at the root operator). 

Past literature covers this modelling aspect to some degree:

***
Plan Selection based on Query Clustering - http://www.vldb.org/conf/2002/S06P02.pdf

Through a tool acroynmed as 'PLASTIC' (PLan Selection Through Incremental Clustering), optimizer access plan have been successfully modelled using acyclic tree. The authors, also go on to establish a similarity check technique, called SIMCHECK:

The SIMCHECK algorithm, whose pseudocode is shown in Figure 5, takes as input two query feature vectors and
outputs a boolean value indicating whether or not they are similar.   The algorithm operates  in  two  phases,  “Feature Vector Comparisons” and “Mapping Tables”.  In the first phase, the feature vectors are compared for equality on the number of tables, the sum of the table degrees, and the sum of  the  join  index  and  predicate  counts.   Only  if  there  is equality on all these structural features is the second phase invoked, otherwise the queries are deemed to be dis-similar. The  equality check is  done first  in  order to  identify dissimilar queries as early and as simply as possible.  For example, it is obvious that if the number of tables in the two
queries do not match, then their plans will also necessarily have to be different.  Such structural feature checks are used as an effective mechanism for stopping unproductive matching at an early stage.

<div style="width:image width px; font-size:80%; text-align:center;"><img src='Images/simcheck.png' alt="alternate text" width="width" height="height" style="padding-bottom:0.5em;" /><b>Simcheck Pseudocode</b></div>

In the Mapping Tables phase, we attempt to establish the closest possible one-to-one correspondence between the tables of the two queries. The tables are mapped to each other in order to check whether it is possible for the optimizer to use similar plans for accessing the mapped tables. The first step in this process is to determine the sets of compatible tables.  For every possible pair of compatible tables, SIMCHECK checks whether their original and (estimated) effective sizes are comparable through the use of a distance function.  If the outcome of the distance computations is less than a threshold value which is an algorithmic parameter, the queries are said to be similar. The notion of compatibility and the distance function are elucidated below.

__Table Compatibility__

We define two tables to be compatible if the degrees, join index counts and predicate counts are the same for both tables. The rationale for this notion of compatibility is explained below. Let  us  first  consider predicate counts. The predicate count for table in Figure 4(a) is  (2,1) since there are
two SARGable predicates and one non-SARGable predicate. Similarly, for table in Figure 4(b), the predicate count is (1, 2), and by our definition the tables are not compatible.  This makes intuitive sense when viewed in light
of the fact that if a predicate on a table is not SARGable, an optimizer cannot use an index to access that table. Thus, plans can change considerably even if the two queries differ on only a single table with respect to this criteria. A similar and stronger argument holds for join index counts. If indexes are available for a join predicate in one query and not in the other, it is very likely that the plans for the two queries will not match. This is because if both the attributes in a join predicate are indexed and the selectivities of the tables are high then it is possible to choose a plan involving an index join.  Similarly, if one of the attributes is indexed then the optimizer may choose to index on one table and fetch (table scan) on the other.
Note  that  even  if  the  join  index  counts  and  predicate counts for two queries match, the plans chosen by the optimizer may differ as there are other statistical factors such as the table sizes that affect plan choices. These factors are captured in the distance function discussed next.

__Query Distance Function__

After  compatible tables  are  identified,  SIMCHECK  tries to establish valid one-to-one mappings between the sets of compatible tables. These mappings are then compared using their original and estimated effective sizes, through a distance function dist(T1, T2), where T1 and T2 are the tables whose distance is to be computed.
***

In [1]:
# pandas
import pandas as pd
print('pandas: %s' % pd.__version__)
# numpy
import numpy as np
print('numpy: %s' % np.__version__)
# matplotlib
import matplotlib.pyplot as plt
# sklearn
import sklearn as sk
from sklearn import preprocessing
from sklearn.metrics.pairwise import euclidean_distances
#
# AnyTree
from anytree import Node, RenderTree, PostOrderIter

pandas: 0.24.1
numpy: 1.16.1


### Configuration Cell

Tweak parametric changes from this cell to influence outcome of experiment

In [2]:
# Experiment Config
tpcds='TPCDS100' # Schema upon which to operate test
test_split=.2
y_labels = ['COST',
            'CARDINALITY',
            'BYTES',
            # 'CPU_COST',
            'IO_COST',
            'TEMP_SPACE',
            'TIME']
black_list = ['TIMESTAMP',
              'SQL_ID',
              'IO_COST',
              'OPERATION',
              'OPTIONS',
              'OBJECT_NAME',
              'OBJECT_OWNER',
              'OBJECT_TYPE',
              'PARTITION_STOP',
              'PARTITION_START'] # Columns which will be ignored during type conversion, and later used for aggregation
nrows = 10000

### Read data from file into pandas dataframes

In [3]:
# Root path
base_dir = 'C:/Users/gabriel.sammut/University/'
#base_dir = 'D:/Projects/'
root_dir = base_dir + 'Data_ICS5200/Schedule/' + tpcds
src_dir = base_dir + 'ICS5200/src/sql/Runtime/TPC-DS/' + tpcds + '/Variants/'

rep_vsql_plan_path = root_dir + '/rep_vsql_plan.csv'
#rep_vsql_plan_path = root_dir + '/rep_vsql_plan.csv'

dtype={'COST':'int64',
       'CARDINALITY':'int64',
       'BYTES':'int64',
       # 'CPU_COST':'int64',
       'IO_COST':'int64',
       'TEMP_SPACE':'int64',
       'TIME':'int64',
       'OPERATION':'str',
       'OBJECT_NAME':'str'}
rep_vsql_plan_df = pd.read_csv(rep_vsql_plan_path, nrows=nrows, dtype=dtype)
print(rep_vsql_plan_df.head())
#
def prettify_header(headers):
    """
    Cleans header list from unwated character strings
    """
    header_list = []
    [header_list.append(header.replace("(","").replace(")","").replace("'","").replace(",","")) for header in headers]
    return header_list
#
rep_vsql_plan_df.columns = prettify_header(rep_vsql_plan_df.columns.values)
print('------------------------------------------')
print(rep_vsql_plan_df.columns)

    ('DBID',)    ('SQL_ID',)  ('PLAN_HASH_VALUE',)  ('ID',)    ('OPERATION',)  \
0  2634225673  2j8td2wuthnfv            1917374110        0  SELECT STATEMENT   
1  2634225673  2j8td2wuthnfv            1917374110        1      TABLE ACCESS   
2  2634225673  2j8td2wuthnfv            1917374110        2              SORT   
3  2634225673  2j8td2wuthnfv            1917374110        3      TABLE ACCESS   
4  2634225673  9nf3gy0tv9p0u            3537130676        0  SELECT STATEMENT   

  ('OPTIONS',) ('OBJECT_NODE',)  ('OBJECT#',) ('OBJECT_OWNER',)  \
0          NaN              NaN           NaN               NaN   
1         FULL              NaN        8693.0               SYS   
2    AGGREGATE              NaN           NaN               NaN   
3         FULL              NaN        8693.0               SYS   
4          NaN              NaN           NaN               NaN   

  ('OBJECT_NAME',)  ... ('ACCESS_PREDICATES',) ('FILTER_PREDICATES',)  \
0              NaN  ...              

### Read outlier data from file into pandas dataframes and concatenate

In [4]:
#
# CSV Outlier Paths
outlier_hints_q5_path = src_dir + 'hints/output/query_5.csv'
outlier_hints_q10_path = src_dir + 'hints/output/query_10.csv'
outlier_hints_q14_path = src_dir + 'hints/output/query_14.csv'
outlier_hints_q18_path = src_dir + 'hints/output/query_18.csv'
outlier_hints_q22_path = src_dir + 'hints/output/query_22.csv'
outlier_hints_q27_path = src_dir + 'hints/output/query_27.csv'
outlier_hints_q35_path = src_dir + 'hints/output/query_35.csv'
outlier_hints_q36_path = src_dir + 'hints/output/query_36.csv'
outlier_hints_q51_path = src_dir + 'hints/output/query_51.csv'
outlier_hints_q67_path = src_dir + 'hints/output/query_67.csv'
outlier_hints_q70_path = src_dir + 'hints/output/query_70.csv'
outlier_hints_q77_path = src_dir + 'hints/output/query_77.csv'
outlier_hints_q80_path = src_dir + 'hints/output/query_80.csv'
outlier_hints_q86_path = src_dir + 'hints/output/query_86.csv'
#
outlier_predicates_q5_path = src_dir + 'predicates/output/query_5.csv'
outlier_predicates_q10_path = src_dir + 'predicates/output/query_10.csv'
outlier_predicates_q14_path = src_dir + 'predicates/output/query_14.csv'
outlier_predicates_q18_path = src_dir + 'predicates/output/query_18.csv'
outlier_predicates_q22_path = src_dir + 'predicates/output/query_22.csv'
outlier_predicates_q27_path = src_dir + 'predicates/output/query_27.csv'
outlier_predicates_q35_path = src_dir + 'predicates/output/query_35.csv'
outlier_predicates_q36_path = src_dir + 'predicates/output/query_36.csv'
outlier_predicates_q51_path = src_dir + 'predicates/output/query_51.csv'
outlier_predicates_q67_path = src_dir + 'predicates/output/query_67.csv'
outlier_predicates_q70_path = src_dir + 'predicates/output/query_70.csv'
outlier_predicates_q77_path = src_dir + 'predicates/output/query_77.csv'
outlier_predicates_q80_path = src_dir + 'predicates/output/query_80.csv'
outlier_predicates_q86_path = src_dir + 'predicates/output/query_86.csv'
#
outlier_rownum_q5_path = src_dir + 'rownum/output/query_5.csv'
outlier_rownum_q10_path = src_dir + 'rownum/output/query_10.csv'
outlier_rownum_q14_path = src_dir + 'rownum/output/query_14.csv'
outlier_rownum_q18_path = src_dir + 'rownum/output/query_18.csv'
outlier_rownum_q22_path = src_dir + 'rownum/output/query_22.csv'
outlier_rownum_q27_path = src_dir + 'rownum/output/query_27.csv'
outlier_rownum_q35_path = src_dir + 'rownum/output/query_35.csv'
outlier_rownum_q36_path = src_dir + 'rownum/output/query_36.csv'
outlier_rownum_q51_path = src_dir + 'rownum/output/query_51.csv'
outlier_rownum_q67_path = src_dir + 'rownum/output/query_67.csv'
outlier_rownum_q70_path = src_dir + 'rownum/output/query_70.csv'
outlier_rownum_q77_path = src_dir + 'rownum/output/query_77.csv'
outlier_rownum_q80_path = src_dir + 'rownum/output/query_80.csv'
outlier_rownum_q86_path = src_dir + 'rownum/output/query_86.csv'
#
# Read CSV Paths
outlier_hints_q5_df = pd.read_csv(outlier_hints_q5_path,dtype=str)
outlier_hints_q10_df = pd.read_csv(outlier_hints_q10_path,dtype=str)
outlier_hints_q14_df = pd.read_csv(outlier_hints_q14_path,dtype=str)
outlier_hints_q18_df = pd.read_csv(outlier_hints_q18_path,dtype=str)
outlier_hints_q22_df = pd.read_csv(outlier_hints_q22_path,dtype=str)
outlier_hints_q27_df = pd.read_csv(outlier_hints_q27_path,dtype=str)
outlier_hints_q35_df = pd.read_csv(outlier_hints_q35_path,dtype=str)
outlier_hints_q36_df = pd.read_csv(outlier_hints_q36_path,dtype=str)
outlier_hints_q51_df = pd.read_csv(outlier_hints_q51_path,dtype=str)
outlier_hints_q67_df = pd.read_csv(outlier_hints_q67_path,dtype=str)
outlier_hints_q70_df = pd.read_csv(outlier_hints_q70_path,dtype=str)
outlier_hints_q77_df = pd.read_csv(outlier_hints_q77_path,dtype=str)
outlier_hints_q80_df = pd.read_csv(outlier_hints_q80_path,dtype=str)
outlier_hints_q86_df = pd.read_csv(outlier_hints_q86_path,dtype=str)
#
outlier_predicates_q5_df = pd.read_csv(outlier_predicates_q5_path,dtype=str)
outlier_predicates_q10_df = pd.read_csv(outlier_predicates_q10_path,dtype=str)
outlier_predicates_q14_df = pd.read_csv(outlier_predicates_q14_path,dtype=str)
outlier_predicates_q18_df = pd.read_csv(outlier_predicates_q18_path,dtype=str)
outlier_predicates_q22_df = pd.read_csv(outlier_predicates_q22_path,dtype=str)
outlier_predicates_q27_df = pd.read_csv(outlier_predicates_q27_path,dtype=str)
outlier_predicates_q35_df = pd.read_csv(outlier_predicates_q35_path,dtype=str)
outlier_predicates_q36_df = pd.read_csv(outlier_predicates_q36_path,dtype=str)
outlier_predicates_q51_df = pd.read_csv(outlier_predicates_q51_path,dtype=str)
outlier_predicates_q67_df = pd.read_csv(outlier_predicates_q67_path,dtype=str)
outlier_predicates_q70_df = pd.read_csv(outlier_predicates_q70_path,dtype=str)
outlier_predicates_q77_df = pd.read_csv(outlier_predicates_q77_path,dtype=str)
outlier_predicates_q80_df = pd.read_csv(outlier_predicates_q80_path,dtype=str)
outlier_predicates_q86_df = pd.read_csv(outlier_predicates_q86_path,dtype=str)
#
outlier_rownum_q5_df = pd.read_csv(outlier_rownum_q5_path,dtype=str)
outlier_rownum_q10_df = pd.read_csv(outlier_rownum_q10_path,dtype=str)
outlier_rownum_q14_df = pd.read_csv(outlier_rownum_q14_path,dtype=str)
outlier_rownum_q18_df = pd.read_csv(outlier_rownum_q18_path,dtype=str)
outlier_rownum_q22_df = pd.read_csv(outlier_rownum_q22_path,dtype=str)
outlier_rownum_q27_df = pd.read_csv(outlier_rownum_q27_path,dtype=str)
outlier_rownum_q35_df = pd.read_csv(outlier_rownum_q35_path,dtype=str)
outlier_rownum_q36_df = pd.read_csv(outlier_rownum_q36_path,dtype=str)
outlier_rownum_q51_df = pd.read_csv(outlier_rownum_q51_path,dtype=str)
outlier_rownum_q67_df = pd.read_csv(outlier_rownum_q67_path,dtype=str)
outlier_rownum_q70_df = pd.read_csv(outlier_rownum_q70_path,dtype=str)
outlier_rownum_q77_df = pd.read_csv(outlier_rownum_q77_path,dtype=str)
outlier_rownum_q80_df = pd.read_csv(outlier_rownum_q80_path,dtype=str)
outlier_rownum_q86_df = pd.read_csv(outlier_rownum_q86_path,dtype=str)
#
# Merge dataframes into a single pandas matrix
df_outliers = pd.concat([outlier_hints_q5_df, outlier_hints_q10_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q14_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q18_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q22_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q27_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q35_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q36_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q51_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q67_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q70_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q77_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q80_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_hints_q86_df], sort=False)
#
df_outliers = pd.concat([df_outliers, outlier_predicates_q5_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q10_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q14_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q18_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q22_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q27_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q35_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q36_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q51_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q67_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q70_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q77_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q80_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_predicates_q86_df], sort=False)
#
df_outliers = pd.concat([df_outliers, outlier_rownum_q5_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q10_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q14_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q18_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q22_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q27_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q35_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q36_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q51_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q67_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q70_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q77_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q80_df], sort=False)
df_outliers = pd.concat([df_outliers, outlier_rownum_q86_df], sort=False)   
#
print(df_outliers.shape)
print(df_outliers.head())
print('------------------------------------------')
print(df_outliers.columns)

(1433, 36)
  PLAN_ID            TIMESTAMP REMARKS         OPERATION          OPTIONS  \
0   12447  11/20/2018 09:56:46     NaN  SELECT STATEMENT              NaN   
1   12447  11/20/2018 09:56:46     NaN             COUNT          STOPKEY   
2   12447  11/20/2018 09:56:46     NaN              VIEW              NaN   
3   12447  11/20/2018 09:56:46     NaN              SORT  GROUP BY ROLLUP   
4   12447  11/20/2018 09:56:46     NaN              VIEW              NaN   

  OBJECT_NODE OBJECT_OWNER OBJECT_NAME                OBJECT_ALIAS  \
0         NaN          NaN         NaN                         NaN   
1         NaN          NaN         NaN                         NaN   
2         NaN     TPCDS100         NaN  from$_subquery$_018@SEL$11   
3         NaN          NaN         NaN                         NaN   
4         NaN     TPCDS100         NaN                    X@SEL$12   

  OBJECT_INSTANCE  ... DISTRIBUTION  CPU_COST  IO_COST TEMP_SPACE  \
0             NaN  ...          NaN 

### Dealing with empty values

In [5]:
def get_na_columns(df, headers):
    """
    Return columns which consist of NAN values
    """
    na_list = []
    for head in headers:
        if df[head].isnull().values.any():
            na_list.append(head)
    return na_list
#
print('N/A Columns\n')
print('\nREP_VSQL_PLAN Features ' + str(len(rep_vsql_plan_df.columns)) + ': ' + str(get_na_columns(df=rep_vsql_plan_df,headers=rep_vsql_plan_df.columns)) + "\n")
print('\nDF_OUTLIERS Features ' + str(len(df_outliers.columns)) + ': ' + str(get_na_columns(df=df_outliers,headers=df_outliers.columns)) + "\n")
#
def fill_na(df):
    """
    Replaces NA columns with 0s
    """
    return df.fillna(0)
#
# Populating NaN values with amount '0'
df = fill_na(df=rep_vsql_plan_df)
df_outliers = fill_na(df=df_outliers)

N/A Columns


REP_VSQL_PLAN Features 39: ['OPTIONS', 'OBJECT_NODE', 'OBJECT#', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS', 'OBJECT_TYPE', 'OPTIMIZER', 'PARENT_ID', 'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG', 'PARTITION_START', 'PARTITION_STOP', 'PARTITION_ID', 'OTHER', 'DISTRIBUTION', 'CPU_COST', 'IO_COST', 'TEMP_SPACE', 'ACCESS_PREDICATES', 'FILTER_PREDICATES', 'PROJECTION', 'TIME', 'QBLOCK_NAME', 'REMARKS', 'OTHER_XML']


DF_OUTLIERS Features 36: ['PLAN_ID', 'REMARKS', 'OPERATION', 'OPTIONS', 'OBJECT_NODE', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS', 'OBJECT_INSTANCE', 'OBJECT_TYPE', 'OPTIMIZER', 'SEARCH_COLUMNS', 'ID', 'PARENT_ID', 'DEPTH', 'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG', 'PARTITION_START', 'PARTITION_STOP', 'PARTITION_ID', 'OTHER', 'OTHER_XML', 'DISTRIBUTION', 'CPU_COST', 'IO_COST', 'TEMP_SPACE', 'ACCESS_PREDICATES', 'FILTER_PREDICATES', 'PROJECTION', 'TIME', 'QBLOCK_NAME', 'Unnamed: 35']



### Type conversion

Each column is converted into a column of type values which are Integer64.

In [6]:
def handle_numeric_overflows(x):
    """
    Accepts a dataframe column, and 
    """
    try:
        #df = df.astype('int64')
        x1 = pd.DataFrame([x],dtype='int64')
    except ValueError:
        x = 9223372036854775807 # Max int size
    return x
#
for col in df.columns:
    try:
        if col in black_list:
            continue
        df[col] = df[col].apply(handle_numeric_overflows)
        df[col].astype('int64',inplace=True)
    except:
        df.drop(columns=col, inplace=True)
        print('Dropped column [' + col + ']')
#
print('-------------------------------------------------------------')
#
for col in df_outliers.columns:
    try:
        if col in black_list:
            continue
        df_outliers[col] = df_outliers[col].astype('int64')
    except OverflowError:
        #
        # Handles numeric overflow conversions by replacing such values with max value inside the dataset.
        df_outliers[col] = df_outliers[col].apply(handle_numeric_overflows)
        df_outliers[col] = df_outliers[col].astype('int64')
    except Exception as e:
        df_outliers.drop(columns=col, inplace=True)
        print('Dropped column [' + col + ']')
print(df.columns)
print(df_outliers.columns)

-------------------------------------------------------------
Dropped column [REMARKS]
Dropped column [OBJECT_NODE]
Dropped column [OBJECT_ALIAS]
Dropped column [OBJECT_INSTANCE]
Dropped column [OPTIMIZER]
Dropped column [SEARCH_COLUMNS]
Dropped column [OTHER_XML]
Dropped column [DISTRIBUTION]
Dropped column [CPU_COST]
Dropped column [IO_COST]
Dropped column [ACCESS_PREDICATES]
Dropped column [FILTER_PREDICATES]
Dropped column [PROJECTION]
Dropped column [TIME]
Dropped column [QBLOCK_NAME]
Dropped column [Unnamed: 35]
Index(['DBID', 'SQL_ID', 'PLAN_HASH_VALUE', 'ID', 'OPERATION', 'OPTIONS',
       'OBJECT_NODE', 'OBJECT#', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS',
       'OBJECT_TYPE', 'OPTIMIZER', 'PARENT_ID', 'DEPTH', 'POSITION',
       'SEARCH_COLUMNS', 'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG',
       'PARTITION_START', 'PARTITION_STOP', 'PARTITION_ID', 'OTHER',
       'DISTRIBUTION', 'CPU_COST', 'IO_COST', 'TEMP_SPACE',
       'ACCESS_PREDICATES', 'FILTER_PREDICATES', 'PROJ

### Feature Selection

In this step, redundant features are dropped. Features are considered redundant if exhibit a standard devaition of 0 (meaning no change in value).

In [7]:
def drop_flatline_columns(df):
    columns = df.columns
    flatline_features = []
    for i in range(len(columns)):
        try:
            #
            if columns[i] in black_list:
                continue
            #
            std = df[columns[i]].std()
            if std == 0:
                flatline_features.append(columns[i])
        except:
            pass
    #
    #print('Features which are considered flatline:\n')
    #for col in flatline_features:
    #    print(col)
    print('\nShape before changes: [' + str(df.shape) + ']')
    df = df.drop(columns=flatline_features)
    print('Shape after changes: [' + str(df.shape) + ']')
    print('Dropped a total [' + str(len(flatline_features)) + ']')
    return df
#
df = drop_flatline_columns(df=df)
df_outliers = drop_flatline_columns(df=df_outliers)
#
print('\nAfter flatline column drop:')
print(df.shape)
print(df.columns)
#
print('--------------------------------------------------------')
print('\nAfter outlier flatline column drop:')
print(df_outliers.shape)
print(df_outliers.columns)


Shape before changes: [(10000, 39)]
Shape after changes: [(10000, 31)]
Dropped a total [8]

Shape before changes: [(1433, 20)]
Shape after changes: [(1433, 18)]
Dropped a total [2]

After flatline column drop:
(10000, 31)
Index(['SQL_ID', 'PLAN_HASH_VALUE', 'ID', 'OPERATION', 'OPTIONS',
       'OBJECT_NODE', 'OBJECT#', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS',
       'OBJECT_TYPE', 'OPTIMIZER', 'PARENT_ID', 'DEPTH', 'POSITION',
       'SEARCH_COLUMNS', 'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG',
       'PARTITION_START', 'PARTITION_STOP', 'PARTITION_ID', 'DISTRIBUTION',
       'CPU_COST', 'IO_COST', 'TEMP_SPACE', 'TIME', 'QBLOCK_NAME', 'TIMESTAMP',
       'OTHER_XML'],
      dtype='object')
--------------------------------------------------------

After outlier flatline column drop:
(1433, 18)
Index(['PLAN_ID', 'TIMESTAMP', 'OPERATION', 'OPTIONS', 'OBJECT_OWNER',
       'OBJECT_NAME', 'OBJECT_TYPE', 'ID', 'PARENT_ID', 'DEPTH', 'POSITION',
       'COST', 'CARDINALITY', 'BYTES', '

### Scaling columns

This section attempts to process a number of data columns through a MinMax Scaler. This is done, to normalize data on a similar scaler, particularly before comparing column measurements using a euclidean based measure. The following columns will be targetted:

* CARDINALITY
* BYTES
* PARTITION_START
* PARTITION_STOP
* CPU_COST
* IO_COST
* TEMP_SPACE
* TIME

In [8]:
scaler = preprocessing.MinMaxScaler()
scaled_columns = ['CARDINALITY',
                'BYTES',
                # 'CPU_COST',
                'IO_COST',
                'TEMP_SPACE',
                'TIME']
print(df['PARTITION_START'].iloc[0])
df[scaled_columns] = scaler.fit_transform(df[scaled_columns])
print(df['PARTITION_START'].iloc[0])
print("Minimal Vector Points: " + str(scaler.data_min_))
print("Maximal Vector Points: " + str(scaler.data_max_))
#
print('\nAfter scaled column transformation:')
print(df.shape)
print(df.columns)
#
print('--------------------------------------------------------')
print('\nAfter outlier scaled column transformation:')
print(df_outliers.shape)
print(df_outliers.columns)

0
0
Minimal Vector Points: [0. 0. 0. 0. 0.]
Maximal Vector Points: [3.9933000e+08 3.6287625e+10 4.1962300e+05 2.4282000e+07 1.7000000e+01]

After scaled column transformation:
(10000, 31)
Index(['SQL_ID', 'PLAN_HASH_VALUE', 'ID', 'OPERATION', 'OPTIONS',
       'OBJECT_NODE', 'OBJECT#', 'OBJECT_OWNER', 'OBJECT_NAME', 'OBJECT_ALIAS',
       'OBJECT_TYPE', 'OPTIMIZER', 'PARENT_ID', 'DEPTH', 'POSITION',
       'SEARCH_COLUMNS', 'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG',
       'PARTITION_START', 'PARTITION_STOP', 'PARTITION_ID', 'DISTRIBUTION',
       'CPU_COST', 'IO_COST', 'TEMP_SPACE', 'TIME', 'QBLOCK_NAME', 'TIMESTAMP',
       'OTHER_XML'],
      dtype='object')
--------------------------------------------------------

After outlier scaled column transformation:
(1433, 18)
Index(['PLAN_ID', 'TIMESTAMP', 'OPERATION', 'OPTIONS', 'OBJECT_OWNER',
       'OBJECT_NAME', 'OBJECT_TYPE', 'ID', 'PARENT_ID', 'DEPTH', 'POSITION',
       'COST', 'CARDINALITY', 'BYTES', 'OTHER_TAG', 'PARTITION_STAR

### Adding Grouping Column

An extra column is added to allow access plans to be isolated per instance

In [9]:
#
# Adds a columns per SQL_ID, PLAN_HASH_VALUE grouping, which can be used to group instances together
def add_grouping_column(df, column_identifier):
    """
    Receives a pandas dataframe, and adds a new column which allows dataframe to be aggregated per 
    SQL_ID, PLAN_HASH_VALUE combination.
    
    :param: df                - Pandas Dataframe
    :param: column_identifier - String denoting matrix column to group by
    
    :return: Pandas Dataframe, with added column    
    """
    print('Shape before transformation: ' + str(df.shape))
    new_grouping_col = []
    counter = 0
    last_sql_id = df[column_identifier].iloc(0) # Starts with first SQL_ID
    for index, row in df.iterrows():
        if column_identifier == 'SQL_ID':
            if last_sql_id != row.SQL_ID:
                last_sql_id = row.SQL_ID
                counter += 1
        elif column_identifier == 'PLAN_ID':
            if last_sql_id != row.PLAN_ID:
                last_sql_id = row.PLAN_ID
                counter += 1
        else:
            raise ValueError('Column does not exist!')
        new_grouping_col.append(counter)
    #
    # Append list as new column
    new_col = pd.Series(new_grouping_col)
    df['PLAN_INSTANCE'] = new_col.values
    print('Shape after transformation: ' + str(df.shape))
    return df
#
df = add_grouping_column(df=df,column_identifier='SQL_ID')
df_outliers = add_grouping_column(df=df_outliers,column_identifier='PLAN_ID')

Shape before transformation: (10000, 31)
Shape after transformation: (10000, 32)
Shape before transformation: (1433, 18)
Shape after transformation: (1433, 19)


### Tree Formatting

Constructs the tree plan structure

In [10]:
class PlanTreeModeller:
    """
    This class simulates an access plan in the form of a tree structure
    """
    
    @staticmethod
    def __create_node(node_name, parent=None):
        """
        Builds a node which will be added to the tree. If the parent is 'None', it is assumed that this
        node will be used as the root/parent Node.
        
        :param: node_name - String specifying node name.
        :param: parent    - Parent node specifying parent node name.
        
        :return: anytree object
        """
        if node_name is None:
            raise ValueError('Node name was not specified!')
        
        if parent is None:
            node = Node(node_name)
        else:
            node = Node(node_name, parent=parent)
        
        return node
    
    @staticmethod
    def build_tree(df):
        """
        This method receives a pandas dataframe, and converts it into a searchable python tree
        
        :param: df - Pandas Dataframe, pertaining to input access plan
        
        :return: Dictionary object, consisting of node objects (which are linked in a tree fashion)
        """
        parent_node = None
        node_dict = {}
        for index, row in df.iterrows():
            
            # Build Node and add to parent
            row_id = int(row['ID'])
            parent_id = int(row['PARENT_ID'])
            
            if row_id == 0:
                node = PlanTreeModeller.__create_node(node_name=row_id)
            else:
                parent_node = node_dict[parent_id]
                node = PlanTreeModeller.__create_node(node_name=row_id, parent=parent_node)
            node_dict[row_id] = node
        
        return node_dict # Dictionary consisting of tree nodes
    
    @staticmethod
    def __retrieve_plan_details(df, node_name):
        """
        Accepts a dataframe, and the node_name. Retrieves features pertaining to the row id in the access plan
        
        :param: df - Dataframe consisting of access plan features
        :param: id - String id denoting which row to retrieve from the parameter dataframe
        
        :return: Dictionary consisting of access plan attributes
        """
        operation = str(df[df['ID'] == node_name]['OPERATION'].iloc[0])
        options = str(df[df['ID'] == node_name]['OPTIONS'].iloc[0])
        object_name = str(df[df['ID'] == node_name]['OBJECT_NAME'].iloc[0])
        try:
            object_type = str(df[df['ID'] == node_name]['OBJECT_TYPE'].iloc[0])
        except KeyError: # This is required because variant query plans do not have this node.
            object_type = None
        cardinality = int(df[df['ID'] == node_name]['CARDINALITY'].iloc[0])
        bytess = int(df[df['ID'] == node_name]['BYTES'].iloc[0])
        partition_delta = int(df[df['ID'] == node_name]['PARTITION_STOP'].iloc[0]) - int(df[df['ID'] == node_name]['PARTITION_START'].iloc[0])
        # cpu_cost = int(df[df['ID'] == node_name]['CPU_COST'].iloc[0])
        io_cost = int(df[df['ID'] == node_name]['IO_COST'].iloc[0])
        temp_space = int(df[df['ID'] == node_name]['TEMP_SPACE'].iloc[0])
        time = int(df[df['ID'] == node_name]['TIME'].iloc[0]) 
        
        return {'OPERATION':operation,
                'OPTIONS':options,
                'OBJECT_NAME':object_name,
                'OBJECT_TYPE':object_type,
                'CARDINALITY':cardinality,
                'BYTES':bytess,
                'PARTITION_DELTA':partition_delta,
                # 'CPU_COST':cpu_cost,
                'IO_COST':io_cost,
                'TEMP_SPACE':temp_space,
                'TIME':time}
    
    @staticmethod
    def __tree_node_euclidean(tree_dict1, tree_dict2):
        """
        This method calculates the eucldiean distance between two vectors.
        
        :param: tree_dict1 - Dictionary denoting a single node within plan / tree 1
        :param: tree_dict2 - Dictionary denoting a single node within plan / tree 2
        
        :return: List denoting euclidean distance
        """
        tree_vector_1 = [tree_dict1['CARDINALITY'],
                         tree_dict1['BYTES'],
                         tree_dict1['PARTITION_DELTA'],
                         #tree_dict1['CPU_COST'],
                         tree_dict1['IO_COST'],
                         tree_dict1['TEMP_SPACE'],
                         tree_dict1['TIME']]
        
        tree_vector_2 = [tree_dict2['CARDINALITY'],
                         tree_dict2['BYTES'],
                         tree_dict2['PARTITION_DELTA'],
                         #tree_dict2['CPU_COST'],
                         tree_dict2['IO_COST'],
                         tree_dict2['TEMP_SPACE'],
                         tree_dict2['TIME']]
        
        euc_distance = euclidean_distances([tree_vector_1],[tree_vector_2])
        return euc_distance[0][0]
    
    @staticmethod
    def render_tree(tree, df):
        """
        Renders Tree by printing to screen
        
        :param: tree - AnyTree object, representing tree modelled access plan
        :param: df   - Pandas dataframe representatnt of the access plan about to be rendered
        
        :return: None
        """
        for pre, fill, node in RenderTree(tree):
            
            access_plan_dict = PlanTreeModeller.__retrieve_plan_details(df=df,
                                                                        node_name = node.name)
            
            if access_plan_dict['OBJECT_NAME'] == '0':
                print("%s%s > %s" % (pre, node.name, access_plan_dict['OPERATION']))
            else:
                if access_plan_dict['OPTIONS'] == '0': 
                    print("%s%s > %s (%s)" % (pre, node.name, access_plan_dict['OPERATION'], access_plan_dict['OBJECT_NAME']))
                else:
                    print("%s%s > %s | %s (%s)" % (pre, node.name, access_plan_dict['OPERATION'], access_plan_dict['OPTIONS'], access_plan_dict['OBJECT_NAME']))
    
    @staticmethod
    def __postorder(tree):
        """
        Accepts a tree, and iterates in post order fashion (left,right,root)
        
        :param: tree - Dictionary consisting of AnyTree Nodes
        
        :return: List consisting of tree traversal order
        """
        post_order_traversal = [node.name for node in PostOrderIter(tree[0])]
        return post_order_traversal
    
    @staticmethod
    def tree_compare(tree1, tree2, df1, df2):
        """
        Accepts two trees of type 'AnyTree', along with respective dataframe denoting each respective access
        path.
        
        :param: tree1 - Dictionary consisting of 'AnyTree' nodes, belonging to tree 1
        :param: tree2 - Dictionary consisting of 'AnyTree' nodes, belonging to tree 2
        :param: df1   - Pandas dataframe consisting of access plan instructions opted for by tree 1
        :param: df2   - Pandas dataframe consisting of access plan instructions opted for by tree 2
        
        :return: None
        """
        
        # Retrieves traversal order for both trees
        post_order_traversal1 = PlanTreeModeller.__postorder(tree1)
        post_order_traversal2 = PlanTreeModeller.__postorder(tree2)
        
        # Iterates over traversal order, until a change is encountered
        max_range = max(len(post_order_traversal1),len(post_order_traversal2))
        delta_flag = True
        euclidean_measure = []
        for i in range(0,max_range):
            
            # This check avoids a list IndexError for scebarious when one plan is bigger than the others,
            # and consequently the number of node traversals is bigger than the other tree.
            if i >= len(post_order_traversal1) or i >= len(post_order_traversal2):
                break
            
            # Retrive prior, current, and next nodes
            try:
                id_1_prev = post_order_traversal1[i-1]
                id_2_prev = post_order_traversal2[i-1]
            except IndexError:
                id_1_prev = None
                id_2_prev = None
            try:
                id_1 = post_order_traversal1[i]
                id_2 = post_order_traversal2[i]
            except IndexError:
                id_1 = None
                id_2 = None
            try:
                id_1_next = post_order_traversal1[i+1]
                id_2_next = post_order_traversal2[i+1]
            except IndexError:
                id_1_next = None
                id_2_next = None

            if id_1_prev is not None and id_2_prev is not None:
                pd_tree1_prev = PlanTreeModeller.__retrieve_plan_details(df=df1, node_name=id_1_prev)
                pd_tree2_prev = PlanTreeModeller.__retrieve_plan_details(df=df2, node_name=id_2_prev)
            if id_1 is not None and id_2 is not None:
                pd_tree1 = PlanTreeModeller.__retrieve_plan_details(df=df1, node_name=id_1)
                pd_tree2 = PlanTreeModeller.__retrieve_plan_details(df=df2, node_name=id_2)
            if id_1_next is not None and id_2_next is not None:
                pd_tree1_next = PlanTreeModeller.__retrieve_plan_details(df=df1, node_name=id_1_next)
                pd_tree2_next = PlanTreeModeller.__retrieve_plan_details(df=df2, node_name=id_2_next)
            
            if (pd_tree1['OPERATION'] != pd_tree2['OPERATION'] or pd_tree1['OBJECT_NAME'] != pd_tree2['OBJECT_NAME'] or pd_tree1['OPTIONS'] != pd_tree2['OPTIONS']) and delta_flag:
                print('Access Predicate Difference detected!')
                print('Tree 1 difference at node [' + str(id_1) + '] operator > ' + str(pd_tree1['OPERATION']) + '(' + str(pd_tree1['OPTIONS']) + ') on object [' + pd_tree1['OBJECT_NAME'] + ']')
                print('Tree 2 difference at node [' + str(id_2) + '] operator > ' + str(pd_tree2['OPERATION']) + '(' + str(pd_tree2['OPTIONS']) + ') on object [' + pd_tree2['OBJECT_NAME'] + ']')
                PlanTreeModeller.render_tree(tree=tree1[0], df=df1) # Tree rendederer uses root node and traverses downwards
                PlanTreeModeller.render_tree(tree=tree2[0], df=df2) # Tree rendederer uses root node and traverses downwards
                
                encountered_recommendations = []
                print('Stat Recommendation: ')
                display_counter = 1
                if pd_tree1['OBJECT_TYPE'] != '0' and pd_tree1['OBJECT_NAME'] not in encountered_recommendations:
                    print(str(display_counter) + ') Collect [' + str(pd_tree1['OBJECT_TYPE']) + '] stats on [' + str(pd_tree1['OBJECT_NAME']) + ']')
                    encountered_recommendations.append(pd_tree1['OBJECT_NAME'])
                    display_counter += 1
                if pd_tree2['OBJECT_TYPE'] != '0' and pd_tree2['OBJECT_NAME'] not in encountered_recommendations:
                    print(str(display_counter) + ') Collect [' + str(pd_tree2['OBJECT_TYPE']) + '] stats on [' + str(pd_tree2['OBJECT_NAME']) + ']')
                    encountered_recommendations.append(pd_tree2['OBJECT_NAME'])
                    display_counter += 1
#                 if pd_tree1_prev['OBJECT_TYPE'] != '0' and pd_tree1_prev['OBJECT_NAME'] not in encountered_recommendations:
#                     print(str(display_counter) + ') Collect [' + pd_tree1_prev['OBJECT_TYPE'] + '] stats on [' + pd_tree1_prev['OBJECT_NAME'] + ']')
#                     encountered_recommendations.append(pd_tree1_prev['OBJECT_NAME'])
#                     display_counter += 1
#                 if pd_tree2_prev['OBJECT_TYPE'] != '0' and pd_tree2_prev['OBJECT_NAME'] not in encountered_recommendations:
#                     print(str(display_counter) + ') Collect [' + pd_tree2_prev['OBJECT_TYPE'] + '] stats on [' + pd_tree2_prev['OBJECT_NAME'] + ']')
#                     encountered_recommendations.append(pd_tree2_prev['OBJECT_NAME'])
#                     display_counter += 1
                if pd_tree1_next['OBJECT_TYPE'] != '0' and pd_tree1_next['OBJECT_NAME'] not in encountered_recommendations:
                    print(str(display_counter) + ') Collect [' + str(pd_tree1_next['OBJECT_TYPE']) + '] stats on [' + str(pd_tree1_next['OBJECT_NAME']) + ']')
                    encountered_recommendations.append(pd_tree1_next['OBJECT_NAME'])
                    display_counter += 1
                if pd_tree2_next['OBJECT_TYPE'] != '0' and pd_tree2_next['OBJECT_NAME'] not in encountered_recommendations:
                    print(str(display_counter) + ') Collect [' + str(pd_tree2_next['OBJECT_TYPE']) + '] stats on [' + str(pd_tree2_next['OBJECT_NAME'])+ ']')
                    encountered_recommendations.append(pd_tree2_prev['OBJECT_NAME'])
                    display_counter += 1
                delta_flag = False
            
            # Calculate Node Euclidean Measure
            euclidean_vector = PlanTreeModeller.__tree_node_euclidean(tree_dict1=pd_tree1,
                                                                      tree_dict2=pd_tree2)
            euclidean_measure.append(euclidean_vector)
            
        if delta_flag is not False and sum(euclidean_measure) > 10000:
            print('Access Predicate Difference detected!')
            print('Plan structure was the same, but a big operator difference was detected with delta score [' + str(sum(euclidean_measure))  + ']')
            PlanTreeModeller.render_tree(tree=tree1[0], df=df1) # Tree rendederer uses root node and traverses downwards
            PlanTreeModeller.render_tree(tree=tree2[0], df=df2) # Tree rendederer uses root node and traverses downwards
        
        if delta_flag:
            print('No plan differences detected.')
        
        print('Total computed delta score [' + str(sum(euclidean_measure)) + ']')

### Captured REP_VSQL_PLANS plans

This section contains metrics pertaining to plans captured by the data capture tool

In [11]:
#
# Retrieve Unique set of PLAN_HASH_VALUES
np_sql_id, np_plan_hash_value, np_plan_instance = pd.unique(df['SQL_ID']),pd.unique(df['PLAN_HASH_VALUE']),pd.unique(df['PLAN_INSTANCE'])
print(np_sql_id)
print(type(np_sql_id))
print(np_plan_hash_value)
print(type(np_plan_hash_value))
print(np_plan_instance)
print(type(np_plan_instance))
print('-'*100)
#
# Iterate over each PLAN_HASH_VALUE, and retrieve PLAN subset                                                                                                                 
for plan_instance in np_plan_instance:
    #
    # Retrieve only a single instance of the plan (as annotated at beginning of experiment)
    df_temp_plan = df[df['PLAN_INSTANCE'] == plan_instance]
    #
    # This step ensures that only TPC-DS related queries are displayed
    tpc_check = df_temp_plan['OBJECT_OWNER'].tolist()
    if tpcds not in tpc_check:
        continue
    #
    # Discards plans with double entries - Due to the parallel nature of the throughput test for 
    # TPC-DS, multiple threads may execute the same query at the same time, resulting in sql access
    # plans with the same SQL_ID, same PLAN_HASH_VALUE, and same TIMESTAMP. Such occurances are skipped.
    df_temp_count = df_temp_plan[df_temp_plan['ID'] == 0]
    if df_temp_count.shape[0] != 1:
        continue
    #
    # Sorts by ID ascending - This clause may be redundant due to the natural order of the data capture tool
    df_temp_plan = df_temp_plan.sort_values(by='ID', ascending=True)
    #
    # Builds Tree
    tree = PlanTreeModeller.build_tree(df=df_temp_plan)
    #
    # Renders Tree
    print('SQL_ID [' + str(df_temp_plan['SQL_ID'].iloc[0]) + '] with PLAN_HASH_VALUE [' + str(df_temp_plan['PLAN_HASH_VALUE'].iloc[0]) + ']\n')
    PlanTreeModeller.render_tree(tree=tree[0], df=df_temp_plan) # Tree rendederer uses root node and traverses downwards
    print('-'*100)

['2j8td2wuthnfv' '9nf3gy0tv9p0u' 'dmarhxq3sjbay' 'atmzuqq2j04vf'
 '8h30qknj67qkd' 'c08uay6yqd6g6' 'cdnf103s6xdrq' '7709u7vc53hzp'
 '1v8msnbvxkyns' 'g1gk65zaj4v13' 'fguqxhgu1dsb0' '7jbz5k0dtf423'
 '2d3zgup3azkv9' '10pyxwav0mqs9' '2a1dk8mrn7130' '8bkwvvpj53p99'
 '2r0jymb3zn4jf' '2xgw6vvusj8b5' 'b0v3ckntj8u2a' 'gc8fy2s1t1cu9'
 '20tqu460batd7' '0vu1tx383zny5' 'fcwqqyym0s6jt' '4w6s7g5fzs73j'
 '341gsjr61mshb' '6hxba954xkbr5' 'd8skjycj376g5' 'fxkcmts3gvwxq'
 'cdhsvwqxkam8t' '2z07h80455ga1' 'c2z5yntnskd4a' '0tmf6pgnf5jnq'
 'dhh64fnj09d5h' '246rprswfccwf' '893thpqvhsmtj' '13yty6ncn52g9'
 'ax9nqy7g8gdjk' 'ctw35amk1n56t' '6zg8hz91awun3' '4a4gj8y2sg6za'
 '91qq5sbbw1wj2' '9wyaa29uhuujf' 'd88xndadpcsrd' 'c1hdntnnta3t0'
 '3qkhfbf2kyvhk' 'bqusp3ck0v1tm' 'c6289n6x7q3ct' 'c4w987zzxa97v'
 '038sf3f71cmgz' '5vwb8shdzwy6f' '44wm17x9hs6ur' '7ny5n5kz9w8vr'
 '0n0fcwdb2n1d2' 'a9ps282b9czw9' 'cvj7vbpg7tczx' '5at0uqw2udhtj'
 'ggy4j2s4hkn3j' '3fscxf8wh6kw8' 'af0v6h89jw9z2' 'fpu02mmzrujzw'
 '4qkmajxbvwfsj' 'b3ycr1a

                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                                └── 9 > INDEX | SAMPLE FAST FULL SCAN (CR_RETURNING_HDEMO_SK_INDEX)
----------------------------------------------------------------------------------------------------
SQL_ID [dmarhxq3sjbay] with PLAN_HASH_VALUE [1792294413]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10001)
            └── 4 > SORT
                └── 5 > PX RECEIVE
                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                                └── 9 > INDEX | SAMPLE FAST FULL SCAN (CR_CALL_CENTER_SK_INDEX)
----------------------------------------------------------------------------------------------------
SQL_ID [atmzuqq2j04vf] with PLAN_HASH_VALUE [4122049279]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 

            └── 4 > SORT
                └── 5 > PX RECEIVE
                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                                └── 9 > INDEX | SAMPLE FAST FULL SCAN (CS_BILL_HDEMO_SK_INDEX)
----------------------------------------------------------------------------------------------------
SQL_ID [20tqu460batd7] with PLAN_HASH_VALUE [2858535601]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10001)
            └── 4 > SORT
                └── 5 > PX RECEIVE
                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                                └── 9 > INDEX | SAMPLE FAST FULL SCAN (CS_SHIP_CDEMO_SK_INDEX)
----------------------------------------------------------------------------------------------------
SQL_ID [0vu1tx383zny5] with PLAN_HASH_VALUE [106

                        └── 7 > TABLE ACCESS | FULL (ITEM)
----------------------------------------------------------------------------------------------------
SQL_ID [ctw35amk1n56t] with PLAN_HASH_VALUE [2848590504]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10000)
            └── 4 > SORT
                └── 5 > OPTIMIZER STATISTICS GATHERING
                    └── 6 > PX BLOCK
                        └── 7 > TABLE ACCESS | FULL (STORE_RETURNS)
----------------------------------------------------------------------------------------------------
SQL_ID [6zg8hz91awun3] with PLAN_HASH_VALUE [3813252756]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10001)
            └── 4 > SORT
                └── 5 > PX RECEIVE
                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                      

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10000)
            └── 4 > SORT
                └── 5 > OPTIMIZER STATISTICS GATHERING
                    └── 6 > PX BLOCK
                        └── 7 > TABLE ACCESS | FULL (S_INVENTORY)
----------------------------------------------------------------------------------------------------
SQL_ID [ggy4j2s4hkn3j] with PLAN_HASH_VALUE [1425259211]

0 > SELECT STATEMENT
└── 1 > WINDOW
    └── 2 > VIEW
        └── 3 > WINDOW
            └── 4 > PX COORDINATOR
                └── 5 > PX SEND | QC (RANDOM) (:TQ10001)
                    └── 6 > HASH
                        └── 7 > PX RECEIVE
                            └── 8 > PX SEND | HASH (:TQ10000)
                                └── 9 > HASH
                                    └── 10 > PX BLOCK
                                        └── 11 > TABLE ACCESS | SAMPLE BY ROWID RANGE (S_INVENTORY)
-------------------------------------------

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10000)
            └── 4 > SORT
                └── 5 > OPTIMIZER STATISTICS GATHERING
                    └── 6 > PX BLOCK
                        └── 7 > TABLE ACCESS | FULL (WEB_SALES)
----------------------------------------------------------------------------------------------------
SQL_ID [1mdg3uscympsp] with PLAN_HASH_VALUE [2718200651]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10000)
            └── 4 > SORT
                └── 5 > OPTIMIZER STATISTICS GATHERING
                    └── 6 > PX BLOCK
                        └── 7 > TABLE ACCESS | FULL (WEB_SALES)
----------------------------------------------------------------------------------------------------
SQL_ID [bsg09s5xtx55p] with PLAN_HASH_VALUE [1341767134]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > INLIST ITERATOR
        └── 3 > TABLE ACCESS | BY USER

                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                                └── 9 > INDEX | SAMPLE FAST FULL SCAN (WS_WEB_PAGE_SK_INDEX)
----------------------------------------------------------------------------------------------------
SQL_ID [bh2r3y6jjnfy3] with PLAN_HASH_VALUE [1243077737]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDINATOR
        └── 3 > PX SEND | QC (RANDOM) (:TQ10001)
            └── 4 > SORT
                └── 5 > PX RECEIVE
                    └── 6 > PX SEND | HASH (:TQ10000)
                        └── 7 > SORT
                            └── 8 > PX BLOCK
                                └── 9 > INDEX | SAMPLE FAST FULL SCAN (WS_WEB_SITE_SK_INDEX)
----------------------------------------------------------------------------------------------------
SQL_ID [74fu7h6nfsfta] with PLAN_HASH_VALUE [500507011]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > PX COORDI

    │                           │   │       └── 39 > INDEX | UNIQUE SCAN (SYS_C0021186)
    │                           │   └── 40 > TABLE ACCESS | BY INDEX ROWID (ITEM)
    │                           │       └── 41 > INDEX | UNIQUE SCAN (SYS_C0021203)
    │                           └── 42 > TABLE ACCESS | BY INDEX ROWID (WEB_RETURNS)
    │                               └── 43 > INDEX | UNIQUE SCAN (SYS_C0021239)
    └── 44 > COUNT
        └── 45 > VIEW
            └── 46 > SORT
                └── 47 > COUNT
                    └── 48 > HASH JOIN
                        ├── 49 > VIEW
                        │   └── 50 > TABLE ACCESS | FULL (SYS_TEMP_0FDA0010E_141942F5)
                        └── 51 > VIEW
                            └── 52 > TABLE ACCESS | FULL (SYS_TEMP_0FDA0010E_141942F5)
----------------------------------------------------------------------------------------------------
SQL_ID [6p1s22js3rhk8] with PLAN_HASH_VALUE [2789789120]

0 > SELECT STATEMENT
└── 1 > COUNT


                └── 19 > VIEW
                    └── 20 > TABLE ACCESS | FULL (SYS_TEMP_0FD9FD661_141942F5)
----------------------------------------------------------------------------------------------------
SQL_ID [bn52fdgrz06r0] with PLAN_HASH_VALUE [3867543437]

0 > SELECT STATEMENT
└── 1 > TEMP TABLE TRANSFORMATION
    ├── 2 > LOAD AS SELECT
    │   └── 3 > VIEW
    │       └── 4 > HASH
    │           └── 5 > COUNT
    │               └── 6 > HASH JOIN
    │                   ├── 7 > TABLE ACCESS | FULL (WAREHOUSE)
    │                   └── 8 > NESTED LOOPS
    │                       ├── 9 > NESTED LOOPS
    │                       │   ├── 10 > TABLE ACCESS | FULL (DATE_DIM)
    │                       │   └── 11 > TABLE ACCESS | BY INDEX ROWID BATCHED (INVENTORY)
    │                       │       └── 12 > INDEX | RANGE SCAN (INV_DATE_SK_INDEX)
    │                       └── 13 > INDEX | UNIQUE SCAN (SYS_C0021203)
    └── 14 > SORT
        └── 15 > COUNT
            └── 16

                    │                   │   │           └── 67 > INDEX | RANGE SCAN (SYS_C0021203)
                    │                   │   └── 68 > TABLE ACCESS | BY INDEX ROWID (CATALOG_RETURNS)
                    │                   │       └── 69 > INDEX | UNIQUE SCAN (SYS_C0021236)
                    │                   └── 70 > TABLE ACCESS | BY INDEX ROWID BATCHED (CATALOG_RETURNS)
                    │                       └── 71 > INDEX | RANGE SCAN (CR_ORDER_NUMBER_INDEX)
                    └── 72 > COUNT
                        └── 73 > VIEW
                            └── 74 > SORT
                                └── 75 > COUNT
                                    └── 76 > HASH JOIN
                                        ├── 77 > NESTED LOOPS
                                        │   ├── 78 > NESTED LOOPS
                                        │   │   ├── 79 > STATISTICS COLLECTOR
                                        │   │   │   └── 80 > HASH JOIN
             

                    │   │   │                                   └── 37 > INDEX | RANGE SCAN (SS_STORE_SK_INDEX)
                    │   │   └── 38 > INDEX | UNIQUE SCAN (SYS_C0021203)
                    │   └── 39 > TABLE ACCESS | BY INDEX ROWID (ITEM)
                    └── 40 > TABLE ACCESS | FULL (ITEM)
----------------------------------------------------------------------------------------------------
SQL_ID [8jc2yh6dw1x4j] with PLAN_HASH_VALUE [3395427503]

0 > SELECT STATEMENT
└── 1 > RESULT CACHE (gnaw6sfhwd56c19knxd0sddk3q)
    └── 2 > SORT
        └── 3 > NESTED LOOPS
            ├── 4 > NESTED LOOPS
            │   ├── 5 > TABLE ACCESS | BY INDEX ROWID (ITEM)
            │   │   └── 6 > INDEX | RANGE SCAN (SYS_C0021203)
            │   └── 7 > INDEX | RANGE SCAN (SYS_C0021242)
            └── 8 > TABLE ACCESS | BY INDEX ROWID (WEB_SALES)
----------------------------------------------------------------------------------------------------
SQL_ID [1sr0c55h3yzu7] with PLAN_HASH

                    │   └── 26 > INDEX | UNIQUE SCAN (SYS_C0021203)
                    └── 27 > TABLE ACCESS | BY INDEX ROWID (ITEM)
----------------------------------------------------------------------------------------------------
SQL_ID [ax5azugc569qy] with PLAN_HASH_VALUE [2866625911]

0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > WINDOW
                └── 5 > SORT
                    └── 6 > COUNT
                        └── 7 > HASH JOIN
                            ├── 8 > TABLE ACCESS | FULL (DATE_DIM)
                            └── 9 > NESTED LOOPS
                                ├── 10 > NESTED LOOPS
                                │   ├── 11 > TABLE ACCESS | FULL (ITEM)
                                │   └── 12 > INDEX | RANGE SCAN (WS_ITEM_SK_INDEX)
                                └── 13 > TABLE ACCESS | BY INDEX ROWID (WEB_SALES)
----------------------------------------------------------------------------------------------

                │   │   │       └── 36 > TABLE ACCESS | FULL (HOUSEHOLD_DEMOGRAPHICS)
                │   │   └── 37 > INDEX | UNIQUE SCAN (SYS_C0021181)
                │   └── 38 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER_ADDRESS)
                └── 39 > TABLE ACCESS | FULL (CUSTOMER_ADDRESS)
----------------------------------------------------------------------------------------------------
SQL_ID [0svydvqgstk96] with PLAN_HASH_VALUE [973204448]

0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > COUNT
                └── 5 > FILTER
                    ├── 6 > HASH JOIN
                    │   ├── 7 > NESTED LOOPS
                    │   │   ├── 8 > NESTED LOOPS
                    │   │   │   ├── 9 > STATISTICS COLLECTOR
                    │   │   │   │   └── 10 > HASH JOIN
                    │   │   │   │       ├── 11 > NESTED LOOPS
                    │   │   │   │       │   ├── 12 > STATISTICS COLLECTOR
                    │   │   │   

    │               │   │   │   │   │       └── 53 > INDEX | UNIQUE SCAN (SYS_C0021183)
    │               │   │   │   │   └── 54 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
    │               │   │   │   │       └── 55 > INDEX | UNIQUE SCAN (SYS_C0021186)
    │               │   │   │   └── 56 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
    │               │   │   │       └── 57 > INDEX | UNIQUE SCAN (SYS_C0021186)
    │               │   │   └── 58 > INDEX | UNIQUE SCAN (SYS_C0021200)
    │               │   └── 59 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER_ADDRESS)
    │               │       └── 60 > INDEX | UNIQUE SCAN (SYS_C0021181)
    │               └── 61 > INDEX | UNIQUE SCAN (SYS_C0021218)
    └── 62 > SORT
        └── 63 > COUNT
            └── 64 > HASH JOIN
                ├── 65 > VIEW
                │   └── 66 > TABLE ACCESS | FULL (SYS_TEMP_0FDA00116_141942F5)
                └── 67 > VIEW
                    └── 68 > TABLE ACCESS | FULL (SYS_TEMP_0FDA00116_141942F5)


SQL_ID [8g6t2f58hwy9y] with PLAN_HASH_VALUE [3096856239]

0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > COUNT
                └── 5 > HASH JOIN
                    ├── 6 > HASH JOIN
                    │   ├── 7 > VIEW
                    │   │   └── 8 > HASH
                    │   │       └── 9 > COUNT
                    │   │           └── 10 > HASH JOIN
                    │   │               ├── 11 > VIEW (VW_NSO_2)
                    │   │               │   └── 12 > TABLE ACCESS | FULL (DATE_DIM)
                    │   │               │       └── 13 > TABLE ACCESS | BY INDEX ROWID BATCHED (DATE_DIM)
                    │   │               │           └── 14 > INDEX | RANGE SCAN (SYS_C0021186)
                    │   │               └── 15 > HASH JOIN
                    │   │                   ├── 16 > NESTED LOOPS
                    │   │                   │   ├── 17 > NESTED LOOPS
                    │   │                   │  

                            │           └── 31 > COUNT
                            │               └── 32 > NESTED LOOPS
                            │                   ├── 33 > NESTED LOOPS
                            │                   │   ├── 34 > NESTED LOOPS
                            │                   │   │   ├── 35 > NESTED LOOPS
                            │                   │   │   │   ├── 36 > HASH JOIN
                            │                   │   │   │   │   ├── 37 > TABLE ACCESS | BY INDEX ROWID BATCHED (ITEM)
                            │                   │   │   │   │   │   └── 38 > INDEX | RANGE SCAN (SYS_C0021203)
                            │                   │   │   │   │   └── 39 > TABLE ACCESS | BY INDEX ROWID BATCHED (ITEM)
                            │                   │   │   │   │       └── 40 > INDEX | RANGE SCAN (SYS_C0021203)
                            │                   │   │   │   └── 41 > TABLE ACCESS | BY INDEX ROWID BATCHED (CATALOG_SALE

└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > COUNT
                └── 5 > FILTER
                    ├── 6 > NESTED LOOPS
                    │   ├── 7 > NESTED LOOPS
                    │   │   ├── 8 > HASH JOIN
                    │   │   │   ├── 9 > TABLE ACCESS | FULL (WEB_SITE)
                    │   │   │   └── 10 > NESTED LOOPS
                    │   │   │       ├── 11 > NESTED LOOPS
                    │   │   │       │   ├── 12 > TABLE ACCESS | BY INDEX ROWID BATCHED (CUSTOMER_ADDRESS)
                    │   │   │       │   │   └── 13 > INDEX | RANGE SCAN (SYS_C0021181)
                    │   │   │       │   └── 14 > INDEX | RANGE SCAN (WS_SHIP_ADDR_SK_INDEX)
                    │   │   │       └── 15 > TABLE ACCESS | BY INDEX ROWID (WEB_SALES)
                    │   │   └── 16 > INDEX | UNIQUE SCAN (SYS_C0021186)
                    │   └── 17 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
                    ├── 18 > COUNT
                    │  

                        │   │   │   │       │       │   ├── 16 > TABLE ACCESS | BY INDEX ROWID BATCHED (ITEM)
                        │   │   │   │       │       │   │   └── 17 > INDEX | RANGE SCAN (SYS_C0021203)
                        │   │   │   │       │       │   └── 18 > INDEX | RANGE SCAN (SYS_C0021248)
                        │   │   │   │       │       └── 19 > TABLE ACCESS | BY INDEX ROWID (STORE_SALES)
                        │   │   │   │       └── 20 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER)
                        │   │   │   │           └── 21 > INDEX | UNIQUE SCAN (SYS_C0021212)
                        │   │   │   └── 22 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER_ADDRESS)
                        │   │   │       └── 23 > INDEX | UNIQUE SCAN (SYS_C0021181)
                        │   │   └── 24 > TABLE ACCESS | FULL (CUSTOMER_ADDRESS)
                        │   └── 25 > INDEX | UNIQUE SCAN (SYS_C0021206)
                        └── 26 > TABLE ACCESS | BY INDEX ROWID (STO

                    │   │                   │   │   │           └── 27 > INDEX | RANGE SCAN (SYS_C0021245)
                    │   │                   │   │   └── 28 > INDEX | UNIQUE SCAN (SYS_C0021186)
                    │   │                   │   └── 29 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
                    │   │                   └── 30 > TABLE ACCESS | FULL (DATE_DIM)
                    │   └── 31 > VIEW
                    │       └── 32 > HASH
                    │           └── 33 > COUNT
                    │               └── 34 > HASH JOIN
                    │                   ├── 35 > VIEW (VW_NSO_3)
                    │                   │   └── 36 > TABLE ACCESS | FULL (DATE_DIM)
                    │                   │       └── 37 > TABLE ACCESS | BY INDEX ROWID BATCHED (DATE_DIM)
                    │                   │           └── 38 > INDEX | RANGE SCAN (SYS_C0021186)
                    │                   └── 39 > HASH JOIN
                    │   

                            │                   │   │   │   └── 41 > TABLE ACCESS | BY INDEX ROWID BATCHED (CATALOG_SALES)
                            │                   │   │   │       └── 42 > INDEX | RANGE SCAN (CS_ITEM_SK_INDEX)
                            │                   │   │   └── 43 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
                            │                   │   │       └── 44 > INDEX | UNIQUE SCAN (SYS_C0021186)
                            │                   │   └── 45 > INDEX | UNIQUE SCAN (SYS_C0021181)
                            │                   └── 46 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER_ADDRESS)
                            └── 47 > COUNT
                                └── 48 > VIEW
                                    └── 49 > SORT
                                        └── 50 > COUNT
                                            └── 51 > NESTED LOOPS
                                                ├── 52 > NESTED LOOPS
                        

            └── 4 > COUNT
                └── 5 > FILTER
                    ├── 6 > HASH JOIN
                    │   ├── 7 > NESTED LOOPS
                    │   │   ├── 8 > NESTED LOOPS
                    │   │   │   ├── 9 > STATISTICS COLLECTOR
                    │   │   │   │   └── 10 > HASH JOIN
                    │   │   │   │       ├── 11 > NESTED LOOPS
                    │   │   │   │       │   ├── 12 > STATISTICS COLLECTOR
                    │   │   │   │       │   │   └── 13 > NESTED LOOPS
                    │   │   │   │       │   │       ├── 14 > TABLE ACCESS | BY INDEX ROWID BATCHED (CUSTOMER_ADDRESS)
                    │   │   │   │       │   │       │   └── 15 > INDEX | RANGE SCAN (SYS_C0021181)
                    │   │   │   │       │   │       └── 16 > TABLE ACCESS | BY INDEX ROWID BATCHED (CATALOG_SALES)
                    │   │   │   │       │   │           └── 17 > INDEX | RANGE SCAN (CS_SHIP_ADDR_SK_INDEX)
                    │   │   │   │       │   └── 1

    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > VIEW
                └── 5 > WINDOW
                    └── 6 > SORT
                        └── 7 > COUNT
                            └── 8 > HASH JOIN
                                ├── 9 > TABLE ACCESS | FULL (DATE_DIM)
                                └── 10 > NESTED LOOPS
                                    ├── 11 > NESTED LOOPS
                                    │   ├── 12 > TABLE ACCESS | FULL (ITEM)
                                    │   └── 13 > TABLE ACCESS | BY INDEX ROWID BATCHED (STORE_SALES)
                                    │       └── 14 > INDEX | RANGE SCAN (SYS_C0021248)
                                    └── 15 > INDEX | UNIQUE SCAN (SYS_C0021206)
----------------------------------------------------------------------------------------------------
SQL_ID [amhu2ajd76u1t] with PLAN_HASH_VALUE [1541751973]

0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > COUNT
    

                    │   └── 21 > NESTED LOOPS
                    │       ├── 22 > NESTED LOOPS
                    │       │   ├── 23 > TABLE ACCESS | BY INDEX ROWID BATCHED (WEB_SALES)
                    │       │   │   └── 24 > INDEX | RANGE SCAN (WS_BILL_CUSTOMER_SK_INDEX)
                    │       │   └── 25 > INDEX | UNIQUE SCAN (SYS_C0021186)
                    │       └── 26 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
                    └── 27 > COUNT
                        └── 28 > NESTED LOOPS
                            ├── 29 > NESTED LOOPS
                            │   ├── 30 > TABLE ACCESS | BY INDEX ROWID BATCHED (CATALOG_SALES)
                            │   │   └── 31 > INDEX | RANGE SCAN (CS_SHIP_CUSTOMER_SK_INDEX)
                            │   └── 32 > INDEX | UNIQUE SCAN (SYS_C0021186)
                            └── 33 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
-----------------------------------------------------------------------------------------------

                    ├── 6 > NESTED LOOPS
                    │   ├── 7 > NESTED LOOPS
                    │   │   ├── 8 > TABLE ACCESS | FULL (DATE_DIM)
                    │   │   └── 9 > TABLE ACCESS | FULL (ITEM)
                    │   └── 10 > TABLE ACCESS | BY INDEX ROWID BATCHED (INVENTORY)
                    │       └── 11 > INDEX | RANGE SCAN (SYS_C0021233)
                    └── 12 > INDEX | RANGE SCAN (CS_ITEM_SK_INDEX)
----------------------------------------------------------------------------------------------------
SQL_ID [anxs5k6rhr8gg] with PLAN_HASH_VALUE [4164433995]

0 > SELECT STATEMENT
└── 1 > RESULT CACHE (3ay9qq6vkjgc819ncj062zc6z0)
    └── 2 > SORT
        └── 3 > TABLE ACCESS | SAMPLE (CUSTOMER_ADDRESS)
----------------------------------------------------------------------------------------------------
SQL_ID [7uv1w7n13jr30] with PLAN_HASH_VALUE [1255395737]

0 > SELECT STATEMENT
└── 1 > SORT
    └── 2 > INDEX | FULL SCAN (SYS_C0021186)
---------------------

                                                │   └── 69 > INDEX | UNIQUE SCAN (SYS_C0021181)
                                                └── 70 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER_ADDRESS)
----------------------------------------------------------------------------------------------------
SQL_ID [6drjmvgnu749k] with PLAN_HASH_VALUE [2706596755]

0 > SELECT STATEMENT
└── 1 > TEMP TABLE TRANSFORMATION
    ├── 2 > LOAD AS SELECT
    │   └── 3 > HASH
    │       └── 4 > COUNT
    │           └── 5 > NESTED LOOPS
    │               ├── 6 > NESTED LOOPS
    │               │   ├── 7 > TABLE ACCESS | BY INDEX ROWID BATCHED (STORE_SALES)
    │               │   │   └── 8 > INDEX | RANGE SCAN (SS_SOLD_DATE_SK_INDEX)
    │               │   └── 9 > INDEX | UNIQUE SCAN (SYS_C0021186)
    │               └── 10 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
    └── 11 > COUNT
        └── 12 > VIEW
            └── 13 > SORT
                └── 14 > COUNT
                    └── 15 > HASH 

                        │   │   │       │   │       │   │       └── 35 > TABLE ACCESS | FULL (REASON)
                        │   │   │       │   │       │   └── 36 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER_DEMOGRAPHICS)
                        │   │   │       │   │       │       └── 37 > INDEX | UNIQUE SCAN (SYS_C0021183)
                        │   │   │       │   │       └── 38 > TABLE ACCESS | FULL (CUSTOMER_DEMOGRAPHICS)
                        │   │   │       │   └── 39 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER_DEMOGRAPHICS)
                        │   │   │       │       └── 40 > INDEX | UNIQUE SCAN (SYS_C0021183)
                        │   │   │       └── 41 > TABLE ACCESS | FULL (CUSTOMER_DEMOGRAPHICS)
                        │   │   └── 42 > INDEX | UNIQUE SCAN (SYS_C0021181)
                        │   └── 43 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER_ADDRESS)
                        └── 44 > TABLE ACCESS | FULL (CUSTOMER_ADDRESS)
--------------------------------------------

                        │   │   │       │   │       │   │       ├── 17 > NESTED LOOPS
                        │   │   │       │   │       │   │       │   ├── 18 > STATISTICS COLLECTOR
                        │   │   │       │   │       │   │       │   │   └── 19 > HASH JOIN
                        │   │   │       │   │       │   │       │   │       ├── 20 > TABLE ACCESS | FULL (REASON)
                        │   │   │       │   │       │   │       │   │       └── 21 > NESTED LOOPS
                        │   │   │       │   │       │   │       │   │           ├── 22 > NESTED LOOPS
                        │   │   │       │   │       │   │       │   │           │   ├── 23 > TABLE ACCESS | BY INDEX ROWID BATCHED (WEB_SALES)
                        │   │   │       │   │       │   │       │   │           │   │   └── 24 > INDEX | RANGE SCAN (WS_SOLD_DATE_SK_INDEX)
                        │   │   │       │   │       │   │       │   │           │   └── 25 > INDEX | UNIQUE SCAN (SYS_C0021239)


                └── 5 > VIEW (VW_FOJ_0)
                    └── 6 > HASH JOIN
                        ├── 7 > VIEW
                        │   └── 8 > HASH
                        │       └── 9 > COUNT
                        │           └── 10 > HASH JOIN
                        │               ├── 11 > TABLE ACCESS | BY INDEX ROWID BATCHED (DATE_DIM)
                        │               │   └── 12 > INDEX | RANGE SCAN (SYS_C0021186)
                        │               └── 13 > TABLE ACCESS | BY INDEX ROWID BATCHED (STORE_SALES)
                        │                   └── 14 > INDEX | RANGE SCAN (SS_SOLD_DATE_SK_INDEX)
                        └── 15 > VIEW
                            └── 16 > HASH
                                └── 17 > COUNT
                                    └── 18 > HASH JOIN
                                        ├── 19 > TABLE ACCESS | BY INDEX ROWID BATCHED (DATE_DIM)
                                        │   └── 20 > INDEX | RANGE SCAN (SYS_C002

                            └── 35 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
----------------------------------------------------------------------------------------------------
SQL_ID [dw5165u4p61np] with PLAN_HASH_VALUE [2446015885]

0 > SELECT STATEMENT
└── 1 > COUNT
    └── 2 > VIEW
        └── 3 > SORT
            └── 4 > COUNT
                └── 5 > VIEW
                    └── 6 > WINDOW
                        └── 7 > VIEW (VW_FOJ_0)
                            └── 8 > HASH JOIN
                                ├── 9 > VIEW
                                │   └── 10 > WINDOW
                                │       └── 11 > SORT
                                │           └── 12 > COUNT
                                │               └── 13 > NESTED LOOPS
                                │                   ├── 14 > NESTED LOOPS
                                │                   │   ├── 15 > TABLE ACCESS | BY INDEX ROWID (WEB_SALES)
                                │            

                        │       ├── 20 > NESTED LOOPS
                        │       │   ├── 21 > NESTED LOOPS
                        │       │   │   ├── 22 > TABLE ACCESS | BY INDEX ROWID BATCHED (CUSTOMER_ADDRESS)
                        │       │   │   │   └── 23 > INDEX | RANGE SCAN (SYS_C0021181)
                        │       │   │   └── 24 > INDEX | RANGE SCAN (C_CURRENT_ADDR_SK_INDEX)
                        │       │   └── 25 > TABLE ACCESS | BY INDEX ROWID (CUSTOMER)
                        │       └── 26 > VIEW
                        │           └── 27 > TABLE ACCESS | FULL (SYS_TEMP_0FDA002B3_141942F5)
                        └── 28 > VIEW (VW_SQ_1)
                            └── 29 > HASH
                                └── 30 > VIEW
                                    └── 31 > JOIN FILTER | USE (:BF0000)
                                        └── 32 > TABLE ACCESS | FULL (SYS_TEMP_0FDA002B3_141942F5)
------------------------------------------------------------------

                └── 5 > FILTER
                    ├── 6 > NESTED LOOPS
                    │   ├── 7 > NESTED LOOPS
                    │   │   ├── 8 > NESTED LOOPS
                    │   │   │   ├── 9 > TABLE ACCESS | BY INDEX ROWID BATCHED (CUSTOMER)
                    │   │   │   │   └── 10 > INDEX | RANGE SCAN (SYS_C0021212)
                    │   │   │   │       └── 11 > COUNT
                    │   │   │   │           └── 12 > NESTED LOOPS
                    │   │   │   │               ├── 13 > NESTED LOOPS
                    │   │   │   │               │   ├── 14 > TABLE ACCESS | BY INDEX ROWID BATCHED (STORE_SALES)
                    │   │   │   │               │   │   └── 15 > INDEX | RANGE SCAN (SS_CUSTOMER_SK_INDEX)
                    │   │   │   │               │   └── 16 > INDEX | UNIQUE SCAN (SYS_C0021186)
                    │   │   │   │               └── 17 > TABLE ACCESS | BY INDEX ROWID (DATE_DIM)
                    │   │   │   └── 18 > TABLE ACCESS | BY IN

│       └── 42 > INDEX | RANGE SCAN (SS_TICKET_NUMBER_INDEX)
├── 43 > SORT
│   └── 44 > TABLE ACCESS | BY INDEX ROWID BATCHED (STORE_SALES)
│       └── 45 > INDEX | RANGE SCAN (SS_TICKET_NUMBER_INDEX)
└── 46 > COUNT
    └── 47 > INDEX | UNIQUE SCAN (SYS_C0021198)
----------------------------------------------------------------------------------------------------
SQL_ID [a9jm365hccskt] with PLAN_HASH_VALUE [1703108607]

0 > SELECT STATEMENT
└── 1 > TEMP TABLE TRANSFORMATION
    ├── 2 > LOAD AS SELECT
    │   └── 3 > HASH
    │       └── 4 > COUNT
    │           └── 5 > HASH JOIN
    │               ├── 6 > TABLE ACCESS | BY INDEX ROWID BATCHED (DATE_DIM)
    │               │   └── 7 > INDEX | RANGE SCAN (SYS_C0021186)
    │               └── 8 > HASH JOIN
    │                   ├── 9 > TABLE ACCESS | BY INDEX ROWID BATCHED (STORE_SALES)
    │                   │   └── 10 > INDEX | RANGE SCAN (SS_SOLD_DATE_SK_INDEX)
    │                   └── 11 > TABLE ACCESS | FULL (CUSTOMER_ADDRES

SQL_ID [5vfj1jt9rga36] with PLAN_HASH_VALUE [3119983122]

0 > SELECT STATEMENT
└── 1 > TEMP TABLE TRANSFORMATION
    ├── 2 > LOAD AS SELECT
    │   └── 3 > UNION-ALL
    │       ├── 4 > HASH
    │       │   └── 5 > COUNT
    │       │       └── 6 > HASH JOIN
    │       │           ├── 7 > TABLE ACCESS | FULL (DATE_DIM)
    │       │           └── 8 > NESTED LOOPS
    │       │               ├── 9 > NESTED LOOPS
    │       │               │   ├── 10 > TABLE ACCESS | BY INDEX ROWID BATCHED (CUSTOMER)
    │       │               │   │   └── 11 > INDEX | RANGE SCAN (SYS_C0021212)
    │       │               │   └── 12 > INDEX | RANGE SCAN (SS_CUSTOMER_SK_INDEX)
    │       │               └── 13 > TABLE ACCESS | BY INDEX ROWID (STORE_SALES)
    │       └── 14 > HASH
    │           └── 15 > COUNT
    │               └── 16 > HASH JOIN
    │                   ├── 17 > NESTED LOOPS
    │                   │   ├── 18 > NESTED LOOPS
    │                   │   │   ├── 19 > TABLE ACCESS | BY 

### Captured Outlier Plans

This section contains metrics pertaining to outlier plans. There are three categories of captured outliers denoted below, each assigned a total of 14 queries

* Hint Enhanced Queries
* Predicate Enhanced Queries
* Rownum Stopkey Enhanced Queries

In [12]:
#
# Retrieve Unique set of PLAN_HASH_VALUES
np_outlier_plan_id, np_outlier_plan_instance = pd.unique(df_outliers['PLAN_ID']), pd.unique(df_outliers['PLAN_INSTANCE'])
print(np_outlier_plan_id)
print(type(np_outlier_plan_id))
print(np_outlier_plan_instance)
print(type(np_outlier_plan_instance))
print('-'*100)
#
# Iterate over each PLAN_HASH_VALUE, and retrieve PLAN subset                                                                                                                 
for plan_instance in np_outlier_plan_instance:
    #
    # Retrieve only a single instance of the plan (as annotated at beginning of experiment)
    df_temp_plan = df_outliers[df_outliers['PLAN_INSTANCE'] == plan_instance]
    #
    # This step ensures that only TPC-DS related queries are displayed
    tpc_check = df_temp_plan['OBJECT_OWNER'].tolist()
    if tpcds not in tpc_check:
        continue
    #
    # Discards plans with double entries - Due to the parallel nature of the throughput test for 
    # TPC-DS, multiple threads may execute the same query at the same time, resulting in sql access
    # plans with the same SQL_ID, same PLAN_HASH_VALUE, and same TIMESTAMP. Such occurances are skipped.
    df_temp_count = df_temp_plan[df_temp_plan['ID'] == 0]
    if df_temp_count.shape[0] != 1:
        continue
    #
    # Sorts by ID ascending - This clause may be redundant due to the natural order of the data capture tool
    df_temp_plan = df_temp_plan.sort_values(by='ID', ascending=True)
    #
    # Builds Tree
    tree = PlanTreeModeller.build_tree(df=df_temp_plan)
    #
    # Renders Tree
    print('PLAN_ID [' + str(df_temp_plan['PLAN_ID'].iloc[0]) + ']\n')
    PlanTreeModeller.render_tree(tree=tree[0], df=df_temp_plan) # Tree rendederer uses root node and traverses downwards
    print('-'*100)

[12447 12448 12449 12450 12451 12452 12453 12454 12455     0 12457 12458
 12459 12460 12461 12462 12463 12464 12465 12466 12467 12468 12469 12470
 12471 12472 12473 12474 12475 12476 12477 12478 12479 12480 12481 12482
 12483 12484 12485 12486 12487 12488]
<class 'numpy.ndarray'>
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42]
<class 'numpy.ndarray'>
----------------------------------------------------------------------------------------------------
PLAN_ID [12447]



KeyError: 'IO_COST'

### Access Plan / Tree Comparison (SAME PLAN COMPARISON)

This section tests / evaluates different plans being compared to one another. Two separate tests are carried out, as follows:

* Comparing the exact same outlier plans with each other. This test verifies that no unneccessary flagging is carried out by the implementation.
* Comparing the inlier plans with the respective TPC-DS outlier plan. This test ensures that access plans are appropriately flagged where inconsistencies are encountered.

In [None]:
# Comparing same exact plans                                                                                                                
for plan_instance in np_outlier_plan_instance:
    #
    # Retrieve only a single instance of the plan (as annotated at beginning of experiment)
    df_temp_plan = df_outliers[df_outliers['PLAN_INSTANCE'] == plan_instance]
    #
    # This step ensures that only TPC-DS related queries are displayed
    tpc_check = df_temp_plan['OBJECT_OWNER'].tolist()
    if tpcds not in tpc_check:
        continue
    #
    # Discards plans with double entries - Due to the parallel nature of the throughput test for 
    # TPC-DS, multiple threads may execute the same query at the same time, resulting in sql access
    # plans with the same SQL_ID, same PLAN_HASH_VALUE, and same TIMESTAMP. Such occurances are skipped.
    df_temp_count = df_temp_plan[df_temp_plan['ID'] == 0]
    if df_temp_count.shape[0] != 1:
        continue
    #
    # Sorts by ID ascending - This clause may be redundant due to the natural order of the data capture tool
    df_temp_plan = df_temp_plan.sort_values(by='ID', ascending=True)
    #
    # Builds Tree
    tree = PlanTreeModeller.build_tree(df=df_temp_plan)
    #
    # Renders Trees
    print('Tree 1 with PLAN_ID [' + str(df_temp_plan['PLAN_ID'].iloc[0]) + ']\n')
    #PlanTreeModeller.render_tree(tree=tree[0], df=df_temp_plan) # Tree rendederer uses root node and traverses downwards
    print('\nTree 2 with PLAN_ID [' + str(df_temp_plan['PLAN_ID'].iloc[0]) + ']\n')
    #PlanTreeModeller.render_tree(tree=tree[0], df=df_temp_plan) # Tree rendederer uses root node and traverses downwards
    #
    # Compares both plans
    print('\n')
    PlanTreeModeller.tree_compare(tree1=tree, 
                                  tree2=tree, 
                                  df1=df_temp_plan, 
                                  df2=df_temp_plan)
    print('-'*100)

### Access Plan / Tree Comparison (DIFFERENT PLAN COMPARISON)

This section tests / evaluates different plans being compared to one another. Two separate tests are carried out, as follows:

* Comparing the exact same outlier plans with each other. This test verifies that no unneccessary flagging is carried out by the implementation.
* Comparing the inlier plans with the respective TPC-DS outlier plan. This test ensures that access plans are appropriately flagged where inconsistencies are encountered.

In [None]:
outlier_category_quantity = int(len(np_outlier_plan_instance) / 3)
for i in range(outlier_category_quantity):
    #
    # Isolate type 1 outliers
    df_temp_plan1 = df_outliers[df_outliers['PLAN_ID'] == np_outlier_plan_id[i]]
    #
    # Sorts by ID ascending for type 1 outliers - This clause may be redundant due to the natural order of 
    # the data capture tool
    df_temp_plan1 = df_temp_plan1.sort_values(by='ID', ascending=True)
    #
    # Builds Tree 1
    tree1 = PlanTreeModeller.build_tree(df=df_temp_plan1)
    #
    # Isolate type 2 outliers
    comparison_index = int(i + outlier_category_quantity)
    df_temp_plan2 = df_outliers[df_outliers['PLAN_ID'] == (np_outlier_plan_id[comparison_index])]
    #
    # Sorts by ID ascending for type 2 outliers - This clause may be redundant due to the natural order of 
    # the data capture tool
    df_temp_plan2 = df_temp_plan2.sort_values(by='ID', ascending=True)
    #
    # Builds Tree 2
    tree2 = PlanTreeModeller.build_tree(df=df_temp_plan2)
    #
    # Renders Trees
    print('Tree 1 with PLAN_ID [' + str(df_temp_plan1['PLAN_ID'].iloc[0]) + ']\n')
    #PlanTreeModeller.render_tree(tree=tree1[0], df=df_temp_plan1) # Tree rendederer uses root node and traverses downwards
    print('\nTree 2 with PLAN_ID [' + str(df_temp_plan2['PLAN_ID'].iloc[0]) + ']\n')
    #PlanTreeModeller.render_tree(tree=tree2[0], df=df_temp_plan2) # Tree rendederer uses root node and traverses downwards
    #
    # Compares both plans
    print('\n')
    PlanTreeModeller.tree_compare(tree1=tree1, 
                                  tree2=tree2, 
                                  df1=df_temp_plan1, 
                                  df2=df_temp_plan2)
    print('-'*100)
    print('\n\n\n')