# Schedule TPC-DS 10 Descriptor

This notebook contains work pertatining to pattern learning / identification for a database workload schedule. It contains the captured trace generator logs, which were generated during creation of the database workload. The logs span from beginning till the end of the workload trace, approximately a total of 336 hours.

The TPC-DS (Transaction Processing Council - Decision Support) Benchmark, is a decision support workload benchmark which exhibits similarities capable of modelling aspects of a decision support system. The benchmark is composed of a typical decision support RDBMS schema, and contains a number of queries and data maintenance procedures which interact with the underlying schema. Particularly, TPC-DS is representative of a System Under Test’s (SUT) performance for a general purpose decision support system. It models the decision support capabilities of a retail product supplier, where in the schema houses vital product and business information, including but not limited to product, customer and order information. Furthermore, TPC-DS models the two most important factors a decision support system, including but not limited to user queries (conversion of operational facts into business intelligence), and data maintenance RDBMS activities (synchronization of process management analysis and maintenance for the underlying data source upon which it relies). In general, the benchmark’s components are applicable to a broad range of implementation methods and system topologies, which allow for a technically comparable, vendor neutral approach. Similar to prior benchmarks offered by TPC, the TPC-DS workload is widely used by vendors to demonstrate, model and assume the complex decision support system logic exhibited by such workloads. It is particularly useful and widely applicable due to it’s fair and honest comparison it offers between different vendor usage, due to the workload’s controlled and repeatable nature that it offers. TPC benchmarks are particularly useful when taking into consideration the purchasing of servers and software, planning of system design and architecture, and for other research domains which mandate resource intensive and realistic workloads.

TPC-DS is particularly built towards testing of upward boundaries within hardware system performance, with focus on areas of CPU usage, I/O resource usage, memory utilization. Furthermore, the proposed benchmark is also aimed at testing the upper bounds of operating system usage and database software, to perform various complex tasks as offered by the benchmark. As with any decision support system benchmark, TPC-DS excels particularly at modelling large volumes constituted of underlying data, generating the most optimal access plans for underlying complex query structures, amongst other things. Amongst the many decision support systems that the benchmark offers, denotes the following:
* Large volume of data lookups.
* Provides answers to real-world business scenarios.
* Executes various operational and complex requirements in the form of queries, akin to reporting, iterative OLAP and data mining.
* The workload is characterized by high intensive CPU and I/O bound tasks.
* The underlying workload database is periodically updated through a suite of maintenance tasks provided by the benchmark.
* Integrates with ‘Big Data’ solutions, including RDBMS technologies, as well as Hadoop/Spark based systems.


<div style="width:image width px; font-size:80%; text-align:center;"><img src='Images/TPCDS Setup.jpg' alt="alternate text" width="width" height="height" style="padding-bottom:0.5em;" /><b>TPC-DS Architecture</b></div>

The TPC-DS workload decomposes workload activity into a number of categories, which together constitute the backbone of execution on the database. These categories are denoted below [82]:

* Database Load Test - The benchmark’s specification denotes this as the building of the database (inserting data from the dsdgen generated flat files). For the purposes of this experiment, this phase was replaced entirely with optimizer statistics generation tasks, which are gathered schema wide on all TPC-DS objects.
* Power Test - The power test executes queries in serial like fashion, submitted through an application driver through a single query stream.
* Throughput Test - The throughput test opens a total of twenty, parallel, query streams. Each stream corresponds to a single Power Test, with it’s own unique permutation as dictated by the benchmark.
* Data Maintenance Test - Serial execution of TPC-DS data maintenance tasks.

Each of the TPC-DS workloads (TPC-DS 1, TPC-DS 10, TPC-DS 100) behaves in the same pipelined order of execution, denoted in the figure below:


<div style="width:image width px; font-size:80%; text-align:center;"><img src='Images/Trace Workflow.jpg' alt="alternate text" width="width" height="height" style="padding-bottom:0.5em;" /><b>Trace Workflow</b></div>

### Module Installation and Importing Libraries

In [5]:
# Module Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Configuration Cell

Tweak parametric changes from this cell to influence outcome of experiment. 

* tpcds - Schema upon which to operate test.

In [6]:
tpcds='TPCDS10'

In [7]:
# Root path
#root_dir = 'C:/Users/gabriel.sammut/University/Data_ICS5200/Schedule/' + tpcds
root_dir = 'D:/Projects/Datagenerated_ICS5200/Schedule/' + tpcds

# Open Data
scheduler_log_path = root_dir + '/msg_log_tpcds10scheduler_20181130'
scheduler_log_file = open(scheduler_log_path,'r')

In [8]:
class ParseLogs:
    """
    Parses Scheduler File
    """
    @staticmethod
    def parse_log_file(file):
        """
        Parses Scheduling Log File, and retrieves relavaent info. Input is a dataframe of type 'Pandas'
        """
        task_list = ['THROUGHPUT_TEST_1',
                     'THROUGHPUT_TEST_2',
                     'DATA_MAINTENANCE_1',
                     'DATA_MAINTENANCE_2',
                     'GATHER_STATS',
                     'POWER_TEST']
        lines = []
        flag = False
        for line in file.readlines():
            
            line = str(line)
            line = line.replace(' ',' ')
            
            if 'Metrics successfully written to file' in line: # Skip this line
                continue
            
            if flag is True:
                seconds = ParseLogs.__parse_time(line)
                lines[len(lines)-1].append(seconds)
                flag = False
                continue
            
            for task in task_list:
                if task in line:
                    timestamp = ParseLogs.__parse_timestamp(line)
                    snap_id = ParseLogs.__parse_snap_id(line)
                    lines.append([timestamp,
                                  task,
                                  snap_id])
                    flag = True
                    break
        return lines
    
    @staticmethod
    def __parse_timestamp(data_line):
        """
        Parses timestamp from passed data_line
        """
        return data_line[0:19]
    
    @staticmethod
    def __parse_snap_id(data_line):
        """
        Parses log line and retrieves SNAP_ID
        """
        snap_id = ''
        for i in reversed(range(len(data_line))):
            if (len(data_line) - i) < 7:
                try:
                    snap_id += str(int(data_line[i]))
                except:
                    pass
            else:
                break
        snap_id = snap_id[::-1]
        return int(snap_id)
    
    @staticmethod
    def __parse_time(data_line):
        """
        Parses time in seconds for line position index+1 (determined by log file structure)
        """
        time_secs = ''
        for i in range(len(data_line)):
            if i > 34:
                try:
                    data_line = str(data_line)
                    time_secs += str(int(data_line[i]))
                except:
                    if data_line[i] == '.':
                        break
        return int(time_secs)

parsed_log_file = ParseLogs.parse_log_file(scheduler_log_file)
for line in parsed_log_file:
    print(line)
scheduler_log_file.close()

['2018-11-30 17:36:35', 'GATHER_STATS', 3249, 329]
['2018-11-30 18:08:02', 'POWER_TEST', 3279, 1887]
['2018-11-30 19:23:05', 'THROUGHPUT_TEST_1', 3313, 4502]
['2018-11-30 23:30:56', 'DATA_MAINTENANCE_1', 3546, 14871]
['2018-12-01 01:02:47', 'THROUGHPUT_TEST_2', 3592, 5510]
['2018-12-01 05:35:10', 'DATA_MAINTENANCE_2', 3853, 16342]
['2018-12-01 05:40:34', 'GATHER_STATS', 3857, 324]
['2018-12-01 06:39:02', 'POWER_TEST', 3913, 3508]
['2018-12-01 08:09:54', 'THROUGHPUT_TEST_1', 3959, 5451]
['2018-12-01 12:14:15', 'DATA_MAINTENANCE_1', 4193, 14661]
['2018-12-01 13:51:27', 'THROUGHPUT_TEST_2', 4235, 5831]
['2018-12-01 17:47:07', 'DATA_MAINTENANCE_2', 4463, 14140]
['2018-12-01 17:52:33', 'GATHER_STATS', 4468, 325]
['2018-12-01 18:28:31', 'POWER_TEST', 4503, 2158]
['2018-12-01 20:08:00', 'THROUGHPUT_TEST_1', 4538, 5968]
['2018-12-02 00:13:38', 'DATA_MAINTENANCE_1', 4769, 14738]
['2018-12-02 02:08:04', 'THROUGHPUT_TEST_2', 4818, 6865]
['2018-12-02 06:11:30', 'DATA_MAINTENANCE_2', 5048, 14606]
[