## Parallel workloads archive parser

Authors of this archive wrote article where explained all problems they faced -> http://www.cs.huji.ac.il/~feit/papers/PWA14JPDC.pdf

Authors book about Workload Modeling -> http://www.cs.huji.ac.il/~feit/wlmod/wlmod.pdf

Parallel workloads archive link -> http://www.cs.huji.ac.il/labs/parallel/workload/index.html

Logs Format -> http://www.cs.huji.ac.il/labs/parallel/workload/swf.html

Logs -> http://www.cs.huji.ac.il/labs/parallel/workload/logs.html

Log Parser on Perl -> http://www.cs.huji.ac.il/labs/parallel/workload/parse_swf.pl

## Standard workload format
- **Job Number** -- a counter field, starting from 1.

- **Submit Time** -- in seconds. The earliest time the log refers to is zero, and is usually the submittal time of the first job. The lines in the log are sorted by ascending submittal times. It makes sense for jobs to also be numbered in this order.

- **Wait Time** -- in seconds. The difference between the job's submit time and the time at which it actually began to run. Naturally, this is only relevant to real logs, not to models.

- **Run Time** -- in seconds. The wall clock time the job was running (end time minus start time). We decided to use "wait time" and "run time" instead of the equivalent "start time" and "end time" because they are directly attributable to the scheduler and application, and are more suitable for models where only the run time is relevant. <font color='red'> Note that when values are rounded to an integral number of seconds (as often happens in logs) a run time of 0 is possible and means the job ran for less than 0.5 seconds. On the other hand it is permissable to use floating point values for time fields. </font>

- **Number of Allocated Processors** -- an integer. In most cases this is also the number of processors the job uses; if the job does not use all of them, we typically don't know about it.

- **Average CPU Time Used** -- both user and system, in seconds. This is the average over all processors of the CPU time used, and may therefore be smaller than the wall clock runtime. If a log contains the total CPU time used by all the processors, it is divided by the number of allocated processors to derive the average.

- **Used Memory** -- in kilobytes. This is again the average per processor.

- **Requested Number of Processors**.

- **Requested Time** -- This can be either runtime (measured in wallclock seconds), or average CPU time per processor (also in seconds) -- the exact meaning is determined by a header comment. In many logs this field is used for the user runtime estimate (or upper bound) used in backfilling. If a log contains a request for total CPU time, it is divided by the number of requested processors.

- **Requested Memory (again kilobytes per processor).**

- **Status** 1 if the job was completed, 0 if it failed, and 5 if cancelled. If information about chekcpointing or swapping is included, other values are also possible. See usage note -> (http://www.cs.huji.ac.il/labs/parallel/workload/swf.html). This field is meaningless for models, so would be -1.
    - The main usage of the status field is to note the job's status. This isn't as straightforward as it sounds. 
    - The simple case is jobs that complete normally, and have status 1.
    - The harder case is jobs that don't complete normally. This can happen for several reasons:
        - The job failed (e.g. segmentation fault). This is given status 0.
        - The job was cancelled by the user (like ^C in Unix). This is given status 5. Note that cancelled jobs may have positive runtimes and processors if cancelled after they started to run, or 0 or -1 if cancelled while waiting in the queue.
        - The job was killed by the system (e.g. because it exceeded its requested run time). This may be given different status values in different logs; it will typically be 0 or 5, but might also be 1.
        
        Note also that the distinction between failure / cancellation / killing is not necessarily accurate, as the distinction typically does not appear in the original logs. If a log contains information about checkpoints and swapping out of jobs, a job can have multiple lines in the log. In fact, we propose that the job information appear twice. First, there will be one line that summarizes the whole job: its submit time is the submit time of the job, its runtime is the sum of all partial runtimes, and its code is 0 or 1 according to the completion status of the whole job. In addition, there will be separate lines for each instance of partial execution between being swapped out. All these lines have the same job ID and appear consecutively in the log. Only the first has a submit time; the rest only have a wait time since the previous burst. The completed code for all these lines is 2, meaning "to be continued"; the completion code for the last such line is 3 or 4, corresponding to completion or being killed. It should be noted that such details are only useful for studying the behavior of the logged system, and are not a feature of the workload. Such studies should ignore lines with completion codes of 0 and 1, and only use lines with 2, 3, and 4. <font color="red">For workload studies, only the single-line summary of the job should be used, as identified by a code of 0 or 1. </font>

- **User ID** -- a natural number, between one and the number of different users.

- **Group ID** -- a natural number, between one and the number of different groups. Some systems control resource usage by groups rather than by individual users.

- **Executable (Application) Number** -- a natural number, between one and the number of different applications appearing in the workload. In some logs, this might represent a script file used to run jobs rather than the executable directly; this should be noted in a header comment.

- **Queue Number** -- a natural number, between one and the number of different queues in the system. The nature of the system's queues should be explained in a header comment. This field is where batch and interactive jobs should be differentiated: we suggest the convention of denoting interactive jobs by 0.

- **Partition Number** -- a natural number, between one and the number of different partitions in the systems. The nature of the system's partitions should be explained in a header comment. For example, it is possible to use partition numbers to identify which machine in a cluster was used.

- **Preceding Job Number** -- this is the number of a previous job in the workload, such that the current job can only start after the termination of this preceding job. Together with the next field, this allows the workload to include feedback as described below.

- **Think Time from Preceding Job** -- this is the number of seconds that should elapse between the termination of the preceding job and the submittal of this one.


## Header Comments
- **Version**: Version number of the standard format the file uses. The format described here is version 2.
- **Computer**: Brand and model of computer
- **Installation**: Location of installation and machine name
- **Acknowledge**: Name of person(s) to acknowledge for creating/collecting the workload.
- **Information**: Web site or email that contain more information about the workload or installation.
- **Conversion**: Name and email of whoever converted the log to the standard format.
- **MaxJobs**: Integer, total number of jobs in this workload file.
- **MaxRecords**: Integer, total number of records in this workload file. If no checkpointing/swapping information is included, there is one record per job, and this is equal to MaxJobs. But with chekpointing/swapping there may be multiple records per job.
- **Preemption**: Enumerated, with four possible values. 'No' means that jobs run to completion, and are represented by a single line in the file. 'Yes' means that the execution of a job may be split into several parts, and each is represented by a separate line. 'Double' means that jobs may be split, and their information appears twice in the file: once as a one-line summary, and again as a sequence of lines representing the parts, as suggested above. 'TS' means time slicing is used, but no details are available.
- **UnixStartTime**: When the log starts, in Unix time (seconds since the epoch)
- **TimeZone**: DEPRECATED and replaced by TimeZoneString.
- A value to add to times given as seconds since the epoch. The sum can then be fed into gmtime (Greenwich time function) to get the correct date and hour of the day. The default is 0, and then gmtime can be used directly. - Note: do not use localtime, as then the results will depend on the difference between your time zone and the installation time zone.
- **TimeZoneString**: Replaces the buggy and now deprecated TimeZone. TimeZoneString is a standard UNIX string indicating the time zone in which the log was generated; this is actually the name of a zoneinfo file, e.g. "Europe/Paris". All times within the SWF file are in this time zone. For more details see the usage note below.
- **StartTime**: When the log starts, in human readable form, in this standard format: Tue Feb 21 18:44:15 IST 2006 (as printed by the UNIX 'date' utility).
- **EndTime**: When the log ends (the last termination), formatted like StartTime.
- **MaxNodes**: Integer, number of nodes in the computer. List the number of nodes in different partitions in parentheses if applicable.
- **MaxProcs**: Integer, number of processors in the computer. This is different from MaxNodes if each node is an SMP. List the number of processors in different partitions in parentheses if applicable.
- **MaxRuntime**: Integer, in seconds. This is the maximum that the system allowed, and may be larger than any specific job's runtime in the workload.
- **MaxMemory**: Integer, in kilobytes. Again, this is the maximum the system allowed.
- **AllowOveruse**: Boolean. 'Yes' if a job may use more than it requested for any resource, 'No' if it can't.
- **MaxQueues**: Integer, number of queues used.
- **Queues**: A verbal description of the system's queues. Should explain the queue number field (if it has known values). As a minimum it should be explained how to tell between a batch and interactive job.
- **Queue**: A description of a single queue in the following format: queue-number queue-name (optional-details). This should be repeated for all the queues.
- **MaxPartitions**: Integer, number of partitions used.
- **Partitions**: A verbal description of the system's partitions, to explain the partition number field. For example, partitions can be distinct parallel machines in a cluster, or sets of nodes with different attributes (memory configuration, number of CPUs, special attached devices), especially if this is known to the scheduler.
- **Partition**: Description of a single partition.
- **Note**: There may be several notes, describing special features of the log. For example, The runtime is until the last node was freed; jobs may have freed some of their nodes earlier

In [3]:
# Print names of all available logs
# TODO

import os

LOGS_DIR="./logs"
LOGS = [
    "CIEMAT-Euler-2008-1.swf"
]


class Job:
    def __init__(self, description):
        params = description.split()
        
        #  0 - Job Number
        self.id = params[0]
        
        #  1 - Submit Time
        self.sub = params[1]

        #  2 - Wait Time
        self.wait = params[2]
        
        #  3 - Run Time
        self.t = params[3]
        
        #  4 - Number of Processors
        self.p = params[4]
        
        #  5 - Average CPU Time Used
        self.cpu = params[5]
        
        #  6 - Used Memory
        self.mem = params[6]
        
        #  7 - Requested Number of Processors
        self.preq = params[7]
        
        #  8 - Requested Time
        self.treq = params[8]
        
        #  9 - Requested Memory
        self.mreq = params[9]
        
        # 10 - status (1=completed, 0=killed)
        self.status = params[10]
        
        # 11 - User ID
        self.u = params[11]
        
        # 12 - Group ID
        self.gr = params[12]
        
        # 13 - Executable (Application) Number
        self.app = params[13]
        
        # 14 - Queue Number
        self.q = params[14]
        
        # 15 - Partition Number
        self.part = params[15]
        
        # 16 - Preceding Job Number
        self.prec = params[16]
        
        # 17 - Think Time from Preceding Job
        self.think = params[17]


class LogFile:
    def __init__(self, filename):
        self.filename = filename
        
        self.cnt_fmt  = 0
        self.cnt_t0   = 0
        self.cnt_p0   = 0
        self.cnt_stat = 0
        self.cnt_bad  = 0
        
        self.start = 0
        self.jobs  = 0
        self.procs = 0
        self.nodes = 0
        
        jobs = []

    def read(self):
        # TODO
        # check file existence
        



In [2]:
# Print detail information about specific logs
# TODO

