# Data analysis of server(s) workload jobs

## Interpreting the data

If we take a look at the Yahoo! trace, the provided format is `job_submission_time` `nr_of_tasks_in_job` `average_task_duration` `the_runtime_of_each_task`. This only allows us to model the "Job Arrival Process" \[1], ie. the Arrival Rate, Inter-arrival Time, and Actual Runtime. However, Job Cancellation is unknown and cannot be measured from these traces. Additionally, we can analyse the job modelling characteristics Bag Of Tasks, Burstiness, and Periodicity \[1]. "Job Execution Process" \[1] details such as Job Size (# Cores), Memory Usage, and User Behaviour are not recorded in the trace, and thus cannot be analysed.

## Problems with the trace data

When examining the `job_submission_time` and `the_runtime_of_each_task` fields of the trace, the time increment is not specified. This information is important to determine to accurately interperet the data. When looking at the source for the Yahoo! workload trace \[2], the  trace comes from a cluster of approximately 2000 machines at Yahoo! (YH trace), covers three weeks in late February 2009 and early March 2009, and contains around 30,000 jobs. It also details that the running time is in task-seconds of map and reduce functions. Thus, for the `the_runtime_of_each_task` field, we can make the assumption that the floating point number is in seconds. If we make the assumption that the `job_submission_time`, we would expect the last job submission time to be close to the number of seconds in 3 weeks, which is `60 x 60 x 24 x 7 x 3 = 1814400`. When looking at the last job record in the Yahoo! trace `YH.tr`, the `job_submission_time` is 181440, thus we can make the assumption that the submission time is measure in tens of seconds.

Unfortunately, when doing the same thing for the Facebook trace `FB.tr`, the last job record has a submit time of 388171, this is vastly different to the number of seconds in 6 months, which is approximately `60 x 60 x 24 x 30 x 6 = 15552000`. It also shares no regular pattern to the last job submit time. This issue of unknown time units must be resolved before analysing the Facebook trace, and potentially the other traces.

*\[1] F. Ian, "WORKLOAD CHARACTERISATION FOR CLOUD RESOURCE MANAGEMENT," School of Engineering Macquarie University, 2020, pp. 4-7.*
 
*\[2] Y. Chen, A. Ganapathi, R. Griffith and R. Katz, "The Case for Evaluating MapReduce Performance Using Workload Suites," 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, 2011, pp. 390-399, doi: 10.1109/MASCOTS.2011.12.*

## Importing the data

The first step is to import the data from the `jobs.csv` file. This file is generated from the `*.tr` trace files using the preprocessing script `process-trace.sh`. 

The `pandas` library is used to import the CSV file as a Dataframe. Pandas is a data science tool used for exploring and manipulating data. The `numpy` library is also imported to be used in conjuction with `pandas` to manipulate the Dataframe.

In [15]:
import numpy as np
import pandas as pd

with open('preprocessing/jobs.csv') as file:
    data = pd.read_csv(file)

data.set_index('job_id', inplace=True)

## How the data looks

As we can see below, the `jobs.csv` data has four fields: fields: 'job_id', 'job_submission_time', 'nr_of_tasks_in_job', and 'average_task_duration'.

Starting from time 0, each record contains a timestamp for either when a new job is submitted to the system. This give us a list of jobs in the workload trace.

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24262 entries, 0 to 24261
Data columns (total 3 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   job_submission_time    24262 non-null  float64
 1   nr_of_tasks_in_job     24262 non-null  int64  
 2   average_task_duration  24262 non-null  float64
dtypes: float64(2), int64(1)
memory usage: 758.2 KB


In [17]:
data.head()

Unnamed: 0_level_0,job_submission_time,nr_of_tasks_in_job,average_task_duration
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7.527,51,15.948893
1,15.092,34,8.723592
2,22.65,24,22.139817
3,30.004,5,1.027298
4,37.425,3,19.411553


In [18]:
data.tail()

Unnamed: 0_level_0,job_submission_time,nr_of_tasks_in_job,average_task_duration
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
24257,181410.0,81,27.690639
24258,181418.0,7,10.193665
24259,181425.0,26,27.222939
24260,181433.0,1,1085.223348
24261,181440.0,19,6.994636


## Statistics

Below are some general statistical mesaures of the data.

In [19]:
data.describe()

Unnamed: 0,job_submission_time,nr_of_tasks_in_job,average_task_duration
count,24262.0,24262.0,24262.0
mean,90727.482251,39.91159,118.784488
std,52376.743597,153.444104,526.209199
min,7.527,1.0,0.004283
25%,45372.075,6.0,6.456153
50%,90727.0,15.0,15.619042
75%,136084.0,31.0,34.323011
max,181440.0,5900.0,20512.714086
