<a href="https://colab.research.google.com/github/andreaaraldo/machine-learning-for-networks/blob/master/05.trees/05.trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd

from sklearn.model_selection import train_test_split

from collections import Counter

from imblearn.over_sampling import SMOTE

# Use case and dataset

We use the dataset by [Reyhane Askari Hemmat](https://github.com/ReyhaneAskari/SLA_violation_classification) (Université de Montréal) used in [He16]. This dataset is built from [Google Cloud Cluster Trace](https://github.com/google/cluster-data), a 29-days trace of activity in a Google Cloud cluster. The trace reports:

* Resources available on the machines
* Tasks submitted by users, along with the requested resources
* Actual resources used by tasks
* Events, like eviction of tasks (for lack of resources, failure of the machine, etc.)


Hemmat et Al. [He16] pre-processed this trace:
* For each submitted task, they checked if the task correctly terminates or is evicted
* They created as csv file with the task characteristics and a `violation` column, to indicating failure (1) or normal termination (0).


### Goal
Predict a task failure, i.e., whether a task [will be evicted](https://github.com/ReyhaneAskari/SLA_violation_classification/blob/55bba2683dec43e739244b6b616294827a98f8e1/3_create_database/scripts/full_db_2.py#L33) before normal termination. 

In [1]:
!wget https://raw.githubusercontent.com/ReyhaneAskari/SLA_violation_classification/master/3_create_database/csvs/frull_db_2.csv

--2020-03-23 16:53:55--  https://raw.githubusercontent.com/ReyhaneAskari/SLA_violation_classification/master/3_create_database/csvs/frull_db_2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10633997 (10M) [text/plain]
Saving to: ‘frull_db_2.csv’


2020-03-23 16:53:56 (45.5 MB/s) - ‘frull_db_2.csv’ saved [10633997/10633997]



Unfortunately, [no GPU support](https://stackoverflow.com/a/41568439/2110769) is available for scikit learn.

# Load dataset and preliminary operations


In [20]:
train_path = "frull_db_2.csv"
df = pd.read_csv(train_path)
df

Unnamed: 0.1,Unnamed: 0,job_id,task_idx,sched_cls,priority,cpu_requested,mem_requested,disk,violation
0,2,3418314,0,3,9,0.12500,0.074460,0.000424,0
1,3,3418314,1,3,9,0.12500,0.074460,0.000424,0
2,45,3418368,0,3,9,0.03125,0.086910,0.000455,0
3,46,3418368,1,3,9,0.03125,0.086910,0.000455,0
4,47,3418368,2,3,9,0.03125,0.086910,0.000455,0
...,...,...,...,...,...,...,...,...,...
201195,450131,6251995937,196,0,0,0.06873,0.011930,0.000115,0
201196,450134,4392480606,180,2,0,0.06250,0.063350,0.000077,0
201197,450137,5285926325,0,0,9,0.06250,0.006218,0.000038,1
201198,450142,6183750753,60,1,0,0.12500,0.033390,0.000019,0


Column description:
* `job_id`: users submit jobs, i.e., a set of tasks
* `task_idx`: the index of a task within a job. A task is uniquely identified by `(job_id, task_idx)`
* `sched_cls`: From [Re11]: "3 representing a more latency-sensitive task (e.g., serving revenue-generating user requests) and 0 representing a non-production task (e.g., development, non-business-critical analyses, etc.)... more latency-sensitive tasks tend to have higher task priorities"
* `priority`
* `cpu_requested`: Maximum amount of CPU the task is permitted to use. 
  * Unit of measurement: core-count / second.
  * The scale is relateive to the CPU available in the most powerful machine of the cluster.
  * This is specified by the user at submission time
* `mem_requested`: Maximum amount of memory the task is permitted to use. 
  * Unit of measurement: GB
  * The scale is relateive to the memory available in the machine of the cluster with the largest memory.
  * This is specified by the user at submission time
* `disk`: Similarly to `mem_requested`

Let's partition the dataset in training and test dataset

In [0]:
X = df.drop(labels='violation', axis=1)
y = df['violation']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, 
                                        shuffle=True, random_state=4)

Check for class imbalance and correct for it

In [22]:
print( "Samples per class before SMOTE: ", Counter(y_train) )

smote = SMOTE()
X_train, y_train = smote.fit_sample(X_train, y_train)

print( "Samples per class after SMOTE: ", Counter(y_train) )

Samples per class before SMOTE:  Counter({0: 129445, 1: 11395})




Samples per class after SMOTE:  Counter({0: 129445, 1: 129445})


# Training a random forest

# References

[He16] Hemmat, R. A., & Hafid, A. (2016). SLA Violation Prediction In Cloud Computing: A Machine Learning Perspective. Retrieved from http://arxiv.org/abs/1611.10338

[Re11] Reiss, C., Wilkes, J., & Hellerstein, J. (2011). Google cluster-usage traces: format+ schema. Google Inc., …, 1–14. https://doi.org/10.1007/978-3-540-69057-3_88