# Failures

Failure cause hosts to stop periodically. In OpenDC, failures can be simulated by providing a trace. 
This trace describes when failures occur, how long they are, and how intens (how many hosts are effected). 

In this demo, we will investigate the effect of failures.

#### Lets start by looking at one of the failure traces.

In [None]:
import pandas as pd

df_failure = pd.read_parquet("failure_traces/Facebook_user_reported.parquet")

df_failure

## Experiment

Failures can be added to the simulation using the experiment file. 
First, we will run a workload with, and without failures. 

For this we need to run two simulations, one using an experiment file similar to the one used in the previous demo. You can find the file [here](experiments/2.no_failures.json)

Next, we make an experiment with failures. To do this we need to add a "failureModel" to the experiment file. 
This results in the experiment file that can be found [here](experiments/2.Facebook_failures.json), and is shown below:

```json
{
    "name": "Facebook_failures",
    "outputFolder": "output/2.failures",
    "topologies": [
        {
            "pathToFile": "topologies/2.demo_failures/surfsara_small.json"
        }
    ],
    "workloads": [
        {
            "pathToFile": "workload_traces/2022-10-01_2022-10-02",
            "type": "ComputeWorkload"
        }
    ],
    "exportModels": [
        {
            "exportInterval": 300
        }
    ],
    "failureModels": [
        {
            "type": "trace-based",
            "pathToFile": "failure_traces/Facebook_user_reported.parquet"
        }
    ]
}
```

This experiment adds the Facebook_user_reported failure trace to the simulation. 

#### Lets run both experiments

In [None]:
import subprocess

pathToScenario = "experiments/2.Facebook_failures.json"
subprocess.run(["OpenDCExperimentRunner/bin/OpenDCExperimentRunner", "--experiment-path", pathToScenario])

## Analysis

We can see that many tasks did not succeed.
When a tasks fails too many times, the task gets terminated from the system.
Lets further investigate the failures 


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


df_host = pd.read_parquet("output/2.failures/Facebook_failures/raw-output/0/seed=0/host.parquet")
df_power = pd.read_parquet("output/2.failures/Facebook_failures/raw-output/0/seed=0/powerSource.parquet")
df_task = pd.read_parquet("output/2.failures/Facebook_failures/raw-output/0/seed=0/task.parquet")
df_service = pd.read_parquet("output/2.failures/Facebook_failures/raw-output/0/seed=0/service.parquet")

df_host_fail = pd.read_parquet("output/2.failures/Facebook_failures/raw-output/1/seed=0/host.parquet")
df_power_fail = pd.read_parquet("output/2.failures/Facebook_failures/raw-output/1/seed=0/powerSource.parquet")
df_task_fail = pd.read_parquet("output/2.failures/Facebook_failures/raw-output/1/seed=0/task.parquet")
df_service_fail = pd.read_parquet("output/2.failures/Facebook_failures/raw-output/1/seed=0/service.parquet")

In [None]:
tasks_terminated = df_service.iloc[-1].tasks_terminated
tasks_terminated_fail = df_service_fail.iloc[-1].tasks_terminated

print(f"In the normal simulation {tasks_terminated} tasks were terminated")
print(f"When adding failures {tasks_terminated_fail} tasks were terminated")

#### Lets compare the runtimes

In [None]:
runtime = pd.to_timedelta(df_service.timestamp.max() - df_service.timestamp.min(), unit="ms")
runtime_fail = pd.to_timedelta(df_service_fail.timestamp.max() - df_service_fail.timestamp.min(), unit="ms")

print(f"The workload took {runtime} without failures")
print(f"The workload took {runtime_fail} with failures")


##### Adding failures almost tripled the runtime!

## Visualization

We can plot the active tasks over time to see what is happening

In [None]:
import matplotlib.dates as mdates

timestamps = pd.to_datetime(df_service[:-1].timestamp_absolute, unit="ms")
timestamps_failures = pd.to_datetime(df_service_fail[:-1].timestamp_absolute, unit="ms")


fig, ax = plt.subplots(figsize=(10,5))
ax.plot(timestamps, df_service[:-1].tasks_active, label="no failures")
ax.plot(timestamps_failures, df_service_fail[:-1].tasks_active, label="failures")

plt.title("Tasks active during a workload")
plt.xlabel("Time")
plt.ylabel("Carbon Emission (CO2/h)")
ax.xaxis.set_major_locator(plt.MaxNLocator(3))
myFmt = mdates.DateFormatter('%y-%m-%d %H:%M:%S')
ax.xaxis.set_major_formatter(myFmt)

plt.legend()
plt.show()

#### We can clearly see that the failures are creating idle periods

## Comparing Failure traces

In the failure_traces folder we find five failure traces gathered from different applications. 
Lets compare the effect of the different failure traces on the same workload. 
To do this, we have created a new experiment which you can find [here](experiments/2.all_failures.json).
Its content is shown below:

```json
{
    "name": "all_failures",
    "outputFolder": "output/2.failures",
    "topologies": [
        {
            "pathToFile": "topologies/2.demo_failures/surfsara_small.json"
        }
    ],
    "workloads": [
        {
            "pathToFile": "workload_traces/2022-10-01_2022-10-02",
            "type": "ComputeWorkload"
        }
    ],
    "exportModels": [
        {
            "exportInterval": 300
        }
    ],
    "failureModels": [
        {
            "type": "trace-based",
            "pathToFile": "failure_traces/Facebook_user_reported.parquet"
        },
        {
            "type": "trace-based",
            "pathToFile": "failure_traces/Instagram_user_reported.parquet"
        },
        {
            "type": "trace-based",
            "pathToFile": "failure_traces/Netflix_user_reported.parquet"
        },
        {
            "type": "trace-based",
            "pathToFile": "failure_traces/Whatsapp_user_reported.parquet"
        },
        {
            "type": "trace-based",
            "pathToFile": "failure_traces/YouTube_user_reported.parquet"
        }
    ]
}
```

Using a list of different failure traces instructs OpenDC to run multiple simulation, all with a different failure trace.

In [None]:
import subprocess

pathToScenario = "experiments/2.all_failures.json"
subprocess.run(["OpenDCExperimentRunner/bin/OpenDCExperimentRunner", "--experiment-path", pathToScenario])


## Visualization

Lets load the output data, and compare the results

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df_host_facebook = pd.read_parquet("output/2.failures/all_failures/raw-output/0/seed=0/host.parquet")
df_power_facebook = pd.read_parquet("output/2.failures/all_failures/raw-output/0/seed=0/powerSource.parquet")
df_service_facebook = pd.read_parquet("output/2.failures/all_failures/raw-output/0/seed=0/service.parquet")
df_task_facebook = pd.read_parquet("output/2.failures/all_failures/raw-output/0/seed=0/task.parquet")

df_host_instagram = pd.read_parquet("output/2.failures/all_failures/raw-output/1/seed=0/host.parquet")
df_power_instagram = pd.read_parquet("output/2.failures/all_failures/raw-output/1/seed=0/powerSource.parquet")
df_service_instagram = pd.read_parquet("output/2.failures/all_failures/raw-output/1/seed=0/service.parquet")
df_task_instagram = pd.read_parquet("output/2.failures/all_failures/raw-output/1/seed=0/task.parquet")

df_host_netflix = pd.read_parquet("output/2.failures/all_failures/raw-output/2/seed=0/host.parquet")
df_power_netflix = pd.read_parquet("output/2.failures/all_failures/raw-output/2/seed=0/powerSource.parquet")
df_service_netflix = pd.read_parquet("output/2.failures/all_failures/raw-output/2/seed=0/service.parquet")
df_task_netflix = pd.read_parquet("output/2.failures/all_failures/raw-output/2/seed=0/task.parquet")

df_host_whatsapp = pd.read_parquet("output/2.failures/all_failures/raw-output/3/seed=0/host.parquet")
df_power_whatsapp = pd.read_parquet("output/2.failures/all_failures/raw-output/3/seed=0/powerSource.parquet")
df_service_whatsapp = pd.read_parquet("output/2.failures/all_failures/raw-output/3/seed=0/service.parquet")
df_task_whatsapp = pd.read_parquet("output/2.failures/all_failures/raw-output/3/seed=0/task.parquet")

df_host_youtube = pd.read_parquet("output/2.failures/all_failures/raw-output/4/seed=0/host.parquet")
df_power_youtube = pd.read_parquet("output/2.failures/all_failures/raw-output/4/seed=0/powerSource.parquet")
df_service_youtube = pd.read_parquet("output/2.failures/all_failures/raw-output/4/seed=0/service.parquet")
df_task_youtube = pd.read_parquet("output/2.failures/all_failures/raw-output/4/seed=0/task.parquet")

In [None]:
fig, ax = plt.subplots(figsize=(20,5))

ax.plot(df_service_facebook.tasks_active, label="facebook")
ax.plot(df_service_instagram.tasks_active, label="instagram")
ax.plot(df_service_netflix.tasks_active, label="netflix")
ax.plot(df_service_whatsapp.tasks_active, label="whatsapp")
ax.plot(df_service_youtube.tasks_active, label="youtube")

ax.legend()

plt.show()


#### We can see a clear difference between when the failures occur, their intensity, and the overall effect of the failures.