# Getting Data

In several cases, we have access to historical data about the behavior of the complex system that we want to model. This was the case for [Google a few years ago](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42542.pdf) and [led to considerable energy savings later on](https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/).

In our case, unfortunately (?!?) historical data about zombie virus outbreaks are not available. This tends to happen when the scenario that we need to deal with is somthing that really shoudl not happen, such as an expensive machine breaking or the downfall of humanity.

In such situation, **it is often useful to resort to a simulator**.

For this tutorial we will use [the excellent R-package EpiModel](http://www.epimodel.org), which can simulate an infectious disease spreading over a dynamic network.

## Simulator Configuration

Just to given an intuition of how the simulator (configured for our needs) works:

* The nodes of the network will be our lab workers, which can be healthy or infected
* The arcs represent encounters, which arise and disappear randomly
* For every enconters, there may be several interactions
* Each interaction with an infected (e.g. an attack) may lead to infection
* At each step, some of the workers (healthy and infected) may die
* At each step, some of the infected may recover

More in detail, the simlator is controlled by six parameters:

* `edge.ratio`, representing the average fraction of the maximum number of edges that is present in the network at each step
* `act.rate`, i.e. the average number of interactions per edge
* `inf.prob`, i.e. the infection probability per interaction
* `ds.rate`, i.e. is the mortality rate for healty workers
* `di.rate`, i.e. is the mortality rate for the infected
* `rec.rate`, i.e. the recovery rate

We will run the simulator for one week with a population of 500 individual and a single initial infection case.

The main outputs are:

* The final number of healty (susceptible or recovered) individuals (i.e. the survivors)
* The final number of infected

The process is stochastic, so different runs may lead to different results.

It is important to realize that **the simulator inputs are not the same as the decisions in out optimization problem**. They represent instead *features* that describe the target system.

## How Bad is It?

As default values we assume:

* `edge.ratio = 0.004`, moderately frequent encounters
* `act.rate = 3`, since our staff is not trained
* `inf.prob = 0.7`, again to lack of training and the strength of the virus
* `ds.rate=0.05`, since the zombies also eat brains
* `di.rate=0.0`, no hope down them unless we train & equip our staff
* `rec.rate=0.0`, no hope for recovery unless we equipe our staff with cure shots

Let's have a look at our odds:

In [None]:
# Load simulation function
source('sim.R')

# Default setup
sol = c(0.004, 0.700, 3.000, 0.000, 0.050, 0.000)

# Run the simulator
here.they.come(edge.ratio=sol[1], inf.prob=sol[2], act.rate=sol[3], rec.rate=sol[4],
     ds.rate=sol[5], di.rate=sol[6])

**We. Are. Doomed. :-(**

But then, where would the challenge be? ;-)

## Getting Data

Remember that our goal is to approximate the input-output relation of our system, i.e. to estimate how the edge ratio etc. correlate with the number of survivors and infected.

We can do this by running the simulator with different parameter configuration and collecting the results. The next cell shows how to do it:

In [None]:
source('gendata.R')

gen.data(stop.after=1, verbose=FALSE)

The function will consider several input configurations according to a factorial design: for each parameter it will test a limited range of values and run 30 simulations. The results are saved by default to a file called `results.csv`.

Each simulatio takes a few seconds, and so the whole process can be very time consuming. Thankfully, the final results are already available in the `za_data.csv` file, in the `shared` directory.