# High-Performance Analysis of Binary Systems Datasets with Dask

Puggioni Dario, Salvador Alberto, Saran Gattorno Giancarlo, Volpi Gaia

<span style="color:red"> inserire numeri matricola</span> 


# 1. Introduction 

`Write about the context of the processing task to be analyzed in this project (ie what is SEVN, what is data used for in astrophysics context, etc...)`

The aim of this project is to analyize a dataset using the _Dask_ library for Python. The computation is optimized by using three virtual machines connected in a cluster. These virtual machines are provided by _CloudVeneto_, a cloud computing platform managed by the University of Padova.\
More specifically, we want to study the conditions that lead to the formation of a binary-black holes system that merge via Gravitational Waves emission.

#### SEVN
The data we are dealing with are provided by _SEVN_ (Stellar EVolution N-body), a software for simulating the evolution binary systems of stars.\
The evolution of a single star is univocally defined by its initial mass and metallicity, a parameter that refers to the abundance of elements heavier than Hydrogen and Helium in its atmosphere. \
The evolution of a binary system is determined by a series of processes that could happen between the two stars, and which depends on the initial proprierties of them. Some examples of such processes are: Wind-mass transfer, Super-Novae explosions, Common Envelope, Roche-Lobe Overflow, etc.


<div style="text-align: center;">
    <img src="immagini/sevn.jpg" alt="Image" width="400"/>
</div>


# 2. Datasets

`Write about the structure of the datasets here. Add diagrams/tables if useful`

From SEVN we are able to retrieve the data describing the evolution of a given binary system. The dataset it returns is composed by a fixed amount of labels/columns, and a series of rows. Each row contains the values of the labels at a given time step. These time steps are not fixed as the evolution is performed with an 'adaptive time step' schema giving more time resolution where needed. Here it is the basic structure of a SEVN dataset:

<span style="color:red"> inserire head dataset per far vedere la struttura</span> 

We have at our disposal two datasets for two different metallicity:
- Z=0.0014, the "low metallicity" dataset
- Z=0.02, the "high metallicity" dataset

Both of them shares the same structure that we discussed above.
We will perform the same analysis on both and comparing the results to determine how the metallicity affect them.

In a given dataset, we find multiple simulations of different binary systems, i.e., systems that starts with different physical proprieties. Typically the simulation of a single system goes on until the software detects that it reached a steady state (for example the two stars merged in a single one, one of the star exploded in a Super Novae, etc.). After a single simulation end, another simulation starts with a different initial conditions for the system.

# 3. Cluster

`Describe briefly the setup of cluster and file system. Add diagrams if useful`

The cluster consists of three CloudVeneto virtual machines. Each machine has 8GB RAM and 4 CPU cores (**check**).
The cluster is created through ssh connections with rsa key identification allowing for passwordless communication. The three VMs are aliased as `scheduler` (ip 10.67.22.174), `worker1` (ip 10.67.22.36) and `worker2` (ip 10.67.22.251).

Dask instantiates `Clients` from ssh-clusters through the method `dask.distributed.SSHCluster` which is based on the `paramiko` package.

In [None]:
#This code is ran on the scheduler VM

from dask.distributed import Client, SSHCluster

cluster = SSHCluster(
    ["scheduler", "scheduler", "worker1", "worker2"],
    connect_options={"known_hosts":None},
    scheduler_options={"port": 8786, "dashboard_address": ":8787"}
)

client = Client(cluster)

In this configuration the virtual machine `scheduler` is both scheduler and worker

```
Diagram of the dask network
```

## 3.1 Storage

The three VMs have 25GB of storage each, so the dataset has to be stored elsewhere. We opted for a 200GB volume from CloudVeneto. In order to share the data between all VMs we set up a distributed file system, in our case NFS. This is done by mounting the volume on one of the VMs (the NFS server), formatting it as Linux ext4 and then editing the `/etc/exports` file, which configures the ip addresses of the clients, their permissions and the level of consistency on write if clients are allowed to write on the distributed file system. In our case since we have no need to modify the data during processing we chose a configuration of the type
```
Paste etc/exports
```
Since the NFS server will also be a worker we had to exclude the scheduler, which would bottleneck the cluster performance if it had to take on all three roles.
The datasets are then downloaded onto the volume from the Drive folder using `gdown`.
```
Diagram of the NFS network
```

# 4. Data Processing 

## 4.1 Naive method

`Method used for LCP-B before knowing Dask. Report performance`

## 4.2 Optimized method (bags? foldby? delay? etc...)

`Optimize naive method as much as possible. Again report performance`

## 4.3 (optional) File System behavior

`Try to understand the way dask and NFS handle data transfer, processing, etc... Especially considering that the NFS server is assigned some processing tasks and the scheduler is assigned processing tasks.`

# 5. Benchmarking

`Benchmark the best algorithm by collecting statistics on the time performance at varying numbers of threads, workers, and partitions`