In [1]:
import os
import sys

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

from dask.distributed import Client
import dask.dataframe as dd


pd.options.display.max_rows = 10004
matplotlib.style.use("dark_background")

In [2]:
sys.path.insert(0, os.path.abspath('/opt/vssexclude/personal/kaggle/volcano/src/'))

%load_ext autoreload
%autoreload 2

## Description of the Data

Data used in this notebook is from the Kaggle Competition "INGV - Volcanic Eruption Prediction"(https://www.kaggle.com/c/predict-volcanic-eruptions-ingv-oe).

We will explore a bunch of files under `train` and `test` directories. Each file contains ten minutes of logs from ten different sensors arrayed around a volcano. There are 4432 data files under the train directory and 4521 files under test directory. Each of these files consists of 60K lines. On the disk, size of the files under train and test directory is 30G (15G + 15G).

## Challenges with Large Data

As a Data Scientist, we encounter two major challenges when dealing with such a large volume of data:

1. Limited Processing Power: Because of Python's Global Interpreter Lock (GIL), libraries like Pandas or Numpy can use only one processor at any point of time, even when multiple processors available. 

2. Limited Memory: For a workstation, RAM is often limited to 16 or 32 GB. So, it's kind of impossible to load all the data files together. Even the disk space will be limited to around 2 TB.

## Why Dask?
Dask is a framework designed to overcome these limitations:

1. Parallelization using Multiple Cores (avilable in a Single Computer or distributed across multiple computers)
2. Out of Core Computing: If size of the data is larger than the main memory (RAM), dask doesn't load all the data in-memory at a time. It streams the data from the disk as and when needed. If Data doesn't fit into the disk of a single computer, it can be ditributed across multiple computers.

Dask can scale on thousand-machine clusters to handle hundreds of terabytes of data. At the same time, it works efficiently on a single machine as well, enabling analysis of moderately large datasets (100GB+) on relatively low power laptops.

#### How many Processors do I have?

In [3]:
os.cpu_count()

10

This is equal to double the number of CPU Cores since in most of the Computers hyperthreading is enabled.

**Hyperthreading** tells the operating system that it has two cores for every physical core. In my Window's laptop, I have 6 physical cores, but 12 logical processors. But, these 12 logical processors will not give 12x improvement compared to single physical core. Hyperthreading generally gives around 1.25x to 1.3x improvement if two cores give 2x improvement.

## Dask Architechture

<img src="../images/dask_architechture_diagram.png" width="600" height="200" style="border-style: solid;">

#### What is a Client?

The Client connects users to a Dask cluster. After a Dask cluster is setup, we initialize a Client by pointing it to the address of a Scheduler:

```python
from distributed import Client
client = Client("1.2.3.4:8786")
```



#### Start Dask Client

In [37]:
# client = Client(n_workers=10, memory_limit='2.5GB')
client = Client(n_workers=10)

Here we are creating a Client without specifying the scheduler (cluster) address. In this case, the Client creates a `LocalCluster` in the background and connects to that. Any computation will automatically use this `LocalCluster`.

The above code is effectively same as the following:

```python
from dask.distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=10)
client = Client(cluster)
```

A client can be closed using:

```python
client.close()
```


In [39]:
client.close()

In [None]:
from dask.distributed import Client, LocalCluster

# set up cluster and workers
cluster = LocalCluster(n_workers=4, 
                       threads_per_worker=1,
                       memory_limit='64GB')
client = Client(cluster)