### Setup for Log Files

Creating necessary directories to store log files.

In [1]:
from os.path import exists
from pathlib import Path
import os

home = str(Path.home())
dasklogs = f"{home}/dask-test-logs"
if not exists(dasklogs):
    os.mkdir(dasklogs)

### Intialize the Slurm Cluster

Dask allows users to specify parameters of the SLURM cluster. [Other parameters](https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html) besides the ones below can also be specified for the SLURM cluster. 

In [2]:
import dask
from dask.distributed import Client
from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(
    cores=4, 
    memory="8GB",
    processes=2,
    queue="normal",
    shebang='#!/usr/bin/env bash',
    local_directory='/tmp',
    death_timeout="15s",
    interface="ib0",
    log_directory=dasklogs,
    project="boc")

client = Client(cluster)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 36272 instead


0,1
Connection method: Cluster object,Cluster type: SLURMCluster
Dashboard: http://10.55.50.17:36272/status,

0,1
Dashboard: http://10.55.50.17:36272/status,Workers: 0
Total threads:  0,Total memory:  0 B

0,1
Comm: tcp://10.55.50.17:45360,Workers: 0
Dashboard: http://10.55.50.17:36272/status,Total threads:  0
Started:  Just now,Total memory:  0 B


### Slurm Job Script

Dipslying the jobscript for the SLURMCluster that was created above.

In [3]:
print(cluster.job_script())

#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -e /home/asd/stha/dask-test-logs/dask-worker-%J.err
#SBATCH -o /home/asd/stha/dask-test-logs/dask-worker-%J.out
#SBATCH -p normal
#SBATCH -A boc
#SBATCH -n 1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH -t 00:30:00

/usr/local/tools/anaconda3/2021.05/bin/python -m distributed.cli.dask_worker tcp://10.55.50.17:45360 --nthreads 2 --nprocs 2 --memory-limit 3.73GiB --name dummy-name --nanny --death-timeout 15s --local-directory /tmp --interface ib0 --protocol tcp://



### Scaling the cluster to 1 node

In [4]:
cluster.scale(2)

### Reading multiple sources of data in multiple dataframes

Reads three csv files into their own dataframes using the read_csv() function.

In [5]:
import dask
import dask.dataframe as dd

df = dd.read_csv('data/010121.csv', dtype={'Active':'float64'}) # dtypes specified to read csv properly
df2 = dd.read_csv('data/020121.csv', dtype={'Active':'float64'})
df3 = dd.read_csv('data/030121.csv', dtype={'Active':'float64'})

### Displaying the 3 dataframes

The head() function displays the beginning of a dataframe.

In [7]:
df.head()

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,Afghanistan,02/01/2021 5:22,33.93911,67.709953,51526,2191,41727,0.0,Afghanistan,0.0,4.252222
1,,Albania,02/01/2021 5:22,41.1533,20.1683,58316,1181,33634,23501.0,Albania,2026.409062,2.025173
2,,Algeria,02/01/2021 5:22,28.0339,1.6596,99897,2762,67395,29740.0,Algeria,227.809861,2.764848
3,,Andorra,02/01/2021 5:22,42.5063,1.5218,8117,84,7463,570.0,Andorra,10505.40348,1.034865
4,,Angola,02/01/2021 5:22,-11.2027,17.8739,17568,405,11146,6017.0,Angola,53.452981,2.305328


In [8]:
df2.head()

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,Afghanistan,02/02/2021 5:22,33.93911,67.709953,55059,2404,47723,4932.0,Afghanistan,141.436801,4.366225
1,,Albania,02/02/2021 5:22,41.1533,20.1683,78992,1393,47922,29677.0,Albania,2744.874557,1.76347
2,,Algeria,02/02/2021 5:22,28.0339,1.6596,107578,2894,73530,31154.0,Algeria,245.325978,2.690141
3,,Andorra,02/02/2021 5:22,42.5063,1.5218,9972,101,9206,665.0,Andorra,12906.2318,1.012836
4,,Angola,02/02/2021 5:22,-11.2027,17.8739,19829,466,18180,1183.0,Angola,60.332375,2.350093


In [9]:
df3.head()

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,Afghanistan,02/03/2021 5:23,33.93911,67.709953,55733,2444,49344,3945.0,Afghanistan,143.168187,4.385194
1,,Albania,02/03/2021 5:23,41.1533,20.1683,107931,1816,70413,35702.0,Albania,3750.469108,1.682556
2,,Algeria,02/03/2021 5:23,28.0339,1.6596,113255,2987,78234,32034.0,Algeria,258.272078,2.637411
3,,Andorra,02/03/2021 5:23,42.5063,1.5218,10889,110,10475,304.0,Andorra,14093.05636,1.010194
4,,Angola,02/03/2021 5:23,-11.2027,17.8739,20854,508,19400,946.0,Angola,63.451074,2.435984


### Mean Calculations

Using the mean() function, you can calculate the mean for dataframes and specify which column.

In [10]:
%%time

# Calculating the mean
print("Mean of confirmed global COVID-19 cases reported 01/01/2021:  " + str(df.Confirmed.mean().compute()))
print("Mean of reported global COVID-19 deaths reported 02/01/2021:  " + str(df2.Deaths.mean().compute()))
print("Mean of confirmed active global COVID-19 cases reported 03/01/2021:  " + str(df3.Active.mean().compute()))

Mean of confirmed global COVID-19 cases reported 01/01/2021:  21119.139307228917
Mean of reported global COVID-19 deaths reported 02/01/2021:  579.2197140707299
Mean of confirmed active global COVID-19 cases reported 03/01/2021:  11862.095859473024
CPU times: user 50.7 ms, sys: 5.11 ms, total: 55.8 ms
Wall time: 195 ms


In [11]:
client

0,1
Connection method: Cluster object,Cluster type: SLURMCluster
Dashboard: http://10.55.50.17:36272/status,

0,1
Dashboard: http://10.55.50.17:36272/status,Workers: 2
Total threads:  4,Total memory:  7.46 GiB

0,1
Comm: tcp://10.55.50.17:45360,Workers: 2
Dashboard: http://10.55.50.17:36272/status,Total threads:  4
Started:  Just now,Total memory:  7.46 GiB

0,1
Comm: tcp://10.55.50.22:39377,Total threads: 2
Dashboard: http://10.55.50.22:41378/status,Memory: 3.73 GiB
Nanny: tcp://10.55.50.22:43271,
Local directory: /tmp/dask-worker-space/worker-dbey0wt3,Local directory: /tmp/dask-worker-space/worker-dbey0wt3

0,1
Comm: tcp://10.55.50.22:41940,Total threads: 2
Dashboard: http://10.55.50.22:44012/status,Memory: 3.73 GiB
Nanny: tcp://10.55.50.22:37163,
Local directory: /tmp/dask-worker-space/worker-dwec7mir,Local directory: /tmp/dask-worker-space/worker-dwec7mir


### Merging the Data

Using the merge() function to merge all of the dataframes into one.  Below you can see you can merge multiple dataframes by separating it with a period and calling the merge() function however many times necessary.

In [1]:
%%time

result = df.merge(df2).merge(df3)

NameError: name 'df' is not defined

In [13]:
client

0,1
Connection method: Cluster object,Cluster type: SLURMCluster
Dashboard: http://10.55.50.17:36272/status,

0,1
Dashboard: http://10.55.50.17:36272/status,Workers: 2
Total threads:  4,Total memory:  7.46 GiB

0,1
Comm: tcp://10.55.50.17:45360,Workers: 2
Dashboard: http://10.55.50.17:36272/status,Total threads:  4
Started:  Just now,Total memory:  7.46 GiB

0,1
Comm: tcp://10.55.50.22:39377,Total threads: 2
Dashboard: http://10.55.50.22:41378/status,Memory: 3.73 GiB
Nanny: tcp://10.55.50.22:43271,
Local directory: /tmp/dask-worker-space/worker-dbey0wt3,Local directory: /tmp/dask-worker-space/worker-dbey0wt3

0,1
Comm: tcp://10.55.50.22:41940,Total threads: 2
Dashboard: http://10.55.50.22:44012/status,Memory: 3.73 GiB
Nanny: tcp://10.55.50.22:37163,
Local directory: /tmp/dask-worker-space/worker-dwec7mir,Local directory: /tmp/dask-worker-space/worker-dwec7mir


In [14]:
result.head()

Unnamed: 0,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,Diamond Princess,Canada,21/12/2020 13:27,,,0,1,0,,"Diamond Princess, Canada",,
1,Grand Princess,Canada,21/12/2020 13:27,,,13,0,13,0.0,"Grand Princess, Canada",,0.0
2,Alabama,US,21/12/2020 13:27,,,0,0,0,0.0,"Out of AL, Alabama, US",,
3,Alabama,US,21/12/2020 13:27,,,0,0,0,0.0,"Unassigned, Alabama, US",,
4,Diamond Princess,US,04/08/2020 2:27,,,49,0,0,49.0,"Diamond Princess, US",,0.0


The total sum for each column is calculated with the sum() function and dataframe is displayed based on columns specified in the groupby() function.

In [15]:
%%time
result.groupby(['Province_State', 'Country_Region']).sum().reset_index().compute()

CPU times: user 38 ms, sys: 3.56 ms, total: 41.5 ms
Wall time: 164 ms


Unnamed: 0,Province_State,Country_Region,Lat,Long_,Confirmed,Deaths,Recovered,Active,Incident_Rate,Case_Fatality_Ratio
0,Alabama,US,0.0,0.0,0,0,0,0.0,0.0,0.0
1,Diamond Princess,Canada,0.0,0.0,0,1,0,0.0,0.0,0.0
2,Diamond Princess,US,0.0,0.0,49,0,0,49.0,0.0,0.0
3,Grand Princess,Canada,0.0,0.0,13,0,13,0.0,0.0,0.0
4,Grand Princess,US,0.0,0.0,103,3,0,100.0,0.0,2.912621
5,Hawaii,US,0.0,0.0,0,0,0,0.0,0.0,0.0
6,Maine,US,0.0,0.0,0,0,0,0.0,0.0,0.0
7,Montana,US,0.0,0.0,0,0,0,0.0,0.0,0.0
8,Virginia,US,0.0,0.0,0,0,0,0.0,0.0,0.0


In [16]:
client

0,1
Connection method: Cluster object,Cluster type: SLURMCluster
Dashboard: http://10.55.50.17:36272/status,

0,1
Dashboard: http://10.55.50.17:36272/status,Workers: 2
Total threads:  4,Total memory:  7.46 GiB

0,1
Comm: tcp://10.55.50.17:45360,Workers: 2
Dashboard: http://10.55.50.17:36272/status,Total threads:  4
Started:  Just now,Total memory:  7.46 GiB

0,1
Comm: tcp://10.55.50.22:39377,Total threads: 2
Dashboard: http://10.55.50.22:41378/status,Memory: 3.73 GiB
Nanny: tcp://10.55.50.22:43271,
Local directory: /tmp/dask-worker-space/worker-dbey0wt3,Local directory: /tmp/dask-worker-space/worker-dbey0wt3

0,1
Comm: tcp://10.55.50.22:41940,Total threads: 2
Dashboard: http://10.55.50.22:44012/status,Memory: 3.73 GiB
Nanny: tcp://10.55.50.22:37163,
Local directory: /tmp/dask-worker-space/worker-dwec7mir,Local directory: /tmp/dask-worker-space/worker-dwec7mir
