# 09a. Demo analysis - local part

## Overview

In this notebook and its remote counterpart `09b`, you will learn how to:

 - Download a large quantity of CSV data for analysis.
 - Load the data using Dask on the cluster.
 - Convert the data to a more suitable format: Apache Parquet.
 - Load the data from Parquet.
 - Perform a simple data analysis.

## Install an SSH client

For this tutorial, we will need an SSH client to connect to the cluster. It's likely you already have OpenSSH on Linux and Windows 10. PuTTY on Windows will work too.

## Import idact

It's recommended that *idact* is installed with *pip*. Alternatively, make sure the dependencies are installed: `pip install -r requirements.txt`, and add *idact* to path, for example:

In [1]:
import sys
sys.path.append('../')

We will use a wildcard import for convenience:

In [2]:
from idact import *
import bitmath

## Load the cluster

Let's load the environment and the cluster. Make sure to use your cluster name.

In [3]:
load_environment()
cluster = show_cluster("hpc")
cluster

Cluster(pro.cyfronet.pl, 22, plggarstka, auth=AuthMethod.PUBLIC_KEY, key='C:\\Users\\Maciej/.ssh\\id_rsa_6p', install_key=False, disable_sshd=False)

In [4]:
access_node = cluster.get_access_node()
access_node.connect()

## Find the data to analyze

There is a lot of open source datasets available online for free. In many cases, you need to pay for the bandwidth though, especially if the dataset is more that a few gigabytes. In some cases, especially when the data is from government agencies, it's available fully free of charge.

I will use the New York City Taxi & Limousine Commission Trip Record Data (yellow) for years 2010-2014, available [here](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).
Since 2015, there was a slight change in formatting, so we'll not worry about the newer data for now.

For the years we're interested in, there is a CSV file for each month, so we have 12\*5=60 CSV files, with the total size of 143GiB.

## Download the data

We will download the data straight to the cluster, by logging in to a compute node through SSH. 

Let's allocate the node. We will download two years at a time, so let's get 24 cores for an hour, though the download shouldn't take that long.

In [5]:
nodes = cluster.allocate_nodes(nodes=1,
                               cores=24,
                               memory_per_node=bitmath.GiB(120),
                               walltime=Walltime(hours=1),
                               native_args={
                                   '--account': 'intdata',
                                   '--partition': 'plgrid-testing'
                               })
nodes

2018-12-02 05:18:48 INFO: Installing key in '.ssh/authorized_keys.idact' for access to compute nodes.
2018-12-02 05:18:48 INFO: Creating the ssh directory.


Nodes([Node(NotAllocated)], SlurmAllocation(job_id=14399244))

In [6]:
nodes.wait()
nodes

2018-12-02 05:18:56 INFO: Still pending or configuring...


Nodes([Node(p0007:58019, 2018-12-02 05:18:54.856900+00:00)], SlurmAllocation(job_id=14399244))

Let's log in to the node by creating a tunnel:

In [7]:
tunnel = nodes[0].tunnel_ssh()
tunnel

ssh -i "C:\Users\Maciej/.ssh\id_rsa_6p" -p 58019 plggarstka@localhost

If you have OpenSSH, the command printed above should work. Otherwise, you need to copy the key path, host and port to PuTTY.

Once on the node, let's pick a directory to download the data into. Depending on the cluster and available resources, you may have a team storage area for persistent data.

On my cluster, there is also a temporary (30 day) personal storage determined by the environment variable `$SCRATCH`, which I will use for now.

Let's create a directory for the data:
```
cd $SCRATCH && mkdir taxi && cd taxi
```

Then, download the data. I downloaded the CSV files using wget in batches of 24.

```
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-01.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-03.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-04.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-05.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-06.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-07.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-08.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-09.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-10.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-11.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-12.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-01.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-02.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-03.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-04.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-05.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-06.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-07.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-08.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-09.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-10.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-11.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-12.csv &

wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-01.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-02.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-03.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-04.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-05.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-06.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-07.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-08.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-09.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-10.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-11.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2012-12.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-01.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-02.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-03.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-04.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-05.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-06.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-07.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-08.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-09.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-10.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-11.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2011-12.csv &

wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-01.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-02.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-03.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-04.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-05.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-06.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-07.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-08.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-09.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-10.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-11.csv &
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-12.csv &
```

## Install fastparquet on the cluster

We will need `fastparquet` on the cluster, so while you have the access to the compute node, install 
it in your Python environment, e.g.:
```
pip install fastparquet --user
```

## Cancel the download node allocation

We're done, so we won't need the node anymore. Let's close the ssh tunnel and cancel the allocation.

In [8]:
tunnel.close()
nodes.cancel()

2018-12-02 05:19:24 INFO: Cancelling job 14399244.


## Allocate nodes for conversion from CSV to Apache Parquet

Let's allocate a few nodes.

In [9]:
nodes = cluster.allocate_nodes(nodes=6,
                               cores=24,
                               memory_per_node=bitmath.GiB(120),
                               walltime=Walltime(hours=1),
                               native_args={
                                   '--account': 'intdata',
                                   '--partition': 'plgrid-testing'
                               })

2018-12-02 05:19:30 INFO: Installing key in '.ssh/authorized_keys.idact' for access to compute nodes.
2018-12-02 05:19:30 INFO: Creating the ssh directory.


In [11]:
nodes.wait()
nodes

2018-12-02 05:22:56 INFO: Still pending or configuring...
2018-12-02 05:23:00 INFO: Still pending or configuring...
2018-12-02 05:23:05 INFO: Still pending or configuring...
2018-12-02 05:23:09 INFO: Still pending or configuring...
2018-12-02 05:23:14 INFO: Still pending or configuring...
2018-12-02 05:23:18 INFO: Still pending or configuring...
2018-12-02 05:23:22 INFO: Still pending or configuring...
2018-12-02 05:23:26 INFO: Still pending or configuring...
2018-12-02 05:23:31 INFO: Still pending or configuring...
2018-12-02 05:23:35 INFO: Still pending or configuring...
2018-12-02 05:23:39 INFO: Still pending or configuring...
2018-12-02 05:23:45 INFO: Still pending or configuring...
2018-12-02 05:23:52 INFO: Still pending or configuring...
2018-12-02 05:23:58 INFO: Still pending or configuring...
2018-12-02 05:24:04 INFO: Still pending or configuring...
2018-12-02 05:24:11 INFO: Still pending or configuring...
2018-12-02 05:24:18 INFO: Still pending or configuring...
2018-12-02 05:

Nodes([Node(p0109:59529, 2018-12-02 05:26:06.298927+00:00),Node(p0110:60911, 2018-12-02 05:26:06.298927+00:00),Node(p0111:58340, 2018-12-02 05:26:06.298927+00:00),Node(p0112:39446, 2018-12-02 05:26:06.298927+00:00),Node(p0113:36112, 2018-12-02 05:26:06.298927+00:00),Node(p0114:52334, 2018-12-02 05:26:06.298927+00:00)], SlurmAllocation(job_id=14399245))

Deploy a Jupyter Notebook:

In [12]:
nb = nodes[0].deploy_notebook()
nb

JupyterDeployment(8080 -> Node(p0109:59529, 2018-12-02 05:26:06.298927+00:00)

Then, Dask:

In [13]:
dd = deploy_dask(nodes)
dd

2018-12-02 05:26:46 INFO: Deploying Dask on 6 nodes.
2018-12-02 05:26:46 INFO: Connecting to p0109:59529 (1/6).
2018-12-02 05:26:46 INFO: Connecting to p0110:60911 (2/6).
2018-12-02 05:26:48 INFO: Connecting to p0111:58340 (3/6).
2018-12-02 05:26:54 INFO: Connecting to p0112:39446 (4/6).
2018-12-02 05:27:01 INFO: Connecting to p0113:36112 (5/6).
2018-12-02 05:27:03 INFO: Connecting to p0114:52334 (6/6).
2018-12-02 05:27:05 INFO: Deploying scheduler on the first node: p0109.
2018-12-02 05:28:24 INFO: Retried and failed: config.retries[Retry.OPEN_TUNNEL].{count=3, seconds_between=5}
2018-12-02 05:28:24 ERROR: Failure: Adding last hop.
2018-12-02 05:28:50 INFO: Checking scheduler connectivity from p0109 (1/6).
2018-12-02 05:28:50 INFO: Checking scheduler connectivity from p0110 (2/6).
2018-12-02 05:28:50 INFO: Checking scheduler connectivity from p0111 (3/6).
2018-12-02 05:28:50 INFO: Checking scheduler connectivity from p0112 (4/6).
2018-12-02 05:28:50 INFO: Checking scheduler connectivi

DaskDeployment(scheduler=tcp://localhost:53769/tcp://172.20.64.109:46289, workers=6)

Push the nodes and Dask deployment, because we'll use them on the cluster:

In [14]:
cluster.clear_pushed_deployments()
cluster.push_deployment(nodes)
cluster.push_deployment(dd)

2018-12-02 05:30:52 INFO: Clearing deployments.
2018-12-02 05:30:55 INFO: Pushing deployment: Nodes([Node(p0109:59529, 2018-12-02 05:26:06.298927+00:00),Node(p0110:60911, 2018-12-02 05:26:06.298927+00:00),Node(p0111:58340, 2018-12-02 05:26:06.298927+00:00),Node(p0112:39446, 2018-12-02 05:26:06.298927+00:00),Node(p0113:36112, 2018-12-02 05:26:06.298927+00:00),Node(p0114:52334, 2018-12-02 05:26:06.298927+00:00)], SlurmAllocation(job_id=14399245))
2018-12-02 05:31:00 INFO: Pushing deployment: DaskDeployment(scheduler=tcp://localhost:53769/tcp://172.20.64.109:46289, workers=6)


## Open the Dask Dashboard

Open the scheduler dashboard:

In [15]:
client = dd.get_client()
client

0,1
Client  Scheduler: tcp://localhost:53769  Dashboard: http://localhost:33300/status,Cluster  Workers: 6  Cores: 144  Memory: 773.09 GB


There is nothing interesting there for now, but we will observe what happens when we load the data later.

You can also browse the dashboards for workers as well, if you want:

In [16]:
dd.diagnostics.addresses
# dd.diagnostics.open_all()

['http://localhost:33300/status',
 'http://localhost:55762/main',
 'http://localhost:43330/main',
 'http://localhost:60271/main',
 'http://localhost:55581/main',
 'http://localhost:59782/main',
 'http://localhost:51605/main']

We don't need the client anymore here:

In [17]:
client.close()

## Copy notebook `09b` to the cluster

Drag and drop `09b-Demo_analysis_-_remote_part.ipynb` to the deployed notebook, and open it there.

In [18]:
nb.open_in_browser()

## Follow the instructions in notebook `09b`

Follow the instructions until you are referred back to this notebook.

## Cancel the allocation

It's important to cancel an allocation if you're done with it early, in order to minimize the CPU time you are charged for.

In [None]:
nodes.running()

In [None]:
nodes.cancel()

In [None]:
nodes.running()