# Dask Tutorial
### By Michelle Lam, Elke Windschitl, & Michael Zargari for EDS-217

This is a tutoriol on how to use the Python library Dask. Dask is a tool to scale data libraries in Python such as Numpy, Pandas, and Scikit-learn. This means that smaller libraries can be scaled to used on big data. Dask can be deployed anywhere, so users can start on a laptop and scale up to cloud computing.

Visit https://www.dask.org/ to learn more about Dask.

### Why use Dask?
Dask can be used when working with big data sets. Environmental data scientists may encounter big data sets frequently. But why do we need to use dask? 

*"Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them."* (Wikipedia)

Libraries such as Numpy and Pandas aren't built to handle big data sets. Dask works with Numpy and Pandas under the hood to run familiar functions on large sets of data. The maintainers of Dask are the same maintainers of Numpy and Pandas.

<img src="./dask-screenshot.png" width = "600"/>

## Getting Started
To get started using Dask in Python, check to see if it is already on your laptop. Dask is included with Anaconda, so it may have installed when you downloaded Anaconda.

```
# To check for dask, try importing. 
import dask
```

If you do not have dask, Python will let you know. If you get a message saying 'No module named dask', you will need to install dask using conda in the command line before you import.

```
# To install dask use conda in your Powershell command line:
conda install dask
```
<div class="run">
    ▶️ <b> Run the cell(s) below. </b>
</div>

In [4]:
import dask

You will also need to install xarray to use large data sets with dask

```
# Install xarray in your Powershell command line:
conda install -c conda-forge xarray dask netCDF4 bottleneck
```
<div class="run">
    ▶️ <b> Run the cell(s) below. </b>
</div>

In [1]:
import xarray

## 3 Different Parts to the Dask project
### 1. Dask Collections ("core-library")
- **High-level collections**: mimic NumPy, lists, and pandas, but can operate in parallel on datasets that don't fit into memory
    - Array
    - Bag
    - DataFrame
- **Low-level collections**: give you finer control in building custom parallel and distributed computations
    - Delayed
    - Futures

### 2. Dask Cluster
Dask uses a distributed scheduler, which exists in the context of a Dask cluster.

Structure of a dask cluster:

<img src="./dask_cluster_img.png" width = "600"/>

### 3. Dask Ecosystem
The Dask ecosystem connects several adiitional open source projects that provide different mechanisms for deploying Dask clusters. 

**This tutorial will focus on using the high-level collections of Array, Bag, and DataFrame.**
