<img src="https://raw.githubusercontent.com/dask/dask/main/docs/source/images/dask_horizontal_no_pad.svg"
     width="30%"
     alt="Dask logo\" />


# Basic Dask Concepts & When to Use Dask? 


If you've heard of Dask before, maybe you have a sense of how to answer these questions. If you haven't heard of Dask before and want to know what it is and when/if you should use it, then you are in the right place! :)

Before we give a short overview and attempt to answer these questions, we strongly recommend you to check the amazing documentation that the Dask community has in place. 

- Documentation: https://docs.dask.org

Contribute to the project:

-  Github: https://github.com/dask/dask

Engage with the community:

- Discourse: https://dask.discourse.group/

### What is Dask? 

Dask is a flexible library for parallel computing in Python, that follows the syntax of the PyData ecosystem. If you are familiar with Numpy, pandas and scikit-learn then think of Dask as their faster cousin. For example:

```python
import pandas as pd                   import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv')    df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean()   df.groupby(df.user_id).value.mean().compute()
```

 Since they are all family, Dask allows you to scale your existing workflows with a small amount of changes. Dask enables you to accelerate computations and perform those that don't fit in memory. It works in your laptop but it also scales out to large clusters while providing a dashboard with great diagnostic tools. 

<img src="https://raw.githubusercontent.com/dask/dask/main/docs/source/images/dask-overview.svg" 
     width="100%"
     alt="Dask overview\" />

### Dask Concepts: Client, Scheduler and Workers 

- **Client**: The user-facing entry point for cluster users. In other words, the client lives where your python code lives, and it communicates to the scheduler, passing along the tasks to be executed.
- **Scheduler**: The task manager, it sends the tasks to the workers.
- **Workers**: The ones that compute the tasks.

Note: The Scheduler and the Workers are on the same network, they could live in your laptop or on a separate cluster

<img src="https://raw.githubusercontent.com/coiled/pydata-global-dask/master/images/dask-cluster.svg"
     width="75%"
     alt="Dask cluster\">

## When to use Dask?

Before trying to use Dask, there are some questions to determine if Dask might be suitable for you. 
    
**Bottom Left:** You don't need Dask.    
**Elsewhere:** Dask fair game.

In [1]:
from IPython.display import Image
Image(url="dask-side-1.png", width=500)

In [2]:
Image(url="dask-side-2.png", width=500)

This also means:

## Don't use Dask if you don't need to!
Distributed computing brings a lot of additional complexity into the mix and will **incur overhead**. If your dataset and computations fit comfortably within your local resources **this overhead will may be larger than the performance gain** you'll get by using Dask. In that case, stick with non-distributed libraries like pandas, numpy and scikit-learn. 

## Dask Best Practices

Transitioning to distributed computing comes with a learning curve because we're introducing multiple levels of additional complexity. That's why we the Dask team has built Best Practice guides that will make the process smoother. We will go over some of these topics but we want to leave here these links for future reference:

- Are you working with arrays? Check this [array best practices](https://docs.dask.org/en/latest/array-best-practices.html)
- Dealing with DataFrames? Check this [DataFrames best practices](https://docs.dask.org/en/latest/dataframe-best-practices.html)
- Are you trying to accelerate your code using `delayed`? Check this [delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)
- For overall good practices check [Dask good practices](https://docs.dask.org/en/latest/best-practices.html)

## Extra learning material:

1. What is Dask? https://coiled.io/blog/what-is-dask/ 
2. Self-paced Dask-Tutorial: https://tutorial.dask.org/
3. Dask training by Coiled: [Scaling Python with Dask](https://coiled.io/course/scaling-python-with-dask/)