<img src="https://raw.githubusercontent.com/dask/dask/main/docs/source/images/dask_horizontal_no_pad.svg"
     width="30%"
     alt="Dask logo\" />


# What is it and when to use it? 


If you've heard of Dask before, maybe you have a sense of how to answer these questions. If you haven't heard of Dask before and want to know what it is and when/if you should use it, then you are in the right place! :)

Before we give a short overview and attempt to answer these questions, we strongly recommend you to check the amazing documentation that the Dask community has in place. 

- Documentation: https://docs.dask.org

Contribute to the project:

-  Github: https://github.com/dask/dask

Engage with the community:

- Slack: https://dask.slack.com/

### What is Dask? 

Dask is a flexible library for parallel computing in Python, that follows the syntax of the PyData ecosystem. If you are familiar with Numpy, pandas and scikit-learn then think of Dask as their faster cousin. For example:

```python
import pandas as pd                   import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv')    df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean()   df.groupby(df.user_id).value.mean().compute()
```

 Since they are all family, Dask allows you to scale your existing workflows with a small amount of changes. Dask enables you to accelerate computations and perform those that don't fit in memory. It works in your laptop but it also scales out to large clusters while providing a dashboard with great diagnostic tools. 

<img src="https://raw.githubusercontent.com/dask/dask/main/docs/source/images/dask-overview.svg" 
     width="75%"
     alt="Dask overview\" />

### Dask jurgon: Client, Scheduler and Workers 

- Client: The user-facing entry point for cluster users. In other words, the client lives where your python code lives, and it communicates to the scheduler, passing along the tasks to be executed.
- Scheduler: The task manager, it sends the tasks to the workers.
- Workers: The ones that compute the tasks.

Note: The Scheduler and the Workers are on the same network, they could live in your laptop or on a separate cluster

<img src="https://raw.githubusercontent.com/coiled/pydata-global-dask/master/images/dask-cluster.svg"
     width="75%"
     alt="Dask cluster\">

## When to use Dask?

Before trying to use Dask, there are some questions to determine if Dask might be suitable for you. 

- Does your data fit in memory? 
    - Yes: Use pandas or numpy.  
    - No : Dask might be able to help. 
- Do your computations take for ever?
    - Yes: Dask might be able to help. 
    - No : Awesome.
- Do you have embarrassingly parallelizable code?
    - Yes: Dask might be able to help.
    - No?: If you are not sure here are some [examples](https://examples.dask.org/applications/embarrassingly-parallel.html) 
    - No: I'm sorry, although Dask might have some hope for you.
    
    
**Bottom Left:** You don't need Dask.    
**Elsewhere:** Dask fair game.


<img src="https://raw.githubusercontent.com/dask/dask-ml/main/docs/source/images/dimensions_of_scale.svg"
     width="65%"
     alt="Dask zones">


**Disclaimers:**

1. When we say "Dask might be able to help" it is because you should try first to accelerate your code with Numpy and or Numba, checking types used on your Dataframes, and then maybe consider Dask. Now even when using Dask, we can't guarantee that things will be faster, it depends on what is the code behind.  

2. Even when you have large datasets, at some point you want to double check if you have reduced things to a manageable level where going back to pandas or Numpy might be the best call.

**Best practices:**

The learning curve to use Dask can be a bit intimidating, that's why we want to point you out to some best practices links that will make the process smoother. We will go over some of these topics but we want to leave here these links for future reference

- Are you working with arrays? Check this [array best practices](https://docs.dask.org/en/latest/array-best-practices.html)
- Dealing with DataFrames? Check this [DataFrames best practices](https://docs.dask.org/en/latest/dataframe-best-practices.html)
- Are you trying to accelerate your code using `delayed`? Check this [delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)
- For overall good practices check [Dask good practices](https://docs.dask.org/en/latest/best-practices.html)

## Why Dask? 

If you are interested in knowing why Dask might be a good option for you we recommend you to check the Dask documentation [Why Dask?](https://docs.dask.org/en/latest/why.html)

But if you are already convinced that Dask is right for you and/or want to learn more about it. The topics that we will cover on this mini-tutorial are:

1. Dask Delayed: How to parallelize existing Python code and your custom algorithms. 
2. Schedulers: Single Machine vs Distributed, and the Dashboard.   
3. From pandas to Dask: How to manipulate bigger-than-memory DataFrames using Dask.  
4. Dask-ML: Scalable machine learning using Dask.

## Extra learning material:

1. Self-paced Dask-Tutorial: https://tutorial.dask.org/
2. Dask training by Coiled: [Scaling Python with Dask](https://coiled.io/course/scaling-python-with-dask/)