<img src="../images/dask_horizontal.svg" align="right" width="30%">

# Introducing Dask

Dask is a parallel computing library that scales the existing Python libraries. This tutorial will introduce Dask and parallel data analysis more generally.


## Learning Objectives 

- Describe components that make up Dask



## Dask Components 

Dask is composed of two main parts:

- **Dask Collections**
- **Dynamic Task Scheduling**

<img src="../images/Dask Overview (Light).png" width="80%">

1. High-level collection APIs:
  - **Dask Array**: Parallel NumPy Arrays
  - **Dask DataFrame**: Parallel Pandas DataFrames
  - **Dask Bag**: Parallel lists
  - **Dask ML**: Parallel Scikit-learn


2. Low-level collection APIs:
  - **Dask Delayed**: Lazy parallel objects
  - **Dask Futures**: Eager parallel objects


3. Task Scheduling
  - **Scheduler**: 
    - creates and manages directed acyclic graphs (DAG)s
    - distributes tasks to workers
    
    
    
<div class="admonition alert alert-info">
    <p class="admonition-title" style="font-weight:bold">Lazy evaluation vs eager evaluation</p>
    <ul>
        
        <li> Lazy evaluation: objects are evaluated just in time when the results are needed </li> 
    
<li>Eager evaluation: objects are evaluated in real time regardless if the results are needed immediately or not </li>
    </ul>
</div>
    


## Advantages of using Dask

- **Familiarity**: Dask collections such as Dask Array, Dask DataFrames provide decent NumPy and Pandas compatible APIs.
- **Responsive**: Dask is designed with interactive computing in mind. 
    - It provides rapid feedback and diagnostics to aid humans
- **Scale up and scale down**: It scales well from single machine (laptop) to clusters (100s of machines)
    - This ease of transition between single machine to moderate clusters makes it easy for users to prototype their workflows on their local machines and seamlessy transition to a cluster when needed. 
    - This also gives users a lot of flexibility when choosing the best to deploy and run their workflows. 
- **Flexibility**: Dask supports interfacing with popular cluster resource managers such as PBS/SLURM/Kubernetes, etc.. with a minimal amount of effort

<img src="../images/Dask Cluster Manager (Light)(1).png" width="80%">

## Task Graphs

Dask represents distributed/parallel computations with task graphs, more specifically [directed acyclic graphs](https://en.wikipedia.org/wiki/Directed_acyclic_graph).

Directed acyclic graphs are made up of nodes and have a clearly defined start and end, a single traversal path, and no looping 

<img src="../images/dask-task-stream.gif">

---

## Resources and references

* Reference
    *  [Docs](https://dask.org/)
    *  [Examples](https://examples.dask.org/)
    *  [Code](https://github.com/dask/dask/)
    *  [Blog](https://blog.dask.org/)
*  Ask for help
    *   [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions
    *   [github discussions](https://github.com/dask/dask/discussions) for general, non-bug, discussion, and usage questions
    *   [github issues](https://github.com/dask/dask/issues/new) for bug reports and feature requests
    
* Pieces of this notebook are adapted from the following sources
  * https://github.com/dask/dask-tutorial
  
  
 <div class="admonition alert alert-success">
    <p class="title" style="font-weight:bold">Next: <a href="./08-dask-delayed.ipynb">Parallelizing code with dask.delayed</a></p>
    
</div>