# Dask

Here in this Learning on Dask, we will explore what is Dask, its architecture, its use and how it is better compare to other libraries for fetching dat from large files and to create Data Frame.

## So First question that arises in our mind is what is Dask??

Dask is a flexible library for parallel computing in Python.

Dask is composed of two parts:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.

“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

Dask emphasizes the following virtues:

Familiar: Provides parallelized NumPy array and Pandas DataFrame objects

Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.

Native: Enables distributed computing in pure Python with access to the PyData stack.

Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms

Scales up: Runs resiliently on clusters with 1000s of cores

Scales down: Trivial to set up and run on a laptop in a single process

Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans
Dask collections and schedulers


## Architecture of Dask
![image.png](attachment:image.png)

## Another Unique Advantage of using Dask is:

Dask is convenient on a laptop. It installs trivially with conda or pip and extends the size of convenient datasets from “fits in memory” to “fits on disk”.


Dask can scale to a cluster of 100s of machines. It is resilient, elastic, data local, and low latency. For more information, see the documentation about the distributed scheduler.


This ease of transition between single-machine to moderate cluster enables users to both start simple and grow when necessary.


## Installation of Dask

You can install dask with conda, with pip, or by installing from source.

### Using Conda How to install Dask

Dask is installed by default in Anaconda. You can update Dask using the conda command:

In [1]:
conda install dask

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\Rohan\anaconda3

  added / updated specs:
    - dask


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.8.5                |           py38_0         2.9 MB
    openssl-1.1.1h             |       he774522_0         4.8 MB
    ------------------------------------------------------------
                                           Total:         7.7 MB

The following packages will be UPDATED:

  conda                                        4.8.4-py38_0 --> 4.8.5-py38_0
  openssl                                 1.1.1g-he774522_1 --> 1.1.1h-he774522_0



Downloading and Extracting Packages

openssl-1.1.1h       | 4.8 MB    |            |   0% 
openssl-1.1.1h       | 4.8 MB    |            |   0% 
openssl-1.1.1h       | 4.8

This installs Dask and all common dependencies, including Pandas and NumPy. Dask packages are maintained both on the default
channel and on conda-forge. Optionally, you can obtain a minimal Dask installation using the following command:

In [None]:
conda install dask-core

This will install a minimal set of dependencies required to run Dask similar to (but not exactly the same as) python -m pip install dask below.

### Using Pip How to install Dask

You can install everything required for most common uses of Dask (arrays, dataframes, …) 
This installs both Dask and dependencies like NumPy, Pandas, and so on that are necessary for different workloads.
This is often the right choice for Dask users:

In [2]:
python -m pip install "dask[complete]"    # Install everything

You can also install only the Dask library. Modules like dask.array, dask.dataframe, dask.delayed, or dask.distributed won’t work until you also install NumPy, Pandas, Toolz, or Tornado, respectively. 
This is common for downstream library maintainers:

In [None]:
python -m pip install dask                # Install only core parts of dask

We also maintain other dependency sets for different subsets of functionality:

In [None]:
python -m pip install "dask[array]"       # Install requirements for dask array
python -m pip install "dask[bag]"         # Install requirements for dask bag
python -m pip install "dask[dataframe]"   # Install requirements for dask dataframe
python -m pip install "dask[delayed]"     # Install requirements for dask delayed
python -m pip install "dask[distributed]" # Install requirements for distributed dask

We have these options so that users of the lightweight core Dask scheduler aren’t required to download the more exotic dependencies of the collections (Numpy, Pandas, Tornado, etc.).

### Install from Source

To install Dask from source, clone the repository from github:

In [None]:
git clone https://github.com/dask/dask.git
cd dask
python -m pip install .

### Test

Test Dask with py.test:

In [None]:
cd dask
py.test dask

Please be aware that installing Dask naively may not install all requirements by default. Please read the pip section above which discusses requirements. You may choose to install the dask[complete] version which includes all dependencies for all collections. Alternatively, you may choose to test only certain submodules depending on the libraries within your environment. For example, to test only Dask core and Dask array we would run tests as follows:

In [None]:
py.test dask/tests dask/array/tests

## Setup
This page describes various ways to set up Dask on different hardware, either locally on your own machine or on a distributed cluster. If you are just getting started, then this page is unnecessary. Dask does not require any setup if you only want to use it on a single computer.

Dask has two families of task schedulers:

Single machine scheduler: This scheduler provides basic features on a local process or thread pool. This scheduler was made first and is the default. It is simple and cheap to use. It can only be used on a single machine and does not scale.
Distributed scheduler: This scheduler is more sophisticated. It offers more features, but also requires a bit more effort to set up. It can run locally or distributed across a cluster.
If you import Dask, set up a computation, and then call compute, then you will use the single-machine scheduler by default. To use the dask.distributed scheduler you must set up a Client


In [None]:
import dask.dataframe as dd
df = dd.read_csv(...)
df.x.sum().compute()  # This uses the single-machine scheduler by default

In [None]:
from dask.distributed import Client
client = Client(...)  # Connect to distributed cluster and override default
df.x.sum().compute()  # This now runs on the distributed system