## 3 Different Parts to the Dask project
### 1. Dask Collections ("core-library")
- **High-level collections**: mimic NumPy, lists, and pandas, but can operate in parallel on datasets that don't fit into memory
    - Array
    - Bag
    - DataFrame
- **Low-level collections**: give you finer control in building custom parallel and distributed computations
    - Delayed
    - Futures

### 2. Dask Cluster
Dask uses a distributed scheduler, which exists in the context of a Dask cluster.

Structure of a dask cluster:

<img src="./dask_cluster_img.png" width = "600"/>

citation: https://tutorial.dask.org/00_overview.html

### 3. Dask Ecosystem
The Dask ecosystem connects several adiitional open source projects that provide different mechanisms for deploying Dask clusters. 

**This tutorial will focus on using the high-level collection of DataFrame.**


In [11]:
import dask.dataframe as dd
import pandas as pd

url = 'https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-cap.256.10&entityid=53edaa7a0e083013d9bf20322db1780e'
birds_dask = dd.read_csv(url)

birds_dask.shape

#create small subset of data (first 50 rows) in Dask
small_birds_dask = birds_dask.loc[:50]

#find the mean of a column in Dask
print(birds_dask.bird_count.mean())

#add a column to the dask data frame
birds_dask['bird_count_add'] = birds_dask.bird_count + 1
print(birds_dask.info)

#applying functions - create a new column that calculates the seen to heard ratio of birds
def seen_calc(seen):
    """ Multiplies the seen value by 100
    
    Parameters
    ----------
        seen : float
            Number of birds seen
       
    Returns
    -------
        seen_new : float
            Number of birds seen times 100
    """
    seen_new = (seen * 100)
    return seen_new

birds_dask['seen_new'] = birds_dask.seen
birds_dask.seen.apply(seen_calc, meta=('seen', 'int64'))
print(birds_dask.info)



dd.Scalar<series-..., dtype=float64>
<bound method DataFrame.info of Dask DataFrame Structure:
              survey_id site_id species_id distance bird_count    notes   seen  heard direction bird_count_add
npartitions=1                                                                                                 
                  int64  object     object   object      int64  float64  int64  int64    object          int64
                    ...     ...        ...      ...        ...      ...    ...    ...       ...            ...
Dask Name: assign, 4 tasks>
<bound method DataFrame.info of Dask DataFrame Structure:
              survey_id site_id species_id distance bird_count    notes   seen  heard direction bird_count_add seen_new
npartitions=1                                                                                                          
                  int64  object     object   object      int64  float64  int64  int64    object          int64    int64
               