# DAT300 - Large Scale Machine Learning with Dask

<img src="./images/Dask_logo.png" />

[Dask](https://dask.org/) - parallel computing with Python

## Motivation

<img src="./images/puppy.jpg" width=650/>

* This is the kind of dataset you have worked with so far in DAT200
* Structured datasets like **iris**, **Wisconsin breast cancer**, **Boston house pricing** for learning the fundamentals of machine learning 
* Small in size and ready for analysis

<img src="./images/wolf.jpg" width=450/>

* This is the kind of data you will meet in real life
* Potentially lots of problems in the way before you can even start modelling
* Frequent big challange is **size** of the data

* **Problem**: many of data science tools don't scale or don't scale easily
* **Need**: tools that enable machine learning on larger datasets using larger resources in a user friendly way (high-level API)

## Why Dask?

* **Dask** to the rescue
    * **Parallelism**: conveniently distribute computation across many CPU cores to save time
    * **Large data user interfaces**: convenient handling out of memory datasets
    * Do all this familiar pure Python API
    * Re-use legacy code with very few changes 

### Options for out-of-memory datasets

<img src="./images/Scale-up-ML-tasks_CPU-&_RAM.png" width=650/>

**Sampling**

* Is really **all data** needed for training?
* **Sanity check** of how many samples are needed by use of **learning curves** (as introduced in DAT200)
    * Start with 10% of data for training, then 20%, etc. and compare accuracy
    * mlxtend package [implements](http://rasbt.github.io/mlxtend/user_guide/plotting/plot_learning_curves/) function `plot_learning_curves`
    * scikit-learn code [implementing](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py) function `plot_learning_curve`

<img src="./images/learning_curves.png" width=750/>

**Out-of-core algorithms and tools**

Limitations with scikit-learn

* uses **NumPy arrays** as the primary data structure
* most algorithms/estimators in scikit-learn **expect in-memory** data
* **exceptions**: algorithms/estimators that support **incremental learning** using the `.partial_fit` methods. Overview of those is [here](https://scikit-learn.org/stable/modules/computing.html)
* Dask provides tools for work with large **out-of-memory datasets**
    * [Dask arrays](http://docs.dask.org/en/latest/array.html) (based on NumPy arrays)
    * [Dask dataframes](http://docs.dask.org/en/latest/dataframe.html) (based on Pandas dataframes)
    * Both work nicely with estimators that implement `.partial_fit` for training

**Other libraries**

* Dask is capable of orchestrating workflow with other libraries such as 
    * **XGBoost**
    * **TensorFlow**