# Techniques for Data Science with Big Datasets

<img src='images/data.jpg' width=400/>

## Well... that sounds awfully vague, doesn't it?

__Welcome to large-scale data engineering and data science in 2020__
* End-to-end, "single product" platforms are no longer the leading options
* In the open-source world, end-to-end may not even be possible for the near future

__What does this mean in concrete terms?__
* Focusing on OSS, Hadoop and Spark can no longer support our end-to-end needs
* We need -- and want -- to learn how to assemble a suite of best-of-breed tools for data science with newer, simpler tools like
    * Dask
    * Ray
    * Horovod and others
* ... while still using key features of mature tools like
    * SparkSQL
    * Hive
    * Airflow and more
    
__As architects and practitioners, we need to choose and leverage suite of tools chosen for power and simplicity__

This class is designed to help you become confident
* making those tool choices
* communicating about them with your team
* migrating away from legacy systems to meet modern data science needs

This class is *not* designed to
* Go in depth on the APIs or internals of any specific tools (there's just not enough time)
* "Sell you" on any specific open-source project or product
    * We want to get comfortable discussing strength/weaknesses, and then you can choose a solution that is right for you
    
*We'll be welcoming and exploring Questions & Answers more than most of my classes (which are heavier on the code and internals)*

## Catching up to large-scale data science in 2020: what's changed?

A brief recap:
* 2016 - broad adoption across industries of R, PyData, Apache Spark
* 2017 - broad rise of deep learning
* 2018-2019 - decline of Hadoop/Spark for data science
* 2020 - new open tools and hybrid architectures

__Theme: best-of-breed__

https://www.oreilly.com/radar/why-best-of-breed-is-a-better-choice-than-all-in-one-platforms-for-data-science/

## Interactive Survey

* What size datasets do you typically work with?
* Where is (most of) your data stored?
* How do you get data out of your data lake?
* What tools do you typically use for
    * feature engineering
    * modeling

## The changing definition of large-scale data

__Compute power has grown, but datasets have not__

<img src='images/largest.jpg' width=650/>


*Source: https://www.kdnuggets.com/2020/07/poll-largest-dataset-analyzed-results.html*

The largest ML datasets used vs. largest tractable on a single node (no cluster) has changed dramatically
* Resulting in new definitions for small, medium, and big data
* Avoid "big data" tools and their taxes when you can

Some "medium data" approaches
* Downsample
* XGBoost external memory (out-of-core)
* TF/PyTorch data loaders
* sklearn + `partial_fit` (incrementalizable) algorithms
    * Simplify with Dask, though Dask not strictly necessary
* Apache Arrow / PyArrow (https://arrow.apache.org/docs/python/memory.html)
* Honorable mention for feature engineering: Vaex

## Roadmap for large-scale tooling journey

<img src='images/flow-base.png' width=800>