# Machine Learning (and Feature Engineering) at Scale

## Feature Engineering at Scale

Some feature engineering operations are "embarrassingly parallel" so we can easily use any parallel processing scheduler (e.g., Spark, Dask) to apply them to data. SparkML calls these `Transformers`.

Examples include:
* Normalizing text (e.g., lowercasing, removing stop words)
* Thresholding & Bucketing
* Polynomial Expansion
    
Others require at least *some* information gained from looking at all (or a big chunk) of data. In the data-parallel terminology, we'd say these rely on some kind of "reduce" operation to acquire state, which is then used for the per-record processing. SparkML calls these `Estimators` (although SparkML Estimators also include parallel algorithms -- see below).

Examples:
* Standard scaling (we need the mean and std dev from at least a suitable sample of the data)
* Category encoding, one-hot encoding (we need to find all the unique values of the field)
* Deskewing (the Box-Cox deskew requires a parameter, "lambda," which depends on the skewed data)

These can require a bit of thought depending on what large-data processing system you're using. Most designed-for-big-data ML tools handle these issues in a sensible way, but it's good to check.

## Model Training at Scale

For most algorithms, __parallel, scale-out training requires a fundamentally different algorithm from the local/serial implementation__. So we cannot usually take a these traditional implementations (e.g., most of scikit-learn or R) and just magically run it on a cluster via Spark/Dask/etc.

Many (most?) scale-out implementations rely on using a parallel implementation of a convex optimizer to try and find good parameter values for parametric models, e.g.
* Stochastic Gradient Descent (SGD) and related optimizers (Adagrad, Nesterov, etc.) in deep learning frameworks like TensorFlow
* Second-Order/Quasi-Newton methods for Generalized Linear Models in SparkML

<img src='https://materials.s3.amazonaws.com/i/sgd.jpg' width=600 align=left>

Hints for parallelizing on multiple nodes with TensorFlow/PyTorch:
* You may want to use Spark/Dask first for feature engineering
* Use Horovod (https://github.com/horovod/horovod) for easy scaling
* Use Petastorm (https://github.com/uber/petastorm) to convert columnar big data (Parquet) to dense row-based data that these tools prefer to consume

So although we can't magically make our local implementations scale out, we can borrow the optimizers from these other systems -- like Spark, TensorFlow, or PyTorch -- to get scale-out training on any problem we can formulate as an optimization.

And for almost all common cases, existing tools can help. For example, XGBoost has parallel (and GPU-parallel) implementations, so you can download and leverage those. 

Where we can't use these scale-out approaches, looks for tricks like statistical approximations. For example, k-Nearest-Neighbors doesn't scale in the naive implementation but, e.g., SparkML includes and Approximate Nearest Neighbors implementation which does scale (relying on locality-sensitive hashing).

## Model Inference at Scale

You have a model which you're happy with and you want to perform inference (or scoring) on a large dataset.

This case is an easy (embarrassingly parallel) problem, which can be approached in whatever way you like. A key point to mention is that if you choose a good format to represent your featurization pipeline and model (e.g., PMML, PFA, ONNX) then you can perform inference in a system that is totally separate from the training environment.

For example, you can:
* Train a neural net with TensorFlow, but then use Spark to perform bulk inference
* Use Spark to train a model, but then use, say Kubernetes, Docker, and Microsoft's ONNX Runtime for a scaling inference service