## Spark's ML Library

### Learning Objectives:
* Understanding Spark as an ecosystem
* Machine Learning on Spark
* The ML library vs MLlib vs spark-sklearn
* Main classes in ML
* Use cases

### Understanding Spark as an Ecosystem

* ETL
* ML
* Streaming
* In-memory database (SQL)

![sparkroadmap](images/ecosystem.png)

### Machine Learning on Spark

![sparkroadmap](images/our-spark-roadmap.png)

#### Motivation:

* When your data is collosal and needs a cluster
* Because dumb models on big data are (usually) better than smart models on small data
* When you need to retrain models regularly
* When you're using a Spark ecosystem (ETL, stream processing, in-memory database, etc)
* A (very) active [bazaar](https://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar)

#### Why Spark ML is fast
* In memory processing
* Smart DAG scheduling
* TorrentBroadcast intermediate models
* Tree aggregation
* hashing functions
* probabilitic data structures
* Static vs dynamic typing (motivation for Scala over python)

![torrent](images/torrent.png)

### The ML library vs MLlib vs spark-sklearn

![quote](images/quote.png)

* Spark is moving from RDD's towards DataFrames
* MLlib is Spark's machine learning library for RDD's
* ML operates on DataFrames instead
* ML is more flexible and veritile, thanks to DataFrames

![comparison](images/RDD_dataframe_comparison.png)

source: *High Performance Spark* early release

#### Where are they now?
* MLlib is currently in maintenance
 - It is not being actively developed
* Spark encourages developers to develop for ML instead of MLlib

#### spark-sklearn [package](https://github.com/databricks/spark-sklearn)

* Scikit-learn integration package for Apache Spark
* Distributes multi-processed components of sklearn across a cluster
 - Works well for grid search
 - Easy conversions between spark DataFrames and numpy arrays

In [None]:
from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
# Use spark_sklearn’s grid search instead:
from spark_sklearn import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [1, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"],
              "n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(),
param_grid=param_grid)
gs.fit(X, y)

#### Order of Operations

1. ML
2. MLlib
3. spark-sklearn

### Main classes in ML

Three abstract classes in [Spark's ML library](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html):

1. `Transformers`: transforms your data from one dataset to another by (normally) appending a new column to your DataFrame.
2. `Estimators`: these are the statistical models that we estimate.  The ML library has a suite of classification, regression, and clustering algorithms.
3. `Pipelines`: offers an end-to-end transformation-estimation process with distinct stages.  A pipeline will ingest raw data in the form of a DataFrame and perform the necessary transformations in order to estimate a statistical model.  A pipeline often consists of multiple transformers that feed into an estimator

#### Transformers

* Takes a DataFrame, returns a transformed DF
* Uses the `transform` method
* Note that this normally creates new columns

Examples: `ChiSqSelector`, `CountVectorizer`, `Normalizer`, `OneHotEncoder`, `Tokenizer`, `StopWordsRemover`

In [None]:
from pyspark.ml.feature import RegexTokenizer

tokenizer = RegexTokenizer(inputCol="tweet", outputCol="words", pattern='\s+|[,.\"]')
transformed_df = tokenizer.transform(df)

#### Estimators

* Estimators that fit models to data
* Uses the `fit` and `transform` methods


Examples:

| Classification | Regression | Clustering | Recommendation | 
|----|-----|------|------|
| `LogisticRegression` | `LinearRegression` | `KMeans` | `ALS` | 
| `RandomForestClassifier` (multiclass) | `RandomForestRegressor` | `LDA`| |
| `GBTClassifier` | `GBTRegressor` | `GaussianMixture` | |
| `NaiveBayes` |  | | |

In [None]:
forestizer = RandomForestClassifier(labelCol="lang", featuresCol="features", numTrees=10)
model = forestizer.fit(transformed_df)
y_hat = model.transform(test)

#### Pipelines

* Quick assembly of ML pipelines
  - feature extraction -> dimensionality reduction -> model training
* Uses `fit` and `transform`
* Since transformers add columns, `inputCol` and `outputCol`

In [None]:
from pyspark.ml.classification import RandomForestClassifier
import pyspark.ml.evaluation as ev
from pyspark.ml.feature import RegexTokenizer, HashingTF, IDF
from pyspark.ml import Pipeline

tokenizer = RegexTokenizer(inputCol="tweet", outputCol="words", pattern='\s+|[,.\"]')
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=200)
idf = IDF(inputCol="rawFeatures", outputCol="features")
forestizer = RandomForestClassifier(labelCol="lang", featuresCol="features", numTrees=10)

pipeline = Pipeline(stages=[\
                tokenizer, 
                hashingTF, 
                idf,
                forestizer])

tweets_train, tweets_test = df.randomSplit([0.7, 0.3], seed=123)
model = pipeline.fit(tweets_train)
test_model = model.transform(tweets_test)

evaluator = ev.BinaryClassificationEvaluator(rawPredictionCol='probability', labelCol='lang')
print('AUC for Random Forest:', evaluator.evaluate(test_model, {evaluator.metricName: 'areaUnderROC'}))

#### Other Fun Stuff

* Evaluation
* Cross-validation 
* Gridsearch
* Train/Test Splits
* Linear Algebra

### Use cases

* Recommender systems
* Fraud/Anomoly detection
* Deep learning
* Genomics
* Deep Learning CV and hyperparameter tuning

### Concluding Remarks

* Spark ML is a powerful tool for ML at scale
* It allows for data projects built in a Spark ecosystem
* It evolves fast.  Very fast.  Don't expect it to be future-proof
* Looking for an open source project?  Develop for ML!

### Other Resources

* [Spark ML Documents](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html)
* [Learning Pyspark](https://www.amazon.com/Learning-PySpark-Tomasz-Drabas/dp/1786463709)
* [High Performance Spark (Currently early release)](http://shop.oreilly.com/product/0636920046967.do?cmp=af-strata-books-videos-product_cj_9781491943137_%25zp)
* [Lessons for Large-Scale Machine Learning (Somewhat Outdated)](http://go.databricks.com/large-scale-machine-learning-deployments-spark-databricks)