# Three frameworks in 59 minutes (more or less)

Let's take a deeper dive into each of these tools and get into some code and architecture.

For time reasons, and because we are interested in demoing where these tools work well, we'll just look at a few bits of key use cases.

## Apache Spark

__Data access__

In [None]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

We can create a Spark dataframe from SQL. A Spark dataframe is really a query, not a dataset ... so it's closer to a VIEW in the RDBMS world.

In [None]:
spark.sql('SELECT * FROM parquet.`data/california`')

We need to explicitly tell Spark if we want to read or process data

In [None]:
spark.sql('SELECT * FROM parquet.`data/california`').show()

In [None]:
spark.sql('SELECT origin, AVG(delay) as delay FROM parquet.`data/california` GROUP BY origin HAVING count(1) > 500 ORDER BY delay DESC').show()

__Data manipulation__

In [None]:
df = spark.read.csv('data/diamonds.csv', inferSchema=True, header=True)

df

In [None]:
df.show()

We can manipulate Spark dataframes with the classic PySpark API

In [None]:
df.drop('_c0').withColumnRenamed('price', 'label')

In [None]:
import pyspark.sql.functions as fn

df.groupby(fn.ceil('carat')).mean('price').orderBy('ceil(carat)').show()

In recent versions of Spark, we can also use the Pandas API (although there are a number of caveats that come with this approach)

In [None]:
import pyspark.pandas as ps

df.pandas_api()[:5]

In [None]:
psdf = df.pandas_api().drop(columns='_c0').rename(columns={'price':'label'})

psdf[:5]

In [None]:
ps.get_dummies(psdf)[:5]

__Architecture__

<img src="http://i.imgur.com/h621Rva.png" width="600px"></img>

---

## Dask


__Cluster creation and dashboards__

In [None]:
from dask.distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=2, threads_per_worker=1, memory_limit='1GB')

client = Client(cluster)

client

* Dashboard
* Jupyterlab plugin

__Arrays__

Dask Array is a virtual, lazy large array composed of chunks, each of which will be a NumPy array (in the default configuration) when loaded

In [None]:
import dask.array as da

arr = da.random.random((200, 200), chunks=(50, 40))

arr

Dask Array aims to implement most of the NumPy API, so we use that API for most operations

In [None]:
arr @ arr.T

Because the data structure is virtual, we need to tell Dask explicitly what we want to `.compute()` or write out (e.g., via `.to_zarr()`)

In [None]:
(arr @ arr.T).compute()

In [None]:
da.linalg.svd((arr @ arr.T).rechunk(200, 20)) # returns (u,s,v)

In [None]:
da.linalg.svd((arr @ arr.T).rechunk(200, 20))[1].compute() # singular vals

__Architecture__

<img src='images/dask.svg'>

__Parallelizing Python__

Dask has two different APIs for parallelizing Python code. Here's we'll look at `delayed`.

In [None]:
from dask import delayed
import numpy as np

@delayed
def get_data(i):
    return np.array([i, i+1, i+2])

get_data(7)

A Delayed is a proxy object (it can "handle" most normal operations/messages and internally records them into a compute graph).

In [None]:
get_data(7).compute()

In its role as root of a compute graph, we can also tell it to `.compute`, cache (`.persist()`), explain (`.dask` or `.visualize()`)

In [None]:
some_numbers = get_data(7)

some_more = get_data(100)

total = np.sum(some_numbers) + np.sum(some_more)

total

In [None]:
total.visualize()

In [None]:
total.compute()

In [None]:
client.close()
cluster.close()

---

## Ray

In [None]:
import ray

ray.init(num_cpus=4)

__Data access__

Ray accesses data from storage or from other systems via Ray Data

In [None]:
dataset = ray.data.read_csv('data/breast_cancer.csv')

dataset.take(1)

__Prep and model training__

Ray Data is also capable of some data manipulation ("last-mile data prep")

In [None]:
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

Training is done through a standardized `Trainer` interface that allows for tree-based, deep-learning, or other distributed training use cases

In [None]:
from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig

scale = ScalingConfig(num_workers=2, use_gpu=False)

trainer = XGBoostTrainer(
    scaling_config = scale, label_column="target", num_boost_round=20,
    
    params={ "objective": "binary:logistic", "eval_metric": ["logloss", "error"] }, # XGBoost params
    
    datasets={"train": train_dataset, "valid": valid_dataset},
)

In [None]:
result = trainer.fit()
print(result.metrics)

__Architecture__

https://docs.ray.io/en/latest/cluster/key-concepts.html#key-concepts

<img src='images/ray-cluster.svg' width=700 />

__Prediction__

Batch prediction has a dedicated API (a separate API is used for fast/small prediction, which we'll see when we demo Ray Serve)

In [None]:
from ray.train.batch_predictor import BatchPredictor
from ray.train.xgboost import XGBoostPredictor

batch_predictor = BatchPredictor.from_checkpoint(result.checkpoint, XGBoostPredictor)

demo_records = valid_dataset.drop_columns(['target'])

batch_predictor.predict(demo_records).to_pandas()

__Reinforcement Learning__

RL is a key use case with extensive design and support in Ray. It's outside our scope for today but definitely check it out.

<video src='images/cpv1.mp4' controls='true' autoplay='true' loop='true' width=500/>

See the latest examples at https://docs.ray.io/en/latest/rllib/index.html

In [8]:
ray.shutdown()