# Developing Distributed Models with Spark

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)

[https://feng.li/](https://feng.li/)

<h1>Outline<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-move-code-to-data-philosophy" data-toc-modified-id="The-move-code-to-data-philosophy-1">The <em>move-code-to-data</em> philosophy</a></span></li><li><span><a href="#What-do-we-have-with-Spark?" data-toc-modified-id="What-do-we-have-with-Spark?-2">What do we have with Spark?</a></span></li><li><span><a href="#What-do-we-(statisticians)-miss-with-distributed-platforms?" data-toc-modified-id="What-do-we-(statisticians)-miss-with-distributed-platforms?-3">What do we (<em>statisticians</em>) miss with distributed platforms?</a></span></li><li><span><a href="#Why-is-it-difficult-to-develop-statistical-models-on-distributed-systems?" data-toc-modified-id="Why-is-it-difficult-to-develop-statistical-models-on-distributed-systems?-4">Why is it difficult to develop statistical models on distributed systems?</a></span></li><li><span><a href="#Spark-APIs-for-statisticians-to-develop-distributed-models" data-toc-modified-id="Spark-APIs-for-statisticians-to-develop-distributed-models-5">Spark APIs for statisticians to develop distributed models</a></span><ul class="toc-item"><li><span><a href="#UDFs-for-DataFrames-based-API" data-toc-modified-id="UDFs-for-DataFrames-based-API-5.1">UDFs for DataFrames-based API</a></span></li><li><span><a href="#RDD-API-with-linear-algebra-support" data-toc-modified-id="RDD-API-with-linear-algebra-support-5.2">RDD API with linear algebra support</a></span><ul class="toc-item"><li><span><a href="#Linear-algebra-and-optimization" data-toc-modified-id="Linear-algebra-and-optimization-5.2.1">Linear algebra and optimization</a></span></li><li><span><a href="#Random-variable-generator-and-distribution" data-toc-modified-id="Random-variable-generator-and-distribution-5.2.2">Random variable generator and distribution</a></span></li></ul></li></ul></li><li><span><a href="#Real-projects-on-Spark" data-toc-modified-id="Real-projects-on-Spark-6">Real projects on Spark</a></span><ul class="toc-item"><li><span><a href="#DLSA:-Least-squares-approximation-for-a-distributed-system" data-toc-modified-id="DLSA:-Least-squares-approximation-for-a-distributed-system-6.1">DLSA: Least squares approximation for a distributed system</a></span></li><li><span><a href="#Distributed-quantile-regression-by-pilot-sampling-and-one-step-updating." data-toc-modified-id="Distributed-quantile-regression-by-pilot-sampling-and-one-step-updating.-6.2">Distributed quantile regression by pilot sampling and one-step updating.</a></span></li><li><span><a href="#Distributed-ARIMA-models-for-ultra-long-time-series" data-toc-modified-id="Distributed-ARIMA-models-for-ultra-long-time-series-6.3">Distributed ARIMA models for ultra-long time series</a></span></li></ul></li><li><span><a href="#Take-home-message" data-toc-modified-id="Take-home-message-7">Take home message</a></span></li></ul></div>

## The _move-code-to-data_ philosophy

- The traditional supercomputer requires repeat transmissions of data between clients and servers. This works fine for computationally intensive work, but for data-intensive processing, the size of data becomes too large to be moved around easily. 


- A distributed systems focuses on **moving code to data**. 

- The clients send only the programs to be executed, and these programs are usually small.

- More importantly, data are broken up and distributed across the cluster, and as much as possible, computation on a piece of data takes place on the same machine where that piece of data resides.

- The whole process is known as **MapReduce**.

![MapReduce](figures/spark.png)

![Spark-GitHub](figures/spark-github.png)

![ms](figures/ms.png)

## What do we have with Spark?

![Spark-ML](figures/spark-ml.png)

## What do we ( _statisticians_ ) miss with distributed platforms?

- Interpretable statistical models such as **GLM** and **Time Series Forecasting Models**.

- Efficient Bayesian inference tools such as __MCMC__, __Gibbs__ and __Variational Inference__.

- Distributed statistical visualization tools like `ggplot2`, `seaborn` and `plotly`

- ...

## Why is it difficult to develop statistical models on distributed systems?


-- _Especially for statisticians_


- __No unified solutions__ to deploy conventional statistical methods to distributed computing platform.

- __Steep learning curve__ for using distributed computing.

- Could not balance between __estimator efficiency and communication cost__.

- __Unrealistic models assumptions__, e.g. requiring data randomly distributed.

## Spark APIs for statisticians to develop distributed models

### UDFs for DataFrames-based API

- User-Defined Functions (UDFs) are a feature of Spark that allows users to define their own functions when the system's built-in functions are not enough to perform the desired task.

- The API is available in Spark (>= 2.3).

- It runs with PySpark (requiring Apache `Arrow`) and Scala.

###  RDD API with linear algebra support


- MLlib uses linear algebra packages [`Breeze`](http://www.scalanlp.org/), [`dev.ludovic.netlib`](https://github.com/luhenry/netlib), and [`netlib-java`](https://github.com/fommil/netlib-java) for optimized numerical processing.

- Only available in Scala. 

- Steep learning curve.

#### Linear algebra and optimization

- __`ml.linalg.`__ `Matrix()`, `DenseMatrix()`, `SparseMatrix()` 

- __`mllib.linalg.`__ `SingularValueDecomposition()`, `QRDecomposition()`
- __`mllib.linalg.distributed.`__ `BlockMatrix()`, `CoordinateMatrix()`, `IndexedRow()`, `IndexedRowMatrix()`, `RowMatrix()`
- __`mllib.optimization.`__ `LBFGS()`, `GradientDescent()`

#### Random variable generator and distribution

- __`mllib.random.`__ `GammaGenerator()`, `LogNormalGenerator()`, `PoissonGenerator()`, `StandardNormalGenerator()`, `UniformGenerator()`, `WeibullGenerator()`, `ExponentialGenerator()`

- __`mllib.stat.distribution.`__ `MultivariateGaussian()`

## Real projects on Spark


Code available at https://github.com/feng-li/dstats

### DLSA: Least squares approximation for a distributed system

in _Journal of Computational and Graphical Statistics, 2021_ (with Xuening Zhu & Hansheng Wang) https://doi.org/10.1080/10618600.2021.1923517


- We estimate the parameter $\theta$ on each worker separately by using local data on distributed workers. This can be done efficiently by using standard statistical estimation methods (e.g., maximum likelihood estimation). 

- Each worker passes the local estimator of $\theta$ and its asymptotic
  covariance estimate to the master.

- A weighted least squares-type objective function can be constructed. This can be viewed as a local quadratic approximation of the global log-likelihood functions. 

**Efficiency and cost effectiveness**

- A standard industrial-level architecture Spark-on-YARN cluster on the Alibaba cloud server consists of one master node and two worker nodes. Each node contains 64 virtual cores, 64 GB of RAM and two 80 GB SSD local hard drives. (cost 300 RMB per day}.

- We find that $26.2$ minutes are needed for DLSA.

- The traditional MLE takes more that $15$ hours and obtains an inferior result (cost 187 RMB). 
    
- That means we have saved 97% computational power. (cost only 6 RMB).

### Distributed quantile regression by pilot sampling and one-step updating

in _Journal of Business and Economic Statistics, 2021_ (with Rui Pan, Tunan Ren, Baishan Guo, Guodong Li & Hansheng Wang) https://doi.org/10.1080/07350015.2021.1961789


- We conduct a random sampling of size $n$ from the distributed system, where $n$ is much smaller than the whole sample size $N$.

- Thereafter, a standard quantile regression estimator can be obtained on the master, which is referred to as the _pilot estimator_.

- To further enhance the statistical efficiency, we propose a one-step Newton-Raphson type algorithm to upgrade the pilot estimator. 

### Distributed ARIMA models for ultra-long time series

in [arXiv:2007.09577](https://arxiv.org/abs/2007.09577) (with Xiaoqian Wang, Yanfei Kang and Rob J Hyndman)


- We develop a novel distributed forecasting framework to tackle challenges associated with forecasting ultra-long time series. 

- The proposed model combination approach facilitates distributed time series forecasting by combining the local estimators of time series models delivered from worker nodes and minimizing a global loss function. 

- In this way, instead of unrealistically assuming the data generating process (DGP) of an ultra-long time series stays invariant, we make assumptions only on the DGP of subseries spanning shorter time periods.


![DARIMA](figures/darima.png)

## Take home message

- Distributed modeling, computing and visualization are the future of statistics. 

- Spark is not the only software for distributed statistical computing,

- But is the easiest one.