## Distributed Model Training

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: June 5, 2024

---

### SOURCES: 

- [Distributed Machine Learning Frameworks and its Benefits](https://www.xenonstack.com/blog/distributed-ml-framework#:~:text=In%20distributed%20machine%20learning%2C%20model,and%20training%20each%20split%20separately.)

- [Distributed Training with Azure](https://learn.microsoft.com/en-us/azure/machine-learning/concept-distributed-training?view=azureml-api-2/)

- [Distributed Training: Guide for Data Scientists
](https://neptune.ai/blog/distributed-training)

Need to Review:

- [Distributed model training II: Parameter Server and AllReduce](http://www.juyang.co/distributed-model-training-ii-parameter-server-and-allreduce/)

- [Distributed Machine Learning and the Parameter Server](https://www.cs.cornell.edu/courses/cs4787/2019sp/notes/lecture22.pdf)



### OBJECTIVES

- Explain uses 

### CONCEPTS

- Data parallelism and model parallelism
- Asynchronous training vs Synchronous training 
- Parameter server algorithm
- All-reduce algorithm

---

### I. Why use Distributed Model Training?


For massive training sets, it may not be possible to train a model on a single machine. 

This may be the case for deep learning models.

In *distributed training*, the workload to train a model is split up and shared among worker nodes. 

The concepts, benefits, and challenges are similar to what we've learned earlier.

The work can be parallelized to speed up training.

This process introduces benefits but also complexity.

### II. Data Parallelism and Model Parallelism

Two main types of distributed training: *data parallelism* and *model parallelism*.

#### Data Parallelism

This follows the approach used by Spark

Data is divided into partitions

Number of partitions = total number of available nodes

Model is copied in each worker node

Each node operates on its subset of data



<img src="./data_parallelism.png" width=600>

Each node must:

- Independently compute errors between training sample predictions and labels
- Update its model based on errors
- Communicate all of its changes to the other nodes to update their corresponding models

Worker nodes need to synchronize gradients at end of batch computation to ensure they're training a consistent model.

#### Model Parallelism

Model is segmented into different parts that run concurrently in different nodes

Each model part runs on same data

Scalability depends on degree of task parallelization of algorithm

Worker nodes need to synchronize shared parameters, usually once for each forward or backward-propagation step. 

More complex to implement than data parallelism

### III. Synchronous Training

Consider data parallelism case: 
- Data is divided into partitions
- Each partition is sent to a worker 
- Each worker has full replica of model. Training is done on its partition. 

**Forward Pass**

In synchronous training, forward pass begins at same time for each worker

Each worker computes different output and gradients

Each worker waits for the others to complete training loops and calculate respective gradients

After all workers have computed gradients, they communicate with each other and aggregate gradients using *all-reduce algorithm* (below)

After all gradients are combined, the updated gradients are copied to all workers. 

**Backward Pass**

Each worker performs backward pass and updates their local weights

Each worker will have different gradients as they are trained on different subsets of data

However, at any point in time, all workers have the same weights

<img src="./cosine_sim.png" width=300>

#### Synchronous Algorithm Example: : All-Reduce Algorithm

AllReduce: To synchronize the model weights across all computers in a cluster, the AllReduce method is utilized. Using a portion of the training data, each computer calculates the model's gradient and distributes it to the other machines. The gradients are then combined using the AllReduce method, which also updates the model weights on each computer.

<img src="./cosine_sim.png" width=300>

### IV. Asynchronous Training

Asynchronous training can be more efficient than synchronous training since there is no waiting. 

This is especially helpful when there is variation in the computing power across workers.

Thus in asynchronous training, we want workers to work independently in such a way that a worker need not wait for any other worker in the cluster. One way to achieve this is by using a parameter server.

<img src="./vector_space.png" width=300>

#### Asynchronous Algorithm Example: Parameter Server

In this approach, the weights and biases of the ML model are distributed across nodes in the cluster. 

A copy of the model is stored on each node and a centralized *parameter server* manages model changes.


<img src="./parameter_server.png" width=300>

### V. Challenges

Ensuring the convergence of the model during distributed training is a significant difficulty.

<img src="./rag.png" width=600>

### Implementation

**Elephas** is a Keras add-on that allows you to use Spark to execute distributed deep learning models at scale.

**Amazon SageMaker** offers 

**Horovod**. Open-source distributed training framework developed at Uber for TensorFlow, Keras, PyTorch, and MXNet.  
Supports practical distributed training over several GPUs and nodes.


---

### Conclusions

Add details

---