<a href="https://colab.research.google.com/github/Weikang01/Do_something_with_tensorflow_and_keras/blob/master/parallel_computing_with_tf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

本文是对https://www.youtube.com/watch?v=rj-hjS5L8Bw 所做的笔记
#**Tensorflow Strategies on parallel computing**#
TensorFlow上的并行计算框架都在tensorflow.distribute库中
```python
import tensorflow.distribute
```
<b>Tensorflow</b>一共有六种并行计算框架，用户需要根据设备
* **Mirror Strategy**
<p>适合一台服务器拥有多块GPU</p>
<p>Mirror Strategy类似于MapReduce编程模型，每块GPU都拥有完整的模型参数，每块GPU用1块batch的数据计算随机梯度，然后把所有GPU算出来的随机梯度求和，用随机梯度加和来更新模型参数</p>
<p>Mirror Strategy做的是同步随机梯度下降，需要等到所有GPU全部完成计算</p>
* **TPU Strategy**
* **Multi-Worker Mirrored Strategy**
* **Central Storage Strategy**
* **Parameter Server Strategy**
* **One Device Strategy**
#**Parallel Training CNN on MNIST**#
用并行网络来训练一个CNN来预测手写数字数据集mnist
* 需要安装TensorFlow的GPU版本
```python
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
```
<p>/device:CPU:0</p>
<p>/device:GPU:0</p>
<p>/device:GPU:1</p>
<p>/device:GPU:2</p>
<p>/device:GPU:3</p>


In [0]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

In [0]:
import tensorflow as tf
import tensorflow.distribute as distribute
strategy = distribute.MirroredStrategy()

In [0]:
def scale(image, label):
  image = tf.cast(image, tf.float32)
  image /= 255
  return image, label

In [9]:
import tensorflow_datasets as data
datasets, info = data.load(name='mnist', with_info=True, as_supervised=True)
mnist_train = datasets['train'].map(scale).cache()
mnist_test = datasets['test'].map(scale)
m = strategy.num_replicas_in_sync  # 得到当前Processor的数量
print(m)

1


In [0]:
BUFFER_SIZE = 10000
BATCH_SIZE_PER_REPLICA = 128
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * m
data_train = mnist_train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
data_test = mnist_test.batch(BATCH_SIZE)

In [0]:
import tensorflow.keras as keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

with strategy.scope():  # That's what you need to do to run in parallale computing
  model = Sequential(name='hello')
  model.add(Conv2D(32, 3, activation='relu', input_shape=(28,28,1)))
  model.add(MaxPooling2D())
  model.add(Conv2D(64, 3, activation='relu'))
  model.add(MaxPooling2D())
  model.add(Flatten())
  model.add(Dense(64, activation='relu'))
  model.add(Dense(10, activation='softmax'))

In [0]:
with strategy.scope():
  model.compile(loss=keras.losses.sparse_categorical_crossentropy,
         optimizer=keras.optimizers.RMSprop(learning_rate=1E-3))

In [22]:
model.fit(data_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f0ff9c1cfd0>

In [0]:
eval_loss, eval_acc = model.evaluate(data_test)

##**Mirror Strategy的运行原理**##
* Ring All-Reduce<br>
<b>Reduce和All-Reduce的区别</b><br>
Reduce: Server对从worker得到的计算结果进行计算（如：reduce_sum，reduce_mean...）<br>
如：<br>
$$output_{worker1}=7$$
$$output_{worker2}=3$$
$$output_{worker3}=4$$
$$output_{worker4}=1$$
$$Reduce(sum) = \sum^n_{i=1}output_{worker_i}=7+3+4+1=15$$<br>
* 在这种情况下，Server得到reduce的结果，但各个worker并不知道
* All-Reduce：所有节点都会获得reduce结果的一个副本
> E.g.,通过reduce+broadcast实现all-reduce<br>
> E.g.,通过all-to-all communication实现all-reduce（worker之间互相向其余所有节点传递自身信息，没有了server的参与）<br>
> E.g.,通过ring all-reduce
* **Naïve Ring All-Reduce Algorithm朴素环全约简算法**<br>
逻辑：<br>
$$GPU_1:g_1\longrightarrow GPU_2$$
$$GPU_2:(g_1+g_2)\longrightarrow GPU_3$$
$$...$$
$$GPU_{m-1}:(\sum^{m-1}_{i=1}g_i)\longrightarrow GPU_m$$
通过传递的方法GPU(m)获得了加和的数据，之后从GPU(m)开始再进行循环，使所有GPU均获得完整的梯度g。
$$GPU_m:g\Longrightarrow GPU_1$$
$$GPU_1:g\Longrightarrow GPU_2$$
$$...$$
$$GPU_{m-2}:g\Longrightarrow GPU_{m-1}$$
这种算法的问题是：
* 同一时间大多数的GPU都是闲置的
* 通信时间：md/b
 * m：GPU数量
 * d:参数数量
 * b:网络带宽
* **Ring All-Reduce Algorithm优化的环全约简算法**<br>


1. 将参数切分为m份，m为GPU数量:
$$GPU_0: g_0=[a_0;b_0;c_0;...]$$
$$GPU_1: g_1=[a_1;b_1;c_1;...]$$
$$...$$
$$GPU_m: g_m=[a_m;b_m;c_m;...]$$
2. GPU之间同时向同一方向通信，同时发送自身参数的第i个参数部分发向i+1号GPU（m号发送给1号GPU）
$$GPU_1: a_0\longrightarrow GPU_2$$
$$GPU_2: b_1\longrightarrow GPU_3$$
$$...$$
$$GPU_m: x_m\longrightarrow GPU_1$$
3. 此时，i+1号GPU可以计算出i参数和i+1的加和。
4. 以此类推，可以以朴素方法m倍的效率完成计算。
##<b>算法特点总结</b>
* 没有闲置的计算机网络
* 通信时间：d/b
 * d:参数数量
 * b:网络带宽


