# TensorFlow Script Mode - Using Horovod

Starting from TensorFlow version 1.12, you can use Horovod for distributed training.

For this example, we use [Tensorflow Model ZOO](https://github.com/tensorflow/models) and [TensorFlow Benchmarks repo](https://github.com/tensorflow/benchmarks) to train a VGG16 model with synthetic ImageNet data.

## Cloning the repositories

In [1]:
!git clone https://github.com/tensorflow/benchmarks.git -b cnn_tf_v1.12_compatible

Cloning into 'benchmarks'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 3335 (delta 6), reused 10 (delta 4), pack-reused 3315[K
Receiving objects: 100% (3335/3335), 1.87 MiB | 303.00 KiB/s, done.
Resolving deltas: 100% (2327/2327), done.


In [None]:
!git clone https://github.com/tensorflow/models.git -b v1.11

Cloning into 'models'...
remote: Enumerating objects: 82, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (57/57), done.[K


## Using the benchmark script
__benchmarks/scripts/tf_cnn_benchmarks.py__ benchmark scripts for multiple models and datasets. You can ```--help``` to check the usage options:

In [1]:
!benchmarks/scripts/tf_cnn_benchmarks.py --help

/bin/sh: benchmarks/scripts/tf_cnn_benchmarks.py: No such file or directory


for out specific example, we want to run **tf_cnn_benchmarks.py** with the following arguments:

```python
benchmarks/scripts/tf_cnn_benchmarks.py --num_batches=1000 --model vgg16  --batch_size 64 \
            --variable_update horovod --horovod_device gpu --use_fp16
```
### Creating the launcher script
Let's create a shell script that calls **tf_cnn_benchmarks.py** with the correct user arguments:

In [None]:
%%writefile launcher.sh 

#!/usr/bin/env bash 
pip install requests py-cpuinfo
        

PYTHONPATH="/opt/ml/code/models:$PYTHONPATH" \
python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
    --num_batches=1000 --model vgg16  --batch_size 64 \
    --variable_update horovod --horovod_device gpu --use_fp16 \
    --train_dir /opt/ml/model --eval_dir /opt/ml/model --benchmark_log_dir /opt/ml/model

The script above:
- installs the required Python packages **requests** and **cpuinfo**
- add **TensorFlow models** repository to the **PYTHONPATH**
- invokes the benchmarking script, saving training, evaluation, and benchmark event files under **/opt/ml/model**. This folder will be saved in S3 in the end of the training.

# Training in SageMaker using multiple GPUs

We are ready to train in SageMaker. Let's start by using one instance with multiple GPUs for training.
A ml.p3.16xlarge instance has 8 GPUs and is a perfect for this use case. We need to create **distribution**
dictionary, with **mpi** enabled, passing the number of processes that will be executed per instance:

In [None]:
distributions={'mpi': {'enabled': True, 'processes_per_host': 8}}

We need to pass the **launcher.sh** as the script entry point and include both **benchmarks** and **models** 
as depencies.

In [None]:
entry_point='launcher.sh'

dependencies=[os.path.join(dir_path, 'benchmarks'),
                              os.path.join(dir_path, 'models')]

We can now use the ```TensorFlow``` estimator to start the training job.

In [None]:
estimator = TensorFlow(
                entry_point=entry_point,
                role=sagemaker.get_execution_role()
                dependencies=dependencies,
                train_instance_count=1,
                train_instance_type='ml.p3.16xlarge',
                framework_version='1.12',
                py_version='py3',
                script_mode=True,
                distributions=distributions
            )  

The `estimator.fit` call bellow starts training without any data channel. The benchmark script generates synthetic ImageNet during training
and does not require any data channels

In [None]:
estimator.fit()

### Configuring VPC for distributed training
Providing a VPC improves the network throught of the training job and considerable increases the performance and stability of Horovod training jobs.

This can be done by supplying subnets and security groups to the job launching scripts.
We will use the default VPC configuration for this example.

In [None]:
ec2 = boto3.client('ec2')

default_vpc = [vpc['VpcId'] for vpc in ec2.describe_vpcs()['Vpcs'] if vpc["IsDefault"] == True][0]

default_security_groups = [group["GroupId"] for group in ec2.describe_security_groups()['SecurityGroups'] \
                   if group["GroupName"] == "default" and group["VpcId"] == default_vpc]

default_subnets = [subnet["SubnetId"] for subnet in ec2.describe_subnets()["Subnets"] \
                  if subnet["VpcId"] == default_vpc and subnet['DefaultForAz']==True]

print("Using default VPC:", default_vpc)
print("Using default security group:", default_security_groups)
print("Using default subnets:", default_subnets)

We can use the VPC for distributed training:


In [None]:
estimator = TensorFlow(
                entry_point=entry_point,
                role=sagemaker.get_execution_role()
                dependencies=dependencies,
                train_instance_count=2,
                train_instance_type='ml.p3.16xlarge',
                framework_version='1.12',
                py_version='py3',
                script_mode=True,
                distributions=distributions,
                security_group_ids=security_groups,
                subnets=subnets 
            )  

estimator.fit()