# F.A.Q / Troubleshooting - Distributed Training

```
# Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
#   Licensed under the Apache License, Version 2.0 (the "License").
#   You may not use this file except in compliance with the License.
#   A copy of the License is located at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#   or in the "license" file accompanying this file. This file is distributed
#   on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
#   express or implied. See the License for the specific language governing
#   permissions and limitations under the License.
# ==============================================================================
```

The following lists are the frequent problems and troubleshoot in regarding to running distributed training with Horovod and executing MXFusion's code in GPU.  

## ValueError while executing <tt>horovodrun</tt>

### Problem

After recently installed Horovod in the machine, the following error may occur when executing the code with <tt>horovodrun</tt> on terminal:

<b>ValueError: Neither MPI nor Gloo support has been built. Try reinstalling Horovod ensuring that either MPI is installed (MPI) or CMake is installed (Gloo).</b>

### Steps to Reproduce

After installing <tt>Horovod</tt> with <b>pip install horovod==0.16.4</b>, execute a MXFusion distributed training script with <b>horovodrun -np {number_of_processors} -H localhost:4 python {python_script}</b>

### Solution

Use <tt>mpirun</tt> instead of <tt>horovodrun</tt>. For example on terminal, type :

<b>mpirun -np {number_of_processors} -H localhost:4 python {python_script}</b>

## Warning of <tt>CMA Support</tt> Not Available

### Problem

When first executing MXFusion with Horovod every time Ubuntu boots, the ptrace protection from Ubuntu blocks CMA support from being enabled, which then does not allow shared memory between processors. A warning will be shown in the terminal :

<b>
Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.
</b>

### Steps to Reproduce

After Ubuntu boots, execute a MXFusion distributed training script with mpirun -np {number_of_processors} -H localhost:4 python {python_script}

### Solution

Temporarily disable ptrace protection by typing the line below on the terminal. Note that you may need to reenable it back with <tt>echo 1</tt> after stopped using Horovod for security measures. Also note that <tt>ptrace_scope</tt> will be resetted to 1 every time Ubuntu boots. To disable ptrace protection, on terminal type :

<b>echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope</b>

## Segmentation fault : 11 with <tt>MXNet-cu100</tt>

### Problem

When executing MXFusion on GPU, error of <b>Segmentation fault : 11</b> will be thrown if <tt>MXNet-cu100</tt> is installed.

### Steps to Reproduce

Install <tt>MXNet-cu100</tt> with <b>pip install mxnet-cu100</b> on <tt>Deep Learning AMI (Ubuntu) Version 24.1 (ami-06f483a626f873983)</tt>. Run a MXFusion distributed training script with <b>mpirun -np {number_of_processors} -H localhost:4 python {python_script}</b>.

### Solution

Uninstall <tt>MXNet-cu100</tt> with and install <tt>MXNet-cu100mkl</tt>. On terminal, type :

<b>
    pip uninstall mxnet-cu100<br>
    pip install mxnet-cu100mkl
</b>

## Segmentation fault : 11 with latest version of <tt>Horovod</tt>

### Problem

MXFusion currently does not support <tt>Horovod</tt> version 18 and above. With latest version of <tt>Horovod</tt>, when running <tt>MXFusion</tt> distributed training on CPU, the loss function and output will be inaccurate and inconsistent between processors. When running <tt>MXFusion</tt> distributed training on GPU, <b>Segmentation fault : 11</b> error will be thrown.

### Steps to Reproduce

Install <tt>Horovod</tt> with <b>pip install horovod</b>. Run a distributed training script with <b>mpirun -np {number_of_processors} -H localhost:4 python {python_script}</b>.

### Solution

Currently MXFusion supports <tt>Horovod</tt> below version 18. Install the latest version of <tt>MXFusion</tt> before version 18 with :

<b>pip install horovod==0.16.4</b>

## Error with dtype='float64' on GPU

### Problem

When setting <tt>float64</tt> as the data type and run the script on GPU, this error may occur :

<b>mxnet.base.MXNetError: src/ndarray/ndarray_function.cu:58: Check failed: to->type_flag_ == from.type_flag_ (1 vs. 0) : Source and target must have the same data type when copying across devices.</b>

### Steps to Reproduce

In a GPU, change the value of <tt>config.DEFAULT_DTYPE</tt> and dtype of NDArray to <tt>'float64'</tt> in <b>distributed_bnn_test.py</b>. Run the test. The error will occur in <tt>test_BNN_regression</tt> and <tt>test_BNN_regression_minibatch</tt>. In the terminal, from MXFusion source root folder, type :

<b>
    cd testing/inference<br>
    mpirun -np 4 -H localhost:4 pytest -s distributed_bnn_test.py
</b>

### Solution

Set <tt>float32</tt> as the data type. GPU also supports <tt>float32</tt> at better speed than <tt>float64</tt>.