# Run Keras on GPU

If you are running on the **TensorFlow** or **CNTK** backends, your code will automatically run on GPU if any avaiable GPU is detectred.  

If you are running on the **Theano** backend, you can use one of the following methods:  
**Method 1**: use Theano flags  

    THEANO_FLAGS = device = gpu, floatX=float32 python my_keras_script.py

The name 'gpu' might have to be changed depending on your device's identifier (e.g. gpu0, gpu1, etc).  
**Method 2**: set up your .theanorc: Instructions  
**Method 3**: manyally set theano.config.device, theano.config.floatX at the beginning of your code:  
```python
import theano
theano.config.device = 'gpu'
theano.config.floatX = 'float32'
```

# Run a keras model on multiple GPUs?

I recommend doing so using the **TensorFlow** backend. There are two ways to run a single model on multiple GPUs: **data parallelism** and **device parallelism**.  
In most case, what you need is most likely data parallelism.

### Data parallelism

Data parallelism consist in replicating the target model once on each devicem, and using each replace to process a different fraction of the input data. Keras has a built-in utility, *keras.utils.multi_gpu_model*, which can produce a data-parallel version of any model, and achieves quasi-linear speedup on up to 8 GPUs.  

For more information, see the documentation for multi_gpu_model. Here is a quick example:  
```python
from keras.utils import multi_gpu_model

# Replicates `model` on 9 GPUs
# This assumes that your machione has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
                      optimizer='rmsprop')
# This `fit` call will be distributed oin 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples
parallel_model.fit(x, y, epochs=20, batch_size=256
```

### Device parallelism
Device parallelism consist in running different parts of a same model on different devices. It works best for models that have a parallel architeture, e.g. a model with two branches.  

This can be achieved by using TensorFlow device scopes.  
```python
# Model where a shared LSTM is used to encode two different sequences in parallel
input_a = keras.Input(shape=(140, 256))
input_b = keras.Input(shape=(140, 256))

shared_lstm = keras.layers.LSTM(64)

# Process the first sequence on one GPU
with tf.device_scope('/gpu:0'):
  encoded_a = shared_lstm(tweet_a)
# Process the next sequence on another GPU
with tf.device_scope('/gpu:1'):
  encoded_b = shared_lstm(tweet_b)

# Concatenate results on CPU
with tf.device_scope('/cpu:0'):
  merged_vector = keras.layers.concatenate([encoded_a, encoded_b], axis=-1)
```