# GPUs
:label:`sec_use_gpu`

In :numref:`tab_intro_decade`, we discussed the rapid growth
of computation over the past two decades.
In a nutshell, GPU performance has increased
by a factor of 1000 every decade since 2000.
This offers great opportunities but it also suggests
a significant need to provide such performance.


In this section, we begin to discuss how to harness
this computational performance for your research.
First by using single GPUs and at a later point,
how to use multiple GPUs and multiple servers (with multiple GPUs).

Specifically, we will discuss how
to use a single NVIDIA GPU for calculations.
First, make sure you have at least one NVIDIA GPU installed.
Then, download the [NVIDIA driver and CUDA](https://developer.nvidia.com/cuda-downloads)
and follow the prompts to set the appropriate path.
Once these preparations are complete,
the `nvidia-smi` command can be used
to (**view the graphics card information**).


In [1]:
!nvidia-smi

Sun Oct  3 18:24:57 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce 940MX       Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P0    N/A /  N/A |    264MiB /  2004MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In PyTorch, every array has a device, we often refer it as a context.
So far, by default, all variables
and associated computation
have been assigned to the CPU.
Typically, other contexts might be various GPUs.
Things can get even hairier when
we deploy jobs across multiple servers.
By assigning arrays to contexts intelligently,
we can minimize the time spent
transferring data between devices.
For example, when training neural networks on a server with a GPU,
we typically prefer for the model's parameters to live on the GPU.

Next, we need to confirm that
the GPU version of PyTorch is installed.
If a CPU version of PyTorch is already installed,
we need to uninstall it first.
For example, use the `pip uninstall torch` command,
then install the corresponding PyTorch version
according to your CUDA version.
Assuming you have CUDA 10.0 installed,
you can install the PyTorch version
that supports CUDA 10.0 via `pip install torch-cu100`.


To run the programs in this section,
you need at least two GPUs.
Note that this might be extravagant for most desktop computers
but it is easily available in the cloud, e.g.,
by using the AWS EC2 multi-GPU instances.
Almost all other sections do *not* require multiple GPUs.
Instead, this is simply to illustrate
how data flow between different devices.

## [**Computing Devices**]

We can specify devices, such as CPUs and GPUs,
for storage and calculation.
By default, tensors are created in the main memory
and then use the CPU to calculate it.


In PyTorch, the CPU and GPU can be indicated by `torch.device('cpu')` and `torch.device('cuda')`.
It should be noted that the `cpu` device
means all physical CPUs and memory.
This means that PyTorch's calculations
will try to use all CPU cores.
However, a `gpu` device only represents one card
and the corresponding memory.
If there are multiple GPUs, we use `torch.device(f'cuda:{i}')`
to represent the $i^\mathrm{th}$ GPU ($i$ starts from 0).
Also, `gpu:0` and `gpu` are equivalent.


In [5]:
import torch
from torch import nn

torch.device('cpu'), torch.device('cuda'), torch.device('cuda:1')

(device(type='cpu'), device(type='cuda'), device(type='cuda', index=1))

We can (**query the number of available GPUs.**)


In [6]:
torch.cuda.device_count()

1

Now we [**define two convenient functions that allow us
to run code even if the requested GPUs do not exist.**]


In [7]:
def try_gpu(i=0):  #@save
    """Return gpu(i) if exists, otherwise return cpu()."""
    if torch.cuda.device_count() >= i + 1:
        return torch.device(f'cuda:{i}')
    return torch.device('cpu')

def try_all_gpus():  #@save
    """Return all available GPUs, or [cpu(),] if no GPU exists."""
    devices = [
        torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())]
    return devices if devices else [torch.device('cpu')]

try_gpu(), try_gpu(10), try_all_gpus()

(device(type='cuda', index=0),
 device(type='cpu'),
 [device(type='cuda', index=0)])

## Tensors and GPUs

By default, tensors are created on the CPU.
We can [**query the device where the tensor is located.**]


In [8]:
x = torch.tensor([1, 2, 3])
x.device

device(type='cpu')

It is important to note that whenever we want
to operate on multiple terms,
they need to be on the same device.
For instance, if we sum two tensors,
we need to make sure that both arguments
live on the same device---otherwise the framework
would not know where to store the result
or even how to decide where to perform the computation.

### Storage on the GPU

There are several ways to [**store a tensor on the GPU.**]
For example, we can specify a storage device when creating a tensor.
Next, we create the tensor variable `X` on the first `gpu`.
The tensor created on a GPU only consumes the memory of this GPU.
We can use the `nvidia-smi` command to view GPU memory usage.
In general, we need to make sure that we do not create data that exceed the GPU memory limit.


In [9]:
X = torch.ones(2, 3, device=try_gpu())
X

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')

Assuming that you have at least two GPUs, the following code will (**create a random tensor on the second GPU.**)


In [11]:
Y = torch.rand(2, 3, device=try_gpu(0))
Y

tensor([[0.9520, 0.9171, 0.4997],
        [0.7426, 0.1999, 0.9146]], device='cuda:0')

### Copying

[**If we want to compute `X + Y`,
we need to decide where to perform this operation.**]
For instance, as shown in :numref:`fig_copyto`,
we can transfer `X` to the second GPU
and perform the operation there.
*Do not* simply add `X` and `Y`,
since this will result in an exception.
The runtime engine would not know what to do:
it cannot find data on the same device and it fails.
Since `Y` lives on the second GPU,
we need to move `X` there before we can add the two.

![Copy data to perform an operation on the same device.](../img/copyto.svg)
:label:`fig_copyto`


In [8]:
Z = X.cuda(1)
print(X)
print(Z)

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:1')


Now that [**the data are on the same GPU
(both `Z` and `Y` are),
we can add them up.**]


In [9]:
Y + Z

tensor([[1.6254, 1.5894, 1.6790],
        [1.3811, 1.4399, 1.5367]], device='cuda:1')

Imagine that your variable `Z` already lives on your second GPU.
What happens if we still call `Z.cuda(1)`?
It will return `Z` instead of making a copy and allocating new memory.


In [10]:
Z.cuda(1) is Z

True

### Side Notes

People use GPUs to do machine learning
because they expect them to be fast.
But transferring variables between devices is slow.
So we want you to be 100% certain
that you want to do something slow before we let you do it.
If the deep learning framework just did the copy automatically
without crashing then you might not realize
that you had written some slow code.

Also, transferring data between devices (CPU, GPUs, and other machines)
is something that is much slower than computation.
It also makes parallelization a lot more difficult,
since we have to wait for data to be sent (or rather to be received)
before we can proceed with more operations.
This is why copy operations should be taken with great care.
As a rule of thumb, many small operations
are much worse than one big operation.
Moreover, several operations at a time
are much better than many single operations interspersed in the code
unless you know what you are doing.
This is the case since such operations can block if one device
has to wait for the other before it can do something else.
It is a bit like ordering your coffee in a queue
rather than pre-ordering it by phone
and finding out that it is ready when you are.

Last, when we print tensors or convert tensors to the NumPy format,
if the data is not in the main memory,
the framework will copy it to the main memory first,
resulting in additional transmission overhead.
Even worse, it is now subject to the dreaded global interpreter lock
that makes everything wait for Python to complete.


## [**Neural Networks and GPUs**]

Similarly, a neural network model can specify devices.
The following code puts the model parameters on the GPU.


In [11]:
net = nn.Sequential(nn.Linear(3, 1))
net = net.to(device=try_gpu())

We will see many more examples of
how to run models on GPUs in the following chapters,
simply since they will become somewhat more computationally intensive.

When the input is a tensor on the GPU, the model will calculate the result on the same GPU.


In [12]:
net(X)

tensor([[0.3583],
        [0.3583]], device='cuda:0', grad_fn=<AddmmBackward>)

Let us (**confirm that the model parameters are stored on the same GPU.**)


In [13]:
net[0].weight.data.device

device(type='cuda', index=0)

In short, as long as all data and parameters are on the same device, we can learn models efficiently. In the following chapters we will see several such examples.

## Summary

* We can specify devices for storage and calculation, such as the CPU or GPU.
  By default, data are created in the main memory
  and then use the CPU for calculations.
* The deep learning framework requires all input data for calculation
  to be on the same device,
  be it CPU or the same GPU.
* You can lose significant performance by moving data without care.
  A typical mistake is as follows: computing the loss
  for every minibatch on the GPU and reporting it back
  to the user on the command line (or logging it in a NumPy `ndarray`)
  will trigger a global interpreter lock which stalls all GPUs.
  It is much better to allocate memory
  for logging inside the GPU and only move larger logs.

## Exercises

1. Try a larger computation task, such as the multiplication of large matrices,
   and see the difference in speed between the CPU and GPU.
   What about a task with a small amount of calculations?
1. How should we read and write model parameters on the GPU?
1. Measure the time it takes to compute 1000
   matrix-matrix multiplications of $100 \times 100$ matrices
   and log the Frobenius norm of the output matrix one result at a time
   vs. keeping a log on the GPU and transferring only the final result.
1. Measure how much time it takes to perform two matrix-matrix multiplications
   on two GPUs at the same time vs. in sequence
   on one GPU. Hint: you should see almost linear scaling.


In [22]:
%%time
mat1 = torch.rand(size=(1000,1000), device = "cuda")
mat2 = torch.rand(size=(1000,1000), device = "cuda")
mat1@mat2
# del mat1

CPU times: user 0 ns, sys: 2.35 ms, total: 2.35 ms
Wall time: 1.41 ms


tensor([[250.1902, 249.2633, 256.6523,  ..., 258.4780, 261.5490, 257.0576],
        [244.4353, 243.3054, 250.9145,  ..., 255.4516, 252.3988, 248.2755],
        [247.2702, 256.6495, 258.4785,  ..., 261.7487, 257.1197, 254.7344],
        ...,
        [237.3310, 236.7554, 246.6906,  ..., 250.7657, 252.0147, 247.2911],
        [244.2958, 241.1219, 247.7389,  ..., 249.3332, 248.9553, 245.8207],
        [241.9816, 244.5497, 245.8743,  ..., 248.5241, 252.5327, 248.8336]],
       device='cuda:0')

In [23]:
%%time
cmat1 = torch.rand(size=(1000,1000))
cmat2 = torch.rand(size=(1000,1000))
cmat1@cmat2

CPU times: user 49.6 ms, sys: 6.55 ms, total: 56.2 ms
Wall time: 40.7 ms


tensor([[262.3176, 265.5510, 263.1043,  ..., 253.2892, 256.5887, 263.5003],
        [245.5448, 250.3501, 243.5349,  ..., 241.3442, 242.2385, 249.7554],
        [265.3176, 256.9858, 257.5384,  ..., 255.3320, 258.6715, 260.5834],
        ...,
        [253.2216, 254.8988, 255.2290,  ..., 250.5921, 248.4778, 258.2626],
        [247.3179, 246.9469, 256.1393,  ..., 242.9561, 245.8574, 248.2369],
        [250.4972, 250.8333, 248.6693,  ..., 245.5199, 245.5322, 252.0666]])

In [29]:
%%time
mat1 = torch.rand(size=(10,10), device = "cuda")
mat2 = torch.rand(size=(10,10), device = "cuda")
mat1@mat2

CPU times: user 3.58 ms, sys: 0 ns, total: 3.58 ms
Wall time: 2.02 ms


tensor([[2.3601, 1.9744, 2.8437, 2.2071, 1.7932, 2.8040, 2.5621, 2.7248, 2.7566,
         2.5303],
        [3.3752, 2.8262, 4.2553, 3.0018, 2.8311, 4.1486, 3.8601, 3.9614, 3.4200,
         3.3799],
        [1.8647, 1.4047, 2.8949, 2.4029, 1.5081, 2.6371, 2.9830, 2.7959, 1.9864,
         1.8769],
        [1.8381, 1.7445, 2.5300, 2.6354, 1.2741, 2.5711, 2.3872, 2.3987, 2.4197,
         2.0141],
        [2.9224, 2.3075, 3.5317, 2.6155, 2.3895, 3.5359, 3.8099, 4.0587, 2.5569,
         2.8081],
        [1.9969, 1.9212, 1.8611, 1.9316, 1.6494, 2.0236, 1.8767, 2.5465, 1.9707,
         2.7303],
        [2.4641, 2.0773, 2.4537, 1.9613, 1.5819, 2.7824, 2.6518, 3.0149, 2.2901,
         2.1211],
        [1.4173, 1.2284, 2.1891, 1.6922, 1.4651, 2.0167, 2.1726, 1.9976, 1.5614,
         1.3093],
        [2.8042, 2.2416, 3.3750, 2.1541, 2.4416, 3.2690, 2.9968, 3.2527, 2.1858,
         2.7010],
        [2.7512, 1.9890, 2.6774, 2.5629, 1.4020, 2.3547, 3.3021, 3.2130, 2.6174,
         2.7176]], device='c

In [28]:
%%time
cmat1 = torch.rand(size=(10,10))
cmat2 = torch.rand(size=(10,10))
cmat1@cmat2

CPU times: user 721 µs, sys: 266 µs, total: 987 µs
Wall time: 669 µs


tensor([[2.0081, 2.8147, 2.9947, 1.6371, 1.1374, 2.1292, 2.2030, 2.4247, 2.4846,
         1.7819],
        [1.8138, 2.8805, 3.2652, 2.4198, 1.3734, 2.8765, 1.7320, 2.2273, 2.6091,
         2.5451],
        [2.0451, 3.3301, 3.5665, 2.5648, 1.6631, 3.0180, 2.1163, 2.5234, 2.9700,
         2.9185],
        [2.4842, 3.4720, 3.9635, 2.8895, 2.1661, 3.5853, 2.4080, 3.2623, 3.1985,
         2.6640],
        [1.5650, 2.2128, 2.4900, 1.8758, 1.1352, 2.2968, 1.2053, 1.8245, 1.8440,
         2.0261],
        [2.1567, 3.3778, 3.3430, 2.6012, 1.2878, 2.7936, 2.1602, 2.6063, 2.7456,
         2.2909],
        [1.7704, 2.1728, 2.2165, 1.8316, 1.1405, 2.2387, 1.5783, 1.7794, 2.1045,
         1.7769],
        [2.0810, 2.4070, 3.0029, 2.1485, 1.6362, 2.7940, 1.9092, 2.4360, 2.7122,
         2.2990],
        [1.7521, 2.1582, 2.2710, 1.7125, 0.8667, 1.8779, 1.3959, 1.9188, 1.5988,
         1.3072],
        [1.1284, 1.8470, 1.7871, 1.6762, 1.0466, 1.8841, 1.0116, 1.6809, 1.5857,
         1.4693]])

In [45]:
%%time
mat1 = torch.rand(size=(100,100), device = "cuda")/75
for i in range(1000):
    mat1 = mat1@mat1
    frobenius = torch.norm(mat1)
#     print(frobenius)

CPU times: user 43 ms, sys: 5.82 ms, total: 48.9 ms
Wall time: 46.5 ms


[Discussions](https://discuss.d2l.ai/t/63)
