In [1]:
# before activating GPU
# !nvidia-smi

In PyTorch, every array has a device, we often refer it as a context. So far, by default, all variables
and associated computation have been assigned to the CPU. Typically, other contexts might be
various GPUs. Things can get even hairier when we deploy jobs across multiple servers. By assigning arrays to contexts intelligently, we can minimize the time spent transferring data between
devices. For example, when training neural networks on a server with a GPU, we typically prefer
for the modelʼs parameters to live on the GPU.
Next, we need to confirm that the GPU version of PyTorch is installed. If a CPU version of PyTorch
is already installed, we need to uninstall it first. For example, use the pip uninstall torch command, then install the corresponding PyTorch version according to your CUDA version. Assuming
you have CUDA 10.0 installed, you can install the PyTorch version that supports CUDA 10.0 via pip
install torch-cu100.


In [2]:
!nvidia-smi

Sun Aug 15 01:33:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    30W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
import torch
from torch import nn

We can specify devices, such as CPUs and GPUs, for storage and calculation. By default, tensors
are created in the main memory and then use the CPU to calculate it.
In PyTorch, the CPU and GPU can be indicated by torch.device('cpu') and torch.
device('cuda'). It should be noted that the cpu device means all physical CPUs and memory.
This means that PyTorchʼs calculations will try to use all CPU cores. However, a gpu device only
represents one card and the corresponding memory. If there are multiple GPUs, we use torch.
device(f'cuda:{i}') to represent the i
th GPU (i starts from 0). Also, gpu:0 and gpu are equivalent

In [4]:
torch.device('cpu'), torch.device('cuda'), torch.device('cuda:1')

(device(type='cpu'), device(type='cuda'), device(type='cuda', index=1))

In [5]:
torch.cuda.device_count()

1

Lets define functions that allows us to run the requested code even if multiple GPUs dont exist

In [6]:
def try_gpu(i=0):
    if torch.cuda.device_count() >= i +1:
        return torch.device(f'cuda:{i}')
    else:
        return torch.device('cpu')
    

In [7]:
def try_all_gpus():
    devices = [torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count()) ]
    return devices if devices else torch.device('cpu')

In [8]:
try_gpu(1), try_all_gpus()

(device(type='cpu'), [device(type='cuda', index=0)])

By default, tensors are created on the CPU. We can query the device where the tensor is located.


In [9]:
x = torch.tensor([1,2,3])
x.device

device(type='cpu')

It is important to note that whenever we want to operate on multiple terms, they need to be on the
same device. For instance, if we sum two tensors, we need to make sure that both arguments live
on the same device—otherwise the framework would not know where to store the result or even
how to decide where to perform the computation.

There are several ways to store a tensor on the GPU. For example, we can specify a storage device
when creating a tensor. Next, we create the tensor variable X on the first gpu. The tensor created
on a GPU only consumes the memory of this GPU. We can use the nvidia-smi command to view
GPU memory usage. In general, we need to make sure that we do not create data that exceed the
GPU memory limit.

In [10]:
x =  torch.ones(2,3, device=try_gpu())

In [11]:
x

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')

In [12]:
y = torch.ones(2,3, device=try_gpu(1))

In [13]:
y # on cpu

tensor([[1., 1., 1.],
        [1., 1., 1.]])

In [14]:
z = x.cuda(0)

In [15]:
z

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')

z and x can be computed together. since they are on the same device, subject to shape limitations.

In [16]:
x+ z

tensor([[2., 2., 2.],
        [2., 2., 2.]], device='cuda:0')

Imagine that your variable Z already lives on your second GPU. What happens if we still call Z.
cuda(1)? It will return Z instead of making a copy and allocating new memory

In [17]:
z.cuda(0) is z

True

People use GPUs to do machine learning because they expect them to be fast. But transferring
variables between devices is slow. So we want you to be 100% certain that you want to do something slow before we let you do it. If the deep learning framework just did the copy automatically
without crashing then you might not realize that you had written some slow code.
Also, transferring data between devices (CPU, GPUs, and other machines) is something that is
much slower than computation. It also makes parallelization a lot more difficult, since we have to
wait for data to be sent (or rather to be received) before we can proceed with more operations. This
is why copy operations should be taken with great care. As a rule of thumb, many small operations
are much worse than one big operation. Moreover, several operations at a time are much better
than many single operations interspersed in the code unless you know what you are doing. This
is the case since such operations can block if one device has to wait for the other before it can do
something else. It is a bit like ordering your coffee in a queue rather than pre-ordering it by phone
and finding out that it is ready when you are.
Last, when we print tensors or convert tensors to the NumPy format, if the data is not in the main
memory, the framework will copy it to the main memory first, resulting in additional transmission overhead. Even worse, it is now subject to the dreaded global interpreter lock that makes
everything wait for Python to complete.

In [18]:
# using gpu with neural network
net = nn.Sequential(nn.Linear(3,1))
net = net.to(device=try_gpu(0))

In [19]:
net(torch.randn(2,3).to(device=try_gpu(0)))

tensor([[-0.2368],
        [-0.2634]], device='cuda:0', grad_fn=<AddmmBackward>)

In [20]:
net[0].weight.device

device(type='cuda', index=0)

### Exercises

1. Try a larger computation task, such as the multiplication of large matrices, and see the difference in speed between the CPU and GPU. What about a task with a small amount of calculations?
2. How should we read and write model parameters on the GPU?

* by indexing right

3. Measure the time it takes to compute 1000 matrix-matrix multiplications of 100 × 100 matrices and log the Frobenius norm of the output matrix one result at a time vs. keeping a log on the GPU and transferring only the final result.

4. Measure how much time it takes to perform two matrix-matrix multiplications on two GPUs
at the same time vs. in sequence on one GPU. Hint: you should see almost linear scaling.

* in one gpu we have seen how much time it takes, however since second GPU is not available we are letting this one go.

In [21]:
# answer to question 1

z = torch.randn(1000, 1000)
x = torch.randn(1000, 1000)

In [22]:
%%time
z * x

CPU times: user 1.04 ms, sys: 2.78 ms, total: 3.81 ms
Wall time: 3.26 ms


tensor([[ 1.3992, -0.0378,  0.9720,  ...,  0.0576, -1.5187,  0.0060],
        [ 1.7183,  0.8595,  0.0661,  ..., -0.2778, -0.4570, -0.0922],
        [ 0.3082, -0.3432,  0.0066,  ..., -0.1596, -0.4619,  0.5993],
        ...,
        [ 0.5238,  0.5094,  0.7713,  ..., -0.2503, -0.0841,  0.3215],
        [ 0.1634,  0.2402, -0.0315,  ..., -0.1962, -0.2267,  1.2554],
        [-0.2990,  0.3895,  2.8810,  ..., -2.5421,  0.1877, -0.0838]])

In [23]:
z = torch.randn(1000, 1000).to(device=try_gpu())
x = torch.randn(1000, 1000).to(device=try_gpu())

In [24]:
%%time
z * x

CPU times: user 265 µs, sys: 67 µs, total: 332 µs
Wall time: 205 µs


tensor([[ 0.5064,  1.8411, -0.0675,  ..., -1.9003,  0.0494, -0.0685],
        [ 0.0449,  1.3982,  1.4708,  ...,  1.2273,  0.0252, -0.3968],
        [ 0.0116, -1.3838, -0.1443,  ...,  0.3758, -0.3587, -2.6917],
        ...,
        [ 0.5400, -0.0314,  0.0174,  ..., -0.4030, -0.5504,  0.1872],
        [ 0.1848, -1.2533,  0.0858,  ..., -0.6625,  1.6501, -1.8974],
        [-0.1401,  0.5153,  0.3659,  ...,  0.5048,  0.8870,  0.0622]],
       device='cuda:0')

In [25]:
%%time
z = torch.randn(1,1)
x = torch.randn(1,1)
z*x

CPU times: user 913 µs, sys: 0 ns, total: 913 µs
Wall time: 665 µs


tensor([[-0.3749]])

In [26]:
%%time
z = torch.randn(1,1).to(device=try_gpu())
x = torch.randn(1,1).to(device=try_gpu())
z*x

CPU times: user 614 µs, sys: 0 ns, total: 614 µs
Wall time: 401 µs


tensor([[0.2681]], device='cuda:0')

In [29]:
# answer to question 3
z = torch.randn(100,100).to(device=try_gpu())
x = torch.randn(100,100).to(device=try_gpu())

y = z*x
y.shape

torch.Size([100, 100])

In [30]:
%%time
for i in range(1000):
    z = z*x

torch.norm(z)

CPU times: user 13.6 ms, sys: 826 µs, total: 14.5 ms
Wall time: 17.7 ms


tensor(inf, device='cuda:0')

In [33]:

z = torch.randn(100,100)
x = torch.randn(100,100)

y = z*x
y.shape

torch.Size([100, 100])

In [34]:
%%time
for i in range(1000):
    z = z*x

torch.norm(z)

CPU times: user 55.2 ms, sys: 659 µs, total: 55.8 ms
Wall time: 57 ms


tensor(inf)