# Cuda

### CUDA Semantics
[link](https://pytorch.org/docs/stable/notes/cuda.html)
`torch.cuda` sets up and executes CUDA operations. Keeps track of current CUDA device, and allocates all CUDA tensors will be created on that device. Once a tendor is allocated, you can do operations on it irrespective of the selected CUDA device, and the results will always be placed on the same device as the tensor.

Cross-GPU operations are not allowed, unless you enable peer-to-peer memory access


### Asynchronous Execution

By default, GPU operations are asynchronous. When you call a function that uses the GPU, the operations are enqueued to the particular device, but not necessarily executed until later. This allows us to execute more computations in parallel, including operations on CPU or other GPUs.

In general, the effect of asynchronous computation is invisible to the caller, because (1) each device executes operations in the order they are queued, and (2) PyTorch automatically performs necessary synchronization when copying data between CPU and GPU or between two GPUs. Hence, computation will proceed as if every operation was executed synchronously.

### Memory Management

PyTorch uses a __caching memory allocator__ to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use `memory_allocated()` and `max_memory_allocated()` to monitor memory occupied by tensors, and use `memory_cached()` and `max_memory_cached()` to monitor memory managed by the caching allocator. Calling `empty_cache()` __releases all unused cached memory from PyTorch__ so that those can be used by other GPU applications. However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch.

### Device-agnostic Code
Can use `torch.cuda.is_available()` then set device to 'cpu' or 'cuda' based on the result.

### Pinned memory buffers

Host to GPU copies are much faster when they originate from pinned (page-locked) memory. CPU tensors and storages expose a `pin_memory()` method, that returns a copy of the object, with data put in a pinned region.

Also, __once you pin a tensor or storage, you can use asynchronous GPU copies__. Just pass an additional `non_blocking=True` argument to a `to()` or a `cuda()` call. This can be used to overlap data transfers with computation.

You can make the DataLoader return batches placed in pinned memory by passing `pin_memory=True` to its constructor.


### Code

In [None]:
import torch 
device = torch.device('cuda')

print(torch.cuda.is_available())

### Checking GPU Memory
`torch.cuda.max_memory_allocated(device=None)`

Returns the maximum GPU memory occupied by tensors in bytes for a given device.

By default, this returns the peak allocated memory since the beginning of this program. `reset_max_memory_allocated()` can be used to reset the starting point in tracking this metric. For example, these two functions can measure the peak allocated memory usage of each iteration in a training loop.


In [3]:
print(torch.cuda.max_memory_allocated())

0


In [16]:
x = torch.rand(300,300,300, device=device)
torch.cuda.reset_max_memory_allocated()
print(torch.cuda.max_memory_allocated())

108000256


In [18]:
#Returns the current GPU memory occupied by tensors in bytes for a given device.
torch.cuda.memory_allocated()

108000256

`torch.cuda.max_memory_cached(device=None)`

Returns the maximum GPU memory managed by the caching allocator in bytes for a given device.

By default, this returns the peak cached memory since the beginning of this program. `reset_max_memory_cached()` can be used to reset the starting point in tracking this metric. For example, these two functions can measure the peak cached memory amount of each iteration in a training loop.

In [17]:
torch.cuda.reset_max_memory_cached()
print(torch.cuda.max_memory_cached())

220200960


In [7]:
torch.cuda.memory_cached()

111149056