I want to learn how to know how big (depth, max_seq_len) a model I can create and have it still fit on my GPU.

In [41]:
import torch
import gc

### Toy example

Let me start by seeing if I can understand it for a simple model.

In [2]:
torch.cuda.is_available()

True

In [3]:
free, total = torch.cuda.mem_get_info()
free, total

(8249999360, 8354660352)

In [4]:
print(f"Free memory: {free / 1024**3:.2f} GB")
print(f"Total memory: {total / 1024**3:.2f} GB")

Free memory: 7.68 GB
Total memory: 7.78 GB


In [5]:
allocated = torch.cuda.memory_allocated()
allocated

0

In [6]:
free_before = free

In [7]:
model = torch.nn.Linear(in_features=10, out_features=10, bias=False, device="cuda", dtype=torch.float32)

In [8]:
# this should have 100 params, each 32 bits = 4 bytes, so guessing consumes this many bytes:
100 * 4

400

In [9]:
free, total = torch.cuda.mem_get_info()
free_before - free

4194304

In [10]:
torch.cuda.memory_allocated()

512

Is the 512 because it allocates in certain minimum units? Let's try a few things.

In [11]:
del model
gc.collect()
torch.cuda.empty_cache()
torch.cuda.memory_allocated()

0

In [12]:
# first confirm repeatable
model = torch.nn.Linear(in_features=10, out_features=10, bias=False, device="cuda", dtype=torch.float32)
torch.cuda.memory_allocated()

512

In [13]:
del model
gc.collect()
torch.cuda.empty_cache()
torch.cuda.memory_allocated()

0

In [14]:
# 11 x 11 x 4 = 484, so guessing we'll still see 512
model = torch.nn.Linear(in_features=11, out_features=11, bias=False, device="cuda", dtype=torch.float32)

In [15]:
torch.cuda.memory_allocated()

512

In [20]:
del model
gc.collect()
torch.cuda.empty_cache()
torch.cuda.memory_allocated()

0

In [21]:
# 12 x 12 x 4 = 576, so guessing we'll still see 1024
model = torch.nn.Linear(in_features=12, out_features=12, bias=False, device="cuda", dtype=torch.float32)

In [22]:
torch.cuda.memory_allocated()

1024

What about the gradient? Does that space get allocated only as needed? Let's try.

In [25]:
x = torch.randn(12, device="cuda")

In [26]:
torch.cuda.memory_allocated()
# 1024 + 512 = 1636

1536

In [27]:
loss = model(x).sum()

In [28]:
torch.cuda.memory_allocated()

8521728

What's that all about? Is that memory that gets allocated to do the forward pass? Maybe we should start even simpler and see what happens if we allocte two tensors and multiply them. But first see what happens when we call backward().

In [31]:
loss.backward()

In [32]:
torch.cuda.memory_allocated()

17042432

In [37]:
model.weight.grad

tensor([[ 0.6030, -1.4976,  0.3691, -1.1289,  0.7871, -1.1535,  0.2922,  0.8758,
         -0.8272, -0.8537,  0.0650,  0.6997],
        [ 0.6030, -1.4976,  0.3691, -1.1289,  0.7871, -1.1535,  0.2922,  0.8758,
         -0.8272, -0.8537,  0.0650,  0.6997],
        [ 0.6030, -1.4976,  0.3691, -1.1289,  0.7871, -1.1535,  0.2922,  0.8758,
         -0.8272, -0.8537,  0.0650,  0.6997],
        [ 0.6030, -1.4976,  0.3691, -1.1289,  0.7871, -1.1535,  0.2922,  0.8758,
         -0.8272, -0.8537,  0.0650,  0.6997],
        [ 0.6030, -1.4976,  0.3691, -1.1289,  0.7871, -1.1535,  0.2922,  0.8758,
         -0.8272, -0.8537,  0.0650,  0.6997],
        [ 0.6030, -1.4976,  0.3691, -1.1289,  0.7871, -1.1535,  0.2922,  0.8758,
         -0.8272, -0.8537,  0.0650,  0.6997],
        [ 0.6030, -1.4976,  0.3691, -1.1289,  0.7871, -1.1535,  0.2922,  0.8758,
         -0.8272, -0.8537,  0.0650,  0.6997],
        [ 0.6030, -1.4976,  0.3691, -1.1289,  0.7871, -1.1535,  0.2922,  0.8758,
         -0.8272, -0.8537,  0.

In [47]:
model.weight.untyped_storage().nbytes()

576

In [48]:
model.weight.grad.untyped_storage().nbytes()

576

In [49]:
x.untyped_storage().nbytes()

48

In [50]:
loss.untyped_storage().nbytes()

4

Maybe it's not such a good idea to try to line things up at this level with `torch.cuda.memory_allocated()` because torch (or lower-level stuff?) could be allocating caches, etc. Let's see how things work with bigger numbers. But first, let's see what happens with allocating and multiplying raw tensors. My hope is there the numbers will match up.

Also, could jupyter be holding onto things?

In [1]:
## restart kernel ##
import torch
import gc

In [2]:
torch.cuda.memory_allocated()

0

In [3]:
m = torch.randn(12, 12, device="cuda", dtype=torch.float32)

In [4]:
torch.cuda.memory_allocated()

1024

In [5]:
m.untyped_storage().nbytes()

576

In [6]:
x = torch.randn(12, device="cuda", dtype=torch.float32)

In [7]:
torch.cuda.memory_allocated() # expecting 1024 + 512 = 1536

1536

In [8]:
y = m @ x

In [9]:
torch.cuda.memory_allocated() # expecting 1024 + 512 + 512 = 2048

8521728

No. Maybe as soon as you do matrix math it allocates memory for something?

In [11]:
torch.cuda.empty_cache()

In [12]:
torch.cuda.memory_allocated()

8521728

Does torch.no_grad() matter?

In [1]:
## restart kernel ##
import torch
import gc
torch.cuda.memory_allocated()

0

In [2]:
m = torch.randn(12, 12, device="cuda", dtype=torch.float32)
x = torch.randn(12, device="cuda", dtype=torch.float32)

In [3]:
torch.cuda.memory_allocated()

1536

In [4]:
with torch.no_grad():
    y = m @ x

In [5]:
torch.cuda.memory_allocated()

8521728

How about preallocating space for the result?

In [1]:
## restart kernel ##
import torch
import gc
torch.cuda.memory_allocated()

0

In [2]:
m = torch.randn(12, 12, device="cuda", dtype=torch.float32)
x = torch.randn(12, device="cuda", dtype=torch.float32)
y = torch.empty_like(x)

In [3]:
torch.cuda.memory_allocated()

2048

In [4]:
foo = torch.matmul(m, x, out=y)
del foo
gc.collect()
torch.cuda.empty_cache()

In [5]:
torch.cuda.memory_allocated()

8521728

Not sure, let's try with much bigger numbers and see. Maybe the unaccounted for stuff is fixed stuff that will become a rounding error at bigger sizes.

In [1]:
## restart kernel ##
import torch
import gc
torch.cuda.memory_allocated()

0

In [2]:
m = torch.randn(1000, 1000, device="cuda", dtype=torch.float32) # 4,000,000 bytes
x = torch.randn(1000, device="cuda", dtype=torch.float32) # 4000 bytes

In [3]:
torch.cuda.memory_allocated()

4004352

In [4]:
m.untyped_storage().nbytes(), x.untyped_storage().nbytes()

(4000000, 4000)

In [5]:
y = m @ x

In [7]:
y.untyped_storage().nbytes()

4000

In [6]:
torch.cuda.memory_allocated()

12528128

No. It jumps from ~4M to ~12M. Why?

In [9]:
torch.cuda.empty_cache()
torch.cuda.memory_allocated()

12528128

What if the tensors large enough that they and the result take up all the space?

In [1]:
## restart kernel ##
import torch
import gc
torch.cuda.memory_allocated()

0

In [2]:
free, total = torch.cuda.mem_get_info()
f"{free:,}"

'8,249,999,360'

In [3]:
import math
math.sqrt(free / 4)

45414.753549920315

In [4]:
m = torch.randn(40_000, 40_000, device="cuda", dtype=torch.float32)
x = torch.randn(40_000, device="cuda", dtype=torch.float32)

In [5]:
m.untyped_storage().nbytes() + x.untyped_storage().nbytes()

6400160000

In [6]:
f"{torch.cuda.memory_allocated():,}"

'6,400,668,160'

In [7]:
y = m @ x

In [8]:
y.untyped_storage().nbytes()

160000

In [9]:
f"{torch.cuda.memory_allocated():,}"

'6,409,348,096'

In [10]:
free, total = torch.cuda.mem_get_info()
f"{free:,}"

'1,813,839,872'

well, that worked, and the "extra" increase after the multiplication was "only" ~9M:

In [11]:
torch.cuda.memory_allocated() - (m.untyped_storage().nbytes() + x.untyped_storage().nbytes() + y.untyped_storage().nbytes())

9028096

so maybe it's not worth worrying about exactly how this works, but what happens if we go back to the simple linear model with the same large number of params. Will the forward pass work but it will fail with an out of memory problem when call backward() because it doesn't have room to store the gradient?

In [1]:
## restart kernel ##
import torch
import gc
torch.cuda.memory_allocated()

0

In [2]:
model = torch.nn.Linear(in_features=40_000, out_features=40_000, bias=False, device="cuda", dtype=torch.float32)

In [3]:
f"{torch.cuda.memory_allocated():,}"

'6,400,507,904'

In [4]:
x = torch.randn(40_000, device="cuda", dtype=torch.float32)

In [5]:
f"{torch.cuda.memory_allocated():,}"

'6,400,668,160'

In [6]:
loss = model(x).sum()

In [7]:
f"{torch.cuda.memory_allocated():,}"

'6,409,188,352'

In [8]:
loss.backward()

OutOfMemoryError: CUDA out of memory. Tried to allocate 5.96 GiB. GPU 0 has a total capacity of 7.78 GiB of which 1.68 GiB is free. Including non-PyTorch memory, this process has 6.10 GiB memory in use. Of the allocated memory 5.97 GiB is allocated by PyTorch, and 13.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)