In [19]:
# ! uv pip install nbdistributed

<IPython.core.display.Javascript object>

Load the `nbdistributed` extension to start working with multiple GPUs.

In [6]:
%load_ext nbdistributed

The nbdistributed extension is already loaded. To reload it, use:
  %reload_ext nbdistributed


<IPython.core.display.Javascript object>

In [7]:
%dist_init --num-processes=2 --gpu-ids 0,1

Using GPU IDs: [0, 1]
Starting 2 distributed workers...
✓ Successfully started 2 workers
  Rank 0 -> GPU 0
  Rank 1 -> GPU 1
Available commands:
  %%distributed - Execute code on all ranks (explicit)
  %%rank [0,n] - Execute code on specific ranks
  %sync - Synchronize all ranks
  %dist_status - Show worker status
  %dist_mode - Toggle automatic distributed mode
  %dist_shutdown - Shutdown workers

🚀 Distributed mode active: All cells will now execute on workers automatically!
   Magic commands (%, %%) will still execute locally as normal.

🐍 Below are auto-imported and special variables auto-generated into the namespace to use
  `torch`
  `dist`: `torch.distributed` import alias
  `rank` (`int`): The local rank
  `world_size` (`int`): The global world size
  `gpu_id` (`int`): The specific GPU ID assigned to this worker
  `device` (`torch.device`): The current PyTorch device object (e.g. `cuda:1`)


<IPython.core.display.Javascript object>

In [8]:
%dist_status

Distributed cluster status (2 processes):
Rank 0: ✓ PID 130
  ├─ GPU: 0 (Tesla T4)
  └─ Status: Running

Rank 1: ✓ PID 131
  ├─ GPU: 1 (Tesla T4)
  └─ Status: Running



<IPython.core.display.Javascript object>

`dist_status` gives us a peek into the state of GPUs we are currently connected to.

In [None]:
%%rank [0]
t = torch.tensor([1,2,3]).to(device)
t


🔹 Rank 0:
  tensor([1, 2, 3], device='cuda:0')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
device


🔹 Rank 0:
  0

🔹 Rank 1:
  1


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [11]:
%%rank [1]
t = torch.tensor([1,2,3]).to(device)
t


🔹 Rank 1:
  tensor([1, 2, 3], device='cuda:1')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

We see that there are two tensors created on the GPUs despite the same variable names. This is strange to follow and could be an anti-pattern.

In [12]:
%%rank [1]
b = torch.tensor([1,3,4]).to(device)
b


🔹 Rank 1:
  tensor([1, 3, 4], device='cuda:1')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [13]:
t + b


🔹 Rank 1:
  tensor([2, 5, 7], device='cuda:1')

❌ Error on Rank 0: name 'b' is not defined
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/nbdistributed/worker.py", line 284, in _execute_code_streaming
    result = eval(compile(tree, '<string>', 'eval'), self.namespace)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 1, in <module>
NameError: name 'b' is not defined



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This code is failing because the tensor `b` is unavailable on rank0 which is the default way in which the `nbdistributed` package works from the notebook if we do not specify the GPU.

In [14]:
%%rank [1]
t + b


🔹 Rank 1:
  tensor([2, 5, 7], device='cuda:1')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

If we specify the GPU, the notebook is able to look up with the specific GPU, identify the tensors and do the math!

In [15]:
%dist_shutdown

Shutting down distributed workers (nuclear option)...
Starting force shutdown...
Force shutdown completed
Distributed workers shutdown
📱 Normal cell execution restored


<IPython.core.display.Javascript object>

In [16]:
t

tensor([0, 0, 0])

<IPython.core.display.Javascript object>

It looks like `t` is available because of rank 0 tensor while `b` is most likely not.

In [17]:
b

NameError: name 'b' is not defined

<IPython.core.display.Javascript object>

In [18]:
%dist_status

No distributed workers running


<IPython.core.display.Javascript object>