### Check GPU availability

Let's make sure that we have access to GPU. We can use `nvidia-smi` command to do that. In case of any problems navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `T4 GPU`, and then click `Save`.

In [1]:
!nvidia-smi

Fri Jan  9 01:51:07 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 5070 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   32C    P1             39W /  300W |     769MiB /  16303MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### Install dependencies

Installs RF-DETR version 1.2.1 or higher (which includes the new Nano, Small, and Medium checkpoints), along with Supervision for benchmarking and Roboflow for pulling datasets and uploading models to the Roboflow platform.

In [2]:
!pip install -q rfdetr==1.2.1 supervision==0.26.1 roboflow

## Download Dataset from Roboflow Universe

RF-DETR expects the dataset to be in COCO format. Divide your dataset into three subdirectories: `train`, `valid`, and `test`. Each subdirectory should contain its own `_annotations.coco.json` file that holds the annotations for that particular split, along with the corresponding image files. Below is an example of the directory structure:

```
dataset/
├── train/
│   ├── _annotations.coco.json
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ... (other image files)
├── valid/
│   ├── _annotations.coco.json
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ... (other image files)
└── test/
    ├── _annotations.coco.json
    ├── image1.jpg
    ├── image2.jpg
    └── ... (other image files)
```

[Roboflow](https://roboflow.com/annotate) allows you to create object detection datasets from scratch or convert existing datasets from formats like YOLO, and then export them in COCO JSON format for training. You can also explore [Roboflow Universe](https://universe.roboflow.com/) to find pre-labeled datasets for a range of use cases.

In [3]:
from roboflow import Roboflow
rf = Roboflow(api_key="jNNh2gmG3E4aDKPlvoAp")
project = rf.workspace("augmented-startups").project("playing-cards-ow27d")
version = project.version(4)
dataset = version.download("coco")
                

loading Roboflow workspace...
loading Roboflow project...


Downloading Dataset Version Zip in Playing-Cards-4 to coco:: 100%|██████████| 2122319/2122319 [02:27<00:00, 14414.23it/s]





Extracting Dataset Version Zip to Playing-Cards-4 in coco:: 100%|██████████| 24241/24241 [00:03<00:00, 6587.80it/s]


## Train RF-DETR on custom dataset

### Choose the right `batch_size`

Different GPUs have different amounts of VRAM (video memory), which limits how much data they can handle at once during training. To make training work well on any machine, you can adjust two settings: `batch_size` and `grad_accum_steps`. These control how many samples are processed at a time. The key is to keep their product equal to 16 — that’s our recommended total batch size. For example, on powerful GPUs like the A100, set `batch_size=16` and `grad_accum_steps=1`. On smaller GPUs like the T4, use `batch_size=4` and `grad_accum_steps=4`. We use a method called gradient accumulation, which lets the model simulate training with a larger batch size by gradually collecting updates before adjusting the weights.

In [4]:
from rfdetr import RFDETRNano

model = RFDETRNano()

model.train(dataset_dir=dataset.location, epochs=80, batch_size=8, grad_accum_steps=2)

rf-detr-nano.pth: 100%|██████████| 349M/349M [00:24<00:00, 15.1MiB/s]   


Using a different number of positional encodings than DINOv2, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model.
Using patch size 16 instead of 14, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model.
Loading pretrain weights


num_classes mismatch: model has 90 classes, but your dataset has 53 classes
reinitializing your detection head with 53 classes.


Unable to initialize TensorBoard. Logging is turned off for this session.  Run 'pip install tensorboard' to enable logging.
Not using distributed mode
git:
  sha: 62b4983f894cfd8dcd28148f3548ce74cc2d6f88, status: clean, branch: main

Namespace(num_classes=53, grad_accum_steps=2, amp=True, lr=0.0001, lr_encoder=0.00015, batch_size=8, weight_decay=0.0001, epochs=80, lr_drop=100, clip_max_norm=0.1, lr_vit_layer_decay=0.8, lr_component_decay=0.7, do_benchmark=False, dropout=0, drop_path=0.0, drop_mode='standard', drop_schedule='constant', cutoff_epoch=0, pretrained_encoder=None, pretrain_weights='rf-detr-nano.pth', pretrain_exclude_keys=None, pretrain_keys_modify_to_load=None, pretrained_distiller=None, encoder='dinov2_windowed_small', vit_encoder_num_layers=12, window_block_indexes=None, position_embedding='sine', out_feature_indexes=[3, 6, 9, 12], freeze_encoder=False, layer_norm=True, rms_norm=False, backbone_lora=False, force_no_pretrain=False, dec_layers=2, dim_feedforward=2048, hidde



Epoch: [0]  [   0/1325]  eta: 0:39:40  lr: 0.000100  class_error: 92.80  loss: 8.8440 (8.8440)  loss_ce: 0.6428 (0.6428)  loss_bbox: 0.7125 (0.7125)  loss_giou: 1.4183 (1.4183)  loss_ce_0: 0.6098 (0.6098)  loss_bbox_0: 0.8578 (0.8578)  loss_giou_0: 1.5182 (1.5182)  loss_ce_enc: 0.5793 (0.5793)  loss_bbox_enc: 0.9328 (0.9328)  loss_giou_enc: 1.5725 (1.5725)  loss_ce_unscaled: 0.6428 (0.6428)  class_error_unscaled: 92.8040 (92.8040)  loss_bbox_unscaled: 0.1425 (0.1425)  loss_giou_unscaled: 0.7091 (0.7091)  cardinality_error_unscaled: 3892.3750 (3892.3750)  loss_ce_0_unscaled: 0.6098 (0.6098)  loss_bbox_0_unscaled: 0.1716 (0.1716)  loss_giou_0_unscaled: 0.7591 (0.7591)  cardinality_error_0_unscaled: 3893.1250 (3893.1250)  loss_ce_enc_unscaled: 0.5793 (0.5793)  loss_bbox_enc_unscaled: 0.1866 (0.1866)  loss_giou_enc_unscaled: 0.7862 (0.7862)  cardinality_error_enc_unscaled: 3885.2500 (3885.2500)  time: 1.7964  data: 0.4087  max mem: 3528
Epoch: [0]  [  10/1325]  eta: 0:10:23  lr: 0.000100  

KeyboardInterrupt: 

Before benchmarking the model, we need to load the best saved checkpoint. To ensure it fits on the GPU, we first need to free up GPU memory. This involves deleting any remaining references to previously used objects, triggering Python’s garbage collector, and clearing the CUDA memory cache.

In [None]:
cleanup_gpu_memory(model, verbose=True)

[Before] Allocated: 146.49 MB | Reserved: 10432.00 MB
[After]  Allocated: 146.49 MB | Reserved: 316.00 MB


<div align="center">
  <p>
    Looking for more tutorials or have questions?
    Check out our <a href="https://github.com/roboflow/notebooks">GitHub repo</a> for more notebooks,
    or visit our <a href="https://discord.gg/GbfgXGJ8Bk">discord</a>.
  </p>
  
  <p>
    <strong>If you found this helpful, please consider giving us a ⭐
    <a href="https://github.com/roboflow/notebooks">on GitHub</a>!</strong>
  </p>

</div>