# What is a 3D CNN for video?

A 3D CNN extends 2D convs into time. Instead of convolving over $(H, W)$, kernels slide over $(T, H, W)$ to learn **spatiotemporal** features.

If an input clip is $x \in \mathbb{R}^{B \times C \times T \times H \times W}$ and a kernel is $w \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times K_t \times K_h \times K_w}$, the 3D convolution at output channel $o$ and position $(t,i,j)$ is:
$$
y_{o,t,i,j} ;=; \sum_{c=1}^{C_{\text{in}}};\sum_{\tau=0}^{K_t-1};\sum_{m=0}^{K_h-1};\sum_{n=0}^{K_w-1}
w_{o,c,\tau,m,n};x_{c,,t + \tau s_t - p_t,, i + m s_h - p_h,, j + n s_w - p_w}.
$$
Parameters: $C_{\text{out}}\cdot C_{\text{in}}\cdot K_t K_h K_w$.

# Common model families (mental map)

* **C3D**: Plain 3D convs everywhere (e.g., $3{\times}3{\times}3$). Simple, heavy.
* **I3D**: Inflate 2D kernels to 3D (copy/average weights along time); leverage 2D ImageNet pretraining.
* **(2+1)D / R(2+1)D**: Factor $3{\times}3{\times}3$ into $1{\times}3{\times}3$ (spatial) then $3{\times}1{\times}1$ (temporal). Fewer params + extra nonlinearity.
  $$
  \text{3D}(K_t,K_h,K_w) \approx \text{2D}(K_h,K_w) ;\to; \text{1D}_t(K_t).
  $$
* **P3D**: Parallel or cascaded spatial/temporal branches (various topologies).
* **3D ResNets / R3D**: Residual networks with 3D blocks.
* **SlowFast**: Two pathways—Slow (low fps, high channels) + Fast (high fps, low channels). Fuse lateral connections.
* **X3D**: Efficient compound scaling of width, depth, resolution, and time for mobile/edge.
* **TSM/Temporal-Shift (2D-heavy)**: Keep 2D convs; shift channels temporally to mix frames (near 2D cost).

# Key design knobs

* **Temporal sampling**: choose clip length $T$ and stride $s_t$. Example: sample 8–32 frames at strides 2–8.
* **Temporal receptive field**: increase with $K_t$, temporal stride, or temporal dilation $\delta_t$.
* **Factorization**: (2+1)D often gives better accuracy/efficiency than full $3{\times}3{\times}3$.
* **Normalization**: 3D BN/SyncBN; LayerNorm is common when mixing with attention.
* **Pooling/heads**:

  * **Video classification**: global average pool over $(T,H,W)$ then MLP.
    $$
    \hat{y}=\mathrm{softmax}\left(W,\mathrm{GAP}_{t,h,w}(F)+b\right).
    $$
  * **Temporal localization**: keep $T$; pool only spatially.
  * **Spatiotemporal detection/segmentation**: 3D FPN/decoder; keep spatial map; optionally reduce $T$ later.

# Data pipeline (typical)

1. **Decode**: uniformly sample $T$ frames from each video or multi-clip sampling at test time.
2. **Augment**: random crop, horizontal flip, color jitter, RandAugment (frame-wise or consistent across time).
3. **Normalize**: per-channel mean/std.
4. **Label granularity**: video-level vs frame-level vs tube-level depends on task.

# Training recipes (solid defaults)

* Optimizer: AdamW or SGD+momentum.
* LR: cosine decay with warmup; batch-size scaled LR.
* Regularization: label smoothing, dropout/stochastic depth, mixup/cutmix (applied consistently over time).
* Pretraining: 2D ImageNet (for I3D) or Kinetics-400/600/700 checkpoints (if available).
* Inference: clip sampling (e.g., 10-crop × 3 clips) for a small gain if latency permits.

# Efficient configs (4–6 GB VRAM)

* **R(2+1)D-18**: $T{=}8$ or $16$, $224^2$, batch 4–8 (accumulation if needed).
* **X3D-S/XS**: $T{=}13$–$16$, $160$–$224$.
* **TSM-ResNet-18 (2D)**: $T{=}8$–$16$ with temporal shift—very memory-friendly.

# Minimal PyTorch: basic 3D block and tiny classifier

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class Basic3DBlock(nn.Module):
    def __init__(self, cin, cout, kt=3, kh=3, kw=3, stride=(1,2,2), t_stride=1):
        super().__init__()
        # stride is (t,h,w); you can pass stride=(t_stride,2,2) for downsampling
        self.conv1 = nn.Conv3d(cin, cout, kernel_size=(kt,kh,kw),
                               stride=stride, padding=(kt//2, kh//2, kw//2), bias=False)
        self.bn1 = nn.BatchNorm3d(cout)
        self.conv2 = nn.Conv3d(cout, cout, kernel_size=(kt,kh,kw),
                               stride=1, padding=(kt//2, kh//2, kw//2), bias=False)
        self.bn2 = nn.BatchNorm3d(cout)
        self.down = None
        if stride != 1 or cin != cout:
            self.down = nn.Sequential(
                nn.Conv3d(cin, cout, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm3d(cout),
            )
    def forward(self, x):
        identity = x
        out = F.relu(self.bn1(self.conv1(x)), inplace=True)
        out = self.bn2(self.conv2(out))
        if self.down is not None:
            identity = self.down(identity)
        out = F.relu(out + identity, inplace=True)
        return out

class Tiny3DClassifier(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.stem = nn.Conv3d(3, 32, kernel_size=(3,7,7), stride=(1,2,2), padding=(1,3,3), bias=False)
        self.bn = nn.BatchNorm3d(32)
        self.layer1 = Basic3DBlock(32, 64, stride=(1,2,2))
        self.layer2 = Basic3DBlock(64, 128, stride=(2,2,2))  # temporal downsample
        self.layer3 = Basic3DBlock(128, 256, stride=(1,2,2))
        self.head = nn.Linear(256, num_classes)

    def forward(self, x):  # x: [B,3,T,H,W]
        x = F.relu(self.bn(self.stem(x)), inplace=True)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)     # [B,C,T',H',W']
        x = x.mean(dim=(2,3,4))  # GAP over (T,H,W)
        return self.head(x)
```

# Minimal PyTorch: (2+1)D factorized block

```python
class R2Plus1DBlock(nn.Module):
    def __init__(self, cin, cout, kt=3, kh=3, kw=3, stride=(1,1,1)):
        super().__init__()
        # Spatial 2D first
        self.spatial = nn.Conv3d(cin, cout, kernel_size=(1,kh,kw),
                                 stride=(1, stride[1], stride[2]),
                                 padding=(0, kh//2, kw//2), bias=False)
        self.bn1 = nn.BatchNorm3d(cout)
        # Temporal 1D next
        self.temporal = nn.Conv3d(cout, cout, kernel_size=(kt,1,1),
                                  stride=(stride[0],1,1),
                                  padding=(kt//2,0,0), bias=False)
        self.bn2 = nn.BatchNorm3d(cout)
        self.down = None
        if stride != (1,1,1) or cin != cout:
            self.down = nn.Sequential(
                nn.Conv3d(cin, cout, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm3d(cout),
            )
    def forward(self, x):
        identity = x
        x = F.relu(self.bn1(self.spatial(x)), inplace=True)
        x = self.bn2(self.temporal(x))
        if self.down is not None:
            identity = self.down(identity)
        return F.relu(x + identity, inplace=True)
```

# When to prefer which

* Need **speed/low memory**: TSM or X3D-XS.
* Need **strong accuracy** on actions with motion cues: R(2+1)D / SlowFast / X3D-M.
* Need **long-range** context (≫32 frames): combine 3D CNN backbone with temporal pooling, dilations, or a lightweight temporal transformer head.

# Practical tips

* **Clip length vs stride**: for fast actions use shorter $T$ and smaller stride; for long actions sample sparse but longer clips.
* **Temporal augmentation**: random frame-rate jitter (vary stride) improves robustness.
* **Class imbalance**: use weighted loss or focal loss for detection/localization.
* **Evaluation protocol**: report top-1/top-5 for classification; mAP for detection; mIoU for segmentation.

# Quick sanity check (toy)

* Input: $B{=}2,;C{=}3,;T{\approx}8,;H{=}112,;W{=}112$.
* Start with $T{=}8$ clips during training; at test time average predictions across 3 temporal crops.



## **Output of a 3D CNN**
The **output of a 3D CNN** depends on *what task* you’re solving and *how you handle the time dimension*.
Let’s break it down clearly using the notation:

$$
x \in \mathbb{R}^{B \times C_{in} \times T \times H \times W}
$$
where:

* $B$: batch size
* $C_{in}$: input channels (e.g. RGB → 3)
* $T$: number of frames
* $H, W$: spatial resolution

---

## 1. **Video Classification (most common)**

Goal: predict **one label per video clip**, e.g. “playing guitar.”

### Output shape

$$
y \in \mathbb{R}^{B \times \text{num_classes}}
$$

### What happens internally

1. After the 3D convolutions, you get feature maps
   $F \in \mathbb{R}^{B \times C_{feat} \times T' \times H' \times W'}$
2. You **average pool over (T', H', W')** →
   $$
   f = \mathrm{GAP}*{T',H',W'}(F) \in \mathbb{R}^{B \times C*{feat}}
   $$
3. Then apply a **fully connected (linear) layer**:
   $$
   \hat{y} = \mathrm{softmax}(W f + b)
   $$
   which gives per-class probabilities.

### Example (from the code above)

```python
x = torch.randn(2, 3, 8, 112, 112)   # batch=2, RGB, 8 frames
model = Tiny3DClassifier(num_classes=10)
out = model(x)
print(out.shape)  # torch.Size([2, 10])
```

✅ So here the output is **one 10-class prediction per video clip**.

---

## 2. **Frame-level Prediction (per-frame label)**

For tasks like **temporal action segmentation** or **frame-wise emotion detection**,
you keep the temporal dimension and average only spatially.

### Output shape

$$
y \in \mathbb{R}^{B \times \text{num_classes} \times T'}
$$

### Implementation

```python
x = model.extract_features(video)        # [B, C, T', H', W']
x = x.mean(dim=(3,4))                    # average over H,W → [B, C, T']
out = head(x)                            # linear projection → [B, num_classes, T']
```

✅ One label per frame (or short temporal window).

---

## 3. **Spatiotemporal Detection or Segmentation**

Here we want a **map** in time and space — e.g., localizing an action *and where it happens in the frame*.

### Output shape

$$
y \in \mathbb{R}^{B \times \text{num_classes} \times T' \times H' \times W'}
$$
Each voxel has a class probability.

This is used in **action localization**, **video object segmentation**, or **3D medical imaging** (like CT/MRI).

---

## 4. **Feature Extractor / Backbone Output**

If you remove the classifier head and just want **spatiotemporal features**,
the output is the **final 3D tensor** before pooling:
$$
F \in \mathbb{R}^{B \times C_{feat} \times T' \times H' \times W'}
$$
You can feed this into:

* a temporal transformer,
* an RNN,
* or another detection/segmentation head.

Example:

```python
features = model.layer3(video)  # torch.Size([B, 256, 2, 7, 7])
```

---

### Quick summary table

| Task                               | Output shape                 | Meaning                       |
| ---------------------------------- | ---------------------------- | ----------------------------- |
| **Video classification**           | [B, num_classes]             | One label per clip            |
| **Frame-level classification**     | [B, num_classes, T’]         | Label per frame               |
| **Video segmentation / detection** | [B, num_classes, T’, H’, W’] | Label per pixel per frame     |
| **Feature extraction**             | [B, C_feat, T’, H’, W’]      | Spatiotemporal feature tensor |

---


