In [None]:
import os
from pathlib import Path

import torch

# Local project imports
from dataloaders import get_dataloaders
from model import build_model
from transforms import get_video_transform
from predict import predict  # if predict.py is in the same directory

In [None]:
```python
root = Path("../../../Data/Visual_split/")

print("Root exists:", root.exists())
print("Subdirectories:", [p.name for p in root.iterdir() if p.is_dir()])

train_dir = root / "train"
test_dir = root / "test"
print("\nTrain classes:", [p.name for p in train_dir.iterdir() if p.is_dir()])
print("\nTest classes:", [p.name for p in test_dir.iterdir() if p.is_dir()])


In [None]:
 VideoDataset: How ASL clips are loaded

The `VideoDataset` class (defined in `dataset.py` or together with `dataloaders.py`) is responsible for:

1. Discovering all classes:
   - It scans `root_dir` and collects all subfolders as class names.
   - It sorts them and builds `class_to_idx` (e.g., `{"HELLO": 0, "PLEASE": 1, ...}`).

2. Collecting all video samples:
   - For each class folder, it finds all `.mp4` files.
   - It stores a list of `(video_path, label_index)` pairs in `self.samples`.

3. Loading and sampling frames from a video:
   - Uses `torchvision.io.read_video` to read the video into a tensor of shape `(T, H, W, C)`.
   - Uniformly samples a fixed number of frames (`num_frames`, default 16).
   - If the video is too short, it repeats frames to reach 16.
   - Converts frames to float in `[0, 1]` and permutes to `(T, C, H, W)`.
   - Applies image transforms **frame-by-frame** (resize, normalize, etc.).

This turns each raw `.mp4` ASL clip into a tensor of shape `(T, C, H, W)` that can be fed into a 3D CNN.


In [None]:
 – Code: `model.py`

```python
# model.py
import torch
import torch.nn as nn
from torchvision.models.video import r3d_18, R3D_18_Weights

def build_model(num_classes, pretrained=True):
    if pretrained:
        weights = R3D_18_Weights.DEFAULT
        model = r3d_18(weights=weights)
    else:
        model = r3d_18(weights=None)

    # Replace final FC layer
    in_features = model.fc.in_features
    model.fc = nn.Linear(in_features, num_classes)
    return model


In [None]:
 `build_model`: 3D ResNet for ASL

- Uses `r3d_18`, a **3D ResNet-18** model from `torchvision.models.video`.
- If `pretrained=True`, it loads `R3D_18_Weights.DEFAULT`, which are weights pretrained on a large action recognition dataset.

Why this helps ASL:

- The pretrained model already knows how to detect **motion patterns** and **spatio-temporal features** from videos.
- By replacing the final fully-connected layer (`model.fc`) with a new linear layer with `num_classes` outputs, we adapt it to predict **ASL sign classes** instead of general actions.

So the model output is:

- A vector of length `num_classes` with **logits** (one per ASL class).
- During training, we use cross-entropy loss to map those logits to the correct ASL label.


In [None]:
`predict.py`: Inference on a New ASL Video

This script runs **inference** on a single `.mp4` ASL video.

Loading the checkpoint

```python
ckpt = torch.load(checkpoint, map_location=device)
class_names = ckpt["class_names"]
model = build_model(num_classes=len(class_names), pretrained=False)
model.load_state_dict(ckpt["model_state_dict"])


In [None]:
 Training & Transforms: How Everything Connects

`get_video_transform` (frame-level transforms)

- Resizes each frame to `112 x 112`.
- Normalizes using standard video statistics (same as Kinetics models).
- Optionally, we could enable `RandomHorizontalFlip`, but for ASL, flipping can change meaning, so it’s commented out.

This transform is used in **both**:

- `VideoDataset` (training and testing)
- `load_video_tensor` in `predict.py` (inference)

So ASL videos are always preprocessed in a consistent way.

 `train_one_epoch` and `eval_model`

Both functions:

- Permute videos from `(B, T, C, H, W)` to `(B, C, T, H, W)`:

  ```python
  videos = videos.permute(0, 2, 1, 3, 4).to(device)


In [None]:
 Summary & Next Steps


- Described the **ASL dataset structure**, where each class corresponds to a specific sign.
- Explained how `VideoDataset`:
  - Discovers ASL classes from folders
  - Loads `.mp4` clips
  - Samples a fixed number of frames
  - Applies frame-wise transforms for the 3D CNN
- Showed how `get_dataloaders` builds training and test dataloaders.
- Documented how the **3D ResNet (R3D-18)** model is adapted for ASL classification by replacing the final layer.
- Reviewed the training loop and how the model is saved to `video_classifier.pt`.
- Ran (or at least documented) the prediction pipeline to classify a new ASL video clip.

### Possible extensions

- Add more ASL signs and collect more training data.
- Experiment with:
  - Different numbers of frames (e.g., 8, 32)
  - Data augmentation (carefully, due to ASL semantics)
  - Different architectures (e.g., `mc3_18`, `r2plus1d_18`)
- Generate confusion matrices and per-class metrics to understand which ASL signs are harder for the model.

