Goals: choose “top-K” most frequent classes for a strong baseline subset; build a balanced manifest; save class weights for training; export buckets for weighted sampler / focal loss / class-balanced loss.

### Cell A — Setup & load final ROI manifest

In [1]:
from pathlib import Path
import pandas as pd, numpy as np

root = Path("..").resolve()
data_dir = root / "data" / "wlasl_preprocessed"
man_roi = data_dir / "manifest_nslt2000_roi_final.csv"   # produced in previous step

df = pd.read_csv(man_roi)
assert {"video_id","path","gloss","label","split"}.issubset(df.columns)
print(f"Loaded {len(df)} samples | classes={df['gloss'].nunique()}")
df.head(2)


Loaded 11980 samples | classes=2000


Unnamed: 0,video_id,path,gloss,label,split
0,335,/home/falasoul/notebooks/USD/AAI-590/Capstone/...,abdomen,2,test
1,336,/home/falasoul/notebooks/USD/AAI-590/Capstone/...,abdomen,2,train


### Cell B — Sanity checks (existence, per-split counts, class sizes)

In [2]:
from collections import Counter
from tqdm.auto import tqdm
import os

# file existence
df["exists"] = df["path"].apply(lambda p: os.path.exists(p))
missing = df[~df["exists"]]
print(f"Missing files: {len(missing)}")
if len(missing): display(missing.head())

# split summary
split_counts = df["split"].value_counts().to_dict()
print("Split counts:", split_counts)

# class sizes overall and per split
by_gloss = df.groupby("gloss").size().sort_values(ascending=False)
print("Classes:", len(by_gloss))
display(by_gloss.head(10))


Missing files: 0
Split counts: {'train': 8313, 'val': 2253, 'test': 1414}
Classes: 2000


gloss
thin        16
cool        16
before      16
go          15
drink       15
help        14
who         14
computer    14
cousin      14
accident    13
dtype: int64

### Cell C — Choose top-K classes & minimum-per-class requirement

In [9]:
# parameters (tune as needed)
TOP_K     = 300          # e.g., 100, 200, 500
MIN_PER_C = 10           # only keep classes with >= this many total clips (across all splits)

eligible = by_gloss[by_gloss >= MIN_PER_C].index.tolist()
topk_gloss = by_gloss.loc[eligible].head(TOP_K).index.tolist()

df_top = df[df["gloss"].isin(topk_gloss)].copy()
print(f"Kept {len(topk_gloss)} classes, {len(df_top)} samples")


Kept 104 classes, 1160 samples


### New “Cell C+” — Filter out classes without enough clips per split

In [15]:
# Optional sanity filter to drop underrepresented classes before balancing

min_per_split = {"train": 8, "val": 2, "test": 1}

ok_gloss = []
for gloss, gdf in df_top.groupby("gloss"):
    ok = True
    for split, min_count in min_per_split.items():
        if (gdf["split"] == split).sum() < min_count:
            ok = False
            break
    if ok:
        ok_gloss.append(gloss)

df_top = df_top[df_top["gloss"].isin(ok_gloss)].copy()
print(f"After per-split minima, kept {len(ok_gloss)} classes and {len(df_top)} samples")


After per-split minima, kept 24 classes and 315 samples


You now have:

24 sign classes

315 total video clips

Each class has ≥ 8 train / 2 val / 1 test clips.
→ So every class is represented across all splits (which is key for real training and validation).

That’s a balanced, compact dataset ideal for:

Validating your preprocessing & training loop.

Confirming your model learns meaningful patterns (instead of just overfitting to a few examples).

Benchmarking FPS, batch sizes, and GPU utilization before scaling up

### Cell D — Balance within each split (cap per-class, drop excess)

In [16]:
# target per-class caps per split (adjust per your dataset)
# heuristics: keep natural class proportion across splits but cap extremes
per_split_caps = {
    "train": 8,   # cap max clips/class in train
    "val":   2,   # cap max in val
    "test":  1,   # cap max in test
}

balanced_parts = []
for split, cap in per_split_caps.items():
    part = df_top[df_top["split"] == split].copy()
    part = (part.groupby("gloss", group_keys=False)
                 .apply(lambda g: g.sample(n=min(len(g), cap), random_state=42)))
    balanced_parts.append(part)

df_bal = pd.concat(balanced_parts, ignore_index=True)
print(f"Balanced total: {len(df_bal)} | classes={df_bal['gloss'].nunique()}")


Balanced total: 264 | classes=24


  .apply(lambda g: g.sample(n=min(len(g), cap), random_state=42)))
  .apply(lambda g: g.sample(n=min(len(g), cap), random_state=42)))
  .apply(lambda g: g.sample(n=min(len(g), cap), random_state=42)))


### Cell E — Verify balance and report class distribution

In [17]:
def report(df_):
    print("Total:", len(df_))
    print("By split:", df_["split"].value_counts().to_dict())
    ctrain = df_[df_["split"]=="train"].groupby("gloss").size().describe()
    cval   = df_[df_["split"]=="val"].groupby("gloss").size().describe()
    ctest  = df_[df_["split"]=="test"].groupby("gloss").size().describe()
    print("\nTrain per-class:\n", ctrain)
    print("\nVal per-class:\n",   cval)
    print("\nTest per-class:\n",  ctest)

report(df_bal)


Total: 264
By split: {'train': 192, 'val': 48, 'test': 24}

Train per-class:
 count    24.0
mean      8.0
std       0.0
min       8.0
25%       8.0
50%       8.0
75%       8.0
max       8.0
dtype: float64

Val per-class:
 count    24.0
mean      2.0
std       0.0
min       2.0
25%       2.0
50%       2.0
75%       2.0
max       2.0
dtype: float64

Test per-class:
 count    24.0
mean      1.0
std       0.0
min       1.0
25%       1.0
50%       1.0
75%       1.0
max       1.0
dtype: float64


### Cell F — Create class index mapping (label remap) and persist

In [18]:
# Build contiguous class index (0..C-1) in alphabetical order (or by frequency)
classes = sorted(df_bal["gloss"].unique())
gloss_to_new = {g:i for i,g in enumerate(classes)}
df_bal["label_new"] = df_bal["gloss"].map(gloss_to_new)

# Save artifacts
out_dir = data_dir
man_balanced = out_dir / f"manifest_nslt2000_roi_top{len(classes)}_balanced.csv"
class_map_json = out_dir / f"class_index_top{len(classes)}.json"

df_bal.to_csv(man_balanced, index=False)
print("Saved manifest:", man_balanced)

import json
with open(class_map_json, "w") as f:
    json.dump({"classes": classes, "gloss_to_index": gloss_to_new}, f, indent=2)
print("Saved class map:", class_map_json)


Saved manifest: /home/falasoul/notebooks/USD/AAI-590/Capstone/AAI-590-G3-ASL/data/wlasl_preprocessed/manifest_nslt2000_roi_top24_balanced.csv
Saved class map: /home/falasoul/notebooks/USD/AAI-590/Capstone/AAI-590-G3-ASL/data/wlasl_preprocessed/class_index_top24.json


### Cell G — (Optional) Class weights for imbalanced loss

In [19]:
# If you want weights for cross-entropy: inverse frequency on train subset
train_counts = df_bal[df_bal["split"]=="train"]["gloss"].value_counts()
weights = {gloss_to_new[g]: float(1.0 / c) for g,c in train_counts.items()}
import json
weights_json = out_dir / f"class_weights_top{len(classes)}.json"
with open(weights_json, "w") as f:
    json.dump(weights, f, indent=2)
print("Saved class weights:", weights_json)


Saved class weights: /home/falasoul/notebooks/USD/AAI-590/Capstone/AAI-590-G3-ASL/data/wlasl_preprocessed/class_weights_top24.json


### Cell H — (Optional) Spot-check a few files exist and open

In [20]:
import cv2, random

sampled = df_bal.sample(min(6, len(df_bal)), random_state=7)
ok = 0
for p in sampled["path"]:
    cap = cv2.VideoCapture(p)
    ret, _ = cap.read()
    cap.release()
    ok += int(ret)
print(f"Spot-check decode OK: {ok}/{len(sampled)}")


Spot-check decode OK: 6/6


What to do next

✅ Proceed to 06_train_baseline.ipynb with this 24-class dataset.

Train a small backbone (e.g., R3D-18 or C3D) on these 315 clips.

Evaluate convergence and confusion matrix.

If training looks stable and accuracy improves over epochs, then you can confidently:

Go back to this notebook (05_select_top_and_balance.ipynb)

Loosen the filter — e.g.
min_per_split = {"train": 5, "val": 1, "test": 1}
and maybe TOP_K = 500, MIN_PER_C = 6

Regenerate a larger manifest (e.g., 100–300 classes)

Re-run training with a deeper backbone.