Skip to content

Build a compact LSTM/Transformer model for ASL recognition. The goal is to translate sign language into text/speech on a mobile app

Notifications You must be signed in to change notification settings

davidgit3000/asl_recognition

Repository files navigation

ASL Recognition Model

American Sign Language (ASL) recognition system using MediaPipe landmarks and deep learning.

πŸ“ Project Structure

asl_model/
β”œβ”€β”€ configs/                      # Configuration files
β”‚   β”œβ”€β”€ config.yaml              # Main configuration
β”‚   └── label_map.json           # Label normalization rules
β”œβ”€β”€ data/                        # Raw datasets
β”‚   β”œβ”€β”€ kaggle_asl1/            # Kaggle ASL Alphabet dataset
β”‚   β”œβ”€β”€ kaggle_asl2/            # Kaggle ASL Dataset
β”‚   β”œβ”€β”€ kaggle_asl_combined/    # Combined Kaggle images (A-Z)
β”‚   β”œβ”€β”€ microsoft_asl/          # MS-ASL videos and metadata
β”‚   └── personal/               # Personal recordings (optional)
β”œβ”€β”€ artifacts/                   # Generated artifacts
β”‚   β”œβ”€β”€ landmarks/              # Raw MediaPipe landmarks [T, 543, 4]
β”‚   β”œβ”€β”€ features/               # Preprocessed features [T, 75, 4]
β”‚   β”œβ”€β”€ manifests/              # Dataset manifests (CSV)
β”‚   β”œβ”€β”€ models/                 # Trained model checkpoints
β”‚   β”‚   └── lstm_attention_20251020_151032/  # Best model (86.3%)
β”‚   └── logs/                   # TensorBoard training logs
β”œβ”€β”€ src/                        # Core library code
β”‚   β”œβ”€β”€ data/                   # Data processing modules
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── dataloader.py      # PyTorch dataloader with windowing
β”‚   β”œβ”€β”€ models/                 # Model architectures
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── lstm_model.py      # BiLSTM + Attention
β”‚   └── utils/                  # Utility functions
β”‚       └── __init__.py
β”œβ”€β”€ scripts/                    # Executable scripts
β”‚   β”œβ”€β”€ 1_data_preparation/    # Download and organize data
β”‚   β”‚   β”œβ”€β”€ combine_kaggle_asl.py
β”‚   β”‚   β”œβ”€β”€ build_manifest.py
β”‚   β”‚   β”œβ”€β”€ assign_splits.py
β”‚   β”‚   └── msasl_*.py         # MS-ASL pipeline scripts
β”‚   β”œβ”€β”€ 2_preprocessing/       # Extract and preprocess features
β”‚   β”‚   β”œβ”€β”€ extract_landmarks.py
β”‚   β”‚   β”œβ”€β”€ preprocess_features.py
β”‚   β”‚   └── filter_valid_features.py
β”‚   β”œβ”€β”€ 3_training/            # Train models
β”‚   β”‚   └── train_baseline.py
β”‚   └── 4_evaluation/          # Evaluate and visualize
β”‚       β”œβ”€β”€ analyze_errors.py
β”‚       β”œβ”€β”€ quick_stats.py
β”‚       └── quick_viz.py
β”œβ”€β”€ tests/                      # Test scripts
β”‚   β”œβ”€β”€ test_dataloader_with_splits.py
β”‚   β”œβ”€β”€ test_model.py
β”‚   └── README.md
β”œβ”€β”€ bash_scripts/               # Bash shell scripts
β”‚   β”œβ”€β”€ download_kaggle_datasets.sh
β”‚   β”œβ”€β”€ check_feature_validity.sh
β”‚   β”œβ”€β”€ check_feature_validity_unfiltered.sh
β”‚   β”œβ”€β”€ train_ensemble.sh
β”‚   └── README.md
β”œβ”€β”€ plans/                      # Project planning documents
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ README.md                  # This file
β”œβ”€β”€ TRAINING_GUIDE.md          # Training documentation
β”œβ”€β”€ KAGGLE_DATASETS_INFO.md    # Kaggle dataset details
β”œβ”€β”€ MSASL_PIPELINE.md          # MS-ASL pipeline guide
└── PROJECT_STRUCTURE.md       # Detailed structure docs

πŸš€ Quick Start

1. Clone the Repository

git clone https://github.com/davidgit3000/asl_recognition.git
cd asl_recognition

2. Setup Environment

Mac/Linux:

# Create virtual environment
python3.11 -m venv .venv311
source .venv311/bin/activate

# Install dependencies
pip install -r requirements.txt

Windows:

# Create virtual environment
python -m venv .venv311
.venv311\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Download Datasets

Kaggle Datasets (required):

Option A - Using script (Mac/Linux):

# Setup Kaggle API credentials first (~/.kaggle/kaggle.json)
bash download_kaggle_datasets.sh

Option B - Manual download:

  • Download ASL Alphabet β†’ extract to data/kaggle_asl1/
  • Download ASL Dataset β†’ extract to data/kaggle_asl2/

MS-ASL Videos (optional):

  • Download from MS-ASL Dataset
  • Then, use automated scripts (see scripts/1_data_preparation/README.md)

4. Prepare Data

# Combine Kaggle datasets
python scripts/1_data_preparation/combine_kaggle_asl.py

# Build manifest and assign splits
python scripts/1_data_preparation/build_manifest.py
python scripts/1_data_preparation/assign_splits.py

5. Extract and Preprocess Features

# Extract MediaPipe landmarks
python scripts/2_preprocessing/extract_landmarks.py

# Preprocess features for training
python scripts/2_preprocessing/preprocess_features.py

6. Test Dataloader

python scripts/4_evaluation/test_dataloader_with_splits.py

πŸ“Š Dataset

Raw Data

  • Kaggle ASL Alphabet: ~78,000 images (A-Z letters)
  • Kaggle ASL Dataset: ~9,000 images (A-Z letters)
  • MS-ASL: ~190 videos (20 common words)
  • Total Raw: 87,200 samples

After Processing & Filtering

  • Valid Features: 68,671 samples (86.1% extraction success rate)
  • Classes: 26 (A-Z letters only)
  • Splits: 70% train (48,074), 15% val (10,278), 15% test (10,319)
  • Removed: 11,144 zero features + 188 MS-ASL samples (class imbalance)

Key Statistics

  • Feature extraction success: 86.1% (dual-detection pipeline)
  • Class balance: 1,847-3,008 samples per class (well-balanced)
  • Most challenging classes: M, N (39.8% detection failure due to closed fist)

πŸ”§ Pipeline Overview

  1. Data Collection β†’ Download Kaggle + MS-ASL datasets βœ…
  2. Manifest Building β†’ Create unified CSV with metadata βœ…
  3. Landmark Extraction β†’ Dual-detection (MediaPipe Holistic + Hands fallback) βœ…
  4. Feature Preprocessing β†’ Normalize, smooth, reduce to 75 landmarks βœ…
  5. Quality Filtering β†’ Remove zero features and low-count classes βœ…
  6. Train/Val/Test Split β†’ Stratified 70/15/15 split βœ…
  7. Dataloader β†’ PyTorch DataLoader with windowing βœ…
  8. Model Training β†’ BiLSTM + Attention (86.3% test accuracy) βœ…
  9. Evaluation β†’ Error analysis and confusion matrix πŸ”„
  10. Inference Pipeline β†’ Real-time webcam demo πŸ“‹

πŸ“¦ Features

Data Processing

  • βœ… Dual-detection pipeline: MediaPipe Holistic + Hands fallback (86.1% success)
  • βœ… Temporal smoothing: Savitzky-Golay filter (window=5, polynomial=2)
  • βœ… Normalization: Centered on torso, scaled by shoulder width
  • βœ… Quality filtering: Remove zero features and low-count classes
  • βœ… Windowed sequences: 32 frames with configurable stride
  • βœ… Data augmentation: Rotation, scale, translation (training only)
  • βœ… Class balancing: Weighted loss for imbalanced data
  • βœ… Stratified splits: 70/15/15 train/val/test

Model Architecture

  • βœ… BiLSTM + Attention: 3 layers, 512 hidden units, 17M parameters
  • βœ… Regularization: Dropout (0.25), weight decay (1e-5), label smoothing (0.1)
  • βœ… Optimization: Adam optimizer with ReduceLROnPlateau scheduler
  • βœ… Early stopping: Patience=10 epochs

Performance

  • βœ… Test Accuracy: 86.29% (22.7Γ— better than random)
  • βœ… Validation Accuracy: 86.79%
  • βœ… Generalization: Val β‰ˆ Test (no overfitting)
  • βœ… Training Time: ~6 hours (100 epochs on Apple M4)

πŸ“ Documentation

  • KAGGLE_DATASETS_INFO.md - Kaggle dataset details
  • MSASL_PIPELINE.md - MS-ASL download pipeline
  • scripts/README.md - Scripts documentation
  • scripts/*/README.md - Detailed docs for each stage

🎯 Current Status

βœ… Completed

  1. βœ… Data collection and preprocessing (68,671 samples, 26 classes)
  2. βœ… Dual-detection landmark extraction (86.1% success rate)
  3. βœ… Feature engineering and quality filtering
  4. βœ… BiLSTM + Attention model (86.29% test accuracy)
  5. βœ… Training pipeline with early stopping and LR scheduling

πŸ”„ In Progress

  1. πŸ”„ Error analysis and confusion matrix visualization
  2. πŸ”„ Per-class performance evaluation

πŸ“‹ Next Steps

Phase 1: Model Exploration (Week 1-2)

  1. CNN-based models (for spatial feature extraction)

    • 2D CNN on landmark heatmaps
    • 3D CNN for spatiotemporal features
    • ResNet/EfficientNet backbones
  2. Hybrid models (combining spatial + temporal)

    • CNN + LSTM (extract spatial features, then temporal modeling)
    • CNN + Transformer (attention over CNN features)
    • Two-stream networks (appearance + motion)
  3. Transformer-based models

    • Vision Transformer (ViT) for landmark sequences
    • Temporal Transformer with positional encoding
    • BERT-style pre-training on ASL data
  4. Ensemble methods

    • Train 5 diverse models (different architectures/hyperparameters)
    • Probability averaging or voting
    • Expected: +2-4% accuracy boost

Phase 2: Deployment (Week 3-4)

  1. Real-time inference pipeline

    • Webcam integration with MediaPipe
    • Sliding window prediction (30 FPS)
    • Temporal smoothing for stable predictions
  2. Demo application

    • GUI with live video feed
    • Top-3 predictions with confidence scores
    • Recording capability for new samples
  3. Model optimization

    • ONNX export for cross-platform deployment
    • Quantization for faster inference
    • Mobile deployment (TensorFlow Lite)

Phase 3: Advanced Features (Optional)

  1. Word-level recognition

    • Collect more MS-ASL data (500+ samples per word)
    • Sequence-to-sequence models
    • Sentence-level ASL translation
  2. Transfer learning

    • Fine-tune on personal signing style
    • Few-shot learning for new signs
  3. Multi-modal learning

    • Combine landmarks + raw video
    • Audio integration (for signed songs)

πŸ† Model Comparison (Planned)

Model Architecture Params Expected Acc Training Time Notes
BiLSTM + Attention 3-layer BiLSTM 17M 86.3% βœ… 6h Current best
CNN + LSTM ResNet18 + 2-layer LSTM ~15M 87-89% 8h Spatial + temporal
3D CNN 3D ResNet ~30M 85-88% 10h End-to-end spatiotemporal
Transformer 6-layer encoder ~20M 88-91% 12h Pure attention
Ensemble (5 models) Mixed ~80M 89-92% 30h Best performance

πŸ“ˆ Performance Targets

  • βœ… Baseline (Random): 3.8% (1/26 classes)
  • βœ… Current (BiLSTM+Attention): 86.3%
  • 🎯 Target (Ensemble): 90%+
  • πŸ† SOTA (Published research): 92-95%

πŸ“„ License

Educational project for CS 4620.

About

Build a compact LSTM/Transformer model for ASL recognition. The goal is to translate sign language into text/speech on a mobile app

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published