Digitron: Spoken Digit Recognition with HMMs and Deep Learning 🎙️🔢🤖

This project presents a comprehensive, modular framework for recognizing spoken digits (0-9) from audio recordings. It implements and rigorously compares two powerful, yet fundamentally different, approaches to sequence modeling: classical Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs) and modern Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks.

The entire system is implemented in Python, following best practices for code modularity, reproducibility, and clear separation of concerns. It includes complete pipelines for data preprocessing, feature extraction (MFCCs), hyperparameter tuning, and final evaluation, providing a thorough, end-to-end solution for a classic pattern recognition problem.

🚀 The Challenge: Understanding Spoken Digits

Recognizing spoken words, even isolated digits, is a foundational task in speech processing. The challenge is to build a system that can:

Process Raw Audio into a meaningful, numerical representation.
Model Temporal Dynamics of speech, as the sound of a digit evolves over time.
Accurately Classify digits despite variations in pronunciation, speed, and speaker characteristics.
Compare and Contrast a classical statistical approach (HMMs) with a deep learning approach (LSTMs) on the same standardized dataset.

💾 Dataset

This project utilizes the Free Spoken Digit Dataset (FSDD), a public collection of audio recordings for spoken digits.

Content: 3,000 recordings of digits (0-9) from 6 different speakers.
Format: 8kHz .wav files.
Attribution: The dataset is open-source and available on GitHub.

The project includes a script that automatically downloads and prepares this dataset, ensuring a smooth and reproducible setup process. A smaller, secondary dataset (digits/) is also included for the initial preparatory analysis.

📂 Project Structure

The project is organized into a modular structure designed for clarity, experimentation, and maintainability.

run_01_preparatory_analysis.py: A script for conducting the initial exploratory analysis on a small digit dataset, including feature visualization (histograms, heatmaps, PCA) and a baseline evaluation of classical ML classifiers (e.g., SVM, Random Forest).
run_02_pytorch_intro.py: A self-contained script demonstrating the fundamentals of RNNs in PyTorch by training models to map sine waves to cosine waves.
run_03_hmm_classification.py: The primary script for executing the end-to-end HMM pipeline, including hyperparameter tuning and final evaluation on the FSDD.
run_04_lstm_classification.py: The primary script for executing the end-to-end LSTM pipeline, including model training, evaluation, and a comparison of performance with and without sequence packing.
data/: A directory containing the small, initial digits dataset.
outputs/: A directory where all generated artifacts (plots, confusion matrices) are saved, organized by experiment.
src/: The core logic of the project, structured as a Python library (patrec).
- preparatory/: Modules for the initial data analysis.
- pytorch_intro/: Modules for the self-contained PyTorch RNN demonstration.
- main_assignment/: The core modules for the main classification task.
  - common/: Shared utilities for data preparation (FSDD) and plotting.
  - hmm/: The complete pipeline for GMM-HMM model creation, training, and evaluation.
  - lstm/: The complete pipeline for the PyTorch LSTM model, including the Dataset class, model architecture, and training logic.

🧠 Algorithmic Strategy: A Tale of Two Models

This project directly compares a classical statistical model with a modern deep learning model.

1. The Statistical Approach: GMM-HMMs

The first model is a Gaussian Mixture Model - Hidden Markov Model (GMM-HMM), a cornerstone of traditional automatic speech recognition.

Architecture: A separate HMM is trained for each digit (0-9). Each HMM is a left-to-right model, meaning sequences can only progress forward through states, capturing the temporal nature of speech.
Emission Probabilities: The probability of observing a particular audio feature (an MFCC vector) in a given state is modeled by a Gaussian Mixture Model (GMM). This allows each state to represent complex, multi-modal feature distributions.
Training: The models are trained using the Baum-Welch algorithm (a form of Expectation-Maximization) to learn the GMM parameters and transition probabilities.
Inference: To classify a new recording, the system calculates the likelihood of the audio sequence being generated by each of the 10 HMMs. The digit corresponding to the HMM with the highest likelihood is chosen as the prediction.
Tuning: A systematic grid search is performed to find the optimal number of HMM states and GMM components.

2. The Deep Learning Approach: LSTMs

The second model is a Long Short-Term Memory (LSTM) network, a type of Recurrent Neural Network (RNN) specifically designed to handle long-range dependencies in sequence data.

Architecture: A single, powerful LSTM network is trained to classify all 10 digits. The architecture consists of multiple stacked, bidirectional LSTM layers followed by a final classification layer. Bidirectionality allows the model to process the audio sequence in both forward and backward directions, capturing richer contextual information.
Efficiency: The implementation leverages pack_padded_sequence, a PyTorch utility that allows the RNN to skip computations over padded elements in a batch of variable-length sequences. This dramatically improves training speed without sacrificing accuracy.
Training: The model is trained using the AdamW optimizer and Cross-Entropy Loss. Early Stopping is employed to monitor validation loss and prevent overfitting, ensuring the best model is saved.
Regularization: Dropout is applied between LSTM layers and before the final layer to improve generalization.

🚀 Getting Started & How to Run

Prerequisites

Python 3.10+
Git (for automatic dataset cloning)

Setup & Installation

Clone the repository:

git clone https://github.com/antonisbaro/digitron
cd digitron

Set up a virtual environment (recommended):

python -m venv .venv
# On Windows: .venv\Scripts\activate
# On macOS/Linux: source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Execution

The project is run via Python scripts from the command line. It is recommended to run them in their numbered order.

Run the Preparatory Analysis:
```
python run_01_preparatory_analysis.py
```
Run the PyTorch Introduction:
```
python run_02_pytorch_intro.py
```
Run the HMM Experiment: (This will automatically download the FSDD dataset)
```
python run_03_hmm_classification.py
```
Run the LSTM Experiment: (This will automatically download the FSDD dataset)
```
python run_04_lstm_classification.py
```

All results, plots, and model checkpoints will be saved to the outputs/ directory.

💻 Technology Stack

Language: Python
Core Libraries: Pandas, NumPy, Scikit-learn
Deep Learning: PyTorch
Statistical Modeling: Pomegranate
Audio Processing: Librosa
Plotting: Matplotlib, Seaborn
Utilities: TQDM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digitron: Spoken Digit Recognition with HMMs and Deep Learning 🎙️🔢🤖

🚀 The Challenge: Understanding Spoken Digits

💾 Dataset

📂 Project Structure

🧠 Algorithmic Strategy: A Tale of Two Models

1. The Statistical Approach: GMM-HMMs

2. The Deep Learning Approach: LSTMs

🚀 Getting Started & How to Run

Prerequisites

Setup & Installation

Execution

💻 Technology Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
outputs		outputs
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_01_preparatory_analysis.py		run_01_preparatory_analysis.py
run_02_pytorch_intro.py		run_02_pytorch_intro.py
run_03_hmm_classification.py		run_03_hmm_classification.py

Folders and files

Latest commit

History

Repository files navigation

Digitron: Spoken Digit Recognition with HMMs and Deep Learning 🎙️🔢🤖

🚀 The Challenge: Understanding Spoken Digits

💾 Dataset

📂 Project Structure

🧠 Algorithmic Strategy: A Tale of Two Models

1. The Statistical Approach: GMM-HMMs

2. The Deep Learning Approach: LSTMs

🚀 Getting Started & How to Run

Prerequisites

Setup & Installation

Execution

💻 Technology Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages