# 🔊 Speaker Recognition using Classical Machine Learning
**Project:** VocalPrint – Who's Speaking?

This notebook implements speaker identification using classical machine learning algorithms with MFCC features extracted from Mozilla Common Voice data.

We will:
- Load a subset of Common Voice
- Select a few speakers
- Extract MFCC features using Librosa
- Train and evaluate models (e.g., KNN, SVM)
- Visualize performance using a confusion matrix

In [None]:
from datasets import load_dataset

# Load 1% subset of the English Common Voice dataset
dataset = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="train[:1%]")
dataset = dataset.filter(lambda x: x['audio'] is not None and x['client_id'] is not None)

print(f"Loaded {len(dataset)} examples.")
dataset[0]


## 📥 Step 1: Load Audio Metadata

We read the `validated.tsv` file to extract speaker IDs and audio paths.


In [None]:
import pandas as pd

# Load metadata
metadata_path = "data/cv-corpus-13.0-2023-03-09/en/validated.tsv"
base_audio_path = "data/cv-corpus-13.0-2023-03-09/en/clips/"

# Load the TSV into a DataFrame
df = pd.read_csv(metadata_path, sep="\t")

# Now it's safe to display a sample
df[["client_id", "path", "sentence"]].head()


In [12]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# === 1. Define path to your MP3 file ===
file_path = "data/cv-corpus-21.0-delta-2025-03-14/en/clips/common_voice_en_41910500.mp3"

# === 2. Load the audio file ===
y, sr = librosa.load(file_path, sr=None)  # Load with original sampling rate

print(f"Loaded audio with shape: {y.shape}, Sample rate: {sr}")

# === 3. Extract MFCC features ===
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

print(f"MFCC shape: {mfcc.shape}")

# === 4. Visualize the MFCCs ===
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfcc, x_axis='time', sr=sr)
plt.colorbar()
plt.title("MFCC - Mel Frequency Cepstral Coefficients")
plt.tight_layout()
plt.show()


  y, sr = librosa.load(file_path, sr=None)  # Load with original sampling rate


FileNotFoundError: [Errno 2] No such file or directory: 'data/cv-corpus-21.0-delta-2025-03-14/en/clips/common_voice_en_41910500.mp3'

## 👤 Step 2: Filter Speakers

We select the top N speakers with at least 10 samples to ensure class balance.


In [None]:
from collections import Counter

speaker_counts = Counter(dataset['client_id'])
N_SPEAKERS = 5
selected_speakers = [s for s, c in speaker_counts.items() if c > 10][:N_SPEAKERS]

filtered_dataset = dataset.filter(lambda x: x['client_id'] in selected_speakers)
print(f"Filtered to {len(filtered_dataset)} samples from {N_SPEAKERS} speakers.")


In [None]:
import librosa
import numpy as np
from tqdm import tqdm

def extract_mfcc(audio_array, sr, n_mfcc=13):
    mfcc = librosa.feature.mfcc(y=audio_array, sr=sr, n_mfcc=n_mfcc)
    return np.mean(mfcc.T, axis=0)

X, y = [], []

for example in tqdm(filtered_dataset):
    audio = example["audio"]["array"]
    sr = example["audio"]["sampling_rate"]
    label = example["client_id"]
    
    mfcc_feat = extract_mfcc(audio, sr)
    X.append(mfcc_feat)
    y.append(label)


## 🎼 Step 3: Extract MFCC Features

We use Librosa to extract 13 MFCC coefficients for each audio sample, averaging across time steps.


## 🤖 Step 4: Train a K-Nearest Neighbors (KNN) Classifier

We split the dataset and train a basic KNN classifier to identify speakers based on MFCC features.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


## 📊 Step 5: Confusion Matrix

We visualize how well the classifier performs across all speakers.


In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot(cmap='Blues')
plt.title("Confusion Matrix - KNN Classifier")
plt.show()


## ✅ Summary & Next Steps

In this notebook, we:
- Used Mozilla Common Voice to extract voice samples
- Focused on 5 speakers for a balanced classification problem
- Extracted MFCC features using Librosa
- Trained a KNN model and evaluated it with classification metrics and a confusion matrix

**Next:**
- Try SVM, Logistic Regression, and Random Forest
- Tune hyperparameters (e.g., number of MFCCs, K value in KNN)
- Compare with CNNs using Mel Spectrograms in the next notebook


In [6]:
mp3_path = Path("/home/ahmed-bashir/Documents/school/VoiceScope/data/cv-corpus-21.0-delta-2025-03-14-en/cv-corpus-21.0-delta-2025-03-14/en/clips/common_voice_en_41980499.mp3")

In [3]:
import numpy as np
from pathlib import Path

# Load the TSV metadata
base_dir = Path.cwd().parent
tsv_path = base_dir / "data" / "cv-corpus-21.0-delta-2025-03-14-en" / "cv-corpus-21.0-delta-2025-03-14" / "en" / "validated.tsv"
df = pd.read_csv(tsv_path, sep='\t')

# Load training data
train_data = np.load(base_dir / "audios" / "train.npz")
X_train = train_data['X']
y_train = train_data['y']

print("Training set:")
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("Unique speakers:", len(np.unique(y_train)))

# Load test data
test_data = np.load(base_dir / "audios" / "test.npz")
X_test = test_data['X']
y_test = test_data['y']

print("\nTest set:")
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
print("Unique speakers:", len(np.unique(y_test)))


Training set:
X_train shape: (179, 13)
y_train shape: (179,)
Unique speakers: 22

Test set:
X_test shape: (45, 13)
y_test shape: (45,)
Unique speakers: 16


In [4]:
import numpy as np

# Load the file
split_data = np.load(base_dir / "audios" / "train_test_split.npz")

# Extract components
X_train = split_data['X_train']
X_test = split_data['X_test']
y_train = split_data['y_train']
y_test = split_data['y_test']

# Display summary
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

print("Train speakers:", len(np.unique(y_train)))
print("Test speakers:", len(np.unique(y_test)))


X_train shape: (179, 13)
y_train shape: (179,)
X_test shape: (45, 13)
y_test shape: (45,)
Train speakers: 22
Test speakers: 16


In [5]:
import numpy as np

# Load training data
train_data = np.load(base_dir / "audios" / "train.npz")
X_train = train_data['X']
y_train = train_data['y']

print("Training set:")
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("Unique speakers:", len(np.unique(y_train)))

# Load test data
test_data = np.load(base_dir / "audios" / "test.npz")
X_test = test_data['X']
y_test = test_data['y']

print("\nTest set:")
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
print("Unique speakers:", len(np.unique(y_test)))


Training set:
X_train shape: (179, 13)
y_train shape: (179,)
Unique speakers: 22

Test set:
X_test shape: (45, 13)
y_test shape: (45,)
Unique speakers: 16
