**Dataset Preparation**
Data Acquisition: Download the specified datasets, which include 117,985 generated audio clips from Zenodo, along with the LJSPEECH and JSUT datasets for reference data.

Pre-processing: Convert audio files into a uniform format if necessary (e.g., 16-bit PCM wav files). Extract Mel spectrograms from these audio files as they are a common feature used for training vocoders and detecting differences in generated audio.


In [5]:
import librosa
import tensorflow_datasets as tfds
import numpy as np
import soundfile as sf
import os
import requests
from tqdm import tqdm
import soundfile as sf

def map_to_array(batch):
    speech_array, _ = sf.read(batch["file"])
    batch["speech"] = speech_array
    return batch

def load_ljspeech_dataset():
    """
    Load the LJ Speech dataset and prepare it for feature extraction.

    Returns:
    - audios: Numpy array of audio waveforms.
    - texts: List of corresponding normalized text transcriptions.
    """
    dataset = tfds.load('huggingface:lj_speech/main', split='train')
    dataset = dataset.map(map_to_array, remove_columns=["file"])
    audios = []
    texts = []

    for example in tfds.as_numpy(dataset):
        audio = example['audio']
        #audio = example['speech'].astype(np.float32) / 32768.0  # Normalize int16 to float32 range [-1, 1]
        text = example['normalized_text']
        audios.append(audio)
        texts.append(text)

    return audios, texts

In [7]:
audios, texts = load_ljspeech_dataset()

  hf_names = hf_datasets.list_datasets()


DatasetNotFoundError: "ls_speech" is not listed in Hugging Face datasets.

**Feature Extraction**
MFCC and LFCC Features: Extract Mel-frequency cepstral coefficients (MFCC) and linear-frequency cepstral coefficients (LFCC) from the audio files. These features are crucial for capturing the textural properties of the sound and will serve as the input for the GMM classifier.

In [None]:
import numpy as np
import librosa
from sklearn.preprocessing import StandardScaler
from python_speech_features import logfbank  # Assuming use of python_speech_features for LFCC

def extract_lfcc(audio, sr=16000, n_filters=26, n_lfcc=20):
    """
    Extracts Linear Frequency Cepstral Coefficients (LFCC) from an audio signal.

    Parameters:
    - audio: The audio signal from which to extract features.
    - sr: The sample rate of the audio signal.
    - n_filters: The number of filters to use in the filterbank.
    - n_lfcc: The number of LFCCs to extract.

    Returns:
    - lfcc_features: An array of LFCC features averaged across time.
    """
    # Compute log filterbank energies.
    logfbank_features = logfbank(audio, samplerate=sr, nfilt=n_filters)
    
    # Compute DCT to get LFCCs, keep first 'n_lfcc' coefficients
    lfcc_features = np.fft.dct(logfbank_features, type=2, axis=1, norm='ortho')[:, :n_lfcc]
    return np.mean(lfcc_features, axis=0)

def extract_features(audios, sr=16000):
    """
    Wrapper function to load an audio file, pre-process it, and extract relevant features.

    Parameters:
    - audio_path: Path to the audio file.
    - sr: Sample rate to use for loading the audio.

    Returns:
    - features: A numpy array containing extracted features.
    """
    audio, sr = librosa.load(audios, sr=sr)
    lfcc = extract_lfcc(audio, sr)
    return lfcc

def scale_features(features):
    """
    Scales the features using StandardScaler.

    Parameters:
    - features: Numpy array of features to scale.

    Returns:
    - scaled_features: Scaled features.
    """
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(features.reshape(-1, 1))
    return scaled_features.flatten()


**Model Training**
Gaussian Mixture Model (GMM): Train two GMMs for each dataset - one for the real audio distribution (using the original LJSPEECH dataset) and one for the generated audio samples. This step involves:
    Calculating the likelihood ratio for classifying samples.
    Using MFCC and LFCC features as inputs.
    
RawNet2 Training (Optional): As an alternative to GMM, train a CNN-GRU hybrid model known as RawNet2 which directly extracts features from raw audio to create embeddings for classification.


In [None]:
from sklearn.mixture import GaussianMixture
from sklearn.metrics import roc_curve
from scipy.optimize import brentq
from scipy.interpolate import interp1d

def train_gmm(features, n_components=16):
    """
    Train a Gaussian Mixture Model (GMM) on the provided features.

    Parameters:
    - features: Feature matrix for training the GMM.
    - n_components: Number of Gaussian components in the GMM.

    Returns:
    - gmm: Trained GMM object.
    """
    gmm = GaussianMixture(n_components=n_components, covariance_type='diag', max_iter=200, random_state=0)
    gmm.fit(features)
    return gmm

def preprocess_features(audios, sr=22050, n_lfcc=20):
    """
    Preprocess audio data and extract LFCC features.

    Parameters:
    - audios: List of audio waveforms.
    - sr: Sample rate of the audio data.
    - n_lfcc: Number of LFCC coefficients to extract.

    Returns:
    - features: Numpy array of extracted LFCC features for all audio samples.
    """
    features = []
    for audio in audios:
        lfcc = extract_lfcc(audio, sr=sr, n_lfcc=n_lfcc)
        features.append(lfcc)
    return np.array(features)

def compute_eer(y_true, y_scores):
    """
    Compute the Equal Error Rate (EER).

    Parameters:
    - y_true: Ground truth binary labels.
    - y_scores: Predicted scores or probabilities.

    Returns:
    - eer: Equal Error Rate.
    """
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)
    eer = brentq(lambda x: 1. - x - interp1d(fpr, tpr)(x), 0., 1.)
    return eer

def classify_samples(gmm_real, gmm_synthetic, features):
    """
    Classify samples using trained GMMs based on the log-likelihood ratio.

    Parameters:
    - gmm_real: GMM trained on real audio features.
    - gmm_synthetic: GMM trained on synthetic audio features.
    - features: Feature matrix of samples to classify.

    Returns:
    - scores: Log-likelihood ratio scores for the samples.
    """
    log_likelihood_real = gmm_real.score_samples(features)
    log_likelihood_synthetic = gmm_synthetic.score_samples(features)
    scores = log_likelihood_real - log_likelihood_synthetic
    return scores

In [None]:
# Combine and split the dataset
features = np.vstack((features_real, features_synthetic))
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Train GMMs
gmm_real = train_gmm(X_train[y_train == 0], n_components=16)
gmm_synthetic = train_gmm(X_train[y_train == 1], n_components=16)

**Classifier Design**
Binary Classification: Design classifiers to distinguish between human and AI-generated voices. Use the GMM-based approach to calculate the likelihood of a sample being real or synthetic.

Evaluation Metrics: Utilize the Equal Error Rate (EER) as the primary metric for evaluating model performance.

In [None]:
# Classify and evaluation
scores = classify_samples(gmm_real, gmm_synthetic, X_test)
eer = compute_eer(y_test, scores)
print(f"Equal Error Rate (EER): {eer:.2f}")

**Experiments and Evaluation**
Generalization Tests: Evaluate the classifier's performance on unseen data, such as JSUT and TTS datasets, to assess its ability to generalize.

Attribution Analysis: Implement attribution methods like BlurIG to understand which parts of the audio signal influence the prediction, focusing on specific features that distinguish between human and AI-generated voices.