# C'est oui ou bien c'est non 

L'objectif de ce travail est de se servir d'un réseau de neurones pour réussir à classifier correctement un "Oui" et un "Non" prononcés à l'oral. Le thème de ce travail est inspiré par la chanson d'Angèle **Oui ou non**. 

# I-Construction du dataset

On commence par importer des librairies utiles et par mettre en place l'environnement de travail. 

Ensuite, l'idée est d'extraire des propriétés pertinentes pour les fichiers audio afin de les numériser et d'être en mesure d'entraîner des modèles à partir de celles-ci. 

In [3]:
import os
import numpy as np
from scipy.io import wavfile
import librosa

yes_audio_dir = './YesNo/yes/'
no_audio_dir = './YesNo/no/'

On crée une fonction pour justement extraire ces "features". 

In [4]:
# Function to extract features from an audio file
def extract_features(audio_file, max_length=174):
    # Read the audio file
    sampling_rate, audio_data = wavfile.read(audio_file)
    
    # Ensure the audio is mono (if stereo, take the first channel)
    if len(audio_data.shape) > 1:
        audio_data = audio_data[:, 0]

    # Convert audio data to floating-point representation
    audio_data = audio_data.astype(float)

    # Extract MFCC features 
    mfcc = librosa.feature.mfcc(y=audio_data, sr=sampling_rate, n_mfcc=13)

    # Pad or truncate MFCCs to a fixed length
    if mfcc.shape[1] < max_length:
        pad_width = max_length - mfcc.shape[1]
        mfcc = np.pad(mfcc, pad_width=((0, 0), (0, pad_width)), mode='constant')
    else:
        mfcc = mfcc[:, :max_length]

    return mfcc

On n'a plus qu'à utiliser notre fonction sur nos fichiers ! J'ai choisi 13 comme nombre de features, de manière assez arbitraire. 

In [5]:
# Create empty lists to store features and labels
features = []
labels = []

# Process "yes" audio files
for filename in os.listdir(yes_audio_dir):
    if filename.endswith('.wav'):
        audio_file = os.path.join(yes_audio_dir, filename)
        mfccs = extract_features(audio_file)
        features.append(mfccs)
        labels.append(1)  # 1 for "yes"

# Process "no" audio files
for filename in os.listdir(no_audio_dir):
    if filename.endswith('.wav'):
        audio_file = os.path.join(no_audio_dir, filename)
        mfccs = extract_features(audio_file)
        features.append(mfccs)
        labels.append(0)  # 0 for "no"

# Convert lists to numpy arrays
X = np.array(features)
y = np.array(labels)

On a nos données bien étiquetées. Penchons-nous maintenant sur le modèle à utiliser. 

# II-  Le modèle
On va se servir de la librairie `keras` pour accéder aux modèles disponibles sur `Tensorflow`. 

Avant cela, essayons d'entraîner nous-même un modèle avec une architecture classique. 


In [7]:
X.shape

(7985, 13, 174)

In [12]:
import tensorflow as tf
from tensorflow.keras import layers, models

num_frames = 174
num_mfcc = 13
# Define the CNN model
model = models.Sequential()

model.add(layers.Conv1D(32, 3, activation='relu', input_shape=(num_frames, num_mfcc)))  # Adjust input shape
model.add(layers.MaxPooling1D(2))
model.add(layers.Conv1D(64, 3, activation='relu'))
model.add(layers.MaxPooling1D(2))
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(2, activation='softmax'))  # Two output classes ("yes" and "no")

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',  # Use 'categorical_crossentropy' if one-hot encoding labels
              metrics=['accuracy'])

# Print the model summary
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d_4 (Conv1D)           (None, 172, 32)           1280      
                                                                 
 max_pooling1d_4 (MaxPooling  (None, 86, 32)           0         
 1D)                                                             
                                                                 
 conv1d_5 (Conv1D)           (None, 84, 64)            6208      
                                                                 
 max_pooling1d_5 (MaxPooling  (None, 42, 64)           0         
 1D)                                                             
                                                                 
 flatten_2 (Flatten)         (None, 2688)              0         
                                                                 
 dense_4 (Dense)             (None, 128)              

Parfait, maintenant on peut préparer nos données pour l'entraînement et le test. 

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Reshape the data to two dimensions (flatten along the second and third dimensions)
X = X.reshape(X.shape[0], -1)

# Standardize the features (mean=0, std=1)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Reshape X back to its original shape (if needed)
# X = X.reshape(X.shape[0], num_frames, num_mfcc)

# Split the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print("X_train shape:", X_train.shape)
print("X_valid shape:", X_valid.shape)

X_train shape: (6388, 2262)
X_valid shape: (1597, 2262)


Ok on peut lancer l'entraînement !

# III- L'entraînement

In [10]:
# We need to reshape our data
X_train = X_train.reshape(X_train.shape[0], num_frames, num_mfcc)
X_valid = X_valid.reshape(X_valid.shape[0], num_frames, num_mfcc)

In [15]:
# Train the model
history = model.fit(
    X_train, y_train,
    epochs=10,  
    batch_size=32,  # classic batch size
    validation_data=(X_valid, y_valid),
    verbose=2  # 1 for progress updates, 0 for no updates
)

Epoch 1/10
200/200 - 2s - loss: 0.0162 - accuracy: 0.9944 - val_loss: 0.1587 - val_accuracy: 0.9631 - 2s/epoch - 9ms/step
Epoch 2/10
200/200 - 1s - loss: 0.0096 - accuracy: 0.9966 - val_loss: 0.1606 - val_accuracy: 0.9637 - 1s/epoch - 7ms/step
Epoch 3/10
200/200 - 2s - loss: 0.0101 - accuracy: 0.9970 - val_loss: 0.1776 - val_accuracy: 0.9643 - 2s/epoch - 8ms/step
Epoch 4/10
200/200 - 1s - loss: 0.0123 - accuracy: 0.9950 - val_loss: 0.2003 - val_accuracy: 0.9562 - 1s/epoch - 7ms/step
Epoch 5/10
200/200 - 2s - loss: 0.0150 - accuracy: 0.9947 - val_loss: 0.1783 - val_accuracy: 0.9537 - 2s/epoch - 8ms/step
Epoch 6/10
200/200 - 2s - loss: 0.0050 - accuracy: 0.9984 - val_loss: 0.1929 - val_accuracy: 0.9662 - 2s/epoch - 8ms/step
Epoch 7/10
200/200 - 2s - loss: 0.0012 - accuracy: 1.0000 - val_loss: 0.2150 - val_accuracy: 0.9631 - 2s/epoch - 8ms/step
Epoch 8/10
200/200 - 2s - loss: 3.7216e-04 - accuracy: 1.0000 - val_loss: 0.2247 - val_accuracy: 0.9643 - 2s/epoch - 8ms/step
Epoch 9/10
200/200 -

On peut maintenant évaluer les performances du modèle :

In [14]:
# Evaluate the model on the validation set
val_loss, val_accuracy = model.evaluate(X_valid, y_valid, verbose=0)
print(f"Validation Loss: {val_loss:.4f}")
print(f"Validation Accuracy: {val_accuracy*100:.2f}%")

Validation Loss: 0.1661
Validation Accuracy: 95.62%


C'est tout à fait satisfaisant ! On pourrait essayer d'aller plus loin et d'utiliser de l'augmentation de données mais on a déjà des résultats très satisfaisants. 

# IV- Pour aller plus loin

On peut tester plusieurs pistes :
- il serait intéressant de simplement essayer d'entraîner sur les données de son. 
- on peut aussi essayer une méthode "bourrine" en extrayant plus de features. 

Testons la deuxième méthode, car elle est très rapide à mettre en place. 

In [26]:
# Function to extract features from an audio file
def extract_features20(audio_file, max_length=174):
    # Read the audio file
    sampling_rate, audio_data = wavfile.read(audio_file)
    
    # Ensure the audio is mono (if stereo, take the first channel)
    if len(audio_data.shape) > 1:
        audio_data = audio_data[:, 0]

    # Convert audio data to floating-point representation
    audio_data = audio_data.astype(float)

    # Extract MFCC features 
    mfcc = librosa.feature.mfcc(y=audio_data, sr=sampling_rate, n_mfcc=20)

    # Pad or truncate MFCCs to a fixed length
    if mfcc.shape[1] < max_length:
        pad_width = max_length - mfcc.shape[1]
        mfcc = np.pad(mfcc, pad_width=((0, 0), (0, pad_width)), mode='constant')
    else:
        mfcc = mfcc[:, :max_length]

    return mfcc

# Create empty lists to store features and labels
features = []
labels = []

# Process "yes" audio files
for filename in os.listdir(yes_audio_dir):
    if filename.endswith('.wav'):
        audio_file = os.path.join(yes_audio_dir, filename)
        mfccs = extract_features20(audio_file)
        features.append(mfccs)
        labels.append(1)  # 1 for "yes"

# Process "no" audio files
for filename in os.listdir(no_audio_dir):
    if filename.endswith('.wav'):
        audio_file = os.path.join(no_audio_dir, filename)
        mfccs = extract_features20(audio_file)
        features.append(mfccs)
        labels.append(0)  # 0 for "no"

# Convert lists to numpy arrays
X = np.array(features)
y = np.array(labels)

On réexecute simplement le code précédent. 

In [27]:
# Reshape the data to two dimensions (flatten along the second and third dimensions)
X = X.reshape(X.shape[0], -1)

# Standardize the features (mean=0, std=1)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Reshape X back to its original shape (if needed)
# X = X.reshape(X.shape[0], num_frames, num_mfcc)

# Split the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print("X_train shape:", X_train.shape)
print("X_valid shape:", X_valid.shape)

X_train shape: (6388, 3480)
X_valid shape: (1597, 3480)


In [28]:
num_frames = 174  # You can adjust this if needed
num_mfcc = 20    # Set to 20 for 20 MFCC features

# Assuming you have already loaded your dataset into X (features) and y (labels)

# Reshape the data to match the new number of MFCC features (20)
X_train = X_train.reshape(X_train.shape[0], num_frames, num_mfcc)
X_valid = X_valid.reshape(X_valid.shape[0], num_frames, num_mfcc)

# Define the CNN model
model = models.Sequential()

model.add(layers.Conv1D(32, 3, activation='relu', input_shape=(num_frames, num_mfcc)))
model.add(layers.MaxPooling1D(2))
model.add(layers.Conv1D(64, 3, activation='relu'))
model.add(layers.MaxPooling1D(2))
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(2, activation='softmax'))  # Two output classes ("yes" and "no")

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',  # Use 'categorical_crossentropy' if one-hot encoding labels
              metrics=['accuracy'])

# Print the model summary
model.summary()

# Train the model
history = model.fit(
    X_train, y_train,
    epochs=10,  
    batch_size=32,  
    validation_data=(X_valid, y_valid),
    verbose=2  
)

# Evaluate the model on the validation set
val_loss, val_accuracy = model.evaluate(X_valid, y_valid, verbose=0)
print(f"Validation Loss: {val_loss:.4f}")
print(f"Validation Accuracy: {val_accuracy*100:.2f}%")


Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d_10 (Conv1D)          (None, 172, 32)           1952      
                                                                 
 max_pooling1d_10 (MaxPoolin  (None, 86, 32)           0         
 g1D)                                                            
                                                                 
 conv1d_11 (Conv1D)          (None, 84, 64)            6208      
                                                                 
 max_pooling1d_11 (MaxPoolin  (None, 42, 64)           0         
 g1D)                                                            
                                                                 
 flatten_5 (Flatten)         (None, 2688)              0         
                                                                 
 dense_10 (Dense)            (None, 128)              

# Conclusion

Grâce à un Réseau de neurones convolutif à la structure très basique, on arrive facilement, en extrayant des features de fichiers sonores, à bien différencier un `Oui prononcé` d'un `Non prononcé`. Cependant, augmenter le nombre de features extraites n'a pas forcément d'impact sur la performance du modèle (au-delà d'un certain stade typiquement). 

On pourrait encore tester d'autres pistes, mais j'ai réalisé ce travail pour servir d'introduction au sujet des modèles d'analyse de fichiers sonores. 
