# Análisis Exploratorio

En esta sección, realizaremos un análisis inicial del conjunto de datos **UrbanSound8K**. Este análisis nos permitirá comprender la estructura y el contenido del banco de datos descargado. Antes de continuar, es importante asegurarnos de que se haya ejecutado correctamente el script **setup.sh** y que la carpeta ***UrbanSound8K*** esté ubicada dentro de la carpeta `data`.

El conjunto de datos contiene información sobre archivos de audio clasificados en diferentes categorías de sonidos urbanos. A continuación, exploraremos las características principales del dataframe, incluyendo las columnas disponibles, el número de registros, los tipos de datos y las estadísticas descriptivas.


In [None]:
# Importing necessary libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
import matplotlib.pyplot as plt  # For plotting and visualization
import librosa  # For audio processing
from IPython.display import Audio  # For audio playback in Jupyter
import glob  # For file path pattern matching
from sklearn.preprocessing import LabelEncoder  # For encoding labels
import tensorflow as tf  # For deep learning models
import os  # For operating system interactions

# Configure matplotlib to display plots inline in the notebook
%matplotlib inline

In [None]:
# Check GPU availability
# This code aims to verify if there are devices available for processing,
# specifically GPUs, as this model is large and the data to process (audio) is also large.
# Using a GPU can significantly accelerate the training of the model.
print("Available devices:")
print(tf.config.list_physical_devices())

In [None]:
# Load the data
# Define the base path where the audio files and metadata are located
FILES_PATH = "./data/UrbanSound8K"  # Path to the audio files

# Read the CSV file containing the metadata of the UrbanSound8K dataset
# This file includes information such as file name, sound class, duration, etc.
dataframe_audios = pd.read_csv(
    f"{FILES_PATH}/metadata/UrbanSound8K.csv"
)  # UrbanSound8K provides this file upon downloading the dataset

# Display the first few rows of the DataFrame to inspect its content
dataframe_audios.head()

In [None]:
# Add a new column "audio_path" to the dataframe
# This column contains the full file path for each audio file in the dataset.
# The path is constructed using the base path (FILES_PATH), the fold number, and the slice file name.
dataframe_audios["audio_path"] = dataframe_audios.apply(
    lambda row: os.path.join(
        FILES_PATH, "audio", f"fold{row['fold']}", row["slice_file_name"]
    ),
    axis=1,
)

# Display the first few rows of the updated dataframe to verify the new column
dataframe_audios.head()


## Analisis del dataframe

In [None]:
# Display basic information about the dataframe
# The `info()` method provides a concise summary of the dataframe, including the number of entries,
# column names, non-null counts, and data types. This helps to understand the structure of the dataset.
dataframe_audios.info()

# Generate descriptive statistics for the dataframe
# The `describe()` method computes summary statistics for numerical columns, such as count, mean,
# standard deviation, min, max, and percentiles. This is useful for understanding the distribution
# and range of the data.
dataframe_audios.describe()

In [None]:
# Display descriptive statistics for the dataframe
# The `describe()` method provides a summary of statistics for both numerical and categorical columns.
# By using `include="all"`, it includes all columns, regardless of their data type.
# This helps to understand the distribution, central tendency, and spread of the data,
# as well as unique values and frequency for categorical columns.
dataframe_audios.describe(include="all")

In [None]:
# Display unique classes and their counts
# The `unique()` method is used to list all unique categories in the "class" column.
# This provides an overview of the different sound classes present in the dataset.
print("Categorías únicas:", dataframe_audios["class"].unique())

# The `value_counts()` method counts the occurrences of each category in the "class" column.
# This helps to understand the distribution of samples across different sound classes.
print("Conteo por clase:\n", dataframe_audios["class"].value_counts())

In [None]:
# Visualize the distribution of classes in the dataset
# The `value_counts()` method counts the occurrences of each unique value in the "class" column.
# This provides an overview of how many samples belong to each sound class.
# The `plot()` method is used to create a bar chart to visualize this distribution.

dataframe_audios["class"].value_counts().plot(kind="bar", figsize=(10, 5))

# Add a title and labels to the plot for better understanding
plt.title("Distribución de Clases")  # Title of the plot
plt.ylabel("Cantidad")  # Label for the y-axis
plt.xlabel("Clase")  # Label for the x-axis

# Display the plot
plt.show()

In [None]:
# Calculate the duration of each audio clip
# The "duration" column is computed as the difference between the "end" and "start" columns,
# which represent the start and end times of each audio clip in seconds.
dataframe_audios["duration"] = dataframe_audios["end"] - dataframe_audios["start"]

# Plot the distribution of audio durations
# A histogram is created to visualize the distribution of audio clip durations.
# The x-axis represents the duration in seconds, and the y-axis represents the frequency of clips.
dataframe_audios["duration"].plot(kind="hist", bins=50, figsize=(10, 5))

# Add a title and labels to the plot for better understanding
plt.title("Distribución de Duración")  # Title of the plot
plt.xlabel("Duración (segundos)")  # Label for the x-axis

# Display the plot
plt.show()

### Probando la carga y visualización de un audio

En esta sección, cargaremos un archivo de audio del conjunto de datos y lo visualizaremos. Para ello, utilizaremos las herramientas de procesamiento de audio disponibles en la biblioteca `librosa`. Además, acotaremos la visualización del audio según los tiempos de inicio y fin especificados en el dataframe `dataframe_audios`. Esto nos permitirá analizar de manera más detallada las características del audio seleccionado.

In [None]:
# Load and play an audio file to ensure the audio data is correctly loaded and audible.
# This step is crucial for verifying the integrity of the audio files in the dataset.

# Select an audio file (e.g., the first row)
audio_file_path = dataframe_audios["audio_path"].iloc[0]  # Adjust the path if necessary
audio_data, sample_rate = librosa.load(audio_file_path, sr=None)  # Load audio

# Play the audio
Audio(data=audio_data, rate=sample_rate)

In [None]:
# Visualize the waveform and spectrogram of the audio file
# This step helps us understand the time-domain and frequency-domain characteristics of the audio data.

# Plot the waveform
plt.figure(figsize=(12, 4))
librosa.display.waveshow(audio_data, sr=sample_rate)
plt.title("Waveform of the Audio")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.show()

# Plot the spectrogram
# The spectrogram provides a visual representation of the frequency content of the audio over time.
spectrogram = librosa.amplitude_to_db(np.abs(librosa.stft(audio_data)), ref=np.max)
plt.figure(figsize=(12, 4))
librosa.display.specshow(spectrogram, sr=sample_rate, x_axis="time", y_axis="log")
plt.colorbar(format="%+2.0f dB")
plt.title("Spectrogram")
plt.xlabel("Time (s)")
plt.ylabel("Frequency (Hz)")
plt.show()

# Preparacion de los datos


In [None]:
# Encode class labels into numerical format
# This step is necessary to prepare the data for machine learning models, which require numerical inputs.
label_encoder = LabelEncoder()
dataframe_audios["encoded_label"] = label_encoder.fit_transform(
    dataframe_audios["class"]
)  # Add a new column 'encoded_label' with numerical IDs for each class
number_of_classes = len(
    dataframe_audios["class"].unique()
)  # Calculate the total number of unique sound classes
number_of_classes

In [None]:
# Split the dataset into training and testing sets in a stratified manner
# This ensures that the distribution of classes is preserved in both sets, which is crucial for balanced training and evaluation.

from sklearn.model_selection import train_test_split

# Features: paths to audio files
audio_file_paths = dataframe_audios["audio_path"]

# Target: encoded class labels
encoded_labels = dataframe_audios["encoded_label"]

# Perform the stratified split
train_audio_paths, test_audio_paths, train_labels, test_labels = train_test_split(
    audio_file_paths,
    encoded_labels,
    test_size=0.3,
    random_state=42,
    stratify=encoded_labels,
)


## Extraer las características de los audios

En este módulo, nos enfocaremos en la extracción de características relevantes de los archivos de audio. Específicamente, utilizaremos los coeficientes cepstrales de frecuencia de Mel (MFCC) para representar cada archivo de audio como un tensor numérico. Los MFCC son ampliamente utilizados en el procesamiento de señales de audio, ya que capturan información importante sobre el contenido espectral del sonido, lo que los hace ideales para tareas de clasificación y reconocimiento de audio.

El objetivo es implementar funciones que permitan calcular los MFCC de cada archivo de audio y garantizar que todos los tensores tengan una longitud fija, aplicando padding o recorte según sea necesario. Esto asegurará que los datos estén en un formato consistente para ser utilizados en modelos de aprendizaje profundo.

In [None]:
def extract_mfcc(audio_file_path, num_mfcc=13, max_padding_length=174):
    """
    Extract Mel-frequency cepstral coefficients (MFCC) from an audio file.

    This function ensures that all audio features have a fixed length by applying padding or truncation.
    This is essential for preparing the data for machine learning models, which require consistent input dimensions.

    Parameters:
    - audio_file_path (str): Path to the audio file.
    - num_mfcc (int): Number of MFCC features to extract.
    - max_padding_length (int): Maximum length for padding or truncation.

    Returns:
    - numpy.ndarray: MFCC features with fixed dimensions.
    """
    audio_signal, sample_rate = librosa.load(audio_file_path, sr=None)
    mfcc_features = librosa.feature.mfcc(
        y=audio_signal, sr=sample_rate, n_mfcc=num_mfcc
    )

    # Ensure fixed length by padding or truncating
    if mfcc_features.shape[1] < max_padding_length:
        mfcc_features = np.pad(
            mfcc_features,
            ((0, 0), (0, max_padding_length - mfcc_features.shape[1])),
            mode="constant",
        )
    else:
        mfcc_features = mfcc_features[:, :max_padding_length]

    return mfcc_features

In [None]:
# Apply MFCC extraction to the training and testing datasets
# This step is crucial to convert raw audio data into numerical features (MFCCs)
# that can be used as input for the CNN model.

X_train_mfcc_features = np.array([extract_mfcc(path) for path in train_audio_paths])
X_test_mfcc_features = np.array([extract_mfcc(path) for path in test_audio_paths])

# Add a channel dimension to the MFCC features
# This is necessary because the CNN model expects input data with a channel dimension.
X_train_mfcc_features = X_train_mfcc_features[
    ..., np.newaxis
]  # Shape: (samples, n_mfcc, max_pad_len, 1)
X_test_mfcc_features = X_test_mfcc_features[..., np.newaxis]

# Modelo CNN

En este módulo, implementaremos una red neuronal convolucional (CNN) para procesar los coeficientes cepstrales de frecuencia de Mel (MFCCs) extraídos de los archivos de audio. Las CNN son especialmente efectivas para tareas de clasificación de datos estructurados espacialmente, como imágenes o, en este caso, representaciones espectrales de audio.

La arquitectura de la red incluye capas convolucionales para extraer características relevantes, capas de normalización para estabilizar el entrenamiento, y capas densas con regularización para evitar el sobreajuste. Finalmente, la red utiliza una capa de salida con activación softmax para clasificar los audios en las diferentes categorías de sonidos urbanos.


In [None]:
from tensorflow.keras import layers, models
from tensorflow.keras import regularizers

"""
Define the CNN model for audio classification.

Why we do this:
- To classify audio samples into predefined categories using MFCC features.
- Added BatchNormalization after each convolutional layer to stabilize and accelerate training.
- Applied L2 regularization to dense layers to reduce overfitting.
"""

# Get the number of unique classes from the dataset
num_classes = number_of_classes

# Define the CNN model
audio_classification_model = models.Sequential(
    [
        # Input layer
        layers.Input(shape=(13, 174, 1)),  # Adjusted to match MFCC feature dimensions
        # Convolutional blocks
        layers.Conv2D(32, (3, 3), activation="relu"),
        layers.BatchNormalization(),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation="relu"),
        layers.BatchNormalization(),
        layers.MaxPooling2D((2, 2)),
        # Flatten + Dense layers
        layers.Flatten(),
        layers.Dense(128, activation="relu", kernel_regularizer=regularizers.l2(0.01)),
        layers.Dropout(0.5),
        # Output layer
        layers.Dense(
            num_classes, activation="softmax", kernel_regularizer=regularizers.l2(0.01)
        ),  # One neuron per class
    ]
)

# Compile the model
audio_classification_model.compile(
    optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

# Display the model summary
audio_classification_model.summary()

**ENTRENAMIENTO DEL MODELO**

En esta etapa, procederemos a entrenar el modelo de clasificación de audio utilizando los coeficientes cepstrales de frecuencia de Mel (MFCC) previamente extraídos. Este proceso puede tomar un tiempo considerable dependiendo de la capacidad de procesamiento de la GPU disponible y del tamaño del conjunto de datos. Asegúrate de contar con los recursos necesarios antes de iniciar el entrenamiento.

In [None]:
# Train the CNN model for audio classification
# Why we do this:
# - To optimize the model's weights using the training dataset.
# - To evaluate the model's performance on the validation dataset during training.
# - To monitor the learning process and adjust hyperparameters if necessary.

history = audio_classification_model.fit(
    X_train_mfcc_features,  # Training features (MFCCs)
    train_labels,  # Training labels (encoded)
    epochs=80,  # Number of training epochs (adjustable)
    batch_size=32,  # Batch size for gradient updates
    validation_data=(X_test_mfcc_features, test_labels),  # Validation data
)

In [None]:
# curva de aprendizaje
plt.figure(figsize=(12, 8))  # Aumentar el tamaño de la figura
plt.plot(history.history["accuracy"], label="Train Accuracy")
plt.plot(history.history["val_accuracy"], label="Validation Accuracy")

# Agregar texto con el accuracy final
final_val_accuracy = (
    history.history["val_accuracy"][-1] * 100
)  # Último valor de val_accuracy
plt.text(
    len(history.history["val_accuracy"]) - 1,  # Última época
    history.history["val_accuracy"][-10],  # Valor de val_accuracy
    f"{final_val_accuracy:.2f}%",  # Texto con el porcentaje
    fontsize=12,
    color="red",
    ha="right",
)

plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.grid(True)  # Añadir una cuadrícula para mayor precisión
plt.show()

In [None]:
# Evaluate the model on the test dataset
# Why we do this:
# - To measure the model's performance on unseen data.
# - To calculate the test accuracy and loss for evaluation purposes.
test_loss, test_accuracy = audio_classification_model.evaluate(
    X_test_mfcc_features, test_labels
)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

# Generate a confusion matrix to analyze model predictions
# Why we do this:
# - To visualize the performance of the model in terms of correctly and incorrectly classified samples.
# - To identify patterns of misclassification for further improvements.
from sklearn.metrics import confusion_matrix
import seaborn as sns

predicted_labels = np.argmax(
    audio_classification_model.predict(X_test_mfcc_features), axis=1
)
confusion_matrix_result = confusion_matrix(test_labels, predicted_labels)

plt.figure(figsize=(10, 8))
sns.heatmap(
    confusion_matrix_result,
    annot=True,
    fmt="d",
    xticklabels=label_encoder.classes_,
    yticklabels=label_encoder.classes_,
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

: Guardar y cargar el modelo entrenado

En esta sección, se explica cómo guardar el modelo entrenado en un formato recomendado por Keras y cómo cargarlo posteriormente para realizar predicciones o continuar con el entrenamiento. Guardar el modelo permite reutilizarlo sin necesidad de volver a entrenarlo, lo que ahorra tiempo y recursos computacionales. Además, se asegura la persistencia del modelo para futuros análisis o implementaciones.

In [None]:
# Save the trained audio classification model
# Why we do this:
# - To persist the trained model for future use without retraining.
# - To ensure reproducibility and save computational resources.
audio_classification_model.save("audio_classification_model.keras")

# Load the saved audio classification model
# Why we do this:
# - To reuse the trained model for predictions or further training.
# - To validate that the saved model can be successfully restored.
loaded_audio_classification_model = tf.keras.models.load_model(
    "audio_classification_model.keras"
)


In [None]:
def predict_audio_class(audio_file_path):
    """
    Why we do this:
    - To predict the class of a given audio file using the trained CNN model.
    - To convert the MFCC features of the audio into a format compatible with the model.
    - To decode the predicted class index back into the original class label.
    """
    # Extract MFCC features and reshape for model input
    mfcc_features = extract_mfcc(audio_file_path)[np.newaxis, ..., np.newaxis]

    # Predict the class probabilities
    predictions = audio_classification_model.predict(mfcc_features)

    # Get the class index with the highest probability
    predicted_class_index = np.argmax(predictions)

    # Decode the class index to the original class label
    predicted_class_label = label_encoder.inverse_transform([predicted_class_index])[0]

    return predicted_class_label


# Example usage
example_audio_path = "./data/UrbanSound8K/audio/fold5/100032-3-0-0.wav"
print(f"Prediction: {predict_audio_class(example_audio_path)}")

In [None]:
# VERIFICAR TODAS LAS FALLAS QUE TUVO