# **2. Data augmentation**
---

## **Carga de datos**

In [27]:
import pandas as pd

df = pd.read_csv("full_multilabel_df.csv")
df.fillna(0, inplace=True)

In [28]:
index_out = df[df['database'] == 'ESC'].index.tolist()
# Drop indices from DataFrame
df.drop(index_out, inplace=True)
df.reset_index(drop=True, inplace=True)

In [29]:
df['database'].value_counts()

database
INTER1SP    6041
SMC         5129
EMS         2005
EW          1992
MESD         862
EmoFilm      359
Name: count, dtype: int64

In [30]:
df['new_emotion'].value_counts()

new_emotion
neutral      5581
disgust      2265
anger        1937
happiness    1910
sadness      1863
fear         1807
surprise     1025
Name: count, dtype: int64

## **Separación de datos**

In [31]:
e_dbs = ['SMC', 'EMS', 'INTER1SP', 'EmoFilm']
m_dbs = ['EmoWisconsin', 'MESD', ]

df['accent'] = df['database'].apply(lambda x: 'spain' if x in e_dbs else 'mex')

# Codificar emociones y género como variables dummy (0/1)
#y_emotion = pd.get_dummies(df["new_emotion"])  # columnas: positive, neutral, negative
#y_gender = pd.get_dummies(df["gender"])        # columnas: female, male, child
#y_accent = pd.get_dummies(df["accent"]) 

# Unir ambos en una sola matriz de salida
y = df[["new_emotion", "gender", "accent"]] 


In [32]:
from sklearn.model_selection import train_test_split
import numpy as np

# Obtener todos los índices disponibles
indices = np.arange(len(df))

# Crear la columna 'set' vacía inicialmente
df["set"] = ""

# Dividir en train y test con estratificación por emoción
train_idx, test_idx = train_test_split(
    indices, 
    test_size=0.2, 
    random_state=42, 
    shuffle=True, 
    stratify=df['new_emotion']
)

# Asignar etiquetas al DataFrame
df.loc[train_idx, "set"] = "train"
df.loc[test_idx, "set"] = "test"


## **Data augmentation**

In [33]:
df[df['set'] == 'train']['new_emotion'].value_counts()

new_emotion
neutral      4465
disgust      1812
anger        1549
happiness    1528
sadness      1490
fear         1446
surprise      820
Name: count, dtype: int64

### **Prueba de Data Augmentation**

In [34]:
from IPython.display import Audio
from IPython import display
import librosa
import numpy as np

# Transformaciones individuales
def noise(data, noise_factor=0.010):
    noise_amp = np.random.uniform(0.001, noise_factor) * np.amax(data)
    return data + noise_amp * np.random.normal(size=data.shape[0])

def stretch(data, rate=0.25):
    rate = np.random.choice([1 - rate, 1 + rate])
    return librosa.effects.time_stretch(data, rate=rate)

def pitch(data, sr, pitch_factor=0.7):
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

# Función principal con encadenamiento
def probe_da(
    df,
    sr=16000,
    samples=5,
    noise_factor=0.015,
    stretch_rate=0.25,
    pitch_factor=0.7
):
    samples_idx = np.random.choice(len(df), samples)
    modified_audios = []

    for idx in samples_idx:
        sample_path = df["path"].iloc[idx]
        y, orig_sr = librosa.load(sample_path, sr=None, mono=True)

        # Asegurar 16 kHz y mono
        if orig_sr != sr:
            y = librosa.resample(y, orig_sr=orig_sr, target_sr=sr)

        print("=== Muestra original ===")
        print(df[["name", "new_emotion", "database"]].iloc[idx].to_markdown())
        display.display(Audio(y, rate=sr))

        # Aplicar las tres modificaciones encadenadas
        y_aug = noise(y, noise_factor=noise_factor)
        y_aug = pitch(y_aug, sr=sr, pitch_factor=pitch_factor)
        y_aug = stretch(y_aug, rate=stretch_rate)

        # Guardar audio modificado
        modified_audios.append((y_aug, sr))

        print("→ Audio modificado (noise + stretch + pitch):")
        display.display(Audio(y_aug, rate=sr))

    return modified_audios


In [35]:
modified_audios = probe_da(
    df=df,
    samples=3,
    noise_factor=0.05,
    stretch_rate=0.005,
    pitch_factor=0.01
)

=== Muestra original ===
|             | 9264     |
|:------------|:---------|
| name        | spf1f075 |
| new_emotion | fear     |
| database    | INTER1SP |


→ Audio modificado (noise + stretch + pitch):


=== Muestra original ===
|             | 8091              |
|:------------|:------------------|
| name        | 04f4d96c-fea6fa6c |
| new_emotion | disgust           |
| database    | SMC               |


→ Audio modificado (noise + stretch + pitch):


=== Muestra original ===
|             | 5242              |
|:------------|:------------------|
| name        | 56f5b5a8-e214f3e0 |
| new_emotion | neutral           |
| database    | SMC               |


→ Audio modificado (noise + stretch + pitch):


### **Data augmentation**

1. Conteo por clase emocional

Se contabiliza cuántas muestras hay por new_emotion dentro de train_df.

2. Selección y duplicación

Para cada emoción que tenga menos de limit_per_emotion muestras:

* Se seleccionan al azar ejemplos existentes con esa emoción (con o sin reemplazo).

* Se generan nuevas versiones aumentadas de esos audios.

3. Transformaciones encadenadas al audio

A cada audio seleccionado se le aplican, en orden:

* noise() con noise_factor=0.05

* stretch() con rate=0.005

* pitch() con pitch_factor=0.01

4. Guardado del audio aumentado

El audio transformado se guarda como: ```limit{N}samples/limit{N}samples_{name}_augmented.wav```


5. Unificación del DataFrame final

* Se concatenan: train_df, las filas aumentadas y test_df.

* Se devuelve un nuevo df_augmented con todos los ejemplos, indicando cuáles son sintéticos.

In [36]:
import os
import pandas as pd
import librosa
import soundfile as sf
import numpy as np
from pathlib import Path

def noise(data, noise_factor=0.05):
    noise_amp = np.random.uniform(0.001, noise_factor) * np.amax(data)
    return data + noise_amp * np.random.normal(size=data.shape[0])

def stretch(data, rate=0.005):
    rate = np.random.choice([1 - rate, 1 + rate])
    return librosa.effects.time_stretch(data, rate=rate)

def pitch(data, sr, pitch_factor=0.01):
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def augment_data_by_emotion(df, limit_per_emotion=1000, sr=16000):
    # Filtrar datos de entrenamiento
    train_df = df[df["set"] == "train"].copy()
    test_df = df[df["set"] == "test"].copy()

    # Marcar los originales
    train_df["augmented"] = False
    test_df["augmented"] = False

    # Crear folder destino
    output_dir = Path(f"limit{limit_per_emotion}samples")
    output_dir.mkdir(parents=True, exist_ok=True)

    print("=== Conteo inicial por emoción (train) ===")
    initial_counts = train_df["new_emotion"].value_counts()
    print(initial_counts.to_string())

    augmented_rows = []
    already_augmented_names = set()

    for emotion, count in initial_counts.items():
        if count < limit_per_emotion:
            deficit = limit_per_emotion - count

            # Filtrar ejemplos con esa emoción que aún no hayan sido aumentados
            available_samples = train_df[
                (train_df["new_emotion"] == emotion) &
                (~train_df["name"].isin(already_augmented_names))
            ]

            # Solo se puede aumentar como máximo una vez por nombre
            num_to_augment = min(deficit, len(available_samples))

            # Si no hay suficientes muestras únicas, salta
            if num_to_augment == 0:
                continue

            selected = available_samples.sample(
                n=num_to_augment,
                replace=False,
                random_state=42
            ).copy()

            for _, row in selected.iterrows():
                orig_path = row["path"]
                name = row["name"]

                try:
                    # Cargar audio y asegurar mono 16kHz
                    y, orig_sr = librosa.load(orig_path, sr=None, mono=True)
                    if orig_sr != sr:
                        y = librosa.resample(y, orig_sr=orig_sr, target_sr=sr)

                    # Aplicar transformaciones encadenadas
                    y_aug = noise(y, noise_factor=0.05)
                    y_aug = stretch(y_aug, rate=0.005)
                    y_aug = pitch(y_aug, sr=sr, pitch_factor=0.01)

                    # Generar nuevo nombre y path
                    new_name = f"{name}_augmented"
                    new_filename = output_dir / f"limit{limit_per_emotion}samples_{new_name}.wav"
                    sf.write(new_filename, y_aug, sr)

                    # Crear nuevo registro
                    new_row = row.copy()
                    new_row["path"] = str(new_filename)
                    new_row["augmented"] = True
                    new_row["name"] = new_name

                    augmented_rows.append(pd.DataFrame([new_row]))
                    already_augmented_names.add(name)

                except Exception as e:
                    print(f"Error con {orig_path}: {e}")

    # Unir los resultados
    if augmented_rows:
        augmented_df = pd.concat(augmented_rows, ignore_index=True)
    else:
        augmented_df = pd.DataFrame(columns=train_df.columns)

    df_augmented = pd.concat([train_df, augmented_df, test_df], ignore_index=True)

    print("\n=== Aumento final ===")
    for emotion in sorted(initial_counts.index):
        original = initial_counts[emotion]
        added = (augmented_df["new_emotion"] == emotion).sum()
        total = original + added
        print(f"{emotion}: {original} + {added} → {total} muestras")

    return df_augmented


In [37]:
LIMIT_DA =  1528

In [38]:
df_augmented = augment_data_by_emotion(df, limit_per_emotion=LIMIT_DA)


=== Conteo inicial por emoción (train) ===
new_emotion
neutral      4465
disgust      1812
anger        1549
happiness    1528
sadness      1490
fear         1446
surprise      820

=== Aumento final ===
anger: 1549 + 0 → 1549 muestras
disgust: 1812 + 0 → 1812 muestras
fear: 1446 + 82 → 1528 muestras
happiness: 1528 + 0 → 1528 muestras
neutral: 4465 + 0 → 4465 muestras
sadness: 1490 + 38 → 1528 muestras
surprise: 820 + 708 → 1528 muestras


In [39]:
df_augmented[df_augmented['set'] == 'train']["new_emotion"].value_counts()

new_emotion
neutral      4465
disgust      1812
anger        1549
fear         1528
happiness    1528
sadness      1528
surprise     1528
Name: count, dtype: int64

## **Guardado de DF**

In [40]:
df_augmented.to_csv(f"df_augmented_{LIMIT_DA}_samples.csv", index=False)

# **Obtener y guardar características**
---

In [41]:
import torch
import torchaudio
from tqdm import tqdm
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import os
import numpy as np
import pandas as pd


def load_model(model_path):
    # Cargar el procesador y el modelo
    processor = Wav2Vec2Processor.from_pretrained(model_path)
    model = Wav2Vec2Model.from_pretrained(model_path)
    
    model.to("cuda")
    
    return processor, model

def extract_features(df, processor=None, model=None, device='cuda'):
    """
    Extraer características de todas las capas del modelo con y sin pooling mean.
    """
    if not model or not processor:
        print("Error. Se requiere de un modelo y un procesador para extraer características.")
        return

    # Inicializar diccionarios para almacenar las características de cada capa
    features_dict = {}      # Con pooling mean

    # Definir una función de pooling mean
    def apply_mean_pooling(features):
        return torch.mean(features, dim=1)  # Pooling mean sobre la dimensión temporal

    for index, row in tqdm(df.iterrows(), total=len(df), desc="Extrayendo características"):
        audio_path = row["path"]

        # Cargar el archivo de audio
        waveform, sample_rate = torchaudio.load(audio_path)
        
        if sample_rate != 16000:
            resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
            waveform = resampler(waveform)
           
        # Eliminar dimensiones adicionales si es necesario
        waveform = waveform.squeeze()

        # Procesar el audio
        inputs = processor(waveform, sampling_rate=16000, return_tensors="pt", padding=True)
        
        # Mover los tensores al dispositivo (GPU o CPU)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Asegurarse de que las entradas sean de tamaño adecuado para el modelo
        input_values = inputs['input_values'].view(inputs['input_values'].size(0), -1).to(device)
    
        # Extraer características de todas las capas del modelo
        with torch.no_grad():
            with torch.amp.autocast(device_type=device, enabled=(device == 'cuda')):
                outputs = model(input_values, output_hidden_states=True)  # Obtener todas las capas
                
                # Aplicar pooling mean
                pooled_features = apply_mean_pooling(outputs.hidden_states[6])  # Agregamos dimensión batch

                pooled_features = pooled_features.squeeze(0).cpu().numpy()

                # Almacenar las características con pooling
                features_dict[row["name"]] = pooled_features
                
    return features_dict


In [42]:
processor, model  = load_model("facebook/wav2vec2-large-xlsr-53-spanish")
feature_extractor = "w2v2_53esp"

Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53-spanish and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [43]:
len(df_augmented)

17216

In [44]:
features_dict = extract_features(df_augmented, processor, model) # Se puede elegir la capa de mi elección

Extrayendo características:   0%|          | 0/17216 [00:00<?, ?it/s]

Extrayendo características: 100%|██████████| 17216/17216 [20:50<00:00, 13.77it/s]


In [45]:
np.savez(f"df_augmented_{LIMIT_DA}_w2v2_53es_layer6.npz", **features_dict)