# Notebook 1: Generate Augmented Audio Dataset

This notebook has a single, critical purpose: to create a large, static dataset of augmented audio files.

**Problem:** Training a model with on-the-fly augmentation is flexible, but it can be slow as the augmentations have to be recalculated for every training run.

**Solution:** We will pre-generate our augmented data. This notebook will read every audio file from our original `Audios para Treinamento/` directory, create several augmented versions of each, and save them as new `.wav` files in a new `Audios-Augmented/` directory.

The augmentation techniques will include:
- **Adding Noise**
- **Pitch Shifting**
- **Chunk Shuffling:** A new, powerful technique where we slice the audio into chunks, shuffle them, and stitch them back together. This is designed to teach the model about the difference between stationary noise (leaks) and transient sounds.

After running this notebook **once**, we will have a large, ready-to-use dataset for our modeling experiments, which will make the training process much faster and simpler.


## 1. Setup and Imports


In [1]:
%pip install librosa numpy tqdm soundfile


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: C:\Users\fe_de\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
import os
import numpy as np
import librosa
from tqdm import tqdm
import random
import soundfile as sf

print("Imports complete.")


Imports complete.


## 2. Define File Paths and Augmentation Functions


In [3]:
# --- Configuration ---
SOURCE_DIR = 'Audios para Treinamento/'
DEST_DIR = 'Audios-Augmented/'
SAMPLE_RATE = 22050
AUGMENTATIONS_PER_FILE = 4 # How many new files to create for each original file

# --- Augmentation Functions ---
def add_noise(audio, noise_factor=0.005):
    noise = np.random.randn(len(audio))
    return audio + noise_factor * noise

def pitch_shift(audio, sample_rate, n_steps=4):
    return librosa.effects.pitch_shift(y=audio, sr=sample_rate, n_steps=n_steps)

def chunk_shuffle(audio, num_chunks=5):
    # Ensure there's audio to split
    if len(audio) == 0:
        return audio
    # Split the audio into N chunks
    chunks = np.array_split(audio, num_chunks)
    # Shuffle the list of chunks
    random.shuffle(chunks)
    # Concatenate them back together
    return np.concatenate(chunks)

# List of augmentation functions to choose from
augmentation_choices = [add_noise, pitch_shift, chunk_shuffle]

print("Configuration and functions defined.")


Configuration and functions defined.


## 3. Generate the Augmented Dataset

This is the main processing step. The code below will perform the following actions:
1.  It will create the main `Audios-Augmented/` directory.
2.  Inside, it will create two subdirectories: `Leak` and `NoLeak`.
3.  It will then iterate through every file in the original `Audios para Treinamento/` directory.
4.  For each file, it will copy the original to the new corresponding directory.
5.  It will then generate `AUGMENTATIONS_PER_FILE` new, randomly augmented versions and save them as new `.wav` files with descriptive names (e.g., `original_name_aug_noise.wav`).

**Warning:** This cell will take some time to run as it is loading, processing, and saving hundreds of new audio files. Run it once, and then you will have your complete dataset ready for training.


In [4]:
# Create destination directories
os.makedirs(DEST_DIR, exist_ok=True)
os.makedirs(os.path.join(DEST_DIR, 'Leak'), exist_ok=True)
os.makedirs(os.path.join(DEST_DIR, 'NoLeak'), exist_ok=True)

# Get the original classes (Leak-Metal, NoLeak-NonMetal, etc.)
original_classes = [d for d in os.listdir(SOURCE_DIR) if os.path.isdir(os.path.join(SOURCE_DIR, d))]

# Get a flat list of all source files
all_source_files = []
for class_dir in original_classes:
    source_class_path = os.path.join(SOURCE_DIR, class_dir)
    files = [os.path.join(source_class_path, f) for f in os.listdir(source_class_path) if f.endswith('.wav')]
    all_source_files.extend(files)

# --- Main Processing Loop ---
print(f"Starting to process {len(all_source_files)} original files...")
for file_path in tqdm(all_source_files):
    try:
        # Determine the destination directory (Leak or NoLeak)
        base_filename = os.path.basename(file_path)
        class_dir_name = os.path.basename(os.path.dirname(file_path))
        binary_label = 'NoLeak' if class_dir_name.startswith('NoLeak') else 'Leak'
        dest_path = os.path.join(DEST_DIR, binary_label)

        # Load the original audio
        audio, sr = librosa.load(file_path, sr=SAMPLE_RATE)
        
        # 1. Save the original file
        original_dest_filename = os.path.join(dest_path, base_filename)
        sf.write(original_dest_filename, audio, sr)

        # 2. Create and save augmented versions
        for i in range(AUGMENTATIONS_PER_FILE):
            # Choose a random augmentation
            aug_func = random.choice(augmentation_choices)
            
            # Apply it
            if aug_func == pitch_shift: # pitch_shift needs sr
                augmented_audio = aug_func(audio, sample_rate=sr)
            else:
                augmented_audio = aug_func(audio)
            
            # Create a new filename for the augmented audio
            name, ext = os.path.splitext(base_filename)
            aug_filename = f"{name}_aug_{aug_func.__name__}_{i}{ext}"
            aug_dest_path = os.path.join(dest_path, aug_filename)
            
            # Save the new file
            sf.write(aug_dest_path, augmented_audio, sr)

    except Exception as e:
        print(f"Error processing file {file_path}: {e}")

print("\\n---------------------------------")
print("Dataset generation complete!")
leak_files = len(os.listdir(os.path.join(DEST_DIR, 'Leak')))
noleak_files = len(os.listdir(os.path.join(DEST_DIR, 'NoLeak')))
print(f"Total files in '{os.path.join(DEST_DIR, 'Leak')}': {leak_files}")
print(f"Total files in '{os.path.join(DEST_DIR, 'NoLeak')}': {noleak_files}")
print(f"Total augmented files created: {leak_files + noleak_files}")


Starting to process 60 original files...


100%|██████████| 60/60 [00:08<00:00,  7.34it/s]

\n---------------------------------
Dataset generation complete!
Total files in 'Audios-Augmented/Leak': 150
Total files in 'Audios-Augmented/NoLeak': 150
Total augmented files created: 300



