# Notebook 00: GTZAN Data Pre-processing and Augmentation

**Project:** Music Genre Classification on GTZAN  
**Author:** Alessandro Potenza & Camilla Sed  
**Course:** Numerical Analysis for Machine Learning, Politecnico di Milano

---

## Objective

This notebook serves as the foundational step of our entire project. Its sole responsibility is to load the raw GTZAN audio files and transform them into a clean, augmented, and robust dataset suitable for training deep learning models.

The pipeline implemented here is critically designed to **prevent data leakage**, a common flaw in MGC research that leads to inflated and unreliable results.

### Key Steps Performed:

1.  **Load File Paths**: Scan the GTZAN directory to get a list of all audio files and their corresponding genre labels.
2.  **Strategic Data Splitting**: Split the _file paths_ into training (60%), validation (20%), and test (20%) sets. This is the **most critical step** to ensure data integrity.
3.  **Segmentation (Chunking)**: Augment the dataset by slicing each 30-second audio clip into 10 smaller, 3-second segments. This increases our dataset size tenfold.
4.  **Feature Extraction**: Convert each audio segment into a Mel-spectrogram, transforming the audio classification problem into an image classification task.
5.  **Standardization**: Normalize the spectrograms by fitting a `StandardScaler` **only on the training data** and then applying it to all sets.
6.  **Save Artifacts**: Save the final NumPy arrays (`X_train`, `y_train`, etc.) and the fitted `LabelEncoder` and `StandardScaler` objects for use in subsequent training notebooks.


---

## Cell 1: Setup, Imports, and Global Configuration

This cell handles the initial setup of our environment. We import all necessary libraries and define global constants for file paths and the random state to ensure our experiments are fully reproducible. An output directory for the processed data is also created.


In [3]:
# ===================================================================
# CELL 1: SETUP, IMPORTS, AND GLOBAL CONFIGURATION
# ===================================================================
# This cell handles the initial setup of our environment. We import general
# libraries and define global constants for file paths and reproducibility.

import os
import sys
import numpy as np
import librosa
import pickle
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm
from typing import List, Tuple
from pathlib import Path
import json

# --- Define Project Paths ---
# Assumes this notebook is in a 'notebooks' directory, one level down from the project root.
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '../..'))
if PROJECT_ROOT not in sys.path:
    sys.path.append(PROJECT_ROOT)

# Path to the raw data destination
GTZAN_ROOT_PATH = os.path.join(PROJECT_ROOT, 'data', 'gtzan')
# Path to the specific folder containing genre subdirectories
DATA_PATH = os.path.join(GTZAN_ROOT_PATH, 'genres')
# Path for saving processed numpy arrays
PROCESSED_DATA_PATH = os.path.join(PROJECT_ROOT, 'data', 'processed')

# --- Global Constants for Reproducibility ---
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# --- Create Output Directory ---
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)

print("✅ Environment setup complete. Data paths configured.")

✅ Environment setup complete. Data paths configured.


---
## Cell 2: Kaggle Setup and Automatic Dataset Download
This cell handles the configuration of Kaggle credentials and automates the download of the GTZAN dataset.

### Logic:
1.  **Check Credentials**: It first verifies that the `kaggle/kaggle.json` file exists in the project root. If not, it halts execution with clear instructions for the user.
2.  **Set Environment**: It points the Kaggle API to our project-specific credential file.
3.  **Check for Data**: It then checks if the GTZAN `genres` directory already exists.
4.  **Download if Missing**: If the data is not found, it uses the Kaggle API to download and unzip the dataset automatically.

In [7]:
# ===================================================================
# CELL 2: KAGGLE SETUP AND DOWNLOAD (WITH PROGRESS BAR)
# ===================================================================
# This cell ensures the GTZAN dataset is available locally. It uses a
# two-step process to provide a real-time progress bar during extraction.

# --- 1. CONFIGURE KAGGLE API ENVIRONMENT ---
print("🔧 Configuring Kaggle API environment...")
KAGGLE_DIR = os.path.join(PROJECT_ROOT, 'kaggle')
KAGGLE_JSON_PATH = os.path.join(KAGGLE_DIR, 'kaggle.json')

# Check if credentials file exists
if not os.path.exists(KAGGLE_JSON_PATH):
    raise FileNotFoundError(
        f"❌ Kaggle API credentials not found at: {KAGGLE_JSON_PATH}\n"
        "Please follow the instructions in the README to set it up."
    )

os.environ['KAGGLE_CONFIG_DIR'] = KAGGLE_DIR
if os.name != 'nt':
    os.chmod(KAGGLE_JSON_PATH, 0o600)

# --- 2. RELOAD KAGGLE LIBRARY AND DOWNLOAD DATASET ---
try:
    import kaggle
    import zipfile
    import importlib
    importlib.reload(kaggle) # Force reload to read the new environment config
    print("✅ Kaggle credentials configured successfully.")
except Exception as e:
    raise ImportError(f"Could not import or reload 'kaggle' package. Error: {e}")

KAGGLE_DATASET_ID = 'carlthome/gtzan-genre-collection'
print("\n🎵 Checking for GTZAN dataset...")

if os.path.isdir(DATA_PATH) and len(os.listdir(DATA_PATH)) > 0:
    print(f"✅ Dataset already found at: {DATA_PATH}")
else:
    print("Dataset not found. Proceeding with download and extraction.")
    try:
        os.makedirs(GTZAN_ROOT_PATH, exist_ok=True)
        
        # --- STEP 1: DOWNLOAD THE ZIP FILE ONLY ---
        print("Downloading dataset from Kaggle (this may take a moment)...")
        kaggle.api.dataset_download_files(
            KAGGLE_DATASET_ID,
            path=GTZAN_ROOT_PATH,
            unzip=False, # Set to False to handle extraction manually
            quiet=True # We'll provide our own progress for extraction
        )
        print("✅ Download complete.")

        # --- STEP 2: EXTRACT WITH A TQDM PROGRESS BAR ---
        # Construct the path to the downloaded zip file
        zip_filename = KAGGLE_DATASET_ID.split('/')[1] + '.zip'
        zip_path = os.path.join(GTZAN_ROOT_PATH, zip_filename)
        
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            # Get the list of files to show progress against the number of files
            file_list = zip_ref.infolist()
            
            # Wrap the extraction in a tqdm loop
            for file in tqdm(file_list, desc="Extracting dataset"):
                zip_ref.extract(file, path=GTZAN_ROOT_PATH)
        
        # --- STEP 3: CLEANUP ---
        # Remove the zip file after successful extraction
        os.remove(zip_path)
        print("✅ Extraction complete and zip file removed.")
        
        if not os.path.isdir(DATA_PATH):
             raise FileNotFoundError(f"Extraction failed: 'genres' directory not found.")
            
    except Exception as e:
        print(f"❌ An error occurred: {e}")
        raise

🔧 Configuring Kaggle API environment...
✅ Kaggle credentials configured successfully.

🎵 Checking for GTZAN dataset...
✅ Dataset already found at: /home/alepot55/Desktop/projects/naml_project/data/gtzan/genres


---

## Cell 3: The Data Loading and Processing Class

This cell defines the `GTZANDataLoader`, a comprehensive class that encapsulates all the logic for loading, processing, and augmenting the GTZAN dataset. By structuring our logic this way, the code becomes modular, reusable, and easy to understand.

### Key Methods:

- `load_file_paths()`: Scans the directory structure to create a master list of audio files and labels.
- `process_file()`: Loads a single audio file, slices it into 10 segments, and computes the log-Mel spectrogram for each segment. This is our primary data augmentation strategy. It includes error handling for potentially corrupt files in the dataset.
- `create_dataset_from_files()`: Iterates through a list of file paths (e.g., the training set) and applies the `process_file` method to each, compiling the final list of spectrograms and labels.
- `adjust_spectrograms_shape()`: A helper function to ensure all spectrograms have a uniform length by padding or truncating them. This is necessary because minor variations in file length can lead to spectrograms with slightly different time dimensions.


In [5]:
# ===================================================================
# CELL 3: THE DATA LOADER CLASS
# ===================================================================

class GTZANDataLoader:
    """
    Handles the loading, pre-processing, and splitting of the GTZAN dataset
    with a strong focus on preventing data leakage.
    """
    
    def __init__(self, data_path: str, sample_rate: int = 22050, n_mels: int = 128, hop_length: int = 512):
        """
        Initializes the loader with audio parameters and fits the LabelEncoder.
        The StandardScaler is initialized but remains unfitted until the training
        data is available.
        """
        self.data_path = data_path
        self.sample_rate = sample_rate
        self.n_mels = n_mels
        self.hop_length = hop_length
        
        # Discover genres from subdirectories
        self.genres = sorted([d for d in os.listdir(data_path) if os.path.isdir(os.path.join(data_path, d))])
        
        # Fit the LabelEncoder once on all possible genre names
        self.label_encoder = LabelEncoder().fit(self.genres)
        
        # Initialize an empty scaler; it will be fitted ONLY on the training data later
        self.scaler = StandardScaler()
        
    def load_all_filepaths(self) -> Tuple[List[str], List[str]]:
        """Scans the data directory and returns a list of all file paths and their text labels."""
        filepaths, labels = [], []
        for genre in self.genres:
            genre_path = os.path.join(self.data_path, genre)
            for filename in os.listdir(genre_path):
                if filename.endswith(('.wav', '.au')):
                    filepaths.append(os.path.join(genre_path, filename))
                    labels.append(genre)
        return filepaths, labels
    
    def _process_audio_file(self, file_path: str, n_segments: int = 10, segment_duration: float = 2.97) -> List[np.ndarray]:
        """
        Loads and processes a single audio file, augmenting it into multiple segments.
        Returns a list of log-Mel spectrograms for that file.
        """
        try:
            signal, _ = librosa.load(file_path, sr=self.sample_rate)
            samples_per_segment = int(self.sample_rate * segment_duration)
            
            spectrograms = []
            for s in range(n_segments):
                start = s * samples_per_segment
                end = start + samples_per_segment
                if end <= len(signal):
                    segment_signal = signal[start:end]
                    mel_spec = librosa.feature.melspectrogram(
                        y=segment_signal, sr=self.sample_rate, n_mels=self.n_mels, hop_length=self.hop_length
                    )
                    log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
                    spectrograms.append(log_mel_spec)
            return spectrograms
        except Exception as e:
            print(f"Warning: Could not process file {os.path.basename(file_path)}. Error: {e}")
            return []
            
    def create_dataset_from_filepaths(self, filepaths: List[str], text_labels: List[str], n_segments: int = 10) -> Tuple[List[np.ndarray], np.ndarray]:
        """Creates a dataset of spectrograms and encoded labels from a list of files."""
        X_list, y_list = [], []
        encoded_labels = self.label_encoder.transform(text_labels)
        
        for i, file_path in enumerate(tqdm(filepaths, desc=f"Processing {len(filepaths)} files")):
            spectrograms = self._process_audio_file(file_path, n_segments=n_segments)
            for spec in spectrograms:
                X_list.append(spec)
                y_list.append(encoded_labels[i])
                
        return X_list, np.array(y_list)

    @staticmethod
    def _unify_spectrogram_shapes(spec_list: List[np.ndarray], target_len: int = 128) -> np.ndarray:
        """Pads or truncates spectrograms to a uniform target length."""
        adjusted_list = []
        for spec in spec_list:
            if spec.shape[1] > target_len:
                adjusted_list.append(spec[:, :target_len])
            else:
                padding_needed = target_len - spec.shape[1]
                adjusted_list.append(np.pad(spec, ((0, 0), (0, padding_needed)), mode='constant'))
        return np.array(adjusted_list)

print("✅ GTZANDataLoader class defined.")

✅ GTZANDataLoader class defined.


---

## Cell 4: Execution of the Full Pipeline

This is the main execution cell where we orchestrate the entire data preparation process from start to finish.

### The sequence of operations is critical for ensuring no data leakage:

1.  **Load and Split Paths**: We instantiate our `GTZANDataLoader` and immediately split the raw file paths into training, validation, and test sets. This ensures that all segments from a single original song will reside in only one data split.
2.  **Process Datasets**: We process each set of files (train, val, test) separately to create our augmented spectrogram datasets.
3.  **Unify Shape**: We enforce a uniform shape across all generated spectrograms.
4.  **Normalize Data**: We correctly fit the `StandardScaler` **only on the training data** and then apply this learned transformation to the validation and test sets.
5.  **Add Channel Dimension**: We add a final "channel" dimension to the spectrograms, transforming their shape from `(N, H, W)` to `(N, H, W, 1)`, which is the standard input format for 2D Convolutional Neural Networks in Keras/TensorFlow.
6.  **Save Artifacts**: Finally, we save all processed arrays and the fitted `scaler` and `label_encoder` objects to disk. These artifacts are now ready to be loaded directly by our training and analysis notebooks.


In [6]:
# ===================================================================
# CELL 4: EXECUTION AND SAVING OF PROCESSED DATA
# ===================================================================

print("Starting data preparation pipeline...")

# 1. Load file paths and split at the file level to prevent data leakage.
# Final split: 60% train, 20% validation, 20% test.
data_loader = GTZANDataLoader(data_path=DATA_PATH)
filepaths, text_labels = data_loader.load_all_filepaths()

# First split: 80% for train/val, 20% for test
train_val_files, test_files, train_val_labels, test_labels = train_test_split(
    filepaths, text_labels, test_size=0.2, random_state=RANDOM_STATE, stratify=text_labels
)
# Second split: create train and validation from the 80% pool
train_files, val_files, train_labels, val_labels = train_test_split(
    train_val_files, train_val_labels, test_size=0.25, random_state=RANDOM_STATE, stratify=train_val_labels # 0.25 * 0.8 = 0.2
)
print(f"Data split into: {len(train_files)} train, {len(val_files)} validation, {len(test_files)} test files.")

# 2. Process each data split into spectrograms
X_train_list, y_train = data_loader.create_dataset_from_filepaths(train_files, train_labels)
X_val_list, y_val = data_loader.create_dataset_from_filepaths(val_files, val_labels)
X_test_list, y_test = data_loader.create_dataset_from_filepaths(test_files, test_labels)
print("\n✅ Spectrogram extraction complete.")

# 3. Unify spectrogram shapes to a consistent length
print("Unifying spectrogram shapes...")
X_train = GTZANDataLoader._unify_spectrogram_shapes(X_train_list)
X_val = GTZANDataLoader._unify_spectrogram_shapes(X_val_list)
X_test = GTZANDataLoader._unify_spectrogram_shapes(X_test_list)

# 4. Normalize the data (Fit on train ONLY)
# The scaler expects 2D data, so we reshape, fit/transform, then reshape back.
print("Normalizing data (fitting scaler only on training data)...")
scaler = data_loader.scaler
X_train_shape = X_train.shape
X_train = scaler.fit_transform(X_train.reshape(-1, X_train_shape[1] * X_train_shape[2])).reshape(X_train_shape)

X_val_shape = X_val.shape
X_val = scaler.transform(X_val.reshape(-1, X_val_shape[1] * X_val_shape[2])).reshape(X_val_shape)

X_test_shape = X_test.shape
X_test = scaler.transform(X_test.reshape(-1, X_test_shape[1] * X_test_shape[2])).reshape(X_test_shape)

# 5. Add channel dimension for CNN input: (N, H, W) -> (N, H, W, 1)
X_train, X_val, X_test = X_train[..., np.newaxis], X_val[..., np.newaxis], X_test[..., np.newaxis]
print("✅ Normalization and channel formatting complete.")

# --- 6. SAVE ALL ARTIFACTS ---
print("\n💾 Saving processed data and artifacts...")
np.save(os.path.join(PROCESSED_DATA_PATH, 'X_train.npy'), X_train)
np.save(os.path.join(PROCESSED_DATA_PATH, 'y_train.npy'), y_train)
np.save(os.path.join(PROCESSED_DATA_PATH, 'X_val.npy'), X_val)
np.save(os.path.join(PROCESSED_DATA_PATH, 'y_val.npy'), y_val)
np.save(os.path.join(PROCESSED_DATA_PATH, 'X_test.npy'), X_test)
np.save(os.path.join(PROCESSED_DATA_PATH, 'y_test.npy'), y_test)

with open(os.path.join(PROCESSED_DATA_PATH, 'scaler.pkl'), 'wb') as f:
    pickle.dump(scaler, f)
with open(os.path.join(PROCESSED_DATA_PATH, 'label_encoder.pkl'), 'wb') as f:
    pickle.dump(data_loader.label_encoder, f)

print("\n🎉 Pipeline complete. Data is ready for training in the subsequent notebooks.")
print("\n📊 Summary of Saved Data:")
print(f"   - Training Set:   X={X_train.shape}, y={y_train.shape}")
print(f"   - Validation Set: X={X_val.shape}, y={y_val.shape}")
print(f"   - Test Set:       X={X_test.shape}, y={y_test.shape}")

Starting data preparation pipeline...
Data split into: 600 train, 200 validation, 200 test files.


Processing 600 files:   0%|          | 0/600 [00:00<?, ?it/s]

Processing 200 files:   0%|          | 0/200 [00:00<?, ?it/s]

Processing 200 files:   0%|          | 0/200 [00:00<?, ?it/s]


✅ Spectrogram extraction complete.
Unifying spectrogram shapes...
Normalizing data (fitting scaler only on training data)...
✅ Normalization and channel formatting complete.

💾 Saving processed data and artifacts...

🎉 Pipeline complete. Data is ready for training in the subsequent notebooks.

📊 Summary of Saved Data:
   - Training Set:   X=(6000, 128, 128, 1), y=(6000,)
   - Validation Set: X=(2000, 128, 128, 1), y=(2000,)
   - Test Set:       X=(2000, 128, 128, 1), y=(2000,)
