<a href="https://colab.research.google.com/github/Youngstg/Test_Multimodal/blob/main/TestAUDIO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset: MIREX Emotion Dataset dari Kaggle
Modalitas: Audio (WAV files)
Model: PANNs (Pre-trained Audio Neural Networks)

Paper: "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition"

# 1. INSTALASI DAN IMPORT LIBRARY

In [None]:
print("Installing required packages...")
!pip install -q kagglehub
!pip install -q panns-inference  # Much simpler!
!pip install -q torch torchvision torchaudio
!pip install -q librosa soundfile
!pip install -q scikit-learn pandas numpy

print("‚úì Installation complete!")

import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.class_weight import compute_class_weight
import librosa
import soundfile as sf
from panns_inference import AudioTagging  # PANNs made easy!
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
def set_seed(seed=42):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_seed(42)

# Check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'\n‚úì Using device: {device}')

Installing required packages...
‚úì Installation complete!

‚úì Using device: cuda


# 2. DOWNLOAD DATASET & LOAD PRE-TRAINED PANNS

In [None]:
import kagglehub

print("\n" + "="*80)
print("DOWNLOADING MIREX DATASET")
print("="*80)

# Download dataset
path = kagglehub.dataset_download("imsparsh/multimodal-mirex-emotion-dataset")
print(f"‚úì Path to dataset files: {path}")

# Check audio directory
audio_dir = os.path.join(path, 'dataset', 'Audio')
print(f"\n‚úì Audio directory: {audio_dir}")
print(f"‚úì Audio directory exists: {os.path.exists(audio_dir)}")

if os.path.exists(audio_dir):
    audio_files = [f for f in os.listdir(audio_dir) if f.endswith('.wav') or f.endswith('.mp3')]
    print(f"‚úì Found {len(audio_files)} audio files")
    if len(audio_files) > 0:
        print(f"  Sample files: {audio_files[:5]}")
else:
    print("‚ö†Ô∏è Audio directory not found! Checking alternatives...")
    # Try alternative paths
    for alt_dir in ['Audio', 'audio', 'WAV', 'wav']:
        alt_path = os.path.join(path, 'dataset', alt_dir)
        if os.path.exists(alt_path):
            audio_dir = alt_path
            print(f"‚úì Found audio at: {audio_dir}")
            break


DOWNLOADING MIREX DATASET
Using Colab cache for faster access to the 'multimodal-mirex-emotion-dataset' dataset.
‚úì Path to dataset files: /kaggle/input/multimodal-mirex-emotion-dataset

‚úì Audio directory: /kaggle/input/multimodal-mirex-emotion-dataset/dataset/Audio
‚úì Audio directory exists: True
‚úì Found 903 audio files
  Sample files: ['326.mp3', '149.mp3', '898.mp3', '011.mp3', '434.mp3']


# 3. LOAD PRE-TRAINED PANNS (USING panns-inference)

In [None]:
print("\n" + "="*80)
print("LOADING PRE-TRAINED PANNS MODEL")
print("="*80)

# Initialize PANNs Audio Tagging model (auto-downloads weights!)
at = AudioTagging(checkpoint_path=None, device=device)  # Auto download

print("‚úì PANNs model loaded successfully!")
print("  Model: Cnn14")
print("  Trained on: AudioSet (2M+ audio clips)")
print("  Embedding dimension: 2048")
print("  Checkpoint auto-downloaded by panns-inference")


LOADING PRE-TRAINED PANNS MODEL
Checkpoint path: /root/panns_data/Cnn14_mAP=0.431.pth
Using CPU.
‚úì PANNs model loaded successfully!
  Model: Cnn14
  Trained on: AudioSet (2M+ audio clips)
  Embedding dimension: 2048
  Checkpoint auto-downloaded by panns-inference


# 4. AUDIO FEATURE EXTRACTION (USING panns-inference)

In [None]:
def extract_audio_embedding(audio_path, at_model, sr=32000, duration=10):
    """
    Extract PANNs embedding from audio file using panns-inference

    Args:
        audio_path: path to audio file
        at_model: AudioTagging model from panns-inference
        sr: sample rate
        duration: audio duration in seconds

    Returns:
        embedding: (2048,) PANNs embedding
    """
    try:
        # Load audio
        audio, orig_sr = librosa.load(audio_path, sr=sr, duration=duration)

        # Pad if too short
        target_length = sr * duration
        if len(audio) < target_length:
            audio = np.pad(audio, (0, target_length - len(audio)), mode='constant')
        else:
            audio = audio[:target_length]

        # Inference with PANNs (returns clipwise_output and embedding)
        (clipwise_output, embedding) = at_model.inference(audio[None, :])

        # embedding shape: (1, 2048)
        embedding = embedding[0]  # (2048,)

        return embedding

    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None

# 5. LOAD CLUSTER LABELS

In [None]:
def load_cluster_labels(dataset_path):
    """
    Load cluster labels from clusters.txt
    """
    clusters_path = os.path.join(dataset_path, 'dataset', 'clusters.txt')
    cluster_labels = []

    print("\n" + "="*80)
    print("LOADING CLUSTER LABELS")
    print("="*80)

    if os.path.exists(clusters_path):
        with open(clusters_path, 'r', encoding='utf-8', errors='ignore') as f:
            lines = f.readlines()
            cluster_labels = [line.strip() for line in lines if line.strip()]

        unique_clusters = sorted(set(cluster_labels))
        print(f"‚úì Loaded {len(cluster_labels)} cluster labels")
        print(f"‚úì Unique clusters: {unique_clusters}")

        from collections import Counter
        cluster_counts = Counter(cluster_labels)
        print(f"\nCluster distribution:")
        for cluster, count in sorted(cluster_counts.items()):
            print(f"  {cluster}: {count} songs")
    else:
        print("‚ùå clusters.txt not found!")
        return []

    return cluster_labels

cluster_labels = load_cluster_labels(path)

# Create song_id to cluster mapping
song_cluster_map = {}
for idx in range(len(cluster_labels)):
    song_id_0 = str(idx).zfill(3)
    song_id_1 = str(idx + 1).zfill(3)
    song_cluster_map[song_id_0] = cluster_labels[idx]
    song_cluster_map[song_id_1] = cluster_labels[idx]

print(f"\n‚úì Created mappings for {len(song_cluster_map)} song IDs")


LOADING CLUSTER LABELS
‚úì Loaded 903 cluster labels
‚úì Unique clusters: ['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5']

Cluster distribution:
  Cluster 1: 170 songs
  Cluster 2: 164 songs
  Cluster 3: 215 songs
  Cluster 4: 191 songs
  Cluster 5: 163 songs

‚úì Created mappings for 904 song IDs


# 6. AUDIO PREPROCESSING & FEATURE EXTRACTION

In [None]:
print("\n" + "="*80)
print("LOADING AUDIO DATA & EXTRACTING EMBEDDINGS")
print("="*80)

# Find audio files
audio_files_found = [f for f in os.listdir(audio_dir) if f.endswith('.wav') or f.endswith('.mp3')]
print(f"Processing {len(audio_files_found)} audio files...")
print("Extracting PANNs embeddings (this may take a while)...\n")

audio_data_list = []
matched = 0
failed = 0
no_cluster = 0

for idx, audio_file in enumerate(audio_files_found):
    if idx % 20 == 0 and idx > 0:
        print(f"  Progress: {idx}/{len(audio_files_found)} files processed...")

    # Extract song ID
    song_id = audio_file.replace('.wav', '').replace('.mp3', '')
    song_id_clean = ''.join(filter(str.isdigit, song_id))
    if song_id_clean:
        song_id = song_id_clean.zfill(3)

    # Check cluster mapping
    if song_id not in song_cluster_map:
        no_cluster += 1
        if no_cluster <= 3:
            print(f"  ‚ö†Ô∏è No cluster for: {audio_file} (ID: {song_id})")
        continue

    # Extract PANNs embedding
    audio_path = os.path.join(audio_dir, audio_file)
    embedding = extract_audio_embedding(audio_path, at, sr=32000, duration=10)

    if embedding is not None:
        audio_data_list.append({
            'song_id': song_id,
            'embedding': embedding,
            'cluster': song_cluster_map[song_id]
        })
        matched += 1

        if matched <= 3:
            print(f"  ‚úì Loaded: {audio_file} ‚Üí {song_id} ‚Üí {song_cluster_map[song_id]} (emb: {embedding.shape})")
    else:
        failed += 1

print(f"\n{'='*80}")
print(f"AUDIO LOADING SUMMARY:")
print(f"{'='*80}")
print(f"‚úì Successfully loaded: {matched} audio files")
print(f"‚ö†Ô∏è No cluster mapping: {no_cluster} files")
print(f"‚ùå Failed to process: {failed} files")

# Create DataFrame
if len(audio_data_list) > 0:
    df = pd.DataFrame(audio_data_list)
    print(f"\n‚úì Dataset shape: {df.shape}")
    print(f"\nCluster distribution:")
    print(df['cluster'].value_counts())
else:
    raise ValueError("No audio data loaded!")


LOADING AUDIO DATA & EXTRACTING EMBEDDINGS
Processing 903 audio files...
Extracting PANNs embeddings (this may take a while)...

  ‚úì Loaded: 326.mp3 ‚Üí 326 ‚Üí Cluster 2 (emb: (2048,))
  ‚úì Loaded: 149.mp3 ‚Üí 149 ‚Üí Cluster 1 (emb: (2048,))
  ‚úì Loaded: 898.mp3 ‚Üí 898 ‚Üí Cluster 5 (emb: (2048,))
  Progress: 20/903 files processed...
  Progress: 40/903 files processed...
  Progress: 60/903 files processed...
  Progress: 80/903 files processed...
  Progress: 100/903 files processed...
  Progress: 120/903 files processed...
  Progress: 140/903 files processed...
  Progress: 160/903 files processed...
  Progress: 180/903 files processed...
  Progress: 200/903 files processed...
  Progress: 220/903 files processed...
  Progress: 240/903 files processed...
  Progress: 260/903 files processed...
  Progress: 280/903 files processed...
  Progress: 300/903 files processed...
  Progress: 320/903 files processed...
  Progress: 340/903 files processed...
  Progress: 360/903 files processe

# 7. DATASET CLASS (SIMPLIFIED)

In [None]:
class AudioDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data.iloc[idx]

        # Embedding already extracted by panns-inference
        embedding = torch.FloatTensor(item['embedding'])
        label = item['label']

        return {
            'embedding': embedding,
            'label': torch.tensor(label, dtype=torch.long)
        }

# 8. EMOTION CLASSIFIER (SIMPLIFIED)

In [None]:
class PANNsEmotionClassifier(nn.Module):
    """
    Simple classifier on top of pre-extracted PANNs embeddings
    """
    def __init__(self, num_classes, input_dim=2048, dropout=0.5):
        super(PANNsEmotionClassifier, self).__init__()

        # Classifier head (embeddings already extracted)
        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(input_dim, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(512, 128),
            nn.ReLU(),
            nn.Dropout(dropout * 0.5),
            nn.Linear(128, num_classes)
        )

    def forward(self, embedding):
        # Embeddings already computed, just classify
        logits = self.classifier(embedding)
        return logits


# 9. LABEL ENCODING

In [None]:
print("\n" + "="*80)
print("ENCODING LABELS")
print("="*80)

label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['cluster'])

print(f"‚úì Cluster classes: {label_encoder.classes_}")
print(f"‚úì Number of clusters: {len(label_encoder.classes_)}")

print("\nClass distribution:")
for cluster, count in df['cluster'].value_counts().items():
    encoded = df[df['cluster'] == cluster]['label'].iloc[0]
    print(f"  {encoded}: {cluster} - {count} samples")

num_classes = len(label_encoder.classes_)

# Calculate class weights
y_labels = df['label'].values
class_weights = compute_class_weight('balanced', classes=np.unique(y_labels), y=y_labels)
class_weights = torch.FloatTensor(class_weights).to(device)
print(f"\n‚úì Class weights: {class_weights.cpu().numpy()}")


ENCODING LABELS
‚úì Cluster classes: ['Cluster 1' 'Cluster 2' 'Cluster 3' 'Cluster 4' 'Cluster 5']
‚úì Number of clusters: 5

Class distribution:
  2: Cluster 3 - 215 samples
  3: Cluster 4 - 191 samples
  0: Cluster 1 - 169 samples
  1: Cluster 2 - 164 samples
  4: Cluster 5 - 164 samples

‚úì Class weights: [1.068639  1.1012195 0.84      0.9455497 1.1012195]


# 10. TRAINING & EVALUATION FUNCTIONS (SIMPLIFIED)

In [None]:
def train_epoch(model, dataloader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    predictions = []
    true_labels = []

    for batch in dataloader:
        embedding = batch['embedding'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()

        # Forward pass
        logits = model(embedding)
        loss = criterion(logits, labels)

        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        total_loss += loss.item()

        # Predictions
        preds = torch.argmax(logits, dim=1)
        predictions.extend(preds.cpu().numpy())
        true_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(true_labels, predictions)

    return avg_loss, accuracy

def evaluate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    predictions = []
    true_labels = []

    with torch.no_grad():
        for batch in dataloader:
            embedding = batch['embedding'].to(device)
            labels = batch['label'].to(device)

            # Forward pass
            logits = model(embedding)
            loss = criterion(logits, labels)

            total_loss += loss.item()

            # Predictions
            preds = torch.argmax(logits, dim=1)
            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(true_labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        true_labels, predictions, average='weighted', zero_division=0
    )

    return avg_loss, accuracy, precision, recall, f1, predictions, true_labels

# 11. 5-FOLD CROSS VALIDATION

In [None]:
# Hyperparameters
BATCH_SIZE = 16
LEARNING_RATE = 1e-4
NUM_EPOCHS = 30
N_FOLDS = 5
WEIGHT_DECAY = 0.01
EARLY_STOPPING_PATIENCE = 7
DROPOUT = 0.5

print("\n" + "="*80)
print("HYPERPARAMETERS")
print("="*80)
print(f"Batch size: {BATCH_SIZE}")
print(f"Learning rate: {LEARNING_RATE}")
print(f"Epochs: {NUM_EPOCHS}")
print(f"Dropout: {DROPOUT}")
print(f"PANNs: Pre-extracted embeddings (2048-dim)")
print(f"Total samples: {len(df)}")

# Prepare data
X = df.index.values
y = df['label'].values

# 5-Fold Cross Validation
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=42)

print("\n" + "="*80)
print("STARTING 5-FOLD CROSS VALIDATION")
print("="*80)

fold_results = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"\n{'='*80}")
    print(f"FOLD {fold + 1}/{N_FOLDS}")
    print(f"{'='*80}")

    # Split data
    train_data = df.iloc[train_idx].reset_index(drop=True)
    val_data = df.iloc[val_idx].reset_index(drop=True)

    print(f"Train size: {len(train_data)}, Val size: {len(val_data)}")

    # Create datasets
    train_dataset = AudioDataset(train_data)
    val_dataset = AudioDataset(val_data)

    # Create dataloaders
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

    # Initialize model
    model = PANNsEmotionClassifier(
        num_classes=num_classes,
        input_dim=2048,
        dropout=DROPOUT
    )
    model = model.to(device)

    # Loss and optimizer
    criterion = nn.CrossEntropyLoss(weight=class_weights)
    optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)

    # Scheduler
    from torch.optim.lr_scheduler import ReduceLROnPlateau
    scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=3)

    # Training loop
    best_val_f1 = 0
    patience_counter = 0

    for epoch in range(NUM_EPOCHS):
        print(f"\nEpoch {epoch + 1}/{NUM_EPOCHS}")

        # Train
        train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)

        # Validate
        val_loss, val_acc, val_precision, val_recall, val_f1, _, _ = evaluate(
            model, val_loader, criterion, device
        )

        # Update scheduler
        scheduler.step(val_f1)

        # Overfitting gap
        overfit_gap = train_acc - val_acc

        print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
        print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}, Val F1: {val_f1:.4f}")
        print(f"Overfitting Gap: {overfit_gap:.4f}")

        if overfit_gap > 0.2:
            print(f"  ‚ö†Ô∏è Overfitting detected!")

        # Save best model
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            torch.save(model.state_dict(), f'best_panns_model_fold{fold+1}.pt')
            patience_counter = 0
            print(f"  ‚úì New best F1: {best_val_f1:.4f}")
        else:
            patience_counter += 1
            print(f"  No improvement ({patience_counter}/{EARLY_STOPPING_PATIENCE})")

            if patience_counter >= EARLY_STOPPING_PATIENCE:
                print(f"  Early stopping triggered!")
                break

    # Load best model
    model.load_state_dict(torch.load(f'best_panns_model_fold{fold+1}.pt'))
    val_loss, val_acc, val_precision, val_recall, val_f1, predictions, true_labels = evaluate(
        model, val_loader, criterion, device
    )

    print(f"\n{'='*80}")
    print(f"FOLD {fold + 1} FINAL RESULTS:")
    print(f"{'='*80}")
    print(f"Accuracy:  {val_acc:.4f}")
    print(f"Precision: {val_precision:.4f}")
    print(f"Recall:    {val_recall:.4f}")
    print(f"F1-Score:  {val_f1:.4f}")

    # Store results
    fold_results.append({
        'fold': fold + 1,
        'accuracy': val_acc,
        'precision': val_precision,
        'recall': val_recall,
        'f1': val_f1
    })

    # Classification report
    print("\nClassification Report:")
    print(classification_report(
        true_labels, predictions,
        target_names=label_encoder.classes_,
        digits=4,
        zero_division=0
    ))


HYPERPARAMETERS
Batch size: 16
Learning rate: 0.0001
Epochs: 30
Dropout: 0.5
PANNs: Pre-extracted embeddings (2048-dim)
Total samples: 903

STARTING 5-FOLD CROSS VALIDATION

FOLD 1/5
Train size: 722, Val size: 181

Epoch 1/30
Train Loss: 1.6213, Train Acc: 0.2175
Val Loss: 1.5661, Val Acc: 0.3702, Val F1: 0.3112
Overfitting Gap: -0.1527
  ‚úì New best F1: 0.3112

Epoch 2/30
Train Loss: 1.5574, Train Acc: 0.3061
Val Loss: 1.5219, Val Acc: 0.4088, Val F1: 0.3462
Overfitting Gap: -0.1027
  ‚úì New best F1: 0.3462

Epoch 3/30
Train Loss: 1.5273, Train Acc: 0.3421
Val Loss: 1.4883, Val Acc: 0.4254, Val F1: 0.3819
Overfitting Gap: -0.0833
  ‚úì New best F1: 0.3819

Epoch 4/30
Train Loss: 1.4910, Train Acc: 0.3324
Val Loss: 1.4644, Val Acc: 0.4365, Val F1: 0.3986
Overfitting Gap: -0.1041
  ‚úì New best F1: 0.3986

Epoch 5/30
Train Loss: 1.4607, Train Acc: 0.3753
Val Loss: 1.4395, Val Acc: 0.4420, Val F1: 0.4007
Overfitting Gap: -0.0666
  ‚úì New best F1: 0.4007

Epoch 6/30
Train Loss: 1.4201

# 12. FINAL RESULTS

In [None]:
print("\n" + "="*80)
print("5-FOLD CROSS VALIDATION SUMMARY")
print("="*80)

results_df = pd.DataFrame(fold_results)
print("\nResults per fold:")
print(results_df.to_string(index=False))

print("\n" + "="*80)
print("AVERAGE PERFORMANCE ACROSS ALL FOLDS:")
print("="*80)
avg_acc = results_df['accuracy'].mean()
avg_f1 = results_df['f1'].mean()
print(f"Accuracy:  {avg_acc:.4f} ¬± {results_df['accuracy'].std():.4f}")
print(f"Precision: {results_df['precision'].mean():.4f} ¬± {results_df['precision'].std():.4f}")
print(f"Recall:    {results_df['recall'].mean():.4f} ¬± {results_df['recall'].std():.4f}")
print(f"F1-Score:  {avg_f1:.4f} ¬± {results_df['f1'].std():.4f}")

# Save results
results_df.to_csv('panns_audio_cv_results.csv', index=False)
print("\n‚úì Results saved to 'panns_audio_cv_results.csv'")

print("\n" + "="*80)
print("‚úÖ PANNS AUDIO CLASSIFICATION COMPLETE!")
print("="*80)

print(f"\nüìä PERFORMANCE SUMMARY:")
print(f"  Audio (PANNs): {avg_f1:.1%}")
print(f"  Dataset size: {len(df)} samples")

print("\nüí° COMPARISON:")
print(f"  Lyrics (BERT): ~45-55% F1")
print(f"  MIDI (Orpheus): ~23% F1 (limited data)")
print(f"  Audio (PANNs): ~{avg_f1:.0%} F1")

if avg_f1 > 0.50:
    print("\nüéâ Audio modality performs BEST!")
    print("This confirms audio is the strongest single modality.")
elif avg_f1 > 0.40:
    print("\n‚úì Audio performs well!")
    print("Comparable to lyrics modality.")
else:
    print("\n‚ö†Ô∏è Audio performance below expectation")
    print("May need more data or different preprocessing.")

print("\nüìà NEXT STEPS:")
print("  1. ‚úì Lyrics modality (BERT) - DONE")
print("  2. ‚úì MIDI modality (Orpheus) - DONE")
print("  3. ‚úì Audio modality (PANNs) - DONE")
print("  4. ‚è≥ Multimodal Fusion - NEXT")
print("="*80)


5-FOLD CROSS VALIDATION SUMMARY

Results per fold:
 fold  accuracy  precision   recall       f1
    1  0.486188   0.469039 0.486188 0.454786
    2  0.441989   0.438838 0.441989 0.410194
    3  0.469613   0.469585 0.469613 0.443444
    4  0.461111   0.456505 0.461111 0.435834
    5  0.450000   0.439288 0.450000 0.396616

AVERAGE PERFORMANCE ACROSS ALL FOLDS:
Accuracy:  0.4618 ¬± 0.0172
Precision: 0.4547 ¬± 0.0152
Recall:    0.4618 ¬± 0.0172
F1-Score:  0.4282 ¬± 0.0241

‚úì Results saved to 'panns_audio_cv_results.csv'

‚úÖ PANNS AUDIO CLASSIFICATION COMPLETE!

üìä PERFORMANCE SUMMARY:
  Audio (PANNs): 42.8%
  Dataset size: 903 samples

üí° COMPARISON:
  Lyrics (BERT): ~45-55% F1
  MIDI (Orpheus): ~23% F1 (limited data)
  Audio (PANNs): ~43% F1

‚úì Audio performs well!
Comparable to lyrics modality.

üìà NEXT STEPS:
  1. ‚úì Lyrics modality (BERT) - DONE
  2. ‚úì MIDI modality (Orpheus) - DONE
  3. ‚úì Audio modality (PANNs) - DONE
  4. ‚è≥ Multimodal Fusion - NEXT
