<a href="https://colab.research.google.com/github/cildiroyku/TransferLearningforAnimalSounds/blob/main/DLProjectSetupandTemplateforExperimentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transfer Learning Notebook

Instructions:

* This notebook has dependencies from a GitHub repo for the project that includes a config file, the path to YAMNet, embedding transofrmations for PyTorch. **Run Part 1 completely before starting to work on this notebook.**

* Part 2 is for demonstration purposes only on one sample audio clip. It was used to investigate the inner workings of YAMNet and understand how the embeddings need to be extracted and structured, to be able to process them. **Part 2 is optional to run.**

* Part 3 sets up a pipeline to extract embeddings. **Part 3 it should be run before proceeding.**

*

## Part1: Setting up to clone the repo from GitHub, get dependencies, model weights and connect to GPU

No need to modify this section. If GPU is not available, model will run on CPU.


In [1]:
!git clone https://github.com/cildiroyku/TransferLearningforAnimalSounds.git
%cd TransferLearningforAnimalSounds
!ls -lh yamnet.pth

Cloning into 'TransferLearningforAnimalSounds'...
remote: Enumerating objects: 29, done.[K
remote: Counting objects: 100% (29/29), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 29 (delta 4), reused 12 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (29/29), 13.33 MiB | 16.92 MiB/s, done.
Resolving deltas: 100% (4/4), done.
/content/TransferLearningforAnimalSounds
-rw-r--r-- 1 root root 15M Nov 10 14:58 yamnet.pth


Imports and configuration load

In [2]:
import yaml
cfg = yaml.safe_load(open("configs/base.yaml"))
print(cfg["data"]["sample_rate"])

16000


In [3]:
!python test_config.py

Config loaded successfully!
Project name: transfer_learning_for_animal_sounds
Sample rate: 16000
Device preference: cuda

Verified folder: data/gtzan/
Verified folder: data/esc50/
Verified folder: outputs/embeddings/

GPU not detected ‚Äî using CPU


In [44]:
!git clone https://github.com/w-hc/torch_audioset.git


Cloning into 'torch_audioset'...
remote: Enumerating objects: 269, done.[K
remote: Counting objects:   3% (1/32)[Kremote: Counting objects:   6% (2/32)[Kremote: Counting objects:   9% (3/32)[Kremote: Counting objects:  12% (4/32)[Kremote: Counting objects:  15% (5/32)[Kremote: Counting objects:  18% (6/32)[Kremote: Counting objects:  21% (7/32)[Kremote: Counting objects:  25% (8/32)[Kremote: Counting objects:  28% (9/32)[Kremote: Counting objects:  31% (10/32)[Kremote: Counting objects:  34% (11/32)[Kremote: Counting objects:  37% (12/32)[Kremote: Counting objects:  40% (13/32)[Kremote: Counting objects:  43% (14/32)[Kremote: Counting objects:  46% (15/32)[Kremote: Counting objects:  50% (16/32)[Kremote: Counting objects:  53% (17/32)[Kremote: Counting objects:  56% (18/32)[Kremote: Counting objects:  59% (19/32)[Kremote: Counting objects:  62% (20/32)[Kremote: Counting objects:  65% (21/32)[Kremote: Counting objects:  68% (22/32)[Kremote: C

In [45]:
!ls torch_audioset


LICENSE    README.md  test	  tools		  visualization.py
notebooks  setup.py   tf_2_torch  torch_audioset


In [46]:
!find torch_audioset -maxdepth 2 -type f -name "*yamnet*.py"


torch_audioset/tf_2_torch/store_yamnet_pred_metadata_as_yml.py
torch_audioset/tf_2_torch/convert_yamnet.py


In [47]:
!realpath torch_audioset/tf_2_torch/convert_yamnet.py

/content/TransferLearningforAnimalSounds/torch_audioset/torch_audioset/torch_audioset/tf_2_torch/convert_yamnet.py


In [48]:
!find torch_audioset -type f -name "params.py"

torch_audioset/torch_audioset/params.py


In [49]:
!find torch_audioset -type f -name "model.py"


torch_audioset/torch_audioset/yamnet/model.py
torch_audioset/torch_audioset/vggish/model.py


In [50]:
import sys
sys.path.append("/content/TransferLearningforAnimalSounds/torch_audioset")

from torch_audioset.yamnet.model import yamnet as YamNet

In [54]:
%cd /content/TransferLearningforAnimalSounds


/content/TransferLearningforAnimalSounds


In [55]:
!ls configs


base.yaml


In [57]:
import torch, yaml

device = "cuda" if torch.cuda.is_available() else "cpu"

with open("configs/base.yaml") as f:
    cfg = yaml.safe_load(f)

model = YamNet().to(device)
checkpoint = torch.load("yamnet.pth", map_location=device)
model.load_state_dict(checkpoint)
model.eval()

print("YAMNet model ready on", device)


YAMNet model ready on cpu


## Part 2:Testing YAMNet on a sample, real audio clip from the ESC-50 dataset

**What this section does**

* End-to-end flow from raw audio ‚Üí embedding
* Fixing TensorFlow--> Pytorch compatibility for embedding sizes
* Creating a custom function to extract yamnet embeddings up to layer14
* Embedding matching of the correct feature dimension

### Downloading a test sound to check for sample rate and waveform

In [137]:
import torch
import torchaudio
import torchaudio.transforms as T
import numpy as np

# Example environmental sound (dog bark)
url = "https://github.com/karoldvl/ESC-50/raw/master/audio/1-100032-A-0.wav"
waveform, sr = torchaudio.load(url)

print("Original sample rate:", sr, "| waveform shape:", waveform.shape)


Original sample rate: 44100 | waveform shape: torch.Size([1, 220500])


### Pre-processing audio file imported aboce

In [138]:
target_sr = cfg["data"]["sample_rate"]

# Resample if needed
if sr != target_sr:
    waveform = torchaudio.functional.resample(waveform, sr, target_sr)

# Convert to mono and normalize
waveform = waveform.mean(dim=0, keepdim=True)
waveform = waveform / waveform.abs().max()

print("Processed waveform shape:", waveform.shape)  # [1, N]

Processed waveform shape: torch.Size([1, 80000])


In [139]:
# Parameters consistent with YAMNet's frontend
mel_transform = T.MelSpectrogram(
    sample_rate=target_sr,
    n_fft=1024,
    hop_length=160,
    n_mels=64,
    f_min=125,
    f_max=7500,
)
db_transform = T.AmplitudeToDB(stype="power")

# Convert
mel_spec = mel_transform(waveform)
mel_spec_db = db_transform(mel_spec)

print("Mel-spectrogram shape:", mel_spec_db.shape)  # [1, 64, T]

Mel-spectrogram shape: torch.Size([1, 64, 501])


In [140]:
# YAMNet expects [frames, mel_bins], not batched [1, 64, T]
mel_spec_db = mel_spec_db.transpose(1, 2)  # ‚Üí [1, T, 64]
mel_spec_db = mel_spec_db.squeeze(0)       # ‚Üí [T, 64]

print("Final mel input shape for YAMNet:", mel_spec_db.shape)

Final mel input shape for YAMNet: torch.Size([501, 64])


In [141]:
mel_spec_db = mel_spec_db.unsqueeze(0).unsqueeze(0)
print("Final mel input shape for YAMNet:", mel_spec_db.shape)

Final mel input shape for YAMNet: torch.Size([1, 1, 501, 64])


In [132]:
print(model)

YAMNet(
  (layer1): Conv(
    (fused): CONV_BN_RELU(
      (conv): Conv2d_tf(1, 32, kernel_size=(3, 3), stride=(2, 2), padding=SAME, bias=False)
      (bn): BatchNorm2d(32, eps=0.0001, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU()
    )
  )
  (layer2): SeparableConv(
    (depthwise_conv): CONV_BN_RELU(
      (conv): Conv2d_tf(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=SAME, groups=32, bias=False)
      (bn): BatchNorm2d(32, eps=0.0001, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU()
    )
    (pointwise_conv): CONV_BN_RELU(
      (conv): Conv2d_tf(32, 64, kernel_size=(1, 1), stride=(1, 1), padding=SAME, bias=False)
      (bn): BatchNorm2d(64, eps=0.0001, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU()
    )
  )
  (layer3): SeparableConv(
    (depthwise_conv): CONV_BN_RELU(
      (conv): Conv2d_tf(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=SAME, groups=64, bias=False)
      (bn): BatchNorm2d(

In [142]:
def extract_yamnet_embedding(model, mel_features):
    """
    Runs YAMNet backbone up to layer14 ‚Üí returns 1024-D embedding
    Input shape: [1, 1, frames, 64]
    """
    with torch.no_grad():
        x = model.layer1(mel_features)
        x = model.layer2(x)
        x = model.layer3(x)
        x = model.layer4(x)
        x = model.layer5(x)
        x = model.layer6(x)
        x = model.layer7(x)
        x = model.layer8(x)
        x = model.layer9(x)
        x = model.layer10(x)
        x = model.layer11(x)
        x = model.layer12(x)
        x = model.layer13(x)
        x = model.layer14(x)

        # Global average pooling across spatial dims
        x = torch.mean(x, dim=[2, 3])  # [batch, 1024]
        return x.squeeze(0)


### Passing through YAMNet to extract embeddings

In [144]:
clip_embedding = extract_yamnet_embedding(model, mel_spec_db)
print("Embedding shape:", clip_embedding.shape)


Embedding shape: torch.Size([1024])


## Part 3: Making a reusable pipeline to extract YAMNet embeddings

In [151]:
import torchaudio
import torchaudio.transforms as T
import torch

def yamnet_extract_from_file(model, file_path, cfg):
    """
    Full pipeline: loads audio, converts to log-mel, extracts 1024-D embedding.
    Returns a PyTorch tensor [1024].
    """
    target_sr = cfg["data"]["sample_rate"]

    # --- Load and resample ---
    waveform, sr = torchaudio.load(file_path)
    if sr != target_sr:
        waveform = torchaudio.functional.resample(waveform, sr, target_sr)
    waveform = waveform.mean(dim=0, keepdim=True)
    waveform = waveform / waveform.abs().max()

    # --- Mel spectrogram ---
    mel_transform = T.MelSpectrogram(
        sample_rate=target_sr,
        n_fft=1024,
        hop_length=160,
        n_mels=64,
        f_min=125,
        f_max=7500,
    )
    db_transform = T.AmplitudeToDB(stype="power")
    mel_spec = mel_transform(waveform)
    mel_spec_db = db_transform(mel_spec)

    # --- Prepare for YAMNet ---
    mel_spec_db = mel_spec_db.transpose(1, 2).unsqueeze(0)


    # --- Forward through YAMNet backbone ---
    with torch.no_grad():
        x = model.layer1(mel_spec_db)
        x = model.layer2(x)
        x = model.layer3(x)
        x = model.layer4(x)
        x = model.layer5(x)
        x = model.layer6(x)
        x = model.layer7(x)
        x = model.layer8(x)
        x = model.layer9(x)
        x = model.layer10(x)
        x = model.layer11(x)
        x = model.layer12(x)
        x = model.layer13(x)
        x = model.layer14(x)
        x = torch.mean(x, dim=[2, 3])  # [batch, 1024]

    return x.squeeze(0)

In [152]:
url = "https://github.com/karoldvl/ESC-50/raw/master/audio/1-100032-A-0.wav"
waveform, sr = torchaudio.load(url)

In [153]:
#Sample usage:

path = "/content/TransferLearningforAnimalSounds/data/esc50/1-100032-A-0.wav"
torch.save(waveform, "sample_waveform.pt")
emb = yamnet_extract_from_file(model, url, cfg)

print(emb.shape)  # torch.Size([1024])

torch.Size([1024])


## Part 4: Loading the Data for Extraction

Make sure you point the extraction function below to where you have the audio files from ESC-50 saved on your Google Drive.

In [180]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [181]:
!cp -r "/content/drive/MyDrive/DeepLearning/ESC-50-master/audio" /content/audio

cp: cannot stat '/content/drive/MyDrive/DeepLearning/ESC-50-master/audio': No such file or directory


## Part 5: Creating a Pipeline That Processes All Audio Clips in Batch

Building a function that:

* walks through a dataset folder (e.g. data/esc50),

* extracts one 1024-D embedding per audio file,

* saves each embedding to a structured directory (outputs/embeddings/{dataset_name}/),

* writes a summary .csv or .npy file for downstream training.

In [184]:
import os
import numpy as np
from tqdm import tqdm
import glob

def extract_embeddings_from_folder(model, folder_path, cfg, save_dir):
    os.makedirs(save_dir, exist_ok=True)
    embeddings, labels = [], []

    # Glob reference folder_path
    wav_files = glob.glob(os.path.join(folder_path, "*.wav"))
    print(f"üîç Found {len(wav_files)} WAV files in: {folder_path}")

    if len(wav_files) == 0:
        print("‚ö†Ô∏è No .wav files found ‚Äî check your folder path!")
        return

    for file_path in tqdm(wav_files):
        emb = yamnet_extract_from_file(model, file_path, cfg)
        embeddings.append(emb.numpy())
        labels.append(os.path.basename(file_path).split("-")[0])  # optional: parse label

    embeddings = np.stack(embeddings)
    np.save(os.path.join(save_dir, "embeddings.npy"), embeddings)
    np.save(os.path.join(save_dir, "labels.npy"), np.array(labels))
    print(f"‚úÖ Saved embeddings: {embeddings.shape} | Labels: {len(labels)}")

In [185]:
extract_embeddings_from_folder(
    model,
    "/content/drive/MyDrive/DeepLearning/ESC-50-master/audio",
    cfg,
    "outputs/embeddings/esc50"
)

üîç Found 2000 WAV files in: /content/drive/MyDrive/DeepLearning/ESC-50-master/audio


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:56<00:00, 17.16it/s]

‚úÖ Saved embeddings: (2000, 1024) | Labels: 2000





## Part 6: Loading embeddings as predictors and labels



In [192]:
import numpy as np

X = np.load("outputs/embeddings/esc50/embeddings.npy")
y = np.load("outputs/embeddings/esc50/labels.npy")

print(X.shape, y.shape)

(2000, 1024) (2000,)


In [193]:
#Train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Part 7: Training a Classifier - Log Regression

### Normalizing the features

In [194]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Training and testing the model

## Part 8: Visualizing the embedding space

The purpose here is to see the clusters of sound categories that came out of our classifier.

## Part 9: Transfer Learning / Fine-Tuning

## Part 10: