# Bird Call Classification

The goal of the exercise is to classify audio samples of two different bird calls using CNN. But how can we classify audio using CNNs that are generally good with images?

Raw audio is like hearing a conversation in an alien language—wavy lines that don’t make much sense to the eye or the ear. What if you could see the sound in a way that reveals its hidden patterns?

That’s where mel-spectrograms come in. They’re like taking a picture of sound that reveals important features—It turns a chaotic mess of sound into a neat, colorful grid. This grid represents how energy is distributed across different frequencies over time. Mel-spectrograms take into account how we humans naturally hear sounds, making it much easier for a Convolutional Neural Network (CNN) to spot differences in things like speech, music, or even a bird tweeting.

So, why do we transform audio into mel-spectrograms? Because CNNs are visual creatures—they thrive on images! By converting audio into something visual, like a spectrogram, we let our CNN use its superpowers to analyze and classify those sounds with ease.

In [1]:
# %run supportvectors-common.ipynb

In [None]:
import torch
from torch import nn
from torch.optim import AdamW
from torch.utils.data import DataLoader

import joblib
import IPython

from svlearn.config.configuration import ConfigurationMixin
from svlearn.train.simple_trainer import train_simple_network
from svlearn.bird_call.birdcall_cnn import BirdCallCNN
from svlearn.bird_call.spectrogram_dataset import SpectrogramDataset
from svlearn.bird_call.preprocess import Preprocessor
from svlearn.train.train_utils import split_dataset
from svlearn.common.utils import ensure_directory

from sklearn.metrics import accuracy_score

In [None]:
config = ConfigurationMixin().load_config()
data_dir = config['bird-call-classification']['data']
results_dir = config['bird-call-classification']['results']

## Preprocessing the Audio files

- **Step 1: Loading and Resampling Audio Files**
We begin by loading all the raw audio files from the specified directory. Since audio files can have different sample rates, we resample each one to a common sample rate to ensure consistency across the dataset.

- **Step 2: Trimming or Padding Audio Files**
Next, we standardize the duration of the audio files. Files longer than the target duration are trimmed, and those shorter are padded with silence to match the desired length. This ensures that all audio inputs have the same length, which is essential for uniform processing.

- **Step 3: Converting to Mel-Spectrograms**
Once the audio files are prepared, we convert each one into a mel-spectrogram. This transformation allows us to represent the audio in a format that a CNN can work with, as the network processes images, and mel-spectrograms provide a visual representation of sound.

- **Step 4: Saving Mel-Spectrograms as Numpy Arrays**
After converting the audio to mel-spectrograms, we save these spectrograms as numpy arrays in a separate processed directory. This step organizes the data into a format that is easy to load and use during model training.

- **Step 5: Creating an Index File**
Finally, we create an index file that maps each raw audio file to its corresponding processed spectrogram. This index file also includes the labels for each audio sample. It will be used by our custom dataset loader to associate each input with its label during the training process.


In [None]:
to_preprocess = True # if running for the first time set this to True

if to_preprocess:
    preprocessor = Preprocessor()
    preprocessor.fit_transform(data_dir)
    joblib.dump(preprocessor, f"{results_dir}/preprocessor.joblib")

else:
    preprocessor = joblib.load(f"{results_dir}/preprocessor.joblib")

## Prepare for Training

### Create a custom dataset
Use the processed mel-spectrograms and the index file to create a custom dataset. This dataset will load the spectrograms along with their associated labels. Once the dataset is ready, split it into two sets: one for training the model and another for evaluating its performance. This ensures that the model is trained on one subset of data and tested on unseen data for validation.

In [5]:
dataset = SpectrogramDataset(data_dir)
train_dataset , eval_dataset = split_dataset(dataset, eval_split=0.2)

### Sample Bird Calls

#### Common Cuckoo

Let's take samples from our dataset and listen to the cuckoo and the sparrow. 

In [None]:
idx = 2

audio_path = dataset.get_audio_path(idx)
IPython.display.Audio(audio_path)

Next let's view what our CNN model would see, i.e. let's plot the mel-spectrogram. On comparing the spectrograms do we see any visual difference?

In [None]:
dataset.plot_spectrogram(idx , title="Mel-Spectrogram (Common Cuckoo)")

#### Song Sparrow

In [None]:
idx = 604

audio_path = dataset.get_audio_path(idx)

IPython.display.Audio(audio_path)

In [None]:
dataset.plot_spectrogram(idx, title="Mel-Spectrogram (Song Sparrow)")

We can observe that the frequency of the sparrow is much higher than the cuckoo. Also we can see the distinct and strong pattern that the cuckoo call creates in the spectrogram.

### Instantiate the CNN Model
Let's set up a Convolutional Neural Network (CNN) designed to classify the audio samples into two target classes. This involves defining the network architecture, specifying the input size to match the shape of the mel-spectrograms, and setting the output layer to have two neurons, one for each class.

In [None]:
model = BirdCallCNN(2)
optimizer = AdamW(model.parameters(), lr=0.001)

dataset = SpectrogramDataset(data_dir)
train_dataset , eval_dataset = split_dataset(dataset, eval_split=0.2)

print(train_dataset[0][0].shape)

train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(dataset=eval_dataset, batch_size=32, shuffle=True)



We reuse the same train simple network class from the previous exercise

In [None]:
ensure_directory(results_dir)
result = train_simple_network(
                        model=model,
                        optimizer=optimizer,
                        loss_func=nn.CrossEntropyLoss(),
                        train_loader=train_loader,
                        test_loader=val_loader,
                        epochs=5,
                        score_funcs={'accuracy': accuracy_score},
                        classify=True,
                        checkpoint_file=f"{results_dir}/cnn-model.pt")

In [None]:
result

## Testing our Trained Model

Let's test our model with some audio files that the model has not yet seen. 

In [None]:
import IPython
import os

test_dir = f"{data_dir}/test"
audio_files = os.listdir(test_dir)


file_path = os.path.join(test_dir , audio_files[0])

IPython.display.Audio(file_path)


### Preprocessing for inference
We use the preprocessor that we fitted while training for trasforming the audio file to spectrogram and getting the label 

In [None]:
spectrogram , label = preprocessor.transform_audio(file_path , 'common_cuckoo');

Next let's load the best model we have from the checkpoint files we saved while training.

In [None]:
with open(f"{results_dir}/cnn-model.pt", 'rb') as f:
    checkpoint = torch.load(f)

We load the model's parameters from the saved checkpoint's model_state_dict which stores the learned parameters. We set the model to evaluation mode to skip batch normalizations and dropouts.

In [16]:
model.load_state_dict(checkpoint['model_state_dict'])
model.eval();

Let's pass the spectrogram to the model and see it's prediction.

In [None]:

y_hat = model(torch.tensor(spectrogram).unsqueeze(0))

prediction = preprocessor.labels_to_names[int(y_hat.argmax())]
prediction


Sounds good! Try the same for Song Sparrow!

## Finetuning Pretrained Models

### ResNet18

In [None]:
import torchvision.models as models

# Load the pretrained ResNet18 model
resnet18 = models.resnet18(pretrained=True)

# Modify the first convolution layer to accept a single-channel (grayscale) input
# The original conv1 has an input size of (3, 224, 224), we change it to (1, 128, 431)
resnet18.conv1 = nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)

# Optionally, you can replace the final fully connected (FC) layer
# Here, we assume you have 10 birdcall classes to predict, modify it accordingly
num_classes = 2
resnet18.fc = nn.Linear(resnet18.fc.in_features, num_classes)

optimizer = torch.optim.Adam(resnet18.fc.parameters(), lr=0.001)

In [None]:
result = train_simple_network(
                        model=resnet18,
                        optimizer=optimizer,
                        loss_func=nn.CrossEntropyLoss(),
                        train_loader=train_loader,
                        test_loader=val_loader,
                        epochs=10,
                        score_funcs={'accuracy': accuracy_score},
                        classify=True,
                        checkpoint_file=f"{results_dir}/resnet-model-01.pt")

In [None]:
result

We observe that our simpler BirdCallCNN seems to perform much better than finetuning in this case. 