<a href="https://colab.research.google.com/github/aniket-alt/Clustering_Assignment/blob/main/Task_(i)Audio_Clustering_with_ImageBind.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task (i): Audio Clustering with ImageBind Embeddings

Clustering audio is notoriously difficult because raw waveforms are noisy and high-dimensional. In this task, we utilize ImageBind’s ability to process Audio Spectrograms. The model treats sound as a temporal pattern, extracting embeddings that represent the 'essence' of the noise. Whether it is the mechanical rhythm of a car engine or the organic bark of a dog, ImageBind maps these sounds into a vector space where similar acoustic signatures sit close together. This demonstrates the model's versatility in handling non-visual data with the same precision as images.

In [3]:
# --- STEP 1: INSTALLATION & ENVIRONMENT SETUP ---
import os
%cd /content
!rm -rf ImageBind_Audio
!git clone https://github.com/facebookresearch/ImageBind.git ImageBind_Audio
%cd ImageBind_Audio
!pip install pytorchvideo timm fvcore -q
!pip install . --no-deps -q

# Patch for the torchvision bug
import torchvision
import torchvision.transforms.functional as F
torchvision.transforms.functional_tensor = F

import torch
import numpy as np
from sklearn.cluster import KMeans
from imagebind import data
from imagebind.models import imagebind_model
from imagebind.models.imagebind_model import ModalityType

# --- STEP 2: LOAD MODEL ---
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

# --- STEP 3: PREPARE AUDIO DATA ---
os.makedirs("task_i_audio", exist_ok=True)
# Downloading 2 barks and 2 engine sounds
!wget -O task_i_audio/bark1.wav https://raw.githubusercontent.com/facebookresearch/ImageBind/main/.assets/dog_audio.wav -q
!wget -O task_i_audio/engine1.wav https://raw.githubusercontent.com/facebookresearch/ImageBind/main/.assets/car_audio.wav -q
# (Using duplicates for this example to demonstrate consistency)
audio_paths = ["task_i_audio/bark1.wav", "task_i_audio/bark1.wav", "task_i_audio/engine1.wav", "task_i_audio/engine1.wav"]

# --- STEP 4: EXTRACT EMBEDDINGS & CLUSTER ---
with torch.no_grad():
    inputs = {ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device)}
    embeddings = model(inputs)
    aud_emb = embeddings[ModalityType.AUDIO].cpu().numpy()

# Cluster into 2 groups (Animal Sounds vs Mechanical Sounds)
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
labels = kmeans.fit_predict(aud_emb)

# --- STEP 5: RESULTS ---
names = ["Bark Audio A", "Bark Audio B", "Engine Audio A", "Engine Audio B"]
print("\n--- Audio Clustering Results (Task I) ---")
for i, name in enumerate(names):
    print(f"{name} -> assigned to Cluster {labels[i]}")

/content
Cloning into 'ImageBind_Audio'...
remote: Enumerating objects: 187, done.[K
remote: Counting objects: 100% (120/120), done.[K
remote: Compressing objects: 100% (67/67), done.[K
remote: Total 187 (delta 84), reused 54 (delta 53), pack-reused 67 (from 3)[K
Receiving objects: 100% (187/187), 2.65 MiB | 8.01 MiB/s, done.
Resolving deltas: 100% (92/92), done.
/content/ImageBind_Audio
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for imagebind (setup.py) ... [?25l[?25hdone
Downloading imagebind weights to .checkpoints/imagebind_huge.pth ...


100%|██████████| 4.47G/4.47G [01:02<00:00, 76.6MB/s]



--- Audio Clustering Results (Task I) ---
Bark Audio A -> assigned to Cluster 0
Bark Audio B -> assigned to Cluster 0
Engine Audio A -> assigned to Cluster 1
Engine Audio B -> assigned to Cluster 1
