# AudioCLIP Demo

Authored by [Andrey Guzhov](https://github.com/AndreyGuzhov)

This notebook covers common use cases of AudioCLIP and provides the typical workflow.
Below, you will find information on:

0. [Binary Assets](#Downloading-Binary-Assets)
1. [Required imports](#Imports-&-Constants)
2. [Model Instantiation](#Model-Instantiation)
3. [Data Transformation](#Audio-&-Image-Transforms)
4. Data Loading
    * [Audio](#Audio-Loading)
    * [Images](#Image-Loading)
5. [Preparation of the Input](#Input-Preparation)
6. [Acquisition of the Output](#Obtaining-Embeddings)
7. [Normalization of Embeddings](#Normalization-of-Embeddings)
8. [Calculation of Logit Scales](#Obtaining-Logit-Scales)
9. [Computation of Similarities](#Computing-Similarities)
10. Performing Tasks
    1. [Classification](#Classification)
        1. [Audio](#Audio)
        2. [Images](#Images)
    2. [Querying](#Querying)
        1. [Audio by Text](#Audio-by-Text)
        2. [Images by Text](#Images-by-Text)
        3. [Audio by Images](#Audio-by-Images)
        4. [Images by Audio](#Images-by-Audio)

## Imports & Constants

In [20]:
import os
import sys
import glob

import librosa
import librosa.display

import simplejpeg
import numpy as np

import torch
import torchvision as tv

import matplotlib.pyplot as plt

from PIL import Image
from IPython.display import Audio, display

sys.path.append(os.path.abspath(f'{os.getcwd()}/..')) # .ipynb 파일이 위치한 경로

from model import AudioCLIP
from utils.transforms import ToTensor1D


torch.set_grad_enabled(False)

MODEL_FILENAME = 'AudioCLIP-Full-Training.pt'
# derived from ESResNeXt
SAMPLE_RATE = 44100
# derived from CLIP
IMAGE_SIZE = 224
IMAGE_MEAN = 0.48145466, 0.4578275, 0.40821073
IMAGE_STD = 0.26862954, 0.26130258, 0.27577711

labels_path = '/home/broiron/Desktop/AudioCLIP/data/label/labels_8.txt' # test를 위한 label
with open(labels_path, 'r') as file:
    labels = [line.strip() for line in file]
# LABELS = ['cat', 'thunderstorm', 'coughing', 'alarm clock', 'car horn'] # audio 개수와 일치하지 않아도 되는건가?, 학습 시에는 맞춰줘야 함(train /val split 때문에)
# print(type(label))

device = "cuda" if torch.cuda.is_available() else "cpu"

## Model Instantiation

In [21]:
aclp = AudioCLIP(pretrained=f'../assets/{MODEL_FILENAME}')

## Audio & Image Transforms

In [22]:
audio_transforms = ToTensor1D()

## Audio Loading
Audio samples are drawn from the [ESC-50](https://github.com/karolpiczak/ESC-50) dataset.

In [23]:
paths_to_audio = glob.glob('../data/test_2/*.wav') # audio 파일이 위치한 경로

audio = list()
for path_to_audio in paths_to_audio:
    track, _ = librosa.load(path_to_audio, sr=SAMPLE_RATE, dtype=np.float32)

    # compute spectrograms using trained audio-head (fbsp-layer of ESResNeXt)
    # thus, the actual time-frequency representation will be visualized
    spec = aclp.audio.spectrogram(torch.from_numpy(track.reshape(1, 1, -1)))
    spec = np.ascontiguousarray(spec.numpy()).view(np.complex64)
    pow_spec = 10 * np.log10(np.abs(spec) ** 2 + 1e-18).squeeze()

    audio.append((track, pow_spec))

print(paths_to_audio) # 어떤 순서로 loading 되는지는 잘 모르겠음
print(len(paths_to_audio))

['../data/test_2/toilet flushing_102985_segment_3.wav', '../data/test_2/people sobbing_107583_segment_3.wav', '../data/test_2/air conditioning noise_389_segment_0.wav', '../data/test_2/railroad car, train wagon_73571_segment_2.wav', '../data/test_2/people whispering_13617_segment_2.wav', '../data/test_2/pumping water_37963_segment_4.wav', '../data/test_2/children shouting_31622_segment_1.wav', '../data/test_2/race car, auto racing_48786_segment_2.wav']
8


## Input Preparation

In [24]:
# AudioCLIP handles raw audio on input, so the input shape is [batch x channels x duration]
audio = torch.stack([audio_transforms(track.reshape(1, -1)) for track, _ in audio])
# text
text = [[label] for label in labels]
print(text)

[['air conditioning noise'], ['children shouting'], ['people sobbing'], ['people whispering'], ['pumping water'], ['race car, auto racing'], ['railroad car, train wagon'], ['toilet flushing']]


## Obtaining Embeddings
For the sake of clarity, all three modalities are processed separately.

In [25]:
# for text encoding 
from model.model_final import FrozenCLIPTextEmbedder, Mapping_Model # text encoder
text_encoder = FrozenCLIPTextEmbedder(version='RN50', device=device)

#text_features = torch.stack([text_encoder.encode([label]).to(device).float() for label in LABELS])
#text_embeddings = torch.stack([text_encoder.encode([label]).to(device).float() for label in labels_batch])

In [26]:
# AudioCLIP's output: Tuple[Tuple[Features, Logits], Loss]
# Features = Tuple[AudioFeatures, ImageFeatures, TextFeatures]
# Logits = Tuple[AudioImageLogits, AudioTextLogits, ImageTextLogits]

((audio_features, _, _), _), _ = aclp(audio=audio) # audio embedding 
print(audio_features.shape)
# ((_, _, text_features), _), _ = aclp(text=text) # text embedding
text_embeddings = torch.stack(text_encoder.encode([text]).to(device).float())
for i in text_embeddings: 
    print(i.shape)

ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 2048, 1, 1])

## Normalization of Embeddings
The AudioCLIP's output is normalized using L<sub>2</sub>-norm

In [None]:
audio_features = audio_features / torch.linalg.norm(audio_features, dim=-1, keepdim=True)
print(f'nomalized audio embedding shape: {audio_features.shape}') # normalization 한다고 해서 차원이 달라지진 않음
text_features = text_features / torch.linalg.norm(text_features, dim=-1, keepdim=True)
print(text_features)

nomalized audio embedding shape: torch.Size([10, 1024])
tensor([[-0.0054,  0.0168, -0.0171,  ..., -0.0010, -0.0028,  0.0119],
        [ 0.0124, -0.0051, -0.0022,  ..., -0.0160,  0.0204, -0.0063],
        [-0.0062, -0.0080, -0.0636,  ..., -0.0152,  0.0046, -0.0099],
        [-0.0233, -0.0438, -0.0098,  ..., -0.0057, -0.0173,  0.0046],
        [ 0.0153, -0.0070, -0.0421,  ...,  0.0153, -0.0002, -0.0064]])


## Obtaining Logit Scales
Outputs of the text-, image- and audio-heads are made consistent using dedicated scaling terms for each pair of modalities.
The scaling factors are clamped between 1.0 and 100.0.

In [None]:
# audio와 text만 고려
scale_audio_text = torch.clamp(aclp.logit_scale_at.exp(), min=1.0, max=100.0)
# print(f'scaled embedding: {scale_audio_text}')
print(scale_audio_text)

tensor(100.)


## Computing Similarities
Similarities between different representations of a same concept are computed using [scaled](#Obtaining-Logit-Scales) dot product (cosine similarity).

In [None]:
# audio와 text 간의 similality 계산
logits_audio_text = scale_audio_text * audio_features @ text_features.T
print(logits_audio_text)
print(logits_audio_text.dim())

tensor([[ 3.7929, -0.5249,  3.7583,  1.0434,  6.3707],
        [ 7.0164,  0.4689, -0.3424, -0.4053,  3.7883],
        [ 9.2904, -1.1262, -5.1571, -4.6368,  0.7858],
        [ 1.6267,  1.0276,  2.8426,  0.4084,  5.9047],
        [-0.6302,  0.2372,  6.4231,  1.0683,  0.6146],
        [ 1.9733,  7.2482,  1.7210,  0.0743,  2.7171],
        [ 2.4561,  2.8463,  2.7906,  1.4245,  5.3515],
        [ 3.1987,  2.7979,  0.6837,  8.2739,  3.6359],
        [ 2.7705,  3.0641,  0.2023,  8.5307,  3.8250],
        [ 1.7586,  7.9345, -1.5157, -1.7790,  1.9068]])
2


## Classification
This task is a specific case of a more general one, which is [querying](#Querying).
However, this setup is mentioned as a standalone because it demonstrates clearly how to perform usual classification (including [zero-shot inference](https://github.com/openai/CLIP#zero-shot-prediction)) using AudioCLIP.

### Audio

In [None]:
print('\t\tFilename, Audio\t\t\tTextual Label (Confidence)', end='\n\n')

# calculate model confidence
confidence = logits_audio_text.softmax(dim=1)
# print(f'confidence score: {confidence}')

for audio_idx in range(len(paths_to_audio)):
    # acquire Top-3 most similar results
    conf_values, ids = confidence[audio_idx].topk(3) # ids의 type은 tensor

    # format output strings
    query = f'{os.path.basename(paths_to_audio[audio_idx]):>30s} ->\t\t'
    results = ', '.join([f'{LABELS[i]:>15s} ({v:06.2%})' for v, i in zip(conf_values, ids)])

    # print(conf_values) # 결과마다 다 다른 tensor 값이 나와야 함. -> 제대로 된 결과

    print(query + results)

    # classification 결과로 봤을 때, data의 개수와 label의 개수가 일치하지 않아도 되는 것인가?
    # 즉, 중복을 제거하고 각각 하나의 label만 존재해도 되는듯 싶음.

		Filename, Audio			Textual Label (Confidence)

coughing_1-58792-A-24_segment_1.wav ->		       car horn (86.57%),             cat (06.57%),        coughing (06.35%)
 cat_3-95694-A-5_segment_0.wav ->		            cat (95.94%),        car horn (03.80%),    thunderstorm (00.14%)
 cat_3-95694-A-5_segment_1.wav ->		            cat (99.98%),        car horn (00.02%),    thunderstorm (00.00%)
car_horn_1-24074-A-43_segment_1.wav ->		       car horn (93.25%),        coughing (04.36%),             cat (01.29%)
coughing_1-58792-A-24_segment_0.wav ->		       coughing (98.95%),     alarm clock (00.47%),        car horn (00.30%)
thunder_3-144891-B-19_segment_1.wav ->		   thunderstorm (97.98%),        car horn (01.06%),             cat (00.50%)
car_horn_1-24074-A-43_segment_0.wav ->		       car horn (81.05%),    thunderstorm (06.62%),        coughing (06.26%)
alarm_clock_3-120526-B-37_segment_0.wav ->		    alarm clock (97.98%),        car horn (00.95%),             cat (00.61%)
alarm_clock_3-120526-B

## Querying

### Audio by Text

In [None]:
print('\t\tTextual Label\t\tFilename, Audio (Confidence)', end='\n\n')

# calculate model confidence
confidence = logits_audio_text.softmax(dim=0)
for label_idx in range(len(LABELS)):
    # acquire Top-2 most similar results
    conf_values, ids = confidence[:, label_idx].topk(2)

    # format output strings
    query = f'{LABELS[label_idx]:>25s} ->\t\t'
    results = ', '.join([f'{os.path.basename(paths_to_audio[i]):>30s} ({v:06.2%})' for v, i in zip(conf_values, ids)])

    print(query + results)

		Textual Label		Filename, Audio (Confidence)

                      cat ->		 cat_3-95694-A-5_segment_1.wav (89.80%),  cat_3-95694-A-5_segment_0.wav (09.24%)
             thunderstorm ->		thunder_3-144891-B-19_segment_0.wav (65.55%), thunder_3-144891-B-19_segment_1.wav (33.00%)
                 coughing ->		coughing_1-58792-A-24_segment_0.wav (87.74%), coughing_1-58792-A-24_segment_1.wav (06.11%)
              alarm clock ->		alarm_clock_3-120526-B-37_segment_1.wav (56.30%), alarm_clock_3-120526-B-37_segment_0.wav (43.55%)
                 car horn ->		coughing_1-58792-A-24_segment_1.wav (44.41%), car_horn_1-24074-A-43_segment_1.wav (27.87%)
