# JSALT 2025 - Introduction to Large Audio Language Models

**Laboratory session: AuGI - Towards audio general intelligence**

June 20th, 2025

## Introduction and Objectives

In this notebook, we will explore Large Audio Language Models (LALMs) and their use for inference and answering questions.

In particular, we will use the recently published Audio Flamingo 2 (AF2) model (based on the CLAP audio encoder) and the MMAU dataset.

📖 **Some papers**
* [CLAP](https://arxiv.org/pdf/2211.06687)
* [AudioFlamingo 2](https://arxiv.org/pdf/2503.03983)
* [MMAU dataset](https://openreview.net/pdf?id=TeVAZXr3yv)


## Outline

Through this notebook, we will:

1. Set up the Python environment
2. Explore the CLAP audio encoder and its embeddings
3. Use AudioFlamingo 2 (AF2) for inference
4. Explore the MMAU dataset
5. Prepare some AQA to test the model further



## Materials

The materials needed for this session are in the followign GitHub repository:

https://github.com/ferugit/JSALT-LALMs-tutorial.git



# 1. Set up the Python environment



In order to prepare the libraries needed, we will first clone the code from the GitHub repository and install the required packages:

In [None]:
# Remote
!git clone https://github.com/ferugit/JSALT-LALMs-tutorial.git
%cd JSALT-LALMs-tutorial

In [None]:
!pip install -r requirements.txt

# 2. Explore the CLAP audio encoder and its embeddings

## 👏 CLAP (Contrastive Language-Audio Pretraining)

CLAP is a dual encoder model that learns joint representations of audio and language through contrastive learning, similar to CLIP (Contrastive Language–Image Pretraining).



### 🛠️ Architecture

1. **Audio Encoder**
   - Based on a CNN or transformer-based model (e.g. PANN and HTSAT).
   - Takes log-mel spectrograms as input and produces an embedding vector.
   - Trained to capture the semantic content of audio.

2. **Text Encoder**
   - A transformer (e.g. CLIP, BERT or RoBERTa).
   - Encodes natural language prompts or captions into a dense representation.

3. **Projection Layers**
   - Both audio and text embeddings are projected into the same latent space.

4. **Contrastive Loss (InfoNCE)**
   - During training, matched audio-text pairs are pulled together in embedding space, and mismatched ones are pushed apart.

<img src="https://raw.githubusercontent.com/LAION-AI/CLAP/main/assets/audioclip-arch.png" width="800"/>



To explore the CLAP model, we will first import some python libraries and download the trained model:

In [None]:
import os
import sys
import yaml
import umap
import IPython

import torch
import torchaudio
import numpy as np

import laion_clap

import matplotlib.pyplot as plt

In [None]:
model = laion_clap.CLAP_Module(enable_fusion=False)
model.load_ckpt()


This model could be use for _zero-shot classification_:

![CLAP Architecture](../assets/zero-shot_classification.png)
<img src="https://raw.githubusercontent.com/ferugit/JSALT-LALMs-tutorial/blob/master/assets/zero-shot_classification.png" width="800"/>

For that, we can use the audios included in the _assets_ directory of the repository.

You can listen to some of these audios:

In [None]:
cat_filename = "assets/cat.wav"
dog_filename = "assets/dog_barking.wav"
another_dog_filename = "assets/dog.wav"
breaking_filename = "assets/breaking.wav"
cough_filename = "assets/cough.wav"
music_filename = "assets/dance_matisse_musiclm.wav"

In [None]:
IPython.display.Audio(cat_filename)

In [None]:
IPython.display.Audio(dog_filename)

Then, you can obtain the embeddings from the CLAP audio encoder like this:

In [None]:
# Get audio embeddings from audio files
audio_file = [cat_filename, dog_filename]
with torch.no_grad():
    audio_embed = model.get_audio_embedding_from_filelist(x = audio_file, use_tensor=True)
print(audio_embed[:,-20:])
print(audio_embed.shape)

And, in a similar way, you can get the text embeddings from a given description using the CLAP model:

In [None]:
# Get text embedings from texts
text_data = ["This is a sound of a dog", "This is a sound of a cat"]
with torch.no_grad():
    text_embed = model.get_text_embedding(text_data, use_tensor=True)
print(text_embed[:,-20:])
print(text_embed.shape)

This way, you can estimate a similarity matrix from the audio and text embeddings:

In [None]:
similarity = audio_embed @ text_embed.t()
print("Similarity matrix:\n", similarity)

In order to visualize the embeddings in a 2D space, we can project them and see how close they are depending on the audio and text chosen:

In [None]:
# Concatenate embeddings
embeddings = torch.cat([audio_embed, text_embed], dim=0).cpu().numpy()
labels = ['audio_cat', 'audio_dog', 'text_dog', 'text_cat']

# Solve UMAP 2D projection
reducer = umap.UMAP(n_neighbors=2, random_state=1)
embeddings_2d = reducer.fit_transform(embeddings)

# Plot emdedding distances
plt.figure(figsize=(8, 6))

for i, label in enumerate(labels):
    plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1], label=label)
    plt.text(embeddings_2d[i, 0]+0.01, embeddings_2d[i, 1]+0.01, label)

# Draw lines between audio and text pairs to show distances
plt.plot([embeddings_2d[0, 0], embeddings_2d[3, 0]], [embeddings_2d[0, 1], embeddings_2d[3, 1]], 'r--', label='cat distance')
plt.plot([embeddings_2d[1, 0], embeddings_2d[2, 0]], [embeddings_2d[1, 1], embeddings_2d[2, 1]], 'b--', label='dog distance')

plt.legend()
plt.title('2D Visualization of Audio and Text Embeddings with Distances')
plt.xlabel('UMAP-1')
plt.ylabel('UMAP-2')
plt.show()

Here there are some examples of similarity and distances (cosine) for different audios:

In [None]:
# Cosine Similarity
cos_sim = torch.nn.CosineSimilarity(dim=0)

In [None]:
# Cacluclate cosine distance
audio_file = [dog_filename]
with torch.no_grad():
    audio_embed = model.get_audio_embedding_from_filelist(x = audio_file, use_tensor=True)

with torch.no_grad():
    text_embed = model.get_text_embedding("This is a dog barking", use_tensor=True)

similarity = cos_sim(audio_embed[-1], text_embed[-1])
distance = 1 - similarity
print("Cosine Distance:", distance.item())

In [None]:
IPython.display.Audio(cough_filename)

In [None]:
# Cacluclate cosine distance
audio_file = [cough_filename]
with torch.no_grad():
    audio_embed = model.get_audio_embedding_from_filelist(x = audio_file, use_tensor=True)

with torch.no_grad():
    text_embed = model.get_text_embedding("This is a dog barking", use_tensor=True)

similarity = cos_sim(audio_embed[-1], text_embed[-1])
distance = 1 - similarity
print("Cosine Distance:", distance.item())

# 3. Use AudioFlamingo 2 (AF2) for inference


## 🦩 AudioFlamingo 2

AudioFlamingo 2 (AF2) is a state-of-the-art Audio-Language Model (ALM) wit advanced audio understanding and reasoning capabilities.
This model is capable of handling tasks like:
- **Audio Captioning**: "Describe what you hear."
- **Sound Event Detection**: "Is there a siren in this audio?"
- **Question Answering with Audio**: "Which is the mood of the second speaker?"


> **Note**: This notebook is designed for inference only.

### 🛠️ Architecture

1. **AF-CLAP**: CLAP-based audio encoder with sliding window feature extraction

2. **Transformation Layers** Expand and project audio features to a dimension compatible with the LLM.

3. **Frozen Language Model (LLM)**: The decoder-only casual LLM (Qwen2.5-3B).

4. **Gated Cross-Attention**: XATTN-Dense layers for audio conditioning.


<img src="https://raw.githubusercontent.com/NVIDIA/audio-flamingo/main/assets/af2_arch.png" width="1500"/>

In order to use it for inference, first of all, we need to download the Qwen model (0.5B version in this case) from HuggingFace and the AF2 checkpoint (you will need to use your HF token as argument to the script):

In [None]:
# Download Qwen2.5-0.5B model
!./download_hf_model.sh

In [None]:
# Download AF2 model: CLAP encoder, Audio Transformer and XATTN
!./download_af2.sh "YOUR_HF_TOKEN_HERE"

In [None]:
!ls models/

Now you can run the inference over one file (check the script to run the inference as well as the config file):

In [None]:
!cat run_af2_single_inference.sh

In [None]:
!cat src/audio_flamingo_2/config/inference.yaml

In [None]:
!./run_af2_single_inference.sh

You can listen to the audio used for inference in this case and check if the model outputs a good description.

In [None]:
IPython.display.Audio(music_filename)

# 4. Explore the MMAU dataset

## 🗂️ MMAU: Multimodal Audio Understanding Dataset

<img src="https://raw.githubusercontent.com/Sakshi113/MMAU/main/mmau_intro.jpg" width="800"/>


**MMAU** (Multimodal Audio Understanding) is a benchmark dataset introduced alongside AudioFlamingo 2 to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning.

- **10k curated audio clips** paired with human annotated natural language questions and answers.
- **3 major audio domains**: Speech, sounds and music.
- **27 Diverse task**: 16 reasoning and 11 information extraction tasks.

- **test_mini set**: 1000 questions. Reflects the task distribution of the main test set and is intended for hyperparameter tuning.

### 🧠 Task Examples

<img src="https://raw.githubusercontent.com/Sakshi113/MMAU/main/mmau_hero.jpg" width="1000"/>

With the following code, you can compute some statistics about the composition of the dataset and explore it:

In [None]:
import json
import pandas as pd
import numpy as np

import scienceplots
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use(['science', 'nature', 'bright', 'no-latex'])
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)
plt.rc('axes', titlesize=10)

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
test_mini_file = 'mmau-test-mini.json'

with open(test_mini_file, 'r') as f:
    test_mini_json = json.load(f)

test_mini_df = pd.DataFrame(test_mini_json)

In [None]:
test_mini_df

In [None]:
test_mini_df.describe()


In [None]:
test_mini_df['category'].value_counts()

In [None]:
plt.figure(figsize=(10,5))
test_mini_df["task"].value_counts().plot.pie(
    autopct='%1.1f%%',
    labeldistance=None,
    startangle=20,
    pctdistance=1.18,
    textprops={'fontsize': 10}
)
plt.title("Task")
plt.ylabel("")
plt.legend(loc=4, ncol=1, frameon=True, framealpha=1.0, bbox_to_anchor=(1.2, 0.4), prop={'size': 10})
plt.show()

In [None]:
plt.figure(figsize=(10,5))
test_mini_df["category"].value_counts().plot.pie(
    autopct='%1.1f%%',
    labeldistance=None,
    startangle=20,
    pctdistance=1.18,
    textprops={'fontsize': 10}
)
plt.title("Category")
plt.ylabel("")
plt.legend(loc=4, ncol=1, frameon=True, framealpha=1.0, bbox_to_anchor=(1.55, 0.4), prop={'size': 10})
plt.show()

In [None]:
# Generate 14 faded colors using seaborn's color palette and add transparency
colors = sns.color_palette("colorblind", 14)  # Use any palette: husl, pastel, muted, etc.
faded_colors = [(r, g, b, 0.6) for r, g, b in colors]  # Add alpha = 0.6 for faded effect

plt.figure(figsize=(10, 5))
test_mini_df["dataset"].value_counts().plot.pie(
    colors=faded_colors,
    autopct='%1.1f%%',
    labeldistance=None,
    startangle=20,
    pctdistance=1.18,
    textprops={'fontsize': 10}
)
plt.title("Dataset")
plt.ylabel("")
plt.legend(loc=4, ncol=1, frameon=True, framealpha=1.0,
           bbox_to_anchor=(1.55, 0.1), prop={'size': 10})
plt.show()

In [None]:
len(test_mini_df['sub-category'].unique())

In [None]:
plt.figure(figsize=(10,5))
test_mini_df['sub-category'].value_counts().plot(kind='barh')
plt.ylabel("")
plt.title('Sub-categories in MMAU test-mini')

In [None]:
test_mini_df["num_choices"] = test_mini_df["choices"].apply(lambda x: len(x))

In [None]:
plt.figure(figsize=(10,5))
test_mini_df["num_choices"].value_counts().plot.pie(
    autopct='%1.1f%%',
    labeldistance=None,
    startangle=20,
    pctdistance=1.18,
    textprops={'fontsize': 10}
)
plt.title("Number of choices")
plt.ylabel("")
plt.legend(loc=4, ncol=1, frameon=True, framealpha=1.0, bbox_to_anchor=(1.2, 0.4), prop={'size': 10})
plt.show()

In [None]:
test_mini_df[test_mini_df["num_choices"] == 2]

In [None]:
test_mini_df[test_mini_df["sub-category"] == "Phonemic Stress Pattern Analysis"].sample(1)['question'].item()

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
manifest_file = 'manifest_mmau-test-mini.json'
with open(manifest_file, 'r') as f:
    manifest_file = json.load(f)

len_df = pd.DataFrame(manifest_file['data']).T
len_df

In [None]:
plt.figure(figsize=(10,5))
len_df['duration'].plot.kde()
plt.title("Length distribution")
plt.ylabel("")
#plt.legend(loc=4, ncol=1, frameon=True, framealpha=1.0, bbox_to_anchor=(1.2, 0.4), prop={'size': 10})
plt.show()

# 5. Prepare some AQA to test the model further

In order to do some inference over different files, as well as preparing some other questions, you can first load the model:

In [None]:
import json

import src.audio_flamingo_2.factory as factory
from src.audio_flamingo_2.inference_utils import read_audio, load_audio, predict, get_num_windows
from src.audio_flamingo_2.utils import Dict2Class, float32_to_int16, int16_to_float32, get_autocast, get_cast_dtype
from safetensors.torch import load_file

In [None]:
# Load the config file
config = yaml.load(open("src/audio_flamingo_2/config/inference.yaml"), Loader=yaml.FullLoader)

data_config = config['data_config']
model_config = config['model_config']
clap_config = config['clap_config']
model_args = Dict2Class(config['train_config'])

# Cast the model to the appropriate dtype
autocast = get_autocast(
    model_args.precision, cache_enabled=(not model_args.fsdp)
)
cast_dtype = get_cast_dtype(model_args.precision)

# Get the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set Hugging Face cache directory
model, tokenizer = factory.create_model_and_transforms(
    **model_config,
    clap_config=clap_config,
    use_local_files=True,
    gradient_checkpointing=False,
    freeze_lm_embeddings=True,
    device=device,
)

print("Model and tokenizer created successfully.")

print("Loading trained weights...")

# CLAP, tokenizer and LLM are pretrained.
# XATTN and Transformer are not. We need to load the pretrained weights.
model = model.to(device)
model.eval()

# Load the pretrained weights
ckpt_path = config['inference_config']['pretrained_path']
metadata_path = os.path.join(ckpt_path, "safe_ckpt/metadata.json")

# Load metadata
with open(metadata_path, "r") as f:
    metadata = json.load(f)

# Reconstruct the full state_dict
state_dict = {}

# Load each SafeTensors chunk
for chunk_name in metadata:
    chunk_path = f"safe_ckpt/{chunk_name}.safetensors"
    chunk_tensors = load_file(os.path.join(ckpt_path, chunk_path))

    # Merge tensors into state_dict
    state_dict.update(chunk_tensors)

missing_keys, unexpected_keys = model.load_state_dict(state_dict, False)

print("Missing keys:", missing_keys)
print("Unexpected keys:", unexpected_keys)
print("Model loaded successfully.")

And finally, you can do the decoding to get the results of the model:

In [None]:
decoding = {
    "do_sample": False,  # Set to True for sampling, False for greedy/beam search
    "temperature": 0.0,
    "num_beams": 1,
    "top_k": 30,
    "top_p": 0.95,
    "num_return_sequences": 1,
}


question =  "YOUR_QUESTION_HERE" # example: "What is the gender of the person"

# Perform inference
result = predict(
    cough_filename,
    question,
    clap_config,
    inference_kwargs=decoding,
    cast_dtype=cast_dtype,
    device=device,
    tokenizer=tokenizer,
    model=model
)
print("Inference completed.\n\n")
print("*" * 50)
print("Prompt:", question)
print("Audio path:", cough_filename)
print("Inference result:", result)