# Label Enhancing for VLM-Generated Descriptions
Within this Jupyter Notebook, LLaMa 3.2b, by *Meta* is utilized.
It generates three different variations from the original VLM generated Descriptions

In [1]:
import pandas as pd
import time
from ollama import chat
from ollama import ChatResponse

### Cleaning CSVs
The AudioSet and AudioCaps dataset will be reorganized for clarity into how the LLM will input responses back

In [None]:
audioset_molmo_desc_file = 'audioset_description.csv'
audioset_description_df = pd.read_csv(audioset_molmo_desc_file)
audioset_description_df = audioset_description_df.drop_duplicates()
audioset_description_df.reset_index(drop=True, inplace=True)
audioset_description_df.shape

(7584, 2)

In [3]:
audiocaps_molmo_desc_file = 'audiocap-descriptionNEW.csv'
audiocaps_description_df = pd.read_csv(audiocaps_molmo_desc_file)
audiocaps_description_df = audiocaps_description_df.drop_duplicates()
audiocaps_description_df.reset_index(drop=True, inplace=True)
audiocaps_description_df.shape

(1614, 2)

In [None]:
# Places a separate column for the YouTube IDs for each point through regex
if 'YTID' not in audioset_description_df.columns:
    audioset_description_df.insert(
        0,
        'YTID', 
        audioset_description_df['Image Name'].str.extract(r'\[([\w-]+)\]_cut_')
    )
if 'YTID' not in audiocaps_description_df.columns:
    audiocaps_description_df.insert(
        0,
        'YTID', 
        audiocaps_description_df['Image Name'].str.extract(r'\[([\w-]+)\]_cut_')
    )
display(audioset_description_df)
display(audiocaps_description_df)

Unnamed: 0,YTID,Image Name,Description
0,_qodustNrME,STREET FIGHTER VS ASURAï¼Ÿ Asura's Wrath DLC T...,Picture a dramatic battle scene unfolding in ...
1,ieZVo7W3BQ4,"VI Truck Show Ciudad de Torrelavega, desfile d...",Imagine the distant rumble of a truck's engin...
2,FXlPUGUw9UU,"Insonnia Musica Rilassante x il sonno Dormire,...",Imagine a tranquil nighttime scene by a seren...
3,wMXiOt2HHUw,iRobot Roomba 560 limpiando en casa [wMXiOt2HH...,"Imagine a serene, sun-drenched living room ba..."
4,C4YMjmJ7tt4,Bhagwant Mann - Full Speed - Part - 2 WwWKOOKD...,"Imagine the distant hum of traffic, punctuate..."
...,...,...,...
7579,2v1rSA4FqlM,Human powered washing machine [2v1rSA4FqlM]_cu...,"The image, if translated into sound, might ev..."
7580,ZbJD8pWYsk0,Liza Soberano Slap Enrique Gil [ZbJD8pWYsk0]_c...,"Imagine a lively, intimate setting where two ..."
7581,3Kb4RHaZpxo,Sound Effect - Telephone [3Kb4RHaZpxo]_cut_0.jpeg,"Picture a dimly lit room, shrouded in an air ..."
7582,3oqo61gK5Co,National Cash Register Class 1900 motorized [...,"Imagine a bustling, vintage recording studio ..."


Unnamed: 0,YTID,Image Name,Description
0,1VSLSGXlG1s,65888_Metal Gear Solid 4 Big Boss Emblem Act I...,Imagine the distant rumble of heavy machinery...
1,Y2KCoO8C8R8,21313_ã€é›»å­è­¦ç¬›ã€‘E233ç³»6000ç•ªå° [Y2K...,Imagine the gentle hum of an electric train a...
2,7HUQ_NtyY3k,19744_There's a fish at the door [7HUQ_NtyY3k]...,"Imagine a quiet, contemplative moment capture..."
3,vJrc42EJIyg,21489_Totally Free Sound Effects #25 - Faucet ...,Imagine a bustling kitchen filled with the si...
4,jqtqV0BQ8GU,81484_Spore Creature Creator - Mud skipper â§¸...,"Imagine a whimsical, cartoon-like forest scen..."
...,...,...,...
1609,we8NP0EKyZQ,83637_25 lb Little Giant Power Hammer Demo [we...,Imagine the rhythmic clanking of metal on met...
1610,3Hza-oEdi7E,13489_Red Staffordshire Bull Terrier & Cat sta...,Imagine a serene winter scene viewed through ...
1611,45FElpwPRnc,8033_Bath Sink Overflow ï¼š Part 7â€”artisan s...,"Imagine the soft, rhythmic sound of water bei..."
1612,3Hz5urV9T_o,13489_1930 model a horn [3Hz5urV9T_o]_cut_0.jpeg,Imagine the distant rumble of a powerful engi...


### Running LLaMa3.2b
We utilize the API for LLaMa3.2b using the ollama library.
The three variations for prompting the LLM is also listed below

In [6]:
from ollama import chat
from ollama import ChatResponse

def llama3_2(prompt, description):
    if pd.isna(description):  # Handle missing values
        return None
    response: ChatResponse = chat(model='llama3.2', messages=[
        {'role': 'system','content': prompt,},
        {'role': 'user', 'content': description,},
    ])
    return response.message.content

In [None]:
# The same variations will be used for both sets to be consistent
variation_concise_label = "Use the words in the label provided that are associated with audio to only output descriptive three-word descriptions of those audio words keeping their relevance super close to the context"
variation_hypothetical_mood_label = "Analyzing the given label, create a hypothetical mood, that is super relevant to the content in the text, with words in the label that are associated to auditory sensations"
variation_rephrase_label = "Rephrase the contents in the label into something new and descriptive while maintaining extreme relevance to the content inside of the label, using the words that are associated with sounds"

In [None]:
import os
# The new csv with the three variations will be renewed every 50 datapoints
output_folder = 'checkpoints_for_variation'
audioset_output_file = os.path.join(output_folder, 'audioset_descriptions_vars.csv')
audiocaps_output_file = os.path.join(output_folder, 'audiocaps_descriptions_vars.csv')

os.makedirs(output_folder, exist_ok=True)

checkpoint_interval = 50
processed_rows = 0

In [15]:
# For updated AudioSet Descriptions Generated from Molmo
start = time.time()
for i in range(len(audioset_description_df)):
    # Skip all processed datapoints
    if 'Variation_Concise_Label' in audioset_description_df.columns and pd.notna(audioset_description_df.at[i, 'Variation_Concise_Label']):
        continue
    
    # Three Variations
    description = audioset_description_df.at[i, 'Description']
    audioset_description_df.at[i, 'Variation_Concise_Label'] = llama3_2(variation_concise_label, description)
    audioset_description_df.at[i, 'Variation_Hypothetical_Mood_Label'] = llama3_2(variation_hypothetical_mood_label, description)
    audioset_description_df.at[i, 'Variation_Rephrase_Label'] = llama3_2(variation_rephrase_label, description)
    processed_rows += 1

    # Update csv after every 50 points
    if processed_rows % checkpoint_interval == 0:
        print(f"Saving checkpoint after {processed_rows} processed rows...")
        audioset_description_df.to_csv(audioset_output_file, index=False)
        print(f"Saved to {audioset_output_file}.")

# The entire CSV is saved after
audioset_description_df.to_csv(audioset_output_file, index=False)
elapsed = time.time() - start
print(f"Time taken: {elapsed:.2f} seconds")

Saving checkpoint after 50 processed rows...
Saved to checkpoints_for_variation/audioset_descriptions_vars.csv.
Saving checkpoint after 100 processed rows...
Saved to checkpoints_for_variation/audioset_descriptions_vars.csv.
Saving checkpoint after 150 processed rows...
Saved to checkpoints_for_variation/audioset_descriptions_vars.csv.
Saving checkpoint after 200 processed rows...
Saved to checkpoints_for_variation/audioset_descriptions_vars.csv.
Saving checkpoint after 250 processed rows...
Saved to checkpoints_for_variation/audioset_descriptions_vars.csv.
Saving checkpoint after 300 processed rows...
Saved to checkpoints_for_variation/audioset_descriptions_vars.csv.
Saving checkpoint after 350 processed rows...
Saved to checkpoints_for_variation/audioset_descriptions_vars.csv.
Saving checkpoint after 400 processed rows...
Saved to checkpoints_for_variation/audioset_descriptions_vars.csv.
Saving checkpoint after 450 processed rows...
Saved to checkpoints_for_variation/audioset_descrip

In [16]:
# For updated AudioCaps Descriptions Generated from Molmo for Validation
start = time.time()
for i in range(len(audiocaps_description_df)):
    # Skip all processed datapoints
    if 'Variation_Concise_Label' in audiocaps_description_df.columns and pd.notna(audiocaps_description_df.at[i, 'Variation_Concise_Label']):
        continue
    
    # Three Variations
    description = audiocaps_description_df.at[i, 'Description']
    audiocaps_description_df.at[i, 'Variation_Concise_Label'] = llama3_2(variation_concise_label, description)
    audiocaps_description_df.at[i, 'Variation_Hypothetical_Mood_Label'] = llama3_2(variation_hypothetical_mood_label, description)
    audiocaps_description_df.at[i, 'Variation_Rephrase_Label'] = llama3_2(variation_rephrase_label, description)
    processed_rows += 1

    # Update csv at every 100 points
    if processed_rows % checkpoint_interval == 0:
        print(f"Saving checkpoint after {processed_rows} processed rows...")
        audiocaps_description_df.to_csv(audiocaps_output_file, index=False)
        print(f"Saved to {audiocaps_output_file}.")

# The entire CSV is saved after
audiocaps_description_df.to_csv(audiocaps_output_file, index=False)
elapsed = time.time() - start
print(f"Time taken for 'Variation_Concise_Label': {elapsed:.2f} seconds")

Saving checkpoint after 4400 processed rows...
Saved to checkpoints_for_variation/audiocaps_descriptions_vars.csv.
Saving checkpoint after 4450 processed rows...
Saved to checkpoints_for_variation/audiocaps_descriptions_vars.csv.
Saving checkpoint after 4500 processed rows...
Saved to checkpoints_for_variation/audiocaps_descriptions_vars.csv.
Saving checkpoint after 4550 processed rows...
Saved to checkpoints_for_variation/audiocaps_descriptions_vars.csv.
Saving checkpoint after 4600 processed rows...
Saved to checkpoints_for_variation/audiocaps_descriptions_vars.csv.
Saving checkpoint after 4650 processed rows...
Saved to checkpoints_for_variation/audiocaps_descriptions_vars.csv.
Saving checkpoint after 4700 processed rows...
Saved to checkpoints_for_variation/audiocaps_descriptions_vars.csv.
Saving checkpoint after 4750 processed rows...
Saved to checkpoints_for_variation/audiocaps_descriptions_vars.csv.
Saving checkpoint after 4800 processed rows...
Saved to checkpoints_for_variatio

In [17]:
display(audioset_description_df)
display(audiocaps_description_df)

Unnamed: 0,YTID,Image Name,Description,Variation_Concise_Label,Variation_Hypothetical_Mood_Label,Variation_Rephrase_Label
0,_qodustNrME,STREET FIGHTER VS ASURAï¼Ÿ Asura's Wrath DLC T...,Picture a dramatic battle scene unfolding in ...,Here are three-word descriptions for each audi...,**Dramatic Battle Scene**\n\nAudiophile's Mood...,"**""Echoes of Conflict: A Symphony of Steel and..."
1,ieZVo7W3BQ4,"VI Truck Show Ciudad de Torrelavega, desfile d...",Imagine the distant rumble of a truck's engin...,Here are the descriptive three-word descriptio...,Here's a hypothetical mood label with words as...,The label begins with a low-frequency vibratio...
2,FXlPUGUw9UU,"Insonnia Musica Rilassante x il sonno Dormire,...",Imagine a tranquil nighttime scene by a seren...,Here are three-word descriptive outputs for ea...,Here's a hypothetical mood label associated wi...,Here's a rephrased version of the content usin...
3,wMXiOt2HHUw,iRobot Roomba 560 limpiando en casa [wMXiOt2HH...,"Imagine a serene, sun-drenched living room ba...",Here are three-word descriptions for audio-rel...,Here's a hypothetical label that captures the ...,Here's a rephrased version of the label conten...
4,C4YMjmJ7tt4,Bhagwant Mann - Full Speed - Part - 2 WwWKOOKD...,"Imagine the distant hum of traffic, punctuate...",Here are the descriptive three-word descriptio...,"""Hush, harmony""\n\nThis label captures the ser...",The soothing serenade of urban life unfolds li...
...,...,...,...,...,...,...
7579,2v1rSA4FqlM,Human powered washing machine [2v1rSA4FqlM]_cu...,"The image, if translated into sound, might ev...",Here are the audio-related words with three-wo...,Here is the revised label with a hypothetical ...,"""Vibrate with creativity,"" the image whispers,..."
7580,ZbJD8pWYsk0,Liza Soberano Slap Enrique Gil [ZbJD8pWYsk0]_c...,"Imagine a lively, intimate setting where two ...",Here are three-word descriptions for the audio...,The hypothetical label for this scenario could...,"""Snap into focus: the vibrant cadence of human..."
7581,3Kb4RHaZpxo,Sound Effect - Telephone [3Kb4RHaZpxo]_cut_0.jpeg,"Picture a dimly lit room, shrouded in an air ...",Here are three-word descriptions for some audi...,**Mood:** High-Stakes Action Thriller\n\n**Lab...,"Imagine being enveloped by an ominous silence,..."
7582,3oqo61gK5Co,National Cash Register Class 1900 motorized [...,"Imagine a bustling, vintage recording studio ...",Here are three-word descriptions for audio-rel...,"""Electric Dreams and Vintage Vibes""\n\nImagine...","""Capturing the Vibrations: A Symphony of Sound..."


Unnamed: 0,YTID,Image Name,Description,Variation_Concise_Label,Variation_Hypothetical_Mood_Label,Variation_Rephrase_Label
0,1VSLSGXlG1s,65888_Metal Gear Solid 4 Big Boss Emblem Act I...,Imagine the distant rumble of heavy machinery...,Here are the descriptive three-word descriptio...,"**Mood: ""Crisis Point""**\n\nThe label for this...","Imagine a stark, pulsing hum, like the steady ..."
1,Y2KCoO8C8R8,21313_ã€é›»å­è­¦ç¬›ã€‘E233ç³»6000ç•ªå° [Y2K...,Imagine the gentle hum of an electric train a...,Here are three-word descriptions for each audi...,Here is a hypothetical mood label associated w...,"""Embark on a journey through the symphony of m..."
2,7HUQ_NtyY3k,19744_There's a fish at the door [7HUQ_NtyY3k]...,"Imagine a quiet, contemplative moment capture...",Here are three-word descriptions for each audi...,"""Whispers in Twilight""\n\n(Soft piano melody p...","""Echoes in Silence""\n\nThe whispers of the pas..."
3,vJrc42EJIyg,21489_Totally Free Sound Effects #25 - Faucet ...,Imagine a bustling kitchen filled with the si...,Here are three-word descriptions for some of t...,Here's a hypothetical label that matches this ...,"""Harmony of the Senses: A Symphony of Flavors""..."
4,jqtqV0BQ8GU,81484_Spore Creature Creator - Mud skipper â§¸...,"Imagine a whimsical, cartoon-like forest scen...",Here are three-word audio descriptions:\n\n1. ...,**Whimsical Woodland Soundscape**\n\n**Label: ...,Imagine being enveloped in an enchanting sound...
...,...,...,...,...,...,...
1609,we8NP0EKyZQ,83637_25 lb Little Giant Power Hammer Demo [we...,Imagine the rhythmic clanking of metal on met...,Here are three-word descriptions associated wi...,Here's a hypothetical mood label with words as...,Imagine the symphony of sonic sensations as a ...
1610,3Hza-oEdi7E,13489_Red Staffordshire Bull Terrier & Cat sta...,Imagine a serene winter scene viewed through ...,Here are three-word descriptions for the audio...,"Based on the serene winter scene described, I ...",Here's a rephrased version of the content in a...
1611,45FElpwPRnc,8033_Bath Sink Overflow ï¼š Part 7â€”artisan s...,"Imagine the soft, rhythmic sound of water bei...",Here are the descriptive three-word descriptio...,"**Label: ""Whispers from the Clay Studio""**\n\n...",**Ethereal Cadence: A Symphony of Clay**\n\nIm...
1612,3Hz5urV9T_o,13489_1930 model a horn [3Hz5urV9T_o]_cut_0.jpeg,Imagine the distant rumble of a powerful engi...,Here are three-word descriptions for each audi...,Here's a hypothetical label that captures the ...,"**""Ignition""**\n\nThe soft purr of a vintage e..."
