<a href="https://colab.research.google.com/github/davidMadueke/VidToMusicTags/blob/main/VidToMusicTags_LLAVA_Next_Video%2BLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook serves as a prototype for the AI workflow of classifying videos into natural language tags that are compatible with large music publishing libraries and their set of database tags.

Currently, the main problems associated with the current Video to Music Tags workflows is the cost of having to train a vision based model to perform multi-class classification on video frames, which suffers from training and testing costs and lack of flexibility in retroactively choosing different tags to perform classification on.

<div align="center">
	<img src="https://mermaid.ink/img/pako:eNpVkU1rAjEQhv9KSC8WtIi95VDQXQoerKVre8l6GJLZNbhJJB8Ucf3vjUlbMKfMzPO-M5lcqLASKaPdYL_FAVwgu7o1JJ0lX5tTDORLSbR7xpjK4Wz2MjagTwNK8upAox_JarKJQ1DaShjIskcTHovHKuPZgTRRa3DnkVSTagDvVacEBGXNnWL9tuCb6JUgO-j9XVsyvp8lmJBqNQQQN5PUvSrCKiM138Zw47O6VOrZU6psP3cLzitrujSNEUgaYR3uf5nqj3nmPI9DPhC8Ncr0BSlY7lljR8pQPjh7RPbQzed0SjU6DUqmZV6ygoYDamwpS1cJ7tjS1lwTBzHY5mwEZcFFnFJnY3-grIPBpyieJASsFfRpuf9ZlCpYtyl_JW6P6On1BzY2koU?type=png">
</div>

This project ameliorates these issues by utilising a dual stage large language model (LLM) architecture to perform the classification. Video frames are sampled from the input video and fed into a Vision Model, trained specifically on videos, which generates a short and concise summary explaining what is happening.

This video summary is then fed as input to a second LLM stage which will generate a json payload with a list of tags the model classifies this summary as, along with a confidence score for each identified tag. The json payload also features a quick paragraph detailing the reasoning behind the choices made along with important information found within the summary.

The main benefits of this approach comes to its flexibility in choice of music tags. Tags can be freely added or removed and written in natural language as well without the need to retrain a model. This is because one can leverage the classification agent's reasoning capabilities to accurately perform the classification. In addition, further semantic context can be provided to each tag to help provide the agent with additional meaning, leading to more accurate results, at the cost of additional prompt tokens consumed


-----------

This implementation uses the [LLaVA-Next-Video](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-34B-hf) Vision Transformer model and [OpenAI GPT4o](https://openai.com/index/hello-gpt-4o/) for each agent respectively. Future areas for research may be to investigate the use of other models (either closed or open source) and their effects on the quality of classification. Although not emphasised in this implementation, it will also be interesting to construct reliable success criteria that can be used to effectively evaluate the performance of this model

Note that in order to use the following repository, it is necessary to have an [OpenAI API Key](https://openai.com/index/openai-api/) and a [Hugging Face Token](https://huggingface.co/settings/profile)


## Prepare the video input

In [None]:
# We need av to be able to read the video files
!pip install -q av

In [None]:
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Taken from LLaVA_Next_Video HF repo https://huggingface.co/llava-hf/LLaVA-NeXT-Video-34B-hf

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

In [None]:
from google.colab import files
uploaded = files.upload()
vid_path = next(iter(uploaded))

Saving boxer.mp4 to boxer.mp4


In [None]:
vid_container = av.open(vid_path)

### Sample a set number of frames from the video

We are going to sample (default 8) frames from the video to feed into the visual language model

In [None]:
NUM_SAMPLES = 8
total_frames = vid_container.streams.video[0].frames
frame_indices = np.arange(0, total_frames, total_frames / NUM_SAMPLES).astype(int)
vid_clips = read_video_pyav(vid_container, frame_indices)


# Video Captioning using LM

TODOS:
- <input type="checkbox"/> Reconfigure model prediction function calling to accept different Hugging Face Vision Transformers

- <input type="checkbox"/> Experiment with different prompt templates

## Install and Import necessary **dependencies**

As per the recommendations of the LLaVA-Next-Video Hugging Face repo, we will load the model and its corresponding processor from hf hub and then quantise said model using 4 bit quantisation from the BitsAndBytes Library

In [None]:
!pip install --upgrade -q accelerate bitsandbytes
!pip install --q -U transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m104.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## Load the Multimodal Model


In [None]:
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantisation_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
    )

model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"
processor = LlavaNextVideoProcessor.from_pretrained(model_id)
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=quantisation_config,
    device_map="auto",
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/741 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}


model.safetensors.index.json:   0%|          | 0.00/70.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Prepare the prompt for the visual model

In the prompt, you can refer to video using the special `<video>` or `<image>` token. To indicate which text comes from a human vs. the model, one uses USER and ASSISTANT respectively (note: it's true only for this checkpoint). The format looks as follows:

`USER: <video>\n<prompt> ASSISTANT:`


In other words, you always need to end your prompt with ASSISTANT:.


Manually adding USER and ASSISTANT to your prompt can be error-prone since each checkpoint has its own prompt format expected, depending on the backbone language model. Luckily we can use `apply_chat_template` to make it easier.

Chat templates are special templates written in jinja and added to the model's config. Whenever we call `apply_chat_template`, the jinja template in filled in with your text instruction.

To use chat template simply build a list of messages, with role and content keys, and then pass it to the `apply_chat_template()` method. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use `add_generation_prompt=True` to add a generation prompt. See [the docs](https://huggingface.co/docs/transformers/main/en/chat_templating) for more details

In [None]:
# Each "content" is a list of dicts and you can add image/video/text modalities
user_prompt = "What is happening in this video? Be concise with your answer"
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": user_prompt},
              {"type": "video"},
              ],
      },
]

# Note we add add_generation_prompt as per the hf Transformers Tutorial
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

In [None]:
prompt

'USER: <video>\nWhat is happening in this video? Be concise with your answer ASSISTANT:'

## Feed all of the inputs into the VQA model

In [None]:
inputs = processor(
    prompt,
    videos=vid_clips,
    padding=True,
    return_tensors="pt",
).to(model.device)

generate_kwargs ={
    "max_new_tokens": 500,
    "do_sample": True,
    "top_p": 0.9,
}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)


  return torch.tensor(value)
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


In [None]:
generated_text

['USER: \nWhat is happening in this video? Be concise with your answer ASSISTANT: In this video, a young man is seen wearing gloves, training for a boxing match. He practices punches and footwork while being watched by a trainer. The video captures the intense preparation and focus of the fighter before he enters the ring.']

## Extract the prompt response into another variabl

In [None]:
generated_response = generated_text[0].split('ASSISTANT:')[-1].strip()
generated_response

'In this video, a young man is seen wearing gloves, training for a boxing match. He practices punches and footwork while being watched by a trainer. The video captures the intense preparation and focus of the fighter before he enters the ring.'

# LLM Classification

## Install and Import Necessary Dependencies

In [None]:
!pip install -q openai
!pip install -q instructor

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/365.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/318.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [None]:
import instructor
from pydantic import BaseModel, Field
from openai import OpenAI
from enum import Enum
from typing import List

## Create Instructor (Pydantic Based) Data model for classification tags

In [None]:
# Use Colab to get Open Ai Api Key from secrets
from google.colab import userdata
OPENAI_KEY = userdata.get('openai_api_key')

In [None]:
# Use the Instructor patched API version of OpenAI's API
client = instructor.patch(OpenAI(api_key=OPENAI_KEY))

## Use Instructor-Patched OpenAI API to execute the Classification prompt

Here we are creating our own custom pydantic data model using the [Instructor](https://github.com/jxnl/instructor) library. The key features of this library are:

- **Response Models**: Specify Pydantic models to define the structure of your LLM outputs
- **Retry Management**: Easily configure the number of retry attempts for your requests
- **Validation**: Ensure LLM responses conform to your expectations with Pydantic validation
- **Streaming Support**: Work with Lists and Partial responses effortlessly
- **Flexible Backends**: Seamlessly integrate with various LLM providers beyond OpenAI

Here, simply follow the format here to input the set of music tags that are going to be used for classification


In [None]:
# We will create the enum classes for each of the key
class MusicTags(str, Enum):
    HAPPY = "Happy"
    SAD = "Sad"
    ANGRY = "Angry"
    FEARFUL = "Fearful"
    NATURE = "Nature"
    WORLD = "World"
    LOVE = "Love"
    FUNNY = "Funny"

class TagWithConfidence(BaseModel):
    tag: MusicTags
    confidence: float = Field(ge=0, le=1, description="Confidence score for the classification")

In [None]:
class VideoClassification(BaseModel):
    tag: list[TagWithConfidence]
    key_information: List[str] = Field(
        description="List of key points extracted from the ticket")
    reasoning: str = Field(
        description="A brief explanation of the reasoning behind the classifications")

In [None]:
SYSTEM_PROMPT = """
You are an AI assistant for a large music publishing library company that is providing Film Directors with background music.
Your role is to analyze summaries of video clips and provide structured tags to help our search through our database for music tracks that best fit the video clip.
Business Context:
- We have several music record labels that produces thousands of high quality tracks which are then stored in a database.
- This database categorises each track using the tags given
- Quick and accurate classification is crucial for finding the best track to use for videos.
Your tasks:
1. Categorize the video summary into the most appropriate tags. There can be more than one tag chosen
2. Provide a confidence score for each of your tags.
3. Extract key information that would be helpful for our company.
4. Provide a brief explanation of the reasoning behind your classification.
Remember:
- Be objective and base your analysis solely on the information provided in the summary.
- If you're unsure about any tags, reflect that in your confidence score for each tag.
- For 'key_information', extract specific details like characters, and the actions they are taking.
- The 'reasoning' should be a brief, concise explanation of your reasoning.
Analyze the following summary of a video and provide the requested information in the specified format.
"""

In [None]:
def classify_video(vid_summary: str) -> VideoClassification:
    response = client.chat.completions.create(
        model="gpt-4o",
        response_model=VideoClassification,
        temperature=0,
        max_retries=3,
        messages=[
            {
                "role": "system",
                "content": SYSTEM_PROMPT,
            },
            {"role": "user", "content": vid_summary}
        ]
    )
    return response

In [None]:
test_summary = "The video shows a squirrel walking around on the ground, looking for food. It eventually finds and eats some food, then carries it away. The squirrel is frequently observed in its natural environment."

In [None]:
result1 = classify_video(generated_response)

print(result1.model_dump_json(indent=2))

{
  "tag": [
    {
      "tag": "Angry",
      "confidence": 0.7
    },
    {
      "tag": "Fearful",
      "confidence": 0.6
    }
  ],
  "key_information": [
    "young man training for a boxing match",
    "wearing gloves",
    "practicing punches and footwork",
    "watched by a trainer",
    "intense preparation and focus"
  ],
  "reasoning": "The video depicts a young man intensely training for a boxing match, which suggests a high level of determination and possibly anger or aggression. The focus and preparation also imply a sense of fear or anxiety about the upcoming match."
}
