# Running LLaVa-NeXT-Video: a large multi-modal model on Google Colab

LLaVa-NeXT-Video is a new Large Vision-Language Model that enables interaction with videos and images. The model is based on a previuos series of models: [LLaVa-NeXT](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next) that was trained exclusively on image-text data. The architecutre is same as in LLaVa-NeXT and is a decoder-based text model that takes concatenated vision hidden states with text hidden states.

LLaVA-NeXT surprisingly has strong performance in understanding video content with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT for videos has several improvements:

- LLaVA-Next-Video, with supervised fine-tuning (SFT) on top of LLaVA-Next on video data, achieves better video understanding capabilities and is a current SOTA among open-source models on [VideoMME bench](https://arxiv.org/pdf/2405.21075)
- LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), shows further performance boost.

Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next_video
project page: https://github.com/LLaVA-VL/LLaVA-NeXT



In [None]:
import os 
print(os.getenv("CONDA_DEFAULT_ENV"))

In [None]:
import av
import torch
import numpy as np
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

### Load the model

Next, we load a model and corresponding processor from the hub.

We will specify a quantization config of in 4 bits but we wont load the quantised model . Please refer to this [guide](https://huggingface.co/blog/4bit-transformers-bitsandbytes) for more details.

In [None]:
processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", use_fast=True)
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    #quantization_config=quantization_config,
    device_map='auto'
)

## Preparing the video and image inputs

In order to read the video we'll use `av` and sample 8 frames. You can try to sample more frames if the video is long. The model was trained with 32 frames, but can ingest more as long as we're in the LLM backbone's max sequence length range.

In [None]:
def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

### Set Video Path

In [None]:
# Download video from the hub
video_path = '/home/aritrad/MSR-Project/samples/black-screen.mp4'

container = av.open(video_path)

# sample uniformly 8 frames from the video (we can sample more for longer videos)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

## Prepare a prompt and generate

In the prompt, you can refer to video using the special `<video>` or `<image>` token. To indicate which text comes from a human vs. the model, one uses USER and ASSISTANT respectively (note: it's true only for this checkpoint). The format looks as follows:

`USER: <video>\n<prompt> ASSISTANT:`


In other words, you always need to end your prompt with ASSISTANT:.


Manually adding USER and ASSISTANT to your prompt can be error-prone since each checkpoint has its own prompt format expected, depending on the backbone language model. Luckily we can use `apply_chat_template` to make it easier.

Chat templates are special templates written in jinja and added to the model's config. Whenever we call `apply_chat_template`, the jinja template in filled in with your text instruction.

To use chat template simply build a list of messages, with role and content keys, and then pass it to the `apply_chat_template()` method. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use `add_generation_prompt=True` to add a generation prompt. See [the docs](https://huggingface.co/docs/transformers/main/en/chat_templating) for more details

In [None]:
def create_chat_message(prompt):
    
    message = [
          {
              "role": "user",
              "content": [
                      {
                          "type": "text", 
                          "text": f"{prompt}"
                      },
                      {
                          "type": "video"
                      },
                  ],
          },
    ]
    return message

### Question 1: 4 min sample

In [None]:
question = 'When spinlocks in operating system should be used as per the video'
message = create_chat_message(question)

In [None]:
prompt = processor.apply_chat_template(message, add_generation_prompt=True)
inputs = processor(prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)
generate_kwargs = {"max_new_tokens": 256, "do_sample": True, "top_p": 0.9}

In [None]:
%%time
output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)
print(generated_text[0].split('ASSISTANT:')[1].strip())

### Question 2: 12 min Sample

In [None]:
question = 'When spinlocks should be used'
message = create_chat_message(question)

In [None]:
prompt = processor.apply_chat_template(message, add_generation_prompt=True)
inputs = processor(prompt, videos=clip_karate, padding=True, return_tensors="pt").to(model.device)
generate_kwargs = {"max_new_tokens": 256, "do_sample": True, "top_p": 0.9}

In [None]:
%%time
output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)
generated_text[0].split('ASSISTANT:')[1].strip()