<a href="https://colab.research.google.com/github/aaronjyang/transformers-testing/blob/main/LLaVa_NeXT_Video_demo_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running LLaVa-NeXT-Video: a large multi-modal model on Google Colab

LLaVa-NeXT-Video is a new Large Vision-Language Model that enables interaction with videos and images. The model is based on a previuos series of models: [LLaVa-NeXT](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next) that was trained exclusively on image-text data. The architecutre is same as in LLaVa-NeXT and is a decoder-based text model that takes concatenated vision hidden states with text hidden states.


<img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">


LLaVA-NeXT surprisingly has strong performance in understanding video content with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT for videos has several improvements:

- LLaVA-Next-Video, with supervised fine-tuning (SFT) on top of LLaVA-Next on video data, achieves better video understanding capabilities and is a current SOTA among open-source models on [VideoMME bench](https://arxiv.org/pdf/2405.21075)
- LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), shows further performance boost.

Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next_video
project page: https://github.com/LLaVA-VL/LLaVA-NeXT



First we need to install the latest `transformers` from `main`, as the model has just been added. Also we'll install `bitsandbytes` to load the model in lower bits for [memory efficiency](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

In [2]:
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-gan8xuep
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-gan8xuep
  Resolved https://github.com/huggingface/transformers.git to commit fc269f77da72d4c65b2e71e6d4896cd16c6f1e76
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
# we need av to be able to read the video
!pip install -q av

## Load the model

Next, we load a model and corresponding processor from the hub.

We will specify a quantization config to load the model in 4 bits. Please refer to this [guide](https://huggingface.co/blog/4bit-transformers-bitsandbytes) for more details.

In [4]:
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    quantization_config=quantization_config,
    device_map='auto'
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [27]:
model


LlavaNextVideoForConditionalGeneration(
  (vision_tower): CLIPVisionModel(
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(577, 1024)
      )
      (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-23): 24 x CLIPEncoderLayer(
            (self_attn): CLIPSdpaAttention(
              (k_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (v_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (q_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (out_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            )
            (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
       

## Preparing the video and image inputs

In order to read the video we'll use `av` and sample 8 frames. You can try to sample more frames if the video is long. The model was trained with 32 frames, but can ingest more as long as we're in the LLM backbone's max sequence length range.

In [6]:
# prompt: write code that takes a filepath to an image and converts it to a numpy array

from PIL import Image
import numpy as np

def image_to_numpy(image_path):
  """
  Converts an image at the given filepath to a NumPy array.

  Args:
    image_path: The path to the image file.

  Returns:
    A NumPy array representing the image, or None if the image cannot be opened.
  """
  try:
    img = Image.open(image_path)
    return np.array(img)
  except FileNotFoundError:
    print(f"Error: Image file not found at {image_path}")
    return None
  except Exception as e:
    print(f"Error opening or converting image: {e}")
    return None

In [7]:

img_paths = []
for i in range(2, 51, 2):
  if i < 10:
    img_paths.append(f"/content/car_crash/C_000001_0{i}.jpg")
  else:
    img_paths.append(f"/content/car_crash/C_000001_{i}.jpg")
img_paths

['/content/car_crash/C_000001_02.jpg',
 '/content/car_crash/C_000001_04.jpg',
 '/content/car_crash/C_000001_06.jpg',
 '/content/car_crash/C_000001_08.jpg',
 '/content/car_crash/C_000001_10.jpg',
 '/content/car_crash/C_000001_12.jpg',
 '/content/car_crash/C_000001_14.jpg',
 '/content/car_crash/C_000001_16.jpg',
 '/content/car_crash/C_000001_18.jpg',
 '/content/car_crash/C_000001_20.jpg',
 '/content/car_crash/C_000001_22.jpg',
 '/content/car_crash/C_000001_24.jpg',
 '/content/car_crash/C_000001_26.jpg',
 '/content/car_crash/C_000001_28.jpg',
 '/content/car_crash/C_000001_30.jpg',
 '/content/car_crash/C_000001_32.jpg',
 '/content/car_crash/C_000001_34.jpg',
 '/content/car_crash/C_000001_36.jpg',
 '/content/car_crash/C_000001_38.jpg',
 '/content/car_crash/C_000001_40.jpg',
 '/content/car_crash/C_000001_42.jpg',
 '/content/car_crash/C_000001_44.jpg',
 '/content/car_crash/C_000001_46.jpg',
 '/content/car_crash/C_000001_48.jpg',
 '/content/car_crash/C_000001_50.jpg']

In [8]:
crash = np.array([image_to_numpy(img) for img in img_paths])

Error: Image file not found at /content/car_crash/C_000001_02.jpg
Error: Image file not found at /content/car_crash/C_000001_04.jpg
Error: Image file not found at /content/car_crash/C_000001_06.jpg
Error: Image file not found at /content/car_crash/C_000001_08.jpg
Error: Image file not found at /content/car_crash/C_000001_10.jpg
Error: Image file not found at /content/car_crash/C_000001_12.jpg
Error: Image file not found at /content/car_crash/C_000001_14.jpg
Error: Image file not found at /content/car_crash/C_000001_16.jpg
Error: Image file not found at /content/car_crash/C_000001_18.jpg
Error: Image file not found at /content/car_crash/C_000001_20.jpg
Error: Image file not found at /content/car_crash/C_000001_22.jpg
Error: Image file not found at /content/car_crash/C_000001_24.jpg
Error: Image file not found at /content/car_crash/C_000001_26.jpg
Error: Image file not found at /content/car_crash/C_000001_28.jpg
Error: Image file not found at /content/car_crash/C_000001_30.jpg
Error: Ima

In [9]:
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

In [10]:
videos = ["000010", "000016", "000049", "000055", "000061", "000073", "000074",
          "000090", "000161", "000164", "000166", "000171", "000435", "000500",
          "000539", "000618"]
videos = ["/content/usable_videos/" + item + ".mp4" for item in videos]


In [11]:
containers = [av.open(video) for video in videos]
indices = np.arange(0, 51)
clips = np.asarray([read_video_pyav(containers[i], indices) for i in range(len(containers))])


In [12]:
np.shape(clips)

(16, 50, 720, 1280, 3)

In [17]:
from huggingface_hub import hf_hub_download

# Download video from the hub
video_path_1 = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
video_path_2 = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="karate.mp4", repo_type="dataset")

container = av.open(video_path_1)

# sample uniformly 8 frames from the video (we can sample more for longer videos)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip_baby = read_video_pyav(container, indices)


container = av.open(video_path_2)

# sample uniformly 8 frames from the video (we can sample more for longer videos)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip_karate = read_video_pyav(container, indices)

sample_demo_1.mp4:   0%|          | 0.00/1.55M [00:00<?, ?B/s]

karate.mp4:   0%|          | 0.00/60.7M [00:00<?, ?B/s]

In [11]:
crash.shape

(25,)

In [13]:
from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

# np array with shape (frames, height, width, channels)
video = clips[1]

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
                               interval=100)
HTML(anim.to_html5_video())

## Prepare a prompt and generate

In the prompt, you can refer to video using the special `<video>` or `<image>` token. To indicate which text comes from a human vs. the model, one uses USER and ASSISTANT respectively (note: it's true only for this checkpoint). The format looks as follows:

`USER: <video>\n<prompt> ASSISTANT:`


In other words, you always need to end your prompt with ASSISTANT:.


Manually adding USER and ASSISTANT to your prompt can be error-prone since each checkpoint has its own prompt format expected, depending on the backbone language model. Luckily we can use `apply_chat_template` to make it easier.

Chat templates are special templates written in jinja and added to the model's config. Whenever we call `apply_chat_template`, the jinja template in filled in with your text instruction.

To use chat template simply build a list of messages, with role and content keys, and then pass it to the `apply_chat_template()` method. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use `add_generation_prompt=True` to add a generation prompt. See [the docs](https://huggingface.co/docs/transformers/main/en/chat_templating) for more details

In [104]:
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "Is there a car crash in this video?"},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)


In [67]:
type(conversation[0])

dict

In [105]:
# As you can see we got the USER: ASSISTANT: format prompt
prompt

'USER: <video>\nIs there a car crash in this video? ASSISTANT:'

In [113]:
# we still need to call the processor to tokenize the prompt and get pixel_values for videos
inputs = processor([prompt], videos=[clips[0]], padding=True, return_tensors="pt").to(model.device)

In [114]:
inputs

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [107]:
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

In [108]:
print(generated_text)

['USER: \nIs there a car crash in this video? ASSISTANT: There does not appear to be a car crash in the video you provided. Instead, it seems to show a bustling city street with vehicles traveling along, some with their lights on. However, without specific information about any event, it is difficult to determine if there was a crash or not. The video captures the movement of vehicles on a city street.']


In [109]:
new_conversation = conversation.copy()

In [100]:
generated_text[0][60-7+1:]

"The video shows a car accident involving several vehicles in a city environment at night. While the video does depict an accident scene with multiple cars and one appears to be in the process of rolling over, which is consistent with a rollover vehicle collision. Rollover accidents can cause significant damage and pose a high risk of injury or fatality to the occupants. The exact nature of the accident, whether it's a collision or another cause of the rollover, cannot be determined"

In [111]:
new_conversation.append({"role": "assistant", "content": generated_text[0][60-7+1:]})

In [112]:
new_conversation

[{'role': 'user',
  'content': [{'type': 'text', 'text': 'Is there a car crash in this video?'},
   {'type': 'video'}]},
 {'role': 'assistant',
  'content': 'There does not appear to be a car crash in the video you provided. Instead, it seems to show a bustling city street with vehicles traveling along, some with their lights on. However, without specific information about any event, it is difficult to determine if there was a crash or not. The video captures the movement of vehicles on a city street.'}]

In [61]:
questions = ["Is there a car crash?", "Is there broken glass?"]

In [74]:
conversation.append("USER: "temp)

In [75]:
conversation

[{'role': 'user',
  'content': [{'type': 'text', 'text': 'Is there a car crash in this video?'},
   {'type': 'video'}]},
 {'role': 'assistant',
  'content': [{'type': 'text',
    'text': ["USER: \nIs there a car crash in this video? ASSISTANT: Based on the video description provided, it's not entirely clear whether there is a car crash or not. The video shows a blurry view of a street with vehicles and bright lights at night, including cars and trucks, but due to the lack of clarity in the footage, it's difficult to definitively determine if a car crash has occurred. The context provided is not sufficient to confirm or deny the presence of a car crash."]}]}]

In [76]:
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)


TypeError: can only concatenate list (not "str") to list

### Generate from images and image+video data

To generate from images we have to change the special token to `<image>` or indicate an "image" modality in the chat template, that's it! Let's see how it works

In [None]:
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation_image = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this image?"},
              {"type": "image"},
              ],
      },
]

conversation_2_image = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What color is the sign?"},
              {"type": "image"},
              ],
      },
]

prompt_image = processor.apply_chat_template(conversation_image, add_generation_prompt=True)
prompt_2_image = processor.apply_chat_template(conversation_2_image, add_generation_prompt=True)

In [None]:
prompt

'USER: <image>\nWhat do you see in this image? ASSISTANT:'

In [None]:
inputs = processor([prompt_image, prompt_2_image], images=[image_snowman, image_stop], padding=True, return_tensors="pt").to(model.device)

In [None]:
generate_kwargs = {"max_new_tokens": 50, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

In [None]:
print(generated_text)

["USER: \nWhat do you see in this image? ASSISTANT: In the image, I see an animated depiction of a snowman. The snowman appears to be sitting and gazing into the distance, seemingly contemplative. It's dressed in a scarf and hat, and there are two", 'USER: \nWhat color is the sign? ASSISTANT: The sign in the image is red with white text.']


We can feed images and videos in one go instead of running separate generations for image and video. Also we can interleave images with videos inside one prompt, although the training dataset didn't see that kind of examples.

For the processing just make sure to pass images/videos in the same order as they appear in the prompts, starting from the first prompt until the last prompt. You can pass all visual data as flattenned list as shown below, only order matters





In [None]:
inputs = processor([prompt, prompt_image, prompt_2_image], images=[image_snowman, image_stop], videos=[clip_baby], padding=True, return_tensors="pt").to(model.device)

In [None]:
generate_kwargs = {"max_new_tokens": 40, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)
print(generated_text)

In [30]:
# For multi-turn convwersations just continue stacking up messages in the chat template
conversation_multiturn = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this video?"},
              {"type": "video"},
              ],
      },
      {
          "role": "assistant",
          "content": [
              {"type": "text", "text": "I see a baby reading a book."},
              ],
      },
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "Why is it funny?"},
              ],
      },
]

prompt_multiturn = processor.apply_chat_template(conversation_multiturn, add_generation_prompt=True)
print(prompt_multiturn)

USER: <video>
What do you see in this video? ASSISTANT: I see a baby reading a book. USER: Why is it funny? ASSISTANT:
