https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVA-NeXT-Video/Fine_tune_LLaVa_NeXT_Video_with_HFTrainer.ipynb

## Prerequisites
Before we start, make sure you have the following:

Access to GPUs (preferably 80GB or more since videos require high sequence lengths).
Familiarity with Hugging Face’s Transformers library.
Pre-install necessary packages by running the below.

From video decoders you can install only one, the one you will use. Below I will provide helper functions to read videos using any of the three libraries, yet the default is decord which I found to be x8-10 faster.

In [2]:
!pip install -U -q transformers accelerate bitsandbytes peft dataset
!pip install -q av decord opencv-python

## Fine-tune LLaVa-NeXT-Video on  dataset
In this notebook, we are going to fine-tune the LLaVa-NeXT-Video model on ShareGPTVideo dataset which is a video captioning dataset. Note that video datasets usually require a lot of hard disk memory to download the videos, but we'll try to not save videos in memory but rather discard after processing the inputs.

LLaVa-NeXT-Video is a new Large Vision-Language Model that enables interaction with videos and images. The model is based on a previuos series of models: [LLaVa-NeXT](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next) that was trained exclusively on image-text data. The architecutre is same as in LLaVa-NeXT and is a decoder-based text model that takes concatenated vision hidden states with text hidden states.


<img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">


LLaVA-NeXT surprisingly has strong performance in understanding video content with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT for videos has several improvements:

- LLaVA-Next-Video, with supervised fine-tuning (SFT) on top of LLaVA-Next on video data, achieves better video understanding capabilities and is the second best-performing model among open-source models on [VideoMME bench](https://arxiv.org/pdf/2405.21075)
- LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), shows further performance boost.


In this notebook we'll use the [LLaVa-NeXT-Video-7b-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) checkpoint

- LLaVA-Next-Video [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next_video)
- LLaVA-Next-Video [checkpoints on the hub](https://huggingface.co/collections/llava-hf/llava-next-video-6666a9173a64c7052930f153)
- LLaVA-Next-Video [project page](https://github.com/LLaVA-VL/LLaVA-NeXT)

## Define variables
We'll first set some variables useful througout this notebook and doo all the necessary imports.

In [11]:
import os
import av
import fsspec
import shutil
import numpy as np

from transformers import Trainer, TrainingArguments, Seq2SeqTrainingArguments, DataCollatorForLanguageModeling
from transformers import AutoProcessor, BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from huggingface_hub import snapshot_download, hf_hub_download, HfFileSystem
from datasets import load_dataset, concatenate_datasets


MAX_LENGTH = 256
BATCH_SIZE = 4
NUM_FRAMES = 8 # more frames -> more VRAM needed

USE_LORA = False
USE_QLORA = True

In [12]:
DATASET_PATH = "/share/data/drive_2/sharegpt4video" # path where to save the dataset
OUTPUT_DIR = "/share/users/shehan/workspace_pointing_lmm/MolmoVideo/sharegpt4" # path where to save the checkpoints

MODEL_ID = "llava-hf/LLaVa-NeXT-Video-7b-hf"

REPO_ID = "RaushanTurganbay/LLaVa-NeXT-Video-demo" # Change to your hf-hub repo

## Prepare dataset

We will start by downloading and processing the dataset. Downloading all the videos from ShareGPTVideo might require more than 900 GB of memory in hard disk to donwload zip-files (unzipping them will require more memory), so we'll try to download mini-batch of videos in temp directory and delete after we're done. Note that this will stil require some meory to hold temp dirrectory and cache the processed dataset.

The datasets in the hub are usually a [DatasetDict](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetDict) where keys are data-split and values are `Dataset` objects. We can inspect the dataset layoiut but simply printing it and see that ShareGPTVideo consists of video path and a set of captions for each video.

In [3]:
dataset = load_dataset("ShareGPT4Video/ShareGPT4Video")

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['video_id', 'video_path', 'timestamp', 'keyframe', 'captions', 'zip_folder'],
        num_rows: 40178
    })
})

In [5]:
dataset['train'][0]

{'video_id': '02bd228227979a5663a155011f5e3740853f729d68bc12f8fefc75eeb7630379',
 'video_path': 'pixabay/02bd228227979a5663a155011f5e3740853f729d68bc12f8fefc75eeb7630379.mp4',
 'timestamp': ['00:00:00.000', '00:00:04.000'],
 'keyframe': [0.0, 2.0, 4.0],
 'captions': [{'idx': '1',
   'content': "The frame displays a person with a complexion that appears fair, as indicated by the visible shoulders and upper chest. The individual is clad in a dark tank top, the straps of which are slender, overlying the person's shoulders, suggesting a casual or athletic attire. There is a conspicuous contrast between the subject's light skin tone and the tank top's dark fabric, as well as against the background, which is predominantly dark. The background offers no discernible detail, thereby isolating the subject as the focal point. The lighting in the frame seems to be directed from the front, casting subtle shadows and highlighting the musculature and contours of the visible skin. The posture of the p

Below we have three video reader functions. We'll use decord here as it's faster than PyAV. But I am leaving PyAV as an option in case you encounter errors in decord (especially windows users or in decord-gpu kernel) asit's no longer maintained.

In [6]:
import cv2
from numba import jit, cuda

def read_video_opencv(video_path, num_frames=NUM_FRAMES):
    '''
    Decode the video with open-cv decoder.

    Args:
        video_path (str): Path to the video file.
        num_frames (int): Number of frames to sample uniformly. Defaults to NUM_FRAMES

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    video = cv2.VideoCapture(video_path)
    fps = int(video.get(cv2.CAP_PROP_FPS))
    total_num_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.arange(0, total_num_frames, total_num_frames / num_frames).astype(int)
    frames = process_video_cv2(video, indices, total_num_frames)
    return np.stack(frames)


# @jit(nopython=True, target_backend='cuda') # <-- If you have a cuda GPU
def process_video_cv2(video: cv2.VideoCapture, indices: np.array, length: int):
    index = 0
    frames = []
    while video.isOpened():
        success, frame = video.read()
        if index in indices:
            # Channel 0:B 1:G 2:R
            height, width, channel = frame.shape
            frames.append(frame[0:height, 0:width, 0:channel])
        if success:
            index += 1
        if index >= length:
            break

    video.release()
    return frames

In [7]:
from decord import VideoReader, gpu, cpu

def read_video_decord(video_path, num_frames=NUM_FRAMES):
    '''
    Decode the video with Decord decoder.

    Args:
        video_path (str): Path to the video file.
        num_frames (int): Number of frames to sample uniformly. Defaults to NUM_FRAMES

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    vr = VideoReader(uri=video_path, ctx=cpu(0)) # you need to install from source to use gpu ctx
    indices = np.arange(0, len(vr), len(vr) / num_frames).astype(int)
    frames = vr.get_batch(indices).asnumpy()
    return frames

In [8]:
def read_video_pyav(video_path, num_frames=NUM_FRAMES):
    '''
    Decode the video with PyAV decoder.

    Args:
        video_path (str): Path to the video file.
        num_frames (int): Number of frames to sample uniformly. Defaults to NUM_FRAMES

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    container = av.open(video_path)

    # sample uniformly "num_frames" frames from the video
    total_frames = container.streams.video[0].frames
    indices = np.arange(0, total_frames, total_frames / num_frames).astype(int)

    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

## Custom Dataset

In the next step, we'll define a the necessary functions to prepare our data for fine-tuning the LLaVa-NeXT-Video model. We define a "collate_fn" function to handle handle the conversion of dataset samples into the format required for training and evaluation by preparing a prompt and making array from videos.

NOTE: LLaVa-NeXT-Video accepts videos in one of the following formats:

- an array or tensor of shape: (batch-size, frames, channel, height, width) where batch-size is an optional dimension
- a list of arrays/tensors of shape: (frames, channel, height, width)
- a nested list of video frames, where each frame is an image as PIL Image/array/tensor

Here we're going to use the processor to turn the (video, target token sequence) into the format that the model expects (which is pixel_values, input_ids etc.). NOTE: We do not need to do batching right now, so that we can do dynamic batching of data during training and evaluation. It ensures that the data is padded to max length within batch.

We also decide to limit the length of the text tokens (input_ids) to a max length due to memory constraints, feel free to expand if your target token sequences are longer (I'd recommend plotting the average token length of your dataset to determine the optimal value).

The formatting of the input_ids is super important: we need to respect a so-called chat template. LLaVa-NeXT-Video processor has a special `apply_chat_template` which will help you to use the correct format by simply providing the text/images as input. You can also have a multi-turn conversation, and the template converter will take care of the formatting for you.

In [9]:
# We collate to save everything in tensor format to speed-up dataloading process
# Saving the whole video clip (array) along with caption (string) will slow down iteration
# because unprocessed video clip will take up more memory due to higher resolution
# The processed video on the other hand is always 336x336 in size and fixed frame count per clip
# see: https://discuss.huggingface.co/t/slow-iteration-speed-with-and-without-keep-in-memory-true/33587

def collate_fn(example, path):
    video_file = example["video_path"].split("/")[-1]
    video_clip = read_video_decord(f'{path}/{video_file}') # change to the video decoder you want

    # we'll take the overall video caption, not per-scene caption for each frame
    captions_all = [caption for caption in example['captions'] if caption['idx'] == '-1']
    caption = captions_all[0]['content']

    # Let's use chat template to format the prompt correctly
    conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Provide a detailed caption for this video."},
                    {"type": "video"},
                    ],
            },
            {
                "role": "assistant",
                "content": [
                    {"type": "text", "text": caption},
                     ],
            },
        ]

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=False)

    batch = processor(
        text=prompt,
        videos=video_clip,
        truncation=True,
        max_length=MAX_LENGTH,
        return_tensors="pt"
    )

    return batch

In [10]:
# And we also need to load the processor for collate_fn
processor = AutoProcessor.from_pretrained(MODEL_ID, use_fast=False)
processor.tokenizer.padding_side = "right" # during training, one always uses padding on the right

RuntimeError: Failed to import transformers.models.llava_next_video.processing_llava_next_video because of the following error (look up to see its traceback):
No module named 'transformers.models.llava_next_video.processing_llava_next_video'

In case you don't have much hard disk space, run the below cell and skip the subsequent. It will download and process each zipfile separately and then delete videos after processing.

If you want to download everything and have it saved in memory/cache, skip the below cell and run the subsequent.

Additionally, you can take a look at how to do streaming from HF Hub [here](https://colab.research.google.com/drive/1suYlqG6gyjeslUcXbWAn6EsiFqJXqKns?usp=sharing), in case you do not want to download anything.

In [12]:
# Download iteratively and delete after done

datasets_combined = []
fs = HfFileSystem()
directory = f"{DATASET_PATH}/temp_dir"

# zip_folders = {"mixit", "bdd100k", "ego4d", "pexels", "pixabay"} #TODO: Uncomment this line
zip_folders = {"pexels"}


for zip_folder in zip_folders:
    print(f"Processing folder: {zip_folder}...")
    zip_files = fs.ls(f"datasets/ShareGPT4Video/ShareGPT4Video/zip_folder/{zip_folder}", detail=False)
    for zip_file in zip_files:
        zip_file = zip_file.split("/")[-1]
        path = hf_hub_download(
            repo_id='ShareGPT4Video/ShareGPT4Video',
            repo_type="dataset",
            filename=f"zip_folder/{zip_folder}/{zip_file}",
            local_dir=f"{DATASET_PATH}/{zip_folder}",  # save in dataset_dir to avoid caching
            cache_dir=DATASET_PATH,
        )
        subdataset_name = zip_file.split("_")[0]

        if os.path.exists(directory): # create temp dir, remove if it's not the first zip-file being processed
            shutil.rmtree(directory)
        os.makedirs(directory)

        if path.endswith(".zip"):
            shutil.unpack_archive(path, directory)

            # get small part of dataset with curr downloaded video files only
            curr_video_files = os.listdir(directory)
            small_dataset = dataset.filter(lambda example: example["video_path"].split("/")[-1] in curr_video_files)

            small_dataset = small_dataset.map(
                collate_fn,
                batched=False, # false to read video one-by-one
                fn_kwargs={"path": directory},
                num_proc=2, # set num_proc higher to faster process
                remove_columns=["captions", "keyframe", "timestamp", "video_id", "video_path"],
                writer_batch_size=400, # reduce writer_batch_size to have batches smaller than 2GB
            )
            datasets_combined.append(small_dataset['train']) # ShareGPTVideo only has train set, so let's save only that
        os.remove(path) # remove this zip-file after we're done

Processing folder: pexels...


pexels_videos_1.zip:   0%|          | 0.00/14.3G [00:00<?, ?B/s]

Filter:   0%|          | 0/40178 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/196 [00:00<?, ? examples/s]

pexels_videos_10.zip:   0%|          | 0.00/14.3G [00:00<?, ?B/s]

Filter:   0%|          | 0/40178 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/195 [00:00<?, ? examples/s]

pexels_videos_11.zip:   0%|          | 0.00/10.6G [00:00<?, ?B/s]

Filter:   0%|          | 0/40178 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/199 [00:00<?, ? examples/s]

pexels_videos_12.zip:   0%|          | 0.00/15.6G [00:00<?, ?B/s]

KeyboardInterrupt: 

In [None]:
# Download in one go

videos_path = snapshot_download(repo_id='ShareGPT4Video/ShareGPT4Video', repo_type="dataset", allow_patterns="*videos.zip")

# uncomment if you want to cache in specific folder
# videos_path = snapshot_download(repo_id='ShareGPT4Video/ShareGPT4Video', repo_type="dataset", cache_dir="PATH WHERE TO CACHE")

# Now Unzip each file and process
datasets_combined = []
directory = f"{DATASET_PATH}/videos_ShareGPT/"
zip_folders = {"ego4d", "mixit", "pexels", "pixabay", "bdd100k"}

for zip_folder in zip_folders:
    for zip_file in os.listdir(f"{videos_path}/{zip_folder}"):
        zip_file_path = f"{videos_path}/{zip_folder}/{zip_file}"
        shutil.unpack_archive(path, f"{directory}/{zip_folder}")

    small_dataset = dataset.filter(lambda example: example["video_path"].startswith(zip_folder))

    # set num_proc higher for faster processing
    small_dataset = small_dataset.map(collate_fn, batched=False, fn_kwargs={"path": f"{directory}/{zip_folder}"}, num_proc=8)
    temp_dataset = process_dataset(zip_file)
    datasets_combined.append(temp_dataset['train']) # ShareGPTVideo only has train set, so let's save only that

In [None]:
# Concatenate the datasets we have and load a tokenizer
dataset_processed = concatenate_datasets(datasets_combined)
dataset_processed = dataset_processed.shuffle(seed=42)
dataset = dataset_processed.train_test_split(test_size=0.2)

In [None]:
train_dataset, test_dataset = dataset['train'].with_format("torch"), dataset['test'].with_format("torch")

In [None]:
# For demo purposes only a small portion of the dataset was downloaded
# The whole dataset has 40k videos
train_dataset, test_dataset

(Dataset({
     features: ['input_ids', 'attention_mask', 'pixel_values_videos'],
     num_rows: 2851
 }),
 Dataset({
     features: ['input_ids', 'attention_mask', 'pixel_values_videos'],
     num_rows: 713
 }))

## Create Collator for Training

Now we can create out collator which we'll pass to the trainer for dynamic padding of the inputs. The below collator is basically the same as `DataCollatorWithPadding` from transformers with the only difference that we also add "pixel_values" to the batch

Labels are created for the model by simply copying the inputs to the LLM (input_ids), but with padding tokens replaced by the ignore index of the loss function. This ensures that the model doesn't need to learn to predict padding tokens (used to batch examples together).

Why are the labels a copy of the model inputs, you may ask? The model will internally shift the labels one position to the right so that the model will learn to predict the next token.

In [13]:
class LlavaNextVideoDataCollatorWithPadding:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, features):
        padded_inputs = self.processor.tokenizer.pad(
            {
                "input_ids": [feat['input_ids'][0] for feat in features], # each element is one batch only so we slice [0]
                "attention_mask": [feat['attention_mask'][0] for feat in features],
            },
            padding=True,
            return_tensors="pt",
        )

        labels = padded_inputs["input_ids"].clone()
        labels[labels == self.processor.tokenizer.pad_token_id] = -100
        padded_inputs["labels"] = labels
        padded_inputs["pixel_values_videos"] = torch.cat([feat['pixel_values_videos'] for feat in features], dim=0)

        return padded_inputs

#### Let's see one of the video clips we sampled


In [None]:
example = train_dataset[0]

In [None]:
from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

# convert to image from proceessed tensors
clip = example["pixel_values_videos"][0] * 255
clip = clip.permute(0, 2, 3, 1).clamp(0, 255)

# np array with shape (frames, height, width, channels)
video = np.array(clip).astype(np.uint8)

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
                               interval=100)
HTML(anim.to_html5_video())

In [None]:
# and the caption associated with the video clip
processor.batch_decode(example["input_ids"])

['<s> USER:  <video> \nProvide a detailed caption for this video. ASSISTANT: The video captures a sequence of moments within an indoor space designed for art activities, featuring a child engaged in painting or drawing at an easel. Wearing an orange knitted jumper adorned with decorative elements, the child seems deeply involved in their artistic process from the beginning to the end of the captured frames. An adult is present throughout, standing closely behind the child, offering guidance or supervision but without direct intervention, as indicated by the proximity of the adult’s hand to the child’s. The setting includes abundant art supplies and evidence of creative work, such as paint splatters and various artworks, which, along with the consistent lighting and ambiance, affirms the space’s purpose for art and painting activities.\n\nThroughout the video, there is a clear focus on the child’s interaction with the paper on the easel, indicating a progression in the painting activity

## Streaming Dataset

In [14]:
dataset = load_dataset("ShareGPT4Video/ShareGPT4Video")

# processor = LlavaNextVideoProcessor.from_pretrained(MODEL_ID)

# And we also need to load the processor for collate_fn
processor = AutoProcessor.from_pretrained(MODEL_ID, use_fast=False)
processor.tokenizer.padding_side = "right" # during training, one always uses padding on the right


In [15]:
from decord import VideoReader, gpu, cpu

def read_video_decord(video_path, num_frames=NUM_FRAMES):
    '''
    Decode the video with Decord decoder.

    Args:
        video_path (str): Path to the video file.
        num_frames (int): Number of frames to sample uniformly. Defaults to NUM_FRAMES

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    vr = VideoReader(uri=video_path, ctx=cpu(0)) # you need to install from source to use gpu ctx
    indices = np.arange(0, len(vr), len(vr) / num_frames).astype(int)
    frames = vr.get_batch(indices).asnumpy()
    return frames

In [16]:
class StreamerDataset(Dataset):
    """
    This dataset handles the loading of video data and corresponding captions for streaming purposes.
    It retrieves video data from a zip file stored remotely in HuggingFace Hub and processes it into tensors
    suitable for model input.
    """
    def __init__(self, dataset):
        self.dataset = dataset

    def __getitem__(self, idx):
        curr_element = self.dataset[idx]
        zip_file_shard = curr_element['zip_folder']
        video_path = curr_element['video_path']
        file_name = curr_element["video_path"].split("/")[-1]
        zip_folder = curr_element["video_path"].split("/")[0]
        with fsspec.open(
            f"zip://{file_name}::hf://datasets/ShareGPT4Video/ShareGPT4Video/zip_folder/{zip_folder}/{zip_file_shard}"
        ) as f:
            video_clip = read_video_decord(f)

        # take caption for the whole video and not per-scene
        video_caption = [caption['content'] for caption in curr_element["captions"] if caption['idx'] == '-1'][0]
        
        # print(video_caption)
        # Let's use chat template to format the prompt correctly
        conversation = [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "Provide a detailed caption for this video."},
                        {"type": "video"},
                        ],
                },
                {
                    "role": "assistant",
                    "content": [
                        {"type": "text", "text": video_caption},
                            ],
                },
            ]

        prompt = processor.apply_chat_template(conversation, add_generation_prompt=False)
        
        inputs = processor(
            text=prompt,
            videos=video_clip,
            truncation=True,
            max_length=MAX_LENGTH,
            return_tensors="pt"
        )
        #processor(video_caption, videos=video_clip, return_tensors="pt")
        return inputs

    def __len__(self):
        return len(self.dataset)

In [17]:
streamer = StreamerDataset(dataset["train"])
streamer[0]

{'input_ids': tensor([[    1,  3148,  1001, 29901, 29871, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 3

In [None]:
# def collate_fn(examples):
#     """
#     Collate function that dynamically pads inputs to the max length within the batch and creates labels for language modelsing task.
#     In contrast to DataCollatorForLanguageModeling, this collator adds pixel values to the batch for multimodal LLMs.
#     """
#     padded_inputs = processor.tokenizer.pad(
#         {
#             "input_ids": [feat['input_ids'][0] for feat in examples], # each element is one batch only so we slice [0]
#             "attention_mask": [feat['attention_mask'][0] for feat in examples],
#         },
#         padding=True,
#         return_tensors="pt",
#     )

#     labels = padded_inputs["input_ids"].clone()
#     labels[labels == processor.tokenizer.pad_token_id] = -100 # feel free to mask labels the way that suits your use-case (e.g. mask all user-turns)
#     padded_inputs["labels"] = labels
#     padded_inputs["pixel_values_videos"] = torch.cat([feat['pixel_values_videos'] for feat in examples], dim=0)

#     return padded_inputs

In [None]:
# class LlavaNextVideoDataCollatorWithPadding:
#     def __init__(self, processor):
#         self.processor = processor

#     def __call__(self, features):
#         padded_inputs = self.processor.tokenizer.pad(
#             {
#                 "input_ids": [feat['input_ids'][0] for feat in features], # each element is one batch only so we slice [0]
#                 "attention_mask": [feat['attention_mask'][0] for feat in features],
#             },
#             padding=True,
#             return_tensors="pt",
#         )

#         labels = padded_inputs["input_ids"].clone()
#         labels[labels == self.processor.tokenizer.pad_token_id] = -100
#         padded_inputs["labels"] = labels
#         padded_inputs["pixel_values_videos"] = torch.cat([feat['pixel_values_videos'] for feat in features], dim=0)

#         return padded_inputs


In [None]:
streamer = StreamerDataset(dataset["train"])
# loader = DataLoader(streamer, batch_size=8, collate_fn=collate_fn)

In [21]:
batch_items = streamer[0]

In [22]:
batch_items.keys()

dict_keys(['input_ids', 'attention_mask', 'pixel_values_videos'])

In [23]:
example = batch_items

from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

# convert to image from proceessed tensors
clip = example["pixel_values_videos"][0] * 255
clip = clip.permute(0, 2, 3, 1).clamp(0, 255)

# np array with shape (frames, height, width, channels)
video = np.array(clip).astype(np.uint8)

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
                               interval=100)
HTML(anim.to_html5_video())


In [47]:
# and the caption associated with the video clip
processor.batch_decode(example["input_ids"])

['<s> USER: <video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><video><

## Load model
Next, we're going to load the LLaVa-NeXT-Video model from the hub. This is a model with about 7 billion trainable parameters (as it combines a LLaMa-7B language model with a relatively low-parameter vision encoder). Do note that we load a model here which already has undergone supervised fine-tuning (SFT) on VideoChat instruction dataset. We can benefit from the fine-tuning that the model already has undergone.

## Full fine-tuning, LoRa and Q-LoRa
As this model has 7 billion trainable parameters, that's going to have quite an impact on the amount of memory used. For reference, fine-tuning a model using the AdamW optimizer (which is often used to optimize neural networks) with mixed precision, you need about 18 times the amount of parameters in GB of GPU RAM. So in this case, we would need 18x7 billion bytes = 126 GB of GPU RAM if we want to update all the parameters of the model!! That's huge right? And for most people infeasible.

Luckily, some clever people came up with the LoRa method (LoRa is short for low-rank adapation). It allows to just freeze the existing weights and only train a couple of adapter layers on top of the base model. Hugging Face offers the separate [PEFT library](https://huggingface.co/docs/peft/main/en/index) for easy use of LoRa, along with other Parameter-Efficient Fine-Tuning methods (that's where the name PEFT comes from).

Moreover, one can not only freeze the existing base model but also quantize it (which means, shrinking down its size). A neural network's parameters are typically saved in either float32 (which means, 32 bits or 4 bytes are used to store each parameter value) or float16 (which means, 16 bits or half a byte - also called half precision). However, with some clever algorithms one can shrink each parameter to just 8 or 4 bits (half a byte!), without significant effect on final performance. Read all about it [here](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

This means that we're going to shrink the size of the base Idefics2-8b model considerably using 4-bit quantization, and then only train a couple of adapter layers on top using LoRa (in float16). This idea of combining LoRa with quantization is called Q-LoRa and is the most memory friendly version.

Of course, if you have the memory available, feel free to use full fine-tuning or LoRa without quantization! In case of full fine-tuning, the code snippet below instantiates the model with Flash Attention which considerably speeds up computations.

There exist many forms of quantization, here we leverage the BitsAndBytes integration.

In [13]:
## Load model
# Three options for training, from the lowest precision training to the highest precision training:
# QLoRA: model uses 4-bit quantization, which helps in reducing memory usage while maintaining performance.
# Standard LoRA:  model is loaded with standard LoRA adaptations.
# Full Fine-Tuning: no memory optimization are done. In that case Flash Attention is used to speed up training, if hardware supports it.

if USE_QLORA or USE_LORA:
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
        )
    model = LlavaNextVideoForConditionalGeneration.from_pretrained(
        MODEL_ID,
        torch_dtype=torch.float16,
        quantization_config=bnb_config,
        device_map="auto",
    )
else:
    # for full fine-tuning, we can speed up the model using Flash Attention
    # only available on certain devices, see https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features
    model = LlavaNextVideoForConditionalGeneration.from_pretrained(
        MODEL_ID,
        torch_dtype=torch.float16,
        _attn_implementation="flash_attention_2",
        device_map="auto",
    )

config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/70.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Apply PEFT
After loading the base model, we're going to add LoRa adapter layers. We're going to only train these adapter layers (the base model is kept frozen).

The difference here with other models are the layers at which we're going to add adapters (in PEFT this is called target_modules). This typically depends a bit on the model.

We defined a function to find all linear layers in the model, excluding any layers related to multimodal projections and vision models. This function will help us identify which layers should have LoRA applied. We're going to add adapters to all linear layers of the model (nn.Linear), except for the ones present in the vision encoder and multimodal projector. This means that we're mostly going to adapt the language model part of Video-LLaVa for our use case.


In [14]:
def find_all_linear_names(model):
    cls = torch.nn.Linear
    lora_module_names = set()
    multimodal_keywords = ['multi_modal_projector', 'vision_model']
    for name, module in model.named_modules():
        if any(mm_keyword in name for mm_keyword in multimodal_keywords):
            continue
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names: # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)


lora_config = LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0.1,
    target_modules=find_all_linear_names(model),
    init_lora_weights="gaussian",
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

In [15]:
model

PeftModel(
  (base_model): LoraModel(
    (model): LlavaNextVideoForConditionalGeneration(
      (vision_tower): CLIPVisionModel(
        (vision_model): CLIPVisionTransformer(
          (embeddings): CLIPVisionEmbeddings(
            (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
            (position_embedding): Embedding(577, 1024)
          )
          (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (encoder): CLIPEncoder(
            (layers): ModuleList(
              (0-23): 24 x CLIPEncoderLayer(
                (self_attn): CLIPSdpaAttention(
                  (k_proj): lora.Linear4bit(
                    (base_layer): Linear4bit(in_features=1024, out_features=1024, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=1024,

## Define HF Trainer

To streamline the training and evaluation of the LLaVa-NeXT-Video model, we use HF Trainer class, which abstracts away much of the boilerplate code and provides a structured framework for model training. In this section, we define the TrainingArguments and initiate the Trainer class, which will encapsulate the model, training loop, validation loop, and optimizer configuration.

In [16]:
args = TrainingArguments(

    # args related to training
    output_dir = OUTPUT_DIR,
    eval_strategy = 'steps',
    eval_steps=20,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size = BATCH_SIZE,
    gradient_accumulation_steps = 8,
    learning_rate = 2e-05,
    max_steps = 100, # adjust this depending on your dataset size
    lr_scheduler_type = 'cosine',
    warmup_ratio = 0.1,

    # args related to eval/save
    logging_steps = 20,
    save_strategy = 'steps',
    save_steps=20,
    save_total_limit = 1,
    fp16 = True, # we have the model train and eval with fp16 precision
    fp16_full_eval = True,
    optim = 'adamw_bnb_8bit', # adam in lower-bits to save memory, consider changing to 'adamw_torch' if model is not converging
    report_to = "wandb", # install wand to use this
    hub_model_id = REPO_ID,
    push_to_hub = True, # wel'll push the model to hub after each epoch

    # model that was wrapped for QLORA training with peft will not have arguments listed in its signature
    # so we need to pass lable names explicitly to calculate val loss
    label_names=["labels"],
    dataloader_num_workers=4, # let's get more workers since iterating on video datasets might be slower in general
)

In [None]:
# trainer = Trainer(
#     model = model,
#     tokenizer = processor,
#     data_collator = LlavaNextVideoDataCollatorWithPadding(processor=processor),
#     train_dataset = train_dataset,
#     eval_dataset = test_dataset,
#     args=args,
# )

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
CODECARBON : No CPU tracking mode found. Falling back on CPU constant mode.
CODECARBON : Failed to match CPU TDP constant. Falling back on a global constant.


In [22]:
args = TrainingArguments(

    # args related to training
    output_dir = OUTPUT_DIR,
    eval_strategy = 'no',
    eval_steps=20,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size = BATCH_SIZE,
    gradient_accumulation_steps = 8,
    learning_rate = 2e-05,
    max_steps = 100, # adjust this depending on your dataset size
    lr_scheduler_type = 'cosine',
    warmup_ratio = 0.1,

    # args related to eval/save
    logging_steps = 20,
    save_strategy = 'steps',
    save_steps=20,
    save_total_limit = 1,
    fp16 = True, # we have the model train and eval with fp16 precision
    fp16_full_eval = True,
    optim = 'adamw_bnb_8bit', # adam in lower-bits to save memory, consider changing to 'adamw_torch' if model is not converging
    report_to = "wandb", # install wand to use this
    hub_model_id = REPO_ID,
    push_to_hub = True, # wel'll push the model to hub after each epoch

    # model that was wrapped for QLORA training with peft will not have arguments listed in its signature
    # so we need to pass lable names explicitly to calculate val loss
    label_names=["labels"],
    dataloader_num_workers=4, # let's get more workers since iterating on video datasets might be slower in general
)


trainer = Trainer(
    model = model,
    tokenizer = processor,
    # data_collator = collate_fn,
    data_collator = LlavaNextVideoDataCollatorWithPadding(processor=processor),
    train_dataset=streamer,
    # train_dataset = train_dataset,
    # eval_dataset = test_dataset,
    args=args,
)

  trainer = Trainer(


In [23]:
trainer.train()

  return fn(*args, **kwargs)


ValueError: Video features and video tokens do not match: tokens: 0, features 4608

In [None]:
trainer.model.push_to_hub(REPO_ID) # let's push to hub the last ckpt

CommitInfo(commit_url='https://huggingface.co/RaushanTurganbay/LLaVa-NeXT-Video-demo/commit/b81b02b023fd4ba5f2392da889205b0d50e99c41', commit_message='Upload model', commit_description='', oid='b81b02b023fd4ba5f2392da889205b0d50e99c41', pr_url=None, pr_revision=None, pr_num=None)

## Inference with tuned model

In [None]:
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    REPO_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)

adapter_config.json:   0%|          | 0.00/884 [00:00<?, ?B/s]

You are using a model of type llava_next to instantiate a model of type llava_next_video. This is not supported for all configurations of models and can yield errors.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/84.8M [00:00<?, ?B/s]

In [None]:
example = test_dataset[0]

# convert to image from proceessed tensors
clip = example["pixel_values_videos"][0] * 255
clip = clip.permute(0, 2, 3, 1).clamp(0, 255)

# np array with shape (frames, height, width, channels)
video = np.array(clip).astype(np.uint8)

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
                               interval=100)
HTML(anim.to_html5_video())

In [None]:
processor.batch_decode(example["input_ids"])

["<s> USER:  <video> \nProvide a detailed caption for this video. ASSISTANT: The video captures a serene and contemplative sequence featuring an individual seated on a wicker chair beside a window, within an indoor setting highlighted by a white wall and decorative framed pictures. Dressed casually in a red top and rolled-up blue jeans, revealing bare feet, the person begins with a relaxed posture, hugging a white cushion to their chest, their chin resting on it while bathed in the soft, natural light of daytime, setting a mood of introspection.\n\nAs the video progresses, the individual shifts slightly in their chair., displaying a subtle change in emotional state or focus; their head tilts forward, eyes cast downward, and the grip on the cushion tightens a bit, which might suggest a deepening of their reflection or a shift in their feelings.\n\nFurther into the video, the person's facial expression changes as they close their eyes, possibly indicating a moment of deeper reflection or

In [None]:
def run_inference(video_clip, model):
    # Let's use chat template to format the prompt correctly, this time without the caption
    conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Provide a detailed caption for this video."},
                    {"type": "video"},
                    ],
            },
        ]

    # Set add_generation_prompt to add the "ASSISTANT: " at the end
    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

    batch = processor(
        text=prompt,
        videos=None, # we have a processed video, passing it again to processor causes errors
        return_tensors="pt"
    ).to(model.device)
    video_clip = video_clip.to(model.device)

    out = model.generate(**batch, pixel_values_videos=video_clip, max_length=MAX_LENGTH, do_sample=True)
    generated_text = processor.batch_decode(out, skip_special_tokens=True)
    return generated_text

In [None]:
run_inference(example["pixel_values_videos"], model)

['USER:  <video> \nProvide a detailed caption for this video. ASSISTANT: This is a video capturing a serene and introspective moment of an individual sitting next to a woven basket, facing slightly to the left of the frame, suggesting a focused gaze towards the distance. The person is elegantly dressed in a red blouse and faded green jeans, evoking a simple and understated aesthetic. A white wool material is wrapped around the legs of the chair, potentially indicating an interest in outdoor or vintage textures. The background is muted, providing a calm and unobtrusive setting that does not distract from the individual. Throughout the video, there is a gradual transition of light, with the shadows becoming less pronounced and the lighting more even and warm, suggesting either a change in time of day, a movement in the light source, or an indoor setting transitioning to an outdoor one. The person shifts slightly into a more natural pose, with their right foot raised, indicating a moment 

#### For the sake of comparison, let's load the old model and compare the generations

We can see that the tuned model started to get the ShareGPTVideo dataset style where the captions are more detailed and longer in length. The tuned model generates a more descriptive text of each scene and pays attention to the changes that happened as the video evolved (e.g. "there is a gradual transition of light").

In [None]:
old_model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)

You are using a model of type llava_next to instantiate a model of type llava_next_video. This is not supported for all configurations of models and can yield errors.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
run_inference(example["pixel_values_videos"], old_model)

["USER:  <video> \nProvide a detailed caption for this video. ASSISTANT: In this cozy scene, a young woman finds solace while sitting comfortably in a wicker chair. She's snuggled up in casual clothing, her legs bent and her foot resting on her other leg. Her gaze is directed off-camera, suggesting she may be lost in thought or perhaps engrossed in a book she's holding in her hands. The room is softly lit, casting a gentle glow over the scene and enhancing the overall sense of relaxation."]