## Prerequisites
Before we start, make sure you have the following:

- Access to a GPU (preferably A100 since videos require high sequence lengths).
- Familiarity with Hugging Face’s Transformers library.
- Pre-install necessary packages by running the below.

In [None]:
!pip install -U -q transformers accelerate bitsandbytes peft datasets
!pip install -q av
!pip install -q lightning
!pip install pyarrow==15.0.0
!pip install wandb

# restart notebooks here

## Fine-tune InternVL 2B. on MMBench dataset

In this notebook, you need to fine-tune the [InternVL](https://huggingface.co/OpenGVLab/InternVL2-1B) model on [MMBench](https://huggingface.co/datasets/OpenGVLab/MVBench) dataset which is comprised of various video-related tasks. Note that MMBench is quite small and is not made for tuning. So firstly you need to split it into training/testing parts.

The goal for the model in this notebook is to answer given multiple choice questions based on the video. The questions can be realetd to temporal aspects of the video, pose prediction and so on.
Sources:

* InternVL [documentation](https://internvl.readthedocs.io/en/latest/internvl2.0/introduction.html)
* InternVL [checkpoint on the hub](https://huggingface.co/OpenGVLab/InternVL2-1B)

## Define variables

We'll first set some variables useful througout this notebook and doo all the necessary imports.

In [None]:
import os
import av
import re
import bisect
import shutil
import numpy as np
from nltk import edit_distance

from transformers import AutoProcessor
from transformers import BitsAndBytesConfig, VideoLlavaForConditionalGeneration
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from huggingface_hub import snapshot_download, hf_hub_download
from datasets import load_dataset, concatenate_datasets

import lightning as L
from lightning.pytorch.callbacks.early_stopping import EarlyStopping, Callback


MAX_LENGTH = 160
MODEL_ID = "LanguageBind/Video-LLaVA-7B-hf"
REPO_ID = "your-hf-login/VideoLLava-demo" # Change to your hf-hub repo

os.environ["WANDB_API_KEY"] = "your-key" # Change to your W&B profile if you need it
os.environ["WANDB_MODE"] = "online"

from huggingface_hub import login
access_token = "your-hf-token" # Change to your РА profile
login(access_token)

USE_LORA = False
USE_QLORA = True

# MVBench benchmark

[MVBench on HF Datasets](https://huggingface.co/datasets/OpenGVLab/MVBench)

![MVbench1.png](https://huggingface.co/datasets/OpenGVLab/MVBench/resolve/main/assert/generation.png)

It consists of the 20 temporal task examples as follows.

![MVbench-structure.png](https://huggingface.co/datasets/OpenGVLab/MVBench/resolve/main/assert/task_example.png)


Here we have a nice viewer for each task:

[Dataset viewer](https://huggingface.co/datasets/OpenGVLab/MVBench/viewer/action_sequence)



We will start by downloading and processing the dataset. Even though MMBench is a small dataset, it still requires **around 1000B to store the videos**, so make sure you have enough free space.

First, we will use this mapping to get the datasets because each one is a separate subset in its own folder. Then we need a few helper functions to download videos and process them to fit the model's format (8 frames each video).

In [None]:
data_list = {
    "Action Sequence": ("action_sequence.json", "star/Charades_v1_480/", "video", True), # has start & end
    "Action Prediction": ("action_prediction.json", "star/Charades_v1_480/", "video", True), # has start & end
    "Action Antonym": ("action_antonym.json", "ssv2_video/", "video", False),
    "Fine-grained Action": ("fine_grained_action.json", "Moments_in_Time_Raw/videos/", "video", False),
    "Unexpected Action": ("unexpected_action.json", "FunQA_test/test/", "video", False),
    "Object Existence": ("object_existence.json", "clevrer/video_validation/", "video", False),
    "Object Interaction": ("object_interaction.json", "star/Charades_v1_480/", "video", True), # has start & end
    "Object Shuffle": ("object_shuffle.json", "perception/videos/", "video", False),
    "Moving Direction": ("moving_direction.json", "clevrer/video_validation/", "video", False),
    "Action Localization": ("action_localization.json", "sta/sta_video/", "video", True),  # has start & end
    "Scene Transition": ("scene_transition.json", "scene_qa/video/", "video", False),
    "Action Count": ("action_count.json", "perception/videos/", "video", False),
    "Moving Count": ("moving_count.json", "clevrer/video_validation/", "video", False),
    "Moving Attribute": ("moving_attribute.json", "clevrer/video_validation/", "video", False),
    "State Change": ("state_change.json", "perception/videos/", "video", False),
    "Fine-grained Pose": ("fine_grained_pose.json", "nturgbd/", "video", False),
    "Character Order": ("character_order.json", "perception/videos/", "video", False),
    "Egocentric Navigation": ("egocentric_navigation.json", "vlnqa/", "video", False),
    "Episodic Reasoning": ("episodic_reasoning.json", "tvqa/frames_fps3_hq/", "frame", True),  # has start & end, read frame
    "Counterfactual Inference": ("counterfactual_inference.json", "clevrer/video_validation/", "video", False),
}

data_dir = "dataset"
if not os.path.exists(data_dir):
    os.mkdir("dataset")

In [None]:
def read_video_pyav(video_path, start, end, n_frames=8):
    """
    Reads a video for given start-end timestamps interval
    and uniformly samples 8 frames of it
    """
    container = av.open(video_path)
    video = container.streams.get(0)[0]

    av_timestamps = [
        int(packet.pts * video.time_base) for packet in container.demux(video) if packet.pts is not None
    ]

    av_timestamps.sort()
    start_id = bisect.bisect_left(av_timestamps, start)
    end_id = bisect.bisect_left(av_timestamps, end)

    # in case it is a very short video, lets take a longer duration and sample
    if end_id  - start_id < 10:
        end_id += 10
        start_id -= 10

    end_id = min(len(av_timestamps) - 1, end_id)
    start_id = max(1, start_id)

    # We sample n_frames frames for tuning following the original paper
    # But we can increase the number of frames for longer videos and check out if it helps performance
    # Change the below "n_frames" to any number of frames you want, and note that more frames -> more computational resources needed
    indices = np.linspace(start_id, end_id, n_frames).astype(int)

    frames = []
    container.seek(0)
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_id:
            break
        if i >= start_id and i in indices:
            frames.append(frame)
    assert len(frames) == n_frames, f"Got {len(frames)} frames but should be {n_frames}. Check the indices: {indices};, start_id: {start_id}, end_id: {end_id}. Len of video is {len(av_timestamps)} frames."
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

In [None]:
def collate_read_video(example, path):
    # Some datasets have a start-end interval, so we try to get it if exists.
    # Otherwise just set a very large end timestamp
    clip = read_video_pyav(f'{path}/{example["video"]}', example.get("start", 1), example.get("end", 1e+10))
    example["clip"] = clip
    return example

In [None]:
# Download the videos from datasets repo and unzip.
# Make sure you have enough free space before downloading and unzipping

# videos = snapshot_download(repo_id="OpenGVLab/MVBench", allow_patterns="*", repo_type="dataset")
# for zip_file in os.listdir(f"{videos}/video"):
#     if zip_file.endswith(".zip"):
#         shutil.unpack_archive(f"{videos}/video/{zip_file}", f"{videos}/videos_unzipped/")

Make a following data structure:

```
dataset/
    /json
        task_name.json
    /video
        /task_name_prefix (optional)
            /task_name
                video_0.mp4
                video_1.mp4
                video_2.mp4
                ...
                video_100.mp4
```





In [None]:
# YOUR CODE HERE

In [None]:
# Some tasks in MVBench are missing video files - keep it in mind!
# YOUR CODE HERE

In [None]:
# Load videos and split them into frames
# YOUR CODE HERE

In [None]:
# Load model's processor
# YOUR CODE HERE

## Custom Dataset Class

In the next step, you'll need **to define a custom dataset** class and the necessary functions to prepare our data for fine-tuning model. The VideoQADataset class extends the [PyTorch Dataset](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) class to facilitate loading and processing "MMBench". This class will handle the conversion of dataset samples into the format required for training and evaluation by preparing a prompt and making array from videos.

Next, you need **to define collate functions** to handle the batching of data during training and evaluation. These functions ensure that the input data is properly formatted and padded.

Here use the processor to turn the (video, target token sequence) into the format that the model expects (which is pixel_values, input_ids etc.). Use a dynamic padding of the batches: each batch contains ground truth sequences of varying lengths.

Also you can limit the length of the text tokens (input_ids) to a max length due to memory constraints, feel free to expand if your target token sequences are longer (I'd recommend plotting the average token length of your dataset to determine the optimal value).

The formatting of the input_ids is super important: you need to respect a so-called [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating).

Labels are created for the model by simply copying the inputs to the LLM (input_ids), but with padding tokens replaced by the ignore index of the loss function. This ensures that the model doesn't need to learn to predict padding tokens (used to batch examples together).

Why are the labels a copy of the model inputs, you may ask? The model will internally shift the labels one position to the right so that the model will learn to predict the next token. This can be seen here.

The collate function for evaluation is different, since there you only need to feed the prompt to the model, as we'll use the `generate()` method to autoregressively generate a completion.

In [None]:
class VideoQADataset(Dataset):
    """
    PyTorch Dataset for VideoQADataset.
    This class takes a HuggingFace Dataset as input.
    """

    def __init__(
        self,
        dataset: str,
    ):
        super().__init__()
        # YOUR CODE HERE

    def __len__(self) -> int:
        # YOUR CODE HERE

    def __getitem__(self, idx: int):

        # YOUR CODE HERE

        return prompt, clip

In [None]:
def train_collate_fn(examples):

    # YOUR CODE HERE

    return input_ids, attention_mask, pixel_values_videos, labels


def eval_collate_fn(examples):

    # YOUR CODE HERE

    return input_ids, attention_mask, pixel_values_videos, answer_choice

## Shuffling and Splitting the Dataset
You need to shuffle dataset, and then split it into training and test sets. This ensures that our model is trained on a diverse and representative sample of the data.


In [None]:
# YOUR CODE HERE

In [None]:
%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML


example = dataset['train'][0]
clip = example["clip"]
# np array with shape (frames, height, width, channels)
video = np.array(clip)


# Vusualize your data
# YOUR CODE HERE

In [None]:
fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
                               interval=100)
HTML(anim.to_html5_video())

And now wrap it in the Pytorch Datasets class and print one example as sanity check.

In [None]:
train_dataset = VideoLlavaDataset(dataset["train"])
eval_dataset = VideoLlavaDataset(dataset["test"])

In [None]:
prompt, clip = train_dataset[0]

# Model

## Load model
Next, load your InternVL model from the hub. This is a model with about 1 billion trainable parameters (as it combines a **Qwen2 1B language model** with a relatively low-parameter vision **InternViT encoder**). Do note that we load a model here which already has undergone supervised fine-tuning (SFT) instructions dataset. We can benefit from the fine-tuning that the model already has undergone.

## Full fine-tuning, LoRa and Q-LoRa

**Select the fine-tuning method.**

 For reference, fine-tuning a model using the AdamW optimizer (which is often used to optimize neural networks) with mixed precision, you need about 18 times the amount of parameters in GB of GPU RAM. So in this case, we would need 18x1 billion bytes = 18 GB of GPU RAM if we want to update all the parameters of the model. Not so huge right? But using PEFT approach it could be less.

Some clever people came up with the LoRa method (LoRa is short for low-rank adapation). It allows to just freeze the existing weights and only train a couple of adapter layers on top of the base model. Hugging Face offers the separate [PEFT library](https://huggingface.co/docs/peft/main/en/index) for easy use of LoRa, along with other Parameter-Efficient Fine-Tuning methods.

Moreover, one can not only freeze the existing base model but also quantize it (which means, shrinking down its size). A neural network's parameters are typically saved in either float32 (which means, 32 bits or 4 bytes are used to store each parameter value) or float16 (which means, 16 bits or half a byte - also called half precision). However, with some clever algorithms one can shrink each parameter to just 8 or 4 bits (half a byte!), without significant effect on final performance. Read all about it here: https://huggingface.co/blog/4bit-transformers-bitsandbytes.

This means that we're going to shrink the size of the base 1B model considerably using 4-bit quantization, and then only train a couple of adapter layers on top using LoRa (in float16). This idea of combining LoRa with quantization is called Q-LoRa and is the most memory friendly version.

There exist many forms of quantization, here we leverage the [BitsAndBytes integration](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig).

In [None]:
## Load model
# Three options for training, from the lowest precision training to the highest precision training:
# QLoRA: model uses 4-bit quantization, which helps in reducing memory usage while maintaining performance.
# Standard LoRA:  model is loaded with standard LoRA adaptations.
# Full Fine-Tuning: no memory optimization are done. In that case Flash Attention is used to speed up training, if hardware supports it.

# YOUR CODE HERE

In [None]:
def find_all_linear_names(model):
    # Only for LoRA ot QLoRA

    cls = torch.nn.Linear
    lora_module_names = set()
    multimodal_keywords = ['multi_modal_projector', 'vision_model']
    for name, module in model.named_modules():
        if any(mm_keyword in name for mm_keyword in multimodal_keywords):
            continue
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names: # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

# If you selected LoRA ot QLora make a choise of parameters to replace
# YOUR CODE HERE

# Then create LoraConfig and run prepare_model_for_kbit_training(...)
# and finally: model = get_peft_model(model, ...)
# YOUR CODE HERE

['up_proj', 'down_proj', 'q_proj', 'o_proj', 'k_proj', 'gate_proj', 'v_proj']

## Define PyTorch Lightning Module for Video-LLaVA
To streamline the training and evaluation of the Video-InternVL model, you can use [LightningModule](https://lightning.ai/docs/pytorch/stable/common/lightning_module.html), which abstracts away much of the boilerplate code and provides a structured framework for model training. In this section, you need to define the InternVLModelPLModule, a custom PyTorch Lightning module that encapsulates the model, training loop, validation loop, and optimizer configuration.

### InternVLModelPLModule Class

The InternVLModelPLModule class inherits from LightningModule and includes methods for training, validation, and optimizer configuration. This setup ensures a clean and efficient training process.

Basically, PyTorch Lightning will take care of all device placements (.to(device)) for us, as well as the backward pass, putting the model in training mode, etc.

Notice the difference between a training step and an evaluation step:

- a training step only consists of a forward pass, in which we compute the cross-entropy loss between the model's next token predictions and the ground truth (in parallel for all tokens, this technique is known as "teacher forcing"). The backward pass is handled by PyTorch Lightning.
- an evaluation step consists of making the model autoregressively complete the prompt using the generate() method. After that, you compute an evaluation metric between the predicted sequences and the ground truth ones. This allows you to see how the model is improving over the course of training. The metric we use here is accuracy of answering the question.

Besides that, you define the optimizer to use (AdamW is a good default choice) and the data loaders, which use the collate functions defined above to batch together items of the PyTorch datasets. Do note that AdamW is a pretty heavy optimizer in terms of memory requirements, but as we're training with QLoRa we only need to store optimizer states for the adapter layers. For full fine-tuning, one could take a look at more memory friendly optimizers such as 8-bit Adam.

In [None]:
class InternVLModelPLModule(L.LightningModule):
    def __init__(self, config, processor, model):
        super().__init__()
        # YOUR CODE HERE

    def training_step(self, batch, batch_idx):
        # YOUR CODE HERE

    def validation_step(self, batch, batch_idx, dataset_idx=0):
    # YOUR CODE HERE

    def configure_optimizers(self):
        # YOUR CODE HERE

    def train_dataloader(self):
        return DataLoader(train_dataset, collate_fn=train_collate_fn,
                          batch_size=self.batch_size, shuffle=True, num_workers=4)

    def val_dataloader(self):
        return DataLoader(eval_dataset, collate_fn=eval_collate_fn,
                          batch_size=self.batch_size, shuffle=False, num_workers=4)

Then instantiate it (based on a config dictionary which defines all hyperparameters for training).

The batch size was determined based on the compute available.

Do note that one can play around with the hyperparameters, I just use good defaults here: 10 epochs, a learning rate of 1e-4, use mixed precision for training (more memory friendly). One could extend this with things like gradient accumulation and gradient checkpointing.

I recommend [this guide](https://huggingface.co/docs/transformers/v4.20.1/en/perf_train_gpu_one) which goes over all tips and tricks regarding maximizing fine-tuning performance on consumer hardware.

In [None]:
config = {"max_epochs": 2,
          # "val_check_interval": 0.2, # how many times we want to validate during an epoch
          "check_val_every_n_epoch": 1,
          "gradient_clip_val": 1.0,
          "accumulate_grad_batches": 8,
          "lr": 1e-4,
          "batch_size": 1,
          "num_nodes": 1,
          "warmup_steps": 50,
}
# Instantiate yout module here
# YOUR CODE HERE

## Define callbacks
Optionally, Lightning allows to define so-called [callbacks](https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html), which are arbitrary pieces of code that can be executed during training.

You'd better use the EarlyStopping callback of Lightning, which will automatically stop training once the evaluation metric (edit distance in our case) doesn't improve after 3 epochs.

In [None]:
from huggingface_hub import HfApi
from lightning.pytorch.loggers import WandbLogger, TensorBoardLogger

logger = TensorBoardLogger("tb_logs", name="VideoLLava-demo")

api = HfApi()

class PushToHubCallback(Callback):
    def on_train_epoch_end(self, trainer, pl_module):
        print(f"Pushing model to the hub, epoch {trainer.current_epoch}")
        pl_module.model.push_to_hub(REPO_ID,
                                    commit_message=f"Training in progress, epoch {trainer.current_epoch}")

    def on_train_end(self, trainer, pl_module):
        print(f"Pushing model to the hub after training")
        pl_module.processor.push_to_hub(REPO_ID,
                                    commit_message=f"Training done")
        pl_module.model.push_to_hub(REPO_ID,
                                    commit_message=f"Training done")

early_stop_callback = EarlyStopping(monitor="your-metric-here",
                                    patience=3, verbose=False, mode="max")

## Train!
 Trainer class supports many more flags. See the [docs](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.trainer.trainer.Trainer.html#lightning.pytorch.trainer.trainer.Trainer)

In [None]:
trainer = L.Trainer(
        # YOUR CODE HERE
)

In [None]:
trainer.fit(model_module)

In [None]:
# %load_ext tensorboard
# %reload_ext tensorboard
# %tensorboard --logdir .

## Inference

Let's see if the model has learned something. First load the model from the hub first.

In [None]:
from transformers import AutoProcessor, BitsAndBytesConfig, VideoLlavaForConditionalGeneration
import torch

# YOUR CODE HERE

See one example from the validation set here and plot 8 frames to see what is happening in the video.

In [None]:
from matplotlib import pyplot as plt
from PIL import Image

prompt, clip = eval_dataset[2]
fig, axarr = plt.subplots(1, 2, figsize = (10, 10))
fig.tight_layout()

for i in range(2):
    curr_frame = Image.fromarray(np.uint8(clip[i]))
    axarr[i].imshow(curr_frame)
    axarr[i].get_xaxis().set_visible(False)
    axarr[i].get_yaxis().set_visible(False)
    axarr[i].set_aspect('equal')

plt.subplots_adjust(wspace=None, hspace=None)
plt.axis('off')
plt.show()

Next you need to prepare the video for the model, along with the text prompt we used during training. You need to apply the chat template to make sure the format is respected.

Notice that this is exactly the same as what you did for the evaluation data collate function.

In [None]:
# YOUR CODE HERE

Next you should let the model autoregressively generate tokens using the [generate()](https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/text_generation#transformers.GenerationMixin.generate) method, which is recommended for use at inference time. This method feeds each predicted token back into the model as conditioning for each next time step.

Do note that there are various ways of decoding text, here we use greedy decoding which is the default. There are various fancier methods such as beam search and top-k sampling. Refer to [this amazing blog post](https://huggingface.co/blog/how-to-generate) for all details.

In [None]:
# Generate token IDs
# YOUR CODE HERE

# Decode back into text
# YOUR CODE HERE