# Problem Statement 1:

Create a data filtering pipeline that could remove low-quality text-video pairs from the dataset and thus improve video generation quality with less cost of model training.

# Our Approach

* Filter out low FPS text-video pairs, then regnerate metafile with only high FPS pairs.
* Batch download youtube videos using updated metafile, split them using cut_videos_mlda.py
* Recaption video clips.
* Use CLIP as an evaluation metric to decide if a better dataset has been created which can improve video generation quality and reduce model generation cost.

# Requirements and Imports

In [None]:
%pip install pytube

In [None]:
import json
import os

# Generation of new metafile
* We chose 30 FPS as the minimum frame rate for a high quality video.
* This reduces the number of unique videos in the dataset from 18750 to 2773, thus reducing model training time by 6.76 times.

In [None]:
cwd = os.getcwd()

filename_old = cwd + r'\metafiles\long_mlda_data.json'

with open(filename_old, 'r') as f:
    data = json.load(f)

In [None]:
# Filter videos with fps >= 30
filtered_videos = {}
for video_id, video_info in data.items():
    for clip_name, clip_info in video_info['clip'].items():
        if clip_info['fps'] >= 30:
            if video_id not in filtered_videos:
                filtered_videos[video_id] = {}
            filtered_videos[video_id][clip_name] = clip_info

# Convert filtered_videos to JSON string
filtered_json_data = json.dumps(filtered_videos, indent=2)

# Print the filtered JSON data
output_file_path = cwd + r'\metafiles\filtered_data.json'
with open(output_file_path, "w") as output_file:
    output_file.write(filtered_json_data)

In [None]:
with open(output_file_path, 'r') as f:
    data2 = json.load(f)

print(len(data2))

## Batch Downloading of Videos

* We created a list of youtube links from the metadata file, then proceeded to down a small number of videos using pytube.
* We then split the videos using cut_videos_mlda_old.

In [None]:
info_list = []

#converts json file into a list for easier comprehension
for i in data.keys():
    info_list.append(data[i])

print(info_list[0])

In [None]:
#create list of youtube links
youtube_links_list = []
youtube_id_list = []

for i in range(len(info_list)):
    youtube_links_list.append(info_list[i]['url'])


for i in range(len(info_list)):
    youtube_id_list.append(list(info_list[i]['clip'].keys())[0].split('.')[0])
    
print(youtube_id_list)

In [None]:
from pytube import YouTube

number_of_videos = 10

#skips any video downloading errors eg. age restricted errors
for i in range(number_of_videos):
    try:
        print(f"Downloading video {i+1}")
        YouTube(youtube_links_list[i]).streams.first().download('download_videos', filename=f"{youtube_id_list[i]}.mp4")
    except:
        continue

# Our overall approach for generating captions

* To generate captions, we will use the LLaVA (Large Language-and-Vision Assistant) end-to-end framework. The framework is used to train and deploy conversational agents. It combines LLMs with various input modalities such as text, images, and audio to enable multimodal interactions.

* The LLaVA framework provides the Replica API, which serves as the interface for creating the conversational agents. The API provides methods to interact with the agent, such as sending messages, receiving replies, and managing conversations.

* The pretrained LLaVA model that we used for our captioning combines a pre-trained CLIP ViT-L/14 visual encoder and large language model Vicuna with a simple projection matrix, and has been finetuned in a two stage procedure.

* Since LLaVA is only able to generate captions from images, we have opted to take frames from each video clip and run it through LLaVa to generate the caption. **Due to our limited time, we are only taking 1 frame from each video, which we will assume to be representative of the entire video.**

* Future improvements can include sampling mutiple frames in regular intervals for each clip, generating a caption for each frame, then taking advantage of LLaVa's context window to combine the contents of all clip frames to generate a more accurate video caption.

Details for LLaVA model: https://llava-vl.github.io/

Replica API implementation: https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb

## LLaVa Requirements and Imports


In [None]:
!pip install replicate
!pip install "pyautogen[lmm]>=0.2.3"
!pip install pillow

In [None]:
import json
import os
import random
import time
from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union

import matplotlib.pyplot as plt
import requests
from PIL import Image
from termcolor import colored

import autogen
from autogen import Agent, AssistantAgent, ConversableAgent, UserProxyAgent
from autogen.agentchat.contrib.llava_agent import LLaVAAgent, llava_call

A Replicate API key must be generated to use Replicate. This can be done by making an account at https://replicate.com/account/api-tokens and generating an API token on the website.


In [None]:
#set api key for REPLICATE api
os.environ["REPLICATE_API_TOKEN"] = "r8_RK3RENh7f8CY4elCd9CklVenGckfo4k39Ylef"


## LLaVa Setup

* LLaVa can be used either locally or remotely.
* We have chosen a remote setup using Replicate to host LLaVa, as we faced many compatibility issues with the local setup.

In [None]:
LLAVA_MODE = "remote"
assert LLAVA_MODE in ["local", "remote"]

if LLAVA_MODE == "remote":
    import replicate

    llava_config_list = [
        {
            "model": "-",  # The model name doesn't matter here right now.
            "api_key": "None",  # already setup using os.environ["REPLICATE_API_TOKEN"]
            "base_url": "yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
        }
    ]

## Generation of Images from Video Clips


* Our video clips are stored in a folder in our current working directory. The folder name may change from time to time, thus we have chosen to iterate through all mp4 files in the current working directory to get the clip names.

In [None]:
images_from_video_directory = os.getcwd()
filenames = []

def get_mp4_files_recursive(folder_path):
    mp4_files = []
    # Walk through the directory tree
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.endswith('.mp4'):
                mp4_files.append(os.path.join(root, file))
    return mp4_files

filenames = get_mp4_files_recursive(images_from_video_directory)

* The code block below details how each frame is generated for each video clip.
* First, FFMPEG reads each input video and generates an output frame for that video, which is stored in the folder "test_images" in the current working directory.
* The first second of each video clip is ignored as that period may represent a transition between different scenes which will generate an image with irrelevant information (eg. the transition animation) that will invalidate the caption generated.

In [None]:
#Creates images in the folder test_images

# Define the folder path
folder_path = "test_images"
abs_path_list = []

# Create the folder if it doesn't exist
os.makedirs(folder_path, exist_ok=True)

# Loop through the filenames
for file in filenames:
    output_filename = os.path.join(folder_path, os.path.basename(file) + "_output.png")
    abs_path_list.append(rf'{os.path.abspath(output_filename)}')
    #frame generated is after first second
    !ffmpeg -ss 00:00:01 -i "$file" -vf fps=1 "$output_filename"
    print(abs_path_list)

## Producing Captions using Replicate and Storage of Image-Caption pairs

* To generate captions, we need to prompt LLaVA and receive a caption in response.
* However before this, we have to first initialise instances of a user (user_proxy) and Llava (image_agent):

In [None]:
image_agent = LLaVAAgent(
    name="image-explainer",
    max_consecutive_auto_reply=10,
    llm_config={"config_list": llava_config_list, "temperature": 0.5, "max_new_tokens": 1000},
)

user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    code_execution_config={
        "last_n_messages": 3,
        "work_dir": "groupchat",
        "use_docker": False,
    },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.
    human_input_mode="NEVER",  # Try between ALWAYS or NEVER
    max_consecutive_auto_reply=0,
)

* Now, prompts can be sent to the LLaVa agent. However, the image may not be encoded in a way that can be read by LLaVa, thus we process the image to encode it in base64. A try except loop is used to skip across any conversion failures.
* At the same time, we want to store the generated captions in sequence with the image files. This is done by calling the last_message method of the image_agent(representing the LLaVAAgent class) to store the generated caption as the obtained_caption variable.
* Finally, image file paths and the captions are stored as key value pairs in a dictionary for futher processing. 

In [None]:
import base64

#storage of text-image pairs

image_text_pairs = {}

for abs_path in abs_path_list:
    image_path = abs_path
    try:
        # Read the image as binary data
        with open(image_path, "rb") as img_file:
            img_data = img_file.read()

        # Encode the binary image data to base64
        encoded_image = base64.b64encode(img_data).decode("utf-8")

        # Ask the question with an image
        user_proxy.initiate_chat(
            image_agent,
            message="""Caption this image as accurately as possible. Include all details, such as the environment, objects, what may happen in the past or near future, the predicted and current actions happening in the scene. Summarise the description in as much detail as possible within 40 words. sets of keywords in commas can be used to shorten the description instead of outputing full sentences
        <img {image_path}>.""".format(image_path=image_path),
        )

        obtained_caption = image_agent.last_message()['content'][0]['text']
        

        #caption list obtained
        image_text_pairs[abs_path] = [obtained_caption]
    except:
        continue


### Explanation of Prompt Statement

Our prompt statement is as follows:

**Caption this image as accurately as possible. Include all details, such as the environment, objects, what may happen in the past or near future, the predicted and current actions happening in the scene. Summarise the description in as much detail as possible within 40 words. sets of keywords in commas can be used to shorten the description instead of outputing full sentences.**

The prompt highlights major details such as the environment and events in the past and future to ensure that the caption encompasses these important aspects. Keywords are used to streamline the generated caption, allowing detailed yet short captions to keep within the context length of CLIP. A summary of 40 words is required at the end to prevent errors due to exceeding the context length of CLIP when an extremely detailed caption is input into CLIP.


## Clip Score Computation

* Now that we have obtained the image file paths and captions, we will be able to compute the clip score for each image using the openai-clip package.

* Reference: https://unimatrixz.com/blog/latent-space-clip-score/

## Clip Requirements and Imports

In [None]:
!pip install openai-clip

In [None]:
import torch
import clip
from clip import *
from PIL import Image



* The get_clip_score function below loads a pre-trained CLIP model and an image, preprocesses the image, tokenizes the input text, generates embeddings for both the image and text, normalizes the features, calculates the cosine similarity between them, and returns the CLIP score as a measure of similarity between the image and text.

In [None]:
def get_clip_score(image_path, text):
    # Load the pre-trained CLIP model and the image

    model, preprocess = clip.load('ViT-B/32')

    image = Image.open(image_path)

    # Preprocess the image and tokenize the text
    image_input = preprocess(image).unsqueeze(0)
    text_input = clip.tokenize([text])
    
    # Move the inputs to GPU if available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    image_input = image_input.to(device)
    text_input = text_input.to(device)
    model = model.to(device)
    
    # Generate embeddings for the image and text
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_input)
    
    # Normalize the features
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    
    # Calculate the cosine similarity to get the CLIP score
    clip_score = torch.matmul(image_features, text_features.T).item()
    
    return clip_score

* The code block below runs gets the clip score for every image-text pair and modifies the image_text_pairs dictionary accordingly.
* This creates the resulting format for image_text_pairs: {image_file_path1:[obtained_caption,clip_score]}

In [None]:
for raw_image_path,caption in image_text_pairs.items():
    image_text_pairs[raw_image_path].append(get_clip_score(raw_image_path,caption[0]))

* The code block below calculates the average clip score of all frames/videos.
* Based on the provided reference, our custom-prompted LLaVA model achieves the 13th rank on the leaderboard.

Image Classification in the Wild leaderboard: https://eval.ai/web/challenges/challenge-page/1832/leaderboard/4298

In [None]:
counter = 0

for key,value in image_text_pairs.items():
    print(value[1])
    counter+=value[1]

print(f'Average clip score is:{counter/len(image_text_pairs)}')


* Now, the clip scores can be sorted accordingly to return the best text to video pairs.
* Some processing is done before sorting in the next section to recover the original file names of the videos so that they can be retrieved to be fed into the video generation model as stated in the problem statement.

In [None]:
selection_list = {}

for key,value in image_text_pairs.items():
    selection_list_key = key[:key.rfind("_")]
    selection_list_key = selection_list_key.rsplit("\\", 1)[1]
    selection_list[selection_list_key] = value

print(selection_list)

## Getting list of best video-text pairs

In [None]:
# Sort the dictionary items based on the numerical score in descending order
sorted_selection = sorted(selection_list.items(), key=lambda x: x[1][1], reverse=True)

print(sorted_selection)
# Extract the keys and values of the top two items

top_files = [(item[0], item[1][0]) for item in sorted_selection[:10000]]

# Append the names and values of the top two files to my_list
my_list = []
my_list.extend(top_files)

# print(my_list)


# Conclusion

* We have managed to reduce size of the original dataset by 6.76 times by filtering out low FPS pairs.
* Our regenerated captions have an average clip score of about 0.318 which is better that quite a few models.
* Thus we can conclude that our model can improve video generation quality with less cost of model training.