<a href="https://colab.research.google.com/github/coderinf/OPTIMIZATION-OF-VLM/blob/main/VLMOT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TO DOWNLOAD MODELS FROM HUGGING FACE TRANSFORMERS MODULE NEEDED.


In [2]:
pip install transformers torch



GET OPENAI-CLIP MODEL FROM TRANSFORMERS

In [4]:
import torch
from transformers import CLIPProcessor, CLIPModel

# Define the model name
MODEL_NAME = "openai/clip-vit-base-patch32"

# Load the pre-trained CLIP model and processor
processor = CLIPProcessor.from_pretrained(MODEL_NAME)
model = CLIPModel.from_pretrained(MODEL_NAME)

# Set the model to evaluation mode
model.eval()

print(f"Successfully loaded CLIP model: {MODEL_NAME}")
print(f"Model is set to evaluation mode: {not model.training}")

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

CLIPModel LOAD REPORT from: openai/clip-vit-base-patch32
Key                                  | Status     |  | 
-------------------------------------+------------+--+-
text_model.embeddings.position_ids   | UNEXPECTED |  | 
vision_model.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Successfully loaded CLIP model: openai/clip-vit-base-patch32
Model is set to evaluation mode: True


GIVE THE REQUIRED VIDEO FILE TO MODEL TO GET SEMANTICS OF EVENTS

In [6]:
import cv2
import os

# Define the path to the video file
video_path = '/content/4791196-hd_1920_1080_30fps.mp4'

# Check if the video file exists
if not os.path.exists(video_path):
    raise FileNotFoundError(f"Video file not found at: {video_path}")

# Open the video file
cap = cv2.VideoCapture(video_path)

# Check if video opened successfully
if not cap.isOpened():
    raise IOError("Error: Could not open video file.")

frames = []
frame_count = 0

# Read frames from the video
while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Convert frame from BGR to RGB (CLIP expects RGB images)
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frames.append(frame_rgb)
    frame_count += 1

# Release the video capture object
cap.release()

print(f"Successfully extracted {frame_count} frames from {video_path}")
print(f"First frame shape: {frames[0].shape if frames else 'No frames extracted'}")

Successfully extracted 195 frames from /content/4791196-hd_1920_1080_30fps.mp4
First frame shape: (1080, 1920, 3)


GIVE LABELS FOR EVENTS REQUIRED  AND INPUT THE VIDEO TO THE MODEL


In [9]:
import torch
import numpy as np

# 1. Define semantic event labels
event_labels = [
    'a photo of a person walking',
    'a photo of a vehicle stopping',
    'a photo of a crowded scene'
]

# 2. Tokenize the text labels and 3. Generate text embeddings
with torch.no_grad():
    text_inputs = processor(text=event_labels, return_tensors="pt", padding=True)
    # Move inputs to the same device as the model if it's on GPU
    if torch.cuda.is_available():
        text_inputs = {k: v.to('cuda') for k, v in text_inputs.items()}
        model.to('cuda')
    text_features = model.get_text_features(**text_inputs)
    # Ensure text_features is a tensor, if it's an object, get the pooler_output
    if hasattr(text_features, 'pooler_output'):
        text_features = text_features.pooler_output
    text_features /= text_features.norm(dim=-1, keepdim=True)

# 4. Initialize an empty list to store detected events or scores
detected_events = []

print(f"Processing {len(frames)} frames...")
# 5. Iterate through each frame
for i, frame_rgb in enumerate(frames):
    # a. Preprocess the frame
    image_input = processor(images=frame_rgb, return_tensors="pt", padding=True).pixel_values

    # Move image input to the same device as the model
    if torch.cuda.is_available():
        image_input = image_input.to('cuda')

    # b. Generate image embeddings
    with torch.no_grad():
        image_features = model.get_image_features(image_input)
        # Ensure image_features is a tensor, if it's an object, get the pooler_output
        if hasattr(image_features, 'pooler_output'):
            image_features = image_features.pooler_output
        image_features /= image_features.norm(dim=-1, keepdim=True)

    # c. Calculate the cosine similarity
    similarity = torch.nn.functional.cosine_similarity(image_features, text_features)

    # d. Determine the most probable event
    best_match_idx = similarity.argmax().item()
    best_match_score = similarity[best_match_idx].item()
    predicted_event = event_labels[best_match_idx]

    # e. Store the frame index, the detected event, and its similarity score.
    detected_events.append({
        'frame_index': i,
        'predicted_event': predicted_event,
        'similarity_score': best_match_score
    })

# 6. Print or display the detected events for each frame (e.g., first 10 and last 10)
print("\n--- Detected Events (First 10) ---")
for event in detected_events[:10]:
    print(f"Frame {event['frame_index']}: {event['predicted_event']} (Score: {event['similarity_score']:.4f})")

if len(detected_events) > 10:
    print("\n...")
    print("\n--- Detected Events (Last 10) ---")
    for event in detected_events[-10:]:
        print(f"Frame {event['frame_index']}: {event['predicted_event']} (Score: {event['similarity_score']:.4f})")

print(f"\nCompleted event detection for {len(detected_events)} frames.")

Processing 195 frames...

--- Detected Events (First 10) ---
Frame 0: a photo of a crowded scene (Score: 0.2159)
Frame 1: a photo of a crowded scene (Score: 0.2197)
Frame 2: a photo of a crowded scene (Score: 0.2286)
Frame 3: a photo of a crowded scene (Score: 0.2278)
Frame 4: a photo of a crowded scene (Score: 0.2314)
Frame 5: a photo of a crowded scene (Score: 0.2288)
Frame 6: a photo of a crowded scene (Score: 0.2278)
Frame 7: a photo of a crowded scene (Score: 0.2266)
Frame 8: a photo of a crowded scene (Score: 0.2262)
Frame 9: a photo of a vehicle stopping (Score: 0.2249)

...

--- Detected Events (Last 10) ---
Frame 185: a photo of a person walking (Score: 0.2395)
Frame 186: a photo of a person walking (Score: 0.2360)
Frame 187: a photo of a person walking (Score: 0.2292)
Frame 188: a photo of a person walking (Score: 0.2312)
Frame 189: a photo of a person walking (Score: 0.2319)
Frame 190: a photo of a person walking (Score: 0.2377)
Frame 191: a photo of a person walking (Score:

CALCULATE THE BENCHMARKS FOR THE ACTUAL VLM.

In [11]:
import time
import torch
import numpy as np

# Ensure event_labels, processor, model, text_features, and frames are available from previous steps

# 1. Record the start time
start_time = time.time()

# Re-initialize detected_events list (optional, if you want to keep the output clean for this step)
detected_events_performance = []

print(f"Benchmarking inference speed for {len(frames)} frames...")

# 2. Iterate through each frame
for i, frame_rgb in enumerate(frames):
    # a. Preprocess the frame
    image_input = processor(images=frame_rgb, return_tensors="pt", padding=True).pixel_values

    # Move image input to the same device as the model
    if torch.cuda.is_available():
        image_input = image_input.to('cuda')

    # b. Generate image embeddings
    with torch.no_grad():
        image_features = model.get_image_features(image_input)
        # Ensure image_features is a tensor, if it's an object, get the pooler_output
        if hasattr(image_features, 'pooler_output'):
            image_features = image_features.pooler_output
        image_features /= image_features.norm(dim=-1, keepdim=True)

    # c. Calculate the cosine similarity
    similarity = torch.nn.functional.cosine_similarity(image_features, text_features)

    # d. Determine the most probable event (keeping this for consistency, though not strictly needed for benchmarking)
    best_match_idx = similarity.argmax().item()
    best_match_score = similarity[best_match_idx].item()
    predicted_event = event_labels[best_match_idx]

    # e. Store the frame index, the detected event, and its similarity score. (optional for performance metrics)
    detected_events_performance.append({
        'frame_index': i,
        'predicted_event': predicted_event,
        'similarity_score': best_match_score
    })

# 3. Record the end time
end_time = time.time()

# 4. Calculate the total inference time
total_inference_time = end_time - start_time

# 5. Calculate the average inference time per frame
average_time_per_frame = total_inference_time / len(frames)

# 6. Calculate the Frames Per Second (FPS)
fps = len(frames) / total_inference_time

# 7. Print the results
print("\n--- Baseline Model Performance Metrics ---")
print(f"Total frames processed: {len(frames)}")
print(f"Total inference time: {total_inference_time:.4f} seconds")
print(f"Average inference time per frame: {average_time_per_frame:.4f} seconds")
print(f"Frames Per Second (FPS): {fps:.2f}")


Benchmarking inference speed for 195 frames...

--- Baseline Model Performance Metrics ---
Total frames processed: 195
Total inference time: 46.9261 seconds
Average inference time per frame: 0.2406 seconds
Frames Per Second (FPS): 4.16


APPLY QUANTIZATION TO THE VLM TO INT8

In [13]:
import torch
from torch.quantization import quantize_dynamic
import os

# Ensure the model is on CPU for dynamic quantization
# Dynamic quantization for CLIP typically works best on CPU
model.cpu()

print(f"Applying dynamic quantization to model of type {type(model)}...")

# 1. Apply dynamic quantization to the loaded model
# Quantize only linear layers, which are common in CLIP's text and vision transformers
quantized_model = quantize_dynamic(model,
                                   {
                                       torch.nn.Linear
                                   },
                                   dtype=torch.qint8,
                                   inplace=False)

print("Dynamic quantization applied successfully.")

# 2. Save the quantized_model
output_model_path = 'quantized_clip_model.pt'
torch.save(quantized_model.state_dict(), output_model_path)

print(f"Quantized model saved to {output_model_path}")

# Compare model sizes
# To get the size of the original model, save its state_dict to a temporary file
original_model_temp_path = 'original_clip_model_temp.pt'
torch.save(model.state_dict(), original_model_temp_path)
original_model_size_mb = os.path.getsize(original_model_temp_path) / (1024 * 1024)
os.remove(original_model_temp_path) # Clean up the temporary file

quantized_model_size_mb = os.path.getsize(output_model_path) / (1024 * 1024)

print(f"Original model size: {original_model_size_mb:.2f} MB")
print(f"Quantized model size: {quantized_model_size_mb:.2f} MB")

Applying dynamic quantization to model of type <class 'transformers.models.clip.modeling_clip.CLIPModel'>...


For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  quantized_model = quantize_dynamic(model,


Dynamic quantization applied successfully.
Quantized model saved to quantized_clip_model.pt
Original model size: 577.21 MB
Quantized model size: 224.46 MB


LOAD THE OPTIMIZED MODEL


In [15]:
import torch
from transformers import CLIPModel, CLIPProcessor
from torch.quantization import quantize_dynamic

# Define the model name used for loading the original model structure
MODEL_NAME = "openai/clip-vit-base-patch32"

# Re-initialize the original model (on CPU as it was quantized on CPU)
# This is necessary to load the state_dict into the correct model architecture
processor = CLIPProcessor.from_pretrained(MODEL_NAME)
original_model_for_quantization = CLIPModel.from_pretrained(MODEL_NAME)
original_model_for_quantization.cpu()

# Re-apply quantization to get the quantized model object structure
# We need to recreate the quantized model structure to load its state_dict
quantized_model_loaded = quantize_dynamic(original_model_for_quantization,
                                           {torch.nn.Linear},
                                           dtype=torch.qint8,
                                           inplace=False)

# Load the state_dict of the saved quantized model
output_model_path = 'quantized_clip_model.pt'
quantized_model_loaded.load_state_dict(torch.load(output_model_path))

# Set the loaded quantized model to evaluation mode
quantized_model_loaded.eval()

print(f"Successfully loaded quantized model from {output_model_path}")
print(f"Quantized model is set to evaluation mode: {not quantized_model_loaded.training}")

# Make the loaded quantized model globally available for subsequent benchmarking
model_quantized = quantized_model_loaded

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

CLIPModel LOAD REPORT from: openai/clip-vit-base-patch32
Key                                  | Status     |  | 
-------------------------------------+------------+--+-
text_model.embeddings.position_ids   | UNEXPECTED |  | 
vision_model.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  quantized_m

Successfully loaded quantized model from quantized_clip_model.pt
Quantized model is set to evaluation mode: True


PASS THE SAME VIDEO AS INPUT TO THE OPTIMIZED MODEL

In [21]:
import time
import torch
import numpy as np

# Ensure event_labels, processor, frames are available from previous steps
# Make sure `model_quantized` is defined from the previous step

# Re-generate text features using the *quantized* model's text encoder
# This is crucial as quantization affects the model's internal representation
print("Re-generating text features using the quantized model's text encoder...")
with torch.no_grad():
    text_inputs_quantized = processor(text=event_labels, return_tensors="pt", padding=True)
    # Ensure text inputs are on CPU as the quantized model is on CPU
    text_features_quantized = model_quantized.get_text_features(**text_inputs_quantized)
    if hasattr(text_features_quantized, 'pooler_output'):
        text_features_quantized = text_features_quantized.pooler_output
    text_features_quantized /= text_features_quantized.norm(dim=-1, keepdim=True)
print("Text features for quantized model re-generated.")

# 1. Record the start time
start_time_quantized = time.time()

# Re-initialize detected_events list for quantized model performance (optional)
detected_events_performance_quantized = []

print(f"Benchmarking inference speed for quantized model on {len(frames)} frames...")

# 2. Iterate through each frame
for i, frame_rgb in enumerate(frames):
    # a. Preprocess the frame
    image_input = processor(images=frame_rgb, return_tensors="pt", padding=True).pixel_values
    # Ensure image input is on CPU for the quantized model
    # image_input = image_input.to('cpu') # It is already on CPU by default from processor

    # b. Generate image embeddings using the quantized model
    with torch.no_grad():
        image_features_quantized = model_quantized.get_image_features(image_input)
        # Ensure image_features is a tensor, if it's an object, get the pooler_output
        if hasattr(image_features_quantized, 'pooler_output'):
            image_features_quantized = image_features_quantized.pooler_output
        image_features_quantized /= image_features_quantized.norm(dim=-1, keepdim=True)

    # c. Calculate the cosine similarity
    similarity_quantized = torch.nn.functional.cosine_similarity(image_features_quantized, text_features_quantized)

    # d. Determine the most probable event (keeping for consistency)
    best_match_idx_quantized = similarity_quantized.argmax().item()
    best_match_score_quantized = similarity_quantized[best_match_idx_quantized].item()
    predicted_event_quantized = event_labels[best_match_idx_quantized]


    # e. Store the frame index, the detected event, and its similarity score. (optional
    detected_events_performance_quantized.append({
        'frame_index': i,
        'predicted_event': predicted_event_quantized,
        'similarity_score': best_match_score_quantized
    })

# 3. Record the end time
end_time_quantized = time.time()

# 4. Calculate the total inference time
total_inference_time_quantized = end_time_quantized - start_time_quantized

# 5. Calculate the average inference time per frame
average_time_per_frame_quantized = total_inference_time_quantized / len(frames)

# 6. Calculate the Frames Per Second (FPS)
fps_quantized = len(frames) / total_inference_time_quantized

# 7. Print the results
print("\n--- Quantized Model Performance Metrics ---")
print(f"Total frames processed: {len(frames)}")
print(f"Total inference time: {total_inference_time_quantized:.4f} seconds")
print(f"Average inference time per frame: {average_time_per_frame_quantized:.4f} seconds")
print(f"Frames Per Second (FPS): {fps_quantized:.2f}")

# Store metrics for later comparison table
metrics_quantized = {
    "total_frames": len(frames),
    "total_inference_time": total_inference_time_quantized,
    "average_time_per_frame": average_time_per_frame_quantized,
    "fps": fps_quantized
}

Re-generating text features using the quantized model's text encoder...
Text features for quantized model re-generated.
Benchmarking inference speed for quantized model on 195 frames...

--- Quantized Model Performance Metrics ---
Total frames processed: 195
Total inference time: 32.1149 seconds
Average inference time per frame: 0.1647 seconds
Frames Per Second (FPS): 6.07


FORMAT THE COMPARISION TABLE

In [17]:
import pandas as pd

# 1. Collect baseline model performance metrics
# These variables are available from the previous execution of cell 37bc2913
baseline_total_inference_time = total_inference_time
baseline_average_time_per_frame = average_time_per_frame
baseline_fps = fps

# 2. Collect optimized (quantized) model performance metrics
# These variables are available from the previous execution of cell 460c318d
quantized_total_inference_time = metrics_quantized["total_inference_time"]
quantized_average_time_per_frame = metrics_quantized["average_time_per_frame"]
quantized_fps = metrics_quantized["fps"]

# Also retrieve model sizes from previous steps (cell 1b4e841a)
original_model_size_mb = original_model_size_mb
quantized_model_size_mb = quantized_model_size_mb

# 3. Create a pandas DataFrame for the comparison table
comparison_data = {
    'Metric': [
        'Total Frames Processed',
        'Total Inference Time (s)',
        'Average Time per Frame (s)',
        'Frames Per Second (FPS)',
        'Model Size (MB)'
    ],
    'Baseline Model': [
        len(frames),
        f'{baseline_total_inference_time:.4f}',
        f'{baseline_average_time_per_frame:.4f}',
        f'{baseline_fps:.2f}',
        f'{original_model_size_mb:.2f}'
    ],
    'Optimized Model': [
        len(frames),
        f'{quantized_total_inference_time:.4f}',
        f'{quantized_average_time_per_frame:.4f}',
        f'{quantized_fps:.2f}',
        f'{quantized_model_size_mb:.2f}'
    ]
}

comparison_df = pd.DataFrame(comparison_data)

# Calculate speedup metrics
inference_time_reduction = (baseline_total_inference_time - quantized_total_inference_time) / baseline_total_inference_time * 100
fps_speedup_factor = quantized_fps / baseline_fps
model_size_reduction = (original_model_size_mb - quantized_model_size_mb) / original_model_size_mb * 100

print("--- Performance Comparison Table ---")
print(comparison_df.to_markdown(index=False))

print(f"\nObserved Speedup (FPS factor): {fps_speedup_factor:.2f}x")
print(f"Inference Time Reduction: {inference_time_reduction:.2f}%")
print(f"Model Size Reduction: {model_size_reduction:.2f}%")


--- Performance Comparison Table ---
| Metric                     |   Baseline Model |   Optimized Model |
|:---------------------------|-----------------:|------------------:|
| Total Frames Processed     |         195      |          195      |
| Total Inference Time (s)   |          46.9261 |           30.2969 |
| Average Time per Frame (s) |           0.2406 |            0.1554 |
| Frames Per Second (FPS)    |           4.16   |            6.44   |
| Model Size (MB)            |         577.21   |          224.46   |

Observed Speedup (FPS factor): 1.55x
Inference Time Reduction: 35.44%
Model Size Reduction: 61.11%
