<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/JEPA2_DEMO_JUN2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Re-run the uninstallation and installation cells to ensure a clean install of the latest version.
# Make sure to restart the kernel after the installation cells have completed successfully.

# original code from user's notebook:
# from IPython import get_ipython
# from IPython.display import display
# %%
!pip uninstall transformers -y
# If it prompts about multiple packages or dependencies, confirm 'y'
# %%
!pip install --upgrade pip -q
# %%
!pip install av -q
# %%
!pip install git+https://github.com/huggingface/transformers.git -q
# %%
# AFTER RUNNING THE ABOVE INSTALLATION CELLS, RESTART THE JUPYTER KERNEL.
# THEN, RUN THIS CELL AND THE SUBSEQUENT CELLS.


In [6]:
from transformers import AutoVideoProcessor, AutoModel
import torch
import av # PyAV for video loading

hf_repo = "facebook/vjepa2-vitg-fpc64-256"

# Load the model and processor
model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

# --- Example Video Loading (replace with your actual video file) ---
# For demonstration, let's assume you have a video file named 'your_video.mp4'
# You would typically load a video using a library like PyAV or OpenCV

# Dummy video creation for demonstration (replace with actual video loading)
# This creates a dummy video tensor of shape (num_frames, channels, height, width)
# A real video would have varying pixel values.
num_frames = 16
height, width = 256, 256
dummy_video = torch.rand(num_frames, 3, height, width) # Example: 16 frames, 3 color channels, 256x256 pixels

# Preprocess the video
# The processor will handle resizing, normalization, and converting to the correct format
inputs = processor(videos=list(dummy_video), return_tensors="pt")

# Pass the preprocessed video through the model
with torch.no_grad(): # Disable gradient calculation for inference
    outputs = model(**inputs)

# The 'last_hidden_state' typically contains the rich feature representations
# The shape will depend on the model architecture, but it represents the learned features.
video_features = outputs.last_hidden_state
print(f"Shape of extracted video features: {video_features.shape}")

# You can then use these features for downstream tasks (e.g., classification, anomaly detection)

Shape of extracted video features: torch.Size([1, 2048, 1408])


In [3]:
from transformers import AutoVideoProcessor, AutoModel
import torch

hf_repo = "facebook/vjepa2-vitg-fpc64-256"
model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

# Dummy video (same as above)
num_frames = 16
height, width = 256, 256
dummy_video = torch.rand(num_frames, 3, height, width)

# Preprocess
inputs = processor(videos=list(dummy_video), return_tensors="pt")

# Get model outputs - these might include predicted features for masked regions
with torch.no_grad():
    outputs = model(**inputs)

# Depending on the model configuration, you might get different output keys.
# Common ones are 'last_hidden_state' (for general features)
# or potentially 'prediction_logits' if it's explicitly set up for a specific prediction task.
print(outputs.keys())
# For V-JEPA, 'last_hidden_state' is usually the most useful output for downstream tasks.

odict_keys(['last_hidden_state', 'masked_hidden_state', 'predictor_output'])


In [5]:
from transformers import AutoVideoProcessor, AutoModel
import torch
import av # PyAV library for video loading
import numpy as np
import os # To check if the file exists

# --- 1. Define the model and processor ---
hf_repo = "facebook/vjepa2-vitg-fpc64-256"
model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

# --- 2. Specify the path to your video file ---
video_path = '/content/airplane-landing.mp4'

# --- Check if the video file exists ---
if not os.path.exists(video_path):
    print(f"Error: The video file '{video_path}' was not found.")
    print("Please ensure the video is uploaded to your Colab environment or the path is correct.")
else:
    # --- 3. Load and process the video ---
    frames = []
    try:
        container = av.open(video_path)
        # Sample frames evenly (e.g., aiming for 16 frames as common for V-JEPA)
        total_frames_in_video = container.streams.video[0].frames
        num_frames_to_sample = 16
        sampling_interval = max(1, total_frames_in_video // num_frames_to_sample)

        print(f"Loading video from: {video_path}")
        print(f"Total frames in video: {total_frames_in_video}")
        print(f"Sampling interval: {sampling_interval} frames")

        for i, frame in enumerate(container.decode(video=0)):
            if len(frames) >= num_frames_to_sample:
                break
            if i % sampling_interval == 0:
                img = frame.to_rgb().to_ndarray() # Convert to NumPy array (H, W, C)
                frames.append(img)

        if not frames:
            print(f"Error: No frames could be loaded from '{video_path}'. Check video integrity.")
        elif len(frames) < num_frames_to_sample:
            print(f"Warning: Only {len(frames)} frames loaded. Model might expect {num_frames_to_sample}.")

        # The processor expects a list of NumPy arrays (H, W, C)
        inputs = processor(videos=frames, return_tensors="pt")

        # --- 4. Pass the processed video through the model ---
        print(f"Extracting features from {len(frames)} frames...")
        with torch.no_grad():
            outputs = model(**inputs)

        video_features = outputs.last_hidden_state
        print(f"Successfully extracted video features with shape: {video_features.shape}")

        print("\n--- Next Steps for Description ---")
        print("These 'video_features' are the model's numerical understanding of your video.")
        print("To get a human-readable description, you would need:")
        print("1.  **A Video Classification Model:** Train a classifier on a dataset of videos (or their V-JEPA features) labeled with categories (e.g., 'airplane landing', 'airplane takeoff', 'airport ground operations'). This classifier would then predict the most likely category for your video.")
        print("2.  **A Video Captioning Model:** Train a more advanced model that takes V-JEPA features as input and generates a descriptive sentence (e.g., 'A commercial airplane descends onto a runway and touches down.').")
        print("\nWithout such a pre-trained downstream model, I cannot provide a textual description directly from these numerical features.")

    except av.FFmpegError as e:
        print(f"Error loading video with PyAV: {e}")
        print("This might indicate an issue with the video file itself or FFmpeg installation.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        print("Ensure 'av' library is installed (`pip install av`) and the video format is supported.")

Loading video from: /content/airplane-landing.mp4
Total frames in video: 983
Sampling interval: 61 frames
Extracting features from 16 frames...
Successfully extracted video features with shape: torch.Size([1, 2048, 1408])

--- Next Steps for Description ---
These 'video_features' are the model's numerical understanding of your video.
To get a human-readable description, you would need:
1.  **A Video Classification Model:** Train a classifier on a dataset of videos (or their V-JEPA features) labeled with categories (e.g., 'airplane landing', 'airplane takeoff', 'airport ground operations'). This classifier would then predict the most likely category for your video.
2.  **A Video Captioning Model:** Train a more advanced model that takes V-JEPA features as input and generates a descriptive sentence (e.g., 'A commercial airplane descends onto a runway and touches down.').

Without such a pre-trained downstream model, I cannot provide a textual description directly from these numerical fea