## Step 1: Load and Explore the Preprocessed Dataset
In this step, we will:
1. Load the preprocessed dataset (`preprocessed_dataset.csv`) created in Notebook 1.
2. Explore the structure of the dataset to ensure it is ready for fine-tuning.

In [1]:
import pandas as pd

# Load the preprocessed dataset
preprocessed_data = pd.read_csv("preprocessed_dataset.csv")

# Display the first few rows to verify the structure
preprocessed_data.head()

Unnamed: 0,video_id,category_1,category_2,task_description,captions
0,nVbIUDjzWY4,Cars & Other Vehicles,Motorcycles,Paint a Motorcycle,"{'start': [13.64, 15.86, 20.6, 23.96, 26.36, 2..."
1,rwmt7Cbuvfs,Cars & Other Vehicles,Motorcycles,Paint a Motorcycle,"{'start': [1.8, 6.32, 7.32, 10.86, 13.28, 15.6..."
2,HnTLh99gcxY,Cars & Other Vehicles,Motorcycles,Paint a Motorcycle,"{'start': [0.03, 2.37, 4.29, 6.69, 8.42, 8.67,..."
3,RAidUDTPZ-k,Cars & Other Vehicles,Motorcycles,Paint a Motorcycle,"{'start': [0.06, 1.38, 3.03, 5.13, 7.44, 8.73,..."
4,tYQoPHwNkho,Cars & Other Vehicles,Motorcycles,Paint a Motorcycle,"{'start': [0.0, 6.93, 8.94, 11.07, 12.71, 15.2..."


## Step 2: Prepare Text Data for Fine-Tuning
In this step, we will:
1. Extract the `captions` as input data (source text) and `task_description` as target summaries.
2. Process the text data to ensure it is formatted correctly for the fine-tuning step.

In [2]:
# Extract source (captions) and target (task_description) text
source_text = preprocessed_data["captions"]
target_text = preprocessed_data["task_description"]

# Display a sample of the source and target text for verification
for i in range(3):  # Display the first 3 examples
    print(f"Source (Captions): {source_text[i]}")
    print(f"Target (Task Description): {target_text[i]}")
    print("-" * 50)

Source (Captions): {'start': [13.64, 15.86, 20.6, 23.96, 26.36, 29.36, 32.0, 35.33, 37.69, 40.67, 42.65, 44.57, 47.08, 48.89, 51.19, 53.96, 57.22, 59.23, 60.58, 62.78, 64.15, 66.35, 68.21000000000001, 69.83, 71.81, 73.82, 76.28, 77.78, 79.64, 81.2, 82.97, 86.06, 87.95, 89.3, 91.58, 93.11, 96.95, 99.41, 102.4, 104.65, 106.73, 110.63, 112.58, 114.83, 118.31, 120.47, 123.8, 125.54, 127.7, 130.0, 131.84, 132.62, 134.45, 136.06, 138.53, 140.29, 143.93, 145.97, 147.12, 149.48, 153.48, 155.67000000000004, 159.59, 161.57999999999996, 165.15, 165.66, 167.91, 170.16, 171.62, 174.98, 177.93, 179.73, 181.11, 183.54, 186.56, 188.43, 189.93, 191.26, 193.98, 195.75, 198.29, 202.29, 204.42, 205.68, 206.22, 208.59, 211.47, 213.03, 215.63, 216.93, 218.04, 219.56, 222.54, 225.9, 227.66, 230.48, 232.16, 233.73, 236.43, 238.79, 242.43, 243.93, 246.32, 248.25, 250.07, 252.35, 254.43, 257.45, 259.56, 262.65, 264.63, 266.37, 270.38, 272.52, 274.47, 279.68, 282.71, 286.49, 289.61, 294.08, 296.81, 298.67, 301.6

## Step 3: Process Captions into Plain Text
The `captions` column does not need additional evaluation with `eval()`. We will:
1. Access the `text` key in the dictionaries stored in the `captions` column.
2. Concatenate all captions into a single string for each video.
3. Handle any missing or malformed captions gracefully.

## Simplified Processing of Captions
The `captions` column is already structured as dictionaries, so:
1. We will directly access the `text` key.
2. Concatenate the strings in the `text` list into a single string for each video.
3. Store the result in a new column, `processed_captions`.

In [3]:
# Simplified function to process captions into plain text
def process_captions(caption_data):
    try:
        # Extract the 'text' field and join all captions into a single string
        return " ".join(caption_data["text"])
    except (TypeError, KeyError):
        # Handle missing or malformed captions
        return ""

# Apply the function directly to the captions column
preprocessed_data["processed_captions"] = preprocessed_data["captions"].apply(process_captions)

# Display a sample of the processed captions
for i in range(3):  # Display the first 3 examples
    print(f"Processed Captions: {preprocessed_data['processed_captions'][i]}")
    print(f"Target (Task Description): {target_text[i]}")
    print("-" * 50)

Processed Captions: 
Target (Task Description): Paint a Motorcycle
--------------------------------------------------
Processed Captions: 
Target (Task Description): Paint a Motorcycle
--------------------------------------------------
Processed Captions: 
Target (Task Description): Paint a Motorcycle
--------------------------------------------------


## Step 4: Split Data into Training, Validation, and Testing Sets

**What We Are Doing:**

**Purpose:** Divide the preprocessed data into three subsets to evaluate the model's performance during and after training.

**Steps:**
1. Split the data into training + validation (90%) and testing (10%) subsets.
2. Further divide the training + validation subset into training (80%) and validation (10%) subsets.
3. Shuffle the data to ensure examples are well-distributed.
4. Set a random seed for reproducibility of results.

In [4]:
# Step 4: Split Data into Training, Validation, and Testing Sets with Dataset Reduction
from sklearn.model_selection import train_test_split

# Display the original dataset size
print(f"Original dataset size: {len(preprocessed_data)}")

# Reduce the dataset to 10% of the original size for faster processing
reduced_data = preprocessed_data.sample(frac=0.0001, random_state=42)

# Display the reduced dataset size
print(f"Reduced dataset size: {len(reduced_data)}")

# Optional: Inspect the distribution of any categories (if available)
if 'category' in reduced_data.columns:  # Adjust column name if needed
    print("Distribution of categories in reduced dataset:")
    print(reduced_data['category'].value_counts())

# Split the reduced dataset into train+validation (90%) and test (10%)
train_val_data, test_data = train_test_split(
    reduced_data, test_size=0.1, random_state=42, shuffle=True
)

# Further split train+validation into training (80%) and validation (10%)
train_data, val_data = train_test_split(
    train_val_data, test_size=0.111, random_state=42, shuffle=True
)

# Display the sizes of the datasets
print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")
print(f"Testing set size: {len(test_data)}")

# Print unique values in category_1 and category_2 columns
print("Unique values in category_1:")
print(reduced_data['category_1'].unique())

print("\nUnique values in category_2:")
print(reduced_data['category_2'].unique())

# Display a few rows from each dataset to verify
print("\nSample training data:")
print(train_data.head())

print("\nSample validation data:")
print(val_data.head())

print("\nSample testing data:")
print(test_data.head())

Original dataset size: 1238867
Reduced dataset size: 124
Training set size: 98
Validation set size: 13
Testing set size: 13
Unique values in category_1:
['Food and Entertaining' 'Hobbies and Crafts' 'Pets and Animals'
 'Personal Care and Style' 'Home and Garden'
 'Education and Communications' 'Cars & Other Vehicles' 'Health'
 'Holidays and Traditions' 'Arts and Entertainment' 'Sports and Fitness'
 'Computers and Electronics']

Unique values in category_2:
['Recipes' 'Tricks and Pranks' 'Dogs' 'Drinks' 'Grooming'
 'Food Preparation' 'Gardening' 'Housekeeping' 'Speaking'
 'Care and Use of Cooking Equipment' 'Driving Techniques' 'Crafts'
 'Alternative Health' 'Tools' 'Parties' 'Easter' 'Subjects'
 'Landscaping and Outdoor Building' 'Herbs and Spices' 'Music'
 'Home Improvements and Repairs' 'Collecting' 'Individual Sports'
 "Mother's Day" 'Cars' 'Crustaceans' 'Toys' 'Motorcycles' 'Vehicle Sports'
 'Holiday Cooking' 'Bicycles' 'TV and Home Audio' nan 'Woodworking'
 'Emotional Health']

Sa

## Step 5: Download Videos from YouTube

This step will:

1. Use the video IDs in the training, validation, and testing datasets.
2. Download videos using **yt-dlp**.
3. Save the videos in a structured directory (e.g., `videos/train/`, `videos/val/`, `videos/test/`).

In [15]:
from yt_dlp import YoutubeDL
import os
import warnings
import logging

# Suppress warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Create a logger to suppress yt-dlp logs
logging.basicConfig(level=logging.CRITICAL)

# Define the output directories for videos
video_dirs = {
    "train": "videos/train/",
    "val": "videos/val/",
    "test": "videos/test/"
}

# Create directories if they don't exist
for dir_path in video_dirs.values():
    os.makedirs(dir_path, exist_ok=True)

# Function to download a video using yt-dlp
def download_video(video_id, output_dir):
    url = f"https://www.youtube.com/watch?v={video_id}"
    ydl_opts = {
        "format": "best",  # Best available quality
        "outtmpl": f"{output_dir}/{video_id}.mp4",  # Output filename
        "quiet": True,  # Suppress yt-dlp logs
        "no_warnings": True,  # Suppress yt-dlp warnings
        "logger": logging.getLogger()  # Suppress yt-dlp messages
    }
    with YoutubeDL(ydl_opts) as ydl:
        try:
            ydl.download([url])
            print(".", end="", flush=True)  # Print a dot for success
        except Exception:
            print("x", end="", flush=True)  # Print an 'x' for failure

# Download videos for each dataset
print("Downloading training videos...")
for video_id in train_data["video_id"]:
    download_video(video_id, video_dirs["train"])

print("\nDownloading validation videos...")
for video_id in val_data["video_id"]:
    download_video(video_id, video_dirs["val"])

print("\nDownloading testing videos...")
for video_id in test_data["video_id"]:
    download_video(video_id, video_dirs["test"])

print("\n\nVideo downloads complete.")  # Print a new line after downloads

Downloading training videos...
.................................x.x....x......x...x.x.....xx............x.......x.....x..........
Downloading validation videos...
......x.xx..x
Downloading testing videos...
.x.......x...

Video downloads complete.


## Step 6: Preprocess Videos for Feature Extraction

This step will:

1. Extract frames or embeddings from the videos.
2. Save the extracted features in structured directories for training, validation, and testing datasets.
3. Use efficient processing libraries such as OpenCV, PyTorch, or pre-trained models like CLIP for feature extraction.


In [18]:
import os
import cv2
import torch
from torchvision import transforms
from PIL import Image
from tqdm import tqdm
import numpy as np

# Define directories for the videos and features
video_dirs = {
    "train": "videos/train/",
    "val": "videos/val/",
    "test": "videos/test/",
}

feature_dirs = {
    "train": "features/train/",
    "val": "features/val/",
    "test": "features/test/",
}

# Create feature directories if they don't exist
for dir_path in feature_dirs.values():
    os.makedirs(dir_path, exist_ok=True)

# Pre-trained model for visual feature extraction (e.g., ResNet)
from torchvision.models import resnet50

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = resnet50(pretrained=True)
model = model.eval().to(device)

# Transform to preprocess video frames
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

def extract_frames(video_path, frame_rate=1):
    """
    Extract frames from a video file at the given frame rate.
    """
    frames = []
    cap = cv2.VideoCapture(video_path)
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    frame_interval = fps // frame_rate

    while cap.isOpened():
        frame_id = int(cap.get(cv2.CAP_PROP_POS_FRAMES))
        ret, frame = cap.read()
        if not ret:
            break
        if frame_id % frame_interval == 0:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # Convert BGR to RGB
            frames.append(Image.fromarray(frame))
    cap.release()
    return frames

def extract_features(frames, model, transform, device):
    """
    Extract features for a list of frames using a pre-trained model.
    """
    features = []
    for frame in frames:
        frame_tensor = transform(frame).unsqueeze(0).to(device)
        with torch.no_grad():
            feature = model(frame_tensor).cpu().numpy()
        features.append(feature)
    return features

# Process videos for each dataset
for split, video_dir in video_dirs.items():
    print(f"Processing {split} videos...")
    for video_file in tqdm(os.listdir(video_dir)):
        video_path = os.path.join(video_dir, video_file)
        if not video_file.endswith(".mp4"):
            continue

        # Extract frames and features
        frames = extract_frames(video_path)
        features = extract_features(frames, model, transform, device)

        # Save features as a numpy file
        feature_file = os.path.join(feature_dirs[split], f"{os.path.splitext(video_file)[0]}.npy")
        np.save(feature_file, features)

print("Feature extraction complete!")

Processing train videos...


100%|██████████| 87/87 [05:18<00:00,  3.66s/it]


Processing val videos...


100%|██████████| 9/9 [00:36<00:00,  4.08s/it]


Processing test videos...


100%|██████████| 11/11 [00:51<00:00,  4.70s/it]

Feature extraction complete!





## Step 7: Visualize Extracted Features
This step will:

1. Load a few `.npy` files from the structured directories (e.g., `features/train/`, `features/val/`, `features/test/`).
2. Visualize or print out the contents and structure of the feature arrays to ensure that the extraction process worked as expected.


In [5]:
import numpy as np
import os

# Define the paths to the feature directories
feature_dirs = {
    "train": "features/train/",
    "val": "features/val/",
    "test": "features/test/"
}

# Function to visualize a few `.npy` files
def inspect_features(feature_dir, num_samples=3):
    print(f"Inspecting features in {feature_dir}...\n")
    files = os.listdir(feature_dir)
    npy_files = [file for file in files if file.endswith(".npy")]
    
    # Display information for a few `.npy` files
    for i, npy_file in enumerate(npy_files[:num_samples]):
        feature_path = os.path.join(feature_dir, npy_file)
        features = np.load(feature_path)
        print(f"File: {npy_file}")
        print(f"Shape: {features.shape}")
        print(f"Sample Data: {features[:5]}")  # Print first 5 elements
        print("-" * 50)

# Inspect features in each dataset
for dataset, feature_dir in feature_dirs.items():
    if os.path.exists(feature_dir):
        inspect_features(feature_dir)
    else:
        print(f"Feature directory '{feature_dir}' does not exist.\n")

Inspecting features in features/train/...

File: S0rZQ8d-z_c.npy
Shape: (97, 1, 1000)
Sample Data: [[[-0.52196467  1.6278666   0.21541211 ... -2.037949    2.278176
    3.215495  ]]

 [[-0.5318004   3.0022192   1.0437046  ... -2.660032    1.0309241
    3.0269563 ]]

 [[-0.35514843  3.2112522  -1.7236078  ... -2.6421385  -0.50810766
    0.46100932]]

 [[-2.37383     2.0584176  -0.9910927  ... -2.2873552   0.19965419
    1.1317472 ]]

 [[-2.1626418   1.8039418  -0.5193502  ... -1.9179364   0.19929169
    0.6916184 ]]]
--------------------------------------------------
File: WEA1XkYkPO4.npy
Shape: (817, 1, 1000)
Sample Data: [[[-5.8233482e-01  6.8116599e-01 -2.9223970e-01 ... -1.6159886e+00
   -1.6310718e-05  1.5598352e+00]]

 [[-7.9267472e-03  1.0051288e-01 -1.4209548e+00 ... -2.5470219e+00
    2.2099760e+00  3.0678093e+00]]

 [[ 1.2105324e+00  9.4390994e-01 -1.0709611e+00 ... -2.0749018e+00
    3.6800654e+00  3.3206577e+00]]

 [[ 5.5020666e-01  6.6695499e-01 -1.4132812e+00 ... -2.6674950

## Step 8: Prepare Feature Data for Model Training

This step will:

1. Load the extracted features from the `features/train/`, `features/val/`, and `features/test/` directories.
2. Pair the features with their corresponding labels from the training, validation, and testing datasets.
3. Prepare the data for input into a machine learning or deep learning model by:
   - Creating feature-label pairs.
   - Organizing the data into a format compatible with training frameworks like PyTorch or TensorFlow.


In [6]:
from sklearn.preprocessing import LabelEncoder
import numpy as np
import os
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd

# Combine all unique labels from train, val, and test datasets
all_labels = pd.concat([
    train_data["task_description"],
    val_data["task_description"],
    test_data["task_description"]
]).unique()

# Fit the LabelEncoder on all possible labels
label_encoder = LabelEncoder()
label_encoder.fit(all_labels)

# Debugging: Check label encoding
print("All Unique Labels Encoded:", label_encoder.classes_)

# Encode labels for each dataset
train_data["task_description_encoded"] = label_encoder.transform(train_data["task_description"])
val_data["task_description_encoded"] = label_encoder.transform(val_data["task_description"])
test_data["task_description_encoded"] = label_encoder.transform(test_data["task_description"])

# Debugging: Check encoded labels
print("\nEncoded Labels in Training Set:", train_data["task_description_encoded"].unique())
print("Encoded Labels in Validation Set:", val_data["task_description_encoded"].unique())
print("Encoded Labels in Testing Set:", test_data["task_description_encoded"].unique())


# Dataset class with enhanced handling and proper dimension normalization
class VideoFeatureDataset(Dataset):
    def __init__(self, feature_dir, label_data, max_length=1000):
        self.feature_dir = feature_dir
        self.label_data = label_data
        self.max_length = max_length

        # Filter rows with missing or invalid features
        initial_count = len(label_data)
        self.valid_data = self.label_data[self.label_data["video_id"].apply(self._is_valid_feature)].reset_index(drop=True)
        filtered_count = initial_count - len(self.valid_data)

        if len(self.valid_data) == 0:
            print(f"All {initial_count} samples were invalid. Please check your .npy files.")
            raise ValueError("No valid samples found in the dataset.")
        elif filtered_count > 0:
            print(f"{filtered_count} samples were excluded due to missing or invalid .npy files.")

    def _is_valid_feature(self, video_id):
        feature_path = os.path.join(self.feature_dir, f"{video_id}.npy")
        if not os.path.exists(feature_path):
            print(f"[{self.feature_dir}] Missing file: {feature_path}")
            return False
        try:
            feature = np.load(feature_path)
            # Ensure proper dimensions: (N, 1, 1000) -> (N, 1000)
            if feature.ndim == 3 and feature.shape[1] == 1:
                feature = feature.squeeze(axis=1)
            # Validate final dimensions
            if feature.ndim != 2 or feature.shape[1] != 1000:
                print(f"[{self.feature_dir}] Invalid dimensions for {video_id}: {feature.shape}")
                return False
        except Exception as e:
            print(f"[{self.feature_dir}] Error loading file {video_id}: {e}")
            return False
        return True

    def __len__(self):
        return len(self.valid_data)

    def __getitem__(self, idx):
        video_id = self.valid_data.iloc[idx]["video_id"]
        label = self.valid_data.iloc[idx]["task_description_encoded"]

        # Load the feature from .npy file
        feature_path = os.path.join(self.feature_dir, f"{video_id}.npy")
        feature = np.load(feature_path)

        # Normalize dimensions if needed
        if feature.ndim == 3 and feature.shape[1] == 1:
            feature = feature.squeeze(axis=1)  # Convert (N, 1, 1000) -> (N, 1000)

        # Pad or truncate the feature to max_length
        if feature.shape[0] > self.max_length:
            feature = feature[:self.max_length, :]
        elif feature.shape[0] < self.max_length:
            pad_length = self.max_length - feature.shape[0]
            feature = np.pad(feature, ((0, pad_length), (0, 0)), mode='constant')

        # Ensure the final feature shape matches (max_length, 1000)
        assert feature.shape == (self.max_length, 1000), f"Feature shape mismatch: {feature.shape}"

        return {
            "feature": torch.tensor(feature, dtype=torch.float32),
            "label": torch.tensor(label, dtype=torch.long)
        }


# Reinitialize datasets and DataLoaders
try:
    train_dataset = VideoFeatureDataset("features/train/", train_data, max_length=1000)
    val_dataset = VideoFeatureDataset("features/val/", val_data, max_length=1000)
    test_dataset = VideoFeatureDataset("features/test/", test_data, max_length=1000)

    print(f"Training set size: {len(train_dataset)}")
    print(f"Validation set size: {len(val_dataset)}")
    print(f"Testing set size: {len(test_dataset)}")

    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

    # Debugging: Inspect a batch from the train DataLoader
    for batch_idx, batch in enumerate(train_loader):
        print(f"Batch {batch_idx} Feature Shape: {batch['feature'].shape}")
        print(f"Batch {batch_idx} Label Shape: {batch['label'].shape}")
        print(f"Sample Labels: {batch['label'][:5].numpy()}")
        break  # Process only the first batch to locate issues

except ValueError as e:
    print(e)


# Debugging: Verify all features and their shapes
for video_id in train_data["video_id"]:
    feature_path = os.path.join("features/train", f"{video_id}.npy")
    if os.path.exists(feature_path):
        feature = np.load(feature_path)
        print(f"Video ID: {video_id}, Feature Shape: {feature.shape}")

All Unique Labels Encoded: ['Adjust to Driving a Car on the Right Side of the Road'
 'Ask a Guy Out over Text' 'Bake a Ring Into a Cake or Other Food'
 'Braze Aluminum' 'Bread Chicken' 'Build Shoe Insoles' 'Can Corn'
 'Care for Lawn Tools' 'Change a Ukulele String'
 'Change the Batteries in a Buzz Lightyear Action Figure' 'Clean Gold'
 'Clean a Dishwasher with Vinegar' 'Clean a Man Cave' 'Cook Egg Whites'
 'Cook Honey Glazed Parsnips' 'Cook a Chuck Roast'
 'Create Window Valances from Cardboard Boxes'
 'Create a Hermit Crab Habitat' 'Declutter an Entryway' 'Dig Post Holes'
 'Drift a Car' "Dye Men's Hair" 'Filter Fry Oil for Reuse'
 'Fix Sticky Drawers' 'Fix the Crotch Hole in Your Jeans'
 'Freeze Tomatillos' 'Freeze Your Smoothie Greens' 'Fringe a Shirt'
 'Get Children Involved With Science' 'Get Rid of Moles in Your Garden'
 'Get Rid of a Beehive' 'Grill Turkey' 'Groom a Longhair Dachshund'
 'Grow Gerbera Daisies' 'Grow Herbs in Pots' 'Grow Kohlrabi'
 'Grow Redwoods from Seed' 'Grow V

## Step 9: Fine-Tune Multimodal Model

Now that we have extracted and prepared the features, we will:
1. Combine the visual, audio, and textual features.
2. Fine-tune a multimodal model to predict task descriptions using these features.

### Workflow:
- Align visual features with textual captions and task descriptions.
- Train a transformer-based model or a multimodal architecture like CLIP or VideoBERT.
- Evaluate the model's performance on validation and testing datasets.

In [7]:
from transformers import BertTokenizer, BertModel
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.cuda.amp import GradScaler, autocast

# Define Multimodal Model
class MultimodalModel(nn.Module):
    def __init__(self, visual_dim, text_dim, hidden_dim, output_dim):
        super(MultimodalModel, self).__init__()
        # Visual embedding layer
        self.visual_fc = nn.Linear(visual_dim, hidden_dim)
        # Text embedding layer (using pre-trained BERT)
        self.text_model = BertModel.from_pretrained('bert-base-uncased')
        self.text_fc = nn.Linear(text_dim, hidden_dim)
        # Fusion and output layers
        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, visual_features, input_ids, attention_mask):
        # Process visual features
        visual_out = self.visual_fc(visual_features)
        # Process text features
        text_out = self.text_model(input_ids=input_ids, attention_mask=attention_mask).pooler_output
        text_out = self.text_fc(text_out)
        # Concatenate and predict
        combined = torch.cat((visual_out, text_out), dim=1)
        output = self.fc(combined)
        return output

# Load tokenizer for text processing
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Prepare text data for fine-tuning
def tokenize_texts(texts, max_length=128):
    tokenized = tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )
    return tokenized['input_ids'], tokenized['attention_mask']

# Example batch preparation
batch = next(iter(train_loader))  # Fetch a batch from the train DataLoader
visual_features = batch['feature']  # Visual features from the dataset
labels = batch['label']  # Task description labels

# Extract captions for the current batch using label indices
batch_labels = batch['label'].tolist()  # Convert tensor to list
batch_captions = []

for label in batch_labels:
    matching_row = train_data[train_data['task_description_encoded'] == label]
    if not matching_row.empty:
        batch_captions.append(matching_row['processed_captions'].values[0])
    else:
        batch_captions.append("")  # Fallback if no match is found

# Tokenize the text captions for the batch
input_ids, attention_mask = tokenize_texts(batch_captions)

# Initialize the model
visual_dim = 1000  # Dimension of visual features
text_dim = 768    # BERT embedding dimension
hidden_dim = 512  # Hidden dimension for fusion
output_dim = len(label_encoder.classes_)  # Number of task descriptions
model = MultimodalModel(visual_dim, text_dim, hidden_dim, output_dim)

# Move model to multi-GPU setup if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model = nn.DataParallel(model)
model = model.to(device)

# Move batch data to GPU
visual_features = visual_features.to(device)
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
labels = labels.to(device)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)

# Mixed precision training setup
scaler = GradScaler()

# Adjust batch size dynamically if memory errors occur
def safe_train_step(model, visual_features, input_ids, attention_mask, labels, criterion, optimizer, scaler):
    try:
        model.train()
        optimizer.zero_grad()

        # Mixed precision training
        with autocast():
            outputs = model(visual_features, input_ids, attention_mask)
            loss = criterion(outputs, labels)

        # Scale loss and backpropagate
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        # Clear unused memory
        torch.cuda.empty_cache()

        print(f"Training loss for the batch: {loss.item()}")
        return True  # Training successful
    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("Out of memory error encountered. Reducing batch size or model complexity may help.")
            torch.cuda.empty_cache()
            return False  # Training failed
        else:
            raise  # Re-raise unexpected errors

# Attempt to train with dynamic batch adjustment
success = safe_train_step(model, visual_features, input_ids, attention_mask, labels, criterion, optimizer, scaler)

if not success:
    # If out-of-memory, reduce batch size and retry
    reduced_batch_size = len(batch['feature']) // 2
    if reduced_batch_size > 0:
        print(f"Retrying with reduced batch size: {reduced_batch_size}")
        train_loader = DataLoader(train_dataset, batch_size=reduced_batch_size, shuffle=True)
        batch = next(iter(train_loader))  # Fetch new batch with reduced size
        visual_features = batch['feature'].to(device)
        labels = batch['label'].to(device)
        batch_labels = batch['label'].tolist()
        batch_captions = [
            train_data[train_data['task_description_encoded'] == label]['processed_captions'].values[0]
            if not train_data[train_data['task_description_encoded'] == label].empty
            else ""
            for label in batch_labels
        ]
        input_ids, attention_mask = tokenize_texts(batch_captions)
        safe_train_step(model, visual_features, input_ids, attention_mask, labels, criterion, optimizer, scaler)

Using 4 GPUs!


  scaler = GradScaler()
  with autocast():


RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/data1/reu/.local/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 84, in _worker
    output = module(*input, **kwargs)
  File "/data1/reu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data1/reu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tmp/ipykernel_2571734/2556762115.py", line 27, in forward
    combined = torch.cat((visual_out, text_out), dim=1)
RuntimeError: Tensors must have same number of dimensions: got 3 and 2
