<a href="https://colab.research.google.com/github/ashwin-yedte/visual-intelligence-travel-finance/blob/main/VL_Encoding_Framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**VL ENCODING FRAMEWORK** - WITH COMPREHENSIVE IMAGE PREPROCESSING
Visual-Language Encoding Infrastructure for Indian Travel Destinations

COMPLETE FEATURE SET:
1.  CLIP tensor extraction bug
2. Comprehensive image preprocessing (EXIF, RGB, aspect ratio)
3. Image validation and quality checks
4. Prompt validation (normalization, token limits)
5. Quality reports and statistics
6. Error recovery and logging

PREPROCESSING PIPELINES:
- IMAGE: EXIF orientation → RGB conversion → Aspect ratio → Padding → Validation
- PROMPT: Token validation → Duplicate detection → Quality metrics

OBJECTIVES:
1. Pre-compute CLIP embeddings for all landmark images
2. Extract semantic prompts using 2,200 CLIP prompt library
3. Create searchable database for similarity matching
4. Enable two-stage matching (visual + semantic)


WORKFLOW:
- Stage 1: Pre-compute prompt embeddings (one-time, 2,200 prompts)
- Stage 2: Process each image (extract CLIP embedding + prompts)
- Stage 3: Aggregate per destination
- Stage 4: Create search indices

================================================================

INSTALL PACKAGES

================================================================

In [34]:
!pip install -q transformers torch pillow tqdm
import warnings
warnings.filterwarnings('ignore')

print("="*80)
print("PACKAGES INSTALLED")
print("="*80)

PACKAGES INSTALLED


================================================================

IMPORT PACKAGES

================================================================

In [35]:
from google.colab import drive
import os
import json
import numpy as np
from PIL import Image, ImageOps
import torch
from transformers import CLIPModel, CLIPProcessor
from tqdm import tqdm
from datetime import datetime
import pickle
import warnings
from typing import Tuple, Dict, Any, List

warnings.filterwarnings('ignore')



# ================================================================
   SETUP AND MOUNT Google Drive   
# ================================================================


In [36]:
print("="*80)
print("VISUAL LANGUAGE ENCODING FRAMEWORK")
print("="*80)

drive.mount('/content/drive')

BASE_PATH = '/content/drive/MyDrive/visual-intelligence-travel-finance'

LANDMARKS_PATH = f'{BASE_PATH}/data/landmarks'
METADATA_PATH = f'{LANDMARKS_PATH}/metadata.json'
PROMPT_LIBRARY_PATH = f'{BASE_PATH}/data/prompt_library/clip_prompts_india_themes_semantic.json'

VL_ENCODING_PATH = f'{BASE_PATH}/data/vl_encoding'
EMBEDDINGS_PATH = f'{VL_ENCODING_PATH}/embeddings'
PROMPTS_PATH = f'{VL_ENCODING_PATH}/prompts'
REPORTS_PATH = f'{VL_ENCODING_PATH}/reports'

os.makedirs(EMBEDDINGS_PATH, exist_ok=True)
os.makedirs(f'{PROMPTS_PATH}/image_prompts', exist_ok=True)
os.makedirs(REPORTS_PATH, exist_ok=True)

print(f"Directories created")
print("="*80)




VISUAL LANGUAGE ENCODING FRAMEWORK
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Directories created


================================================================

STEP 1: PROMPT VALIDATION

================================================================

    
    Validates and analyzes text prompts for CLIP processing.
    Features:
    - Token length validation (CLIP limit: 77 tokens)    
    - Quality metrics and reporting
    

In [37]:
print("\n" + "="*80)
print("PROMPT VALIDATION MODULE")
print("="*80)

class PromptValidator:
    def __init__(self, processor: CLIPProcessor):
        """
        Initialize validator with CLIP processor.

        Args:
            processor: CLIPProcessor for tokenization
        """
        self.processor = processor
        self.validation_log = []
        print("PromptValidator initialized")

    def validate_prompt(self, prompt: str, prompt_id: str = None) -> Dict[str, Any]:
        """
        Validate a single prompt.

        Args:
            prompt: Text prompt to validate
            prompt_id: Optional identifier for logging

        Returns:
            Dictionary with validation results
        """
        issues = []
        warnings = []

        if not prompt or not prompt.strip():
            issues.append("Empty prompt")
            return {
                'valid': False,
                'issues': issues,
                'warnings': warnings,
                'prompt': prompt,
                'prompt_id': prompt_id
            }

        prompt_clean = prompt.strip()

        char_length = len(prompt_clean)
        if char_length < 5:
            warnings.append(f"Very short prompt ({char_length} chars)")
        if char_length > 150:
            warnings.append(f"Long prompt ({char_length} chars)")

        try:
            tokens = self.processor.tokenizer(prompt_clean, truncation=False)
            token_count = len(tokens['input_ids'])

            if token_count > 77:
                issues.append(f"Exceeds CLIP token limit: {token_count}/77 tokens")
            elif token_count > 60:
                warnings.append(f"Near token limit: {token_count}/77 tokens")

        except Exception as e:
            issues.append(f"Tokenization failed: {str(e)}")
            token_count = -1

        problem_chars = ['@', '#', '$', '%', '^', '*', '|', '\\']
        found_chars = [c for c in problem_chars if c in prompt_clean]
        if found_chars:
            warnings.append(f"Contains special characters: {found_chars}")

        result = {
            'valid': len(issues) == 0,
            'issues': issues,
            'warnings': warnings,
            'prompt': prompt_clean,
            'prompt_id': prompt_id,
            'char_length': char_length,
            'token_count': token_count if token_count != -1 else None
        }

        self.validation_log.append(result)
        return result

    def validate_prompt_library(self, prompt_library: Dict) -> Dict[str, Any]:
        """
        Validate entire prompt library.

        Args:
            prompt_library: Dictionary of themes -> categories -> prompts

        Returns:
            Validation report with statistics
        """
        print("Validating prompt library...")

        report = {
            'total_prompts': 0,
            'valid_prompts': 0,
            'prompts_with_issues': 0,
            'prompts_with_warnings': 0,
            'issues_found': [],
            'warnings_found': [],
            'token_stats': {
                'min': float('inf'),
                'max': 0,
                'mean': 0,
                'over_limit': 0
            }
        }

        token_counts = []

        for theme, categories in prompt_library.items():
            for category, prompts in categories.items():
                for idx, prompt in enumerate(prompts):
                    prompt_id = f"{theme}/{category}/{idx}"

                    result = self.validate_prompt(prompt, prompt_id)
                    report['total_prompts'] += 1

                    if result['valid']:
                        report['valid_prompts'] += 1
                    else:
                        report['prompts_with_issues'] += 1
                        report['issues_found'].extend([
                            {'prompt_id': prompt_id, 'issue': issue}
                            for issue in result['issues']
                        ])

                    if result['warnings']:
                        report['prompts_with_warnings'] += 1
                        report['warnings_found'].extend([
                            {'prompt_id': prompt_id, 'warning': warning}
                            for warning in result['warnings']
                        ])

                    if result['token_count'] is not None:
                        token_counts.append(result['token_count'])
                        if result['token_count'] > 77:
                            report['token_stats']['over_limit'] += 1

        if token_counts:
            report['token_stats']['min'] = int(min(token_counts))
            report['token_stats']['max'] = int(max(token_counts))
            report['token_stats']['mean'] = float(np.mean(token_counts))

        return report

    def generate_report(self, validation_results: Dict, output_file: str = None) -> str:
        """
        Generate human-readable validation report.

        Args:
            validation_results: Results from validate_prompt_library()
            output_file: Optional path to save report

        Returns:
            Report text
        """
        report_lines = []
        report_lines.append("=" * 80)
        report_lines.append("PROMPT LIBRARY VALIDATION REPORT")
        report_lines.append("=" * 80)
        report_lines.append("")

        report_lines.append("SUMMARY")
        report_lines.append("-" * 80)
        report_lines.append(f"Total prompts: {validation_results['total_prompts']}")
        report_lines.append(f"Valid prompts: {validation_results['valid_prompts']}")
        report_lines.append(f"Prompts with issues: {validation_results['prompts_with_issues']}")
        report_lines.append(f"Prompts with warnings: {validation_results['prompts_with_warnings']}")
        report_lines.append("")

        report_lines.append("TOKEN STATISTICS")
        report_lines.append("-" * 80)
        stats = validation_results['token_stats']
        report_lines.append(f"Min tokens: {stats['min']}")
        report_lines.append(f"Max tokens: {stats['max']}")
        report_lines.append(f"Mean tokens: {stats['mean']:.1f}")
        report_lines.append(f"Prompts over limit (77): {stats['over_limit']}")
        report_lines.append("")

        if validation_results['issues_found']:
            report_lines.append("CRITICAL ISSUES")
            report_lines.append("-" * 80)
            for item in validation_results['issues_found']:
                report_lines.append(f"  [{item['prompt_id']}] {item['issue']}")
            report_lines.append("")
        else:
            report_lines.append("No critical issues found")
            report_lines.append("")

        if validation_results['warnings_found']:
            report_lines.append("WARNINGS")
            report_lines.append("-" * 80)
            for item in validation_results['warnings_found'][:10]:
                report_lines.append(f"  [{item['prompt_id']}] {item['warning']}")
            if len(validation_results['warnings_found']) > 10:
                report_lines.append(f"  ... and {len(validation_results['warnings_found']) - 10} more")
            report_lines.append("")

        report_lines.append("=" * 80)

        report_text = "\n".join(report_lines)

        if output_file:
            with open(output_file, 'w') as f:
                f.write(report_text)
            print(f"Report saved to: {output_file}")

        return report_text


print("PromptValidator class ready")
print("="*80)


PROMPT VALIDATION MODULE
PromptValidator class ready


================================================================

STEP 2: IMAGE PREPROCESSING

================================================================

    Comprehensive image preprocessing for CLIP model.
    
    Handles:
    - EXIF orientation correction
    - RGB conversion from any color mode
    - Aspect ratio preservation
    - High-quality resampling
    - White padding for consistent dimensions


In [38]:
print("\n" + "="*80)
print("IMAGE PREPROCESSING")
print("="*80)

class ImagePreprocessor:
    """Comprehensive image preprocessing for CLIP."""

    def __init__(self, target_size: Tuple[int, int] = (224, 224)):
        self.target_size = target_size
        print(f"ImagePreprocessor initialized (target: {target_size})")

    def preprocess_image(self, image_path: str) -> Image.Image:
        """Load and preprocess image."""
        try:
            img = Image.open(image_path)
            img = self._fix_orientation(img)

            if img.mode != 'RGB':
                if img.mode == 'RGBA':
                    background = Image.new('RGB', img.size, (255, 255, 255))
                    background.paste(img, mask=img.split()[3])
                    img = background
                else:
                    img = img.convert('RGB')

            img = self._resize_with_padding(img, self.target_size)
            return img

        except Exception as e:
            raise Exception(f"Failed to preprocess {image_path}: {str(e)}")

    def _fix_orientation(self, img: Image.Image) -> Image.Image:
        """Fix EXIF orientation."""
        try:
            img = ImageOps.exif_transpose(img)
        except:
            pass
        return img

    def _resize_with_padding(self, img: Image.Image, target_size: Tuple[int, int]) -> Image.Image:
        """Resize with aspect ratio preservation."""
        img_ratio = img.width / img.height
        target_ratio = target_size[0] / target_size[1]

        if img_ratio > target_ratio:
            new_width = target_size[0]
            new_height = int(new_width / img_ratio)
        else:
            new_height = target_size[1]
            new_width = int(new_height * img_ratio)

        img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)

        canvas = Image.new('RGB', target_size, (255, 255, 255))
        offset_x = (target_size[0] - new_width) // 2
        offset_y = (target_size[1] - new_height) // 2
        canvas.paste(img, (offset_x, offset_y))

        return canvas

    def validate_image(self, image_path: str, max_size_mb: float = 10.0) -> Dict[str, Any]:
        """Validate image."""
        try:
            if not os.path.exists(image_path):
                return {'valid': False, 'error': 'File not found'}

            size_mb = os.path.getsize(image_path) / (1024 * 1024)
            if size_mb > max_size_mb:
                return {'valid': False, 'error': f'File too large: {size_mb:.2f}MB'}

            img = Image.open(image_path)
            img_format = img.format
            dimensions = img.size
            img.close()

            return {'valid': True, 'size_mb': size_mb, 'format': img_format, 'dimensions': dimensions}

        except Exception as e:
            return {'valid': False, 'error': str(e)}


preprocessor = ImagePreprocessor(target_size=(224, 224))
print("="*80)


IMAGE PREPROCESSING
ImagePreprocessor initialized (target: (224, 224))


================================================================

STEP 3: LOAD CLIP MODEL

================================================================

In [39]:
print("\n" + "="*80)
print("LOADING CLIP MODEL")
print("="*80)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

model_name = "openai/clip-vit-base-patch32"
print(f"Model: {model_name}")
print("Loading...")

model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

model.to(device)
model.eval()

sample_text = ["test"]
inputs = processor(text=sample_text, return_tensors="pt", padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
    test_output = model.get_text_features(**inputs)
    if torch.is_tensor(test_output):
        embedding_dim = test_output.shape[-1]
    else:
        embedding_dim = 512

print(f"\nModel loaded")
print(f"  Parameters: {sum(p.numel() for p in model.parameters())/1e6:.1f}M")
print(f"  Embedding dimension: {embedding_dim}")
if device == "cuda":
    print(f"  GPU: {torch.cuda.get_device_name(0)}")

print("="*80)


LOADING CLIP MODEL
Device: cpu
Model: openai/clip-vit-base-patch32
Loading...


Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

CLIPModel LOAD REPORT from: openai/clip-vit-base-patch32
Key                                  | Status     |  | 
-------------------------------------+------------+--+-
text_model.embeddings.position_ids   | UNEXPECTED |  | 
vision_model.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Model loaded
  Parameters: 151.3M
  Embedding dimension: 512


================================================================

STEP 4: LOAD IMAGE META DATA  AND PROMPT LIBRARY

================================================================

In [60]:
print("\n" + "="*80)
print("LOADING DATA")
print("="*80)

with open(METADATA_PATH, 'r') as f:
    metadata = json.load(f)

print(f"Metadata: {metadata['total_images']} images, {metadata['total_destinations']} destinations")

# ADD THIS: Initialize pipeline_status if it doesn't exist
if 'pipeline_status' not in metadata:
    metadata['pipeline_status'] = {
        'embeddings_computed': False,
        'prompts_extracted': False,
        'prompts_validated': False
    }
    print("Initialized pipeline_status in metadata")

with open(PROMPT_LIBRARY_PATH, 'r') as f:
    prompt_library = json.load(f)

total_prompts = sum(sum(len(prompts) for prompts in categories.values())
                    for categories in prompt_library.values())
print(f"Prompt library: {total_prompts} prompts")

print("="*80)


LOADING DATA
Metadata: 215 images, 47 destinations
Initialized pipeline_status in metadata
Prompt library: 2200 prompts


================================================================

STEP 5: VALIDATE PROMPT LIBRARY

================================================================

In [41]:
print("\n" + "="*80)
print("VALIDATING PROMPT LIBRARY")
print("="*80)

validator = PromptValidator(processor)
validation_results = validator.validate_prompt_library(prompt_library)

validation_report = validator.generate_report(
    validation_results,
    output_file=f'{REPORTS_PATH}/prompt_validation_report.txt'
)

print("\n" + validation_report)

if validation_results['prompts_with_issues'] > 0:
    print(f"\nWARNING: {validation_results['prompts_with_issues']} prompts have critical issues!")
    print("Review the validation report before proceeding.")
else:
    print("\nAll prompts passed validation!")

print("="*80)



VALIDATING PROMPT LIBRARY
PromptValidator initialized
Validating prompt library...
Report saved to: /content/drive/MyDrive/visual-intelligence-travel-finance/data/vl_encoding/reports/prompt_validation_report.txt

PROMPT LIBRARY VALIDATION REPORT

SUMMARY
--------------------------------------------------------------------------------
Total prompts: 2200
Valid prompts: 2200
Prompts with issues: 0

TOKEN STATISTICS
--------------------------------------------------------------------------------
Min tokens: 17
Max tokens: 28
Mean tokens: 21.3
Prompts over limit (77): 0

No critical issues found


All prompts passed validation!


================================================================

EMBEDDING EXTRACTION FUNCTIONS

================================================================

In [52]:
print("\n" + "="*80)
print("DEFINING EXTRACTION FUNCTIONS")
print("="*80)

def extract_clip_features(outputs):
    """Universal fix for extracting projected tensors from CLIP outputs."""
    # Prioritize 'image_embeds' or 'text_embeds' which are the *projected* CLIP features (512-dim for base-patch32)
    if hasattr(outputs, 'image_embeds') and torch.is_tensor(outputs.image_embeds):
        return outputs.image_embeds
    if hasattr(outputs, 'text_embeds') and torch.is_tensor(outputs.text_embeds):
        return outputs.text_embeds

    # If the output itself is already a tensor, return it directly.
    if torch.is_tensor(outputs):
        return outputs

    # Added: Check for 'pooler_output' which contains the 512-dim projected embedding for BaseModelOutputWithPooling
    if hasattr(outputs, 'pooler_output') and torch.is_tensor(outputs.pooler_output):
        return outputs.pooler_output

    # If still not found, raise an error for clarity.
    raise TypeError(f"Could not extract 512-dim projected CLIP features from model output type: {type(outputs)}. Object: {outputs}. "
                    f"Expected direct tensor or object with 'image_embeds'/'text_embeds'/'pooler_output' tensor attribute.")


def extract_image_embedding(image_path, model, processor, preprocessor, device):
    """Extract CLIP embedding from image with preprocessing."""
    try:
        validation = preprocessor.validate_image(image_path)
        if not validation['valid']:
            print(f"Skipping {image_path}: {validation['error']}")
            return None

        img = preprocessor.preprocess_image(image_path)

        inputs = processor(images=img, return_tensors="pt", padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model.get_image_features(**inputs)
            image_features = extract_clip_features(outputs)
            image_features = image_features / image_features.norm(dim=-1, keepdim=True)

        return image_features.cpu().numpy()[0]

    except Exception as e:
        print(f"\nERROR in extract_image_embedding for {image_path}: {e}")
        return None


def extract_text_embeddings_batch(texts, model, processor, device):
    """Extract CLIP embeddings for batch of text prompts."""
    try:
        inputs = processor(text=texts, return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model.get_text_features(**inputs)
            text_features = extract_clip_features(outputs)
            text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        return text_features.cpu().numpy()

    except Exception as e:
        print(f"\nERROR in extract_text_embeddings_batch: {e}")
        return None


print("Extraction functions defined")
print("="*80)



DEFINING EXTRACTION FUNCTIONS
Extraction functions defined


================================================================

STEP 6: PRE-COMPUTE PROMPT EMBEDDINGS


    Pre-compute embeddings for all 2,200 prompts
    
    FIXED: Properly extract tensor from CLIP output object
    
    Returns:
        Dictionary with structure:
        {
            'Beach': {
                'LandscapeType': [
                    {'text': 'prompt text', 'embedding': np.array(512,)},
                    ...
                ],
                ...
            },
            ...
        }


================================================================

In [53]:
print("\n" + "="*80)
print("STAGE 6: PRE-COMPUTING PROMPT EMBEDDINGS")
print("="*80)

def precompute_prompt_embeddings(prompt_library, model, processor, device):
    """Pre-compute embeddings for all prompts."""

    prompt_embeddings = {}
    total_prompts = sum(sum(len(p) for p in cats.values()) for cats in prompt_library.values())

    print(f"Total prompts: {total_prompts}")

    with tqdm(total=total_prompts, desc="Computing prompt embeddings") as pbar:
        for theme, categories in prompt_library.items():
            prompt_embeddings[theme] = {}

            for category, prompts in categories.items():
                prompt_embeddings[theme][category] = []

                batch_size = 32
                for i in range(0, len(prompts), batch_size):
                    batch = prompts[i:i+batch_size]

                    embeddings = extract_text_embeddings_batch(batch, model, processor, device)

                    if embeddings is not None:
                        for j, prompt_text in enumerate(batch):
                            prompt_embeddings[theme][category].append({
                                'text': prompt_text,
                                'embedding': embeddings[j]
                            })

                    pbar.update(len(batch))

    return prompt_embeddings


prompt_embeddings_file = f'{EMBEDDINGS_PATH}/prompt_embeddings.pkl'

# Force re-computation to ensure consistency after fix
force_recompute_prompts = True

if os.path.exists(prompt_embeddings_file) and not force_recompute_prompts:
    print("Found cached prompt embeddings...")

    with open(prompt_embeddings_file, 'rb') as f:
        cached_embeddings = pickle.load(f)

    for theme, cats in cached_embeddings.items():
        for cat, prompts in cats.items():
            if prompts:
                cached_dim = prompts[0]['embedding'].shape[0]
                break
        break

    print(f"  Cached dimension: {cached_dim}")
    print(f"  Current model dimension: {embedding_dim}")

    if cached_dim == embedding_dim:
        print("  Dimensions match - using cache")
        prompt_embeddings = cached_embeddings
    else:
        print("  Dimension mismatch - recomputing...")
        os.remove(prompt_embeddings_file)
        prompt_embeddings = precompute_prompt_embeddings(prompt_library, model, processor, device)

        with open(prompt_embeddings_file, 'wb') as f:
            pickle.dump(prompt_embeddings, f)
        print(f"  Saved new embeddings")
else:
    if os.path.exists(prompt_embeddings_file) and force_recompute_prompts:
        print("Forcing recomputation of prompt embeddings, ignoring cached file.")
        os.remove(prompt_embeddings_file) # Ensure old cache is removed
    else:
        print("Computing prompt embeddings...")

    prompt_embeddings = precompute_prompt_embeddings(prompt_library, model, processor, device)

    with open(prompt_embeddings_file, 'wb') as f:
        pickle.dump(prompt_embeddings, f)
    print(f"Saved to: {prompt_embeddings_file}")

print("="*80)



STAGE 6: PRE-COMPUTING PROMPT EMBEDDINGS
Forcing recomputation of prompt embeddings, ignoring cached file.
Total prompts: 2200


Computing prompt embeddings:   1%|▏         | 32/2200 [00:00<01:01, 35.08it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0360,  0.1090, -0.1890,  ...,  1.4989,  1.6122, -1.4643],
         [ 1.2186, -1.3370, -0.2337,  ...,  1.6266,  1.2361,  0.0886],
         ...,
         [-0.4441, -1.5782,  0.3321,  ..., -0.8385, -0.6500,  0.6907],
         [-0.5024, -1.6044,  0.3171,  ..., -0.8625, -0.6403,  0.7204],
         [-0.5450, -1.6367,  0.2889,  ..., -0.8662, -0.6201,  0.6481]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0360,  0.1090, -0.1890,  ...,  1.4989,  1.6122, -1.4643],
         [ 1.2186, -1.3370, -0.2337,  ...,  1.6266,  1.2361,  0.0886],
         ...,
         [ 0.6174, -2.0593, -0.4355,  ..., -1.2615, -0.1844, -0.3498],
         [ 0.6373, -2.13

Computing prompt embeddings:   2%|▏         | 50/2200 [00:01<01:09, 30.86it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0360,  0.1090, -0.1890,  ...,  1.4989,  1.6122, -1.4643],
         [ 1.2186, -1.3370, -0.2337,  ...,  1.6266,  1.2361,  0.0886],
         ...,
         [-0.3695, -0.9486,  0.7089,  ..., -0.9234, -0.6332, -0.5198],
         [-0.3746, -1.0124,  0.7639,  ..., -0.9738, -0.5678, -0.5414],
         [-0.3233, -1.0649,  0.7650,  ..., -0.9656, -0.5298, -0.5655]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0360,  0.1090, -0.1890,  ...,  1.4989,  1.6122, -1.4643],
         [ 1.2186, -1.3370, -0.2337,  ...,  1.6266,  1.2361,  0.0886],
         ...,
         [-0.5916, -1.7089, -0.4374,  ..., -0.1074, -1.2703,  0.8683],
         [-0.5777, -1.73

Computing prompt embeddings:   4%|▎         | 82/2200 [00:02<01:17, 27.27it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0477, -0.7871,  0.6439,  ...,  1.1886,  1.0673, -1.4245],
         [ 0.2355, -0.2617, -1.1725,  ...,  1.0947, -0.0462, -0.6166],
         ...,
         [-0.3455, -3.1602, -1.0322,  ..., -0.0973,  0.5425,  0.8564],
         [-0.3520, -3.2030, -1.0399,  ..., -0.0775,  0.5295,  0.8622],
         [-0.3708, -3.2391, -1.0705,  ..., -0.0441,  0.5117,  0.8752]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0477, -0.7871,  0.6439,  ...,  1.1886,  1.0673, -1.4245],
         [ 0.2355, -0.2617, -1.1725,  ...,  1.0947, -0.0462, -0.6166],
         ...,
         [ 0.3107, -1.9692,  0.3447,  ...,  0.0167, -0.7823,  1.1896],
         [ 0.3273, -2.03

Computing prompt embeddings:   5%|▍         | 100/2200 [00:03<01:12, 29.10it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0477, -0.7871,  0.6439,  ...,  1.1886,  1.0673, -1.4245],
         [ 0.2355, -0.2617, -1.1725,  ...,  1.0947, -0.0462, -0.6166],
         ...,
         [-0.3732, -0.2008,  0.8579,  ..., -0.9932, -0.2084,  0.1555],
         [-0.0082, -0.2267, -1.9381,  ..., -0.4676,  0.3422,  0.6717],
         [-0.4353, -1.9286, -1.0317,  ...,  0.4018, -0.5963,  0.7295]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0477, -0.7871,  0.6439,  ...,  1.1886,  1.0673, -1.4245],
         [ 0.2355, -0.2617, -1.1725,  ...,  1.0947, -0.0462, -0.6166],
         ...,
         [-0.7677, -3.1414, -0.4368,  ...,  0.6606, -0.3982,  0.9034],
         [-0.7803, -3.16

Computing prompt embeddings:   6%|▌         | 132/2200 [00:04<01:06, 31.23it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.2282, -0.5560,  0.4948,  ..., -0.4390, -0.5153, -0.6272],
         [ 0.3905,  0.7269, -1.3145,  ..., -0.3579,  0.7757,  0.8788],
         ...,
         [ 1.1902, -2.1037, -0.5028,  ..., -1.3349, -0.3068, -0.3447],
         [ 1.2632, -2.1119, -0.5749,  ..., -1.3250, -0.2761, -0.2804],
         [ 1.3431, -2.0993, -0.6671,  ..., -1.3646, -0.2310, -0.1998]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.2282, -0.5560,  0.4948,  ..., -0.4390, -0.5153, -0.6272],
         [ 0.3905,  0.7269, -1.3145,  ..., -0.3579,  0.7757,  0.8788],
         ...,
         [ 0.1030, -1.8745,  0.0106,  ..., -0.5656, -0.4443,  0.4988],
         [ 0.1683, -1.90

Computing prompt embeddings:   7%|▋         | 150/2200 [00:04<01:04, 32.01it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.2282, -0.5560,  0.4948,  ..., -0.4390, -0.5153, -0.6272],
         [ 0.3905,  0.7269, -1.3145,  ..., -0.3579,  0.7757,  0.8788],
         ...,
         [ 1.5485, -1.9702, -0.8337,  ..., -1.0354, -0.5629,  0.5358],
         [ 1.6041, -2.0687, -0.8757,  ..., -1.0029, -0.4803,  0.4962],
         [ 1.5163, -2.0808, -0.8294,  ..., -0.8099, -0.4303,  0.3238]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.2282, -0.5560,  0.4948,  ..., -0.4390, -0.5153, -0.6272],
         [ 0.3905,  0.7269, -1.3145,  ..., -0.3579,  0.7757,  0.8788],
         ...,
         [ 0.0377, -1.5560,  0.1126,  ..., -1.3427, -0.1582, -0.5732],
         [ 0.0709, -1.57

Computing prompt embeddings:   8%|▊         | 182/2200 [00:05<01:01, 33.02it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1198, -0.3362, -1.0089,  ..., -0.5232,  0.4072, -0.6941],
         [ 1.6471, -1.0166, -0.5341,  ...,  1.0336,  2.2663, -0.1401],
         ...,
         [ 0.3368, -1.3334, -0.3221,  ..., -0.7058, -0.9130,  0.0898],
         [ 0.3632, -1.5361, -0.2902,  ..., -0.5952, -0.7164, -0.0635],
         [ 0.2670, -1.3333, -0.2099,  ..., -0.6729, -1.0419,  0.1062]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1198, -0.3362, -1.0089,  ..., -0.5232,  0.4072, -0.6941],
         [ 1.6471, -1.0166, -0.5341,  ...,  1.0336,  2.2663, -0.1401],
         ...,
         [ 2.0287, -0.7517, -0.9138,  ..., -1.6458,  1.2239, -0.1016],
         [ 1.5422, -2.48

Computing prompt embeddings:   9%|▉         | 200/2200 [00:06<01:00, 33.19it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1198, -0.3362, -1.0089,  ..., -0.5232,  0.4072, -0.6941],
         [ 1.6471, -1.0166, -0.5341,  ...,  1.0336,  2.2663, -0.1401],
         ...,
         [ 0.8019, -1.5294,  0.4562,  ..., -0.3800, -0.2756,  0.1316],
         [ 0.7832, -1.5645,  0.3336,  ..., -0.3937, -0.2131,  0.0978],
         [ 0.7995, -1.6170,  0.2163,  ..., -0.3902, -0.1598,  0.0256]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1198, -0.3362, -1.0089,  ..., -0.5232,  0.4072, -0.6941],
         [ 1.6471, -1.0166, -0.5341,  ...,  1.0336,  2.2663, -0.1401],
         ...,
         [ 0.1250, -0.9046,  0.7768,  ..., -0.9122, -0.2041,  0.4046],
         [ 0.1518, -0.94

Computing prompt embeddings:  11%|█         | 232/2200 [00:07<00:58, 33.68it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 0.8226, -2.2020,  0.4998,  ..., -0.2586,  0.1894, -1.0391],
         ...,
         [ 0.1675, -2.6151, -0.0497,  ..., -0.3448,  0.5574, -0.2152],
         [ 0.1524, -2.6140, -0.1102,  ..., -0.3409,  0.5923, -0.3083],
         [ 0.1902, -2.6490, -0.1006,  ..., -0.3405,  0.5892, -0.2990]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 0.8226, -2.2020,  0.4998,  ..., -0.2586,  0.1894, -1.0391],
         ...,
         [-0.6745, -1.7900, -0.9596,  ..., -0.6434, -0.6276,  0.1663],
         [-0.6604, -1.84

Computing prompt embeddings:  11%|█▏        | 250/2200 [00:07<00:57, 34.13it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 0.8226, -2.2020,  0.4998,  ..., -0.2586,  0.1894, -1.0391],
         ...,
         [ 1.3339, -2.0578, -0.0948,  ..., -0.7147,  0.2204,  1.2404],
         [ 1.3653, -2.1381, -0.1135,  ..., -0.7426,  0.2471,  1.2376],
         [ 1.3368, -2.1824, -0.1413,  ..., -0.7566,  0.2672,  1.2316]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 0.8226, -2.2020,  0.4998,  ..., -0.2586,  0.1894, -1.0391],
         ...,
         [ 1.3991,  0.0575, -0.3576,  ..., -0.7132, -0.2317,  0.0957],
         [ 0.9688, -0.13

Computing prompt embeddings:  13%|█▎        | 282/2200 [00:08<01:01, 31.04it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.0520,  0.7911,  0.4244,  ..., -0.1631, -0.1594, -1.1558],
         [ 0.7713,  0.5752, -0.5835,  ..., -0.0464,  1.0624, -1.1028],
         ...,
         [ 1.3093, -1.5812,  0.2372,  ..., -0.2556, -0.6612,  0.5885],
         [ 1.2940, -1.5499,  0.2196,  ..., -0.2651, -0.6742,  0.6902],
         [ 1.3174, -1.5646,  0.1954,  ..., -0.2622, -0.6466,  0.8046]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.0520,  0.7911,  0.4244,  ..., -0.1631, -0.1594, -1.1558],
         [ 0.7713,  0.5752, -0.5835,  ..., -0.0464,  1.0624, -1.1028],
         ...,
         [ 0.9909, -1.1735,  0.3728,  ..., -0.8700, -0.0702, -1.0165],
         [ 1.0051, -1.22

Computing prompt embeddings:  14%|█▎        | 300/2200 [00:09<01:10, 26.82it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.0520,  0.7911,  0.4244,  ..., -0.1631, -0.1594, -1.1558],
         [ 0.7713,  0.5752, -0.5835,  ..., -0.0464,  1.0624, -1.1028],
         ...,
         [ 0.7267, -2.0232, -0.3049,  ..., -0.5365, -0.3796,  0.1248],
         [ 0.7266, -2.0446, -0.3062,  ..., -0.5013, -0.3692,  0.0831],
         [ 0.7500, -2.0546, -0.3369,  ..., -0.4639, -0.3654,  0.0943]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.0520,  0.7911,  0.4244,  ..., -0.1631, -0.1594, -1.1558],
         [ 0.7713,  0.5752, -0.5835,  ..., -0.0464,  1.0624, -1.1028],
         ...,
         [ 1.6636, -2.2350, -0.1342,  ..., -0.2884, -0.0106,  0.4301],
         [ 1.7160, -2.27

Computing prompt embeddings:  15%|█▌        | 332/2200 [00:11<01:16, 24.33it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1189, -0.3448,  0.0384,  ..., -0.2189,  0.2566, -1.4573],
         [ 0.5326, -0.5567,  0.4352,  ...,  0.2266,  1.4165,  1.1552],
         ...,
         [ 0.8880, -2.4409, -0.5612,  ..., -0.7388,  0.3517,  1.2075],
         [ 0.9456, -2.4810, -0.5478,  ..., -0.7448,  0.4090,  1.2053],
         [ 1.0369, -2.5151, -0.5859,  ..., -0.7276,  0.4349,  1.2026]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1189, -0.3448,  0.0384,  ..., -0.2189,  0.2566, -1.4573],
         [ 0.5326, -0.5567,  0.4352,  ...,  0.2266,  1.4165,  1.1552],
         ...,
         [ 1.3498, -2.0640, -1.2430,  ..., -1.5320,  0.0054,  0.7053],
         [ 1.4014, -2.10

Computing prompt embeddings:  16%|█▌        | 350/2200 [00:12<01:18, 23.69it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1189, -0.3448,  0.0384,  ..., -0.2189,  0.2566, -1.4573],
         [ 0.5326, -0.5567,  0.4352,  ...,  0.2266,  1.4165,  1.1552],
         ...,
         [ 1.0306, -1.9635, -0.9435,  ..., -0.3043, -0.5339,  1.1384],
         [ 1.0134, -1.9491, -0.9794,  ..., -0.3015, -0.5149,  1.1637],
         [ 1.0993, -1.9012, -1.0137,  ..., -0.2072, -0.4433,  1.1089]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1189, -0.3448,  0.0384,  ..., -0.2189,  0.2566, -1.4573],
         [ 0.5326, -0.5567,  0.4352,  ...,  0.2266,  1.4165,  1.1552],
         ...,
         [ 0.0658, -1.1712, -0.5904,  ..., -0.7189, -0.1593,  0.1781],
         [ 0.0641, -1.22

Computing prompt embeddings:  17%|█▋        | 382/2200 [00:13<01:09, 26.12it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8165, -1.4291, -0.1906,  ..., -0.8619,  0.8471, -0.7508],
         [ 1.3646, -1.3783,  1.5167,  ...,  0.8052,  0.2273, -1.4786],
         ...,
         [-0.5414, -1.5182, -0.1888,  ..., -0.4959, -0.8206,  0.2492],
         [-0.5460, -1.4998, -0.2766,  ..., -0.4504, -0.8061,  0.2363],
         [-0.5461, -1.4965, -0.3695,  ..., -0.4722, -0.7552,  0.0932]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8165, -1.4291, -0.1906,  ..., -0.8619,  0.8471, -0.7508],
         [ 1.3646, -1.3783,  1.5167,  ...,  0.8052,  0.2273, -1.4786],
         ...,
         [-0.4319, -1.4558,  0.2746,  ..., -0.3088, -0.7298,  0.1842],
         [-0.3890, -1.58

Computing prompt embeddings:  18%|█▊        | 400/2200 [00:13<01:05, 27.32it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8165, -1.4291, -0.1906,  ..., -0.8619,  0.8471, -0.7508],
         [ 1.3646, -1.3783,  1.5167,  ...,  0.8052,  0.2273, -1.4786],
         ...,
         [-0.1544, -1.0415,  0.7585,  ..., -1.2633, -1.3779,  0.2731],
         [-0.1428, -1.0552,  0.7908,  ..., -1.3214, -1.3608,  0.3313],
         [-0.1717, -1.0476,  0.7984,  ..., -1.3506, -1.3402,  0.3309]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8165, -1.4291, -0.1906,  ..., -0.8619,  0.8471, -0.7508],
         [ 1.3646, -1.3783,  1.5167,  ...,  0.8052,  0.2273, -1.4786],
         ...,
         [ 0.2517, -2.0214,  0.3558,  ..., -0.9729, -1.0023,  1.2907],
         [ 0.2486, -2.07

Computing prompt embeddings:  20%|█▉        | 432/2200 [00:14<01:00, 29.11it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.4283, -0.3762, -0.0794,  ...,  2.4845,  0.8068, -0.2783],
         [ 0.3319, -0.0421, -0.4671,  ...,  1.5648,  1.3119, -0.7371],
         ...,
         [-0.2532, -2.3742, -0.7112,  ...,  0.3492,  0.7720,  0.6277],
         [-0.2021, -2.3739, -0.6444,  ...,  0.3701,  0.7577,  0.6436],
         [-0.1576, -2.3986, -0.5937,  ...,  0.3608,  0.7389,  0.6648]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.4283, -0.3762, -0.0794,  ...,  2.4845,  0.8068, -0.2783],
         [ 0.3319, -0.0421, -0.4671,  ...,  1.5648,  1.3119, -0.7371],
         ...,
         [ 0.5495, -1.8513, -0.4635,  ..., -0.3985,  0.0083,  1.2431],
         [ 0.6498, -1.83

Computing prompt embeddings:  20%|██        | 450/2200 [00:15<00:58, 30.00it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.4283, -0.3762, -0.0794,  ...,  2.4845,  0.8068, -0.2783],
         [ 0.3319, -0.0421, -0.4671,  ...,  1.5648,  1.3119, -0.7371],
         ...,
         [-0.2971, -0.9702, -0.1207,  ..., -0.6930, -0.0262,  0.8510],
         [-0.2960, -1.0360, -0.2049,  ..., -0.6224,  0.0217,  0.8851],
         [-0.2408, -1.0501, -0.2822,  ..., -0.5785,  0.0871,  1.0067]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.4283, -0.3762, -0.0794,  ...,  2.4845,  0.8068, -0.2783],
         [ 0.3319, -0.0421, -0.4671,  ...,  1.5648,  1.3119, -0.7371],
         ...,
         [ 0.7005, -1.7193,  0.0109,  ...,  0.0262,  0.3888,  1.0916],
         [ 0.7556, -1.80

Computing prompt embeddings:  22%|██▏       | 482/2200 [00:16<00:55, 31.04it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 1.2495, -1.0632, -1.5140,  ...,  0.1223,  0.8180,  0.7897],
         ...,
         [ 0.3243, -1.9182, -1.1614,  ..., -0.4625, -0.4945,  0.6923],
         [ 0.3761, -1.9384, -1.2084,  ..., -0.4704, -0.4829,  0.7183],
         [ 0.3271, -1.9296, -1.2190,  ..., -0.5218, -0.3879,  0.6239]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 1.2495, -1.0632, -1.5140,  ...,  0.1223,  0.8180,  0.7897],
         ...,
         [ 1.0679, -1.8032, -1.8819,  ..., -0.3455, -0.3950,  1.5161],
         [ 1.1355, -1.81

Computing prompt embeddings:  23%|██▎       | 500/2200 [00:16<00:54, 31.14it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 1.2495, -1.0632, -1.5140,  ...,  0.1223,  0.8180,  0.7897],
         ...,
         [ 1.4213, -1.8496, -1.1809,  ..., -0.4056,  0.3600,  0.5640],
         [ 1.4307, -1.9231, -1.1925,  ..., -0.4484,  0.4391,  0.5062],
         [ 1.2857, -1.8470, -1.1647,  ..., -0.3652,  0.2389,  0.4742]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 1.2495, -1.0632, -1.5140,  ...,  0.1223,  0.8180,  0.7897],
         ...,
         [ 0.0291, -2.0992, -1.0024,  ..., -0.6537,  0.2734,  0.2728],
         [ 0.0340, -2.01

Computing prompt embeddings:  24%|██▍       | 532/2200 [00:17<00:51, 32.19it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 0.6310,  0.0427,  0.6043,  ...,  0.1625,  1.4684, -0.5701],
         ...,
         [ 0.4413, -1.6149, -0.9869,  ..., -0.9998, -0.3853, -0.2921],
         [ 0.5413, -1.5940, -0.9384,  ..., -0.7592, -0.2784, -0.3918],
         [ 0.4959, -1.4919, -0.9191,  ..., -0.8483, -0.4062, -0.2917]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 0.6310,  0.0427,  0.6043,  ...,  0.1625,  1.4684, -0.5701],
         ...,
         [ 1.1470, -2.0598, -0.3792,  ..., -0.7613,  0.2225, -0.4427],
         [ 1.1715, -2.03

Computing prompt embeddings:  25%|██▌       | 550/2200 [00:18<00:50, 32.46it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 0.6310,  0.0427,  0.6043,  ...,  0.1625,  1.4684, -0.5701],
         ...,
         [ 0.1733, -1.2928, -0.2469,  ..., -0.8865, -0.7690,  0.8249],
         [ 0.1694, -1.3507, -0.2467,  ..., -0.8412, -0.7621,  0.7798],
         [ 0.1876, -1.4291, -0.2793,  ..., -0.7965, -0.7601,  0.7267]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6082, -1.0422,  1.1440,  ..., -0.0248,  0.8466,  0.0318],
         [ 0.6310,  0.0427,  0.6043,  ...,  0.1625,  1.4684, -0.5701],
         ...,
         [ 0.1726, -2.2279,  0.0462,  ...,  0.0485, -0.5077, -0.2225],
         [ 0.1350, -2.24

Computing prompt embeddings:  26%|██▋       | 582/2200 [00:19<00:48, 33.15it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 9.2956e-01, -2.0894e-01, -1.1001e+00,  ...,  6.7006e-01,
           1.0066e+00, -9.7222e-01],
         [ 1.1362e+00, -7.3337e-01, -1.7866e+00,  ...,  1.0826e+00,
           6.3524e-01, -8.5714e-01],
         ...,
         [ 3.0625e-01, -1.2089e+00, -2.9271e-01,  ..., -4.9291e-01,
          -7.2733e-01, -2.3419e-01],
         [ 3.3164e-01, -1.2226e+00, -2.7970e-01,  ..., -4.6974e-01,
          -7.4708e-01, -2.2314e-01],
         [ 3.4654e-01, -1.2394e+00, -3.1690e-01,  ..., -4.5642e-01,
          -7.7112e-01, -2.7337e-01]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 9.

Computing prompt embeddings:  27%|██▋       | 600/2200 [00:19<00:48, 33.18it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.9296, -0.2089, -1.1001,  ...,  0.6701,  1.0066, -0.9722],
         [ 1.1362, -0.7334, -1.7866,  ...,  1.0826,  0.6352, -0.8571],
         ...,
         [ 0.8038, -0.2627, -0.1142,  ..., -0.0402, -0.8933, -0.5453],
         [ 0.7659, -0.2921, -0.1467,  ..., -0.0600, -0.9330, -0.5112],
         [ 0.7491, -0.3292, -0.1165,  ..., -0.0351, -0.8875, -0.5775]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.9296, -0.2089, -1.1001,  ...,  0.6701,  1.0066, -0.9722],
         [ 1.1362, -0.7334, -1.7866,  ...,  1.0826,  0.6352, -0.8571],
         ...,
         [ 1.0981,  0.3412, -0.2457,  ..., -0.3167, -0.6335, -0.6150],
         [ 0.1754, -0.22

Computing prompt embeddings:  29%|██▊       | 632/2200 [00:20<00:46, 33.60it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [-5.1297e-02, -6.1894e-01,  1.2445e+00,  ..., -2.4089e-01,
           4.5432e-01, -8.4660e-01],
         [-1.5851e-01,  2.2886e-01, -1.0117e+00,  ..., -1.3761e+00,
          -5.8826e-02, -1.2711e+00],
         ...,
         [-1.1273e-01, -9.1742e-01, -1.8277e+00,  ..., -1.3986e+00,
           3.4424e-01, -2.0748e+00],
         [-8.0821e-02, -8.9840e-01, -1.8949e+00,  ..., -1.4102e+00,
           3.5256e-01, -2.0232e+00],
         [-5.6505e-03, -9.0071e-01, -1.9078e+00,  ..., -1.1968e+00,
           4.1243e-01, -2.1802e+00]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [-5.

Computing prompt embeddings:  30%|██▉       | 650/2200 [00:21<00:46, 33.64it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [-5.1297e-02, -6.1894e-01,  1.2445e+00,  ..., -2.4089e-01,
           4.5432e-01, -8.4660e-01],
         [-1.5851e-01,  2.2886e-01, -1.0117e+00,  ..., -1.3761e+00,
          -5.8826e-02, -1.2711e+00],
         ...,
         [ 1.1972e-01, -2.3786e-01, -1.8057e+00,  ..., -2.5027e-01,
          -2.8332e-01, -7.4430e-01],
         [ 1.3084e-01, -2.4515e-01, -1.7976e+00,  ..., -2.5260e-01,
          -2.3949e-01, -6.9831e-01],
         [ 1.9627e-01, -2.4675e-01, -1.8137e+00,  ..., -2.0959e-01,
          -1.8900e-01, -6.6381e-01]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [-5.

Computing prompt embeddings:  31%|███       | 682/2200 [00:22<00:44, 33.94it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.2508, -0.4395, -0.8886,  ...,  0.9513, -0.4090, -0.1343],
         [-0.1081,  0.2552, -0.8604,  ...,  0.4617,  0.8835,  0.1769],
         ...,
         [ 1.7956, -0.1749, -1.2083,  ..., -1.2710,  0.2965, -0.5363],
         [ 0.6530, -1.3206, -1.1770,  ..., -1.8074, -1.0501, -0.3589],
         [ 0.6954, -1.3578, -1.2139,  ..., -1.7864, -1.0318, -0.3481]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.2508, -0.4395, -0.8886,  ...,  0.9513, -0.4090, -0.1343],
         [-0.1081,  0.2552, -0.8604,  ...,  0.4617,  0.8835,  0.1769],
         ...,
         [ 2.6763,  0.8355,  0.7855,  ..., -0.3171,  0.1172, -0.3225],
         [ 1.5110,  0.43

Computing prompt embeddings:  32%|███▏      | 700/2200 [00:22<00:47, 31.87it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.2508, -0.4395, -0.8886,  ...,  0.9513, -0.4090, -0.1343],
         [-0.1081,  0.2552, -0.8604,  ...,  0.4617,  0.8835,  0.1769],
         ...,
         [ 1.9851,  0.1965, -0.6359,  ..., -0.5936,  0.1535, -1.4079],
         [ 0.5197, -0.1185, -0.7541,  ..., -1.3814, -0.8726, -0.2935],
         [ 0.5726, -0.1544, -0.7692,  ..., -1.3515, -0.8763, -0.2838]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.2508, -0.4395, -0.8886,  ...,  0.9513, -0.4090, -0.1343],
         [-0.1081,  0.2552, -0.8604,  ...,  0.4617,  0.8835,  0.1769],
         ...,
         [ 0.0128, -0.3448, -2.5937,  ..., -0.6828,  0.5484, -0.1720],
         [ 0.8903, -1.33

Computing prompt embeddings:  33%|███▎      | 732/2200 [00:24<00:53, 27.28it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.3905,  0.2692,  0.1395,  ...,  1.5345,  0.1427, -1.3900],
         [ 1.5333, -0.8700, -0.9271,  ...,  0.8903,  1.0090, -0.5415],
         ...,
         [-0.1329, -0.6463, -0.0053,  ..., -1.1272, -1.0079, -1.5341],
         [-0.1305, -0.6789, -0.0530,  ..., -1.1781, -0.9927, -1.5057],
         [-0.1399, -0.7040, -0.1244,  ..., -1.2134, -0.9656, -1.4920]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.3905,  0.2692,  0.1395,  ...,  1.5345,  0.1427, -1.3900],
         [ 1.5333, -0.8700, -0.9271,  ...,  0.8903,  1.0090, -0.5415],
         ...,
         [-0.3689, -0.2544,  0.0409,  ..., -0.9929, -1.1016, -1.4610],
         [-0.3714, -0.30

Computing prompt embeddings:  34%|███▍      | 750/2200 [00:25<00:56, 25.87it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.3905,  0.2692,  0.1395,  ...,  1.5345,  0.1427, -1.3900],
         [ 1.5333, -0.8700, -0.9271,  ...,  0.8903,  1.0090, -0.5415],
         ...,
         [ 0.8573,  0.7096, -0.3010,  ..., -0.4536, -0.2560, -1.9203],
         [-0.2288, -0.4797, -0.1834,  ..., -0.7466, -1.3522, -1.1261],
         [-0.1529, -0.5109, -0.1964,  ..., -0.7305, -1.3191, -1.0521]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.3905,  0.2692,  0.1395,  ...,  1.5345,  0.1427, -1.3900],
         [ 1.5333, -0.8700, -0.9271,  ...,  0.8903,  1.0090, -0.5415],
         ...,
         [-0.3143, -0.7900,  0.6247,  ..., -1.3598, -1.0545, -1.4917],
         [-0.2909, -0.80

Computing prompt embeddings:  36%|███▌      | 782/2200 [00:26<00:57, 24.66it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6313, -1.0364, -1.3184,  ...,  0.7914,  1.1470, -0.7072],
         [ 1.5074, -1.3012, -0.6064,  ..., -0.1646, -0.6698, -1.1795],
         ...,
         [ 0.0364,  0.1005, -0.6126,  ..., -1.4690, -1.7282, -0.9615],
         [ 0.0106,  0.0790, -0.6276,  ..., -1.4941, -1.7200, -0.9447],
         [-0.0284,  0.0486, -0.6344,  ..., -1.4728, -1.7251, -0.9799]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6313, -1.0364, -1.3184,  ...,  0.7914,  1.1470, -0.7072],
         [ 1.5074, -1.3012, -0.6064,  ..., -0.1646, -0.6698, -1.1795],
         ...,
         [ 0.6889, -1.4865, -0.1637,  ..., -0.8870, -1.1708, -0.7297],
         [ 0.6791, -1.50

Computing prompt embeddings:  36%|███▋      | 800/2200 [00:27<00:53, 26.24it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6313, -1.0364, -1.3184,  ...,  0.7914,  1.1470, -0.7072],
         [ 1.5074, -1.3012, -0.6064,  ..., -0.1646, -0.6698, -1.1795],
         ...,
         [-0.0998, -0.7425, -1.5182,  ..., -0.9080, -1.4831, -0.8270],
         [-0.1223, -0.7462, -1.4947,  ..., -0.9123, -1.4803, -0.8235],
         [-0.1282, -0.7290, -1.4704,  ..., -0.8817, -1.5039, -0.8248]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6313, -1.0364, -1.3184,  ...,  0.7914,  1.1470, -0.7072],
         [ 1.5074, -1.3012, -0.6064,  ..., -0.1646, -0.6698, -1.1795],
         ...,
         [ 1.7914, -0.8240, -0.8139,  ..., -0.7902,  0.8001, -1.9320],
         [ 0.2988, -1.02

Computing prompt embeddings:  38%|███▊      | 832/2200 [00:28<00:47, 28.84it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.4669,  0.9025, -0.1097,  ...,  0.9466,  1.5215,  0.6647],
         [-0.1647,  0.2563,  0.2132,  ...,  0.1142, -0.0651, -0.9047],
         ...,
         [-0.4041, -0.4978, -0.8927,  ..., -1.7057, -0.9394, -2.1232],
         [-0.3733, -0.5174, -0.9842,  ..., -1.6725, -0.9161, -2.0797],
         [-0.2888, -0.5005, -1.1012,  ..., -1.6642, -0.8685, -2.0166]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.4669,  0.9025, -0.1097,  ...,  0.9466,  1.5215,  0.6647],
         [-0.1647,  0.2563,  0.2132,  ...,  0.1142, -0.0651, -0.9047],
         ...,
         [ 0.1036, -0.4757, -1.0929,  ..., -0.6905, -0.7810, -1.6326],
         [ 0.1158, -0.45

Computing prompt embeddings:  39%|███▊      | 850/2200 [00:28<00:45, 29.58it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.4669,  0.9025, -0.1097,  ...,  0.9466,  1.5215,  0.6647],
         [-0.1647,  0.2563,  0.2132,  ...,  0.1142, -0.0651, -0.9047],
         ...,
         [-0.3638, -0.9442, -1.3790,  ..., -0.7489, -0.8253, -1.8304],
         [-0.3765, -0.9559, -1.3736,  ..., -0.7381, -0.8411, -1.8197],
         [-0.3502, -0.9641, -1.3969,  ..., -0.6988, -0.8492, -1.8272]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.4669,  0.9025, -0.1097,  ...,  0.9466,  1.5215,  0.6647],
         [-0.1647,  0.2563,  0.2132,  ...,  0.1142, -0.0651, -0.9047],
         ...,
         [ 2.6769,  0.6553,  0.5845,  ..., -0.2207,  0.0500, -0.0488],
         [ 1.1926,  0.46

Computing prompt embeddings:  40%|████      | 882/2200 [00:29<00:43, 30.39it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 2.3921, -0.2130, -1.2274,  ...,  2.6832, -1.1123, -0.4567],
         [ 1.2946, -0.4574, -0.2807,  ...,  1.6168,  0.0072, -1.4460],
         ...,
         [-0.0410, -0.3264, -1.9003,  ..., -0.6045,  0.0932, -0.3913],
         [ 0.2226, -1.7153, -1.1716,  ..., -0.5602, -1.6135, -0.4611],
         [ 0.3201, -1.7162, -1.2311,  ..., -0.5920, -1.5800, -0.4711]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 2.3921, -0.2130, -1.2274,  ...,  2.6832, -1.1123, -0.4567],
         [ 1.2946, -0.4574, -0.2807,  ...,  1.6168,  0.0072, -1.4460],
         ...,
         [ 0.9252, -1.2527,  0.0620,  ..., -0.8978, -0.4300, -0.8378],
         [ 0.9280, -1.25

Computing prompt embeddings:  41%|████      | 900/2200 [00:30<00:42, 30.42it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 2.3921e+00, -2.1300e-01, -1.2274e+00,  ...,  2.6832e+00,
          -1.1123e+00, -4.5670e-01],
         [ 1.2946e+00, -4.5737e-01, -2.8071e-01,  ...,  1.6168e+00,
           7.2399e-03, -1.4460e+00],
         ...,
         [-2.5304e-01, -1.0188e+00,  1.6442e-01,  ..., -2.7611e-02,
          -3.8741e-01, -2.5811e+00],
         [-1.9348e-01, -1.0280e+00, -1.7610e-03,  ..., -7.7992e-02,
          -3.6756e-01, -2.4884e+00],
         [ 4.8405e-01, -9.8673e-01, -3.6372e-01,  ..., -1.8092e-01,
          -5.5080e-01, -2.2455e+00]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 2.

Computing prompt embeddings:  42%|████▏     | 932/2200 [00:31<00:40, 31.30it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1065, -0.0158, -1.4418,  ...,  0.6192,  0.5134, -2.3123],
         [ 0.1797, -0.7721,  0.0985,  ...,  0.8393,  0.7129, -1.4277],
         ...,
         [ 1.0811, -1.9562, -1.0785,  ...,  0.3262, -0.9731,  0.3542],
         [ 1.1087, -1.9436, -1.0476,  ...,  0.3332, -0.9370,  0.3331],
         [ 1.1801, -1.9422, -1.0664,  ...,  0.3260, -0.9093,  0.3630]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1065, -0.0158, -1.4418,  ...,  0.6192,  0.5134, -2.3123],
         [ 0.1797, -0.7721,  0.0985,  ...,  0.8393,  0.7129, -1.4277],
         ...,
         [ 0.1190, -1.4787, -0.6899,  ...,  0.5215, -1.7176, -0.2077],
         [ 0.1956, -1.51

Computing prompt embeddings:  43%|████▎     | 950/2200 [00:31<00:39, 31.56it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1065, -0.0158, -1.4418,  ...,  0.6192,  0.5134, -2.3123],
         [ 0.1797, -0.7721,  0.0985,  ...,  0.8393,  0.7129, -1.4277],
         ...,
         [ 0.9794, -1.8083, -0.8668,  ...,  0.6480, -1.3425,  0.5929],
         [ 0.8712, -1.7769, -0.7975,  ...,  0.6253, -1.3339,  0.5917],
         [ 0.9253, -1.7594, -0.8024,  ...,  0.6681, -1.3018,  0.5157]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.1065, -0.0158, -1.4418,  ...,  0.6192,  0.5134, -2.3123],
         [ 0.1797, -0.7721,  0.0985,  ...,  0.8393,  0.7129, -1.4277],
         ...,
         [-0.5779, -1.6468, -0.5556,  ...,  0.7001, -1.2831,  0.2434],
         [-0.5129, -1.62

Computing prompt embeddings:  45%|████▍     | 982/2200 [00:32<00:38, 31.87it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.5088, -2.0440, -0.5570,  ..., -0.8291,  0.6160, -0.4156],
         [ 0.3162, -1.0645,  0.5955,  ...,  0.7011,  0.2460, -0.0663],
         ...,
         [-0.6149, -1.8355, -0.7361,  ...,  1.1494, -0.3376, -0.5134],
         [-0.5745, -1.8230, -0.7781,  ...,  1.1548, -0.3021, -0.5581],
         [-0.4786, -1.8401, -0.7758,  ...,  1.2096, -0.2321, -0.6719]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.5088, -2.0440, -0.5570,  ..., -0.8291,  0.6160, -0.4156],
         [ 0.3162, -1.0645,  0.5955,  ...,  0.7011,  0.2460, -0.0663],
         ...,
         [-0.6808, -0.3017, -1.0947,  ..., -0.5401, -0.9553, -1.4378],
         [-0.6421, -0.32

Computing prompt embeddings:  45%|████▌     | 1000/2200 [00:33<00:38, 31.58it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.5088, -2.0440, -0.5570,  ..., -0.8291,  0.6160, -0.4156],
         [ 0.3162, -1.0645,  0.5955,  ...,  0.7011,  0.2460, -0.0663],
         ...,
         [-0.6110, -1.1304, -0.6888,  ...,  0.2760, -0.6951, -0.1118],
         [-0.6074, -1.0967, -0.5585,  ...,  0.4282, -0.7293, -0.2154],
         [-0.6609, -1.0168, -0.4756,  ...,  0.4139, -0.9270, -0.1236]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.5088, -2.0440, -0.5570,  ..., -0.8291,  0.6160, -0.4156],
         [ 0.3162, -1.0645,  0.5955,  ...,  0.7011,  0.2460, -0.0663],
         ...,
         [-0.1492, -0.5251, -0.6580,  ..., -0.2003, -1.5316, -0.6390],
         [-0.0958, -0.54

Computing prompt embeddings:  47%|████▋     | 1032/2200 [00:34<00:36, 32.30it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6313, -1.0364, -1.3184,  ...,  0.7914,  1.1470, -0.7072],
         [ 1.0431, -0.9988, -1.5889,  ..., -0.6070,  0.9458, -1.2126],
         ...,
         [-0.6078, -0.9125, -0.4621,  ..., -0.6539, -1.1658, -0.7403],
         [-0.5642, -0.8999, -0.5787,  ..., -0.5900, -1.1147, -0.7625],
         [-0.4715, -0.8948, -0.6959,  ..., -0.5447, -1.0697, -0.7928]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6313, -1.0364, -1.3184,  ...,  0.7914,  1.1470, -0.7072],
         [ 1.0431, -0.9988, -1.5889,  ..., -0.6070,  0.9458, -1.2126],
         ...,
         [-0.8520, -1.1287, -1.1533,  ..., -1.2331, -0.9497, -1.2589],
         [-0.8051, -1.13

Computing prompt embeddings:  48%|████▊     | 1050/2200 [00:34<00:36, 31.65it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6313, -1.0364, -1.3184,  ...,  0.7914,  1.1470, -0.7072],
         [ 1.0431, -0.9988, -1.5889,  ..., -0.6070,  0.9458, -1.2126],
         ...,
         [ 0.7532, -0.7923, -1.3181,  ..., -0.9549, -0.8178, -0.8111],
         [ 0.7549, -0.8174, -1.2785,  ..., -0.9634, -0.8029, -0.7924],
         [ 0.7685, -0.8187, -1.2832,  ..., -0.9286, -0.7682, -0.7840]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6313, -1.0364, -1.3184,  ...,  0.7914,  1.1470, -0.7072],
         [ 1.0431, -0.9988, -1.5889,  ..., -0.6070,  0.9458, -1.2126],
         ...,
         [ 0.0481, -0.5715, -2.3811,  ..., -0.7284, -1.0077, -1.5562],
         [ 0.1067, -0.55

Computing prompt embeddings:  49%|████▉     | 1082/2200 [00:35<00:35, 31.69it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.2389,  0.9332,  0.3624,  ..., -0.7697,  0.2784, -2.8816],
         [ 0.8662, -0.3167, -0.2082,  ..., -0.7571,  0.0357, -1.9771],
         ...,
         [ 1.3345, -1.2524, -0.3333,  ..., -1.3268, -1.3550,  0.2290],
         [ 1.3664, -1.2441, -0.3225,  ..., -1.4018, -1.3894,  0.2296],
         [ 1.3953, -1.2655, -0.3276,  ..., -1.3610, -1.3784,  0.2010]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-0.2389,  0.9332,  0.3624,  ..., -0.7697,  0.2784, -2.8816],
         [ 0.8662, -0.3167, -0.2082,  ..., -0.7571,  0.0357, -1.9771],
         ...,
         [ 0.5922, -1.6843, -0.2024,  ..., -0.8957, -1.4514, -0.8232],
         [ 0.5802, -1.73

Computing prompt embeddings:  50%|█████     | 1100/2200 [00:36<00:34, 31.59it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [-2.3894e-01,  9.3317e-01,  3.6243e-01,  ..., -7.6973e-01,
           2.7845e-01, -2.8816e+00],
         [ 8.6625e-01, -3.1669e-01, -2.0819e-01,  ..., -7.5709e-01,
           3.5689e-02, -1.9771e+00],
         ...,
         [-1.4747e-01, -4.2804e-01,  1.4815e-01,  ..., -1.3433e+00,
          -1.6855e+00, -8.1438e-01],
         [-5.7984e-02, -4.5855e-01,  1.4503e-01,  ..., -1.3345e+00,
          -1.6962e+00, -7.9144e-01],
         [ 2.0580e-02, -5.0594e-01,  1.2998e-01,  ..., -1.3219e+00,
          -1.6994e+00, -7.5251e-01]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [-2.

Computing prompt embeddings:  51%|█████▏    | 1132/2200 [00:37<00:40, 26.53it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8291,  0.4988, -1.5741,  ...,  0.6036,  0.4009,  0.0545],
         [ 1.5614, -0.2941, -0.7564,  ...,  0.5292, -0.4445,  0.6689],
         ...,
         [ 1.1086, -0.0241,  0.0421,  ..., -0.7003, -0.2591,  0.3514],
         [ 1.1137, -0.0484,  0.0477,  ..., -0.6918, -0.2645,  0.3186],
         [ 1.1247, -0.0942, -0.0748,  ..., -0.7207, -0.3042,  0.3202]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8291,  0.4988, -1.5741,  ...,  0.6036,  0.4009,  0.0545],
         [ 1.5614, -0.2941, -0.7564,  ...,  0.5292, -0.4445,  0.6689],
         ...,
         [ 0.7795, -0.8193, -0.7964,  ..., -0.7943, -0.6996,  0.7685],
         [ 0.7869, -0.83

Computing prompt embeddings:  52%|█████▏    | 1150/2200 [00:38<00:42, 24.92it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8291,  0.4988, -1.5741,  ...,  0.6036,  0.4009,  0.0545],
         [ 1.5614, -0.2941, -0.7564,  ...,  0.5292, -0.4445,  0.6689],
         ...,
         [ 1.0145, -0.3523, -0.1758,  ..., -0.0788, -0.6601, -0.6036],
         [ 1.0389, -0.3731, -0.2090,  ..., -0.0482, -0.6616, -0.5868],
         [ 1.0823, -0.3944, -0.3004,  ..., -0.0112, -0.6544, -0.5621]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8291,  0.4988, -1.5741,  ...,  0.6036,  0.4009,  0.0545],
         [ 1.5614, -0.2941, -0.7564,  ...,  0.5292, -0.4445,  0.6689],
         ...,
         [ 1.0303, -0.3907, -0.2290,  ..., -0.2867, -0.4224, -0.3660],
         [ 0.9936, -0.48

Computing prompt embeddings:  54%|█████▎    | 1182/2200 [00:40<00:45, 22.59it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.2745e+00,  8.9079e-01, -4.4945e-02,  ...,  5.3697e-01,
           2.0364e-01,  6.1304e-02],
         [ 5.1388e-01, -2.6449e-02,  1.5871e-02,  ...,  2.5443e-01,
          -9.2053e-01, -7.9554e-01],
         ...,
         [-2.3650e-01, -2.0503e-01,  2.9770e-01,  ..., -8.0528e-01,
          -3.0589e-01, -1.0559e+00],
         [-2.9640e-01, -1.4673e-01,  2.3952e-01,  ..., -8.1191e-01,
          -2.3798e-01, -1.2143e+00],
         [-2.1346e-01, -3.3719e-02,  4.0688e-01,  ..., -6.7200e-01,
          -3.6659e-01, -1.1874e+00]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.

Computing prompt embeddings:  55%|█████▍    | 1200/2200 [00:41<00:41, 24.26it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.2745,  0.8908, -0.0449,  ...,  0.5370,  0.2036,  0.0613],
         [ 0.5139, -0.0264,  0.0159,  ...,  0.2544, -0.9205, -0.7955],
         ...,
         [ 1.5531,  0.2240,  0.5300,  ..., -0.6824,  0.4142, -0.6786],
         [ 0.6135, -0.9355,  0.7549,  ..., -1.5251, -0.2077, -1.0762],
         [ 0.6304, -0.9283,  0.8217,  ..., -1.5355, -0.2207, -1.0695]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.2745,  0.8908, -0.0449,  ...,  0.5370,  0.2036,  0.0613],
         [ 0.5139, -0.0264,  0.0159,  ...,  0.2544, -0.9205, -0.7955],
         ...,
         [ 1.3230, -0.5983,  0.1557,  ..., -1.6768, -1.1450, -0.1774],
         [ 1.3310, -0.67

Computing prompt embeddings:  56%|█████▌    | 1232/2200 [00:42<00:36, 26.73it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.2508, -0.4395, -0.8886,  ...,  0.9513, -0.4090, -0.1343],
         [-0.1081,  0.2552, -0.8604,  ...,  0.4617,  0.8835,  0.1769],
         ...,
         [ 0.4614, -1.3399, -0.4087,  ..., -0.3765, -0.6035, -0.2896],
         [ 0.5246, -1.3738, -0.3647,  ..., -0.3749, -0.5503, -0.2808],
         [ 0.5343, -1.3998, -0.3391,  ..., -0.3712, -0.5559, -0.2830]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.2508, -0.4395, -0.8886,  ...,  0.9513, -0.4090, -0.1343],
         [-0.1081,  0.2552, -0.8604,  ...,  0.4617,  0.8835,  0.1769],
         ...,
         [ 0.7980, -1.8020, -1.1626,  ..., -1.3830, -0.2325,  0.8539],
         [ 0.8110, -1.78

Computing prompt embeddings:  57%|█████▋    | 1250/2200 [00:42<00:34, 27.90it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.2508, -0.4395, -0.8886,  ...,  0.9513, -0.4090, -0.1343],
         [-0.1081,  0.2552, -0.8604,  ...,  0.4617,  0.8835,  0.1769],
         ...,
         [-0.0531, -1.3682, -0.2959,  ..., -1.6918, -0.8337, -0.5854],
         [-0.0291, -1.3851, -0.3044,  ..., -1.6858, -0.8145, -0.6035],
         [ 0.0091, -1.3892, -0.3139,  ..., -1.6798, -0.8182, -0.6347]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.2508, -0.4395, -0.8886,  ...,  0.9513, -0.4090, -0.1343],
         [-0.1081,  0.2552, -0.8604,  ...,  0.4617,  0.8835,  0.1769],
         ...,
         [ 1.1820, -0.8957, -1.2601,  ..., -1.5953, -1.1516,  0.3397],
         [ 1.2193, -0.88

Computing prompt embeddings:  58%|█████▊    | 1282/2200 [00:43<00:31, 29.51it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.8700e+00,  1.0361e+00,  3.4186e-01,  ...,  8.0928e-01,
           3.2305e-01, -5.3962e-01],
         [ 1.3292e+00, -4.3888e-01,  2.1007e-01,  ...,  1.2478e+00,
          -1.0580e-01,  8.3723e-02],
         ...,
         [ 4.2713e-01, -3.0375e-01,  5.3825e-01,  ..., -5.7468e-01,
          -5.2037e-01, -7.6507e-01],
         [ 4.2356e-01, -2.6928e-01,  5.4327e-01,  ..., -5.6092e-01,
          -5.2885e-01, -7.8058e-01],
         [ 4.3414e-01, -2.9995e-01,  5.6314e-01,  ..., -6.0415e-01,
          -5.5119e-01, -7.3297e-01]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.

Computing prompt embeddings:  59%|█████▉    | 1300/2200 [00:44<00:29, 30.13it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8700,  1.0361,  0.3419,  ...,  0.8093,  0.3230, -0.5396],
         [ 1.3292, -0.4389,  0.2101,  ...,  1.2478, -0.1058,  0.0837],
         ...,
         [ 0.1358, -0.3310,  0.9756,  ..., -1.1295, -0.7983, -1.7111],
         [ 0.1408, -0.3287,  0.9229,  ..., -1.0896, -0.8081, -1.6657],
         [ 0.1319, -0.3264,  0.8621,  ..., -1.0453, -0.8109, -1.6869]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8700,  1.0361,  0.3419,  ...,  0.8093,  0.3230, -0.5396],
         [ 1.3292, -0.4389,  0.2101,  ...,  1.2478, -0.1058,  0.0837],
         ...,
         [ 0.0454, -0.7818, -0.2416,  ..., -1.7036, -0.3667, -0.4065],
         [ 0.0093, -0.80

Computing prompt embeddings:  61%|██████    | 1332/2200 [00:45<00:28, 30.58it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0847, -0.8969, -0.4552,  ...,  1.5363, -1.0048, -0.0074],
         [ 0.2841, -0.5439, -0.8084,  ...,  0.8332,  0.3113, -1.2060],
         ...,
         [-0.1713, -0.2517, -0.7364,  ..., -1.8586, -1.1457, -0.8153],
         [-0.1899, -0.3145, -0.7809,  ..., -1.8786, -1.1374, -0.8698],
         [-0.0215, -0.4313, -0.9259,  ..., -1.9689, -1.1405, -0.6804]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0847, -0.8969, -0.4552,  ...,  1.5363, -1.0048, -0.0074],
         [ 0.2841, -0.5439, -0.8084,  ...,  0.8332,  0.3113, -1.2060],
         ...,
         [ 0.1744, -0.5223, -0.9475,  ..., -1.1077, -1.6315, -0.5903],
         [ 0.2329, -0.59

Computing prompt embeddings:  61%|██████▏   | 1350/2200 [00:45<00:27, 30.77it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0847, -0.8969, -0.4552,  ...,  1.5363, -1.0048, -0.0074],
         [ 0.2841, -0.5439, -0.8084,  ...,  0.8332,  0.3113, -1.2060],
         ...,
         [ 0.8499, -1.2979, -1.3125,  ..., -1.7721, -0.9900, -0.1999],
         [ 0.9097, -0.9775, -1.2589,  ..., -1.5616, -1.1127, -0.0454],
         [ 0.9649, -0.8425, -1.2371,  ..., -1.4793, -1.1160, -0.0060]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0847, -0.8969, -0.4552,  ...,  1.5363, -1.0048, -0.0074],
         [ 0.2841, -0.5439, -0.8084,  ...,  0.8332,  0.3113, -1.2060],
         ...,
         [ 0.3736, -0.3598, -1.0746,  ..., -1.6052, -1.1618, -1.0380],
         [ 0.4338, -0.39

Computing prompt embeddings:  63%|██████▎   | 1382/2200 [00:46<00:27, 30.16it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-1.4743, -0.9738, -0.9815,  ...,  0.4519,  0.0523, -2.3751],
         [-0.4769, -0.1404,  0.1895,  ..., -0.3720, -0.3236, -1.8463],
         ...,
         [ 0.8803, -2.0363, -0.8699,  ..., -0.4127, -0.6664, -0.3674],
         [ 0.8955, -2.1631, -1.2103,  ..., -0.6574, -1.1611, -0.4942],
         [ 1.3434, -2.0453, -0.9017,  ..., -0.5909, -1.2252, -0.3334]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-1.4743, -0.9738, -0.9815,  ...,  0.4519,  0.0523, -2.3751],
         [-0.4769, -0.1404,  0.1895,  ..., -0.3720, -0.3236, -1.8463],
         ...,
         [ 1.9922, -3.1398, -0.2866,  ..., -0.1527,  0.2470, -0.4440],
         [ 1.8638, -2.74

Computing prompt embeddings:  64%|██████▎   | 1400/2200 [00:47<00:26, 30.06it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-1.4743, -0.9738, -0.9815,  ...,  0.4519,  0.0523, -2.3751],
         [-0.4769, -0.1404,  0.1895,  ..., -0.3720, -0.3236, -1.8463],
         ...,
         [ 1.6909, -2.2210,  0.2302,  ..., -1.5470, -0.4576, -1.3272],
         [ 1.7103, -2.2702,  0.1695,  ..., -1.5062, -0.5083, -1.2180],
         [ 1.7019, -2.2526,  0.1391,  ..., -1.4895, -0.5177, -1.1890]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [-1.4743, -0.9738, -0.9815,  ...,  0.4519,  0.0523, -2.3751],
         [-0.4769, -0.1404,  0.1895,  ..., -0.3720, -0.3236, -1.8463],
         ...,
         [ 0.9424, -2.6533, -0.4873,  ..., -0.2944, -0.7481,  0.1696],
         [ 0.8020, -2.39

Computing prompt embeddings:  65%|██████▌   | 1432/2200 [00:48<00:24, 31.02it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8383, -0.6436,  0.2776,  ...,  1.1625,  0.1296, -0.7625],
         [ 1.1890, -0.5992, -0.5097,  ...,  1.2762,  0.0782,  0.4167],
         ...,
         [ 2.3382,  0.6996,  1.1701,  ...,  0.0658,  0.2968,  0.2565],
         [ 0.8745, -0.4476,  0.5873,  ...,  0.2273, -0.2302, -0.2399],
         [ 0.2116, -0.9783, -1.2615,  ...,  0.2293, -0.1009, -0.7118]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8383, -0.6436,  0.2776,  ...,  1.1625,  0.1296, -0.7625],
         [ 1.1890, -0.5992, -0.5097,  ...,  1.2762,  0.0782,  0.4167],
         ...,
         [ 0.3133, -2.2351, -1.2345,  ...,  0.0144,  0.0966, -0.3679],
         [ 0.3726, -2.28

Computing prompt embeddings:  66%|██████▌   | 1450/2200 [00:48<00:23, 31.50it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8383, -0.6436,  0.2776,  ...,  1.1625,  0.1296, -0.7625],
         [ 1.1890, -0.5992, -0.5097,  ...,  1.2762,  0.0782,  0.4167],
         ...,
         [ 0.3602,  0.0787,  0.6479,  ..., -0.2257, -0.0556, -0.1342],
         [ 0.1765, -0.9091, -0.5049,  ..., -0.1361, -0.3777, -0.4022],
         [ 0.2550, -0.9204, -0.4989,  ..., -0.1001, -0.3881, -0.3964]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8383, -0.6436,  0.2776,  ...,  1.1625,  0.1296, -0.7625],
         [ 1.1890, -0.5992, -0.5097,  ...,  1.2762,  0.0782,  0.4167],
         ...,
         [ 1.2482, -1.7509, -1.0507,  ..., -0.1734, -0.0530,  0.2590],
         [ 1.2817, -1.73

Computing prompt embeddings:  67%|██████▋   | 1482/2200 [00:49<00:22, 31.45it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.4971, -0.4381,  0.7583,  ..., -0.7122,  1.1957, -1.2421],
         [-1.8296, -1.0078, -0.0333,  ...,  0.4533,  1.3149, -0.8348],
         ...,
         [ 1.4273, -1.8541, -1.1867,  ..., -0.4794, -0.6808, -0.8795],
         [ 1.3036, -1.7346, -1.1314,  ..., -0.4248, -0.7195, -0.8993],
         [ 1.7737, -2.0828, -1.2376,  ..., -0.7331, -0.7596, -0.4167]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.4971, -0.4381,  0.7583,  ..., -0.7122,  1.1957, -1.2421],
         [-1.8296, -1.0078, -0.0333,  ...,  0.4533,  1.3149, -0.8348],
         ...,
         [ 1.1769, -1.2933,  0.0798,  ..., -0.0524, -1.0349, -0.4424],
         [ 1.1781, -1.31

Computing prompt embeddings:  68%|██████▊   | 1500/2200 [00:50<00:23, 30.27it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 4.9707e-01, -4.3811e-01,  7.5828e-01,  ..., -7.1216e-01,
           1.1957e+00, -1.2421e+00],
         [-1.8296e+00, -1.0078e+00, -3.3296e-02,  ...,  4.5332e-01,
           1.3149e+00, -8.3477e-01],
         ...,
         [ 4.4625e-01, -2.1082e+00,  4.4411e-01,  ..., -9.4803e-02,
          -5.7667e-01,  1.1815e-01],
         [ 5.3700e-01, -2.1284e+00,  2.8787e-01,  ..., -9.1752e-02,
          -5.7942e-01,  1.2438e-01],
         [ 8.2562e-01, -2.2815e+00, -7.0622e-02,  ..., -1.9604e-03,
          -7.0391e-01,  1.1166e-01]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 4.

Computing prompt embeddings:  70%|██████▉   | 1532/2200 [00:52<00:26, 25.62it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8621, -0.2072, -0.9949,  ...,  1.0812,  1.5626,  0.6063],
         [ 1.6188, -0.7449, -0.4241,  ...,  1.7142,  1.0213, -0.4705],
         ...,
         [ 0.3071, -1.0917, -1.5831,  ...,  0.9072, -0.3477,  0.1351],
         [ 0.3425, -1.0898, -1.5884,  ...,  0.9101, -0.3594,  0.1444],
         [ 0.3647, -1.1054, -1.6112,  ...,  0.9224, -0.3811,  0.1580]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8621, -0.2072, -0.9949,  ...,  1.0812,  1.5626,  0.6063],
         [ 1.6188, -0.7449, -0.4241,  ...,  1.7142,  1.0213, -0.4705],
         ...,
         [-0.2005, -0.3303, -0.1344,  ...,  0.2518, -0.8718, -0.5613],
         [-0.1705, -0.37

Computing prompt embeddings:  70%|███████   | 1550/2200 [00:53<00:26, 24.24it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8621, -0.2072, -0.9949,  ...,  1.0812,  1.5626,  0.6063],
         [ 1.6188, -0.7449, -0.4241,  ...,  1.7142,  1.0213, -0.4705],
         ...,
         [ 0.1953, -1.4171, -0.4887,  ...,  1.6915,  0.5495,  1.1337],
         [ 0.3245, -1.1487, -0.4439,  ...,  1.6372,  0.5292,  1.2162],
         [ 0.3512, -1.1352, -0.4124,  ...,  1.6274,  0.5544,  1.2035]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8621, -0.2072, -0.9949,  ...,  1.0812,  1.5626,  0.6063],
         [ 1.6188, -0.7449, -0.4241,  ...,  1.7142,  1.0213, -0.4705],
         ...,
         [ 0.9523, -1.7236, -1.4148,  ...,  1.2647, -0.5241,  2.1824],
         [ 0.9913, -1.84

Computing prompt embeddings:  72%|███████▏  | 1582/2200 [00:54<00:26, 22.94it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6012,  1.0810,  1.4109,  ...,  0.1929,  1.6062, -0.8348],
         [ 1.3189, -0.5364, -2.1692,  ...,  0.1437,  0.7795,  0.0692],
         ...,
         [ 1.8359, -1.9972, -0.7982,  ..., -2.4147,  0.1248,  0.3173],
         [ 1.7631, -2.0495, -0.8129,  ..., -2.4397, -0.0885,  0.3196],
         [ 1.7699, -2.0499, -0.7626,  ..., -2.4534, -0.1025,  0.2778]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.6012,  1.0810,  1.4109,  ...,  0.1929,  1.6062, -0.8348],
         [ 1.3189, -0.5364, -2.1692,  ...,  0.1437,  0.7795,  0.0692],
         ...,
         [ 0.6069, -0.7077, -0.2441,  ..., -1.1207,  0.0400,  0.0663],
         [ 0.6426, -0.69

Computing prompt embeddings:  73%|███████▎  | 1600/2200 [00:55<00:24, 24.69it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.6012e+00,  1.0810e+00,  1.4109e+00,  ...,  1.9286e-01,
           1.6062e+00, -8.3480e-01],
         [ 1.3189e+00, -5.3640e-01, -2.1692e+00,  ...,  1.4366e-01,
           7.7949e-01,  6.9219e-02],
         ...,
         [ 1.6996e+00, -1.1105e+00,  1.1228e-01,  ..., -1.6108e+00,
          -2.1300e-01,  4.2227e-01],
         [ 1.7696e+00, -1.0644e+00,  8.2137e-02,  ..., -1.6010e+00,
          -2.2144e-01,  4.5790e-01],
         [ 1.7640e+00, -1.0952e+00,  5.7330e-02,  ..., -1.6545e+00,
          -2.0578e-01,  4.2070e-01]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.

Computing prompt embeddings:  74%|███████▍  | 1632/2200 [00:56<00:20, 27.39it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0847, -0.8969, -0.4552,  ...,  1.5363, -1.0048, -0.0074],
         [ 0.2841, -0.5439, -0.8084,  ...,  0.8332,  0.3113, -1.2060],
         ...,
         [ 1.0897, -0.9464, -0.1870,  ..., -1.0952, -0.5579, -0.3552],
         [ 1.1007, -1.0174, -0.2569,  ..., -1.0873, -0.5560, -0.4133],
         [ 1.1161, -1.1092, -0.3311,  ..., -1.1145, -0.5382, -0.4570]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0847, -0.8969, -0.4552,  ...,  1.5363, -1.0048, -0.0074],
         [ 0.2841, -0.5439, -0.8084,  ...,  0.8332,  0.3113, -1.2060],
         ...,
         [ 0.3476, -0.2989, -0.4310,  ..., -1.5130, -0.8281, -1.0497],
         [ 0.3994, -0.34

Computing prompt embeddings:  75%|███████▌  | 1650/2200 [00:56<00:19, 28.31it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0847, -0.8969, -0.4552,  ...,  1.5363, -1.0048, -0.0074],
         [ 0.2841, -0.5439, -0.8084,  ...,  0.8332,  0.3113, -1.2060],
         ...,
         [ 0.7100, -1.0987, -0.9496,  ..., -1.1932, -1.2235, -0.5402],
         [ 0.6398, -1.1545, -0.9933,  ..., -1.2289, -1.1208, -0.5840],
         [ 0.7197, -1.0050, -0.9625,  ..., -1.0715, -1.2894, -0.5107]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0847, -0.8969, -0.4552,  ...,  1.5363, -1.0048, -0.0074],
         [ 0.2841, -0.5439, -0.8084,  ...,  0.8332,  0.3113, -1.2060],
         ...,
         [-0.0128, -0.3433, -1.8826,  ..., -0.8559,  0.5401, -0.0745],
         [ 0.8441, -0.91

Computing prompt embeddings:  76%|███████▋  | 1682/2200 [00:57<00:16, 30.48it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [-0.0314,  0.9256,  0.1843,  ..., -0.3776, -0.1604, -2.3416],
         ...,
         [ 1.6652,  0.3334,  0.8063,  ..., -0.2830,  0.6244, -0.4659],
         [ 0.6432, -0.1327,  0.2943,  ..., -0.3306,  0.1877, -1.7355],
         [-0.3386, -0.5711, -0.3936,  ..., -1.3317, -0.6988, -1.5056]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [-0.0314,  0.9256,  0.1843,  ..., -0.3776, -0.1604, -2.3416],
         ...,
         [ 0.8919, -1.9637, -0.3577,  ..., -1.1824, -0.5728, -0.4561],
         [ 0.9139, -2.00

Computing prompt embeddings:  77%|███████▋  | 1700/2200 [00:58<00:15, 31.53it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [-0.0314,  0.9256,  0.1843,  ..., -0.3776, -0.1604, -2.3416],
         ...,
         [ 0.5569, -0.0350,  0.0895,  ..., -1.0994,  0.2788, -1.6848],
         [-0.8489, -0.9900, -0.5202,  ..., -1.7200, -0.6444, -1.0229],
         [-0.8214, -0.9966, -0.5190,  ..., -1.7624, -0.6567, -0.9555]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [-0.0314,  0.9256,  0.1843,  ..., -0.3776, -0.1604, -2.3416],
         ...,
         [-0.7043, -0.7670,  1.3822,  ..., -0.7553, -0.8663, -0.8560],
         [-0.6017, -0.73

Computing prompt embeddings:  79%|███████▊  | 1732/2200 [00:59<00:14, 31.98it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.2678e+00, -6.0781e-01, -2.4018e-01,  ...,  1.0790e-01,
           4.6981e-01, -2.2191e-01],
         [ 8.1927e-01, -8.7636e-01, -1.5251e+00,  ...,  8.9930e-01,
           6.2081e-03, -9.3004e-01],
         ...,
         [ 2.8168e-01, -1.6657e+00, -6.8604e-01,  ..., -1.6088e+00,
           3.6790e-01, -1.7301e+00],
         [ 3.2204e-01, -1.6559e+00, -6.6841e-01,  ..., -1.6394e+00,
           3.6573e-01, -1.7126e+00],
         [ 3.2437e-01, -1.6986e+00, -6.6634e-01,  ..., -1.6384e+00,
           3.7159e-01, -1.7022e+00]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.

Computing prompt embeddings:  80%|███████▉  | 1750/2200 [00:59<00:13, 32.32it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.2678, -0.6078, -0.2402,  ...,  0.1079,  0.4698, -0.2219],
         [ 0.8193, -0.8764, -1.5251,  ...,  0.8993,  0.0062, -0.9300],
         ...,
         [ 0.7361, -2.0583, -0.6119,  ..., -0.7940, -0.2511, -0.9582],
         [ 0.7276, -2.0546, -0.6447,  ..., -0.6873, -0.2259, -1.2087],
         [ 0.7225, -2.0058, -0.6173,  ..., -0.6550, -0.3067, -1.2917]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.2678, -0.6078, -0.2402,  ...,  0.1079,  0.4698, -0.2219],
         [ 0.8193, -0.8764, -1.5251,  ...,  0.8993,  0.0062, -0.9300],
         ...,
         [ 1.0702, -1.8572, -0.3989,  ..., -0.2402, -0.2145, -0.8511],
         [ 1.0419, -1.87

Computing prompt embeddings:  81%|████████  | 1782/2200 [01:00<00:13, 31.98it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 6.9227e-01,  1.2069e+00,  3.5633e-02,  ...,  2.0983e+00,
           6.6442e-01, -5.5420e-02],
         [ 1.2158e+00,  6.2794e-01, -2.1838e-03,  ...,  5.6605e-01,
           5.2701e-01, -1.0969e+00],
         ...,
         [ 9.9213e-01, -8.6976e-01,  4.4251e-01,  ..., -1.0245e-01,
          -4.8361e-01,  3.2438e-01],
         [ 9.6815e-01, -9.9290e-01,  3.4714e-01,  ..., -1.3623e-01,
          -4.6137e-01,  3.5273e-01],
         [ 8.6267e-01, -9.4242e-01,  5.0228e-01,  ..., -1.1020e-01,
          -4.7571e-01,  3.7070e-01]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 6.

Computing prompt embeddings:  82%|████████▏ | 1800/2200 [01:01<00:12, 31.55it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.6923,  1.2069,  0.0356,  ...,  2.0983,  0.6644, -0.0554],
         [ 1.2158,  0.6279, -0.0022,  ...,  0.5661,  0.5270, -1.0969],
         ...,
         [ 0.4979, -1.0340, -0.0201,  ..., -1.1700, -0.1325, -0.7678],
         [ 0.4593, -1.1835, -0.1515,  ..., -1.1975, -0.0931, -0.8848],
         [ 0.4708, -0.9312,  0.0539,  ..., -0.9197, -0.1011, -0.9420]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.6923,  1.2069,  0.0356,  ...,  2.0983,  0.6644, -0.0554],
         [ 1.2158,  0.6279, -0.0022,  ...,  0.5661,  0.5270, -1.0969],
         ...,
         [ 0.7040, -1.3128, -1.2585,  ..., -1.3688, -0.1210, -0.4301],
         [ 0.6718, -1.40

Computing prompt embeddings:  83%|████████▎ | 1832/2200 [01:02<00:11, 30.99it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.3610,  0.4322,  0.6196,  ..., -0.1652,  0.0227, -0.3818],
         [ 0.6888, -0.5762,  0.5923,  ..., -0.7005,  0.7067, -0.3802],
         ...,
         [ 0.0050, -1.9931, -0.0232,  ..., -0.9499, -1.0344, -0.6698],
         [-0.0068, -2.0556, -0.0872,  ..., -0.9927, -1.0058, -0.6572],
         [ 0.7097, -1.7188, -0.8050,  ..., -1.3222, -0.6832, -1.0988]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.3610,  0.4322,  0.6196,  ..., -0.1652,  0.0227, -0.3818],
         [ 0.6888, -0.5762,  0.5923,  ..., -0.7005,  0.7067, -0.3802],
         ...,
         [ 1.4836, -1.9659,  0.3008,  ..., -0.6218,  0.5495, -0.2075],
         [ 1.4692, -2.04

Computing prompt embeddings:  84%|████████▍ | 1850/2200 [01:02<00:11, 30.70it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.3610e+00,  4.3219e-01,  6.1964e-01,  ..., -1.6517e-01,
           2.2667e-02, -3.8185e-01],
         [ 6.8880e-01, -5.7623e-01,  5.9234e-01,  ..., -7.0045e-01,
           7.0666e-01, -3.8021e-01],
         ...,
         [ 2.2487e+00, -5.5129e-01,  5.0486e-02,  ...,  6.2823e-03,
           1.0854e-01, -1.5604e+00],
         [ 4.4277e-01, -2.2335e+00,  4.5850e-01,  ..., -2.0375e-02,
          -7.5513e-01, -9.6144e-01],
         [ 5.3281e-01, -2.1501e+00,  5.9833e-01,  ...,  6.8094e-02,
          -8.2506e-01, -9.1804e-01]],

        [[ 3.3929e-01,  1.1646e-01,  1.0195e-01,  ...,  2.4677e-01,
           5.9064e-01,  1.0130e-01],
         [ 1.

Computing prompt embeddings:  86%|████████▌ | 1882/2200 [01:03<00:09, 32.23it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0229,  0.7761, -0.5828,  ...,  0.3987,  0.0855, -2.7930],
         [ 0.4628,  0.4362, -0.9171,  ..., -0.8236, -0.1523, -1.6996],
         ...,
         [ 1.6235, -0.5571, -1.5847,  ..., -1.5304, -1.9858,  0.5846],
         [ 1.5913, -0.6244, -1.6696,  ..., -1.6478, -2.0068,  0.6165],
         [ 1.6034, -0.6423, -1.5777,  ..., -1.6502, -2.0665,  0.5597]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0229,  0.7761, -0.5828,  ...,  0.3987,  0.0855, -2.7930],
         [ 0.4628,  0.4362, -0.9171,  ..., -0.8236, -0.1523, -1.6996],
         ...,
         [ 0.1830, -0.5629, -0.8591,  ..., -1.3316, -1.5131,  0.1536],
         [ 0.2530, -0.55

Computing prompt embeddings:  86%|████████▋ | 1900/2200 [01:04<00:09, 32.66it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0229,  0.7761, -0.5828,  ...,  0.3987,  0.0855, -2.7930],
         [ 0.4628,  0.4362, -0.9171,  ..., -0.8236, -0.1523, -1.6996],
         ...,
         [ 0.3981, -0.6803, -1.2505,  ..., -1.5109, -1.2690,  0.4197],
         [ 0.3590, -0.7421, -1.3171,  ..., -1.5502, -1.2870,  0.4132],
         [ 0.4071, -0.8266, -1.2976,  ..., -1.6540, -1.2295,  0.3917]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.0229,  0.7761, -0.5828,  ...,  0.3987,  0.0855, -2.7930],
         [ 0.4628,  0.4362, -0.9171,  ..., -0.8236, -0.1523, -1.6996],
         ...,
         [ 2.7080, -0.9487, -1.3069,  ..., -1.6788,  0.9882, -1.3173],
         [ 2.1296, -1.47

Computing prompt embeddings:  88%|████████▊ | 1932/2200 [01:05<00:09, 27.89it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.7472,  0.2165, -1.1722,  ...,  0.5190,  0.7756,  0.5897],
         [ 0.3487, -1.0074, -1.1265,  ...,  0.4423,  0.5125, -0.9018],
         ...,
         [ 0.3708, -1.4959, -0.3764,  ..., -0.4966, -0.2160, -0.9509],
         [ 0.4860, -1.2674, -0.4402,  ..., -0.5029, -0.4862, -0.8960],
         [ 0.5958, -1.1882, -0.4349,  ..., -0.5079, -0.4573, -0.8761]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.7472,  0.2165, -1.1722,  ...,  0.5190,  0.7756,  0.5897],
         [ 0.3487, -1.0074, -1.1265,  ...,  0.4423,  0.5125, -0.9018],
         ...,
         [ 0.5356, -0.0726, -2.1947,  ..., -0.3985,  1.1106, -0.1138],
         [ 1.1248, -1.31

Computing prompt embeddings:  89%|████████▊ | 1950/2200 [01:06<00:09, 25.80it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.7472,  0.2165, -1.1722,  ...,  0.5190,  0.7756,  0.5897],
         [ 0.3487, -1.0074, -1.1265,  ...,  0.4423,  0.5125, -0.9018],
         ...,
         [ 0.8812, -1.0340, -0.1750,  ..., -1.0767, -0.3017, -0.7546],
         [ 0.8895, -1.0009, -0.0714,  ..., -0.9134, -0.3591, -0.8546],
         [ 0.8721, -0.9076, -0.0725,  ..., -0.9001, -0.4489, -0.7814]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.7472,  0.2165, -1.1722,  ...,  0.5190,  0.7756,  0.5897],
         [ 0.3487, -1.0074, -1.1265,  ...,  0.4423,  0.5125, -0.9018],
         ...,
         [ 0.6017, -1.3030, -0.6377,  ..., -0.5619, -0.4766, -0.5614],
         [ 0.8096, -1.30

Computing prompt embeddings:  90%|█████████ | 1982/2200 [01:08<00:09, 22.30it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [-0.8906, -0.6504, -0.5107,  ..., -0.2545,  2.5144, -0.6639],
         ...,
         [-1.7129, -0.7325, -0.1671,  ..., -1.6266,  0.1073, -0.2796],
         [-1.6863, -0.7434, -0.2053,  ..., -1.6496,  0.1163, -0.2023],
         [-1.6652, -0.7601, -0.3611,  ..., -1.7401,  0.2077, -0.0940]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [-0.8906, -0.6504, -0.5107,  ..., -0.2545,  2.5144, -0.6639],
         ...,
         [-0.5861, -1.7439, -0.6997,  ..., -1.1297,  1.4060, -0.0747],
         [ 0.3830, -1.47

Computing prompt embeddings:  91%|█████████ | 2000/2200 [01:09<00:08, 23.16it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [-0.8906, -0.6504, -0.5107,  ..., -0.2545,  2.5144, -0.6639],
         ...,
         [-0.9501, -1.8068, -0.8673,  ..., -1.7051,  0.0275,  0.6473],
         [-0.8901, -1.8208, -0.9263,  ..., -1.7249,  0.0720,  0.6981],
         [-1.0465, -1.8589, -0.8968,  ..., -1.7190,  0.1634,  0.5947]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [-0.8906, -0.6504, -0.5107,  ..., -0.2545,  2.5144, -0.6639],
         ...,
         [-0.5241, -1.6234,  0.0256,  ..., -1.4430,  1.0234,  0.2965],
         [-0.5259, -1.64

Computing prompt embeddings:  92%|█████████▏| 2032/2200 [01:10<00:06, 25.60it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [ 0.1923, -0.5925, -0.5524,  ...,  0.5221,  1.2134, -2.4093],
         ...,
         [ 0.0465, -0.6197, -0.3030,  ..., -0.5173,  0.0930, -0.4389],
         [ 0.0493, -0.5710, -0.2306,  ..., -0.5557,  0.0637, -0.3737],
         [ 0.0526, -0.5464, -0.1811,  ..., -0.5736,  0.1022, -0.3490]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [ 0.1923, -0.5925, -0.5524,  ...,  0.5221,  1.2134, -2.4093],
         ...,
         [ 1.2791, -0.0892,  0.2364,  ...,  0.6451, -0.0192, -0.9490],
         [ 0.2770, -0.33

Computing prompt embeddings:  93%|█████████▎| 2050/2200 [01:10<00:05, 26.58it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [ 0.1923, -0.5925, -0.5524,  ...,  0.5221,  1.2134, -2.4093],
         ...,
         [ 1.0684, -1.6354, -0.8859,  ..., -1.4340,  0.0584, -0.6617],
         [ 1.0753, -1.6654, -0.9628,  ..., -1.4508,  0.0481, -0.6688],
         [ 1.1218, -1.7153, -1.0136,  ..., -1.4238,  0.0680, -0.6399]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [ 0.1923, -0.5925, -0.5524,  ...,  0.5221,  1.2134, -2.4093],
         ...,
         [ 0.6798, -1.8578, -0.3116,  ...,  0.0973,  0.0394, -0.4506],
         [ 0.7099, -1.89

Computing prompt embeddings:  95%|█████████▍| 2082/2200 [01:11<00:04, 28.52it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8301, -0.1343,  0.1393,  ...,  0.2781, -0.4305,  0.1571],
         [ 2.1219,  0.7889,  0.3013,  ..., -0.7244, -0.4014,  0.2146],
         ...,
         [ 0.0418, -1.4929, -1.2053,  ..., -0.6176,  0.1074,  0.1382],
         [ 0.0515, -1.5225, -1.1776,  ..., -0.6374,  0.1230,  0.1335],
         [ 0.0211, -1.5431, -1.1611,  ..., -0.6446,  0.1405,  0.1289]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8301, -0.1343,  0.1393,  ...,  0.2781, -0.4305,  0.1571],
         [ 2.1219,  0.7889,  0.3013,  ..., -0.7244, -0.4014,  0.2146],
         ...,
         [ 0.2878, -0.3946,  0.2197,  ..., -0.6652,  0.4391,  0.1772],
         [ 0.2901, -0.41

Computing prompt embeddings:  95%|█████████▌| 2100/2200 [01:12<00:03, 29.59it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8301, -0.1343,  0.1393,  ...,  0.2781, -0.4305,  0.1571],
         [ 2.1219,  0.7889,  0.3013,  ..., -0.7244, -0.4014,  0.2146],
         ...,
         [ 2.2975,  0.7758,  1.5735,  ...,  0.0467,  0.0950,  0.2443],
         [ 0.9644,  0.8368,  1.2140,  ..., -0.7687, -0.8099, -0.3026],
         [-0.0049, -0.0125, -0.3437,  ..., -1.1448, -1.1352, -0.1932]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8301, -0.1343,  0.1393,  ...,  0.2781, -0.4305,  0.1571],
         [ 2.1219,  0.7889,  0.3013,  ..., -0.7244, -0.4014,  0.2146],
         ...,
         [ 0.6608,  1.0550,  0.8369,  ..., -0.9909, -0.1092,  0.2129],
         [-0.4805, -0.46

Computing prompt embeddings:  97%|█████████▋| 2132/2200 [01:13<00:02, 31.16it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [ 0.9884,  0.7219,  0.2865,  ..., -0.2374,  0.0594, -0.4277],
         ...,
         [ 2.1098, -0.3841, -0.6272,  ..., -0.4970,  1.0817, -0.8276],
         [ 1.1883, -1.7605, -0.5133,  ..., -1.4833,  0.0296,  0.0193],
         [ 1.2181, -1.8109, -0.6003,  ..., -1.5008,  0.0434,  0.0692]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [ 0.9884,  0.7219,  0.2865,  ..., -0.2374,  0.0594, -0.4277],
         ...,
         [ 1.4038, -2.0780, -0.4433,  ..., -1.0512,  0.4866,  0.2853],
         [ 1.4135, -2.11

Computing prompt embeddings:  98%|█████████▊| 2150/2200 [01:13<00:01, 31.44it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [ 0.9884,  0.7219,  0.2865,  ..., -0.2374,  0.0594, -0.4277],
         ...,
         [ 1.4363, -2.0461, -0.1567,  ..., -1.5876,  0.4781,  0.1282],
         [ 1.4324, -2.0398, -0.1372,  ..., -1.4345,  0.5652,  0.1195],
         [ 1.3569, -1.9678, -0.1568,  ..., -1.4598,  0.3698,  0.3304]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 0.8184, -0.9818, -0.3273,  ...,  0.2208,  0.5668, -1.2833],
         [ 0.9884,  0.7219,  0.2865,  ..., -0.2374,  0.0594, -0.4277],
         ...,
         [ 0.2404, -1.0340, -0.7526,  ..., -2.0526,  0.5163, -0.2596],
         [ 0.2922, -1.07

Computing prompt embeddings:  99%|█████████▉| 2182/2200 [01:14<00:00, 31.90it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8301, -0.1343,  0.1393,  ...,  0.2781, -0.4305,  0.1571],
         [ 2.1219,  0.7889,  0.3013,  ..., -0.7244, -0.4014,  0.2146],
         ...,
         [ 1.8881, -0.8357, -0.4620,  ..., -2.0723,  0.1130, -0.4143],
         [ 1.9292, -0.8542, -0.4834,  ..., -2.1034,  0.1048, -0.4125],
         [ 1.8755, -0.9341, -0.3880,  ..., -2.2393,  0.1913, -0.3201]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8301, -0.1343,  0.1393,  ...,  0.2781, -0.4305,  0.1571],
         [ 2.1219,  0.7889,  0.3013,  ..., -0.7244, -0.4014,  0.2146],
         ...,
         [ 1.9824, -0.8528,  0.0480,  ..., -1.4589, -0.1277, -0.1661],
         [ 2.0045, -0.89

Computing prompt embeddings: 100%|██████████| 2200/2200 [01:15<00:00, 29.25it/s]


ERROR in extract_text_embeddings_batch: Could not extract 512-dim projected CLIP features from model output type: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>. Object: BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8301, -0.1343,  0.1393,  ...,  0.2781, -0.4305,  0.1571],
         [ 2.1219,  0.7889,  0.3013,  ..., -0.7244, -0.4014,  0.2146],
         ...,
         [ 1.9090, -0.9907, -0.0438,  ..., -1.9012, -0.4731, -0.3103],
         [ 1.9332, -1.0416, -0.0732,  ..., -1.8910, -0.4218, -0.2738],
         [ 1.9277, -1.1047, -0.0826,  ..., -1.9368, -0.4050, -0.2642]],

        [[ 0.3393,  0.1165,  0.1020,  ...,  0.2468,  0.5906,  0.1013],
         [ 1.8301, -0.1343,  0.1393,  ...,  0.2781, -0.4305,  0.1571],
         [ 2.1219,  0.7889,  0.3013,  ..., -0.7244, -0.4014,  0.2146],
         ...,
         [ 1.6869,  0.7039,  0.6831,  ..., -0.2355, -0.2354, -0.2847],
         [ 1.2161, -0.37




================================================================

STEP 7 IMAGE EMBEDDING EXTRACTION (WITH PREPROCESSING)


    Extract CLIP embedding with comprehensive preprocessing.
    
    Pipeline:
    1. Validate image
    2. Preprocess (EXIF, RGB conversion, aspect ratio)
    3. CLIP processor (normalization, tensor conversion)
    4. Extract embedding
    5. L2 normalization
    
    FIXED: Properly handles CLIP output tensors


================================================================

In [50]:
def extract_prompts_for_image(image_embedding, theme, prompt_embeddings, top_k=2):
    """Extract top prompts using theme-first approach."""

    if theme not in prompt_embeddings:
        return {}

    theme_prompts = prompt_embeddings[theme]
    extracted_prompts = {}

    for category, prompts in theme_prompts.items():
        scores = []

        for prompt_data in prompts:
            score = np.dot(image_embedding, prompt_data['embedding'])
            scores.append((prompt_data['text'], float(score)))

        top_prompts = sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]
        filtered = [(text, score) for text, score in top_prompts if score > 0.40]

        if filtered:
            extracted_prompts[category] = [
                {'text': text, 'score': score} for text, score in filtered
            ]

    return extracted_prompts



def process_all_images(metadata, landmarks_path, model, processor, preprocessor, device):
    """Process all images with preprocessing."""

    image_embeddings = {}
    image_metadata_list = {}

    total_images = metadata['total_images']
    print(f"Processing {total_images} images...")

    stats = {'processed': 0, 'failed': 0}

    with tqdm(total=total_images, desc="Extracting embeddings") as pbar:
        for theme in metadata['themes']:
            theme_name = theme['theme_name']

            for state in theme['states']:
                state_name = state['state_name']

                for destination in state['destinations']:
                    dest_id = destination['destination_id']
                    dest_folder = destination['folder']

                    for img_filename in destination['images']:
                        image_id = f"{dest_id}_{img_filename.replace('.jpg', '').replace('.png', '')}"
                        image_path = os.path.join(landmarks_path, dest_folder, img_filename)

                        embedding = extract_image_embedding(image_path, model, processor, preprocessor, device)

                        if embedding is not None:
                            image_embeddings[image_id] = embedding

                            image_metadata_list[image_id] = {
                                'image_path': image_path,
                                'filename': img_filename,
                                'destination_id': dest_id,
                                'destination_name': destination['destination_name'],
                                'theme': theme_name,
                                'state': state_name,
                                'folder': dest_folder
                            }
                            stats['processed'] += 1
                        else:
                            stats['failed'] += 1

                        pbar.update(1)

    print(f"\nProcessed: {stats['processed']}")
    if stats['failed'] > 0:
        print(f"Failed: {stats['failed']}")

    return image_embeddings, image_metadata_list, stats


image_embeddings, image_metadata_dict, validation_stats = process_all_images(
    metadata, LANDMARKS_PATH, model, processor, preprocessor, device
)

print("="*80)


Processing 215 images...


Extracting embeddings: 100%|██████████| 215/215 [00:53<00:00,  4.02it/s]


Processed: 215





===========================================================================================================

STEP 8: Prompt Extraction For all images

===========================================================================================================

In [54]:
def process_all_prompts(image_embeddings, image_metadata_dict, prompt_embeddings):
    """Extract prompts for all images."""

    all_image_prompts = {}

    print(f"Extracting prompts for {len(image_embeddings)} images...")

    with tqdm(total=len(image_embeddings), desc="Extracting prompts") as pbar:
        for image_id, embedding in image_embeddings.items():
            metadata = image_metadata_dict[image_id]
            theme = metadata['theme']

            prompts = extract_prompts_for_image(embedding, theme, prompt_embeddings, top_k=2)

            all_image_prompts[image_id] = {
                'image_id': image_id,
                'theme': theme,
                'destination_id': metadata['destination_id'],
                'extracted_prompts': prompts,
                'total_prompts_extracted': sum(len(p) for p in prompts.values())
            }

            pbar.update(1)

    print(f"Extracted prompts for {len(all_image_prompts)} images")
    return all_image_prompts


all_image_prompts = process_all_prompts(image_embeddings, image_metadata_dict, prompt_embeddings)

print("="*80)


Extracting prompts for 215 images...


Extracting prompts: 100%|██████████| 215/215 [00:00<00:00, 51796.40it/s]

Extracted prompts for 215 images





===========================================================================================================

STEP 9: AGGREGATE PER DESTINATION

===========================================================================================================

In [56]:
print("\n" + "="*80)
print("STAGE 9: AGGREGATING BY DESTINATION")
print("="*80)

def aggregate_destination_embeddings(image_embeddings, image_metadata_dict):
    """Aggregate embeddings per destination."""

    destination_embeddings = {}
    destination_image_embeddings = {}

    for image_id, embedding in image_embeddings.items():
        dest_id = image_metadata_dict[image_id]['destination_id']

        if dest_id not in destination_image_embeddings:
            destination_image_embeddings[dest_id] = []

        destination_image_embeddings[dest_id].append(embedding)

    for dest_id, embeddings in destination_image_embeddings.items():
        embeddings_array = np.array(embeddings)
        avg_embedding = np.mean(embeddings_array, axis=0)
        avg_embedding = avg_embedding / np.linalg.norm(avg_embedding)

        destination_embeddings[dest_id] = {
            'average_embedding': avg_embedding,
            'individual_embeddings': embeddings_array,
            'num_images': len(embeddings)
        }

    print(f"Aggregated {len(destination_embeddings)} destinations")
    return destination_embeddings


def aggregate_destination_prompts(all_image_prompts):
    """Aggregate prompts with weighted scoring."""

    destination_prompts = {}
    dest_groups = {}

    for image_id, data in all_image_prompts.items():
        dest_id = data['destination_id']
        if dest_id not in dest_groups:
            dest_groups[dest_id] = []
        dest_groups[dest_id].append(data)

    for dest_id, images_data in dest_groups.items():
        num_images = len(images_data)
        category_prompts = {}

        for img_data in images_data:
            for category, prompts in img_data['extracted_prompts'].items():
                if category not in category_prompts:
                    category_prompts[category] = []
                for prompt in prompts:
                    category_prompts[category].append(prompt)

        aggregated = {}

        for category, prompts in category_prompts.items():
            prompt_stats = {}

            for prompt in prompts:
                text = prompt['text']
                score = prompt['score']

                if text not in prompt_stats:
                    prompt_stats[text] = {'scores': [], 'count': 0}

                prompt_stats[text]['scores'].append(score)
                prompt_stats[text]['count'] += 1

            weighted_prompts = []
            for text, stats in prompt_stats.items():
                avg_score = np.mean(stats['scores'])
                frequency = stats['count'] / num_images
                weighted_score = (avg_score * 0.6) + (frequency * 0.4)

                weighted_prompts.append({
                    'text': text,
                    'avg_score': float(avg_score),
                    'frequency': float(frequency),
                    'weighted_score': float(weighted_score),
                    'appearances': stats['count']
                })

            top_prompts = sorted(weighted_prompts, key=lambda x: x['weighted_score'], reverse=True)[:2]

            if top_prompts:
                aggregated[category] = top_prompts

        destination_prompts[dest_id] = {
            'destination_id': dest_id,
            'num_images': num_images,
            'aggregated_prompts': aggregated,
            'dominant_characteristics': {
                cat: prompts[0]['text'] for cat, prompts in aggregated.items() if prompts
            }
        }

    print(f"Aggregated prompts for {len(destination_prompts)} destinations")
    return destination_prompts


destination_embeddings = aggregate_destination_embeddings(image_embeddings, image_metadata_dict)
destination_prompts = aggregate_destination_prompts(all_image_prompts)

print("="*80)


STAGE 9: AGGREGATING BY DESTINATION
Aggregated 47 destinations
Aggregated prompts for 47 destinations


==========================================================================================================
STEP 10: SAVE ALL DATA

===========================================================================================================

In [57]:
print("\n" + "="*80)
print("STAGE 10: SAVING DATA")
print("="*80)

print("Saving embeddings...")

image_ids = list(image_embeddings.keys())
embeddings_array = np.array([image_embeddings[img_id] for img_id in image_ids])

destination_ids = list(destination_embeddings.keys())
dest_avg_embs = np.array([destination_embeddings[d_id]['average_embedding'] for d_id in destination_ids])

np.savez(
    f'{EMBEDDINGS_PATH}/all_embeddings.npz',
    image_ids=image_ids,
    image_embeddings=embeddings_array,
    destination_ids=destination_ids,
    destination_embeddings=dest_avg_embs
)

print(f"all_embeddings.npz ({embeddings_array.shape})")

embedding_index = {
    'image_index': {img_id: idx for idx, img_id in enumerate(image_ids)},
    'destination_index': {d_id: idx for idx, d_id in enumerate(destination_ids)},
    'metadata': {
        'created_date': datetime.now().isoformat(),
        'total_images': len(image_ids),
        'total_destinations': len(destination_ids),
        'embedding_dim': embeddings_array.shape[1],
        'model_name': model_name,
        'preprocessing_enabled': True,
        'validation_stats': validation_stats,
        'prompt_validation': validation_results
    }
}

with open(f'{EMBEDDINGS_PATH}/embedding_index.json', 'w') as f:
    json.dump(embedding_index, f, indent=2)

print(f"embedding_index.json")

with open(f'{PROMPTS_PATH}/image_prompts.json', 'w') as f:
    json.dump(all_image_prompts, f, indent=2)

print(f"image_prompts.json")

with open(f'{PROMPTS_PATH}/destination_prompts.json', 'w') as f:
    json.dump(destination_prompts, f, indent=2)

print(f"destination_prompts.json")

with open(f'{EMBEDDINGS_PATH}/destination_embeddings_detailed.pkl', 'wb') as f:
    pickle.dump(destination_embeddings, f)

print(f"destination_embeddings_detailed.pkl")

print("="*80)


STAGE 5: SAVING DATA
Saving embeddings...
all_embeddings.npz ((215, 768))
embedding_index.json
image_prompts.json
destination_prompts.json
destination_embeddings_detailed.pkl


==========================================================================================================
STEP 11: UPDATE METADATA

===========================================================================================================

In [61]:
print("\n" + "="*80)
print("STAGE 11: UPDATING METADATA")
print("="*80)

metadata['pipeline_status']['embeddings_computed'] = True
metadata['pipeline_status']['prompts_extracted'] = True
metadata['pipeline_status']['prompts_validated'] = True
metadata['last_updated'] = datetime.now().isoformat()
metadata['vl_encoding_version'] = '4.1'

for theme in metadata['themes']:
    for state in theme['states']:
        for destination in state['destinations']:
            dest_id = destination['destination_id']

            if dest_id in destination_embeddings:
                destination['embeddings_computed'] = True
                destination['prompts_extracted'] = True

                destination['embedding_references'] = {
                    'embeddings_file': 'vl_encoding/embeddings/all_embeddings.npz',
                    'destination_index': embedding_index['destination_index'][dest_id]
                }

                if dest_id in destination_prompts:
                    destination['dominant_prompts'] = destination_prompts[dest_id]['dominant_characteristics']

with open(METADATA_PATH, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"Updated metadata.json")
print("="*80)


STAGE 11: UPDATING METADATA
Updated metadata.json


========================================================================================================== STEP 12: FINAL SUMMARY

===========================================================================================================

In [59]:
print("\n" + "="*80)
print("SUCCESS! VL ENCODING COMPLETE")
print("="*80)

print(f"\nImages processed: {len(image_embeddings)}")
print(f"Destinations: {len(destination_embeddings)}")
print(f"Embedding dimension: {embeddings_array.shape[1]}")
print(f"Model: {model_name}")

print(f"\nPrompt Validation:")
print(f"  Total prompts: {validation_results['total_prompts']}")
print(f"  Valid prompts: {validation_results['valid_prompts']}")
print(f"  Issues found: {validation_results['prompts_with_issues']}")

print("\nOutput Files:")
print(f"  {EMBEDDINGS_PATH}/all_embeddings.npz")
print(f"  {EMBEDDINGS_PATH}/embedding_index.json")
print(f"  {EMBEDDINGS_PATH}/destination_embeddings_detailed.pkl")
print(f"  {PROMPTS_PATH}/image_prompts.json")
print(f"  {PROMPTS_PATH}/destination_prompts.json")
print(f"  {REPORTS_PATH}/prompt_validation_report.txt")

print("\nCategory-Based Prompts Available:")
categories = set()
for img_prompts in all_image_prompts.values():
    categories.update(img_prompts['extracted_prompts'].keys())
print(f"  Categories extracted: {', '.join(sorted(categories))}")

print("\nReady for Category-Aware Matching!")
print("="*80)


SUCCESS! VL ENCODING COMPLETE

Images processed: 215
Destinations: 47
Embedding dimension: 768
Model: openai/clip-vit-base-patch32

Prompt Validation:
  Total prompts: 2200
  Valid prompts: 2200
  Issues found: 0

Output Files:
  /content/drive/MyDrive/visual-intelligence-travel-finance/data/vl_encoding/embeddings/all_embeddings.npz
  /content/drive/MyDrive/visual-intelligence-travel-finance/data/vl_encoding/embeddings/embedding_index.json
  /content/drive/MyDrive/visual-intelligence-travel-finance/data/vl_encoding/embeddings/destination_embeddings_detailed.pkl
  /content/drive/MyDrive/visual-intelligence-travel-finance/data/vl_encoding/prompts/image_prompts.json
  /content/drive/MyDrive/visual-intelligence-travel-finance/data/vl_encoding/prompts/destination_prompts.json
  /content/drive/MyDrive/visual-intelligence-travel-finance/data/vl_encoding/reports/prompt_validation_report.txt

Category-Based Prompts Available:
  Categories extracted: 

Ready for Category-Aware Matching!
