<a href="https://colab.research.google.com/github/ashwin-yedte/visual-intelligence-travel-finance/blob/main/notebooks/Step_2_Theme_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

================================================================================
STEP 2: THEME EXTRACTION AND AGGREGATION
================================================================================
Project: Visual Intelligence for Travel and Finance Optimization

Course: AIMLCZG628T

Author: Ashwin Kumar Y (2023AC05628)

Institution: BITS Pilani

Version: 1.0.0

Date: January 2026


Dependencies: Requires Step 1 output (step1_analysis_results.json)
================================================================================

In [11]:

print("="*80)
print("STEP 2: THEME EXTRACTION AND AGGREGATION")
print("="*80)
print("\nProject: Visual Intelligence for Travel and Finance Optimization")
print("Author: Ashwin Kumar Y (2023AC05628)")
print("Version: 1.0.0")
print("\n" + "="*80)
print("SETUP: Installing required packages")
print("="*80)

# Install dependencies (minimal for Step 2)
import sys
print("Installing numpy...")
!pip install -q numpy

print("Installing matplotlib...")
!pip install -q matplotlib

print("\n" + "="*80)
print("SETUP COMPLETE: All packages installed")
print("="*80)

STEP 2: THEME EXTRACTION AND AGGREGATION

Project: Visual Intelligence for Travel and Finance Optimization
Author: Ashwin Kumar Y (2023AC05628)
Version: 1.0.0

SETUP: Installing required packages
Installing numpy...
Installing matplotlib...

SETUP COMPLETE: All packages installed


Import **Libraries**

In [12]:

print("="*80)
print("IMPORTS: Loading required libraries")
print("="*80)

# Core Python libraries
import json
import os
from typing import Dict, List, Tuple, Any
from collections import Counter, defaultdict
from datetime import datetime

# Numerical computing
import numpy as np

# Visualization
import matplotlib.pyplot as plt

# Google Colab specific
from google.colab import files

print("\nLibrary versions:")
print(f"  NumPy: {np.__version__}")

print("\n" + "="*80)
print("IMPORTS COMPLETE: All libraries loaded successfully")
print("="*80)

IMPORTS: Loading required libraries

Library versions:
  NumPy: 2.0.2

IMPORTS COMPLETE: All libraries loaded successfully


**Configuration**

In [13]:

print("="*80)
print("CONFIGURATION: Setting up system parameters")
print("="*80)

class Config:
    """
    Configuration for Step 2: Theme Extraction
    """

    # Input/Output files
    INPUT_JSON_FILE = "step1_analysis_results.json"
    OUTPUT_JSON_FILE = "step2_theme_extraction_results.json"
    OUTPUT_SUMMARY_FILE = "step2_theme_summary.txt"
    OUTPUT_VISUALIZATION_FILE = "step2_theme_visualization.png"

    # Theme extraction thresholds
    HIGH_CONSISTENCY_THRESHOLD = 0.70  # 70% of images
    MEDIUM_CONSISTENCY_THRESHOLD = 0.40  # 40% of images
    LOW_CONSISTENCY_THRESHOLD = 0.20  # 20% of images

    # Score thresholds
    HIGH_SCORE_THRESHOLD = 0.75
    MEDIUM_SCORE_THRESHOLD = 0.60
    LOW_SCORE_THRESHOLD = 0.45

    # Outlier detection
    OUTLIER_STD_MULTIPLIER = 2.0  # Number of std deviations for outlier
    MIN_IMAGES_FOR_OUTLIER_DETECTION = 3

    # Theme selection
    MAX_PRIMARY_THEMES = 2
    MAX_SECONDARY_THEMES = 3

    # Visualization
    VISUALIZATION_DPI = 150

    # Metadata
    VERSION = "1.0.0"
    STEP_NAME = "Step 2: Theme Extraction and Aggregation"
    AUTHOR = "Ashwin Kumar Y (2023AC05628)"
    PROJECT = "Visual Intelligence for Travel and Finance Optimization"

    @classmethod
    def display_config(cls):
        """Display current configuration"""
        print("\nCurrent Configuration:")
        print(f"  Input file: {cls.INPUT_JSON_FILE}")
        print(f"  High consistency threshold: {cls.HIGH_CONSISTENCY_THRESHOLD*100}%")
        print(f"  Medium consistency threshold: {cls.MEDIUM_CONSISTENCY_THRESHOLD*100}%")
        print(f"  Max primary themes: {cls.MAX_PRIMARY_THEMES}")
        print(f"  Max secondary themes: {cls.MAX_SECONDARY_THEMES}")

# Display configuration
Config.display_config()

print("\n" + "="*80)
print("CONFIGURATION COMPLETE: System parameters set")
print("="*80)

CONFIGURATION: Setting up system parameters

Current Configuration:
  Input file: step1_analysis_results.json
  High consistency threshold: 70.0%
  Medium consistency threshold: 40.0%
  Max primary themes: 2
  Max secondary themes: 3

CONFIGURATION COMPLETE: System parameters set


 ================================================================================
 LOAD STEP 1 ANALYSIS RESULTS
================================================================================
Purpose: Load and validate the JSON output from Step 1.
==============================================================================

In [14]:
print("="*80)
print("DATA LOADING: Importing Step 1 analysis results")
print("="*80)

# Check if file exists in current directory
if not os.path.exists(Config.INPUT_JSON_FILE):
    print(f"\nERROR: Input file '{Config.INPUT_JSON_FILE}' not found")
    print("\nPlease upload the file from Step 1:")
    print("  1. You should have downloaded 'step1_analysis_results.json' from Step 1")
    print("  2. Click the 'Choose Files' button below to upload it")
    print("\n" + "="*80)

    # Trigger file upload
    print("Waiting for file upload...")
    uploaded = files.upload()

    if Config.INPUT_JSON_FILE in uploaded:
        print(f"\nFile '{Config.INPUT_JSON_FILE}' uploaded successfully!")

        # Save the uploaded file
        with open(Config.INPUT_JSON_FILE, 'wb') as f:
            f.write(uploaded[Config.INPUT_JSON_FILE])
    else:
        print(f"\nERROR: Expected file '{Config.INPUT_JSON_FILE}' but got:")
        for filename in uploaded.keys():
            print(f"  - {filename}")
        print("\nPlease rename your file to 'step1_analysis_results.json' and try again.")
        step1_data = None

# Load the JSON file
if os.path.exists(Config.INPUT_JSON_FILE):
    print(f"\nLoading data from: {Config.INPUT_JSON_FILE}")

    try:
        with open(Config.INPUT_JSON_FILE, 'r', encoding='utf-8') as f:
            step1_data = json.load(f)

        # Validate structure
        print("\nValidating data structure...")

        required_keys = ['success', 'num_images', 'per_image_analysis']
        missing_keys = [key for key in required_keys if key not in step1_data]

        if missing_keys:
            print(f"ERROR: Missing required keys: {missing_keys}")
            step1_data = None
        elif not step1_data.get('success', False):
            print(f"ERROR: Step 1 analysis was not successful")
            print(f"Reason: {step1_data.get('error', 'Unknown error')}")
            step1_data = None
        else:
            # Display summary
            print("\n" + "="*80)
            print("DATA LOADED SUCCESSFULLY")
            print("="*80)

            print(f"\nFile information:")
            file_size = os.path.getsize(Config.INPUT_JSON_FILE)
            print(f"  File size: {file_size/1024:.2f} KB")
            print(f"  Total images: {step1_data['num_images']}")

            # Count successful analyses
            successful_count = sum(
                1 for analysis in step1_data['per_image_analysis'].values()
                if analysis.get('success', False)
            )
            failed_count = step1_data['num_images'] - successful_count

            print(f"  Successful analyses: {successful_count}")
            print(f"  Failed analyses: {failed_count}")

            # Check for batch statistics
            has_batch_stats = 'batch_statistics' in step1_data and bool(step1_data['batch_statistics'])
            print(f"  Batch statistics available: {'Yes' if has_batch_stats else 'No'}")

            # Display image list
            print(f"\nImages in dataset:")
            for img_id, analysis in step1_data['per_image_analysis'].items():
                status = "SUCCESS" if analysis.get('success', False) else "FAILED"
                confidence = analysis.get('confidence', 'N/A').upper() if analysis.get('success') else 'N/A'
                print(f"  - {img_id}: {status} (Confidence: {confidence})")

            # Display metadata if available
            if 'metadata' in step1_data:
                print(f"\nStep 1 metadata:")
                metadata = step1_data['metadata']
                print(f"  Generated: {metadata.get('generated_at', 'Unknown')}")
                print(f"  Version: {metadata.get('version', 'Unknown')}")
                if 'configuration' in metadata:
                    config = metadata['configuration']
                    print(f"  CLIP model: {config.get('clip_model', 'Unknown')}")
                    print(f"  Device used: {config.get('device_used', 'Unknown')}")

            print("\n" + "="*80)
            print("Data ready for theme extraction")
            print("="*80)

    except json.JSONDecodeError as e:
        print(f"\nERROR: Invalid JSON file")
        print(f"Details: {str(e)}")
        step1_data = None
    except Exception as e:
        print(f"\nERROR: Failed to load file")
        print(f"Details: {str(e)}")
        step1_data = None
else:
    step1_data = None

# Set flag for next cells
if step1_data is not None:
    data_loaded = True
    print("\nProceed to next cell for theme extraction.")
else:
    data_loaded = False
    print("\nPlease fix the errors above before proceeding.")

print("\n" + "="*80)


DATA LOADING: Importing Step 1 analysis results

Loading data from: step1_analysis_results.json

Validating data structure...

DATA LOADED SUCCESSFULLY

File information:
  File size: 29.99 KB
  Total images: 5
  Successful analyses: 5
  Failed analyses: 0
  Batch statistics available: Yes

Images in dataset:
  - Beach1: SUCCESS (Confidence: LOW)
  - Beach2: SUCCESS (Confidence: LOW)
  - Beach3: SUCCESS (Confidence: LOW)
  - Beach4: SUCCESS (Confidence: LOW)
  - Beach5: SUCCESS (Confidence: LOW)

Step 1 metadata:
  Generated: 2026-01-03T16:08:18.620134
  Version: 1.0.0
  CLIP model: openai/clip-vit-base-patch32
  Device used: cpu

Data ready for theme extraction

Proceed to next cell for theme extraction.



================================================================================
THEME EXTRACTOR CLASS
================================================================================
Purpose: Core class for extracting and aggregating themes from Step 1 results.
         Handles consistency analysis, outlier detection, and theme ranking.
         
================================================================================

In [15]:
print("="*80)
print("THEME EXTRACTOR: Initializing theme extraction engine")
print("="*80)

class ThemeExtractor:
    """
    Extract and aggregate themes from multiple image analyses.

    This class takes the output from Step 1 and performs sophisticated
    theme extraction by:
    - Analyzing theme consistency across images
    - Detecting and handling outliers
    - Ranking themes by reliability
    - Classifying themes into primary and secondary
    - Computing confidence metrics

    The goal is to identify the user's true preferences by finding themes
    that consistently appear across multiple images, rather than themes
    that only appear in one image (which might be outliers).

    Attributes:
        step1_data (Dict): Complete output from Step 1
        num_images (int): Number of images analyzed
        successful_analyses (List): List of successful image analyses

    Example:
        >>> extractor = ThemeExtractor(step1_data)
        >>> themes = extractor.extract_themes()
        >>> print(themes['primary_themes'])
    """

    def __init__(self, step1_data: Dict[str, Any]):
        """
        Initialize theme extractor with Step 1 results.

        Args:
            step1_data: Dictionary containing Step 1 analysis results

        Raises:
            ValueError: If step1_data is invalid or empty
        """
        if not step1_data or not step1_data.get('success', False):
            raise ValueError("Invalid or unsuccessful Step 1 data provided")

        self.step1_data = step1_data
        self.num_images = step1_data['num_images']

        # Extract successful analyses only
        self.successful_analyses = [
            analysis for analysis in step1_data['per_image_analysis'].values()
            if analysis.get('success', False)
        ]

        if not self.successful_analyses:
            raise ValueError("No successful analyses found in Step 1 data")

        print(f"\nTheme extractor initialized:")
        print(f"  Total images: {self.num_images}")
        print(f"  Successful analyses: {len(self.successful_analyses)}")

    def extract_themes(self) -> Dict[str, Any]:
        """
        Main method to extract themes from Step 1 results.

        This orchestrates the complete theme extraction pipeline:
        1. Collect all scene scores across images
        2. Compute theme statistics (mean, std, consistency)
        3. Detect outliers
        4. Rank themes by reliability
        5. Select primary and secondary themes
        6. Compute overall confidence

        Returns:
            Dictionary containing:
            {
                'primary_themes': List[Dict],      # Top 1-2 dominant themes
                'secondary_themes': List[Dict],    # Next 3-5 supporting themes
                'all_theme_statistics': Dict,      # Complete stats for all themes
                'outliers_detected': List[Dict],   # Images/themes flagged as outliers
                'consistency_level': str,          # 'high', 'medium', 'low'
                'confidence': str,                 # Overall confidence level
                'summary': Dict                    # Human-readable summary
            }

        Example:
            >>> themes = extractor.extract_themes()
            >>> for theme in themes['primary_themes']:
            >>>     print(f"{theme['prompt']}: {theme['consistency']*100}%")
        """
        print("\n" + "="*80)
        print("THEME EXTRACTION PIPELINE")
        print("="*80)

        # Step 1: Collect scene scores
        print("\nStep 1: Collecting scene scores across all images...")
        all_scene_scores = self._collect_scene_scores()
        print(f"  Collected scores for {len(all_scene_scores)} unique themes")

        # Step 2: Compute statistics
        print("\nStep 2: Computing theme statistics...")
        theme_statistics = self._compute_theme_statistics(all_scene_scores)
        print(f"  Computed statistics for {len(theme_statistics)} themes")

        # Step 3: Detect outliers
        print("\nStep 3: Detecting outliers...")
        outliers = self._detect_outliers(theme_statistics)
        if outliers:
            print(f"  Found {len(outliers)} potential outlier(s)")
        else:
            print(f"  No outliers detected")

        # Step 4: Rank themes
        print("\nStep 4: Ranking themes by consistency and score...")
        ranked_themes = self._rank_themes(theme_statistics)
        print(f"  Themes ranked by reliability")

        # Step 5: Select primary and secondary themes
        print("\nStep 5: Selecting primary and secondary themes...")
        primary_themes = self._select_primary_themes(ranked_themes)
        secondary_themes = self._select_secondary_themes(ranked_themes, primary_themes)
        print(f"  Primary themes: {len(primary_themes)}")
        print(f"  Secondary themes: {len(secondary_themes)}")

        # Step 6: Compute consistency level
        print("\nStep 6: Computing overall consistency level...")
        consistency_level = self._compute_consistency_level(primary_themes)
        print(f"  Consistency level: {consistency_level.upper()}")

        # Step 7: Compute confidence
        print("\nStep 7: Computing overall confidence...")
        confidence = self._compute_overall_confidence(primary_themes, consistency_level)
        print(f"  Confidence: {confidence.upper()}")

        # Step 8: Generate summary
        print("\nStep 8: Generating summary...")
        summary = self._generate_summary(primary_themes, secondary_themes, consistency_level, confidence)

        print("\n" + "="*80)
        print("THEME EXTRACTION COMPLETE")
        print("="*80)

        # Return complete results
        return {
            'primary_themes': primary_themes,
            'secondary_themes': secondary_themes,
            'all_theme_statistics': theme_statistics,
            'outliers_detected': outliers,
            'consistency_level': consistency_level,
            'confidence': confidence,
            'summary': summary,
            'num_images_analyzed': len(self.successful_analyses)
        }

    def _collect_scene_scores(self) -> Dict[str, List[float]]:
        """
        Collect all scene scores from successful analyses.

        Creates a mapping from each theme/prompt to its scores across
        all images. This allows us to see how consistently each theme
        appears.

        Returns:
            Dictionary mapping theme to list of scores
            Example: {
                "tropical beach": [0.85, 0.82, 0.88],
                "rocky coastline": [0.32, 0.28, 0.35]
            }
        """
        all_scores = defaultdict(list)

        for analysis in self.successful_analyses:
            scene_scores = analysis.get('scene_scores', {})
            for prompt, score in scene_scores.items():
                all_scores[prompt].append(float(score))

        return dict(all_scores)

    def _compute_theme_statistics(self, all_scene_scores: Dict[str, List[float]]) -> Dict[str, Dict]:
        """
        Compute comprehensive statistics for each theme.

        For each theme, computes:
        - Mean score (average across images)
        - Standard deviation (measure of variability)
        - Min and max scores
        - Consistency (fraction of images with high score)
        - Appearance count (number of images with score > threshold)

        Args:
            all_scene_scores: Dictionary mapping themes to score lists

        Returns:
            Dictionary with detailed statistics for each theme
        """
        statistics = {}
        threshold = Config.MEDIUM_SCORE_THRESHOLD

        for prompt, scores in all_scene_scores.items():
            scores_array = np.array(scores)

            # Count appearances above threshold
            appears_count = int(np.sum(scores_array > threshold))
            consistency = appears_count / len(self.successful_analyses)

            statistics[prompt] = {
                'mean_score': float(np.mean(scores_array)),
                'std_score': float(np.std(scores_array)),
                'min_score': float(np.min(scores_array)),
                'max_score': float(np.max(scores_array)),
                'median_score': float(np.median(scores_array)),
                'consistency': consistency,
                'appears_in_images': appears_count,
                'total_images': len(self.successful_analyses),
                'scores': scores  # Keep original scores for outlier detection
            }

        return statistics

    def _detect_outliers(self, theme_statistics: Dict[str, Dict]) -> List[Dict]:
        """
        Detect outlier themes or images.

        An outlier is defined as:
        - A theme with very high score in one image but low in others
        - A theme with very high variability (large std deviation)

        Uses statistical method: values beyond mean ± (multiplier * std)

        Args:
            theme_statistics: Dictionary with theme statistics

        Returns:
            List of detected outliers with details
        """
        outliers = []

        # Only detect outliers if we have enough images
        if len(self.successful_analyses) < Config.MIN_IMAGES_FOR_OUTLIER_DETECTION:
            return outliers

        for prompt, stats in theme_statistics.items():
            scores = stats['scores']
            mean = stats['mean_score']
            std = stats['std_score']

            # Skip if std is too small (consistent theme)
            if std < 0.05:
                continue

            # Check for outlier scores
            threshold_high = mean + (Config.OUTLIER_STD_MULTIPLIER * std)
            threshold_low = mean - (Config.OUTLIER_STD_MULTIPLIER * std)

            for idx, score in enumerate(scores):
                if score > threshold_high or score < threshold_low:
                    outliers.append({
                        'prompt': prompt,
                        'image_index': idx,
                        'score': float(score),
                        'mean': mean,
                        'std': std,
                        'deviation': abs(score - mean) / std if std > 0 else 0,
                        'type': 'high' if score > threshold_high else 'low'
                    })

        return outliers

    def _rank_themes(self, theme_statistics: Dict[str, Dict]) -> List[Tuple[str, Dict]]:
        """
        Rank themes by reliability.

        Ranking criteria (in order of importance):
        1. Consistency (how often theme appears across images)
        2. Mean score (average similarity)
        3. Low std deviation (stable scores = more reliable)

        Args:
            theme_statistics: Dictionary with theme statistics

        Returns:
            List of (prompt, statistics) tuples sorted by reliability
        """
        # Create ranking score for each theme
        ranked = []

        for prompt, stats in theme_statistics.items():
            # Composite ranking score
            # Consistency is most important (weight: 0.5)
            # Mean score is secondary (weight: 0.3)
            # Low std is bonus (weight: 0.2, inverted)
            ranking_score = (
                stats['consistency'] * 0.5 +
                stats['mean_score'] * 0.3 +
                (1.0 - min(stats['std_score'], 0.3) / 0.3) * 0.2
            )

            ranked.append((prompt, stats, ranking_score))

        # Sort by ranking score (descending)
        ranked.sort(key=lambda x: x[2], reverse=True)

        # Return (prompt, stats) tuples without ranking score
        return [(prompt, stats) for prompt, stats, _ in ranked]

    def _select_primary_themes(self, ranked_themes: List[Tuple[str, Dict]]) -> List[Dict]:
        """
        Select primary (dominant) themes.

        Primary themes must have:
        - High consistency (>= threshold)
        - High mean score
        - Appear in majority of images

        Args:
            ranked_themes: List of themes sorted by reliability

        Returns:
            List of primary theme dictionaries
        """
        primary = []

        for prompt, stats in ranked_themes:
            # Check if qualifies as primary theme
            if (stats['consistency'] >= Config.MEDIUM_CONSISTENCY_THRESHOLD and
                stats['mean_score'] >= Config.MEDIUM_SCORE_THRESHOLD):

                primary.append({
                    'prompt': prompt,
                    'consistency': stats['consistency'],
                    'mean_score': stats['mean_score'],
                    'std_score': stats['std_score'],
                    'appears_in_images': stats['appears_in_images'],
                    'total_images': stats['total_images'],
                    'confidence_level': self._classify_theme_confidence(stats)
                })

                # Limit to max primary themes
                if len(primary) >= Config.MAX_PRIMARY_THEMES:
                    break

        return primary

    def _select_secondary_themes(self, ranked_themes: List[Tuple[str, Dict]],
                                 primary_themes: List[Dict]) -> List[Dict]:
        """
        Select secondary (supporting) themes.

        Secondary themes:
        - Not already selected as primary
        - Have moderate consistency or score
        - Provide additional context

        Args:
            ranked_themes: List of themes sorted by reliability
            primary_themes: Already selected primary themes

        Returns:
            List of secondary theme dictionaries
        """
        primary_prompts = {theme['prompt'] for theme in primary_themes}
        secondary = []

        for prompt, stats in ranked_themes:
            # Skip if already primary
            if prompt in primary_prompts:
                continue

            # Check if qualifies as secondary
            if (stats['consistency'] >= Config.LOW_CONSISTENCY_THRESHOLD or
                stats['mean_score'] >= Config.MEDIUM_SCORE_THRESHOLD):

                secondary.append({
                    'prompt': prompt,
                    'consistency': stats['consistency'],
                    'mean_score': stats['mean_score'],
                    'std_score': stats['std_score'],
                    'appears_in_images': stats['appears_in_images'],
                    'total_images': stats['total_images'],
                    'confidence_level': self._classify_theme_confidence(stats)
                })

                # Limit to max secondary themes
                if len(secondary) >= Config.MAX_SECONDARY_THEMES:
                    break

        return secondary

    def _classify_theme_confidence(self, stats: Dict) -> str:
        """
        Classify confidence level for a single theme.

        Args:
            stats: Theme statistics dictionary

        Returns:
            Confidence level: 'high', 'medium', or 'low'
        """
        if (stats['consistency'] >= Config.HIGH_CONSISTENCY_THRESHOLD and
            stats['mean_score'] >= Config.HIGH_SCORE_THRESHOLD and
            stats['std_score'] < 0.1):
            return 'high'
        elif (stats['consistency'] >= Config.MEDIUM_CONSISTENCY_THRESHOLD and
              stats['mean_score'] >= Config.MEDIUM_SCORE_THRESHOLD):
            return 'medium'
        else:
            return 'low'

    def _compute_consistency_level(self, primary_themes: List[Dict]) -> str:
        """
        Compute overall consistency level across primary themes.

        Args:
            primary_themes: List of primary theme dictionaries

        Returns:
            Overall consistency: 'high', 'medium', or 'low'
        """
        if not primary_themes:
            return 'low'

        # Average consistency of primary themes
        avg_consistency = np.mean([theme['consistency'] for theme in primary_themes])

        if avg_consistency >= Config.HIGH_CONSISTENCY_THRESHOLD:
            return 'high'
        elif avg_consistency >= Config.MEDIUM_CONSISTENCY_THRESHOLD:
            return 'medium'
        else:
            return 'low'

    def _compute_overall_confidence(self, primary_themes: List[Dict],
                                   consistency_level: str) -> str:
        """
        Compute overall confidence in theme extraction.

        Args:
            primary_themes: List of primary themes
            consistency_level: Overall consistency level

        Returns:
            Overall confidence: 'high', 'medium', or 'low'
        """
        if not primary_themes:
            return 'low'

        # Count high-confidence primary themes
        high_conf_count = sum(1 for theme in primary_themes
                             if theme['confidence_level'] == 'high')

        if high_conf_count >= 1 and consistency_level == 'high':
            return 'high'
        elif high_conf_count >= 1 or consistency_level == 'medium':
            return 'medium'
        else:
            return 'low'

    def _generate_summary(self, primary_themes: List[Dict],
                         secondary_themes: List[Dict],
                         consistency_level: str,
                         confidence: str) -> Dict:
        """
        Generate human-readable summary of theme extraction.

        Args:
            primary_themes: List of primary themes
            secondary_themes: List of secondary themes
            consistency_level: Overall consistency
            confidence: Overall confidence

        Returns:
            Dictionary with summary information
        """
        return {
            'num_primary_themes': len(primary_themes),
            'num_secondary_themes': len(secondary_themes),
            'consistency_level': consistency_level,
            'confidence': confidence,
            'interpretation': self._interpret_results(
                primary_themes,
                consistency_level,
                confidence
            )
        }

    def _interpret_results(self, primary_themes: List[Dict],
                          consistency_level: str,
                          confidence: str) -> str:
        """
        Generate interpretation text for results.

        Args:
            primary_themes: List of primary themes
            consistency_level: Overall consistency
            confidence: Overall confidence

        Returns:
            Human-readable interpretation string
        """
        if not primary_themes:
            return "No clear themes detected. Images may be too diverse or unclear."

        if confidence == 'high' and consistency_level == 'high':
            return "Strong, consistent themes detected across all images. High confidence in user preferences."
        elif confidence == 'high' or consistency_level == 'high':
            return "Clear themes detected with good consistency. Moderate to high confidence in preferences."
        elif confidence == 'medium' or consistency_level == 'medium':
            return "Moderate theme consistency detected. Some variation across images."
        else:
            return "Weak or mixed themes detected. Images show diverse preferences or unclear patterns."


print("\n" + "="*80)
print("THEME EXTRACTOR COMPLETE: Ready to extract themes")
print("="*80)


THEME EXTRACTOR: Initializing theme extraction engine

THEME EXTRACTOR COMPLETE: Ready to extract themes


================================================================================
RUN THEME EXTRACTION
================================================================================
Purpose: Execute theme extraction pipeline on Step 1 data.
================================================================================

In [16]:
"""
================================================================================
CELL 6: RUN THEME EXTRACTION
================================================================================
Purpose: Execute theme extraction pipeline on Step 1 data.
================================================================================
"""

print("="*80)
print("EXECUTING THEME EXTRACTION")
print("="*80)

# Verify data is loaded
if not data_loaded or step1_data is None:
    print("\nERROR: Step 1 data not loaded")
    print("Please run Cell 4 first to load the data.")
    print("="*80)
    theme_results = None
else:
    try:
        # Initialize theme extractor
        print("\nInitializing theme extractor...")
        extractor = ThemeExtractor(step1_data)

        # Run theme extraction
        print("\nStarting theme extraction pipeline...")
        theme_results = extractor.extract_themes()

        print("\n" + "="*80)
        print("THEME EXTRACTION SUCCESSFUL")
        print("="*80)

        # Display high-level summary
        print("\nExtraction Summary:")
        print(f"  Images analyzed: {theme_results['num_images_analyzed']}")
        print(f"  Primary themes found: {theme_results['summary']['num_primary_themes']}")
        print(f"  Secondary themes found: {theme_results['summary']['num_secondary_themes']}")
        print(f"  Consistency level: {theme_results['consistency_level'].upper()}")
        print(f"  Overall confidence: {theme_results['confidence'].upper()}")

        # Display interpretation
        print(f"\nInterpretation:")
        print(f"  {theme_results['summary']['interpretation']}")

        # Display outliers if any
        if theme_results['outliers_detected']:
            print(f"\nOutliers detected: {len(theme_results['outliers_detected'])}")
            print(f"  (These are themes that appear inconsistently)")
        else:
            print(f"\nNo outliers detected - themes are consistent")

        print("\n" + "="*80)
        print("Results stored in variable 'theme_results'")
        print("="*80)

    except ValueError as e:
        print(f"\nERROR: Invalid data")
        print(f"Details: {str(e)}")
        theme_results = None
    except Exception as e:
        print(f"\nERROR: Theme extraction failed")
        print(f"Details: {str(e)}")
        import traceback
        print("\nFull error trace:")
        print(traceback.format_exc())
        theme_results = None

print("\n" + "="*80)


EXECUTING THEME EXTRACTION

Initializing theme extractor...

Theme extractor initialized:
  Total images: 5
  Successful analyses: 5

Starting theme extraction pipeline...

THEME EXTRACTION PIPELINE

Step 1: Collecting scene scores across all images...
  Collected scores for 27 unique themes

Step 2: Computing theme statistics...
  Computed statistics for 27 themes

Step 3: Detecting outliers...
  No outliers detected

Step 4: Ranking themes by consistency and score...
  Themes ranked by reliability

Step 5: Selecting primary and secondary themes...
  Primary themes: 0
  Secondary themes: 0

Step 6: Computing overall consistency level...
  Consistency level: LOW

Step 7: Computing overall confidence...
  Confidence: LOW

Step 8: Generating summary...

THEME EXTRACTION COMPLETE

THEME EXTRACTION SUCCESSFUL

Extraction Summary:
  Images analyzed: 5
  Primary themes found: 0
  Secondary themes found: 0
  Consistency level: LOW
  Overall confidence: LOW

Interpretation:
  No clear themes d

Display Detailed **Results**

In [17]:
"""
================================================================================
CELL 7: DISPLAY DETAILED THEME EXTRACTION RESULTS
================================================================================
Purpose: Present comprehensive theme analysis in human-readable format.
================================================================================
"""

print("="*80)
print("DETAILED THEME EXTRACTION RESULTS")
print("="*80)

# Verify results exist
if 'theme_results' not in locals() or theme_results is None:
    print("\nERROR: No theme results found")
    print("Please run Cell 6 first to perform theme extraction.")
    print("="*80)
else:
    # Section 1: Primary Themes
    print("\n" + "="*80)
    print("SECTION 1: PRIMARY THEMES")
    print("="*80)
    print("\nThese are the dominant, most consistent themes across your images.")

    if theme_results['primary_themes']:
        for i, theme in enumerate(theme_results['primary_themes'], 1):
            print(f"\n{'-'*80}")
            print(f"PRIMARY THEME {i}")
            print(f"{'-'*80}")
            print(f"\nTheme: {theme['prompt']}")
            print(f"\nStatistics:")

            # Consistency bar
            consistency_pct = theme['consistency'] * 100
            consistency_bar_length = int(theme['consistency'] * 40)
            consistency_bar = "█" * consistency_bar_length + "░" * (40 - consistency_bar_length)
            print(f"  Consistency: {consistency_bar} {consistency_pct:.1f}%")
            print(f"               (Appears in {theme['appears_in_images']}/{theme['total_images']} images)")

            # Score bar
            score_bar_length = int(theme['mean_score'] * 40)
            score_bar = "█" * score_bar_length + "░" * (40 - score_bar_length)
            print(f"  Mean Score:  {score_bar} {theme['mean_score']:.3f}")

            # Standard deviation
            std_indicator = "Low" if theme['std_score'] < 0.05 else "Medium" if theme['std_score'] < 0.1 else "High"
            print(f"  Variability: {theme['std_score']:.3f} ({std_indicator})")

            # Confidence
            print(f"  Confidence:  {theme['confidence_level'].upper()}")

            # Interpretation
            print(f"\nInterpretation:")
            if theme['consistency'] >= 0.9:
                print(f"  This theme appears in nearly all images, indicating a very")
                print(f"  strong and consistent user preference.")
            elif theme['consistency'] >= 0.7:
                print(f"  This theme appears in most images, indicating a strong")
                print(f"  user preference with high consistency.")
            elif theme['consistency'] >= 0.5:
                print(f"  This theme appears in more than half the images, suggesting")
                print(f"  a moderate but clear user preference.")
            else:
                print(f"  This theme appears in some images, indicating a possible")
                print(f"  user interest but with less consistency.")
    else:
        print("\nNo primary themes identified.")
        print("This suggests images are too diverse or unclear.")

    # Section 2: Secondary Themes
    print("\n\n" + "="*80)
    print("SECTION 2: SECONDARY THEMES")
    print("="*80)
    print("\nThese provide additional context and supporting characteristics.")

    if theme_results['secondary_themes']:
        for i, theme in enumerate(theme_results['secondary_themes'], 1):
            print(f"\n{'-'*80}")
            print(f"SECONDARY THEME {i}")
            print(f"{'-'*80}")
            print(f"\nTheme: {theme['prompt']}")

            # Compact statistics
            consistency_pct = theme['consistency'] * 100
            print(f"  Consistency: {consistency_pct:.1f}% ({theme['appears_in_images']}/{theme['total_images']} images)")
            print(f"  Mean Score: {theme['mean_score']:.3f}")
            print(f"  Confidence: {theme['confidence_level'].upper()}")
    else:
        print("\nNo secondary themes identified.")

    # Section 3: Outlier Analysis
    if theme_results['outliers_detected']:
        print("\n\n" + "="*80)
        print("SECTION 3: OUTLIER ANALYSIS")
        print("="*80)
        print("\nOutliers are themes that appear inconsistently across images.")
        print("They may indicate:")
        print("  - One image is different from others")
        print("  - Mixed user preferences")
        print("  - Misclassification by the model")

        # Group outliers by type
        high_outliers = [o for o in theme_results['outliers_detected'] if o['type'] == 'high']
        low_outliers = [o for o in theme_results['outliers_detected'] if o['type'] == 'low']

        if high_outliers:
            print(f"\nHigh Score Outliers ({len(high_outliers)}):")
            print("(Theme scored much higher in one image than average)")
            for outlier in high_outliers[:5]:  # Show top 5
                print(f"\n  Theme: {outlier['prompt'][:60]}...")
                print(f"  Image index: {outlier['image_index']}")
                print(f"  Score: {outlier['score']:.3f} (Average: {outlier['mean']:.3f})")
                print(f"  Deviation: {outlier['deviation']:.2f} standard deviations")

        if low_outliers:
            print(f"\nLow Score Outliers ({len(low_outliers)}):")
            print("(Theme scored much lower in one image than average)")
            for outlier in low_outliers[:5]:  # Show top 5
                print(f"\n  Theme: {outlier['prompt'][:60]}...")
                print(f"  Image index: {outlier['image_index']}")
                print(f"  Score: {outlier['score']:.3f} (Average: {outlier['mean']:.3f})")
                print(f"  Deviation: {outlier['deviation']:.2f} standard deviations")

    # Section 4: Overall Assessment
    print("\n\n" + "="*80)
    print("SECTION 4: OVERALL ASSESSMENT")
    print("="*80)

    print(f"\nImages Analyzed: {theme_results['num_images_analyzed']}")
    print(f"Primary Themes: {theme_results['summary']['num_primary_themes']}")
    print(f"Secondary Themes: {theme_results['summary']['num_secondary_themes']}")

    print(f"\nConsistency Level: {theme_results['consistency_level'].upper()}")
    consistency_desc = {
        'high': "Themes are very consistent across images. Strong signal.",
        'medium': "Themes show moderate consistency. Some variation present.",
        'low': "Themes are inconsistent. Images may be diverse."
    }
    print(f"  {consistency_desc.get(theme_results['consistency_level'], 'Unknown')}")

    print(f"\nOverall Confidence: {theme_results['confidence'].upper()}")
    confidence_desc = {
        'high': "High confidence in identified themes. Reliable for next steps.",
        'medium': "Moderate confidence. Results are reasonable but verify.",
        'low': "Low confidence. Results may not be reliable."
    }
    print(f"  {confidence_desc.get(theme_results['confidence'], 'Unknown')}")

    print(f"\nInterpretation:")
    print(f"  {theme_results['summary']['interpretation']}")

    # Section 5: Recommendations for Step 3
    print("\n\n" + "="*80)
    print("SECTION 5: RECOMMENDATIONS FOR STEP 3")
    print("="*80)
    print("\nBased on the extracted themes, here are recommendations for")
    print("geo-location identification (Step 3):")

    if theme_results['confidence'] == 'high':
        print("\n✓ Recommendation: PROCEED WITH CONFIDENCE")
        print("  - Primary themes are clear and consistent")
        print("  - Use these themes for geo-location matching")
        print("  - Both landmark and theme-based matching should work well")
    elif theme_results['confidence'] == 'medium':
        print("\n⚠ Recommendation: PROCEED WITH CAUTION")
        print("  - Themes are moderately consistent")
        print("  - Rely more on landmark recognition if available")
        print("  - Use themes as supporting evidence")
    else:
        print("\n⚠ Recommendation: REVIEW IMAGES")
        print("  - Themes are weak or inconsistent")
        print("  - Consider uploading more focused images")
        print("  - Landmark recognition will be more reliable than theme matching")

    # Check for specific regional themes
    print("\nRegional Theme Detection:")
    regional_themes_found = []
    for theme in theme_results['primary_themes'] + theme_results['secondary_themes']:
        prompt_lower = theme['prompt'].lower()
        if 'goa' in prompt_lower:
            regional_themes_found.append(('Goa', theme['consistency']))
        elif 'kerala' in prompt_lower or 'backwater' in prompt_lower:
            regional_themes_found.append(('Kerala', theme['consistency']))
        elif 'andaman' in prompt_lower:
            regional_themes_found.append(('Andaman', theme['consistency']))
        elif 'konkan' in prompt_lower:
            regional_themes_found.append(('Konkan', theme['consistency']))
        elif 'tamil nadu' in prompt_lower:
            regional_themes_found.append(('Tamil Nadu', theme['consistency']))

    if regional_themes_found:
        print("  Regional indicators detected:")
        for region, consistency in regional_themes_found:
            print(f"    - {region} ({consistency*100:.0f}% consistency)")
        print("  These can help narrow down geo-location in Step 3")
    else:
        print("  No specific regional indicators detected")
        print("  Step 3 will rely on landmark recognition and generic themes")

    # Theme characteristics summary
    print("\nTheme Characteristics Summary:")
    all_themes = theme_results['primary_themes'] + theme_results['secondary_themes']

    # Extract keywords from themes
    keywords = defaultdict(int)
    for theme in all_themes:
        prompt_lower = theme['prompt'].lower()

        # Water/Beach keywords
        if any(word in prompt_lower for word in ['beach', 'sand', 'shore', 'coast']):
            keywords['beach'] += 1
        if any(word in prompt_lower for word in ['water', 'sea', 'ocean']):
            keywords['water'] += 1

        # Vegetation
        if any(word in prompt_lower for word in ['palm', 'tree', 'coconut', 'vegetation']):
            keywords['vegetation'] += 1

        # Activities
        if any(word in prompt_lower for word in ['fishing', 'boat']):
            keywords['fishing'] += 1
        if any(word in prompt_lower for word in ['sport', 'activity']):
            keywords['activities'] += 1

        # Characteristics
        if any(word in prompt_lower for word in ['pristine', 'clear', 'turquoise']):
            keywords['pristine'] += 1
        if any(word in prompt_lower for word in ['rocky', 'cliff']):
            keywords['rocky'] += 1
        if any(word in prompt_lower for word in ['tropical']):
            keywords['tropical'] += 1
        if any(word in prompt_lower for word in ['secluded', 'offbeat']):
            keywords['secluded'] += 1

    if keywords:
        print("  Key characteristics detected:")
        sorted_keywords = sorted(keywords.items(), key=lambda x: x[1], reverse=True)
        for keyword, count in sorted_keywords[:5]:
            print(f"    - {keyword.capitalize()}: {count} theme(s)")

    print("\n" + "="*80)

    print("="*80)


DETAILED THEME EXTRACTION RESULTS

SECTION 1: PRIMARY THEMES

These are the dominant, most consistent themes across your images.

No primary themes identified.
This suggests images are too diverse or unclear.


SECTION 2: SECONDARY THEMES

These provide additional context and supporting characteristics.

No secondary themes identified.


SECTION 4: OVERALL ASSESSMENT

Images Analyzed: 5
Primary Themes: 0
Secondary Themes: 0

Consistency Level: LOW
  Themes are inconsistent. Images may be diverse.

Overall Confidence: LOW
  Low confidence. Results may not be reliable.

Interpretation:
  No clear themes detected. Images may be too diverse or unclear.


SECTION 5: RECOMMENDATIONS FOR STEP 3

Based on the extracted themes, here are recommendations for
geo-location identification (Step 3):

⚠ Recommendation: REVIEW IMAGES
  - Themes are weak or inconsistent
  - Consider uploading more focused images
  - Landmark recognition will be more reliable than theme matching

Regional Theme Detection