## Adaptive Tech Support for Older Adults

This is an automated system that generates step-by-step tech support guides with screenshots for older adults based on user queries viaprompt chaining. By retrieving relevant YouTube tutorials via the YouTube API, extracting key UI elements using OpenCV, and creating visual guides with the OS-ATLAS action model, the system aims to enhance the accessibility and effectiveness of tech support for older adults by leveraging video tutorial content in a more accessible, image-based format.

### 1. Installation of Required Libraries

This cell installs the necessary Python libraries for the project, including packages for Google API access, video processing, image manipulation, machine learning, and deep learning models.

In [None]:
!pip install google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client
!pip install opencv-python-headless yt-dlp
!pip install scikit-image
!pip install qwen_vl_utils
!pip install transformers
!pip install torch torchvision

### 2. YouTube Video Search and Selection

This cell performs the following tasks:
- **Searches for YouTube videos** based on a user-provided query using the YouTube Data API.
- **Cleans the query** to make it folder-friendly (removes spaces and special characters).
- **Fetches video details** including title, URL, and view count.
- **Selects the video with the highest number of views** from the search results.
- The user is prompted to enter a search query, and the most popular video is displayed with its title, URL, and view count.

In [None]:
import requests
import re

# YouTube API key
YOUTUBE_API_KEY = "" # Add your API Key here

def clean_query(query):
    """Convert query to a folder-friendly format (remove spaces & special characters)."""
    return re.sub(r'\W+', '_', query.strip())

def search_youtube_video(query, max_results=5):
    search_url = "https://www.googleapis.com/youtube/v3/search"
    search_params = {
        "part": "snippet",
        "q": query,
        "type": "video",
        "maxResults": max_results,
        "key": YOUTUBE_API_KEY
    }
    response = requests.get(search_url, params=search_params).json()

    # Extract video IDs
    video_results = []
    video_ids = []
    for item in response.get("items", []):
        video_id = item["id"]["videoId"]
        title = item["snippet"]["title"]
        video_url = f"https://www.youtube.com/watch?v={video_id}"
        video_ids.append(video_id)
        video_results.append({"id": video_id, "title": title, "url": video_url})
    if not video_results:
        return None
    
    # Fetch video statistics (views) using a separate API call
    stats_url = "https://www.googleapis.com/youtube/v3/videos"
    stats_params = {
        "part": "statistics",
        "id": ",".join(video_ids),
        "key": YOUTUBE_API_KEY
    }
    stats_response = requests.get(stats_url, params=stats_params).json()
    # Map video views to results
    for i, item in enumerate(stats_response.get("items", [])):
        video_results[i]["views"] = int(item["statistics"].get("viewCount", 0))
    # Select the video with the highest views
    best_video = max(video_results, key=lambda x: x["views"])
    return best_video

# Dynamic user input
query = input("Enter your search query: ")
query_folder = clean_query(query)  # Ensure query is folder-safe
best_video = search_youtube_video(query)

if best_video:
    print(f"Selected Video: {best_video['title']} - {best_video['url']} ({best_video['views']} views)")
else:
    print("No videos found.")

### 3. Video Download and Folder Setup

This cell performs the following tasks:
- **Clears cached data** from yt-dlp to ensure fresh downloads.
- **Sets up folders** for storing the video and screenshots related to the search query:
  - Deletes any existing folders and creates new ones.
- **Downloads the selected YouTube video**:
  - The video is saved in the newly created folder for the search query.
  - The `yt-dlp` tool is configured to download the best available quality.
  - A cookie file (if available) is used for the download.

In [None]:
import yt_dlp
import os
import shutil

yt_dlp.YoutubeDL().cache.remove()
# if os.path.exists('cookies.txt'):
#     os.remove('cookies.txt')

def setup_folders(query_folder):
    """Delete old data & create new folders for the query."""
    base_dirs = ["videos", "screenshots"]
    for base in base_dirs:
        folder_path = os.path.join(base, query_folder)
        if os.path.exists(folder_path):
            shutil.rmtree(folder_path)  # Delete old folder
        os.makedirs(folder_path)  # Create fresh folder

def download_video(video_url, output_folder="videos"):
    query_video_folder = os.path.join(output_folder, query_folder)
    os.makedirs(query_video_folder, exist_ok=True)  # Ensure folder exists
    ydl_opts = {
        "format": "best",
        "outtmpl": f"{query_video_folder}/%(id)s.%(ext)s",
        "cookiefile": "cookies.txt",
        "force-ipv4": True,
        # "compat_opts": ["legacy-server"]
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.cache.remove()
        ydl.download([video_url])

# Setup folders for new query
setup_folders(query_folder)
# Use the video with the highest views
download_video(best_video["url"])

### 4. Phone Screen Detection and Screenshot Extraction

This cell performs the following tasks:
- **Detects the phone screen** in video frames using edge detection and contour approximation:
  - Converts frames to grayscale and uses Canny edge detection.
  - Finds the largest contour, approximates it to a polygon, and identifies it as the phone screen if it is a quadrilateral.
- **Extracts relevant screenshots** from the video:
  - Screenshots are taken at intervals, and only frames where the phone screen is detected are processed.
  - Uses Structural Similarity Index (SSIM) to filter out similar frames, saving only frames with noticeable UI changes.
  - Screenshots are saved in the specified folder for the query.

In [None]:
import cv2
import os
import numpy as np
from skimage.metrics import structural_similarity as ssim

def detect_phone_screen(frame):
    """Detects the largest rectangle (phone screen) in the frame using edge detection and contour approximation."""
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(gray, 50, 150)
    contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if not contours:
        return None  # No contours found
    # Find the largest rectangle
    max_contour = max(contours, key=cv2.contourArea)
    epsilon = 0.02 * cv2.arcLength(max_contour, True)
    approx = cv2.approxPolyDP(max_contour, epsilon, True)
    if len(approx) == 4:  # If it's a quadrilateral, assume it's the phone screen
        return approx.reshape(4, 2)
    return None

def extract_relevant_frames(video_path, output_folder="screenshots", frame_interval=1, similarity_threshold=0.98):
    query_screenshot_folder = os.path.join(output_folder, query_folder)
    os.makedirs(query_screenshot_folder, exist_ok=True)  # Ensure folder exists
    cap = cv2.VideoCapture(video_path)
    frame_count = 0
    screenshot_count = 0
    last_saved_frame = None
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        # Skip frames to reduce processing time
        if frame_count % frame_interval == 0:
            phone_screen = detect_phone_screen(frame)
            if phone_screen is not None:  # Only process frames where the phone screen is detected
                if last_saved_frame is None:
                    save_path = f"{query_screenshot_folder}/frame{screenshot_count}.jpg"
                    cv2.imwrite(save_path, frame)
                    last_saved_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
                    screenshot_count += 1
                else:
                    gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
                    similarity = ssim(last_saved_frame, gray_frame)
                    if similarity < similarity_threshold:  # Only save if UI has changed
                        save_path = f"{query_screenshot_folder}/frame_{screenshot_count}.jpg"
                        cv2.imwrite(save_path, frame)
                        last_saved_frame = gray_frame
                        screenshot_count += 1
        frame_count += 1
    cap.release()
    print(f"Extracted {screenshot_count} relevant UI screenshots in '{query_screenshot_folder}'.")
# Run extraction with filtering
video_file = f"videos/{query_folder}/{best_video['id']}.mp4"
extract_relevant_frames(video_file)

### 5. Phone Screen Detection and Cropping

This cell contains functions for detecting and extracting the screen region from images, using methods like adaptive thresholding, Canny edge detection, and contour analysis. It includes:
- `detect_phone_screen(image)`: Detects the phone's screen area by analyzing edges, contours, and aspect ratios.
- `crop_screen_with_perspective(image, screen_contour)`: Crops the detected screen area using a perspective transformation.
- `order_points(pts)`: Orders the contour points to ensure correct perspective transformation.
- `visualize_detection(image, screen_contour, filename=None, output_dir=None)`: Visualizes the detected screen contour for debugging.
- `extract_ui_screenshots(input_folder="screenshots", output_folder="ui-screens")`: Extracts and saves only the phone screen regions from screenshots.

These functions help isolate the phone screen from images and ensure correct transformations for further processing.

In [None]:
import cv2
import os
import numpy as np

def detect_phone_screen(image):
    """Detects the phone in the image, focusing on the likely phone region."""
    # Create a copy of the image
    original = image.copy()
    height, width = image.shape[:2]
    
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)
    
    # First, try to use the fact that phones often have bright screens
    # that stand out from the background
    
    # 1. Try adaptive thresholding to find the screen
    thresh = cv2.adaptiveThreshold(blurred, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                                  cv2.THRESH_BINARY_INV, 11, 2)
    
    # 2. Also try Canny edge detection
    canny = cv2.Canny(blurred, 50, 150)
    
    # Combine both methods
    edges = cv2.bitwise_or(thresh, canny)
    
    # Dilate to connect edges
    kernel = np.ones((3, 3), np.uint8)
    dilated = cv2.dilate(edges, kernel, iterations=1)
    
    # Find contours
    contours, _ = cv2.findContours(dilated.copy(), cv2.RETR_EXTERNAL, 
                                  cv2.CHAIN_APPROX_SIMPLE)
    
    # Sort contours by area in descending order
    contours = sorted(contours, key=cv2.contourArea, reverse=True)
    
    # Additional filter for phone-like aspect ratio
    phone_contours = []
    
    for contour in contours[:10]:  # Only check the 10 largest contours
        # Approximate the contour
        peri = cv2.arcLength(contour, True)
        approx = cv2.approxPolyDP(contour, 0.02 * peri, True)
        
        # Get bounding rectangle
        x, y, w, h = cv2.boundingRect(approx)
        
        # Skip if the contour is too small
        if w * h < (width * height) / 20:
            continue
            
        # Check if it's at least somewhat rectangular (4-ish points)
        if len(approx) >= 4 and len(approx) <= 8:
            # Check if it has phone-like aspect ratio (height > width)
            aspect_ratio = h / w if w > 0 else 0
            
            # Phones are typically taller than they are wide (aspect ratio > 1.5)
            if aspect_ratio > 1.5 and aspect_ratio < 2.5:
                phone_contours.append(approx)
    
    # If we found potential phone contours, use the largest one
    if phone_contours:
        largest_phone_contour = max(phone_contours, key=cv2.contourArea)
        # Get the bounding rectangle
        x, y, w, h = cv2.boundingRect(largest_phone_contour)
        
        # Create a 4-point contour from the rectangle for perspective transform
        screen_contour = np.array([
            [x, y],           # Top-left
            [x + w, y],       # Top-right
            [x + w, y + h],   # Bottom-right
            [x, y + h]        # Bottom-left
        ], dtype=np.float32)
        
        return screen_contour
    
    # If no phone contour found, look for a rectangle with phone-like proportions
    # in the center half of the image (where phones usually are in tutorial videos)
    
    # Define the center region (middle 50% of the image)
    center_x = width // 4
    center_y = height // 4
    center_w = width // 2
    center_h = height // 2
    
    # Create a mask for the center region
    mask = np.zeros_like(gray)
    mask[center_y:center_y+center_h, center_x:center_x+center_w] = 255
    
    # Apply the mask to the edge image
    masked_edges = cv2.bitwise_and(dilated, mask)
    
    # Find contours in the center region
    center_contours, _ = cv2.findContours(masked_edges, cv2.RETR_EXTERNAL, 
                                         cv2.CHAIN_APPROX_SIMPLE)
    
    # Sort by area
    center_contours = sorted(center_contours, key=cv2.contourArea, reverse=True)
    
    for contour in center_contours[:5]:
        # Approximate the contour
        peri = cv2.arcLength(contour, True)
        approx = cv2.approxPolyDP(contour, 0.02 * peri, True)
        
        # Get bounding rectangle
        x, y, w, h = cv2.boundingRect(approx)
        
        # Skip small contours
        if w * h < (width * height) / 30:
            continue
            
        # Check aspect ratio (phone-like)
        aspect_ratio = h / w if w > 0 else 0
        if aspect_ratio > 1.5 and aspect_ratio < 2.5:
            # Create a 4-point contour
            screen_contour = np.array([
                [x, y],           # Top-left
                [x + w, y],       # Top-right
                [x + w, y + h],   # Bottom-right
                [x, y + h]        # Bottom-left
            ], dtype=np.float32)
            
            return screen_contour
    
    # If we still haven't found anything, try one more approach:
    # Find the most central large rectangle in the image
    
    # Use Hough Line transform to find straight lines
    lines = cv2.HoughLinesP(edges, 1, np.pi/180, 100, minLineLength=100, maxLineGap=10)
    
    if lines is not None:
        # Draw the lines on a blank image
        line_image = np.zeros_like(gray)
        for line in lines:
            x1, y1, x2, y2 = line[0]
            cv2.line(line_image, (x1, y1), (x2, y2), 255, 2)
        
        # Find contours in the line image
        line_contours, _ = cv2.findContours(line_image, cv2.RETR_EXTERNAL, 
                                           cv2.CHAIN_APPROX_SIMPLE)
        
        # Find the contour closest to the center of the image
        image_center = np.array([width/2, height/2])
        best_distance = float('inf')
        best_contour = None
        
        for contour in line_contours:
            if cv2.contourArea(contour) < (width * height) / 20:
                continue
                
            # Find center of contour
            M = cv2.moments(contour)
            if M["m00"] != 0:
                cx = int(M["m10"] / M["m00"])
                cy = int(M["m01"] / M["m00"])
                contour_center = np.array([cx, cy])
                
                # Calculate distance to image center
                distance = np.linalg.norm(contour_center - image_center)
                
                # Get bounding rect to check aspect ratio
                x, y, w, h = cv2.boundingRect(contour)
                aspect_ratio = h / w if w > 0 else 0
                
                # If it's phone-like and closer to center than previous best
                if aspect_ratio > 1.5 and aspect_ratio < 2.5 and distance < best_distance:
                    best_distance = distance
                    best_contour = contour
        
        if best_contour is not None:
            # Get the bounding rectangle
            x, y, w, h = cv2.boundingRect(best_contour)
            
            # Create a 4-point contour
            screen_contour = np.array([
                [x, y],           # Top-left
                [x + w, y],       # Top-right
                [x + w, y + h],   # Bottom-right
                [x, y + h]        # Bottom-left
            ], dtype=np.float32)
            
            return screen_contour
    
    # If all else fails, just take the center portion of the image 
    # with a typical phone aspect ratio (this is a fallback)
    center_x = width // 3
    center_width = width // 3
    
    # Calculate height based on typical phone aspect ratio (1:2)
    center_height = center_width * 2
    center_y = (height - center_height) // 2
    
    # Create a rectangle
    screen_contour = np.array([
        [center_x, center_y],                       # Top-left
        [center_x + center_width, center_y],        # Top-right
        [center_x + center_width, center_y + center_height],  # Bottom-right
        [center_x, center_y + center_height]        # Bottom-left
    ], dtype=np.float32)
    
    return screen_contour

# def crop_screen(image, screen_contour):
#     """Simply crops the region inside the screen contour (no perspective correction)."""
#     if screen_contour is None or len(screen_contour) != 4:
#         return None
    
#     # Get the bounding rectangle for the contour
#     x, y, w, h = cv2.boundingRect(screen_contour)
    
#     # Crop the image using the bounding rectangle
#     cropped = image[y:y+h, x:x+w]
    
#     return cropped

def crop_screen_with_perspective(image, screen_contour):
    """Crops the region inside the screen contour using a perspective transform."""
    if screen_contour is None or len(screen_contour) != 4:
        return None

    # Order points in a consistent order (top-left, top-right, bottom-right, bottom-left)
    ordered_points = order_points(screen_contour)
    
    # Define the dimensions for the perspective transform
    width = 360  # Desired width of the output screen crop
    height = 640  # Desired height of the output screen crop
    dst_points = np.array([
        [0, 0],
        [width - 1, 0],
        [width - 1, height - 1],
        [0, height - 1]
    ], dtype=np.float32)

    # Get the perspective transform matrix
    matrix = cv2.getPerspectiveTransform(ordered_points, dst_points)

    # Apply perspective warp
    warped = cv2.warpPerspective(image, matrix, (width, height))
    return warped

def order_points(pts):
    """Orders points in top-left, top-right, bottom-right, bottom-left order."""
    # Sort by y-coordinate (top to bottom)
    sorted_by_y = pts[np.argsort(pts[:, 1])]
    
    # Get top and bottom points
    top_points = sorted_by_y[:2]
    bottom_points = sorted_by_y[2:]
    
    # Sort top points by x-coordinate (left to right)
    top_left, top_right = top_points[np.argsort(top_points[:, 0])]
    
    # Sort bottom points by x-coordinate (left to right)
    bottom_left, bottom_right = bottom_points[np.argsort(bottom_points[:, 0])]
    
    # Return points in order: top-left, top-right, bottom-right, bottom-left
    return np.array([top_left, top_right, bottom_right, bottom_left], dtype=np.float32)

def visualize_detection(image, screen_contour, filename=None, output_dir=None):
    """Visualizes the detected screen contour for debugging."""
    vis = image.copy()
    
    if screen_contour is not None:
        # Draw the contour
        cv2.drawContours(vis, [screen_contour.astype(int)], -1, (0, 255, 0), 2)
        
        # Draw the ordered corners
        rect = order_points(screen_contour)
        colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0)]  # BGR for corners
        labels = ["TL", "TR", "BR", "BL"]  # Corner labels
        
        for i, (x, y) in enumerate(rect.astype(int)):
            cv2.circle(vis, (x, y), 8, colors[i], -1)
            cv2.putText(vis, labels[i], (x - 10, y - 10), 
                        cv2.FONT_HERSHEY_SIMPLEX, 0.8, colors[i], 2)
    
    if filename and output_dir:
        os.makedirs(output_dir, exist_ok=True)
        save_path = os.path.join(output_dir, f"debug_{filename}")
        cv2.imwrite(save_path, vis)
    
    return vis

def extract_ui_screenshots(input_folder="screenshots", output_folder="ui-screens"):
    """Extracts only the screen region from screenshots and saves the raw cropped area."""
    query_ui_folder = os.path.join(output_folder, query_folder)
    debug_folder = os.path.join(output_folder, "debug")
    
    os.makedirs(query_ui_folder, exist_ok=True)
    os.makedirs(debug_folder, exist_ok=True)
    
    screenshot_files = sorted(os.listdir(os.path.join(input_folder, query_folder)))
    
    successful = 0
    failed = 0
    
    for screenshot in screenshot_files:
        img_path = os.path.join(input_folder, query_folder, screenshot)
        img = cv2.imread(img_path)
        
        if img is None:
            # print(f"Could not read image: {img_path}")
            failed += 1
            continue
        
        # Detect phone screen
        screen_contour = detect_phone_screen(img)
        
        # Create visualization for debugging
        visualize_detection(img, screen_contour, screenshot, debug_folder)
        
        if screen_contour is not None:
            # Simply crop the region inside the green box without perspective correction
            cropped_screen = crop_screen_with_perspective(img, screen_contour)
            
            if cropped_screen is not None and cropped_screen.size > 0:
                save_path = os.path.join(query_ui_folder, screenshot)
                cv2.imwrite(save_path, cropped_screen)
                # print(f"Successfully processed: {screenshot}")
                successful += 1
            else:
                # print(f"Failed to process: {screenshot} - Invalid crop")
                failed += 1
        else:
            print(f"✗ Failed to detect screen in: {screenshot}")
            failed += 1
    
    print(f"Extraction complete!")
    print(f"Successfully processed: {successful} images")
    print(f"Failed to process: {failed} images")
    print(f"Extracted UI screens saved in '{query_ui_folder}'")
    print(f"Debug visualizations saved in '{debug_folder}'")

# Run UI screen extraction
extract_ui_screenshots()

### 6. OS-Atlas Action Model for UI Bounding Box Generation

This cell processes UI screenshots to generate step-by-step instructions for a user query, such as a technical support task. It processes a series of UI screenshots, performs action history tracking, and generates step-by-step instructions with bounding boxes for UI elements.

- Bounding box generation
- Interaction keyword detection (e.g., button, text, icon, etc.)
- Step-by-step instructions generation for UI actions
- Action history tracking
- GPU memory management
- Uses the `Qwen2VLForConditionalGeneration` model from Hugging Face

This script is used for extracting detailed UI instructions, enhancing bounding boxes, and providing a more refined view of the UI interactions.

In [None]:
import torch
import os
import cv2
import re
import numpy as np
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import gc

# Clean up GPU memory
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

# Define interaction keywords (list of terms related to UI actions)
interaction_keywords = ['button', 'text', 'icon', 'menu', 'checkbox', 'field', 'label']

def os_atlas_enhanced_bounding_boxes(query_folder, user_query):
    """
    Enhanced version of OS-Atlas with more precise bounding box generation and action history tracking.
    
    Args:
        query_folder: Folder name inside the screenshots directory containing UI images
        user_query: The specific technical support task to analyze
    """
    # Initialize action history
    action_history = []

    # Setup paths
    ui_screens_folder = f"ui-screens/{query_folder}"
    output_folder = f"results/{query_folder}"
    bbox_folder = os.path.join(output_folder, "bounding_boxes")
    
    os.makedirs(output_folder, exist_ok=True)
    os.makedirs(bbox_folder, exist_ok=True)
    
    # Load model and processor
    print("Loading model...")
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        "OS-Copilot/OS-Atlas-Pro-7B", 
        torch_dtype=torch.float16
    ).to("cuda")
    
    processor = AutoProcessor.from_pretrained(
        "OS-Copilot/OS-Atlas-Pro-7B",
        size={"shortest_edge": 224, "longest_edge": 224}
    )
    
    # Load the UI screenshots
    print(f"Loading screenshots from {ui_screens_folder}...")
    frame_files = sorted(
        [f for f in os.listdir(ui_screens_folder) if f.startswith("frame_")],
        key=lambda x: int(x.split('_')[1].split('.')[0])
    )
    
    if not frame_files:
        print(f"No frames found in {ui_screens_folder}")
        return
    
    # Prepare a more focused system prompt    
    sys_prompt = f"""
    You are operating in Executable Language Grounding mode as an expert mobile UI assistant.
    Your task is to break down the query "{user_query}" into clear, executable steps tied to specific UI elements.
    
    Please generate at least 5 clear steps for the user query {user_query} with detailed UI actions in the format:
    1. **CLICK [button name]**: Click on the button to perform an action.
    2. **TYPE [text input]**: Type the provided text in the field.
    3. **SCROLL [direction]**: Scroll in the specified direction.
    4. **WAIT**: Wait for the UI to respond.
    5. **OPEN_APP [app_name]**: Open the specified app.

    Do not leave the list incomplete.

    Ensure that each action is specific and describes the UI element involved.
    In most cases, task instructions are high-level and abstract. Carefully read the instruction and action history, then perform reasoning to determine the most appropriate next action. Ensure you strictly generate two sections: Thoughts and Actions.
    
    Thoughts: Clearly outline your reasoning process for current step.
    Actions: Specify the actual UI actions you will take based on your reasoning. You should follow action format above when generating. 
    """
    
    # Load images and keep track of paths
    frames = []
    frame_paths = []
    for frame_file in frame_files:
        frame_path = os.path.join(ui_screens_folder, frame_file)
        frame_paths.append(frame_path)
        image = Image.open(frame_path).convert('RGB')
        frames.append(image)
    
    # Prepare the message with images
    message_content = [
        {"type": "text", "text": sys_prompt},
        {"type": "text", "text": f"Specific task: {user_query}"}
    ]
    
    # Add all frames as images
    for image in frames:
        message_content.append({"type": "image", "image": image})
    
    messages = [{"role": "user", "content": message_content}]
    
    # Process with the model
    print("Generating response...")
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=text, return_tensors="pt")
    
    # Move inputs to GPU
    inputs = {k: v.to("cuda") if hasattr(v, 'to') else v for k, v in inputs.items()}
    
    # Generate output with more focused token generation
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=256, temperature=0.2, do_sample=True)
    
    output_text = processor.batch_decode(
        generated_ids, 
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True
    )[0]
    
    # Clean up the output
    output_text = re.sub(r'system.*?user', '', output_text, flags=re.DOTALL)
    output_text = re.sub(r'assistant', '', output_text)
    
    # Extract just the step-by-step instructions
    instructions_match = re.search(r'Step 1:.*', output_text, re.DOTALL)
    if instructions_match:
        cleaned_instructions = instructions_match.group(0)
    else:
        # Fallback if no steps found
        cleaned_instructions = re.sub(r'Task to analyze:.*?\n', '', output_text)
    
    # Save the response
    with open(os.path.join(output_folder, "instructions.txt"), "w") as f:
        f.write(cleaned_instructions)
    
    # Enhanced bounding box generation
    print("Creating enhanced bounding box images...")
    
    # Extract step information
    steps = re.findall(r'thoughts:\s*(.*?)\s*actions:\s*(.*?)\s*(?=(thoughts:|$))', output_text, re.DOTALL)
    
    # For each step, extract thoughts and actions
    for step_num, step in enumerate(steps, 1):
        thoughts, actions, _ = step # Unpack the thoughts, actions, and discard the last match
        try:
            print(f"Step {step_num} thoughts: {thoughts}")
            print(f"Step {step_num} actions: {actions}")
            
            # Assuming each step has associated thoughts and actions, append them to action history
            action_history.append(f"Thoughts: {thoughts.strip()}")
            action_history.append(f"Actions: {actions.strip()}")
    
            # Extract the UI elements or actions from the step (you can further process if needed)
            img = np.array(frames[step_num - 1])
            ui_elements = detect_enhanced_ui_elements(img, actions, interaction_keywords)
            if not ui_elements:
                ui_elements = detect_ui_elements(img)
            
            # Create a copy of the original image for bounding box drawing
            bbox_img = img.copy()
            
            # Mark the most relevant elements
            for i, (x, y, w, h) in enumerate(ui_elements[:3]):
                cv2.rectangle(bbox_img, (x, y), (x+w, y+h), (0, 255, 0), 2)
                cv2.putText(bbox_img, 
                            f"Step {step_num} Element", 
                            (x, y-10), 
                            cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
    
            # Add step text overlay
            add_step_instruction(bbox_img, step_num, actions, bbox_img.shape[0])
    
            # Save the bounding box image
            bbox_path = os.path.join(bbox_folder, f"step{step_num}_bbox.jpg")
            cv2.imwrite(bbox_path, bbox_img)
            
        except Exception as e:
            print(f"Error processing step {step_num}: {e}")
    
    # Print all actions and thoughts pairs at the end
    print("\nAction and Thoughts History:")
    for i in range(0, len(action_history), 2):
        print(f"{action_history[i]}")
        print(f"{action_history[i + 1]}")
    
    print(f"Enhanced bounding box images saved to {bbox_folder}")
    print("\nStep-by-step instructions:")
    print(cleaned_instructions)
    
    # Return instructions and action history
    return cleaned_instructions, frames, frame_paths, action_history

def detect_enhanced_ui_elements(image, step_text, keywords):
    """
    Detect UI elements based on step text and keywords.
    
    Args:
        image (numpy.ndarray): Input image
        step_text (str): Step description
        keywords (list): Interaction keywords
    
    Returns:
        list: List of detected UI element bounding boxes
    """
    # Convert image to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    height, width = image.shape[:2]
    
    # Detect relevant keywords in the step
    relevant_keywords = [kw for kw in keywords if kw.lower() in step_text.lower()]
    
    # If no relevant keywords, return empty list
    if not relevant_keywords:
        return []
    
    # Adaptive thresholding
    thresh = cv2.adaptiveThreshold(
        gray, 255, 
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
        cv2.THRESH_BINARY_INV, 11, 2
    )
    
    # Find contours
    contours, _ = cv2.findContours(
        thresh, 
        cv2.RETR_EXTERNAL, 
        cv2.CHAIN_APPROX_SIMPLE
    )
    
    # Filter contours based on size and aspect ratio
    elements = []
    min_area = (width * height) * 0.001  # Minimum area
    max_area = (width * height) * 0.2    # Maximum area
    
    for contour in contours:
        area = cv2.contourArea(contour)
        if min_area < area < max_area:
            x, y, w, h = cv2.boundingRect(contour)
            
            # Filter by aspect ratio
            aspect_ratio = float(w) / h if h > 0 else 0
            if 0.2 < aspect_ratio < 5:
                # Additional checks based on keywords
                if any(keyword in ['button', 'icon', 'menu', 'option'] for keyword in relevant_keywords):
                    # Check for rectangular shapes more typical of UI elements
                    if 0.5 < aspect_ratio < 3:
                        elements.append((x, y, w, h))
                else:
                    elements.append((x, y, w, h))
    
    return elements

def detect_ui_elements(image):
    """
    Detect UI elements in the image.
    (Kept the same as in the previous implementation)
    """
    height, width = image.shape[:2]
    elements = []
    
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # Apply adaptive thresholding
    thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 11, 2)
    
    # Find contours
    contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    # Filter contours by size to find potential UI elements
    min_area = (width * height) * 0.001  # Minimum area threshold
    max_area = (width * height) * 0.1    # Maximum area threshold
    
    for contour in contours:
        area = cv2.contourArea(contour)
        if min_area < area < max_area:
            x, y, w, h = cv2.boundingRect(contour)
            
            # Check if it's likely a UI element based on aspect ratio
            aspect_ratio = float(w) / h if h > 0 else 0
            if 0.2 < aspect_ratio < 5:  # Common UI element aspect ratios
                elements.append((x, y, w, h))
    
    return elements

def add_step_instruction(image, step_num, step_text, height):
    """Add the step instruction as an overlay at the bottom of the image."""
    # Create semi-transparent overlay for text background
    overlay = image.copy()
    cv2.rectangle(overlay, (0, height-80), (image.shape[1], height), (0, 0, 0), -1)
    cv2.addWeighted(overlay, 0.7, image, 0.3, 0, image)
    
    # Clean and truncate the step text
    clean_text = re.sub(r'\([^)]*\)', '', step_text).strip()  # Remove coordinate info
    truncated_text = clean_text[:80] + "..." if len(clean_text) > 80 else clean_text
    
    # Add text
    cv2.putText(image, 
               f"Step {step_num}: {truncated_text}", 
               (10, height-40),
               cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 1)

# Example usage
if __name__ == "__main__":
    os_atlas_enhanced_bounding_boxes(query_folder, query)