## Adaptive Tech Support for Older Adults

This is an automated system that generates step-by-step tech support guides with screenshots for older adults based on user queries viaprompt chaining. By retrieving relevant YouTube tutorials via the YouTube API, extracting key UI elements using OpenCV, and creating visual guides with the OS-ATLAS action model, the system aims to enhance the accessibility and effectiveness of tech support for older adults by leveraging video tutorial content in a more accessible, image-based format.

### 1. Installation of Required Libraries

This cell installs the necessary Python libraries for the project, including packages for Google API access, video processing, image manipulation, machine learning, and deep learning models.

In [None]:
!pip install google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client
!pip install opencv-python-headless yt-dlp
!pip install scikit-image
!pip install qwen_vl_utils
!pip install transformers
!pip install torch torchvision

### 2. YouTube Video Search and Selection

This cell performs the following tasks:
- **Searches for YouTube videos** based on a user-provided query using the YouTube Data API.
- **Cleans the query** to make it folder-friendly (removes spaces and special characters).
- **Fetches video details** including title, URL, and view count.
- **Selects the video with the highest number of views** from the search results.
- The user is prompted to enter a search query, and the most popular video is displayed with its title, URL, and view count.

In [None]:
import requests
import re

# YouTube API key
YOUTUBE_API_KEY = "" # Add your API Key here

def clean_query(query):
    """Convert query to a folder-friendly format (remove spaces & special characters)."""
    return re.sub(r'\W+', '_', query.strip())

def search_youtube_video(query, max_results=5, max_duration_seconds=120):
    search_url = "https://www.googleapis.com/youtube/v3/search"
    search_params = {
        "part": "snippet",
        "q": query,
        "type": "video",
        "maxResults": max_results,
        "key": YOUTUBE_API_KEY,
        "relevanceLanguage": "en",  # Prioritize relevance in search results
        "order": "relevance"  # Sort by relevance
    }
    response = requests.get(search_url, params=search_params).json()
    
    # Extract video IDs
    video_results = []
    video_ids = []
    for item in response.get("items", []):
        video_id = item["id"]["videoId"]
        title = item["snippet"]["title"]
        video_url = f"https://www.youtube.com/watch?v={video_id}"
        video_ids.append(video_id)
        video_results.append({"id": video_id, "title": title, "url": video_url})
    
    if not video_results:
        return None
    
    # Fetch video content details (for duration) and statistics (for views)
    details_url = "https://www.googleapis.com/youtube/v3/videos"
    details_params = {
        "part": "contentDetails,statistics",
        "id": ",".join(video_ids),
        "key": YOUTUBE_API_KEY
    }
    details_response = requests.get(details_url, params=details_params).json()
    
    # Process videos with duration and views information
    valid_videos = []
    for i, item in enumerate(details_response.get("items", [])):
        # Parse duration (in ISO 8601 format, e.g., PT1M30S for 1 min 30 sec)
        duration_str = item["contentDetails"]["duration"]
        # Extract minutes and seconds
        minutes_match = re.search(r'(\d+)M', duration_str)
        seconds_match = re.search(r'(\d+)S', duration_str)
        
        minutes = int(minutes_match.group(1)) if minutes_match else 0
        seconds = int(seconds_match.group(1)) if seconds_match else 0
        
        # Calculate total duration in seconds
        total_seconds = minutes * 60 + seconds
        
        # Add to valid videos if under max_duration_seconds
        if total_seconds <= max_duration_seconds:
            video_results[i]["duration_seconds"] = total_seconds
            video_results[i]["views"] = int(item["statistics"].get("viewCount", 0))
            valid_videos.append(video_results[i])
    
    if not valid_videos:
        return None
    
    # Sort by relevance (using original order) then by views as a tiebreaker
    # Since the API already returned videos sorted by relevance, we maintain that order
    # but use view count as a secondary sort criterion
    best_video = max(valid_videos, key=lambda x: x["views"])
    
    return best_video

# Dynamic user input
query = input("Enter your search query: ")
query_folder = clean_query(query)  # Ensure query is folder-safe
best_video = search_youtube_video(query)

if best_video:
    print(f"Selected Video: {best_video['title']} - {best_video['url']}")
    print(f"Duration: {best_video['duration_seconds']} seconds | Views: {best_video['views']}")
else:
    print("No videos under 2 minutes found.")

### 3. Video Download and Folder Setup

This cell performs the following tasks:
- **Clears cached data** from yt-dlp to ensure fresh downloads.
- **Sets up folders** for storing the video and screenshots related to the search query:
  - Deletes any existing folders and creates new ones.
- **Downloads the selected YouTube video**:
  - The video is saved in the newly created folder for the search query.
  - The `yt-dlp` tool is configured to download the best available quality.
  - A cookie file (if available) is used for the download.

In [None]:
import yt_dlp
import os
import shutil

yt_dlp.YoutubeDL().cache.remove()
# if os.path.exists('cookies.txt'):
#     os.remove('cookies.txt')

def setup_folders(query_folder):
    """Delete old data & create new folders for the query."""
    base_dirs = ["videos", "screenshots"]
    for base in base_dirs:
        folder_path = os.path.join(base, query_folder)
        if os.path.exists(folder_path):
            shutil.rmtree(folder_path)  # Delete old folder
        os.makedirs(folder_path)  # Create fresh folder

def download_video(video_url, output_folder="videos"):
    query_video_folder = os.path.join(output_folder, query_folder)
    os.makedirs(query_video_folder, exist_ok=True)  # Ensure folder exists
    ydl_opts = {
        "format": "best",
        "outtmpl": f"{query_video_folder}/%(id)s.%(ext)s",
        "cookiefile": "cookies.txt",
        "force-ipv4": True,
        # "compat_opts": ["legacy-server"]
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.cache.remove()
        ydl.download([video_url])

# Setup folders for new query
setup_folders(query_folder)
# Use the video selected previously
download_video(best_video["url"])

### 4. Phone Screen Detection and Frame Extraction

This cell performs the following tasks:
- **Detects the phone screen** in video frames using edge detection and contour approximation:
  - Converts frames to grayscale and uses Canny edge detection.
  - Finds the largest contour, approximates it to a polygon, and identifies it as the phone screen if it is a quadrilateral.
- **Extracts relevant frames** from the video:
  - Frames are taken at intervals, and only the frames where the phone screen is detected are processed.
  - Uses Structural Similarity Index (SSIM) to filter out similar frames, saving only frames with noticeable UI changes.
  - Screenshots are saved in the specified folder for the query.

In [None]:
import cv2
import os
import numpy as np
from skimage.metrics import structural_similarity as ssim

def detect_phone_screen(frame):
    """Detects the largest rectangle (phone screen) in the frame using edge detection and contour approximation."""
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(gray, 50, 150)
    contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    if not contours:
        return None  # No contours found
    
    # Find the largest rectangle
    max_contour = max(contours, key=cv2.contourArea)
    epsilon = 0.02 * cv2.arcLength(max_contour, True)
    approx = cv2.approxPolyDP(max_contour, epsilon, True)
    
    if len(approx) == 4:  # If it's a quadrilateral, assume it's the phone screen
        return approx.reshape(4, 2)
    
    return None

def extract_relevant_frames(video_path, output_folder="screenshots", frame_interval=1, similarity_threshold=0.85, min_pixel_change_threshold=0.02):
    """
    Extract frames with significant UI changes from a video.
    
    :param video_path: Path to the input video file
    :param output_folder: Base output folder for screenshots
    :param frame_interval: Process every nth frame
    :param similarity_threshold: SSIM threshold for UI changes (lower means more changes detected)
    :param min_pixel_change_threshold: Minimum percentage of pixels that must change
    """

    # Create query-specific screenshot folder
    query_screenshot_folder = os.path.join(output_folder, query_folder)
    os.makedirs(query_screenshot_folder, exist_ok=True)  # Ensure folder exists
    
    cap = cv2.VideoCapture(video_path)
    frame_count = 0
    screenshot_count = 0
    last_saved_frame = None
    
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        
        # Skip frames to reduce processing time
        if frame_count % frame_interval == 0:
            phone_screen = detect_phone_screen(frame)
            
            if phone_screen is not None:  # Only process frames where the phone screen is detected
                gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
                
                if last_saved_frame is None:
                    # Save first frame
                    save_path = os.path.join(query_screenshot_folder, f"frame_{screenshot_count}.jpg")
                    cv2.imwrite(save_path, frame)
                    last_saved_frame = gray_frame
                    screenshot_count += 1
                else:
                    # Compare with last saved frame
                    similarity = ssim(last_saved_frame, gray_frame)
                    
                    # Calculate pixel change percentage
                    pixel_diff = np.sum(np.abs(last_saved_frame.astype(int) - gray_frame.astype(int))) / (gray_frame.shape[0] * gray_frame.shape[1] * 255)
                    
                    # Save frame if significant UI change detected
                    if (similarity < similarity_threshold) or (pixel_diff > min_pixel_change_threshold):
                        save_path = os.path.join(query_screenshot_folder, f"frame_{screenshot_count}.jpg")
                        cv2.imwrite(save_path, frame)
                        last_saved_frame = gray_frame
                        screenshot_count += 1
        
        frame_count += 1
    
    cap.release()
    print(f"Extracted {screenshot_count} relevant UI screenshots in '{query_screenshot_folder}'.")
    return screenshot_count

# Example usage
video_file = f"videos/{query_folder}/{best_video['id']}.mp4"
extract_relevant_frames(video_file)

### 5. UI Screen Detection and Cropping

This cell contains functions for detecting and extracting the screen region from images, using methods like adaptive thresholding, Canny edge detection, and contour analysis. It includes:
- `detect_phone_screen(image)`: Detects the phone's screen area by analyzing edges, contours, and aspect ratios.
- `crop_screen_with_perspective(image, screen_contour)`: Crops the detected screen area using a perspective transformation.
- `order_points(pts)`: Orders the contour points to ensure correct perspective transformation.
- `visualize_detection(image, screen_contour, filename=None, output_dir=None)`: Visualizes the detected screen contour for debugging.
- `extract_ui_screenshots(input_folder="screenshots", output_folder="ui-screens")`: Extracts and saves only the phone screen regions from screenshots.

These functions help isolate the phone screen from images and ensure correct transformations for further processing.

In [None]:
import cv2
import os
import numpy as np

def detect_phone_screen(image):
    """Detects the phone in the image, focusing on the likely phone region."""
    # Create a copy of the image
    original = image.copy()
    height, width = image.shape[:2]
    
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)
    
    # 1. Try adaptive thresholding to find the screen
    thresh = cv2.adaptiveThreshold(blurred, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 11, 2)
    
    # 2. Also try Canny edge detection
    canny = cv2.Canny(blurred, 50, 150)
    
    # Combine both methods
    edges = cv2.bitwise_or(thresh, canny)
    
    # Dilate to connect edges
    kernel = np.ones((3, 3), np.uint8)
    dilated = cv2.dilate(edges, kernel, iterations=1)
    
    # Find contours
    contours, _ = cv2.findContours(dilated.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    # Sort contours by area in descending order
    contours = sorted(contours, key=cv2.contourArea, reverse=True)
    
    # Additional filter for phone-like aspect ratio
    phone_contours = []
    
    for contour in contours[:10]:  # Only check the 10 largest contours
        # Approximate the contour
        peri = cv2.arcLength(contour, True)
        approx = cv2.approxPolyDP(contour, 0.02 * peri, True)
        
        # Get bounding rectangle
        x, y, w, h = cv2.boundingRect(approx)
        
        # Skip if the contour is too small
        if w * h < (width * height) / 20:
            continue
            
        # Check if it's at least somewhat rectangular (4-ish points)
        if len(approx) >= 4 and len(approx) <= 8:
            # Check if it has phone-like aspect ratio (height > width)
            aspect_ratio = h / w if w > 0 else 0
            
            # Phones are typically taller than they are wide (aspect ratio > 1.5)
            if aspect_ratio > 1.5 and aspect_ratio < 2.5:
                phone_contours.append(approx)
    
    # If we found potential phone contours, use the largest one
    if phone_contours:
        largest_phone_contour = max(phone_contours, key=cv2.contourArea)
        # Get the bounding rectangle
        x, y, w, h = cv2.boundingRect(largest_phone_contour)
        
        # Create a 4-point contour from the rectangle for perspective transform
        screen_contour = np.array([
            [x, y],           # Top-left
            [x + w, y],       # Top-right
            [x + w, y + h],   # Bottom-right
            [x, y + h]        # Bottom-left
        ], dtype=np.float32)
        
        return screen_contour
    
    # If no phone contour found, look for a rectangle with phone-like proportions
    # in the center half of the image (where phones usually are in tutorial videos)
    
    # Define the center region (middle 50% of the image)
    center_x = width // 4
    center_y = height // 4
    center_w = width // 2
    center_h = height // 2
    
    # Create a mask for the center region
    mask = np.zeros_like(gray)
    mask[center_y:center_y+center_h, center_x:center_x+center_w] = 255
    
    # Apply the mask to the edge image
    masked_edges = cv2.bitwise_and(dilated, mask)
    
    # Find contours in the center region
    center_contours, _ = cv2.findContours(masked_edges, cv2.RETR_EXTERNAL, 
                                         cv2.CHAIN_APPROX_SIMPLE)
    
    # Sort by area
    center_contours = sorted(center_contours, key=cv2.contourArea, reverse=True)
    
    for contour in center_contours[:5]:
        # Approximate the contour
        peri = cv2.arcLength(contour, True)
        approx = cv2.approxPolyDP(contour, 0.02 * peri, True)
        
        # Get bounding rectangle
        x, y, w, h = cv2.boundingRect(approx)
        
        # Skip small contours
        if w * h < (width * height) / 30:
            continue
            
        # Check aspect ratio (phone-like)
        aspect_ratio = h / w if w > 0 else 0
        if aspect_ratio > 1.5 and aspect_ratio < 2.5:
            # Create a 4-point contour
            screen_contour = np.array([
                [x, y],           # Top-left
                [x + w, y],       # Top-right
                [x + w, y + h],   # Bottom-right
                [x, y + h]        # Bottom-left
            ], dtype=np.float32)
            
            return screen_contour
    
    # If we still haven't found anything, try one more approach:
    # Find the most central large rectangle in the image
    
    # Use Hough Line transform to find straight lines
    lines = cv2.HoughLinesP(edges, 1, np.pi/180, 100, minLineLength=100, maxLineGap=10)
    
    if lines is not None:
        # Draw the lines on a blank image
        line_image = np.zeros_like(gray)
        for line in lines:
            x1, y1, x2, y2 = line[0]
            cv2.line(line_image, (x1, y1), (x2, y2), 255, 2)
        
        # Find contours in the line image
        line_contours, _ = cv2.findContours(line_image, cv2.RETR_EXTERNAL, 
                                           cv2.CHAIN_APPROX_SIMPLE)
        
        # Find the contour closest to the center of the image
        image_center = np.array([width/2, height/2])
        best_distance = float('inf')
        best_contour = None
        
        for contour in line_contours:
            if cv2.contourArea(contour) < (width * height) / 20:
                continue
                
            # Find center of contour
            M = cv2.moments(contour)
            if M["m00"] != 0:
                cx = int(M["m10"] / M["m00"])
                cy = int(M["m01"] / M["m00"])
                contour_center = np.array([cx, cy])
                
                # Calculate distance to image center
                distance = np.linalg.norm(contour_center - image_center)
                
                # Get bounding rect to check aspect ratio
                x, y, w, h = cv2.boundingRect(contour)
                aspect_ratio = h / w if w > 0 else 0
                
                # If it's phone-like and closer to center than previous best
                if aspect_ratio > 1.5 and aspect_ratio < 2.5 and distance < best_distance:
                    best_distance = distance
                    best_contour = contour
        
        if best_contour is not None:
            # Get the bounding rectangle
            x, y, w, h = cv2.boundingRect(best_contour)
            
            # Create a 4-point contour
            screen_contour = np.array([
                [x, y],           # Top-left
                [x + w, y],       # Top-right
                [x + w, y + h],   # Bottom-right
                [x, y + h]        # Bottom-left
            ], dtype=np.float32)
            
            return screen_contour
    
    # If all else fails, just take the center portion of the image 
    # with a typical phone aspect ratio (this is a fallback)
    center_x = width // 3
    center_width = width // 3
    
    # Calculate height based on typical phone aspect ratio (1:2)
    center_height = center_width * 2
    center_y = (height - center_height) // 2
    
    # Create a rectangle
    screen_contour = np.array([
        [center_x, center_y],                       # Top-left
        [center_x + center_width, center_y],        # Top-right
        [center_x + center_width, center_y + center_height],  # Bottom-right
        [center_x, center_y + center_height]        # Bottom-left
    ], dtype=np.float32)
    
    return screen_contour

def crop_screen_with_perspective(image, screen_contour):
    """Crops the region inside the screen contour using a perspective transform without resizing."""
    if screen_contour is None or len(screen_contour) != 4:
        return None

    # Order points in a consistent order (top-left, top-right, bottom-right, bottom-left)
    ordered_points = order_points(screen_contour)

    # Compute the width and height dynamically
    width = int(np.linalg.norm(ordered_points[1] - ordered_points[0]))  # Distance between top-left & top-right
    height = int(np.linalg.norm(ordered_points[2] - ordered_points[1]))  # Distance between top-right & bottom-right

    # Define the destination points dynamically based on the detected screen dimensions
    dst_points = np.array([
        [0, 0],
        [width - 1, 0],
        [width - 1, height - 1],
        [0, height - 1]
    ], dtype=np.float32)

    # Compute the perspective transform matrix
    matrix = cv2.getPerspectiveTransform(ordered_points, dst_points)

    # Warp the perspective to get the cropped screen
    cropped = cv2.warpPerspective(image, matrix, (width, height))

    return cropped

    # Resize the cropped image to the desired size (e.g., 1024x1024)
    # resized_cropped = cv2.resize(cropped, (768, 2000), interpolation=cv2.INTER_AREA)

    # return resized_cropped

def order_points(pts):
    """Orders points in top-left, top-right, bottom-right, bottom-left order."""
    # Sort by y-coordinate (top to bottom)
    sorted_by_y = pts[np.argsort(pts[:, 1])]
    
    # Get top and bottom points
    top_points = sorted_by_y[:2]
    bottom_points = sorted_by_y[2:]
    
    # Sort top points by x-coordinate (left to right)
    top_left, top_right = top_points[np.argsort(top_points[:, 0])]
    
    # Sort bottom points by x-coordinate (left to right)
    bottom_left, bottom_right = bottom_points[np.argsort(bottom_points[:, 0])]
    
    # Return points in order: top-left, top-right, bottom-right, bottom-left
    return np.array([top_left, top_right, bottom_right, bottom_left], dtype=np.float32)

def visualize_detection(image, screen_contour, filename=None, output_dir=None):
    """Visualizes the detected screen contour for debugging."""
    vis = image.copy()
    
    if screen_contour is not None:
        # Draw the contour
        cv2.drawContours(vis, [screen_contour.astype(int)], -1, (0, 255, 0), 2)
        
        # Draw the ordered corners
        rect = order_points(screen_contour)
        colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0)]  # BGR for corners
        labels = ["TL", "TR", "BR", "BL"]  # Corner labels
        
        for i, (x, y) in enumerate(rect.astype(int)):
            cv2.circle(vis, (x, y), 8, colors[i], -1)
            cv2.putText(vis, labels[i], (x - 10, y - 10), 
            cv2.FONT_HERSHEY_SIMPLEX, 0.8, colors[i], 2)
    
    if filename and output_dir:
        os.makedirs(output_dir, exist_ok=True)
        save_path = os.path.join(output_dir, f"debug_{filename}")
        cv2.imwrite(save_path, vis)
    
    return vis

def extract_ui_screenshots(input_folder="screenshots", output_folder="ui-screens"):
    """Extracts only the screen region from screenshots and saves the raw cropped area."""
    query_ui_folder = os.path.join(output_folder, query_folder)
    debug_folder = os.path.join(output_folder, "debug")
    
    os.makedirs(query_ui_folder, exist_ok=True)
    os.makedirs(debug_folder, exist_ok=True)
    
    screenshot_files = sorted(os.listdir(os.path.join(input_folder, query_folder)))
    
    successful = 0
    failed = 0
    
    for screenshot in screenshot_files:
        img_path = os.path.join(input_folder, query_folder, screenshot)
        img = cv2.imread(img_path)
        
        if img is None:
            # print(f"Could not read image: {img_path}")
            failed += 1
            continue
        
        # Detect phone screen
        screen_contour = detect_phone_screen(img)
        
        # Create visualization for debugging
        visualize_detection(img, screen_contour, screenshot, debug_folder)
        
        if screen_contour is not None:
            # Simply crop the region inside the green box without perspective correction
            cropped_screen = crop_screen_with_perspective(img, screen_contour)
            
            if cropped_screen is not None and cropped_screen.size > 0:
                save_path = os.path.join(query_ui_folder, screenshot)
                cv2.imwrite(save_path, cropped_screen)
                # print(f"Successfully processed: {screenshot}")
                successful += 1
            else:
                # print(f"Failed to process: {screenshot} - Invalid crop")
                failed += 1
        else:
            print(f"✗ Failed to detect screen in: {screenshot}")
            failed += 1
    
    print(f"Extraction complete!")
    print(f"Successfully processed: {successful} images")
    print(f"Failed to process: {failed} images")
    print(f"Extracted UI screens saved in '{query_ui_folder}'")
    print(f"Debug visualizations saved in '{debug_folder}'")

# Run UI screen extraction
extract_ui_screenshots()

### 6. OS-Atlas Action Model for UI Bounding Box Generation

This cell processes UI screenshots to generate step-by-step instructions for a user query, such as a technical support task. It processes a series of UI screenshots, performs action history tracking, and generates step-by-step instructions with bounding boxes for UI elements.

- Bounding box generation
- Interaction keyword detection (e.g., button, text, icon, etc.)
- Step-by-step instructions generation for UI actions
- Action history tracking
- GPU memory management
- Uses the `Qwen2VLForConditionalGeneration` model from Hugging Face

This script is used for extracting detailed UI instructions, enhancing bounding boxes, and providing a more refined view of the UI interactions.

In [None]:
# OS-ATLAS - ACTION MODEL

import os
import re
import cv2
import torch
import gc
import json
import numpy as np
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Clean up GPU memory
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

# Load the model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "OS-Copilot/OS-Atlas-Pro-7B", torch_dtype="auto"
).to("cuda")

processor = AutoProcessor.from_pretrained(
    "OS-Copilot/OS-Atlas-Pro-7B",
    size=None
)

# Define the system prompt for Technology Support mode for older adults
sys_prompt = """  
You are in Technology Support mode for older adults. Your role is to assist users over 60 with technology issues by providing step-by-step executable actions.  

Guidelines:  
1. Step-by-Step Instructions:  
    - Provide clear, short steps that are easy to follow.  
    - Every action must be directly executable without assumptions.  
    - Example: Instead of "Go to settings," specify each step to navigate there.
    - Use the images for the step generation:
        - Exclude or skip over intro, outro and unclear images.
        - Do not use images that do not have interactive UI elements.
        - Strict guidelines: ALWAYS generate coordinates relative to the image's size and resolution:
            - The coordinates need to be within the image bounds.
            - The coordinates need to be correct and accurate location for the UI element mentioned in each step (in 'thought').
            - The coordinates should be in abosolute pixel values instead of normalized values. Image resolution is: heightxwidth taken from the `image.shape`.

2. Strict Action Format:  
    - Each step must have:  
        - Thought: Explains the reason for the next action.  
        - Action: Specifies what to do in a predefined format.

4. No Follow-Up Questions:  
    - Do not ask for clarification.  
    - Use only given screenshots and action history.

Action Formats:  
1. CLICK: Click on a position. Format: CLICK <point>[x, y]</point>  
2. TYPE: Enter text. Format: TYPE [input text]  
3. SCROLL: Scroll in a direction. Format: SCROLL [UP/DOWN/LEFT/RIGHT]  
4. OPEN_APP: Open an app. Format: OPEN_APP [app_name]  
5. PRESS_BACK: Go to the previous screen. Format: PRESS_BACK  
6. PRESS_HOME: Return to the home screen. Format: PRESS_HOME  
7. ENTER: Press enter. Format: ENTER  
8. WAIT: Pause for loading. Format: WAIT  
9. COMPLETE: Task finished. Format: COMPLETE  

Example Response:  
Query: "How do I take a screenshot on my iPhone?"

Thought: Open the screen you want to capture.
Action: OPEN_APP [Gallery]

Thought: Press the correct button combination to take a screenshot.
Action: PRESS [Power + Volume Up]

Thought: The screenshot is captured and saved.
Action: COMPLETE  
"""

def extract_coordinates(action, image_width, image_height):
    """
    Extracts normalized coordinates from the action and converts them
    to absolute pixel values based on the image dimensions.
    """
    # Patterns to try for coordinate extraction
    patterns = [
        r'<point>\[\[(\d+),(\d+)\]\]</point>',  # Original pattern with XML-like tags
        r'\[\[(\d+),(\d+)\]\]',                 # Nested list format
        r'<point>\[(\d+),(\d+)\]</point>',      # Alternative XML-like tag
        r'\[(\d+),(\d+)\]'                      # Simple list format
    ]
    
    for pattern in patterns:
        match = re.search(pattern, action)
        if match:
            x, y = map(int, match.groups())
            
            # Convert normalized coordinates to absolute pixel values
            abs_x = int(x * image_width / 1000)  # Convert normalized x to pixels
            abs_y = int(y * image_height / 1000)  # Convert normalized y to pixels

            print(f"Extracted coordinates (normalized): ({x}, {y})")
            print(f"Converted to absolute pixel values: ({abs_x}, {abs_y})")
            return abs_x, abs_y
    
    # If no coordinates found, print the full action for debugging
    print(f"No coordinates found in action: {action}")
    return None

def draw_bounding_box(image, coordinates, step_number, action_type):
    """
    Draws a bounding box dynamically around the UI element based on a calculated region.
    """
    if coordinates is None:
        return image
    
    x, y = coordinates
    
    # Validate coordinates are within image bounds
    height, width = image.shape[:2]
    if x < 0 or x >= width or y < 0 or y >= height:
        print(f"Warning: Coordinates ({x}, {y}) are outside image bounds ({width}x{height})")
        return image
    
    # Create a copy of the image to draw on
    img_with_box = image.copy()
    
    # Define a dynamic bounding box size based on a percentage of image dimensions
    box_size_x = int(width * 0.2)  # 30% of image width for bounding box size
    box_size_y = int(height * 0.2)  # 30% of image height for bounding box size
    
    # Apply a margin around the coordinates for better bounding box positioning
    margin = 0.05  # 10% margin to expand the bounding box
    expanded_box_size_x = int(box_size_x * (1 + margin))
    expanded_box_size_y = int(box_size_y * (1 + margin))
    
    # Calculate box coordinates with expanded size
    top_left = (
        max(0, x - expanded_box_size_x // 2), 
        max(0, y - expanded_box_size_y // 2)
    )
    bottom_right = (
        min(width, x + expanded_box_size_x // 2), 
        min(height, y + expanded_box_size_y // 2)
    )
    
    # Choose color based on action type
    if 'CLICK' in action_type.upper():
        color = (0, 255, 0)  # Green for CLICK
    elif 'TYPE' in action_type.upper():
        color = (255, 0, 0)  # Blue for TYPE
    elif 'SCROLL' in action_type.upper():
        color = (255, 165, 0)  # Orange for SCROLL
    else:
        color = (0, 0, 255)  # Red for other actions
    
    # Draw the rectangle (bounding box)
    cv2.rectangle(img_with_box, top_left, bottom_right, color, thickness=2)
    
    # Add step number text with better text placement
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = min(width, height) / 500  # Dynamic font scaling
    font_thickness = 1
    
    text = f"Step {step_number}"
    text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]
    text_pos = (
        max(0, top_left[0]), 
        max(text_size[1] + 10, top_left[1] - 10)
    )
    
    # Add text with a semi-transparent background for better readability
    overlay = img_with_box.copy()
    cv2.rectangle(
        overlay, 
        (text_pos[0], text_pos[1] - text_size[1] - 10),
        (text_pos[0] + text_size[0] + 10, text_pos[1] + 10), 
        (255, 255, 255), 
        -1
    )
    cv2.addWeighted(overlay, 0.5, img_with_box, 0.5, 0, img_with_box)
    
    # Draw text
    cv2.putText(
        img_with_box,
        text,
        text_pos,
        font,
        font_scale,
        color,
        font_thickness
    )
    
    return img_with_box


def log_to_file(message):
    # Append the message to instructions.txt
    with open('os_atlas_instructions.txt', 'a') as file:
        file.write(message + '\n')


# Initialize action history
action_history = []

# Create output folder for boxed images
output_folder = f'./os_atlas_steps/{query_folder}'
os.makedirs(output_folder, exist_ok=True)

# Get all files starting with 'frame_'
frame_files = [f for f in os.listdir(f'./ui-screens/{query_folder}') if f.startswith('frame_')]

# Sort the files numerically (by extracting the number from the filename)
frame_files.sort(key=lambda f: int(f.split('_')[1].split('.')[0]))  # Split based on 'frame_' and take the number

step_number = 1

# Loop through the frames and process each one
for frame_file in frame_files:
    # Print the current frame being processed
    print(f"Processing frame: {frame_file}")

    # Prepare the messages for the current image, using the image path
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": sys_prompt},
                {
                    "type": "image",
                    "image": f'./ui-screens/{query_folder}/{frame_file}',
                },
                {"type": "text", "text": f"Task instruction: '{query}'\nHistory: {action_history or 'null'}"}
            ],
        }
    ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)

    inputs = processor(
        text=[text],
        images=image_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    # Generate output for the current frame
    generated_ids = model.generate(**inputs, max_new_tokens=128)

    # Post-process the output
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
    )

    # Initialize variables
    last_step = None
    
    # Extract thought and action
    thought, action = None, None
    capturing_thought, capturing_action = False, False  # Flags to track multi-line capturing
    temp_thought, temp_action = [], []  # Lists to accumulate multi-line text
    
    for line in output_text[0].splitlines():
        line = line.strip()
    
        if line.lower().startswith("thought:"):
            capturing_thought = True
            capturing_action = False  # Stop capturing action if we encounter a new thought
            continue  # Move to next line to capture the actual thought
    
        elif line.lower().startswith("actions:"):
            capturing_thought = False
            capturing_action = True  # Start capturing action
            continue  # Move to next line to capture the actual action
    
        # Capture multi-line thoughts and actions
        if capturing_thought:
            temp_thought.append(line)
        elif capturing_action:
            temp_action.append(line.replace("<|im_end|>", "").strip())
    
    # Join lines to reconstruct the full thought and action
    thought = " ".join(temp_thought).strip() if temp_thought else None
    action = " ".join(temp_action).strip() if temp_action else None
    
    # Ensure both thought and action exist and check uniqueness
    if thought and action:
        current_step = (thought, action)
    
        # Normalize the action for comparison
        normalized_action = action.strip().lower()
    
        # Check if the action has already been performed (in action history)
        if normalized_action not in [a.strip().lower() for a in action_history]:
            message = f"Step {step_number}:"
            print(message)
            log_to_file(message)
            message = f"Thought: {thought}"
            print(message)
            log_to_file(message)
            message = f"Action: {action}"
            print(message)
            log_to_file(message)
            
            # Read the image for bounding box
            img_path = f'./ui-screens/{query_folder}/{frame_file}'
            img = cv2.imread(img_path)

            img_height, img_width, _ = img.shape
            
            # Check if the action has coordinates
            coordinates = extract_coordinates(action, img_width, img_height)

            # Create a step-specific folder and copy the frame
            step_folder = os.path.join(output_folder, f'step_{step_number}')
            os.makedirs(step_folder, exist_ok=True)
            
            # Draw bounding box for actions with coordinates
            if coordinates:
                img_with_box = draw_bounding_box(img, coordinates, step_number, action)
                # Save the image with bounding box
                output_img_path = os.path.join(step_folder, frame_file)
                cv2.imwrite(output_img_path, img_with_box)
            else:
                # If no coordinates, just copy the original image
                output_img_path = os.path.join(step_folder, frame_file)
                cv2.imwrite(output_img_path, img)

            # Save step details as JSON
            step_details = {
                "step_number": step_number,
                "frame": frame_file,
                "thought": thought,
                "action": action,
                "coordinates": coordinates
            }

            with open(os.path.join(step_folder, 'step_details.json'), 'w') as f:
                json.dump(step_details, f, indent=4)
            
            action_history.append(action)  # Add only action to the history
            last_step = current_step  # Update last step
            
            print(f"Action History: {action_history}\n")
            step_number = step_number + 1
    else:
        print("Thought or action missing for this step.")
        
        # Copy the original image to the output folder
        img_path = f'./ui-screens/{query_folder}/{frame_file}'
        img = cv2.imread(img_path)
        output_path = os.path.join(output_folder, frame_file)
        cv2.imwrite(output_path, img)

        step_number = step_number + 1

print("–––––– END ––––––")

In [None]:
!pip install --upgrade openai

In [None]:
# GPT-4o model

import os
import re
import cv2
import json
import base64
from openai import OpenAI
import numpy as np
import shutil
import difflib

os.environ['OPENAI_API_KEY'] = "" # Add your API Key here

# Initialize OpenAI client
client = OpenAI()

# Define the system prompt for Technology Support mode for older adults
sys_prompt = """  
You are in Technology Support mode for older adults. Your role is to assist users over 60 with technology issues by providing step-by-step executable actions.  

Guidelines:  
1. Provide clear, minimal steps to complete the task.
2. Eliminate redundant or similar steps.
3. Focus on unique, actionable instructions.
4. Each step should be distinctly different from previous steps. Only include steps that have important instructions to achieve the goal.
5. Use the images for the step generation:
    - Exclude or skip over intro, outro and unclear images.
    - Do not use images that do not have interactive UI elements.
    - Strict guidelines: ALWAYS generate coordinates relative to the image's size and resolution:
        - The coordinates need to be within the image bounds.
        - The coordinates need to be correct and accurate location for the UI element mentioned in each step (in 'thought').
        - The coordinates should be in abosolute pixel values instead of normalized values. Image resolution is: heightxwidth taken from the `image.shape`.

6. Include:
   - Thought: Direct, concise, crucial reasoning for the action
   - Action: Simple, specific, executable instruction
   - Coordinates: Precise click, tap or any action location of the UI element in the image (relative to the image size), if applicable
7. The response must be a JSON object.
8. Once the goal is achieved, i.e., after the action is COMPLETE, end the step generation process.

Response format (strict JSON):
{
    "thought": "Unique reasoning for action given in the image and previous step",
    "action": "Specific executable action associated with the elements in the image",
    "coordinates": [[x, y]]
}

IMPORTANT RESPONSE REQUIREMENTS:
- ALWAYS respond in VALID JSON format
- NEVER return plain text
- If no actionable step is possible, return:
{
    "thought": "No unique step can be derived from this image",
    "action": "SKIP",
    "coordinates": []
}

For each step, provide:
- thought: Concise reasoning for the action
- action: Specific UI interaction (e.g., "CLICK", "OPEN_APP", "TAP_BUTTON")
- coordinates: Exact click coordinates [x, y] or empty list []

Example response:
{
    "thought": "Open Settings to modify Control Center",
    "action": "OPEN_APP",
    "coordinates": [180, 50]
}
"""

def encode_image(image_path):
    """
    Encode an image file to base64 for API transmission
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def is_similar_text(text1, text2, threshold=0.8):
    """
    Check if two texts are similar using difflib
    """
    matcher = difflib.SequenceMatcher(None, text1, text2)
    return matcher.ratio() > threshold

def extract_json(text):
    """
    Extract JSON from text by removing markdown code block markers
    """
    # Remove ```json and ``` markers
    text = text.replace('```json', '').replace('```', '').strip()
    return text

def draw_bounding_boxes(image_path, coordinates):
    """
    Draw bounding boxes around elements in the image based on the given coordinates,
    ensuring they remain within image bounds.
    """
    image = cv2.imread(image_path)
    if image is None:
        print(f"Error: Unable to read image {image_path}")
        return None

    img_height, img_width, _ = image.shape  # Get image dimensions

    # Define bounding box properties dynamically based on image size
    box_size = min(img_width, img_height) // 15  # 15% of the smallest dimension
    thickness = max(1, min(img_width, img_height) // 250)  # Keep thickness proportional
    color = (0, 255, 0)  # Green

    # Ensure coordinates is a list of lists
    if isinstance(coordinates, list) and all(isinstance(coord, int) for coord in coordinates):
        coordinates = [coordinates]  # Convert single coordinate to a list of lists

    for coord in coordinates:
        if isinstance(coord, list) and len(coord) == 2:
            x, y = coord

            # Ensure the bounding box is within image boundaries
            top_left_x = max(0, x - box_size // 2)
            top_left_y = max(0, y - box_size // 2)
            bottom_right_x = min(img_width, x + box_size // 2)
            bottom_right_y = min(img_height, y + box_size // 2)

            # Draw the rectangle
            cv2.rectangle(image, (top_left_x, top_left_y), (bottom_right_x, bottom_right_y), color, thickness)

    return image

def log_to_file(message):
    # Append the message to instructions.txt
    with open('gpt4o_instructions.txt', 'a') as file:
        file.write(message + '\n')

def process_ui_screens(query_folder, query):
    """
    Process UI screen images with efficient step generation
    """
    # Create output folders
    screenshots_path = f'./ui-screens/{query_folder}'
    steps_output_folder = f'./gpt4-_steps/{query_folder}'
    os.makedirs(steps_output_folder, exist_ok=True)

    # Get all files starting with 'frame_'
    frame_files = [f for f in os.listdir(screenshots_path) if f.startswith('frame_')]
    frame_files.sort(key=lambda f: int(f.split('_')[1].split('.')[0]))

    # Track processed steps and previous responses
    previous_thoughts = []
    previous_actions = []
    step_number = 1
    previous_response_id = None
    complete_found = False

    for frame_file in frame_files:
        # Skip first two frames if they are intro screens
        frame_index = int(frame_file.split('_')[1].split('.')[0])
        if frame_index < 2:
            message = f"Skipping {frame_file} as it appears to be an intro screen"
            print(message)
            # log_to_file(message)
            continue

        # If we already found a COMPLETE action, break out of the loop
        if complete_found:
            break

        # Prepare the image path
        img_path = os.path.join(screenshots_path, frame_file)
        
        try:
            # Make API call to GPT-4o
            response = client.responses.create(
                model="gpt-4o",
                input=[
                    {"role": "system", "content": sys_prompt},
                    {"role": "user", "content": [
                        {"type": "input_text", "text": f"Task instruction: '{query}, Provide a unique step that hasn't been mentioned before.'"},
                        {"type": "input_image", "image_url": f"data:image/jpeg;base64,{encode_image(img_path)}"}
                    ]}
                ],
                previous_response_id=previous_response_id,
            )

            # Parse the response
            output_text = response.output_text

            # Update `previous_response_id` for the next turn
            previous_response_id = response.id

            try:
                # Extract and parse JSON
                cleaned_json = extract_json(output_text)
                response_data = json.loads(cleaned_json)
            except json.JSONDecodeError as json_err:
                print(f"JSON Parsing Error: {json_err}")
                print(f"Problematic JSON: {output_text}")
                continue

            # Skip if action is SKIP
            if response_data.get('action') == 'SKIP':
                message = "Skipping frame due to no unique step"
                print(message)
                # log_to_file(message)
                continue

            # Extract details
            thought = response_data.get('thought', '')
            action = response_data.get('action', '')
            coordinates = response_data.get('coordinates', [])

            # Print step details
            message = f"Step {step_number}:"
            print(message)
            log_to_file(message)
            message = f"Frame: {frame_file}"
            print(message)
            log_to_file(message)
            message = f"Thought: {thought}"
            print(message)
            log_to_file(message)
            message = f"Action: {action}"
            print(message)
            log_to_file(message)
            message = f"Coordinates: {coordinates}\n"
            print(message)
            log_to_file(message)

            # Create a step-specific folder and copy the frame
            step_folder = os.path.join(steps_output_folder, f'step_{step_number}_frame_{frame_index}')
            os.makedirs(step_folder, exist_ok=True)

            # If coordinates exist, draw bounding boxes
            if coordinates:
                modified_image = draw_bounding_boxes(img_path, coordinates)
                if modified_image is not None:
                    output_img_path = os.path.join(step_folder, frame_file)
                    cv2.imwrite(output_img_path, modified_image)
                else:
                    shutil.copy(img_path, os.path.join(step_folder, frame_file))
            else:
                shutil.copy(img_path, os.path.join(step_folder, frame_file))

            # Save step details as JSON
            step_details = {
                "step_number": step_number,
                "frame": frame_file,
                "thought": thought,
                "action": action,
                "coordinates": coordinates
            }
            
            with open(os.path.join(step_folder, 'step_details.json'), 'w') as f:
                json.dump(step_details, f, indent=4)

            # Check if action is COMPLETE - if so, set the flag to stop after this iteration
            if action == 'COMPLETE':
                message = "Task completed. Exiting step generation process."
                print(message)
                log_to_file(message)
                complete_found = True

            # Increment step number
            step_number += 1

            # Track previous thoughts and actions (only if not COMPLETE)
            if action != 'COMPLETE':
                previous_thoughts.append(thought)
                previous_actions.append(action)

        except Exception as e:
            message = f"Error processing {frame_file}: {e}"
            print(message)
            # log_to_file(message)
            # Skip this frame and move to next
            continue

    print("–––––––––– END ––––––––––")

if __name__ == "__main__":
    process_ui_screens(query_folder, query)

In [None]:
# OS-ATLAS - ACTION MODEL - WITH BOUNDING BOXES

import os
import re
import cv2
import torch
import gc
import json
import numpy as np
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Clean up GPU memory
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

# Load the model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "OS-Copilot/OS-Atlas-Base-7B", torch_dtype="auto"
).to("cuda")

processor = AutoProcessor.from_pretrained(
    "OS-Copilot/OS-Atlas-Base-7B",
    size=None
)

# Define the system prompt for Technology Support mode for older adults
sys_prompt = """  
You are in Technology Support mode for older adults. Your role is to assist users over 60 with technology issues by providing step-by-step executable actions.  

Guidelines:  
1. Step-by-Step Instructions:  
    - Provide clear, short steps that are easy to follow.  
    - Every action must be directly executable without assumptions.  
    - Example: Instead of "Go to settings," specify each step to navigate there.
    - Use the images for the step generation:
        - Exclude or skip over intro, outro and unclear images.
        - Do not use images that do not have interactive UI elements.
        - Strict guidelines: ALWAYS generate coordinates relative to the image's size and resolution:
            - The coordinates need to be within the image bounds.
            - The coordinates need to be correct and accurate location for the UI element mentioned in each step (in 'thought').
            - The coordinates should be in abosolute pixel values instead of normalized values. Image resolution is: heightxwidth taken from the `image.shape`.

2. Strict Action Format:  
    - Each step must have:  
        - Thought: Explains the reason for the next action.  
        - Action: Specifies what to do in a predefined format.

4. No Follow-Up Questions:  
    - Do not ask for clarification.  
    - Use only given screenshots and action history.

Action Formats:  
1. CLICK: Click on a position. Format: CLICK <point>[x, y]</point>  
2. TYPE: Enter text. Format: TYPE [input text]  
3. SCROLL: Scroll in a direction. Format: SCROLL [UP/DOWN/LEFT/RIGHT]  
4. OPEN_APP: Open an app. Format: OPEN_APP [app_name]  
5. PRESS_BACK: Go to the previous screen. Format: PRESS_BACK  
6. PRESS_HOME: Return to the home screen. Format: PRESS_HOME  
7. ENTER: Press enter. Format: ENTER  
8. WAIT: Pause for loading. Format: WAIT  
9. COMPLETE: Task finished. Format: COMPLETE  

Example Response:  
Query: "How do I take a screenshot on my iPhone?"

Thought: Open the screen you want to capture.
Action: OPEN_APP [Gallery]

Thought: Press the correct button combination to take a screenshot.
Action: PRESS [Power + Volume Up]

Thought: The screenshot is captured and saved.
Action: COMPLETE  
"""

def extract_coordinates(action, image_width, image_height):
    """
    Extracts normalized coordinates from the action and converts them
    to absolute pixel values based on the image dimensions.
    """
    # Patterns to try for coordinate extraction
    patterns = [
        r'<point>\[\[(\d+),(\d+)\]\]</point>',  # Original pattern with XML-like tags
        r'\[\[(\d+),(\d+)\]\]',                 # Nested list format
        r'<point>\[(\d+),(\d+)\]</point>',      # Alternative XML-like tag
        r'\[(\d+),(\d+)\]'                      # Simple list format
    ]
    
    for pattern in patterns:
        match = re.search(pattern, action)
        if match:
            x, y = map(int, match.groups())
            
            # Convert normalized coordinates to absolute pixel values
            abs_x = int(x * image_width / 1000)  # Convert normalized x to pixels
            abs_y = int(y * image_height / 1000)  # Convert normalized y to pixels

            print(f"Extracted coordinates (normalized): ({x}, {y})")
            print(f"Converted to absolute pixel values: ({abs_x}, {abs_y})")
            return abs_x, abs_y
    
    # If no coordinates found, print the full action for debugging
    print(f"No coordinates found in action: {action}")
    return None

def draw_bounding_box(image, coordinates, step_number, action_type):
    """
    Draws a bounding box dynamically around the UI element based on a calculated region.
    """
    if coordinates is None:
        return image
    
    x, y = coordinates
    
    # Validate coordinates are within image bounds
    height, width = image.shape[:2]
    if x < 0 or x >= width or y < 0 or y >= height:
        print(f"Warning: Coordinates ({x}, {y}) are outside image bounds ({width}x{height})")
        return image
    
    # Create a copy of the image to draw on
    img_with_box = image.copy()
    
    # Define a dynamic bounding box size based on a percentage of image dimensions
    box_size_x = int(width * 0.2)  # 30% of image width for bounding box size
    box_size_y = int(height * 0.2)  # 30% of image height for bounding box size
    
    # Apply a margin around the coordinates for better bounding box positioning
    margin = 0.05  # 10% margin to expand the bounding box
    expanded_box_size_x = int(box_size_x * (1 + margin))
    expanded_box_size_y = int(box_size_y * (1 + margin))
    
    # Calculate box coordinates with expanded size
    top_left = (
        max(0, x - expanded_box_size_x // 2), 
        max(0, y - expanded_box_size_y // 2)
    )
    bottom_right = (
        min(width, x + expanded_box_size_x // 2), 
        min(height, y + expanded_box_size_y // 2)
    )
    
    # Choose color based on action type
    if 'CLICK' in action_type.upper():
        color = (0, 255, 0)  # Green for CLICK
    elif 'TYPE' in action_type.upper():
        color = (255, 0, 0)  # Blue for TYPE
    elif 'SCROLL' in action_type.upper():
        color = (255, 165, 0)  # Orange for SCROLL
    else:
        color = (0, 0, 255)  # Red for other actions
    
    # Draw the rectangle (bounding box)
    cv2.rectangle(img_with_box, top_left, bottom_right, color, thickness=2)
    
    # Add step number text with better text placement
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = min(width, height) / 500  # Dynamic font scaling
    font_thickness = 1
    
    text = f"Step {step_number}"
    text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]
    text_pos = (
        max(0, top_left[0]), 
        max(text_size[1] + 10, top_left[1] - 10)
    )
    
    # Add text with a semi-transparent background for better readability
    overlay = img_with_box.copy()
    cv2.rectangle(
        overlay, 
        (text_pos[0], text_pos[1] - text_size[1] - 10),
        (text_pos[0] + text_size[0] + 10, text_pos[1] + 10), 
        (255, 255, 255), 
        -1
    )
    cv2.addWeighted(overlay, 0.5, img_with_box, 0.5, 0, img_with_box)
    
    # Draw text
    cv2.putText(
        img_with_box,
        text,
        text_pos,
        font,
        font_scale,
        color,
        font_thickness
    )
    
    return img_with_box


def log_to_file(message):
    # Append the message to instructions.txt
    with open('os_atlas_base_instructions.txt', 'a') as file:
        file.write(message + '\n')


# Initialize action history
action_history = []

# Define the query folder
query_folder = "How_to_restore_the_flashlight_icon_in_the_Control_Center_on_an_iPhone_"
# Define the task instruction
query = "How to restore the flashlight icon in the Control Center on an iPhone?"

# Create output folder for boxed images
output_folder = f'./os_atlas_base_steps/{query_folder}'
os.makedirs(output_folder, exist_ok=True)

# Get all files starting with 'frame_'
frame_files = [f for f in os.listdir(f'./ui-screens/{query_folder}') if f.startswith('frame_')]

# Sort the files numerically (by extracting the number from the filename)
frame_files.sort(key=lambda f: int(f.split('_')[1].split('.')[0]))  # Split based on 'frame_' and take the number

step_number = 1

# Loop through the frames and process each one
for frame_file in frame_files:
    # Print the current frame being processed
    print(f"Processing frame: {frame_file}")

    # Prepare the messages for the current image, using the image path
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": sys_prompt},
                {
                    "type": "image",
                    "image": f'./ui-screens/{query_folder}/{frame_file}',
                },
                {"type": "text", "text": f"Task instruction: '{query}'\nHistory: {action_history or 'null'}"}
            ],
        }
    ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)

    inputs = processor(
        text=[text],
        images=image_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    # Generate output for the current frame
    generated_ids = model.generate(**inputs, max_new_tokens=128)

    # Post-process the output
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
    )

    print(f"OT: {output_text}")

    # Initialize variables
    last_step = None
    
    # Extract thought and action
    thought, action = None, None
    capturing_thought, capturing_action = False, False  # Flags to track multi-line capturing
    temp_thought, temp_action = [], []  # Lists to accumulate multi-line text
    
    for line in output_text[0].splitlines():
        line = line.strip()
    
        if line.lower().startswith("thought:"):
            capturing_thought = True
            capturing_action = False  # Stop capturing action if we encounter a new thought
            continue  # Move to next line to capture the actual thought
    
        elif line.lower().startswith("actions:"):
            capturing_thought = False
            capturing_action = True  # Start capturing action
            continue  # Move to next line to capture the actual action
    
        # Capture multi-line thoughts and actions
        if capturing_thought:
            temp_thought.append(line)
        elif capturing_action:
            temp_action.append(line.replace("<|im_end|>", "").strip())
    
    # Join lines to reconstruct the full thought and action
    thought = " ".join(temp_thought).strip() if temp_thought else None
    action = " ".join(temp_action).strip() if temp_action else None
    
    # Ensure both thought and action exist and check uniqueness
    if thought and action:
        current_step = (thought, action)
    
        # Normalize the action for comparison
        normalized_action = action.strip().lower()
    
        # Check if the action has already been performed (in action history)
        if normalized_action not in [a.strip().lower() for a in action_history]:
            message = f"Step {step_number}:"
            print(message)
            log_to_file(message)
            message = f"Thought: {thought}"
            print(message)
            log_to_file(message)
            message = f"Action: {action}"
            print(message)
            log_to_file(message)
            
            # Read the image for bounding box
            img_path = f'./ui-screens/{query_folder}/{frame_file}'
            img = cv2.imread(img_path)

            img_height, img_width, _ = img.shape
            
            # Check if the action has coordinates
            coordinates = extract_coordinates(action, img_width, img_height)

            # Create a step-specific folder and copy the frame
            step_folder = os.path.join(output_folder, f'step_{step_number}')
            os.makedirs(step_folder, exist_ok=True)
            
            # Draw bounding box for actions with coordinates
            if coordinates:
                img_with_box = draw_bounding_box(img, coordinates, step_number, action)
                # Save the image with bounding box
                output_img_path = os.path.join(step_folder, frame_file)
                cv2.imwrite(output_img_path, img_with_box)
            else:
                # If no coordinates, just copy the original image
                output_img_path = os.path.join(step_folder, frame_file)
                cv2.imwrite(output_img_path, img)

            # Save step details as JSON
            step_details = {
                "step_number": step_number,
                "frame": frame_file,
                "thought": thought,
                "action": action,
                "coordinates": coordinates
            }

            with open(os.path.join(step_folder, 'step_details.json'), 'w') as f:
                json.dump(step_details, f, indent=4)
            
            action_history.append(action)  # Add only action to the history
            last_step = current_step  # Update last step
            
            print(f"Action History: {action_history}\n")
            step_number = step_number + 1
    else:
        print("Thought or action missing for this step.")
        
        # Copy the original image to the output folder
        img_path = f'./ui-screens/{query_folder}/{frame_file}'
        img = cv2.imread(img_path)
        output_path = os.path.join(output_folder, frame_file)
        cv2.imwrite(output_path, img)

        step_number = step_number + 1

print("–––––– END ––––––")