# Project — Part 1: Single Image

Abhinav Kumar
12/7/2025

In [None]:
import cv2
import pytesseract
from PIL import Image
import numpy as np

IMAGE_PATH = "/workspaces/eng-ai-agents/project/data/efficientvit_page1.png"


In [20]:
def ocr_page(image_path: str) -> str:
    """Run Tesseract OCR on a full paper page screenshot."""
    img = cv2.imread(image_path)

    scale = 1.5
    img = cv2.resize(
        img,
        (int(img.shape[1] * scale), int(img.shape[0] * scale)),
        interpolation=cv2.INTER_CUBIC,
    )

    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY)

    text = pytesseract.image_to_string(thresh, lang="eng")
    return text

page_text = ocr_page(IMAGE_PATH)
print(page_text[:1500])


arXiv:2205.14756v6 [cs.CV] 6 Feb 2024

Efficient ViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Han Cai!, Junyan Li?, Muyan Hu’, Chuang Gant, Song Han!
IMIT. ?Zhejiang University, “Tsinghua University, MIT-IBM Watson Al Lab

hetps:/ (github

Abstract

High-resolwion dense prediction enables many appeal
ing real-world applications, such as computational pho-
lugraphy, autonomous driving. ete. However the vast
computational cost makes deploying state-of-the-art high-
resolution dense prediction models on hundware devices dif-
ficult. This work presents EfficientVIT, a new family of high-
resolution vision models with novel mulsi-scale linear atten-
Jion, Untike prioe high-resolution dease prediction models
that rely on heavy softmax attention, handware-inefficient
large-temel convolution, or complicated topology struc
lure to obtain good performances, our multi-scale linear
attention achieves the global receptive field and multi-seate
leaming (two desirable feutu

In [None]:
import requests
import textwrap

OLLAMA_BASE_URL = "http://host.docker.internal:11434"
MODEL_NAME = "qwen2.5:latest"

def call_qwen(system_prompt: str, user_prompt: str) -> str:
    """
    Call local Qwen via Ollama's /api/generate endpoint.
    """
    prompt = textwrap.dedent(f"""
    System: {system_prompt}

    User: {user_prompt}

    Assistant:
    """)

    payload = {
        "model": MODEL_NAME,
        "prompt": prompt,
        "stream": False,
    }

    resp = requests.post(f"{OLLAMA_BASE_URL}/api/generate", json=payload)
    if not resp.ok:
        print("Ollama error status:", resp.status_code)
        print("Ollama error body:", resp.text[:1000])
        resp.raise_for_status()

    data = resp.json()
    return data.get("response", "").strip()


In [27]:
test_answer = call_qwen(
    "You are a concise assistant.",
    "Reply with the single word OK."
)
print("Qwen test:", repr(test_answer))


Qwen test: 'OK'


In [28]:
system_prompt = (
    "You are an AI tutor explaining computer vision research papers "
    "to a graduate student. Use simple, precise language."
)

user_question = "Summarize the main idea of this page in 4–5 sentences."

user_prompt = textwrap.dedent(f"""
Here is OCR text from the first page of an AI paper (EfficientViT).
The text may contain noise from OCR; ignore formatting issues and focus
on the key ideas.

---- OCR TEXT START ----
{page_text}
---- OCR TEXT END ----

Question: {user_question}
""")

summary_answer = call_qwen(system_prompt, user_prompt)
print(summary_answer)


The paper introduces EfficientViT, a novel vision model designed for high-resolution dense prediction. It uses multi-scale linear attention to achieve both global receptive fields and multi-scale learning with lightweight operations, addressing hardware efficiency. This approach outperforms existing models like SegFormer and SegNeXt on various tasks, including semantic segmentation and super-resolution, while providing significant speedups. The core innovation lies in replacing inefficient softmax attention with ReLU linear attention, which maintains performance gains without sacrificing speed on different hardware platforms.


In [29]:
def explain_highlighted_text(page_text: str, highlighted_text: str) -> str:
    """Tutorial-style explanation of a highlighted snippet."""
    system_prompt = (
        "You are an AI tutor helping a graduate student understand an AI "
        "research paper. Explain things clearly with intuition and a simple example."
    )

    user_prompt = textwrap.dedent(f"""
Here is noisy OCR text from a page of a research paper:

---- PAGE TEXT ----
{page_text}
---- END PAGE TEXT ----

The user highlighted this passage:

---- HIGHLIGHTED PASSAGE ----
{highlighted_text}
---- END HIGHLIGHTED PASSAGE ----

Explain what this highlighted passage means.
1. Restate it in simple terms.
2. Give the intuition and why it matters.
3. Give a small concrete example if it helps.

Keep the answer within 2–4 short paragraphs.
""")

    return call_qwen(system_prompt, user_prompt)

fake_highlight = "High-resolution dense prediction enables many real-world applications such as computational photography and autonomous driving."

highlight_explanation = explain_highlighted_text(page_text, fake_highlight)
print(highlight_explanation)


### Explanation of the Highlighted Passage

The highlighted passage states that high-resolution dense prediction (HRDP) is crucial for numerous real-world applications, including computational photography and autonomous driving. This means that being able to predict or analyze images at a very fine level—i.e., with a lot of detail—is essential for these practical uses.

### Simple Restatement and Intuition

In simpler terms, high-resolution dense prediction allows systems to understand detailed visual information accurately, which is vital for technologies like computational photography (where every pixel counts) and autonomous driving (where precise scene understanding can save lives). The intuition here lies in the fact that higher resolution images provide more granular data, enabling better decision-making processes. For instance, in autonomous driving, a high-resolution image might help detect pedestrians or obstacles from further distances, enhancing safety.

### Concrete Example