Skip to content

feat: add text content support for scrape#1041

Merged
amhsirak merged 2 commits intodevelopfrom
scrape-text
Apr 20, 2026
Merged

feat: add text content support for scrape#1041
amhsirak merged 2 commits intodevelopfrom
scrape-text

Conversation

@RohitR311
Copy link
Copy Markdown
Collaborator

@RohitR311 RohitR311 commented Apr 20, 2026

What this PR does?

Adds the text content format for scrape robot.

Summary by CodeRabbit

  • New Features
    • Added "Text" as a selectable output format for scrape robots to extract plain page text.
    • Run results now include a Text Content view with a preformatted display and a "Download Text" button to save as .txt.
    • Text extraction is subject to the same timeout protection as other scrape output formats to maintain reliable execution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 20, 2026

Walkthrough

Adds plain-text extraction to scrape runs: new convertPageToText function, text added to scrape output formats, scrape handlers (API, worker, scheduler) race text conversion with SCRAPE_TIMEOUT and store results, and frontend UI/display updated to select and download text output.

Changes

Cohort / File(s) Summary
Text Extraction Function
server/src/markdownify/scrape.ts
New exported convertPageToText(url, page) that navigates and extracts `document.body.innerText
Output Format Configuration
server/src/constants/output-formats.ts
Added 'text' to SCRAPE_OUTPUT_FORMAT_OPTIONS so scrape flows may request text output.
Scrape Execution Handlers
server/src/api/record.ts, server/src/pgboss-worker.ts, server/src/workflow-management/scheduler/index.ts
Imported convertPageToText, added conditional text conversion when formats includes 'text', wrapped conversion in Promise.race with SCRAPE_TIMEOUT, and store non-empty results in serializableOutput.text as [{ content: text }]; failures are logged/warned without failing the full conversion.
Frontend: Robot Create UI
src/components/robot/pages/RobotCreate.tsx
Added text option to scrape output-format multi-select (label mapping and checkbox menu item).
Frontend: Run Output View
src/components/run/RunContent.tsx
Added textContent state, unified extractFromOutput to read text from multiple output shapes, added "Text Content" accordion and a "Download Text" (output.txt, text/plain) button.

Sequence Diagram(s)

sequenceDiagram
    participant UI as User / UI
    participant API as API (record.ts)
    participant Scheduler as Scheduler
    participant Worker as Worker / pgboss
    participant Scraper as Playwright / markdownify
    participant Store as Storage

    UI->>API: Create/run scrape (formats includes "text")
    API->>Scheduler: Enqueue/run job
    Scheduler->>Worker: Dispatch run
    Worker->>Scraper: convertPageToText(url, page)
    Scraper-->>Worker: text (or timeout/error)
    Worker->>Store: persist serializableOutput.text = [{ content: text }]
    Worker-->>Scheduler: job complete
    Scheduler-->>API: update run status
    API-->>UI: return run results (includes text)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Type: Enhancement, Scope: UI/UX, Scope: Infra

Suggested reviewers

  • amhsirak

Poem

🐰 I hopped through pages, soft and light,

I pulled the words and trimmed them right.
Timeout kept me swift and true,
Now plain text hops back to you —
Downloaded neat, a rabbit’s delight. 📄✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding text content support to the scrape functionality across backend and frontend components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch scrape-text

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
server/src/api/record.ts (3)

345-380: ⚠️ Potential issue | 🟠 Major

Expose text output in formatted API run responses.

serializableOutput.text is persisted, but formatRunResponse only copies markdown/html into run.data; GET/POST API responses will omit text content.

🐛 Proposed fix
         data: {
             textData: {},
             listData: {},
             crawlData: {},
             searchData: {},
             markdown: '',
-            html: ''
+            html: '',
+            text: ''
         },
@@
     if (output.html && Array.isArray(output.html)) {
         formattedRun.data.html = output.html[0]?.content || '';
     }
+
+    if (output.text && Array.isArray(output.text)) {
+        formattedRun.data.text = output.text[0]?.content || '';
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/api/record.ts` around lines 345 - 380, formatRunResponse (the code
creating formattedRun from run.serializableOutput) ignores
serializableOutput.text so persisted text is never surfaced; update the mapping
logic to copy serializableOutput.text into formattedRun.data (e.g., set
formattedRun.data.text or formattedRun.data.textData depending on the intended
field), similar to how markdown/html are handled, and guard with the same type
checks used for other outputs (check run.serializableOutput.text exists and is
an object/string as appropriate before assigning).

710-712: ⚠️ Potential issue | 🟠 Major

Allow text in API format overrides.

Line 710 filters caller-provided requestedFormats against a list that still excludes text, so { "formats": ["text"] } silently runs with no requested output.

🐛 Proposed fix
-                formats = requestedFormats.filter((f): f is 'markdown' | 'html' | 'screenshot-visible' | 'screenshot-fullpage' =>
-                    ['markdown', 'html', 'screenshot-visible', 'screenshot-fullpage'].includes(f)
+                formats = requestedFormats.filter((f): f is 'markdown' | 'html' | 'text' | 'screenshot-visible' | 'screenshot-fullpage' =>
+                    ['markdown', 'html', 'text', 'screenshot-visible', 'screenshot-fullpage'].includes(f)
                 );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/api/record.ts` around lines 710 - 712, The filter currently
narrows requestedFormats into formats but omits the 'text' format; update the
predicate in the requestedFormats.filter call so the allowed list includes
'text' and the type guard becomes (f): f is 'markdown' | 'html' | 'text' |
'screenshot-visible' | 'screenshot-fullpage' =>
['markdown','html','text','screenshot-visible','screenshot-fullpage'].includes(f)
so that formats correctly contains 'text' entries from requestedFormats.

868-871: ⚠️ Potential issue | 🟠 Major

Include text in scrape completion webhooks.

The new text output is saved to the run, but webhook payloads still only include markdown/html/screenshots, so webhook consumers cannot receive the requested text format.

🐛 Proposed fix
                 if (serializableOutput.markdown) webhookPayload.markdown = markdown;
                 if (serializableOutput.html) webhookPayload.html = html;
+                if (serializableOutput.text) webhookPayload.text = text;
                 if (uploadedBinaryOutput['screenshot-visible']) webhookPayload.screenshot_visible = uploadedBinaryOutput['screenshot-visible'];
                 if (uploadedBinaryOutput['screenshot-fullpage']) webhookPayload.screenshot_fullpage = uploadedBinaryOutput['screenshot-fullpage'];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/api/record.ts` around lines 868 - 871, Webhook payloads omit the
new text output even though it's saved to the run; update the block that builds
webhookPayload (around serializableOutput and uploadedBinaryOutput) to include
the text field when present by checking serializableOutput.text (or the local
variable text if defined) and setting webhookPayload.text =
serializableOutput.text (or webhookPayload.text = text), analogous to how
markdown/html are added; touch the same function that references
serializableOutput, webhookPayload, and uploadedBinaryOutput to ensure consumers
receive the requested text format.
server/src/pgboss-worker.ts (1)

353-356: ⚠️ Potential issue | 🟠 Major

Include text in PgBoss scrape webhooks.

The worker stores serializableOutput.text, but the completion webhook only forwards markdown/html/screenshots, leaving text subscribers without the requested output.

🐛 Proposed fix
             if (formats.includes('markdown')) webhookPayload.markdown = markdown;
             if (formats.includes('html')) webhookPayload.html = html;
+            if (serializableOutput.text) webhookPayload.text = text;
             if (uploadedBinaryOutput['screenshot-visible']) webhookPayload.screenshot_visible = uploadedBinaryOutput['screenshot-visible'];
             if (uploadedBinaryOutput['screenshot-fullpage']) webhookPayload.screenshot_fullpage = uploadedBinaryOutput['screenshot-fullpage'];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/pgboss-worker.ts` around lines 353 - 356, The webhook payload
omits serializableOutput.text so subscribers requesting "text" never receive it;
update the PgBoss worker where webhookPayload is assembled (look for
variables/formats, webhookPayload, uploadedBinaryOutput, serializableOutput) to
include the plain text output when present — e.g., if formats.includes('text')
(or if serializableOutput.text exists) set webhookPayload.text =
serializableOutput.text so text subscribers receive the stored
serializableOutput.text.
server/src/workflow-management/scheduler/index.ts (1)

386-389: ⚠️ Potential issue | 🟠 Major

Include text in scheduled scrape webhooks.

Scheduled scrape runs now persist serializableOutput.text, but run_completed webhooks still drop the text payload.

🐛 Proposed fix
         if (formats.includes('markdown')) webhookPayload.markdown = markdown;
         if (formats.includes('html')) webhookPayload.html = html;
+        if (serializableOutput.text) webhookPayload.text = text;
         if (uploadedBinaryOutput['screenshot-visible']) webhookPayload.screenshot_visible = uploadedBinaryOutput['screenshot-visible'];
         if (uploadedBinaryOutput['screenshot-fullpage']) webhookPayload.screenshot_fullpage = uploadedBinaryOutput['screenshot-fullpage'];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/workflow-management/scheduler/index.ts` around lines 386 - 389,
The webhook payload builder is missing the stored text output so scheduled
run_completed webhooks drop serializableOutput.text; update the payload in the
same block that sets markdown/html/uploadedBinaryOutput by adding a conditional
that checks formats.includes('text') and sets webhookPayload.text =
serializableOutput.text (use the existing variables formats, webhookPayload, and
serializableOutput to locate the code inside the function that currently assigns
markdown, html, and uploadedBinaryOutput fields).
🧹 Nitpick comments (2)
src/components/robot/pages/RobotCreate.tsx (1)

763-813: Prefer the shared output-format constants for the Scrape selector.

The new text option is wired correctly, but this block duplicates OUTPUT_FORMAT_LABELS/menu entries while the component already imports the shared constants. Mapping OUTPUT_FORMAT_OPTIONS here avoids future drift.

♻️ Proposed refactor
-                        const OUTPUT_FORMAT_LABELS: Record<string, string> = {
-                          markdown: 'Markdown',
-                          html: 'HTML',
-                          text: 'Text Content',
-                          'screenshot-visible': 'Screenshot (Visible)',
-                          'screenshot-fullpage': 'Screenshot (Full Page)',
-                        };
-
                         const labels = selected.map(
                           (value) => OUTPUT_FORMAT_LABELS[value] ?? value
                         );
@@
-                      <MenuItem value="markdown">
-                        <Checkbox checked={outputFormats.includes('markdown')} />
-                        Markdown
-                      </MenuItem>
-                      <MenuItem value="html">
-                        <Checkbox checked={outputFormats.includes('html')} />
-                        HTML
-                      </MenuItem>
-                      <MenuItem value="text">
-                        <Checkbox checked={outputFormats.includes('text')} />
-                        Text Content
-                      </MenuItem>
-                      <MenuItem value="screenshot-visible">
-                        <Checkbox checked={outputFormats.includes('screenshot-visible')} />
-                        Screenshot - Visible Viewport
-                      </MenuItem>
-                      <MenuItem value="screenshot-fullpage">
-                        <Checkbox checked={outputFormats.includes('screenshot-fullpage')} />
-                        Screenshot - Full Page
-                      </MenuItem>
+                      {OUTPUT_FORMAT_OPTIONS.map((format) => (
+                        <MenuItem key={format} value={format}>
+                          <Checkbox checked={outputFormats.includes(format)} />
+                          {OUTPUT_FORMAT_LABELS[format]}
+                        </MenuItem>
+                      ))}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/components/robot/pages/RobotCreate.tsx` around lines 763 - 813, The block
duplicates OUTPUT_FORMAT_LABELS and hardcoded MenuItem entries; replace the
local OUTPUT_FORMAT_LABELS and the hardcoded MenuItem list with the shared
constants by using the imported OUTPUT_FORMAT_OPTIONS (or the module-provided
mapping) to derive labels and to iterate to render MenuItem/Checkbox options;
update the label lookup in the selected->labels mapping to use
OUTPUT_FORMAT_OPTIONS (or its label property) and render MenuItem components
dynamically from OUTPUT_FORMAT_OPTIONS so the "text" option and any future
options stay in sync with the shared source.
src/components/run/RunContent.tsx (1)

1180-1190: Reuse the existing downloadText helper.

This inline blob/download code duplicates the helper already defined in this component.

♻️ Proposed refactor
-                        onClick={() => {
-                          const blob = new Blob([textContent], { type: 'text/plain;charset=utf-8;' });
-                          const url = URL.createObjectURL(blob);
-                          const a = document.createElement('a');
-                          a.href = url;
-                          a.download = 'output.txt';
-                          document.body.appendChild(a);
-                          a.click();
-                          document.body.removeChild(a);
-                          setTimeout(() => URL.revokeObjectURL(url), 100);
-                        }}
+                        onClick={() => downloadText(textContent, 'output.txt')}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/components/run/RunContent.tsx` around lines 1180 - 1190, The onClick
handler in the RunContent component duplicates blob/download logic; replace that
inline blob creation and manual anchor logic with a call to the existing
downloadText helper (use downloadText(textContent, 'output.txt') or the helper's
actual signature) in the onClick for the download button, removing the manual
URL.createObjectURL / revoke and DOM anchor creation so the component reuses the
centralized downloadText function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@server/src/api/record.ts`:
- Around line 345-380: formatRunResponse (the code creating formattedRun from
run.serializableOutput) ignores serializableOutput.text so persisted text is
never surfaced; update the mapping logic to copy serializableOutput.text into
formattedRun.data (e.g., set formattedRun.data.text or
formattedRun.data.textData depending on the intended field), similar to how
markdown/html are handled, and guard with the same type checks used for other
outputs (check run.serializableOutput.text exists and is an object/string as
appropriate before assigning).
- Around line 710-712: The filter currently narrows requestedFormats into
formats but omits the 'text' format; update the predicate in the
requestedFormats.filter call so the allowed list includes 'text' and the type
guard becomes (f): f is 'markdown' | 'html' | 'text' | 'screenshot-visible' |
'screenshot-fullpage' =>
['markdown','html','text','screenshot-visible','screenshot-fullpage'].includes(f)
so that formats correctly contains 'text' entries from requestedFormats.
- Around line 868-871: Webhook payloads omit the new text output even though
it's saved to the run; update the block that builds webhookPayload (around
serializableOutput and uploadedBinaryOutput) to include the text field when
present by checking serializableOutput.text (or the local variable text if
defined) and setting webhookPayload.text = serializableOutput.text (or
webhookPayload.text = text), analogous to how markdown/html are added; touch the
same function that references serializableOutput, webhookPayload, and
uploadedBinaryOutput to ensure consumers receive the requested text format.

In `@server/src/pgboss-worker.ts`:
- Around line 353-356: The webhook payload omits serializableOutput.text so
subscribers requesting "text" never receive it; update the PgBoss worker where
webhookPayload is assembled (look for variables/formats, webhookPayload,
uploadedBinaryOutput, serializableOutput) to include the plain text output when
present — e.g., if formats.includes('text') (or if serializableOutput.text
exists) set webhookPayload.text = serializableOutput.text so text subscribers
receive the stored serializableOutput.text.

In `@server/src/workflow-management/scheduler/index.ts`:
- Around line 386-389: The webhook payload builder is missing the stored text
output so scheduled run_completed webhooks drop serializableOutput.text; update
the payload in the same block that sets markdown/html/uploadedBinaryOutput by
adding a conditional that checks formats.includes('text') and sets
webhookPayload.text = serializableOutput.text (use the existing variables
formats, webhookPayload, and serializableOutput to locate the code inside the
function that currently assigns markdown, html, and uploadedBinaryOutput
fields).

---

Nitpick comments:
In `@src/components/robot/pages/RobotCreate.tsx`:
- Around line 763-813: The block duplicates OUTPUT_FORMAT_LABELS and hardcoded
MenuItem entries; replace the local OUTPUT_FORMAT_LABELS and the hardcoded
MenuItem list with the shared constants by using the imported
OUTPUT_FORMAT_OPTIONS (or the module-provided mapping) to derive labels and to
iterate to render MenuItem/Checkbox options; update the label lookup in the
selected->labels mapping to use OUTPUT_FORMAT_OPTIONS (or its label property)
and render MenuItem components dynamically from OUTPUT_FORMAT_OPTIONS so the
"text" option and any future options stay in sync with the shared source.

In `@src/components/run/RunContent.tsx`:
- Around line 1180-1190: The onClick handler in the RunContent component
duplicates blob/download logic; replace that inline blob creation and manual
anchor logic with a call to the existing downloadText helper (use
downloadText(textContent, 'output.txt') or the helper's actual signature) in the
onClick for the download button, removing the manual URL.createObjectURL /
revoke and DOM anchor creation so the component reuses the centralized
downloadText function.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e095aabc-ed60-437c-8641-fcd62b456074

📥 Commits

Reviewing files that changed from the base of the PR and between ca1f2ff and 61a78a2.

📒 Files selected for processing (7)
  • server/src/api/record.ts
  • server/src/constants/output-formats.ts
  • server/src/markdownify/scrape.ts
  • server/src/pgboss-worker.ts
  • server/src/workflow-management/scheduler/index.ts
  • src/components/robot/pages/RobotCreate.tsx
  • src/components/run/RunContent.tsx

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/src/markdownify/scrape.ts`:
- Around line 248-265: convertPageToText is missing a stability wait which can
yield incomplete text on JS-heavy pages; after the existing await
gotoWithFallback(page, url) call in convertPageToText, insert await
waitForStability(page) (the same helper used by convertPageToMarkdown and
convertPageToHTML) before the page.evaluate(...) text extraction so dynamic/lazy
content can settle; keep existing try/catch and logger behavior unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7cd62977-0d20-4653-9c83-8a494ad60bd6

📥 Commits

Reviewing files that changed from the base of the PR and between 61a78a2 and c86331b.

📒 Files selected for processing (4)
  • server/src/api/record.ts
  • server/src/markdownify/scrape.ts
  • server/src/pgboss-worker.ts
  • server/src/workflow-management/scheduler/index.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • server/src/workflow-management/scheduler/index.ts
  • server/src/pgboss-worker.ts
  • server/src/api/record.ts

Comment on lines +248 to +265
export async function convertPageToText(url: string, page: Page): Promise<string> {
try {
logger.log('info', `[Scrape] Using existing page instance for text conversion of ${url}`);

await gotoWithFallback(page, url);

const text = await page.evaluate(() => {
const body = document.body;
if (!body) return '';
return body.innerText || body.textContent || '';
});

return text ? text.trim() : '';
} catch (error: any) {
logger.error(`[Scrape] Error during text conversion: ${error.message}`);
throw error;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing waitForStability(page) call before text extraction.

Both convertPageToMarkdown (line 59) and convertPageToHTML (line 165) call waitForStability(page) after gotoWithFallback to allow dynamic content to settle before extracting. convertPageToText skips this step, so text scraped from JS-heavy / lazy-loaded pages may be incomplete or empty — especially since document.body.innerText depends on rendered layout. Given this helper is invoked under a Promise.race with SCRAPE_TIMEOUT in the callers, results could be inconsistent vs. the other formats on the same page.

🔧 Proposed fix
     await gotoWithFallback(page, url);
+    await waitForStability(page);

     const text = await page.evaluate(() => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/markdownify/scrape.ts` around lines 248 - 265, convertPageToText
is missing a stability wait which can yield incomplete text on JS-heavy pages;
after the existing await gotoWithFallback(page, url) call in convertPageToText,
insert await waitForStability(page) (the same helper used by
convertPageToMarkdown and convertPageToHTML) before the page.evaluate(...) text
extraction so dynamic/lazy content can settle; keep existing try/catch and
logger behavior unchanged.

@amhsirak amhsirak merged commit f83c379 into develop Apr 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants