Conversation
WalkthroughAdds plain-text extraction to scrape runs: new Changes
Sequence Diagram(s)sequenceDiagram
participant UI as User / UI
participant API as API (record.ts)
participant Scheduler as Scheduler
participant Worker as Worker / pgboss
participant Scraper as Playwright / markdownify
participant Store as Storage
UI->>API: Create/run scrape (formats includes "text")
API->>Scheduler: Enqueue/run job
Scheduler->>Worker: Dispatch run
Worker->>Scraper: convertPageToText(url, page)
Scraper-->>Worker: text (or timeout/error)
Worker->>Store: persist serializableOutput.text = [{ content: text }]
Worker-->>Scheduler: job complete
Scheduler-->>API: update run status
API-->>UI: return run results (includes text)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
server/src/api/record.ts (3)
345-380:⚠️ Potential issue | 🟠 MajorExpose text output in formatted API run responses.
serializableOutput.textis persisted, butformatRunResponseonly copies markdown/html intorun.data; GET/POST API responses will omit text content.🐛 Proposed fix
data: { textData: {}, listData: {}, crawlData: {}, searchData: {}, markdown: '', - html: '' + html: '', + text: '' }, @@ if (output.html && Array.isArray(output.html)) { formattedRun.data.html = output.html[0]?.content || ''; } + + if (output.text && Array.isArray(output.text)) { + formattedRun.data.text = output.text[0]?.content || ''; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/src/api/record.ts` around lines 345 - 380, formatRunResponse (the code creating formattedRun from run.serializableOutput) ignores serializableOutput.text so persisted text is never surfaced; update the mapping logic to copy serializableOutput.text into formattedRun.data (e.g., set formattedRun.data.text or formattedRun.data.textData depending on the intended field), similar to how markdown/html are handled, and guard with the same type checks used for other outputs (check run.serializableOutput.text exists and is an object/string as appropriate before assigning).
710-712:⚠️ Potential issue | 🟠 MajorAllow
textin API format overrides.Line 710 filters caller-provided
requestedFormatsagainst a list that still excludestext, so{ "formats": ["text"] }silently runs with no requested output.🐛 Proposed fix
- formats = requestedFormats.filter((f): f is 'markdown' | 'html' | 'screenshot-visible' | 'screenshot-fullpage' => - ['markdown', 'html', 'screenshot-visible', 'screenshot-fullpage'].includes(f) + formats = requestedFormats.filter((f): f is 'markdown' | 'html' | 'text' | 'screenshot-visible' | 'screenshot-fullpage' => + ['markdown', 'html', 'text', 'screenshot-visible', 'screenshot-fullpage'].includes(f) );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/src/api/record.ts` around lines 710 - 712, The filter currently narrows requestedFormats into formats but omits the 'text' format; update the predicate in the requestedFormats.filter call so the allowed list includes 'text' and the type guard becomes (f): f is 'markdown' | 'html' | 'text' | 'screenshot-visible' | 'screenshot-fullpage' => ['markdown','html','text','screenshot-visible','screenshot-fullpage'].includes(f) so that formats correctly contains 'text' entries from requestedFormats.
868-871:⚠️ Potential issue | 🟠 MajorInclude text in scrape completion webhooks.
The new text output is saved to the run, but webhook payloads still only include markdown/html/screenshots, so webhook consumers cannot receive the requested
textformat.🐛 Proposed fix
if (serializableOutput.markdown) webhookPayload.markdown = markdown; if (serializableOutput.html) webhookPayload.html = html; + if (serializableOutput.text) webhookPayload.text = text; if (uploadedBinaryOutput['screenshot-visible']) webhookPayload.screenshot_visible = uploadedBinaryOutput['screenshot-visible']; if (uploadedBinaryOutput['screenshot-fullpage']) webhookPayload.screenshot_fullpage = uploadedBinaryOutput['screenshot-fullpage'];🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/src/api/record.ts` around lines 868 - 871, Webhook payloads omit the new text output even though it's saved to the run; update the block that builds webhookPayload (around serializableOutput and uploadedBinaryOutput) to include the text field when present by checking serializableOutput.text (or the local variable text if defined) and setting webhookPayload.text = serializableOutput.text (or webhookPayload.text = text), analogous to how markdown/html are added; touch the same function that references serializableOutput, webhookPayload, and uploadedBinaryOutput to ensure consumers receive the requested text format.server/src/pgboss-worker.ts (1)
353-356:⚠️ Potential issue | 🟠 MajorInclude text in PgBoss scrape webhooks.
The worker stores
serializableOutput.text, but the completion webhook only forwards markdown/html/screenshots, leavingtextsubscribers without the requested output.🐛 Proposed fix
if (formats.includes('markdown')) webhookPayload.markdown = markdown; if (formats.includes('html')) webhookPayload.html = html; + if (serializableOutput.text) webhookPayload.text = text; if (uploadedBinaryOutput['screenshot-visible']) webhookPayload.screenshot_visible = uploadedBinaryOutput['screenshot-visible']; if (uploadedBinaryOutput['screenshot-fullpage']) webhookPayload.screenshot_fullpage = uploadedBinaryOutput['screenshot-fullpage'];🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/src/pgboss-worker.ts` around lines 353 - 356, The webhook payload omits serializableOutput.text so subscribers requesting "text" never receive it; update the PgBoss worker where webhookPayload is assembled (look for variables/formats, webhookPayload, uploadedBinaryOutput, serializableOutput) to include the plain text output when present — e.g., if formats.includes('text') (or if serializableOutput.text exists) set webhookPayload.text = serializableOutput.text so text subscribers receive the stored serializableOutput.text.server/src/workflow-management/scheduler/index.ts (1)
386-389:⚠️ Potential issue | 🟠 MajorInclude text in scheduled scrape webhooks.
Scheduled scrape runs now persist
serializableOutput.text, butrun_completedwebhooks still drop the text payload.🐛 Proposed fix
if (formats.includes('markdown')) webhookPayload.markdown = markdown; if (formats.includes('html')) webhookPayload.html = html; + if (serializableOutput.text) webhookPayload.text = text; if (uploadedBinaryOutput['screenshot-visible']) webhookPayload.screenshot_visible = uploadedBinaryOutput['screenshot-visible']; if (uploadedBinaryOutput['screenshot-fullpage']) webhookPayload.screenshot_fullpage = uploadedBinaryOutput['screenshot-fullpage'];🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/src/workflow-management/scheduler/index.ts` around lines 386 - 389, The webhook payload builder is missing the stored text output so scheduled run_completed webhooks drop serializableOutput.text; update the payload in the same block that sets markdown/html/uploadedBinaryOutput by adding a conditional that checks formats.includes('text') and sets webhookPayload.text = serializableOutput.text (use the existing variables formats, webhookPayload, and serializableOutput to locate the code inside the function that currently assigns markdown, html, and uploadedBinaryOutput fields).
🧹 Nitpick comments (2)
src/components/robot/pages/RobotCreate.tsx (1)
763-813: Prefer the shared output-format constants for the Scrape selector.The new
textoption is wired correctly, but this block duplicatesOUTPUT_FORMAT_LABELS/menu entries while the component already imports the shared constants. MappingOUTPUT_FORMAT_OPTIONShere avoids future drift.♻️ Proposed refactor
- const OUTPUT_FORMAT_LABELS: Record<string, string> = { - markdown: 'Markdown', - html: 'HTML', - text: 'Text Content', - 'screenshot-visible': 'Screenshot (Visible)', - 'screenshot-fullpage': 'Screenshot (Full Page)', - }; - const labels = selected.map( (value) => OUTPUT_FORMAT_LABELS[value] ?? value ); @@ - <MenuItem value="markdown"> - <Checkbox checked={outputFormats.includes('markdown')} /> - Markdown - </MenuItem> - <MenuItem value="html"> - <Checkbox checked={outputFormats.includes('html')} /> - HTML - </MenuItem> - <MenuItem value="text"> - <Checkbox checked={outputFormats.includes('text')} /> - Text Content - </MenuItem> - <MenuItem value="screenshot-visible"> - <Checkbox checked={outputFormats.includes('screenshot-visible')} /> - Screenshot - Visible Viewport - </MenuItem> - <MenuItem value="screenshot-fullpage"> - <Checkbox checked={outputFormats.includes('screenshot-fullpage')} /> - Screenshot - Full Page - </MenuItem> + {OUTPUT_FORMAT_OPTIONS.map((format) => ( + <MenuItem key={format} value={format}> + <Checkbox checked={outputFormats.includes(format)} /> + {OUTPUT_FORMAT_LABELS[format]} + </MenuItem> + ))}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/components/robot/pages/RobotCreate.tsx` around lines 763 - 813, The block duplicates OUTPUT_FORMAT_LABELS and hardcoded MenuItem entries; replace the local OUTPUT_FORMAT_LABELS and the hardcoded MenuItem list with the shared constants by using the imported OUTPUT_FORMAT_OPTIONS (or the module-provided mapping) to derive labels and to iterate to render MenuItem/Checkbox options; update the label lookup in the selected->labels mapping to use OUTPUT_FORMAT_OPTIONS (or its label property) and render MenuItem components dynamically from OUTPUT_FORMAT_OPTIONS so the "text" option and any future options stay in sync with the shared source.src/components/run/RunContent.tsx (1)
1180-1190: Reuse the existingdownloadTexthelper.This inline blob/download code duplicates the helper already defined in this component.
♻️ Proposed refactor
- onClick={() => { - const blob = new Blob([textContent], { type: 'text/plain;charset=utf-8;' }); - const url = URL.createObjectURL(blob); - const a = document.createElement('a'); - a.href = url; - a.download = 'output.txt'; - document.body.appendChild(a); - a.click(); - document.body.removeChild(a); - setTimeout(() => URL.revokeObjectURL(url), 100); - }} + onClick={() => downloadText(textContent, 'output.txt')}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/components/run/RunContent.tsx` around lines 1180 - 1190, The onClick handler in the RunContent component duplicates blob/download logic; replace that inline blob creation and manual anchor logic with a call to the existing downloadText helper (use downloadText(textContent, 'output.txt') or the helper's actual signature) in the onClick for the download button, removing the manual URL.createObjectURL / revoke and DOM anchor creation so the component reuses the centralized downloadText function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@server/src/api/record.ts`:
- Around line 345-380: formatRunResponse (the code creating formattedRun from
run.serializableOutput) ignores serializableOutput.text so persisted text is
never surfaced; update the mapping logic to copy serializableOutput.text into
formattedRun.data (e.g., set formattedRun.data.text or
formattedRun.data.textData depending on the intended field), similar to how
markdown/html are handled, and guard with the same type checks used for other
outputs (check run.serializableOutput.text exists and is an object/string as
appropriate before assigning).
- Around line 710-712: The filter currently narrows requestedFormats into
formats but omits the 'text' format; update the predicate in the
requestedFormats.filter call so the allowed list includes 'text' and the type
guard becomes (f): f is 'markdown' | 'html' | 'text' | 'screenshot-visible' |
'screenshot-fullpage' =>
['markdown','html','text','screenshot-visible','screenshot-fullpage'].includes(f)
so that formats correctly contains 'text' entries from requestedFormats.
- Around line 868-871: Webhook payloads omit the new text output even though
it's saved to the run; update the block that builds webhookPayload (around
serializableOutput and uploadedBinaryOutput) to include the text field when
present by checking serializableOutput.text (or the local variable text if
defined) and setting webhookPayload.text = serializableOutput.text (or
webhookPayload.text = text), analogous to how markdown/html are added; touch the
same function that references serializableOutput, webhookPayload, and
uploadedBinaryOutput to ensure consumers receive the requested text format.
In `@server/src/pgboss-worker.ts`:
- Around line 353-356: The webhook payload omits serializableOutput.text so
subscribers requesting "text" never receive it; update the PgBoss worker where
webhookPayload is assembled (look for variables/formats, webhookPayload,
uploadedBinaryOutput, serializableOutput) to include the plain text output when
present — e.g., if formats.includes('text') (or if serializableOutput.text
exists) set webhookPayload.text = serializableOutput.text so text subscribers
receive the stored serializableOutput.text.
In `@server/src/workflow-management/scheduler/index.ts`:
- Around line 386-389: The webhook payload builder is missing the stored text
output so scheduled run_completed webhooks drop serializableOutput.text; update
the payload in the same block that sets markdown/html/uploadedBinaryOutput by
adding a conditional that checks formats.includes('text') and sets
webhookPayload.text = serializableOutput.text (use the existing variables
formats, webhookPayload, and serializableOutput to locate the code inside the
function that currently assigns markdown, html, and uploadedBinaryOutput
fields).
---
Nitpick comments:
In `@src/components/robot/pages/RobotCreate.tsx`:
- Around line 763-813: The block duplicates OUTPUT_FORMAT_LABELS and hardcoded
MenuItem entries; replace the local OUTPUT_FORMAT_LABELS and the hardcoded
MenuItem list with the shared constants by using the imported
OUTPUT_FORMAT_OPTIONS (or the module-provided mapping) to derive labels and to
iterate to render MenuItem/Checkbox options; update the label lookup in the
selected->labels mapping to use OUTPUT_FORMAT_OPTIONS (or its label property)
and render MenuItem components dynamically from OUTPUT_FORMAT_OPTIONS so the
"text" option and any future options stay in sync with the shared source.
In `@src/components/run/RunContent.tsx`:
- Around line 1180-1190: The onClick handler in the RunContent component
duplicates blob/download logic; replace that inline blob creation and manual
anchor logic with a call to the existing downloadText helper (use
downloadText(textContent, 'output.txt') or the helper's actual signature) in the
onClick for the download button, removing the manual URL.createObjectURL /
revoke and DOM anchor creation so the component reuses the centralized
downloadText function.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e095aabc-ed60-437c-8641-fcd62b456074
📒 Files selected for processing (7)
server/src/api/record.tsserver/src/constants/output-formats.tsserver/src/markdownify/scrape.tsserver/src/pgboss-worker.tsserver/src/workflow-management/scheduler/index.tssrc/components/robot/pages/RobotCreate.tsxsrc/components/run/RunContent.tsx
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/src/markdownify/scrape.ts`:
- Around line 248-265: convertPageToText is missing a stability wait which can
yield incomplete text on JS-heavy pages; after the existing await
gotoWithFallback(page, url) call in convertPageToText, insert await
waitForStability(page) (the same helper used by convertPageToMarkdown and
convertPageToHTML) before the page.evaluate(...) text extraction so dynamic/lazy
content can settle; keep existing try/catch and logger behavior unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7cd62977-0d20-4653-9c83-8a494ad60bd6
📒 Files selected for processing (4)
server/src/api/record.tsserver/src/markdownify/scrape.tsserver/src/pgboss-worker.tsserver/src/workflow-management/scheduler/index.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- server/src/workflow-management/scheduler/index.ts
- server/src/pgboss-worker.ts
- server/src/api/record.ts
| export async function convertPageToText(url: string, page: Page): Promise<string> { | ||
| try { | ||
| logger.log('info', `[Scrape] Using existing page instance for text conversion of ${url}`); | ||
|
|
||
| await gotoWithFallback(page, url); | ||
|
|
||
| const text = await page.evaluate(() => { | ||
| const body = document.body; | ||
| if (!body) return ''; | ||
| return body.innerText || body.textContent || ''; | ||
| }); | ||
|
|
||
| return text ? text.trim() : ''; | ||
| } catch (error: any) { | ||
| logger.error(`[Scrape] Error during text conversion: ${error.message}`); | ||
| throw error; | ||
| } | ||
| } |
There was a problem hiding this comment.
Missing waitForStability(page) call before text extraction.
Both convertPageToMarkdown (line 59) and convertPageToHTML (line 165) call waitForStability(page) after gotoWithFallback to allow dynamic content to settle before extracting. convertPageToText skips this step, so text scraped from JS-heavy / lazy-loaded pages may be incomplete or empty — especially since document.body.innerText depends on rendered layout. Given this helper is invoked under a Promise.race with SCRAPE_TIMEOUT in the callers, results could be inconsistent vs. the other formats on the same page.
🔧 Proposed fix
await gotoWithFallback(page, url);
+ await waitForStability(page);
const text = await page.evaluate(() => {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/src/markdownify/scrape.ts` around lines 248 - 265, convertPageToText
is missing a stability wait which can yield incomplete text on JS-heavy pages;
after the existing await gotoWithFallback(page, url) call in convertPageToText,
insert await waitForStability(page) (the same helper used by
convertPageToMarkdown and convertPageToHTML) before the page.evaluate(...) text
extraction so dynamic/lazy content can settle; keep existing try/catch and
logger behavior unchanged.
What this PR does?
Adds the text content format for scrape robot.
Summary by CodeRabbit