-
Notifications
You must be signed in to change notification settings - Fork 1.1k
chore(infra): add separate browser service #906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughExtracts browser runtime into a standalone browser service, adds Docker/compose artifacts for it, removes Playwright browser install from the backend image, and updates server code to connect remotely via WebSocket/CDP while switching imports to Changes
Sequence Diagram(s)sequenceDiagram
participant Server as Main Server
participant Conn as browserConnection
participant BrowserSvc as Browser Service
participant CDP as Chromium/CDP
Server->>Conn: checkBrowserServiceHealth()
Conn->>BrowserSvc: GET /health
BrowserSvc-->>Conn: { status, wsEndpoint }
Conn-->>Server: health OK + wsEndpoint
Server->>Conn: connectToRemoteBrowser()
Conn->>BrowserSvc: open WebSocket to wsEndpoint
activate BrowserSvc
BrowserSvc->>CDP: start/attach CDP session
CDP-->>BrowserSvc: session ready
BrowserSvc-->>Conn: WS connected / browser handle
deactivate BrowserSvc
Conn-->>Server: Browser instance (via playwright-core)
loop Page operations
Server->>CDP: navigate / interact / capture
CDP-->>Server: events / responses
end
Server->>Conn: close connection
Conn->>BrowserSvc: close WS
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
server/src/markdownify/scrape.ts (2)
8-57: Missing error handling - browser may not be closed on failure.If any operation fails after connecting to the browser (e.g.,
page.goto,page.evaluate), the browser connection will not be closed, potentially leaking resources on the remote browser service.export async function convertPageToMarkdown(url: string): Promise<string> { const browser = await connectToRemoteBrowser(); - const page = await browser.newPage(); - - await page.goto(url, { waitUntil: "networkidle", timeout: 100000 }); - - await page.addInitScript(() => { - // ... cleanup script - }); - - // Re-extract HTML after cleanup - const cleanedHtml = await page.evaluate(() => { - return document.documentElement.outerHTML; - }); - - await browser.close(); - - // Convert cleaned HTML → Markdown - const markdown = await parseMarkdown(cleanedHtml, url); - return markdown; + try { + const page = await browser.newPage(); + + await page.goto(url, { waitUntil: "networkidle", timeout: 100000 }); + + await page.addInitScript(() => { + const selectors = [ + "script", "style", "link[rel='stylesheet']", "noscript", + "meta", "svg", "img", "picture", "source", "video", + "audio", "iframe", "object", "embed" + ]; + selectors.forEach(sel => { + document.querySelectorAll(sel).forEach(e => e.remove()); + }); + const all = document.querySelectorAll("*"); + all.forEach(el => { + [...el.attributes].forEach(attr => { + if (attr.name.startsWith("on")) { + el.removeAttribute(attr.name); + } + }); + }); + }); + + const cleanedHtml = await page.evaluate(() => { + return document.documentElement.outerHTML; + }); + + const markdown = await parseMarkdown(cleanedHtml, url); + return markdown; + } finally { + await browser.close(); + } }
63-111: Same issue: missing try-finally for browser cleanup inconvertPageToHTML.Apply the same try-finally pattern to ensure the browser is closed even when errors occur.
export async function convertPageToHTML(url: string): Promise<string> { const browser = await connectToRemoteBrowser(); - const page = await browser.newPage(); - // ... operations ... - await browser.close(); - return cleanedHtml; + try { + const page = await browser.newPage(); + // ... operations ... + return cleanedHtml; + } finally { + await browser.close(); + } }docker-compose.yml (1)
56-58: Backend should depend on the browser service.The backend now requires the remote browser service for browser operations. Add
browserto thedepends_onlist to ensure proper startup order and health verification.depends_on: - postgres - minio + - browserserver/src/routes/proxy.ts (1)
79-95: Proxy configuration is built but not applied - test is non-functional.The
proxyOptionsobject is constructed (lines 79-85) but never passed to the browser connection. The remote browser connects without proxy settings, so this endpoint no longer actually tests the user's proxy configuration.The
connectToRemoteBrowser()function doesn't support per-connection proxy options since the remote browser is a shared instance. Consider whether:
- The test endpoint should create a new context with proxy settings on the remote browser
- Or document that proxy testing requires a different approach with remote browsers
- const browser = await connectToRemoteBrowser(); - const page = await browser.newPage(); + const browser = await connectToRemoteBrowser(); + const context = await browser.newContext({ + proxy: proxyOptions.server ? proxyOptions : undefined + }); + const page = await context.newPage(); await page.goto('https://example.com'); + await context.close(); await browser.close();Note: Verify that closing the browser here doesn't affect other connections to the shared remote browser service.
🧹 Nitpick comments (7)
ENVEXAMPLE (1)
42-45: Consider adding documentation for local development setup.The
BROWSER_WS_HOST=browservalue assumes Docker service naming. For developers running locally without Docker, add a comment indicating the appropriate value (e.g.,localhost).# WebSocket port for browser CDP connections -BROWSER_WS_PORT=3001 -BROWSER_HEALTH_PORT=3002 -BROWSER_WS_HOST=browser +BROWSER_WS_PORT=3001 # WebSocket port for browser CDP connections +BROWSER_HEALTH_PORT=3002 # Health check endpoint port for browser service +BROWSER_WS_HOST=browser # Browser service host (use 'localhost' for local development without Docker)maxun-core/src/interpret.ts (2)
147-147: Type cast explanation would improve maintainability.The
as anycast is necessary because@cliqz/adblocker-playwrighttypes are built against the fullplaywrightpackage. Consider adding a brief comment explaining this type incompatibility for future maintainers.- await this.blocker.enableBlockingInPage(page as any); + // Cast needed: @cliqz/adblocker-playwright expects playwright types, not playwright-core + await this.blocker.enableBlockingInPage(page as any);
157-157: Same type cast pattern as above.Consider adding a similar comment here for consistency, or extract the cast logic to a helper if this pattern repeats elsewhere.
server/src/browser-management/browserConnection.ts (1)
18-41: Align retry logic and runtime assumptions for the browser service connectionCurrent behavior:
getBrowserServiceEndpoint()is called once before the retry loop; if/healthis temporarily unavailable or returns non-JSON,connectToRemoteBrowserfails immediately and the retry settings never apply.- The retry loop only covers
chromium.connect(wsEndpoint, { timeout }).- Both
getBrowserServiceEndpointandcheckBrowserServiceHealthrely on a globalfetchbeing present at runtime.Consider tightening this up by:
- Moving the
/healthcall inside the retry loop or adding a small retry wrapper aroundgetBrowserServiceEndpoint, so transient startup issues in the browser service are handled the same way as transient CDP failures.- Optionally sharing a small helper for constructing
healthEndpointto avoid divergence betweengetBrowserServiceEndpointandcheckBrowserServiceHealth.- Ensuring the server runtime is Node 18+ (or that a
fetchpolyfill is loaded) so these calls are reliable in production.Please confirm:
- That the Node version used to run this server exposes a global
fetch, and- Whether you want the
/healthlookup to participate in the same retry budget as the CDP connection. If yes, we can sketch a small refactor to unify that behavior.Also applies to: 49-85, 91-110
server/src/workflow-management/scheduler/index.ts (1)
13-13: Scheduler’s Playwright-core migration is fine; ensuredestroyRemoteBrowseris idempotent
- Importing
Pagefrom"playwright-core"matches the rest of the migration and is only used as a type for the interpreter callback, so behavior is unchanged.- In
executeRun:
- For “scrape” robots,
destroyRemoteBrowser(plainRun.browserId, userId)is called inside the try/catch branch.- For regular robots,
destroyRemoteBrowseris called near the end of the main path.- In
readyForRunHandler, on failure or thrown errors you again calldestroyRemoteBrowser(browserId, userId).If
destroyRemoteBrowseris not explicitly idempotent, some failure cases (notably the scrape path when conversion fails) may trigger it twice for the samebrowserId. It’s probably harmless, but worth confirming the implementation handles “already cleaned up” cases gracefully and just no-ops.Please check the implementation of
destroyRemoteBrowserand confirm it:
- Safely handles being called multiple times for the same
browserId, and- Logs at an appropriate level (e.g. not as hard errors) if the browser has already been cleaned up. If you want, we can adjust the scheduler to centralize the cleanup in one place.
Also applies to: 329-331, 368-371, 586-603
browser/package.json (1)
11-15: Consider pinning all Playwright-related dependency versions for consistency.The
playwrightversion is pinned to1.57.0, butplaywright-extraandpuppeteer-extra-plugin-stealthuse caret ranges. This could lead to compatibility issues ifplaywright-extraupdates its internal Playwright dependency to a different version."dependencies": { "playwright": "1.57.0", - "playwright-extra": "^4.3.6", - "puppeteer-extra-plugin-stealth": "^2.11.2" + "playwright-extra": "4.3.6", + "puppeteer-extra-plugin-stealth": "2.11.2" },browser/server.ts (1)
11-13: Add validation for port environment variables.If
BROWSER_WS_PORTorBROWSER_HEALTH_PORTcontain invalid values,parseIntwill returnNaN, causing silent failures.-const BROWSER_WS_PORT = parseInt(process.env.BROWSER_WS_PORT || '3001', 10); -const BROWSER_HEALTH_PORT = parseInt(process.env.BROWSER_HEALTH_PORT || '3002', 10); +const BROWSER_WS_PORT = parseInt(process.env.BROWSER_WS_PORT || '3001', 10); +const BROWSER_HEALTH_PORT = parseInt(process.env.BROWSER_HEALTH_PORT || '3002', 10); + +if (isNaN(BROWSER_WS_PORT) || isNaN(BROWSER_HEALTH_PORT)) { + console.error('Invalid port configuration'); + process.exit(1); +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
Dockerfile.backend(0 hunks)ENVEXAMPLE(1 hunks)browser/.dockerignore(1 hunks)browser/Dockerfile(1 hunks)browser/package.json(1 hunks)browser/server.ts(1 hunks)browser/tsconfig.json(1 hunks)docker-compose.yml(1 hunks)maxun-core/package.json(1 hunks)maxun-core/src/interpret.ts(3 hunks)maxun-core/src/types/workflow.ts(1 hunks)package.json(2 hunks)server/src/api/record.ts(0 hunks)server/src/browser-management/browserConnection.ts(1 hunks)server/src/browser-management/classes/RemoteBrowser.ts(4 hunks)server/src/markdownify/scrape.ts(2 hunks)server/src/pgboss-worker.ts(1 hunks)server/src/routes/proxy.ts(2 hunks)server/src/routes/record.ts(0 hunks)server/src/routes/storage.ts(0 hunks)server/src/types/index.ts(1 hunks)server/src/workflow-management/classes/Generator.ts(1 hunks)server/src/workflow-management/classes/Interpreter.ts(1 hunks)server/src/workflow-management/scheduler/index.ts(1 hunks)server/src/workflow-management/selector.ts(1 hunks)
💤 Files with no reviewable changes (4)
- Dockerfile.backend
- server/src/api/record.ts
- server/src/routes/record.ts
- server/src/routes/storage.ts
🧰 Additional context used
🧬 Code graph analysis (3)
server/src/markdownify/scrape.ts (1)
server/src/browser-management/browserConnection.ts (1)
connectToRemoteBrowser(49-85)
server/src/routes/proxy.ts (1)
server/src/browser-management/browserConnection.ts (1)
connectToRemoteBrowser(49-85)
server/src/browser-management/classes/RemoteBrowser.ts (1)
server/src/browser-management/browserConnection.ts (1)
connectToRemoteBrowser(49-85)
🔇 Additional comments (17)
maxun-core/src/types/workflow.ts (1)
1-1: Import migration to playwright-core looks correct.The
Pagetype fromplaywright-corehas the same interface asplaywright, so this change maintains type compatibility while aligning with the remote browser service architecture.maxun-core/src/interpret.ts (1)
2-2: Import migration to playwright-core is consistent with the PR objectives.server/src/pgboss-worker.ts (1)
16-16: Import migration is consistent with the codebase-wide change to playwright-core.server/src/workflow-management/classes/Interpreter.ts (1)
4-4: Import migration to playwright-core is consistent with the PR's browser service architecture.server/src/types/index.ts (1)
1-1: Type-only Playwright import migration looks goodSwitching
BrowserType/LaunchOptionstoplaywright-corekeeps this file type-only and aligned with the rest of the migration. No functional impact here.Please double-check other remaining imports in the repo to ensure everything consistently pulls types from
playwright-core, notplaywright.package.json (1)
63-63: Playwright-core root dependency is consistent; keep versions in syncAdding
"playwright-core": "1.57.0"at the root matches the core package version and aligns with the import changes across the server. Just be sure the browser service image uses a compatible Playwright version so there are no protocol mismatches.Please confirm that:
maxun-core/package.jsonand the browser servicepackage.jsonalso pinplaywright-core(orplaywright) to a compatible1.57.xrange, and- no remaining code imports from
"playwright"instead of"playwright-core".maxun-core/package.json (1)
33-39: Core package dependency updates look coherentAdding
@types/node,playwright-coreandturndownhere aligns with the new browser architecture and markdown/HTML conversion needs. Versioning is consistent with the root manifest.Please verify against your publish pipeline that:
maxun-coreis built against Node 18+ (to match the@types/nodetarget), and- consumers of
maxun-coredon’t themselves hard-depend on"playwright"in a way that conflicts with this newplaywright-coredependency.server/src/workflow-management/selector.ts (1)
1-1: Page type import migration is consistent with the new Playwright-core setupUsing
Pagefrom"playwright-core"here matches the rest of the migration and keeps this file type-only with respect to Playwright. The runtime still depends on the actualPageinstances supplied by your remote browser layer.Please ensure the interpreter/remote browser classes that pass
Pageinto these functions are also typed againstplaywright-coreto avoid having multiple incompatiblePagetypes in play.server/src/workflow-management/classes/Generator.ts (1)
5-5: Generator now correctly depends onPagefrom Playwright-coreUpdating the
Pagetype import here keeps the generator aligned with the rest of the Playwright-core migration and the remote browser architecture. All method signatures remain the same.Please verify that the
RemoteBrowser/interpreter layer which calls intoWorkflowGeneratoralso usesPagefrom"playwright-core", so TypeScript doesn’t end up with two structurally-identical but distinctPagetypes.browser/.dockerignore (1)
1-9: Re-evaluate ignoring*.ts,tsconfig.json, anddistin the browser build contextFor a TypeScript-based browser service, this combination requires verification:
*.tsandtsconfig.jsonare excluded from the Docker build context.distis also excluded.If the Dockerfile runs build commands like
npm run buildortscinside the container, the image will have neither source files nor compiled output. Conversely, if the build happens outside Docker, excluding the compiled output directory may be incorrect.To resolve this:
- Verify whether
browser/Dockerfileinvokes build commands during image construction.- Confirm where the compiled JavaScript should reside (
dist,build, or another directory).- Adjust .dockerignore so that the necessary files for your build strategy are available in the Docker build context.
browser/Dockerfile (1)
1-30: LGTM! Good Dockerfile structure with proper layer caching.The Dockerfile correctly uses the matching Playwright base image version (v1.57.0), employs
npm cifor reproducible builds, and properly sequences the build steps to optimize Docker layer caching (copying package files before source code).browser/tsconfig.json (1)
1-24: LGTM!The TypeScript configuration is appropriate for the browser service with strict mode enabled and proper module resolution settings.
docker-compose.yml (1)
77-111: LGTM! Well-configured browser service with appropriate resource limits and healthcheck.The configuration properly includes memory limits, shared memory size for Chromium, and necessary security options. The healthcheck is correctly configured to verify the service is ready.
browser/server.ts (1)
24-37: Security-sensitive browser flags warrant review.Several flags have security implications:
--disable-web-security: Disables same-origin policy--ignore-certificate-errors: Accepts invalid/self-signed certificates--no-sandbox: Required in containers but reduces isolationThese are common for automation but could expose scraped content to security risks. Ensure these tradeoffs are acceptable for your use case.
server/src/browser-management/classes/RemoteBrowser.ts (3)
565-575: LGTM! Good migration to remote browser connection with retry logic.The initialization properly connects to the remote browser service with retry handling and maintains the existing context setup logic.
665-675: Good fallback handling for ad blocker initialization.The try-catch ensures the CDP session is established even if the ad blocker setup fails, preventing initialization failures from blocking the entire browser setup.
The
anycasts on lines 667 and 669 work around type incompatibility between playwright-core's Page and the adblocker library's expected types. This is acceptable given the library compatibility constraints.
685-692: No action needed - browser.close() behavior is correct for remote connections.Playwright's
browser.close()method behaves differently for local vs. remote connections. When connecting to a remote browser viachromium.connect()(as done inconnectToRemoteBrowser()), callingbrowser.close()only disconnects that client's session—it does not terminate the shared remote browser instance. The code correctly handles cleanup by disconnecting the client connection and nullifying the reference.Likely an incorrect or invalid review comment.
| // Graceful shutdown | ||
| async function shutdown(): Promise<void> { | ||
| console.log('Shutting down browser service...'); | ||
| if (browserServer) { | ||
| try { | ||
| await browserServer.close(); | ||
| console.log('Browser server closed'); | ||
| } catch (error) { | ||
| console.error('Error closing browser server:', error); | ||
| } | ||
| } | ||
| process.exit(0); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Health server is not closed during shutdown.
The healthServer HTTP server is created but not tracked or closed in the shutdown() function. This could cause the process to hang during graceful shutdown.
+let healthServer: http.Server | null = null;
+
async function start(): Promise<void> {
// ... existing code ...
- const healthServer = http.createServer((req, res) => {
+ healthServer = http.createServer((req, res) => {
// ... handlers ...
});
// ...
}
async function shutdown(): Promise<void> {
console.log('Shutting down browser service...');
+ if (healthServer) {
+ healthServer.close();
+ console.log('Health server closed');
+ }
if (browserServer) {
// ... existing code ...
}
process.exit(0);
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In browser/server.ts around lines 74 to 86, the shutdown() function closes
browserServer but never closes the healthServer HTTP server created elsewhere,
which can leave the process hanging; update shutdown() to check if healthServer
exists, call await healthServer.close() inside the try/catch (log success and
errors similarly to browserServer), and ensure any references are cleaned up
(e.g., set healthServer = undefined) before calling process.exit(0); also make
sure healthServer is in module scope so shutdown() can access it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
server/src/browser-management/browserConnection.ts (1)
136-156: Consider extracting health endpoint URL construction to reduce duplication.The health endpoint URL construction (lines 138-140) duplicates lines 19-21 in
getBrowserServiceEndpoint. Additionally, the sameresponse.okcheck beforeresponse.json()should be added here for consistency.+function getHealthEndpoint(): string { + const healthPort = process.env.BROWSER_HEALTH_PORT || '3002'; + const healthHost = process.env.BROWSER_WS_HOST || 'localhost'; + return `http://${healthHost}:${healthPort}/health`; +}Then use
getHealthEndpoint()in both functions and add the response status check:export async function checkBrowserServiceHealth(): Promise<boolean> { try { - const healthPort = process.env.BROWSER_HEALTH_PORT || '3002'; - const healthHost = process.env.BROWSER_WS_HOST || 'localhost'; - const healthEndpoint = `http://${healthHost}:${healthPort}/health`; + const healthEndpoint = getHealthEndpoint(); const response = await fetch(healthEndpoint); + if (!response.ok) { + logger.warn(`Health check returned status ${response.status}`); + return false; + } const data = await response.json();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/browser-management/browserConnection.ts(1 hunks)
🔇 Additional comments (2)
server/src/browser-management/browserConnection.ts (2)
1-12: LGTM!Imports and configuration constants are appropriate. The retry configuration with 3 attempts, 2-second delay, and 30-second timeout provides reasonable resilience for network-based browser connections.
54-66: Security flags are aggressive—ensure this is intentional.Several flags have security implications:
--disable-web-security: Disables same-origin policy--ignore-certificate-errors: Bypasses SSL validation--no-sandbox: Required in containers but reduces isolationThese are common for scraping/automation in controlled environments. If this service processes untrusted URLs or content, consider whether all flags are necessary, or add a comment documenting why they're required.
| try { | ||
| logger.debug(`Fetching WebSocket endpoint from: ${healthEndpoint}`); | ||
| const response = await fetch(healthEndpoint); | ||
| const data = await response.json(); | ||
|
|
||
| if (data.status === 'healthy' && data.wsEndpoint) { | ||
| logger.debug(`Got WebSocket endpoint: ${data.wsEndpoint}`); | ||
| return data.wsEndpoint; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add response.ok check before parsing JSON.
If the health endpoint returns a non-2xx status (e.g., 500, 404), calling response.json() may fail with a confusing parse error or succeed with unexpected error payload. Check the response status first for clearer error handling.
try {
logger.debug(`Fetching WebSocket endpoint from: ${healthEndpoint}`);
const response = await fetch(healthEndpoint);
+
+ if (!response.ok) {
+ throw new Error(`Health endpoint returned status ${response.status}`);
+ }
+
const data = await response.json();
if (data.status === 'healthy' && data.wsEndpoint) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try { | |
| logger.debug(`Fetching WebSocket endpoint from: ${healthEndpoint}`); | |
| const response = await fetch(healthEndpoint); | |
| const data = await response.json(); | |
| if (data.status === 'healthy' && data.wsEndpoint) { | |
| logger.debug(`Got WebSocket endpoint: ${data.wsEndpoint}`); | |
| return data.wsEndpoint; | |
| } | |
| try { | |
| logger.debug(`Fetching WebSocket endpoint from: ${healthEndpoint}`); | |
| const response = await fetch(healthEndpoint); | |
| if (!response.ok) { | |
| throw new Error(`Health endpoint returned status ${response.status}`); | |
| } | |
| const data = await response.json(); | |
| if (data.status === 'healthy' && data.wsEndpoint) { | |
| logger.debug(`Got WebSocket endpoint: ${data.wsEndpoint}`); | |
| return data.wsEndpoint; | |
| } |
🤖 Prompt for AI Agents
In server/src/browser-management/browserConnection.ts around lines 23 to 31, the
code calls response.json() without verifying the HTTP status; add a response.ok
check immediately after awaiting fetch and before parsing JSON, and handle
non-2xx responses by logging/throwing a clear error (include status and
statusText and optionally response.text()) instead of attempting to parse a
likely error payload; keep the existing healthy/wsEndpoint checks for successful
responses.
| for (let attempt = 1; attempt <= maxRetries; attempt++) { | ||
| try { | ||
| logger.debug(`Connection attempt ${attempt}/${maxRetries}`); | ||
|
|
||
| const browser = await chromium.connect(wsEndpoint, { | ||
| timeout: CONNECTION_CONFIG.connectionTimeout, | ||
| }); | ||
|
|
||
| logger.info('Successfully connected to browser service'); | ||
| return browser; | ||
| } catch (error: any) { | ||
| logger.warn( | ||
| `Connection attempt ${attempt}/${maxRetries} failed: ${error.message}` | ||
| ); | ||
|
|
||
| if (attempt === maxRetries) { | ||
| logger.error( | ||
| `Failed to connect to browser service after ${maxRetries} attempts` | ||
| ); | ||
| throw new Error(`Remote connection failed: ${error.message}`); | ||
| } | ||
|
|
||
| logger.debug(`Waiting ${CONNECTION_CONFIG.retryDelay}ms before retry...`); | ||
| await new Promise(resolve => setTimeout(resolve, CONNECTION_CONFIG.retryDelay)); | ||
| } | ||
| } | ||
|
|
||
| throw new Error('Failed to connect to browser service'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unreachable code at line 123.
The throw statement at line 123 is unreachable. The for-loop either returns on success (line 105) or throws when attempt === maxRetries (line 115). This line can never execute.
logger.debug(`Waiting ${CONNECTION_CONFIG.retryDelay}ms before retry...`);
await new Promise(resolve => setTimeout(resolve, CONNECTION_CONFIG.retryDelay));
}
}
-
- throw new Error('Failed to connect to browser service');
} catch (error: any) {
logger.warn(`Browser service connection failed: ${error.message}`);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (let attempt = 1; attempt <= maxRetries; attempt++) { | |
| try { | |
| logger.debug(`Connection attempt ${attempt}/${maxRetries}`); | |
| const browser = await chromium.connect(wsEndpoint, { | |
| timeout: CONNECTION_CONFIG.connectionTimeout, | |
| }); | |
| logger.info('Successfully connected to browser service'); | |
| return browser; | |
| } catch (error: any) { | |
| logger.warn( | |
| `Connection attempt ${attempt}/${maxRetries} failed: ${error.message}` | |
| ); | |
| if (attempt === maxRetries) { | |
| logger.error( | |
| `Failed to connect to browser service after ${maxRetries} attempts` | |
| ); | |
| throw new Error(`Remote connection failed: ${error.message}`); | |
| } | |
| logger.debug(`Waiting ${CONNECTION_CONFIG.retryDelay}ms before retry...`); | |
| await new Promise(resolve => setTimeout(resolve, CONNECTION_CONFIG.retryDelay)); | |
| } | |
| } | |
| throw new Error('Failed to connect to browser service'); | |
| for (let attempt = 1; attempt <= maxRetries; attempt++) { | |
| try { | |
| logger.debug(`Connection attempt ${attempt}/${maxRetries}`); | |
| const browser = await chromium.connect(wsEndpoint, { | |
| timeout: CONNECTION_CONFIG.connectionTimeout, | |
| }); | |
| logger.info('Successfully connected to browser service'); | |
| return browser; | |
| } catch (error: any) { | |
| logger.warn( | |
| `Connection attempt ${attempt}/${maxRetries} failed: ${error.message}` | |
| ); | |
| if (attempt === maxRetries) { | |
| logger.error( | |
| `Failed to connect to browser service after ${maxRetries} attempts` | |
| ); | |
| throw new Error(`Remote connection failed: ${error.message}`); | |
| } | |
| logger.debug(`Waiting ${CONNECTION_CONFIG.retryDelay}ms before retry...`); | |
| await new Promise(resolve => setTimeout(resolve, CONNECTION_CONFIG.retryDelay)); | |
| } | |
| } |
🤖 Prompt for AI Agents
In server/src/browser-management/browserConnection.ts around lines 96 to 123,
remove the unreachable final `throw new Error('Failed to connect to browser
service');` at line 123 — the for-loop either returns on success or throws when
attempt === maxRetries, so that trailing throw can be deleted; ensure no other
callers rely on that external throw (they'll still get the error thrown inside
the loop) and keep the rest of the retry logic intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/shared/types.ts (1)
2-2: Switch toimport typefor type-only importsSince
Locatoris only used in type positions (ScreenshotSettings.mask?: Locator[]), use a type-only import:-import { Locator } from "playwright-core"; +import type { Locator } from "playwright-core";This clarifies intent and prevents accidental runtime dependencies, even under stricter TypeScript or bundler configurations.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/src/browser-management/inputHandlers.ts(1 hunks)src/shared/types.ts(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- server/src/browser-management/inputHandlers.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/markdownify/scrape.ts (1)
5-18: Remove unused helper or integrate it into the conversion functions.The
gotoWithFallbackfunction is defined but never called. BothconvertPageToMarkdownandconvertPageToHTMLusepage.gotodirectly without the fallback logic. Either remove this function or replace the directpage.gotocalls withgotoWithFallbackto benefit from the retry mechanism.
♻️ Duplicate comments (2)
server/src/markdownify/scrape.ts (2)
67-72: Critical: Undefined variableshouldCloseBrowserwill cause runtime error.The code references
shouldCloseBrowserwhich is not defined anywhere in the file. This will throw aReferenceErrorat runtime.Note: This code should be removed as part of the resource cleanup fix suggested above (lines 26-30).
126-131: Critical: Undefined variableshouldCloseBrowserwill cause runtime error.The code references
shouldCloseBrowserwhich is not defined anywhere in the file. This will throw aReferenceErrorat runtime.Note: This code should be removed as part of the resource cleanup fix suggested above (lines 85-89).
🧹 Nitpick comments (3)
package.json (1)
58-58: Clarify the addition of pkce-challenge dependency.The addition of
pkce-challengeappears unrelated to the stated PR objective of adding a separate browser service for CDP connections. If this is part of OAuth/authentication work, consider documenting it in the PR description.server/src/markdownify/scrape.ts (2)
20-25: Update JSDoc to reflect signature change.The JSDoc comment still mentions the
existingPageparameter that was removed. Update the documentation to reflect the current signature.Apply this diff:
/** * Fetches a webpage, strips scripts/styles/images/etc, * returns clean Markdown using parser. * @param url - The URL to convert - * @param existingPage - Optional existing Playwright page instance to reuse */
79-84: Update JSDoc to reflect signature change.The JSDoc comment still mentions the
existingPageparameter that was removed. Update the documentation to reflect the current signature.Apply this diff:
/** * Fetches a webpage, strips scripts/styles/images/etc, * returns clean HTML. * @param url - The URL to convert - * @param existingPage - Optional existing Playwright page instance to reuse */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
maxun-core/src/interpret.ts(3 hunks)package.json(2 hunks)server/src/api/record.ts(0 hunks)server/src/markdownify/scrape.ts(3 hunks)server/src/pgboss-worker.ts(1 hunks)server/src/workflow-management/scheduler/index.ts(1 hunks)
💤 Files with no reviewable changes (1)
- server/src/api/record.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- server/src/pgboss-worker.ts
- maxun-core/src/interpret.ts
- server/src/workflow-management/scheduler/index.ts
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/markdownify/scrape.ts (1)
server/src/browser-management/browserConnection.ts (1)
connectToRemoteBrowser(89-130)
🔇 Additional comments (2)
server/src/markdownify/scrape.ts (1)
1-1: LGTM!The import change correctly aligns with the remote browser service architecture.
package.json (1)
59-59: playwright-core version 1.57.0 is valid and published. The version exists on npm and is available for use. The version jump from playwright ^1.48.2 to playwright-core 1.57.0 is a significant upgrade but the target version is stable and legitimate.
| export async function convertPageToMarkdown(url: string): Promise<string> { | ||
| const browser = await connectToRemoteBrowser(); | ||
| const page = await browser.newPage(); | ||
|
|
||
| await gotoWithFallback(page, url); | ||
| await page.goto(url, { waitUntil: "networkidle", timeout: 100000 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Browser and page resources are never closed, causing resource leak.
The function creates a browser connection and page but never closes them. This will lead to resource exhaustion over time.
Apply this diff to ensure proper cleanup:
export async function convertPageToMarkdown(url: string): Promise<string> {
const browser = await connectToRemoteBrowser();
- const page = await browser.newPage();
+ let page;
+ try {
+ page = await browser.newPage();
- await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
+ await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
- const cleanedHtml = await page.evaluate(() => {
- const selectors = [
- "script",
- "style",
- "link[rel='stylesheet']",
- "noscript",
- "meta",
- "svg",
- "img",
- "picture",
- "source",
- "video",
- "audio",
- "iframe",
- "object",
- "embed"
- ];
-
- selectors.forEach(sel => {
- document.querySelectorAll(sel).forEach(e => e.remove());
- });
-
- // Remove inline event handlers (onclick, onload…)
- const all = document.querySelectorAll("*");
- all.forEach(el => {
- [...el.attributes].forEach(attr => {
- if (attr.name.startsWith("on")) {
- el.removeAttribute(attr.name);
- }
- });
- });
-
- return document.documentElement.outerHTML;
- });
+ const cleanedHtml = await page.evaluate(() => {
+ const selectors = [
+ "script",
+ "style",
+ "link[rel='stylesheet']",
+ "noscript",
+ "meta",
+ "svg",
+ "img",
+ "picture",
+ "source",
+ "video",
+ "audio",
+ "iframe",
+ "object",
+ "embed"
+ ];
+
+ selectors.forEach(sel => {
+ document.querySelectorAll(sel).forEach(e => e.remove());
+ });
+
+ // Remove inline event handlers (onclick, onload…)
+ const all = document.querySelectorAll("*");
+ all.forEach(el => {
+ [...el.attributes].forEach(attr => {
+ if (attr.name.startsWith("on")) {
+ el.removeAttribute(attr.name);
+ }
+ });
+ });
+
+ return document.documentElement.outerHTML;
+ });
- if (shouldCloseBrowser && browser) {
- logger.log('info', `[Scrape] Closing browser instance created for markdown conversion`);
- await browser.close();
- } else {
- logger.log('info', `[Scrape] Keeping existing browser instance open after markdown conversion`);
- }
+ // Convert cleaned HTML → Markdown
+ const markdown = await parseMarkdown(cleanedHtml, url);
+ return markdown;
+ } finally {
+ if (page) await page.close();
+ if (browser) await browser.close();
+ }
+}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 26 to 30, the function opens a
remote browser and a page but never closes them, leaking resources; wrap the
navigation and processing in a try/finally (or try/catch/finally) and in finally
await closing the page if it was created (await page.close()) and the browser if
it was created (await browser.close()), guarding each close with existence
checks and individual try/catch to avoid masking original errors and ensure
cleanup always runs.
| export async function convertPageToHTML(url: string): Promise<string> { | ||
| const browser = await connectToRemoteBrowser(); | ||
| const page = await browser.newPage(); | ||
|
|
||
| await gotoWithFallback(page, url); | ||
| await page.goto(url, { waitUntil: "networkidle", timeout: 100000 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Browser and page resources are never closed, causing resource leak.
The function creates a browser connection and page but never closes them. This will lead to resource exhaustion over time.
Apply this diff to ensure proper cleanup:
export async function convertPageToHTML(url: string): Promise<string> {
const browser = await connectToRemoteBrowser();
- const page = await browser.newPage();
+ let page;
+ try {
+ page = await browser.newPage();
- await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
+ await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
- const cleanedHtml = await page.evaluate(() => {
- const selectors = [
- "script",
- "style",
- "link[rel='stylesheet']",
- "noscript",
- "meta",
- "svg",
- "img",
- "picture",
- "source",
- "video",
- "audio",
- "iframe",
- "object",
- "embed"
- ];
-
- selectors.forEach(sel => {
- document.querySelectorAll(sel).forEach(e => e.remove());
- });
-
- // Remove inline event handlers (onclick, onload…)
- const all = document.querySelectorAll("*");
- all.forEach(el => {
- [...el.attributes].forEach(attr => {
- if (attr.name.startsWith("on")) {
- el.removeAttribute(attr.name);
- }
- });
- });
-
- return document.documentElement.outerHTML;
- });
+ const cleanedHtml = await page.evaluate(() => {
+ const selectors = [
+ "script",
+ "style",
+ "link[rel='stylesheet']",
+ "noscript",
+ "meta",
+ "svg",
+ "img",
+ "picture",
+ "source",
+ "video",
+ "audio",
+ "iframe",
+ "object",
+ "embed"
+ ];
+
+ selectors.forEach(sel => {
+ document.querySelectorAll(sel).forEach(e => e.remove());
+ });
+
+ // Remove inline event handlers (onclick, onload…)
+ const all = document.querySelectorAll("*");
+ all.forEach(el => {
+ [...el.attributes].forEach(attr => {
+ if (attr.name.startsWith("on")) {
+ el.removeAttribute(attr.name);
+ }
+ });
+ });
+
+ return document.documentElement.outerHTML;
+ });
- if (shouldCloseBrowser && browser) {
- logger.log('info', `[Scrape] Closing browser instance created for HTML conversion`);
- await browser.close();
- } else {
- logger.log('info', `[Scrape] Keeping existing browser instance open after HTML conversion`);
- }
+ return cleanedHtml;
+ } finally {
+ if (page) await page.close();
+ if (browser) await browser.close();
+ }
+}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 85 to 89, the function opens a
browser and page but never closes them; wrap the page navigation and processing
in a try/finally (or try/catch/finally) and ensure you call await page.close()
in the finally block and await browser.close() (or browser.disconnect() if using
a shared remote browser) after page cleanup; check for null/undefined before
closing and rethrow or return appropriately so resources are always released
even on errors.
What this PR does?
Adds a separate browser service for connecting over CDP.
Summary by CodeRabbit
New Features
Infrastructure Updates
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.