Skip to content

Conversation

@RohitR311
Copy link
Collaborator

@RohitR311 RohitR311 commented Nov 30, 2025

What this PR does?

Adds a separate browser service for connecting over CDP.

Summary by CodeRabbit

  • New Features

    • Added a dedicated remote browser service exposing WebSocket and health endpoints.
  • Infrastructure Updates

    • Added a containerized browser service to compose with ports, healthcheck, and resource constraints.
    • Backend build images no longer install browser binaries during image build.
  • Refactor

    • Switched to remote browser connections with retry/fallback and removed stealth-plugin integration.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 30, 2025

Walkthrough

Extracts browser runtime into a standalone browser service, adds Docker/compose artifacts for it, removes Playwright browser install from the backend image, and updates server code to connect remotely via WebSocket/CDP while switching imports to playwright-core.

Changes

Cohort / File(s) Summary
Browser service artifacts
browser/Dockerfile, browser/package.json, browser/server.ts, browser/tsconfig.json, browser/.dockerignore
New TypeScript browser service using Playwright base image: installs deps, builds, exposes WS and health ports, serves health and WebSocket/CDP endpoints; includes .dockerignore.
Compose & backend image
docker-compose.yml, Dockerfile.backend, .env.example
Adds browser service to docker-compose (build, ports, healthcheck, resources, security); removes npx playwright install --with-deps chromium from backend Dockerfile; appends BROWSER_WS_PORT/BROWSER_HEALTH_PORT/BROWSER_WS_HOST env vars.
Remote connection module
server/src/browser-management/browserConnection.ts
New module to fetch health/ws endpoint, connect to remote Playwright via WebSocket with retry/timeout, and fall back to local launch; exports connectToRemoteBrowser() and checkBrowserServiceHealth().
Replace local launches with remote
server/src/browser-management/classes/RemoteBrowser.ts, server/src/markdownify/scrape.ts, server/src/routes/proxy.ts, server/src/routes/record.ts, server/src/routes/storage.ts, server/src/api/record.ts
Replaces in-process chromium.launch() and stealth-plugin usage with connectToRemoteBrowser(); retains page logic, adds CDP fallback handling, and removes stealth plugin initializations.
Dependency & package updates
package.json, maxun-core/package.json
Replace playwright and stealth plugins with playwright-core; add/adjust dependencies (e.g., @types/node, turndown, pkce-challenge).
Playwright import/type updates
maxun-core/src/interpret.ts, maxun-core/src/types/workflow.ts, server/src/pgboss-worker.ts, server/src/types/index.ts, server/src/workflow-management/..., server/src/browser-management/inputHandlers.ts, src/shared/types.ts, ...
Bulk change of imports from 'playwright' to 'playwright-core' across many modules and a few targeted casts to relax type expectations.
Signature changes
server/src/markdownify/scrape.ts
convertPageToMarkdown(url, existingPage?) -> convertPageToMarkdown(url) and convertPageToHTML(url, existingPage?) -> convertPageToHTML(url) — now always use remote browser pages.

Sequence Diagram(s)

sequenceDiagram
    participant Server as Main Server
    participant Conn as browserConnection
    participant BrowserSvc as Browser Service
    participant CDP as Chromium/CDP

    Server->>Conn: checkBrowserServiceHealth()
    Conn->>BrowserSvc: GET /health
    BrowserSvc-->>Conn: { status, wsEndpoint }
    Conn-->>Server: health OK + wsEndpoint

    Server->>Conn: connectToRemoteBrowser()
    Conn->>BrowserSvc: open WebSocket to wsEndpoint
    activate BrowserSvc
    BrowserSvc->>CDP: start/attach CDP session
    CDP-->>BrowserSvc: session ready
    BrowserSvc-->>Conn: WS connected / browser handle
    deactivate BrowserSvc
    Conn-->>Server: Browser instance (via playwright-core)

    loop Page operations
      Server->>CDP: navigate / interact / capture
      CDP-->>Server: events / responses
    end

    Server->>Conn: close connection
    Conn->>BrowserSvc: close WS
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Files warranting close attention:
    • server/src/browser-management/browserConnection.ts — retry, timeouts, fallback correctness, logging.
    • server/src/browser-management/classes/RemoteBrowser.ts — lifecycle, adblock/CDP fallback, type casts.
    • browser/server.ts — launch flags, ports, health endpoint, graceful shutdown.
    • docker-compose.yml & Dockerfile.backend — resource/security settings and env consistency.
    • Call sites where function signatures changed (server/src/markdownify/scrape.ts) and broader import swaps to playwright-core.

Possibly related PRs

Suggested labels

Type: Enhancement, Scope: Infra, Scope: Dependencies

Suggested reviewers

  • amhsirak

Poem

🐰 I hopped a socket, found a stream,
Left local boots to chase a dream.
WebSockets hum and CDP sings,
Remote chromium spreads its wings.
I nibble logs and dance on strings.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing a separate browser service as an infrastructure improvement.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch browser-service

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@amhsirak amhsirak changed the title feat: add separate browser service chore(infra): add separate browser service Nov 30, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
server/src/markdownify/scrape.ts (2)

8-57: Missing error handling - browser may not be closed on failure.

If any operation fails after connecting to the browser (e.g., page.goto, page.evaluate), the browser connection will not be closed, potentially leaking resources on the remote browser service.

 export async function convertPageToMarkdown(url: string): Promise<string> {
   const browser = await connectToRemoteBrowser();
-  const page = await browser.newPage();
-
-  await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
-
-  await page.addInitScript(() => {
-    // ... cleanup script
-  });
-
-  // Re-extract HTML after cleanup
-  const cleanedHtml = await page.evaluate(() => {
-    return document.documentElement.outerHTML;
-  });
-
-  await browser.close();
-
-  // Convert cleaned HTML → Markdown
-  const markdown = await parseMarkdown(cleanedHtml, url);
-  return markdown;
+  try {
+    const page = await browser.newPage();
+
+    await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
+
+    await page.addInitScript(() => {
+      const selectors = [
+        "script", "style", "link[rel='stylesheet']", "noscript",
+        "meta", "svg", "img", "picture", "source", "video",
+        "audio", "iframe", "object", "embed"
+      ];
+      selectors.forEach(sel => {
+        document.querySelectorAll(sel).forEach(e => e.remove());
+      });
+      const all = document.querySelectorAll("*");
+      all.forEach(el => {
+        [...el.attributes].forEach(attr => {
+          if (attr.name.startsWith("on")) {
+            el.removeAttribute(attr.name);
+          }
+        });
+      });
+    });
+
+    const cleanedHtml = await page.evaluate(() => {
+      return document.documentElement.outerHTML;
+    });
+
+    const markdown = await parseMarkdown(cleanedHtml, url);
+    return markdown;
+  } finally {
+    await browser.close();
+  }
 }

63-111: Same issue: missing try-finally for browser cleanup in convertPageToHTML.

Apply the same try-finally pattern to ensure the browser is closed even when errors occur.

 export async function convertPageToHTML(url: string): Promise<string> {
   const browser = await connectToRemoteBrowser();
-  const page = await browser.newPage();
-  // ... operations ...
-  await browser.close();
-  return cleanedHtml;
+  try {
+    const page = await browser.newPage();
+    // ... operations ...
+    return cleanedHtml;
+  } finally {
+    await browser.close();
+  }
 }
docker-compose.yml (1)

56-58: Backend should depend on the browser service.

The backend now requires the remote browser service for browser operations. Add browser to the depends_on list to ensure proper startup order and health verification.

     depends_on:
       - postgres
       - minio
+      - browser
server/src/routes/proxy.ts (1)

79-95: Proxy configuration is built but not applied - test is non-functional.

The proxyOptions object is constructed (lines 79-85) but never passed to the browser connection. The remote browser connects without proxy settings, so this endpoint no longer actually tests the user's proxy configuration.

The connectToRemoteBrowser() function doesn't support per-connection proxy options since the remote browser is a shared instance. Consider whether:

  1. The test endpoint should create a new context with proxy settings on the remote browser
  2. Or document that proxy testing requires a different approach with remote browsers
-        const browser = await connectToRemoteBrowser();
-        const page = await browser.newPage();
+        const browser = await connectToRemoteBrowser();
+        const context = await browser.newContext({
+            proxy: proxyOptions.server ? proxyOptions : undefined
+        });
+        const page = await context.newPage();
         await page.goto('https://example.com');
+        await context.close();
         await browser.close();

Note: Verify that closing the browser here doesn't affect other connections to the shared remote browser service.

🧹 Nitpick comments (7)
ENVEXAMPLE (1)

42-45: Consider adding documentation for local development setup.

The BROWSER_WS_HOST=browser value assumes Docker service naming. For developers running locally without Docker, add a comment indicating the appropriate value (e.g., localhost).

 # WebSocket port for browser CDP connections
-BROWSER_WS_PORT=3001
-BROWSER_HEALTH_PORT=3002
-BROWSER_WS_HOST=browser
+BROWSER_WS_PORT=3001                    # WebSocket port for browser CDP connections
+BROWSER_HEALTH_PORT=3002                # Health check endpoint port for browser service
+BROWSER_WS_HOST=browser                 # Browser service host (use 'localhost' for local development without Docker)
maxun-core/src/interpret.ts (2)

147-147: Type cast explanation would improve maintainability.

The as any cast is necessary because @cliqz/adblocker-playwright types are built against the full playwright package. Consider adding a brief comment explaining this type incompatibility for future maintainers.

-        await this.blocker.enableBlockingInPage(page as any);
+        // Cast needed: @cliqz/adblocker-playwright expects playwright types, not playwright-core
+        await this.blocker.enableBlockingInPage(page as any);

157-157: Same type cast pattern as above.

Consider adding a similar comment here for consistency, or extract the cast logic to a helper if this pattern repeats elsewhere.

server/src/browser-management/browserConnection.ts (1)

18-41: Align retry logic and runtime assumptions for the browser service connection

Current behavior:

  • getBrowserServiceEndpoint() is called once before the retry loop; if /health is temporarily unavailable or returns non-JSON, connectToRemoteBrowser fails immediately and the retry settings never apply.
  • The retry loop only covers chromium.connect(wsEndpoint, { timeout }).
  • Both getBrowserServiceEndpoint and checkBrowserServiceHealth rely on a global fetch being present at runtime.

Consider tightening this up by:

  • Moving the /health call inside the retry loop or adding a small retry wrapper around getBrowserServiceEndpoint, so transient startup issues in the browser service are handled the same way as transient CDP failures.
  • Optionally sharing a small helper for constructing healthEndpoint to avoid divergence between getBrowserServiceEndpoint and checkBrowserServiceHealth.
  • Ensuring the server runtime is Node 18+ (or that a fetch polyfill is loaded) so these calls are reliable in production.

Please confirm:

  • That the Node version used to run this server exposes a global fetch, and
  • Whether you want the /health lookup to participate in the same retry budget as the CDP connection. If yes, we can sketch a small refactor to unify that behavior.

Also applies to: 49-85, 91-110

server/src/workflow-management/scheduler/index.ts (1)

13-13: Scheduler’s Playwright-core migration is fine; ensure destroyRemoteBrowser is idempotent

  • Importing Page from "playwright-core" matches the rest of the migration and is only used as a type for the interpreter callback, so behavior is unchanged.
  • In executeRun:
    • For “scrape” robots, destroyRemoteBrowser(plainRun.browserId, userId) is called inside the try/catch branch.
    • For regular robots, destroyRemoteBrowser is called near the end of the main path.
  • In readyForRunHandler, on failure or thrown errors you again call destroyRemoteBrowser(browserId, userId).

If destroyRemoteBrowser is not explicitly idempotent, some failure cases (notably the scrape path when conversion fails) may trigger it twice for the same browserId. It’s probably harmless, but worth confirming the implementation handles “already cleaned up” cases gracefully and just no-ops.

Please check the implementation of destroyRemoteBrowser and confirm it:

  • Safely handles being called multiple times for the same browserId, and
  • Logs at an appropriate level (e.g. not as hard errors) if the browser has already been cleaned up. If you want, we can adjust the scheduler to centralize the cleanup in one place.

Also applies to: 329-331, 368-371, 586-603

browser/package.json (1)

11-15: Consider pinning all Playwright-related dependency versions for consistency.

The playwright version is pinned to 1.57.0, but playwright-extra and puppeteer-extra-plugin-stealth use caret ranges. This could lead to compatibility issues if playwright-extra updates its internal Playwright dependency to a different version.

     "dependencies": {
         "playwright": "1.57.0",
-        "playwright-extra": "^4.3.6",
-        "puppeteer-extra-plugin-stealth": "^2.11.2"
+        "playwright-extra": "4.3.6",
+        "puppeteer-extra-plugin-stealth": "2.11.2"
     },
browser/server.ts (1)

11-13: Add validation for port environment variables.

If BROWSER_WS_PORT or BROWSER_HEALTH_PORT contain invalid values, parseInt will return NaN, causing silent failures.

-const BROWSER_WS_PORT = parseInt(process.env.BROWSER_WS_PORT || '3001', 10);
-const BROWSER_HEALTH_PORT = parseInt(process.env.BROWSER_HEALTH_PORT || '3002', 10);
+const BROWSER_WS_PORT = parseInt(process.env.BROWSER_WS_PORT || '3001', 10);
+const BROWSER_HEALTH_PORT = parseInt(process.env.BROWSER_HEALTH_PORT || '3002', 10);
+
+if (isNaN(BROWSER_WS_PORT) || isNaN(BROWSER_HEALTH_PORT)) {
+    console.error('Invalid port configuration');
+    process.exit(1);
+}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad8df66 and cf19a72.

📒 Files selected for processing (25)
  • Dockerfile.backend (0 hunks)
  • ENVEXAMPLE (1 hunks)
  • browser/.dockerignore (1 hunks)
  • browser/Dockerfile (1 hunks)
  • browser/package.json (1 hunks)
  • browser/server.ts (1 hunks)
  • browser/tsconfig.json (1 hunks)
  • docker-compose.yml (1 hunks)
  • maxun-core/package.json (1 hunks)
  • maxun-core/src/interpret.ts (3 hunks)
  • maxun-core/src/types/workflow.ts (1 hunks)
  • package.json (2 hunks)
  • server/src/api/record.ts (0 hunks)
  • server/src/browser-management/browserConnection.ts (1 hunks)
  • server/src/browser-management/classes/RemoteBrowser.ts (4 hunks)
  • server/src/markdownify/scrape.ts (2 hunks)
  • server/src/pgboss-worker.ts (1 hunks)
  • server/src/routes/proxy.ts (2 hunks)
  • server/src/routes/record.ts (0 hunks)
  • server/src/routes/storage.ts (0 hunks)
  • server/src/types/index.ts (1 hunks)
  • server/src/workflow-management/classes/Generator.ts (1 hunks)
  • server/src/workflow-management/classes/Interpreter.ts (1 hunks)
  • server/src/workflow-management/scheduler/index.ts (1 hunks)
  • server/src/workflow-management/selector.ts (1 hunks)
💤 Files with no reviewable changes (4)
  • Dockerfile.backend
  • server/src/api/record.ts
  • server/src/routes/record.ts
  • server/src/routes/storage.ts
🧰 Additional context used
🧬 Code graph analysis (3)
server/src/markdownify/scrape.ts (1)
server/src/browser-management/browserConnection.ts (1)
  • connectToRemoteBrowser (49-85)
server/src/routes/proxy.ts (1)
server/src/browser-management/browserConnection.ts (1)
  • connectToRemoteBrowser (49-85)
server/src/browser-management/classes/RemoteBrowser.ts (1)
server/src/browser-management/browserConnection.ts (1)
  • connectToRemoteBrowser (49-85)
🔇 Additional comments (17)
maxun-core/src/types/workflow.ts (1)

1-1: Import migration to playwright-core looks correct.

The Page type from playwright-core has the same interface as playwright, so this change maintains type compatibility while aligning with the remote browser service architecture.

maxun-core/src/interpret.ts (1)

2-2: Import migration to playwright-core is consistent with the PR objectives.

server/src/pgboss-worker.ts (1)

16-16: Import migration is consistent with the codebase-wide change to playwright-core.

server/src/workflow-management/classes/Interpreter.ts (1)

4-4: Import migration to playwright-core is consistent with the PR's browser service architecture.

server/src/types/index.ts (1)

1-1: Type-only Playwright import migration looks good

Switching BrowserType/LaunchOptions to playwright-core keeps this file type-only and aligned with the rest of the migration. No functional impact here.

Please double-check other remaining imports in the repo to ensure everything consistently pulls types from playwright-core, not playwright.

package.json (1)

63-63: Playwright-core root dependency is consistent; keep versions in sync

Adding "playwright-core": "1.57.0" at the root matches the core package version and aligns with the import changes across the server. Just be sure the browser service image uses a compatible Playwright version so there are no protocol mismatches.

Please confirm that:

  • maxun-core/package.json and the browser service package.json also pin playwright-core (or playwright) to a compatible 1.57.x range, and
  • no remaining code imports from "playwright" instead of "playwright-core".
maxun-core/package.json (1)

33-39: Core package dependency updates look coherent

Adding @types/node, playwright-core and turndown here aligns with the new browser architecture and markdown/HTML conversion needs. Versioning is consistent with the root manifest.

Please verify against your publish pipeline that:

  • maxun-core is built against Node 18+ (to match the @types/node target), and
  • consumers of maxun-core don’t themselves hard-depend on "playwright" in a way that conflicts with this new playwright-core dependency.
server/src/workflow-management/selector.ts (1)

1-1: Page type import migration is consistent with the new Playwright-core setup

Using Page from "playwright-core" here matches the rest of the migration and keeps this file type-only with respect to Playwright. The runtime still depends on the actual Page instances supplied by your remote browser layer.

Please ensure the interpreter/remote browser classes that pass Page into these functions are also typed against playwright-core to avoid having multiple incompatible Page types in play.

server/src/workflow-management/classes/Generator.ts (1)

5-5: Generator now correctly depends on Page from Playwright-core

Updating the Page type import here keeps the generator aligned with the rest of the Playwright-core migration and the remote browser architecture. All method signatures remain the same.

Please verify that the RemoteBrowser/interpreter layer which calls into WorkflowGenerator also uses Page from "playwright-core", so TypeScript doesn’t end up with two structurally-identical but distinct Page types.

browser/.dockerignore (1)

1-9: Re-evaluate ignoring *.ts, tsconfig.json, and dist in the browser build context

For a TypeScript-based browser service, this combination requires verification:

  • *.ts and tsconfig.json are excluded from the Docker build context.
  • dist is also excluded.

If the Dockerfile runs build commands like npm run build or tsc inside the container, the image will have neither source files nor compiled output. Conversely, if the build happens outside Docker, excluding the compiled output directory may be incorrect.

To resolve this:

  • Verify whether browser/Dockerfile invokes build commands during image construction.
  • Confirm where the compiled JavaScript should reside (dist, build, or another directory).
  • Adjust .dockerignore so that the necessary files for your build strategy are available in the Docker build context.
browser/Dockerfile (1)

1-30: LGTM! Good Dockerfile structure with proper layer caching.

The Dockerfile correctly uses the matching Playwright base image version (v1.57.0), employs npm ci for reproducible builds, and properly sequences the build steps to optimize Docker layer caching (copying package files before source code).

browser/tsconfig.json (1)

1-24: LGTM!

The TypeScript configuration is appropriate for the browser service with strict mode enabled and proper module resolution settings.

docker-compose.yml (1)

77-111: LGTM! Well-configured browser service with appropriate resource limits and healthcheck.

The configuration properly includes memory limits, shared memory size for Chromium, and necessary security options. The healthcheck is correctly configured to verify the service is ready.

browser/server.ts (1)

24-37: Security-sensitive browser flags warrant review.

Several flags have security implications:

  • --disable-web-security: Disables same-origin policy
  • --ignore-certificate-errors: Accepts invalid/self-signed certificates
  • --no-sandbox: Required in containers but reduces isolation

These are common for automation but could expose scraped content to security risks. Ensure these tradeoffs are acceptable for your use case.

server/src/browser-management/classes/RemoteBrowser.ts (3)

565-575: LGTM! Good migration to remote browser connection with retry logic.

The initialization properly connects to the remote browser service with retry handling and maintains the existing context setup logic.


665-675: Good fallback handling for ad blocker initialization.

The try-catch ensures the CDP session is established even if the ad blocker setup fails, preventing initialization failures from blocking the entire browser setup.

The any casts on lines 667 and 669 work around type incompatibility between playwright-core's Page and the adblocker library's expected types. This is acceptable given the library compatibility constraints.


685-692: No action needed - browser.close() behavior is correct for remote connections.

Playwright's browser.close() method behaves differently for local vs. remote connections. When connecting to a remote browser via chromium.connect() (as done in connectToRemoteBrowser()), calling browser.close() only disconnects that client's session—it does not terminate the shared remote browser instance. The code correctly handles cleanup by disconnecting the client connection and nullifying the reference.

Likely an incorrect or invalid review comment.

Comment on lines +74 to +86
// Graceful shutdown
async function shutdown(): Promise<void> {
console.log('Shutting down browser service...');
if (browserServer) {
try {
await browserServer.close();
console.log('Browser server closed');
} catch (error) {
console.error('Error closing browser server:', error);
}
}
process.exit(0);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Health server is not closed during shutdown.

The healthServer HTTP server is created but not tracked or closed in the shutdown() function. This could cause the process to hang during graceful shutdown.

+let healthServer: http.Server | null = null;
+
 async function start(): Promise<void> {
     // ... existing code ...
-        const healthServer = http.createServer((req, res) => {
+        healthServer = http.createServer((req, res) => {
             // ... handlers ...
         });
     // ...
 }

 async function shutdown(): Promise<void> {
     console.log('Shutting down browser service...');
+    if (healthServer) {
+        healthServer.close();
+        console.log('Health server closed');
+    }
     if (browserServer) {
         // ... existing code ...
     }
     process.exit(0);
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In browser/server.ts around lines 74 to 86, the shutdown() function closes
browserServer but never closes the healthServer HTTP server created elsewhere,
which can leave the process hanging; update shutdown() to check if healthServer
exists, call await healthServer.close() inside the try/catch (log success and
errors similarly to browserServer), and ensure any references are cleaned up
(e.g., set healthServer = undefined) before calling process.exit(0); also make
sure healthServer is in module scope so shutdown() can access it.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
server/src/browser-management/browserConnection.ts (1)

136-156: Consider extracting health endpoint URL construction to reduce duplication.

The health endpoint URL construction (lines 138-140) duplicates lines 19-21 in getBrowserServiceEndpoint. Additionally, the same response.ok check before response.json() should be added here for consistency.

+function getHealthEndpoint(): string {
+    const healthPort = process.env.BROWSER_HEALTH_PORT || '3002';
+    const healthHost = process.env.BROWSER_WS_HOST || 'localhost';
+    return `http://${healthHost}:${healthPort}/health`;
+}

Then use getHealthEndpoint() in both functions and add the response status check:

 export async function checkBrowserServiceHealth(): Promise<boolean> {
     try {
-        const healthPort = process.env.BROWSER_HEALTH_PORT || '3002';
-        const healthHost = process.env.BROWSER_WS_HOST || 'localhost';
-        const healthEndpoint = `http://${healthHost}:${healthPort}/health`;
+        const healthEndpoint = getHealthEndpoint();

         const response = await fetch(healthEndpoint);
+        if (!response.ok) {
+            logger.warn(`Health check returned status ${response.status}`);
+            return false;
+        }
         const data = await response.json();
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf19a72 and 3018d7d.

📒 Files selected for processing (1)
  • server/src/browser-management/browserConnection.ts (1 hunks)
🔇 Additional comments (2)
server/src/browser-management/browserConnection.ts (2)

1-12: LGTM!

Imports and configuration constants are appropriate. The retry configuration with 3 attempts, 2-second delay, and 30-second timeout provides reasonable resilience for network-based browser connections.


54-66: Security flags are aggressive—ensure this is intentional.

Several flags have security implications:

  • --disable-web-security: Disables same-origin policy
  • --ignore-certificate-errors: Bypasses SSL validation
  • --no-sandbox: Required in containers but reduces isolation

These are common for scraping/automation in controlled environments. If this service processes untrusted URLs or content, consider whether all flags are necessary, or add a comment documenting why they're required.

Comment on lines +23 to +31
try {
logger.debug(`Fetching WebSocket endpoint from: ${healthEndpoint}`);
const response = await fetch(healthEndpoint);
const data = await response.json();

if (data.status === 'healthy' && data.wsEndpoint) {
logger.debug(`Got WebSocket endpoint: ${data.wsEndpoint}`);
return data.wsEndpoint;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add response.ok check before parsing JSON.

If the health endpoint returns a non-2xx status (e.g., 500, 404), calling response.json() may fail with a confusing parse error or succeed with unexpected error payload. Check the response status first for clearer error handling.

     try {
         logger.debug(`Fetching WebSocket endpoint from: ${healthEndpoint}`);
         const response = await fetch(healthEndpoint);
+        
+        if (!response.ok) {
+            throw new Error(`Health endpoint returned status ${response.status}`);
+        }
+        
         const data = await response.json();

         if (data.status === 'healthy' && data.wsEndpoint) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try {
logger.debug(`Fetching WebSocket endpoint from: ${healthEndpoint}`);
const response = await fetch(healthEndpoint);
const data = await response.json();
if (data.status === 'healthy' && data.wsEndpoint) {
logger.debug(`Got WebSocket endpoint: ${data.wsEndpoint}`);
return data.wsEndpoint;
}
try {
logger.debug(`Fetching WebSocket endpoint from: ${healthEndpoint}`);
const response = await fetch(healthEndpoint);
if (!response.ok) {
throw new Error(`Health endpoint returned status ${response.status}`);
}
const data = await response.json();
if (data.status === 'healthy' && data.wsEndpoint) {
logger.debug(`Got WebSocket endpoint: ${data.wsEndpoint}`);
return data.wsEndpoint;
}
🤖 Prompt for AI Agents
In server/src/browser-management/browserConnection.ts around lines 23 to 31, the
code calls response.json() without verifying the HTTP status; add a response.ok
check immediately after awaiting fetch and before parsing JSON, and handle
non-2xx responses by logging/throwing a clear error (include status and
statusText and optionally response.text()) instead of attempting to parse a
likely error payload; keep the existing healthy/wsEndpoint checks for successful
responses.

Comment on lines +96 to +123
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
logger.debug(`Connection attempt ${attempt}/${maxRetries}`);

const browser = await chromium.connect(wsEndpoint, {
timeout: CONNECTION_CONFIG.connectionTimeout,
});

logger.info('Successfully connected to browser service');
return browser;
} catch (error: any) {
logger.warn(
`Connection attempt ${attempt}/${maxRetries} failed: ${error.message}`
);

if (attempt === maxRetries) {
logger.error(
`Failed to connect to browser service after ${maxRetries} attempts`
);
throw new Error(`Remote connection failed: ${error.message}`);
}

logger.debug(`Waiting ${CONNECTION_CONFIG.retryDelay}ms before retry...`);
await new Promise(resolve => setTimeout(resolve, CONNECTION_CONFIG.retryDelay));
}
}

throw new Error('Failed to connect to browser service');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unreachable code at line 123.

The throw statement at line 123 is unreachable. The for-loop either returns on success (line 105) or throws when attempt === maxRetries (line 115). This line can never execute.

                 logger.debug(`Waiting ${CONNECTION_CONFIG.retryDelay}ms before retry...`);
                 await new Promise(resolve => setTimeout(resolve, CONNECTION_CONFIG.retryDelay));
             }
         }
-
-        throw new Error('Failed to connect to browser service');
     } catch (error: any) {
         logger.warn(`Browser service connection failed: ${error.message}`);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
logger.debug(`Connection attempt ${attempt}/${maxRetries}`);
const browser = await chromium.connect(wsEndpoint, {
timeout: CONNECTION_CONFIG.connectionTimeout,
});
logger.info('Successfully connected to browser service');
return browser;
} catch (error: any) {
logger.warn(
`Connection attempt ${attempt}/${maxRetries} failed: ${error.message}`
);
if (attempt === maxRetries) {
logger.error(
`Failed to connect to browser service after ${maxRetries} attempts`
);
throw new Error(`Remote connection failed: ${error.message}`);
}
logger.debug(`Waiting ${CONNECTION_CONFIG.retryDelay}ms before retry...`);
await new Promise(resolve => setTimeout(resolve, CONNECTION_CONFIG.retryDelay));
}
}
throw new Error('Failed to connect to browser service');
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
logger.debug(`Connection attempt ${attempt}/${maxRetries}`);
const browser = await chromium.connect(wsEndpoint, {
timeout: CONNECTION_CONFIG.connectionTimeout,
});
logger.info('Successfully connected to browser service');
return browser;
} catch (error: any) {
logger.warn(
`Connection attempt ${attempt}/${maxRetries} failed: ${error.message}`
);
if (attempt === maxRetries) {
logger.error(
`Failed to connect to browser service after ${maxRetries} attempts`
);
throw new Error(`Remote connection failed: ${error.message}`);
}
logger.debug(`Waiting ${CONNECTION_CONFIG.retryDelay}ms before retry...`);
await new Promise(resolve => setTimeout(resolve, CONNECTION_CONFIG.retryDelay));
}
}
🤖 Prompt for AI Agents
In server/src/browser-management/browserConnection.ts around lines 96 to 123,
remove the unreachable final `throw new Error('Failed to connect to browser
service');` at line 123 — the for-loop either returns on success or throws when
attempt === maxRetries, so that trailing throw can be deleted; ensure no other
callers rely on that external throw (they'll still get the error thrown inside
the loop) and keep the rest of the retry logic intact.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/shared/types.ts (1)

2-2: Switch to import type for type-only imports

Since Locator is only used in type positions (ScreenshotSettings.mask?: Locator[]), use a type-only import:

-import { Locator } from "playwright-core";
+import type { Locator } from "playwright-core";

This clarifies intent and prevents accidental runtime dependencies, even under stricter TypeScript or bundler configurations.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3018d7d and 7f97401.

📒 Files selected for processing (2)
  • server/src/browser-management/inputHandlers.ts (1 hunks)
  • src/shared/types.ts (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • server/src/browser-management/inputHandlers.ts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/src/markdownify/scrape.ts (1)

5-18: Remove unused helper or integrate it into the conversion functions.

The gotoWithFallback function is defined but never called. Both convertPageToMarkdown and convertPageToHTML use page.goto directly without the fallback logic. Either remove this function or replace the direct page.goto calls with gotoWithFallback to benefit from the retry mechanism.

♻️ Duplicate comments (2)
server/src/markdownify/scrape.ts (2)

67-72: Critical: Undefined variable shouldCloseBrowser will cause runtime error.

The code references shouldCloseBrowser which is not defined anywhere in the file. This will throw a ReferenceError at runtime.

Note: This code should be removed as part of the resource cleanup fix suggested above (lines 26-30).


126-131: Critical: Undefined variable shouldCloseBrowser will cause runtime error.

The code references shouldCloseBrowser which is not defined anywhere in the file. This will throw a ReferenceError at runtime.

Note: This code should be removed as part of the resource cleanup fix suggested above (lines 85-89).

🧹 Nitpick comments (3)
package.json (1)

58-58: Clarify the addition of pkce-challenge dependency.

The addition of pkce-challenge appears unrelated to the stated PR objective of adding a separate browser service for CDP connections. If this is part of OAuth/authentication work, consider documenting it in the PR description.

server/src/markdownify/scrape.ts (2)

20-25: Update JSDoc to reflect signature change.

The JSDoc comment still mentions the existingPage parameter that was removed. Update the documentation to reflect the current signature.

Apply this diff:

 /**
  * Fetches a webpage, strips scripts/styles/images/etc,
  * returns clean Markdown using parser.
  * @param url - The URL to convert
- * @param existingPage - Optional existing Playwright page instance to reuse
  */

79-84: Update JSDoc to reflect signature change.

The JSDoc comment still mentions the existingPage parameter that was removed. Update the documentation to reflect the current signature.

Apply this diff:

 /**
  * Fetches a webpage, strips scripts/styles/images/etc,
  * returns clean HTML.
  * @param url - The URL to convert
- * @param existingPage - Optional existing Playwright page instance to reuse
  */
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f97401 and 8c653f1.

📒 Files selected for processing (6)
  • maxun-core/src/interpret.ts (3 hunks)
  • package.json (2 hunks)
  • server/src/api/record.ts (0 hunks)
  • server/src/markdownify/scrape.ts (3 hunks)
  • server/src/pgboss-worker.ts (1 hunks)
  • server/src/workflow-management/scheduler/index.ts (1 hunks)
💤 Files with no reviewable changes (1)
  • server/src/api/record.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • server/src/pgboss-worker.ts
  • maxun-core/src/interpret.ts
  • server/src/workflow-management/scheduler/index.ts
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/markdownify/scrape.ts (1)
server/src/browser-management/browserConnection.ts (1)
  • connectToRemoteBrowser (89-130)
🔇 Additional comments (2)
server/src/markdownify/scrape.ts (1)

1-1: LGTM!

The import change correctly aligns with the remote browser service architecture.

package.json (1)

59-59: playwright-core version 1.57.0 is valid and published. The version exists on npm and is available for use. The version jump from playwright ^1.48.2 to playwright-core 1.57.0 is a significant upgrade but the target version is stable and legitimate.

Comment on lines +26 to +30
export async function convertPageToMarkdown(url: string): Promise<string> {
const browser = await connectToRemoteBrowser();
const page = await browser.newPage();

await gotoWithFallback(page, url);
await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Browser and page resources are never closed, causing resource leak.

The function creates a browser connection and page but never closes them. This will lead to resource exhaustion over time.

Apply this diff to ensure proper cleanup:

 export async function convertPageToMarkdown(url: string): Promise<string> {
   const browser = await connectToRemoteBrowser();
-  const page = await browser.newPage();
+  let page;
+  try {
+    page = await browser.newPage();

-  await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
+    await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });

-  const cleanedHtml = await page.evaluate(() => {
-    const selectors = [
-      "script",
-      "style",
-      "link[rel='stylesheet']",
-      "noscript",
-      "meta",
-      "svg",
-      "img",
-      "picture",
-      "source",
-      "video",
-      "audio",
-      "iframe",
-      "object",
-      "embed"
-    ];
-
-    selectors.forEach(sel => {
-      document.querySelectorAll(sel).forEach(e => e.remove());
-    });
-
-    // Remove inline event handlers (onclick, onload…)
-    const all = document.querySelectorAll("*");
-    all.forEach(el => {
-      [...el.attributes].forEach(attr => {
-        if (attr.name.startsWith("on")) {
-          el.removeAttribute(attr.name);
-        }
-      });
-    });
-
-    return document.documentElement.outerHTML;
-  });
+    const cleanedHtml = await page.evaluate(() => {
+      const selectors = [
+        "script",
+        "style",
+        "link[rel='stylesheet']",
+        "noscript",
+        "meta",
+        "svg",
+        "img",
+        "picture",
+        "source",
+        "video",
+        "audio",
+        "iframe",
+        "object",
+        "embed"
+      ];
+
+      selectors.forEach(sel => {
+        document.querySelectorAll(sel).forEach(e => e.remove());
+      });
+
+      // Remove inline event handlers (onclick, onload…)
+      const all = document.querySelectorAll("*");
+      all.forEach(el => {
+        [...el.attributes].forEach(attr => {
+          if (attr.name.startsWith("on")) {
+            el.removeAttribute(attr.name);
+          }
+        });
+      });
+
+      return document.documentElement.outerHTML;
+    });

-  if (shouldCloseBrowser && browser) {
-    logger.log('info', `[Scrape] Closing browser instance created for markdown conversion`);
-    await browser.close();
-  } else {
-    logger.log('info', `[Scrape] Keeping existing browser instance open after markdown conversion`);
-  }
+    // Convert cleaned HTML → Markdown
+    const markdown = await parseMarkdown(cleanedHtml, url);
+    return markdown;
+  } finally {
+    if (page) await page.close();
+    if (browser) await browser.close();
+  }
+}

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 26 to 30, the function opens a
remote browser and a page but never closes them, leaking resources; wrap the
navigation and processing in a try/finally (or try/catch/finally) and in finally
await closing the page if it was created (await page.close()) and the browser if
it was created (await browser.close()), guarding each close with existence
checks and individual try/catch to avoid masking original errors and ensure
cleanup always runs.

Comment on lines +85 to +89
export async function convertPageToHTML(url: string): Promise<string> {
const browser = await connectToRemoteBrowser();
const page = await browser.newPage();

await gotoWithFallback(page, url);
await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Browser and page resources are never closed, causing resource leak.

The function creates a browser connection and page but never closes them. This will lead to resource exhaustion over time.

Apply this diff to ensure proper cleanup:

 export async function convertPageToHTML(url: string): Promise<string> {
   const browser = await connectToRemoteBrowser();
-  const page = await browser.newPage();
+  let page;
+  try {
+    page = await browser.newPage();

-  await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });
+    await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });

-  const cleanedHtml = await page.evaluate(() => {
-    const selectors = [
-      "script",
-      "style",
-      "link[rel='stylesheet']",
-      "noscript",
-      "meta",
-      "svg",
-      "img",
-      "picture",
-      "source",
-      "video",
-      "audio",
-      "iframe",
-      "object",
-      "embed"
-    ];
-
-    selectors.forEach(sel => {
-      document.querySelectorAll(sel).forEach(e => e.remove());
-    });
-
-    // Remove inline event handlers (onclick, onload…)
-    const all = document.querySelectorAll("*");
-    all.forEach(el => {
-      [...el.attributes].forEach(attr => {
-        if (attr.name.startsWith("on")) {
-          el.removeAttribute(attr.name);
-        }
-      });
-    });
-
-    return document.documentElement.outerHTML;
-  });
+    const cleanedHtml = await page.evaluate(() => {
+      const selectors = [
+        "script",
+        "style",
+        "link[rel='stylesheet']",
+        "noscript",
+        "meta",
+        "svg",
+        "img",
+        "picture",
+        "source",
+        "video",
+        "audio",
+        "iframe",
+        "object",
+        "embed"
+      ];
+
+      selectors.forEach(sel => {
+        document.querySelectorAll(sel).forEach(e => e.remove());
+      });
+
+      // Remove inline event handlers (onclick, onload…)
+      const all = document.querySelectorAll("*");
+      all.forEach(el => {
+        [...el.attributes].forEach(attr => {
+          if (attr.name.startsWith("on")) {
+            el.removeAttribute(attr.name);
+          }
+        });
+      });
+
+      return document.documentElement.outerHTML;
+    });

-  if (shouldCloseBrowser && browser) {
-    logger.log('info', `[Scrape] Closing browser instance created for HTML conversion`);
-    await browser.close();
-  } else {
-    logger.log('info', `[Scrape] Keeping existing browser instance open after HTML conversion`);
-  }
+    return cleanedHtml;
+  } finally {
+    if (page) await page.close();
+    if (browser) await browser.close();
+  }
+}

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In server/src/markdownify/scrape.ts around lines 85 to 89, the function opens a
browser and page but never closes them; wrap the page navigation and processing
in a try/finally (or try/catch/finally) and ensure you call await page.close()
in the finally block and await browser.close() (or browser.disconnect() if using
a shared remote browser) after page cleanup; check for null/undefined before
closing and rethrow or return appropriately so resources are always released
even on errors.

@amhsirak amhsirak merged commit a7b14cf into develop Nov 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants