Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: browserPerProxy browser launch option #2418

Merged
merged 5 commits into from Apr 11, 2024
Merged

Conversation

barjin
Copy link
Contributor

@barjin barjin commented Apr 10, 2024

Fixes the performance issues with the new proxy handling in browser crawlers reported by @AndreyBykov 's team.

Reduces the proxy antiblocking performance, though. Consider the following snippet:

  const proxyConfiguration = new ProxyConfiguration({
    newUrlFunction: async () => {
      return `http://session-${Math.random().toString().slice(2,6)}:password@proxy.apify.com:8000`;
    }
  })

  const crawler = new PuppeteerCrawler({
    proxyConfiguration,
    requestHandler: async ({ response, proxyInfo }) => {
      console.log((await response?.json()).ip);
    },
    headless: false,
    // browser per proxy = `false` by default
  });
  
  await crawler.run([
    'https://api.ipify.org/?format=json&q=qnom',
    'https://api.ipify.org/?format=json&q=bugt',
    'https://api.ipify.org/?format=json&q=qfju',
    'https://api.ipify.org/?format=json&q=utbb',
    'https://api.ipify.org/?format=json&q=ekqu',
  ]);
INFO  System info {"apifyVersion":"3.1.16","apifyClientVersion":"2.9.3","crawleeVersion":"3.9.0","osType":"Linux","nodeVersion":"v20.2.0"}
INFO  PuppeteerCrawler: Starting the crawler.
139.28.120.90
139.28.120.90
139.28.120.90
139.28.120.90
139.28.120.90
INFO  PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PuppeteerCrawler: Final request statistics: {"requestsFinished":5,"requestsFailed":0,"retryHistogram":[5],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1189,"requestsFinishedPerMinute":86,"requestsFailedPerMinute":0,"requestTotalDurationMillis":5946,"requestsTotal":5,"crawlerRuntimeMillis":3489}
INFO  PuppeteerCrawler: Finished! Total 5 requests: 5 succeeded, 0 failed. {"terminal":true}

real    0m6,358s
user    0m6,097s
sys     0m0,929s

With browserPerProxy enabled, the same code snippet runs twice as slow... but correct.

  const proxyConfiguration = new ProxyConfiguration({
    newUrlFunction: async () => {
      return `http://session-${Math.random().toString().slice(2,6)}:password@proxy.apify.com:8000`;
    }
  })

  const crawler = new PuppeteerCrawler({
    proxyConfiguration,
    requestHandler: async ({ response, proxyInfo }) => {
      console.log((await response?.json()).ip);
    },
    headless: false,
+    launchContext: {
+      browserPerProxy: true,
+    }
  });
  
  await crawler.run([
    'https://api.ipify.org/?format=json&q=qnom',
    'https://api.ipify.org/?format=json&q=bugt',
    'https://api.ipify.org/?format=json&q=qfju',
    'https://api.ipify.org/?format=json&q=utbb',
    'https://api.ipify.org/?format=json&q=ekqu',
  ]);
INFO  PuppeteerCrawler: Starting the crawler.
119.13.197.92
43.228.238.111
107.175.80.114
104.165.1.67
192.3.93.50
INFO  PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PuppeteerCrawler: Final request statistics: {"requestsFinished":5,"requestsFailed":0,"retryHistogram":[5],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":2263,"requestsFinishedPerMinute":34,"requestsFailedPerMinute":0,"requestTotalDurationMillis":11317,"requestsTotal":5,"crawlerRuntimeMillis":8765}
INFO  PuppeteerCrawler: Finished! Total 5 requests: 5 succeeded, 0 failed. {"terminal":true}

real    0m11,610s
user    0m12,990s
sys     0m3,295s

@barjin barjin added bug Something isn't working. adhoc Ad-hoc unplanned task added during the sprint. labels Apr 10, 2024
@barjin barjin requested a review from B4nan April 10, 2024 17:01
@barjin barjin self-assigned this Apr 10, 2024
@github-actions github-actions bot added this to the 87th sprint - Tooling team milestone Apr 10, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Apr 10, 2024
@barjin
Copy link
Contributor Author

barjin commented Apr 10, 2024

Proxy tiers now enables browserPerProxy by default.

newUrlFunction works (or doesn't work) as well as before. Using browserPerProxy fixes the newUrlFunction behaviour (to what anyone would expect).

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Apr 10, 2024
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e2e tests are happy, lets merge and try to test this a bit more in the wild?

packages/browser-pool/src/browser-pool.ts Show resolved Hide resolved
packages/browser-pool/src/browser-pool.ts Outdated Show resolved Hide resolved
@B4nan B4nan merged commit df57b29 into master Apr 11, 2024
8 checks passed
@B4nan B4nan deleted the feat/browser-per-proxy branch April 11, 2024 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants