How can I limit the amount of crawling per URL? #765

michael1026 · 2020-08-18T15:43:19Z

I've been essentially using a combination of the recursive crawl example and the crawl all links example.

I'm essentially crawling thousands of URLs using the Puppeteer crawler and outputting the URL I'm currently on. The problem is that if I use maxRequestsPerCrawl: 10, it outputs about 10 URLs, then stops the program entirely. I'd like to just do a small amount of crawling on each URL, then stop crawling that URL. Any way I can limit it in this way?

Thanks!

The text was updated successfully, but these errors were encountered:

metalwarrior665 · 2020-08-18T16:29:59Z

maxRequestsPerCrawl won't help you here. I would just mark the initial request that you add with a userData.label: 'START'. In handlePageFunction you add if (request.userData.label === 'START') and handle the enqueueing case. Then instead of using enqueueLinks links utility, I would find those links "manually" with selectors and sliced them to 10. Then enqueue them via requestQueue.addRequest with a different label so you skip the enqueueing phase and just extract the data.

There might be a simpler way via some utility but if you want to learn how to use Apify, I would suggest going more manual way.

michael1026 · 2020-08-18T17:04:36Z

That's really smart. Thank you. Let me see how that works. I'll get back to this.

metalwarrior665 · 2020-08-18T17:11:13Z

Actually, you can keep using enqueueLinks, it has a limit parameter. https://sdk.apify.com/docs/api/utils#utilsenqueuelinksoptions

michael1026 · 2020-08-19T06:17:42Z

I used a combination of both of these. I used limit 10 on enqueueLinks on the first pass, I added the links from the second pass to the queue, then the rest I just logged. Seems to be working well! Thank you.

mnmkng · 2020-09-03T09:58:53Z

@metalwarrior665 Pls comment why you reopened this.

metalwarrior665 · 2020-09-03T10:19:03Z

Sorry. What is the strategy with useful questions that might repeat? Is it expected people will search through closed issues or do we incorporate this into some examples? I feel like it is a bit wasteful to answer this separately for each user.

mnmkng · 2020-09-03T10:26:08Z

I think it's expected that people will search through closed issues too. Open issues typically represent something that's not possible or implemented yet. When you google for solutions to problems, you often end up on closed issues, rather than open ones. We could create an FAQ section in the docs or something, with very short Q/A type examples and keep adding those there, but that's a story for another day :)

teadrinker2015 · 2023-03-20T11:28:44Z

Can we add a request limit option for per run() instead per instance? It is very helpful for testing and development. Either manipulating userDate or initializing multiple crawler instances is a tedious work.

mnmkng · 2023-03-20T12:08:41Z

@teadrinker2015 Could you please provide an example of how this should work? I'm having a hard time imagining what exactly is the issue you need solving.

teadrinker2015 · 2023-03-20T16:40:50Z

@mnmkng This is my use case.

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee'

const host = 'https://www.google.com'

const crawler = new PlaywrightCrawler({

    maxRequestsPerCrawl: 3, // This is instance scoped, or say PerInstance or PerCrawler limitation.
    // maxRequestsPerRun: 3, // I suggest adding this option responsible for each single calling of 'crawler.run()'

    async requestHandler({ request, page }) {
        console.log(request.url)
        const nextPageAnchor = page.locator('#pnnext')
        const nextPageUrl = host + (await nextPageAnchor.getAttribute('href'))
        crawler.addRequests([nextPageUrl])
    }
})

await crawler.run([host + '/search?q=crawler']) // This will consume all 3 available tickets and prevent any following requests.
await crawler.run([host + '/search?q=scraper']) // I expect this to be executed, and consume another 3.

I know the crawler.run() method is just a wrapper around crawler.addRequests(), but I still hope we can make this wrapper more useful.

mnmkng · 2023-03-21T14:06:19Z

Thanks, I get it now. There are some questions to be answered before we can add this, like what should we do with the crawler's internal state between the runs? Discard or keep? @B4nan could someone quickly check, and if it's easy, could we add it?

B4nan · 2023-03-21T14:31:24Z

I know the crawler.run() method is just a wrapper around crawler.addRequests(),

It is not just a wrapper around addRequests, it also starts the scaling and stats, purges storages, etc. Adding requests is only the very last (and optional) step. If you want to have a long-running crawler, you could use keepAlive: true, then the crawler won't stop if there are no requests in the queue, and wait for more - you can add those via crawler.addRequests. And note that you should await that addRequests call, it's async.

I don't think we want to add more options, this is more about allowing to run the run method multiple times - that itself is simply not supported, and the behavior is imho undefined.

I am not against it, I was actually thinking about the same yesterday, it could be enough to just reset the stats object and maybe some other internal flags and caches. In case the crawler is still running, I would just throw. Will take a closer look.

Often users are trying to reuse single crawler instance, which wasn't possible due to internal state not being reset in the `run` method completely. One example of this was the use inside express request handler. This PR adds support for doing this, by resetting the crawler stats and default request queue. If the crawler is already running, an error is thrown, suggesting the use of `crawler.addRequests()` instead. Related: #765

Often users are trying to reuse a single crawler instance, which wasn't possible due to the internal state not being reset in the `run` method completely. One example of this was the use inside express request handler. This PR adds support for doing so, by resetting the crawler stats and default request queue. If the crawler is already running, an error is thrown, suggesting the use of `crawler.addRequests()` instead. Related: #765

metalwarrior665 added the question Further information is requested. label Aug 18, 2020

michael1026 closed this as completed Aug 24, 2020

metalwarrior665 reopened this Sep 3, 2020

mnmkng closed this as completed Sep 3, 2020

B4nan mentioned this issue Mar 23, 2023

feat: allow running single crawler instance multiple times #1844

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I limit the amount of crawling per URL? #765

How can I limit the amount of crawling per URL? #765

michael1026 commented Aug 18, 2020 •

edited

metalwarrior665 commented Aug 18, 2020 •

edited

michael1026 commented Aug 18, 2020

metalwarrior665 commented Aug 18, 2020

michael1026 commented Aug 19, 2020

mnmkng commented Sep 3, 2020

metalwarrior665 commented Sep 3, 2020

mnmkng commented Sep 3, 2020

teadrinker2015 commented Mar 20, 2023

mnmkng commented Mar 20, 2023

teadrinker2015 commented Mar 20, 2023 •

edited

mnmkng commented Mar 21, 2023

B4nan commented Mar 21, 2023

How can I limit the amount of crawling per URL? #765

How can I limit the amount of crawling per URL? #765

Comments

michael1026 commented Aug 18, 2020 • edited

metalwarrior665 commented Aug 18, 2020 • edited

michael1026 commented Aug 18, 2020

metalwarrior665 commented Aug 18, 2020

michael1026 commented Aug 19, 2020

mnmkng commented Sep 3, 2020

metalwarrior665 commented Sep 3, 2020

mnmkng commented Sep 3, 2020

teadrinker2015 commented Mar 20, 2023

mnmkng commented Mar 20, 2023

teadrinker2015 commented Mar 20, 2023 • edited

mnmkng commented Mar 21, 2023

B4nan commented Mar 21, 2023

michael1026 commented Aug 18, 2020 •

edited

metalwarrior665 commented Aug 18, 2020 •

edited

teadrinker2015 commented Mar 20, 2023 •

edited