Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I limit the amount of crawling per URL? #765

Closed
michael1026 opened this issue Aug 18, 2020 · 12 comments
Closed

How can I limit the amount of crawling per URL? #765

michael1026 opened this issue Aug 18, 2020 · 12 comments
Labels
question Further information is requested.

Comments

@michael1026
Copy link

michael1026 commented Aug 18, 2020

I've been essentially using a combination of the recursive crawl example and the crawl all links example.

I'm essentially crawling thousands of URLs using the Puppeteer crawler and outputting the URL I'm currently on. The problem is that if I use maxRequestsPerCrawl: 10, it outputs about 10 URLs, then stops the program entirely. I'd like to just do a small amount of crawling on each URL, then stop crawling that URL. Any way I can limit it in this way?

Thanks!

@metalwarrior665
Copy link
Member

metalwarrior665 commented Aug 18, 2020

maxRequestsPerCrawl won't help you here. I would just mark the initial request that you add with a userData.label: 'START'. In handlePageFunction you add if (request.userData.label === 'START') and handle the enqueueing case. Then instead of using enqueueLinks links utility, I would find those links "manually" with selectors and sliced them to 10. Then enqueue them via requestQueue.addRequest with a different label so you skip the enqueueing phase and just extract the data.

There might be a simpler way via some utility but if you want to learn how to use Apify, I would suggest going more manual way.

@metalwarrior665 metalwarrior665 added the question Further information is requested. label Aug 18, 2020
@michael1026
Copy link
Author

That's really smart. Thank you. Let me see how that works. I'll get back to this.

@metalwarrior665
Copy link
Member

Actually, you can keep using enqueueLinks, it has a limit parameter. https://sdk.apify.com/docs/api/utils#utilsenqueuelinksoptions

@michael1026
Copy link
Author

I used a combination of both of these. I used limit 10 on enqueueLinks on the first pass, I added the links from the second pass to the queue, then the rest I just logged. Seems to be working well! Thank you.

@mnmkng
Copy link
Member

mnmkng commented Sep 3, 2020

@metalwarrior665 Pls comment why you reopened this.

@metalwarrior665
Copy link
Member

Sorry. What is the strategy with useful questions that might repeat? Is it expected people will search through closed issues or do we incorporate this into some examples? I feel like it is a bit wasteful to answer this separately for each user.

@mnmkng
Copy link
Member

mnmkng commented Sep 3, 2020

I think it's expected that people will search through closed issues too. Open issues typically represent something that's not possible or implemented yet. When you google for solutions to problems, you often end up on closed issues, rather than open ones. We could create an FAQ section in the docs or something, with very short Q/A type examples and keep adding those there, but that's a story for another day :)

@mnmkng mnmkng closed this as completed Sep 3, 2020
@teadrinker2015
Copy link

Can we add a request limit option for per run() instead per instance? It is very helpful for testing and development. Either manipulating userDate or initializing multiple crawler instances is a tedious work.

@mnmkng
Copy link
Member

mnmkng commented Mar 20, 2023

@teadrinker2015 Could you please provide an example of how this should work? I'm having a hard time imagining what exactly is the issue you need solving.

@teadrinker2015
Copy link

teadrinker2015 commented Mar 20, 2023

@mnmkng This is my use case.

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee'

const host = 'https://www.google.com'

const crawler = new PlaywrightCrawler({

    maxRequestsPerCrawl: 3, // This is instance scoped, or say PerInstance or PerCrawler limitation.
    // maxRequestsPerRun: 3, // I suggest adding this option responsible for each single calling of 'crawler.run()'

    async requestHandler({ request, page }) {
        console.log(request.url)
        const nextPageAnchor = page.locator('#pnnext')
        const nextPageUrl = host + (await nextPageAnchor.getAttribute('href'))
        crawler.addRequests([nextPageUrl])
    }
})

await crawler.run([host + '/search?q=crawler']) // This will consume all 3 available tickets and prevent any following requests.
await crawler.run([host + '/search?q=scraper']) // I expect this to be executed, and consume another 3.

I know the crawler.run() method is just a wrapper around crawler.addRequests(), but I still hope we can make this wrapper more useful.

@mnmkng
Copy link
Member

mnmkng commented Mar 21, 2023

Thanks, I get it now. There are some questions to be answered before we can add this, like what should we do with the crawler's internal state between the runs? Discard or keep? @B4nan could someone quickly check, and if it's easy, could we add it?

@B4nan
Copy link
Member

B4nan commented Mar 21, 2023

I know the crawler.run() method is just a wrapper around crawler.addRequests(),

It is not just a wrapper around addRequests, it also starts the scaling and stats, purges storages, etc. Adding requests is only the very last (and optional) step. If you want to have a long-running crawler, you could use keepAlive: true, then the crawler won't stop if there are no requests in the queue, and wait for more - you can add those via crawler.addRequests. And note that you should await that addRequests call, it's async.

I don't think we want to add more options, this is more about allowing to run the run method multiple times - that itself is simply not supported, and the behavior is imho undefined.

I am not against it, I was actually thinking about the same yesterday, it could be enough to just reset the stats object and maybe some other internal flags and caches. In case the crawler is still running, I would just throw. Will take a closer look.

B4nan added a commit that referenced this issue Mar 23, 2023
Often users are trying to reuse single crawler instance, which wasn't possible due to
internal state not being reset in the `run` method completely. One example of this was
the use inside express request handler.

This PR adds support for doing this, by resetting the crawler stats and default request
queue. If the crawler is already running, an error is thrown, suggesting the use of
`crawler.addRequests()` instead.

Related: #765
B4nan added a commit that referenced this issue Mar 23, 2023
Often users are trying to reuse single crawler instance, which wasn't possible due to
internal state not being reset in the `run` method completely. One example of this was
the use inside express request handler.

This PR adds support for doing this, by resetting the crawler stats and default request
queue. If the crawler is already running, an error is thrown, suggesting the use of
`crawler.addRequests()` instead.

Related: #765
B4nan added a commit that referenced this issue Mar 23, 2023
Often users are trying to reuse single crawler instance, which wasn't possible due to
internal state not being reset in the `run` method completely. One example of this was
the use inside express request handler.

This PR adds support for doing this, by resetting the crawler stats and default request
queue. If the crawler is already running, an error is thrown, suggesting the use of
`crawler.addRequests()` instead.

Related: #765
B4nan added a commit that referenced this issue Apr 24, 2023
Often users are trying to reuse single crawler instance, which wasn't possible due to
internal state not being reset in the `run` method completely. One example of this was
the use inside express request handler.

This PR adds support for doing this, by resetting the crawler stats and default request
queue. If the crawler is already running, an error is thrown, suggesting the use of
`crawler.addRequests()` instead.

Related: #765
B4nan added a commit that referenced this issue Apr 25, 2023
Often users are trying to reuse a single crawler instance, which wasn't
possible due to the internal state not being reset in the `run` method
completely. One example of this was the use inside express request
handler.

This PR adds support for doing so, by resetting the crawler stats and
default request queue. If the crawler is already running, an error is
thrown, suggesting the use of `crawler.addRequests()` instead.

Related: #765
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested.
Projects
None yet
Development

No branches or pull requests

5 participants