New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I limit the amount of crawling per URL? #765
Comments
There might be a simpler way via some utility but if you want to learn how to use Apify, I would suggest going more manual way. |
That's really smart. Thank you. Let me see how that works. I'll get back to this. |
Actually, you can keep using |
I used a combination of both of these. I used |
@metalwarrior665 Pls comment why you reopened this. |
Sorry. What is the strategy with useful questions that might repeat? Is it expected people will search through closed issues or do we incorporate this into some examples? I feel like it is a bit wasteful to answer this separately for each user. |
I think it's expected that people will search through closed issues too. Open issues typically represent something that's not possible or implemented yet. When you google for solutions to problems, you often end up on closed issues, rather than open ones. We could create an FAQ section in the docs or something, with very short Q/A type examples and keep adding those there, but that's a story for another day :) |
Can we add a request limit option for per run() instead per instance? It is very helpful for testing and development. Either manipulating userDate or initializing multiple crawler instances is a tedious work. |
@teadrinker2015 Could you please provide an example of how this should work? I'm having a hard time imagining what exactly is the issue you need solving. |
@mnmkng This is my use case. import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee'
const host = 'https://www.google.com'
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3, // This is instance scoped, or say PerInstance or PerCrawler limitation.
// maxRequestsPerRun: 3, // I suggest adding this option responsible for each single calling of 'crawler.run()'
async requestHandler({ request, page }) {
console.log(request.url)
const nextPageAnchor = page.locator('#pnnext')
const nextPageUrl = host + (await nextPageAnchor.getAttribute('href'))
crawler.addRequests([nextPageUrl])
}
})
await crawler.run([host + '/search?q=crawler']) // This will consume all 3 available tickets and prevent any following requests.
await crawler.run([host + '/search?q=scraper']) // I expect this to be executed, and consume another 3. I know the |
Thanks, I get it now. There are some questions to be answered before we can add this, like what should we do with the crawler's internal state between the runs? Discard or keep? @B4nan could someone quickly check, and if it's easy, could we add it? |
It is not just a wrapper around I don't think we want to add more options, this is more about allowing to run the I am not against it, I was actually thinking about the same yesterday, it could be enough to just reset the stats object and maybe some other internal flags and caches. In case the crawler is still running, I would just throw. Will take a closer look. |
Often users are trying to reuse single crawler instance, which wasn't possible due to internal state not being reset in the `run` method completely. One example of this was the use inside express request handler. This PR adds support for doing this, by resetting the crawler stats and default request queue. If the crawler is already running, an error is thrown, suggesting the use of `crawler.addRequests()` instead. Related: #765
Often users are trying to reuse single crawler instance, which wasn't possible due to internal state not being reset in the `run` method completely. One example of this was the use inside express request handler. This PR adds support for doing this, by resetting the crawler stats and default request queue. If the crawler is already running, an error is thrown, suggesting the use of `crawler.addRequests()` instead. Related: #765
Often users are trying to reuse single crawler instance, which wasn't possible due to internal state not being reset in the `run` method completely. One example of this was the use inside express request handler. This PR adds support for doing this, by resetting the crawler stats and default request queue. If the crawler is already running, an error is thrown, suggesting the use of `crawler.addRequests()` instead. Related: #765
Often users are trying to reuse single crawler instance, which wasn't possible due to internal state not being reset in the `run` method completely. One example of this was the use inside express request handler. This PR adds support for doing this, by resetting the crawler stats and default request queue. If the crawler is already running, an error is thrown, suggesting the use of `crawler.addRequests()` instead. Related: #765
Often users are trying to reuse a single crawler instance, which wasn't possible due to the internal state not being reset in the `run` method completely. One example of this was the use inside express request handler. This PR adds support for doing so, by resetting the crawler stats and default request queue. If the crawler is already running, an error is thrown, suggesting the use of `crawler.addRequests()` instead. Related: #765
I've been essentially using a combination of the recursive crawl example and the crawl all links example.
I'm essentially crawling thousands of URLs using the Puppeteer crawler and outputting the URL I'm currently on. The problem is that if I use
maxRequestsPerCrawl: 10
, it outputs about 10 URLs, then stops the program entirely. I'd like to just do a small amount of crawling on each URL, then stop crawling that URL. Any way I can limit it in this way?Thanks!
The text was updated successfully, but these errors were encountered: