-
For example, I initialized the crawler as
Now I want to change the maxRequestRetries or I want to change the proxy config dynamically based on my request URL, is that possible? |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 1 reply
-
Not explicitly, the property is protected, but if you don't mind using // @ts-ignore
crawler.maxRequestRetries = 10; What you can do publicly is to override the crawler default with request-based https://crawlee.dev/api/core/interface/RequestOptions#maxRetries |
Beta Was this translation helpful? Give feedback.
-
Changing the proxy config based on the request URL would be difficult to do, provided that you use I believe your best option would be to have multiple crawlers (instances of |
Beta Was this translation helpful? Give feedback.
-
@B4nan @janbuchar In my use case, I have created multiple crawler instances, and using env variables, I figure out which crawler instance I need to use to process my requests. However, the problem is that I then have to boot many servers and keep them running each looking at a different request queue and proxy config. I needed a way where I can have a preNavigation hook, where I check what my request is and based on that change my crawler to either use proxy or not, or change retry properties etc. I think the problem stems from the fact that I need a server running with keepAlive: true to listen to requests in my v2 queue. Is there a better way? |
Beta Was this translation helpful? Give feedback.
-
Hello @harshmaur ! What you're describing should be easily possible with basic Crawlee classes and methods: 1) Dynamic proxy based on URL
new ProxyConfiguration({
newUrlFunction: (sessionId, options) => {
switch (options?.request?.url) {
case 'http://apify.com':
return 'http://localhost:8000';
case 'http://crawlee.dev':
return 'http://otherproxy.com';
default:
return null // returning null means that the request will be made without a proxy
}
},
}); 2) Variable
...
requestHandler: async ({ request }) => {
if(request.url.startsWith('http://example.com')) {
request.maxRetries = 2
} else {
request.maxRetries = 5
}
}
... Combined, you can have a crawler that looks like this: const crawler = new PlaywrightCrawler({
proxyConfiguration: new ProxyConfiguration({
newUrlFunction: (_, options) => {
switch (options?.request?.url) {
case 'http://crawlee.dev':
return 'http://otherproxy.com';
default:
return null // returning null means that the request will be made without a proxy
}
},
}),
maxRequestRetries: 5, // default value for the non-overridden requests
requestHandler: async ({ request }) => {
if(request.url.startsWith('http://example.com')) {
request.maxRetries = 2
} else {
request.maxRetries = 5
}
// something useful here
},
}); |
Beta Was this translation helpful? Give feedback.
-
@barjin the newUrlFunction forces me to return a url, I basically want to use a proxyList when my This allows me to keep using the same proxy URL and let's me retire a session so that I can move to next URL automatically. When I have a newURLFunction I need to figure out which URL I need to use from my list of proxies and ensure that the same URL is being used for sticky sessions. |
Beta Was this translation helpful? Give feedback.
-
You can work around this by having multiple const oneProxyList = new ProxyConfiguration({
proxyUrls: ['http://proxy1.com', 'http://proxy2.com'],
});
const anotherProxyList = new ProxyConfiguration({
proxyUrls: ['http://proxy4.com', 'http://proxy5.com'],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration: new ProxyConfiguration({
newUrlFunction: async (sessionId, options) => {
switch (options?.request?.url) {
case 'http://crawlee.dev':
return await oneProxyList.newUrl(sessionId) ?? null;
case 'http://another.domain':
return await anotherProxyList.newUrl(sessionId) ?? null;
default:
return null
}
},
});
}); If you pass the |
Beta Was this translation helpful? Give feedback.
You can work around this by having multiple
ProxyConfiguration
objects and pass thenewUrlFunction
calls to them: