Is it possible to update the crawler config after initilization? #2533

harshmaur · 2024-06-12T09:41:19Z

harshmaur
Jun 12, 2024

For example, I initialized the crawler as

const crawler = new PlaywriteCrawler({
 // .. configs
 maxRequestRetries: 3
})

Now I want to change the maxRequestRetries or I want to change the proxy config dynamically based on my request URL, is that possible?

Answered by barjin

Jun 12, 2024

You can work around this by having multiple ProxyConfiguration objects and pass the newUrlFunction calls to them:

const oneProxyList = new ProxyConfiguration({
    proxyUrls: ['http://proxy1.com', 'http://proxy2.com'],
});

const anotherProxyList = new ProxyConfiguration({
    proxyUrls: ['http://proxy4.com', 'http://proxy5.com'],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration: new ProxyConfiguration({
        newUrlFunction: async (sessionId, options) => {
            switch (options?.request?.url) {
                case 'http://crawlee.dev':
                    return await oneProxyList.newUrl(sessionId) ?? null;
                case 'http://another.domain':
       …

View full answer

B4nan · 2024-06-12T09:45:36Z

B4nan
Jun 12, 2024
Maintainer

Not explicitly, the property is protected, but if you don't mind using as any or @ts-ignore, you can do this:

// @ts-ignore
crawler.maxRequestRetries = 10;

What you can do publicly is to override the crawler default with request-based maxRetries option.

https://crawlee.dev/api/core/interface/RequestOptions#maxRetries

0 replies

janbuchar · 2024-06-12T09:56:53Z

janbuchar
Jun 12, 2024
Maintainer

Changing the proxy config based on the request URL would be difficult to do, provided that you use PlaywrightCrawler. The crawler processes multiple requests in parallel, and changing the configuration on the fly could affect different requests than the ones you want.

I believe your best option would be to have multiple crawlers (instances of PlaywrightCrawler) with different configurations, and calling addRequests on those based on the configuration you want to use for each URL you find while crawling. Does that make sense?

0 replies

harshmaur · 2024-06-12T10:45:23Z

harshmaur
Jun 12, 2024
Author

@B4nan @janbuchar In my use case, I have created multiple crawler instances, and using env variables, I figure out which crawler instance I need to use to process my requests. However, the problem is that I then have to boot many servers and keep them running each looking at a different request queue and proxy config.

I needed a way where I can have a preNavigation hook, where I check what my request is and based on that change my crawler to either use proxy or not, or change retry properties etc.

I think the problem stems from the fact that I need a server running with keepAlive: true to listen to requests in my v2 queue. Is there a better way?

0 replies

barjin · 2024-06-12T11:08:14Z

barjin
Jun 12, 2024
Maintainer

Hello @harshmaur !

What you're describing should be easily possible with basic Crawlee classes and methods:

1) Dynamic proxy based on URL

you can use the newUrlFunction constructor parameter of the ProxyConfiguration class - since some time ago, the function gets the current Request object passed, so that it can be used to decide what proxy to use.

new ProxyConfiguration({
    newUrlFunction: (sessionId, options) => {
        switch (options?.request?.url) {
            case 'http://apify.com':
                return 'http://localhost:8000';
            case 'http://crawlee.dev':
                return 'http://otherproxy.com';
            default:
                return null // returning null means that the request will be made without a proxy
        }
    },
});

2) Variable maxRequestRetries

each Crawlee Request object has a separate property maxRetries which has precedence over the global crawler's maxRequestRetries value. You can easily set this property in pre/postNavigationHooks or requestHandler and override the maximum number of retries for each request.

...
requestHandler: async ({ request }) => {
    if(request.url.startsWith('http://example.com')) {
        request.maxRetries = 2
    } else {
        request.maxRetries = 5
    }
}
...

Combined, you can have a crawler that looks like this:

const crawler = new PlaywrightCrawler({
    proxyConfiguration: new ProxyConfiguration({
        newUrlFunction: (_, options) => {
            switch (options?.request?.url) {
                case 'http://crawlee.dev':
                    return 'http://otherproxy.com';
                default:
                    return null // returning null means that the request will be made without a proxy
            }
        },
    }),
    maxRequestRetries: 5, // default value for the non-overridden requests
    requestHandler: async ({ request }) => {
        if(request.url.startsWith('http://example.com')) {
            request.maxRetries = 2
        } else {
            request.maxRetries = 5
        }
        // something useful here
    },
});

0 replies

harshmaur · 2024-06-12T11:30:15Z

harshmaur
Jun 12, 2024
Author

@barjin the newUrlFunction forces me to return a url, I basically want to use a proxyList when my request.userData.myProperty === myValue and not use the proxyList when its not.

This allows me to keep using the same proxy URL and let's me retire a session so that I can move to next URL automatically. When I have a newURLFunction I need to figure out which URL I need to use from my list of proxies and ensure that the same URL is being used for sticky sessions.

0 replies

barjin · 2024-06-12T11:40:18Z

barjin
Jun 12, 2024
Maintainer

You can work around this by having multiple ProxyConfiguration objects and pass the newUrlFunction calls to them:

const oneProxyList = new ProxyConfiguration({
    proxyUrls: ['http://proxy1.com', 'http://proxy2.com'],
});

const anotherProxyList = new ProxyConfiguration({
    proxyUrls: ['http://proxy4.com', 'http://proxy5.com'],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration: new ProxyConfiguration({
        newUrlFunction: async (sessionId, options) => {
            switch (options?.request?.url) {
                case 'http://crawlee.dev':
                    return await oneProxyList.newUrl(sessionId) ?? null;
                case 'http://another.domain':
                    return await anotherProxyList.newUrl(sessionId) ?? null;
                default:
                    return null
            }
        },
    });
});

If you pass the sessionId (the first newUrlFunction parameter) too, the "sticky" sessions / proxy URLs should work as expected too.

1 reply

harshmaur Jun 12, 2024
Author

This works! @barjin I will see with another of my crawler if a similar solution would work there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to update the crawler config after initilization? #2533

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is it possible to update the crawler config after initilization? #2533

harshmaur Jun 12, 2024

Replies: 6 comments · 1 reply

B4nan Jun 12, 2024 Maintainer

janbuchar Jun 12, 2024 Maintainer

harshmaur Jun 12, 2024 Author

barjin Jun 12, 2024 Maintainer

harshmaur Jun 12, 2024 Author

barjin Jun 12, 2024 Maintainer

harshmaur Jun 12, 2024 Author

harshmaur
Jun 12, 2024

Replies: 6 comments 1 reply

B4nan
Jun 12, 2024
Maintainer

janbuchar
Jun 12, 2024
Maintainer

harshmaur
Jun 12, 2024
Author

barjin
Jun 12, 2024
Maintainer

harshmaur
Jun 12, 2024
Author

barjin
Jun 12, 2024
Maintainer

harshmaur Jun 12, 2024
Author