Make the back end crawler configurable #6495

leofeyer · 2023-11-08T10:05:33Z

Implements #3809

As discussed, all parameters can now be configured in the container configuration and the maximum depth can also be selected in a drop-down menu.

core-bundle/contao/classes/Crawl.php

Toflar · 2023-11-08T10:46:05Z

Thanks for working on this! Remarks:

I think part of the problem this was not tackled until now is that making only the max depth configurable isn't really gonna fix the entire problem. The most important factor is concurrency because that affects speed. And if people still need to adjust the configuration, they likely won't do it (as they don't know this is even possible). So effectively, nobody will change anything and it will still be slow for people in 5.3.

I think, what would be way smarter would be 2 configuration options:

max_allowed_max_depth
max_allowed_concurrency

The minimum can be hardcoded as this is imho alwas 3 for max depth (lower never makes sense) and 1 for concurrency for obvious reasons.

These should also be considered in the CrawlCommand. They should be hardcoded and system-wide restrictions so that I - as the responsible admin - can limit users, both back end and CLI users, to only use values that make sense and the server is capable of processing.

So I would also like to have the concurrency selectable in the back end. You may ask why anybody would ever want to have less concurrency than the max configured? That's a reasonable question. I just think that there's a ton of users out there who use Contao as is and would never reconfigure things in config.yaml. If we use - say - 10 as max_allowed_concurrency by default and they try that and it doesn't work, I bet there are loads of users who will then try 8 and if that works, just remember that they never have to go higher than 8. They wouldn't go to the config.yaml and reconfigure that though.

Hence, imho 2 dropdowns would be super nice, also indicating the users that there is actually a possibility to change that (and if people want more than 10 they will ask and will find that there's an option in config.yaml).

The best case for UX would include remembering the last used options in the back end in the session so I don't have to always re-configure when I log back in.

leofeyer · 2023-11-08T11:47:38Z

We already discussed this and decided against adding a select menu for the concurrency (see #3809 (comment)). Should we discuss again?

Toflar · 2023-11-08T12:12:11Z

We don't have to, I'm just looking for a way to inform the users about the concurrency that's being used. Maybe a simple hint in the back end when the crawler is running would be enough? What do you think about always showing this:

Your website is currently being crawled with a maximum depth of %d and %d concurrent requests. Depending on the settings and the size of your website, this may take a while. If your server can serve more than %d concurrent requests, you can ask your system administrator to increase the contao.crawl.concurrency setting in the system configuration. Moreover, make sure you analyze the debug log to see which URLs are crawled exactly. For example, you can ask Contao to skip links from being crawled using the data-skip-search-index HTML attribute. Please refer to the docs for more information.

fritzmg · 2023-11-08T13:00:40Z

Judging from the situations in the community I think crawling is mostly slow because of too many URLs being unnecessarily crawled, due to pagination and filter query parameter URLs increasing the amount of URLs exponentially (when using an unbound/high max-depth), not because the concurrency setting is too low.

That being said, I am not against adding the concurrency setting (or info about it) to the back end.

Toflar · 2023-11-08T13:01:55Z

Yes, I know. Which is why my text included an info about how you can skip those :)

leofeyer · 2023-11-08T17:09:28Z

You mean something like this?

I couldn‘t find the data-skip-search-index attribute in the docs, though.

Toflar · 2023-11-08T17:12:03Z

I would output the text only when the crawler is running - otherwise it bloats the maintenance screen :)

I couldn‘t find the data-skip-search-index attribute in the docs, though.

Would need to be added, yes.

leofeyer · 2023-11-08T17:34:16Z

Like this then?

It will only be shown while the crawler is running.

Toflar · 2023-11-09T08:05:05Z

Yeah, with the real values of course :D And I think I would leave the hint about the docs. I will extend them.

leofeyer · 2023-11-09T08:11:29Z

Added in bdfc1b2. We can add the docs hint at any time as soon as the docs have been updated.

core-bundle/src/DependencyInjection/Configuration.php

Toflar · 2023-11-09T08:34:14Z

You can already add it, I'm writing the docs as we speak. We can merge it only once the docs PR is merged :)

…onfig

Toflar · 2023-11-09T09:11:18Z

Here we go with the docs: contao/docs#1260

core-bundle/contao/classes/Crawl.php

Make the back end crawler configurable

fd36017

leofeyer added the feature label Nov 8, 2023

leofeyer added this to the 5.3 milestone Nov 8, 2023

leofeyer self-assigned this Nov 8, 2023

leofeyer requested review from Toflar and a team November 8, 2023 10:05

leofeyer commented Nov 8, 2023

View reviewed changes

core-bundle/contao/classes/Crawl.php Outdated Show resolved Hide resolved

leofeyer linked an issue Nov 8, 2023 that may be closed by this pull request

Configuring the crawler in the back end #3809

Closed

CS

091ffe8

Adjust the max_depth range

f90a9d1

Show a concurrency hint when the crawler is running

bdfc1b2

Toflar reviewed Nov 9, 2023

View reviewed changes

core-bundle/src/DependencyInjection/Configuration.php Outdated Show resolved Hide resolved

leofeyer added 3 commits November 9, 2023 10:05

Move the configuration to the contao.backend.crawl namespace

cbf4062

Synchronize the defaults between the crawl command and the back end c…

6d273e0

…onfig

CS

c824cba

Toflar mentioned this pull request Nov 9, 2023

Documented how you can speed up the crawler process contao/docs#1260

Merged

Toflar reviewed Nov 9, 2023

View reviewed changes

core-bundle/contao/classes/Crawl.php Outdated Show resolved Hide resolved

Remove the contao.backend.crawl.max_depth setting

235fe9c

Dynamically set the max_request parameter

9546761

Toflar previously approved these changes Nov 9, 2023

View reviewed changes

Use a wildcard for the parameter name

2b07167

leofeyer dismissed Toflar’s stale review via 2b07167 November 9, 2023 12:26

Toflar approved these changes Nov 9, 2023

View reviewed changes

leofeyer merged commit 1595cbc into contao:5.x Nov 9, 2023
16 checks passed

leofeyer deleted the feature/backend-crawl-config branch November 9, 2023 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the back end crawler configurable #6495

Make the back end crawler configurable #6495

leofeyer commented Nov 8, 2023

Toflar commented Nov 8, 2023 •

edited

leofeyer commented Nov 8, 2023 •

edited

Toflar commented Nov 8, 2023

fritzmg commented Nov 8, 2023

Toflar commented Nov 8, 2023

leofeyer commented Nov 8, 2023

Toflar commented Nov 8, 2023

leofeyer commented Nov 8, 2023

Toflar commented Nov 9, 2023

leofeyer commented Nov 9, 2023

Toflar commented Nov 9, 2023 •

edited

Toflar commented Nov 9, 2023

Make the back end crawler configurable #6495

Make the back end crawler configurable #6495

Conversation

leofeyer commented Nov 8, 2023

Toflar commented Nov 8, 2023 • edited

leofeyer commented Nov 8, 2023 • edited

Toflar commented Nov 8, 2023

fritzmg commented Nov 8, 2023

Toflar commented Nov 8, 2023

leofeyer commented Nov 8, 2023

Toflar commented Nov 8, 2023

leofeyer commented Nov 8, 2023

Toflar commented Nov 9, 2023

leofeyer commented Nov 9, 2023

Toflar commented Nov 9, 2023 • edited

Toflar commented Nov 9, 2023

Toflar commented Nov 8, 2023 •

edited

leofeyer commented Nov 8, 2023 •

edited

Toflar commented Nov 9, 2023 •

edited