Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the back end crawler configurable #6495

Merged
merged 10 commits into from Nov 9, 2023

Conversation

leofeyer
Copy link
Member

@leofeyer leofeyer commented Nov 8, 2023

Implements #3809

As discussed, all parameters can now be configured in the container configuration and the maximum depth can also be selected in a drop-down menu.

@leofeyer leofeyer added this to the 5.3 milestone Nov 8, 2023
@leofeyer leofeyer self-assigned this Nov 8, 2023
@leofeyer leofeyer requested review from Toflar and a team November 8, 2023 10:05
@leofeyer leofeyer linked an issue Nov 8, 2023 that may be closed by this pull request
@Toflar
Copy link
Member

Toflar commented Nov 8, 2023

Thanks for working on this! Remarks:

I think part of the problem this was not tackled until now is that making only the max depth configurable isn't really gonna fix the entire problem. The most important factor is concurrency because that affects speed. And if people still need to adjust the configuration, they likely won't do it (as they don't know this is even possible). So effectively, nobody will change anything and it will still be slow for people in 5.3.

I think, what would be way smarter would be 2 configuration options:

  • max_allowed_max_depth
  • max_allowed_concurrency

The minimum can be hardcoded as this is imho alwas 3 for max depth (lower never makes sense) and 1 for concurrency for obvious reasons.

These should also be considered in the CrawlCommand. They should be hardcoded and system-wide restrictions so that I - as the responsible admin - can limit users, both back end and CLI users, to only use values that make sense and the server is capable of processing.

So I would also like to have the concurrency selectable in the back end. You may ask why anybody would ever want to have less concurrency than the max configured? That's a reasonable question. I just think that there's a ton of users out there who use Contao as is and would never reconfigure things in config.yaml. If we use - say - 10 as max_allowed_concurrency by default and they try that and it doesn't work, I bet there are loads of users who will then try 8 and if that works, just remember that they never have to go higher than 8. They wouldn't go to the config.yaml and reconfigure that though.

Hence, imho 2 dropdowns would be super nice, also indicating the users that there is actually a possibility to change that (and if people want more than 10 they will ask and will find that there's an option in config.yaml).

The best case for UX would include remembering the last used options in the back end in the session so I don't have to always re-configure when I log back in.

@leofeyer
Copy link
Member Author

leofeyer commented Nov 8, 2023

We already discussed this and decided against adding a select menu for the concurrency (see #3809 (comment)). Should we discuss again?

@Toflar
Copy link
Member

Toflar commented Nov 8, 2023

We don't have to, I'm just looking for a way to inform the users about the concurrency that's being used. Maybe a simple hint in the back end when the crawler is running would be enough? What do you think about always showing this:

Your website is currently being crawled with a maximum depth of %d and %d concurrent requests. Depending on the settings and the size of your website, this may take a while. If your server can serve more than %d concurrent requests, you can ask your system administrator to increase the contao.crawl.concurrency setting in the system configuration. Moreover, make sure you analyze the debug log to see which URLs are crawled exactly. For example, you can ask Contao to skip links from being crawled using the data-skip-search-index HTML attribute. Please refer to the docs for more information.

@fritzmg
Copy link
Contributor

fritzmg commented Nov 8, 2023

Judging from the situations in the community I think crawling is mostly slow because of too many URLs being unnecessarily crawled, due to pagination and filter query parameter URLs increasing the amount of URLs exponentially (when using an unbound/high max-depth), not because the concurrency setting is too low.

That being said, I am not against adding the concurrency setting (or info about it) to the back end.

@Toflar
Copy link
Member

Toflar commented Nov 8, 2023

Yes, I know. Which is why my text included an info about how you can skip those :)

@leofeyer
Copy link
Member Author

leofeyer commented Nov 8, 2023

You mean something like this?

I couldn‘t find the data-skip-search-index attribute in the docs, though.

@Toflar
Copy link
Member

Toflar commented Nov 8, 2023

I would output the text only when the crawler is running - otherwise it bloats the maintenance screen :)

I couldn‘t find the data-skip-search-index attribute in the docs, though.

Would need to be added, yes.

@leofeyer
Copy link
Member Author

leofeyer commented Nov 8, 2023

Like this then?

It will only be shown while the crawler is running.

@Toflar
Copy link
Member

Toflar commented Nov 9, 2023

Yeah, with the real values of course :D And I think I would leave the hint about the docs. I will extend them.

@leofeyer
Copy link
Member Author

leofeyer commented Nov 9, 2023

Added in bdfc1b2. We can add the docs hint at any time as soon as the docs have been updated.

@Toflar
Copy link
Member

Toflar commented Nov 9, 2023

You can already add it, I'm writing the docs as we speak. We can merge it only once the docs PR is merged :)

@Toflar
Copy link
Member

Toflar commented Nov 9, 2023

Here we go with the docs: contao/docs#1260

Toflar
Toflar previously approved these changes Nov 9, 2023
@leofeyer leofeyer merged commit 1595cbc into contao:5.x Nov 9, 2023
16 checks passed
@leofeyer leofeyer deleted the feature/backend-crawl-config branch November 9, 2023 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configuring the crawler in the back end
3 participants