Add a back end maintenance task for the crawler #1057

Toflar · 2019-12-04T17:09:09Z

This is a first draft and I really need some help from here on I guess :)

This is how it looks at the moment:

ToDo's:

Bring back the FE member authentication. We need to improve this in a future version to have stateless authentication but for now, it should at least continue to work as before.
Styling. It's all just some inline styles and not really pretty. I'm just really lacking the skills here.
What configuration options should we provide? All of them according to the crawl command? Does it even make sense to let the users decide on concurrency etc.?
Provide subscriber-specific logs (for level info+)

core-bundle/src/Resources/contao/dca/tl_search_index_queue.php

core-bundle/src/Resources/contao/languages/en/tl_maintenance.xlf

leofeyer · 2020-01-08T14:30:58Z

core-bundle/src/Resources/contao/templates/backend/be_main.html5

@@ -11,6 +11,7 @@
      <meta name="generator" content="Contao Open Source CMS">
      <meta name="viewport" content="width=device-width,initial-scale=1.0,shrink-to-fit=no">
      <meta name="referrer" content="origin">
+      <meta name="robots" content="noindex, nofollow">


Why is it necessary to add this? The /contao route is exempt in the robots.txt file and no crawler will ever be able to see this template without being logged into the back end.

You are logged into the back end if you authenticate with a front end user. There's no possibility to authenticate as a FE user statelessly at the moment.
And /contao is not except by the robots.txt apparently, otherwise Escargot would not crawl.

You are logged into the back end if you authenticate with a front end user.

What? That would be a major security hole!

You misunderstood. I mean when you rebuild the search index from the back end, you are authenticated as a back end user and front end user at the same time. That's why.

I see.

And /contao is not except by the robots.txt apparently, otherwise Escargot would not crawl.

You have to delete the web/robots.txt file so the request triggers the new robots.txt route. It seems we forgot to remove the file when we merged #717. 🙈

No, it was never there: https://github.com/contao/contao/pull/717/files#diff-a793700f1c87e1a3a0d01553e0c01f39
But it would make sense, yes. I'll check.

Aaah, I see what you mean now. You're right. I had to remove the file and it's all good now.
However, I don't think deleting it is a good idea. The migration would need to copy the contents into the root pages so a possible previous configuration doesn't get lost.

And I would still like to keep the meta tags because they aren't wrong :)

leofeyer

I have made the necessary adjustments in aef6723. Here are some screenshots:

There are still some things which need to be fixed though:

The broken link checker never finishes because of an error: SQLSTATE[22001]: String data, right truncated: 1406 Data too long for column 'uri' at row 1.
A max depth of 32 seems pretty high to me. The broken link checker worked best at a max depth of 3 – anything beyond just made the process longer without improving the result.
The success and warning messages are in English and there is no way to translate them.
I am not yet happy with the "download debug log" button. If the warning message was translatable, we could integrate the link into the message, which I would prefer.
I still think that we do not need meta robots tags in the back end.

leofeyer · 2020-01-09T14:25:12Z

Another thing that I am not happy with is using sys_get_temp_dir(). We have had multiple issues with it in the past (open_basedir, write permissions etc.), therefore we are only using it in unit tests.

Can we create the files in var/ or another folder within the installation instead?

Toflar · 2020-01-09T14:29:10Z

We need a temp dir that's cleaned up and sys_get_temp_dir() is the only one we have. Creating in var would never remove them and clutter the system.

Toflar · 2020-01-09T14:32:57Z

Also sys_get_temp_dir() was created exaclty for that use case. It has to be configured correctly, that's just a hard requirement. Even in Symfony theres loads of sys_get_temp_dir()` calls and in other bundles likely too.

Toflar · 2020-01-09T14:59:08Z

1. The broken link checker never finishes because of an error: `SQLSTATE[22001]: String data, right truncated: 1406 Data too long for column 'uri' at row 1`.

Needs to be fixed in Escargot: terminal42/escargot#9

2\. A max depth of 32 seems pretty high to me. The broken link checker worked best at a max depth of 3 – anything beyond just made the process longer without improving the result.

I disagree. In my tests it actually misses quite a lot with a depth of only 3. But we might configure 8 or so. Would that be okay?

3\. The success and warning messages are in English and there is no way to translate them.

Fixed in 68e636a.

4\. I am not yet happy with the "download debug log" button. If the warning message was translatable, we could integrate the link into the message, which I would prefer.

Imho that log should go to the top right or so because it's a general log file and right now it looks as if it would belong to a certain subscriber.

5\. I still think that we do not need meta robots tags in the back end.

Removed in 68e636a.

…S issues

leofeyer · 2020-01-09T16:26:21Z

Thanks a lot @Toflar.

Toflar · 2020-01-09T16:42:08Z

WOOOOHOOOO 🎉 🎉 🎉 🎉

xchs · 2020-01-21T21:13:50Z

I am currently testing this in Contao 4.9.x-dev: should the crawler also index the "Website root" page (Found on level: 0)?

Toflar · 2020-01-21T23:05:06Z

Sure but depends on many factors. The debug log will tell.

First draft for the back end implementation

e287095

Toflar added the feature label Dec 4, 2019

Toflar added this to the 4.9 milestone Dec 4, 2019

Toflar requested a review from leofeyer December 4, 2019 17:09

Toflar self-assigned this Dec 4, 2019

Toflar mentioned this pull request Dec 4, 2019

Rework search indexing / crawling #981

Closed

Implemented subscriber specific log files

2f0d8b9

Toflar force-pushed the feature/crawl-backend branch from bcd86df to 2f0d8b9 Compare December 5, 2019 10:03

Toflar added 4 commits December 5, 2019 16:16

CS

7a5685d

CrawlCommand CS

32208f6

Discard back end configuration

87fa4cf

Re-implemented FE member auth as previously implemented

5ca44b7

Toflar force-pushed the feature/crawl-backend branch from b14f650 to b80708f Compare December 11, 2019 10:09

CS

54f7a08

Toflar force-pushed the feature/crawl-backend branch from b80708f to 54f7a08 Compare December 11, 2019 10:20

Toflar added 2 commits December 11, 2019 13:29

Fix issue with timeout because of session deadlock

9088730

Make sure the back end is never followed

3ccffd8

Toflar mentioned this pull request Dec 23, 2019

Require at least version 3.0 of webignition/robots-txt-file #1127

Closed

Merge branch 'master' into feature/crawl-backend

ef8158c

leofeyer requested changes Jan 8, 2020

View reviewed changes

leofeyer and others added 4 commits January 8, 2020 15:31

Fix the coding style

08be7b0

Converted DCA to Doctrine entity and also properly named it

3bdd8dc

Fixed labels

9a5a9e3

Adjust the back end implementation

aef6723

leofeyer requested changes Jan 9, 2020

View reviewed changes

Toflar added 2 commits January 9, 2020 15:52

Translated summary and warnings of all subscribers for the back end

68e636a

Removed robots meta tags

76d33c5

Toflar and others added 3 commits January 9, 2020 16:00

CS

45ce63e

Move the debug log link next to the progress bar

c968781

Move the CRAWL. labels into the default.xlf file and fix some minor C…

b2cec2c

…S issues

leofeyer approved these changes Jan 9, 2020

View reviewed changes

leofeyer merged commit 2aa01f5 into master Jan 9, 2020

leofeyer deleted the feature/crawl-backend branch January 9, 2020 16:26

leofeyer changed the title ~~Back end implementation for the crawler~~ Add a back end maintenance task for the crawler Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a back end maintenance task for the crawler #1057

Add a back end maintenance task for the crawler #1057

Toflar commented Dec 4, 2019 •

edited

Loading

leofeyer Jan 8, 2020

Toflar Jan 8, 2020

leofeyer Jan 8, 2020 •

edited

Loading

Toflar Jan 8, 2020

leofeyer Jan 8, 2020

Toflar Jan 8, 2020

Toflar Jan 8, 2020

Toflar Jan 8, 2020

leofeyer left a comment

leofeyer commented Jan 9, 2020

Toflar commented Jan 9, 2020

Toflar commented Jan 9, 2020

Toflar commented Jan 9, 2020

leofeyer commented Jan 9, 2020

Toflar commented Jan 9, 2020

xchs commented Jan 21, 2020

Toflar commented Jan 21, 2020

Add a back end maintenance task for the crawler #1057

Add a back end maintenance task for the crawler #1057

Conversation

Toflar commented Dec 4, 2019 • edited Loading

leofeyer Jan 8, 2020

Choose a reason for hiding this comment

Toflar Jan 8, 2020

Choose a reason for hiding this comment

leofeyer Jan 8, 2020 • edited Loading

Choose a reason for hiding this comment

Toflar Jan 8, 2020

Choose a reason for hiding this comment

leofeyer Jan 8, 2020

Choose a reason for hiding this comment

Toflar Jan 8, 2020

Choose a reason for hiding this comment

Toflar Jan 8, 2020

Choose a reason for hiding this comment

Toflar Jan 8, 2020

Choose a reason for hiding this comment

leofeyer left a comment

Choose a reason for hiding this comment

leofeyer commented Jan 9, 2020

Toflar commented Jan 9, 2020

Toflar commented Jan 9, 2020

Toflar commented Jan 9, 2020

leofeyer commented Jan 9, 2020

Toflar commented Jan 9, 2020

xchs commented Jan 21, 2020

Toflar commented Jan 21, 2020

Toflar commented Dec 4, 2019 •

edited

Loading

leofeyer Jan 8, 2020 •

edited

Loading