-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a back end maintenance task for the crawler #1057
Conversation
bcd86df
to
2f0d8b9
Compare
b14f650
to
b80708f
Compare
b80708f
to
54f7a08
Compare
core-bundle/src/Resources/contao/languages/en/tl_maintenance.xlf
Outdated
Show resolved
Hide resolved
@@ -11,6 +11,7 @@ | |||
<meta name="generator" content="Contao Open Source CMS"> | |||
<meta name="viewport" content="width=device-width,initial-scale=1.0,shrink-to-fit=no"> | |||
<meta name="referrer" content="origin"> | |||
<meta name="robots" content="noindex, nofollow"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to add this? The /contao
route is exempt in the robots.txt
file and no crawler will ever be able to see this template without being logged into the back end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are logged into the back end if you authenticate with a front end user. There's no possibility to authenticate as a FE user statelessly at the moment.
And /contao
is not except by the robots.txt apparently, otherwise Escargot would not crawl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are logged into the back end if you authenticate with a front end user.
What? That would be a major security hole!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You misunderstood. I mean when you rebuild the search index from the back end, you are authenticated as a back end user and front end user at the same time. That's why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
And
/contao
is not except by the robots.txt apparently, otherwise Escargot would not crawl.
You have to delete the web/robots.txt
file so the request triggers the new robots.txt route. It seems we forgot to remove the file when we merged #717. 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it was never there: https://github.com/contao/contao/pull/717/files#diff-a793700f1c87e1a3a0d01553e0c01f39
But it would make sense, yes. I'll check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah, I see what you mean now. You're right. I had to remove the file and it's all good now.
However, I don't think deleting it is a good idea. The migration would need to copy the contents into the root pages so a possible previous configuration doesn't get lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I would still like to keep the meta tags because they aren't wrong :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the necessary adjustments in aef6723. Here are some screenshots:
There are still some things which need to be fixed though:
-
The broken link checker never finishes because of an error:
SQLSTATE[22001]: String data, right truncated: 1406 Data too long for column 'uri' at row 1
. -
A max depth of 32 seems pretty high to me. The broken link checker worked best at a max depth of 3 – anything beyond just made the process longer without improving the result.
-
The success and warning messages are in English and there is no way to translate them.
-
I am not yet happy with the "download debug log" button. If the warning message was translatable, we could integrate the link into the message, which I would prefer.
-
I still think that we do not need meta robots tags in the back end.
Another thing that I am not happy with is using Can we create the files in |
We need a temp dir that's cleaned up and |
Also |
Needs to be fixed in Escargot: terminal42/escargot#9
I disagree. In my tests it actually misses quite a lot with a depth of only 3. But we might configure 8 or so. Would that be okay?
Fixed in 68e636a.
Imho that log should go to the top right or so because it's a general log file and right now it looks as if it would belong to a certain subscriber.
Removed in 68e636a. |
Thanks a lot @Toflar. |
WOOOOHOOOO 🎉 🎉 🎉 🎉 |
I am currently testing this in Contao 4.9.x-dev: should the crawler also index the "Website root" page (Found on level: 0)? |
Sure but depends on many factors. The debug log will tell. |
This is a first draft and I really need some help from here on I guess :)
This is how it looks at the moment:
ToDo's: