Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a back end maintenance task for the crawler #1057

Merged
merged 19 commits into from
Jan 9, 2020
Merged

Conversation

Toflar
Copy link
Member

@Toflar Toflar commented Dec 4, 2019

This is a first draft and I really need some help from here on I guess :)

This is how it looks at the moment:

Bildschirmfoto 2019-12-04 um 18 03 03

ToDo's:

  • Bring back the FE member authentication. We need to improve this in a future version to have stateless authentication but for now, it should at least continue to work as before.
  • Styling. It's all just some inline styles and not really pretty. I'm just really lacking the skills here.
  • What configuration options should we provide? All of them according to the crawl command? Does it even make sense to let the users decide on concurrency etc.?
  • Provide subscriber-specific logs (for level info+)

@Toflar Toflar added the feature label Dec 4, 2019
@Toflar Toflar added this to the 4.9 milestone Dec 4, 2019
@Toflar Toflar requested a review from leofeyer December 4, 2019 17:09
@Toflar Toflar self-assigned this Dec 4, 2019
@@ -11,6 +11,7 @@
<meta name="generator" content="Contao Open Source CMS">
<meta name="viewport" content="width=device-width,initial-scale=1.0,shrink-to-fit=no">
<meta name="referrer" content="origin">
<meta name="robots" content="noindex, nofollow">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to add this? The /contao route is exempt in the robots.txt file and no crawler will ever be able to see this template without being logged into the back end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are logged into the back end if you authenticate with a front end user. There's no possibility to authenticate as a FE user statelessly at the moment.
And /contao is not except by the robots.txt apparently, otherwise Escargot would not crawl.

Copy link
Member

@leofeyer leofeyer Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are logged into the back end if you authenticate with a front end user.

What? That would be a major security hole!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You misunderstood. I mean when you rebuild the search index from the back end, you are authenticated as a back end user and front end user at the same time. That's why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

And /contao is not except by the robots.txt apparently, otherwise Escargot would not crawl.

You have to delete the web/robots.txt file so the request triggers the new robots.txt route. It seems we forgot to remove the file when we merged #717. 🙈

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was never there: https://github.com/contao/contao/pull/717/files#diff-a793700f1c87e1a3a0d01553e0c01f39
But it would make sense, yes. I'll check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah, I see what you mean now. You're right. I had to remove the file and it's all good now.
However, I don't think deleting it is a good idea. The migration would need to copy the contents into the root pages so a possible previous configuration doesn't get lost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I would still like to keep the meta tags because they aren't wrong :)

Copy link
Member

@leofeyer leofeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the necessary adjustments in aef6723. Here are some screenshots:

There are still some things which need to be fixed though:

  1. The broken link checker never finishes because of an error: SQLSTATE[22001]: String data, right truncated: 1406 Data too long for column 'uri' at row 1.

  2. A max depth of 32 seems pretty high to me. The broken link checker worked best at a max depth of 3 – anything beyond just made the process longer without improving the result.

  3. The success and warning messages are in English and there is no way to translate them.

  4. I am not yet happy with the "download debug log" button. If the warning message was translatable, we could integrate the link into the message, which I would prefer.

  5. I still think that we do not need meta robots tags in the back end.

@leofeyer
Copy link
Member

leofeyer commented Jan 9, 2020

Another thing that I am not happy with is using sys_get_temp_dir(). We have had multiple issues with it in the past (open_basedir, write permissions etc.), therefore we are only using it in unit tests.

Can we create the files in var/ or another folder within the installation instead?

@Toflar
Copy link
Member Author

Toflar commented Jan 9, 2020

We need a temp dir that's cleaned up and sys_get_temp_dir() is the only one we have. Creating in var would never remove them and clutter the system.

@Toflar
Copy link
Member Author

Toflar commented Jan 9, 2020

Also sys_get_temp_dir() was created exaclty for that use case. It has to be configured correctly, that's just a hard requirement. Even in Symfony theres loads of sys_get_temp_dir()` calls and in other bundles likely too.

@Toflar
Copy link
Member Author

Toflar commented Jan 9, 2020

1. The broken link checker never finishes because of an error: `SQLSTATE[22001]: String data, right truncated: 1406 Data too long for column 'uri' at row 1`.

Needs to be fixed in Escargot: terminal42/escargot#9

2\. A max depth of 32 seems pretty high to me. The broken link checker worked best at a max depth of 3 – anything beyond just made the process longer without improving the result.

I disagree. In my tests it actually misses quite a lot with a depth of only 3. But we might configure 8 or so. Would that be okay?

3\. The success and warning messages are in English and there is no way to translate them.

Fixed in 68e636a.

4\. I am not yet happy with the "download debug log" button. If the warning message was translatable, we could integrate the link into the message, which I would prefer.

Imho that log should go to the top right or so because it's a general log file and right now it looks as if it would belong to a certain subscriber.

5\. I still think that we do not need meta robots tags in the back end.

Removed in 68e636a.

@leofeyer leofeyer merged commit 2aa01f5 into master Jan 9, 2020
@leofeyer leofeyer deleted the feature/crawl-backend branch January 9, 2020 16:26
@leofeyer
Copy link
Member

leofeyer commented Jan 9, 2020

Thanks a lot @Toflar.

@Toflar
Copy link
Member Author

Toflar commented Jan 9, 2020

WOOOOHOOOO 🎉 🎉 🎉 🎉

@leofeyer leofeyer changed the title Back end implementation for the crawler Add a back end maintenance task for the crawler Jan 10, 2020
@xchs
Copy link
Contributor

xchs commented Jan 21, 2020

I am currently testing this in Contao 4.9.x-dev: should the crawler also index the "Website root" page (Found on level: 0)?

@Toflar
Copy link
Member Author

Toflar commented Jan 21, 2020

Sure but depends on many factors. The debug log will tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants