Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a back end maintenance task for the crawler #1057

Merged
merged 19 commits into from Jan 9, 2020
Merged

Conversation

@Toflar
Copy link
Member

Toflar commented Dec 4, 2019

This is a first draft and I really need some help from here on I guess :)

This is how it looks at the moment:

Bildschirmfoto 2019-12-04 um 18 03 03

ToDo's:

  • Bring back the FE member authentication. We need to improve this in a future version to have stateless authentication but for now, it should at least continue to work as before.
  • Styling. It's all just some inline styles and not really pretty. I'm just really lacking the skills here.
  • What configuration options should we provide? All of them according to the crawl command? Does it even make sense to let the users decide on concurrency etc.?
  • Provide subscriber-specific logs (for level info+)
@Toflar Toflar added the feature label Dec 4, 2019
@Toflar Toflar added this to the 4.9 milestone Dec 4, 2019
@Toflar Toflar requested a review from leofeyer Dec 4, 2019
@Toflar Toflar self-assigned this Dec 4, 2019
@Toflar Toflar force-pushed the feature/crawl-backend branch from bcd86df to 2f0d8b9 Dec 5, 2019
@Toflar Toflar force-pushed the feature/crawl-backend branch from b14f650 to b80708f Dec 11, 2019
CS
@Toflar Toflar force-pushed the feature/crawl-backend branch from b80708f to 54f7a08 Dec 11, 2019
@@ -11,6 +11,7 @@
<meta name="generator" content="Contao Open Source CMS">
<meta name="viewport" content="width=device-width,initial-scale=1.0,shrink-to-fit=no">
<meta name="referrer" content="origin">
<meta name="robots" content="noindex, nofollow">

This comment has been minimized.

Copy link
@leofeyer

leofeyer Jan 8, 2020

Member

Why is it necessary to add this? The /contao route is exempt in the robots.txt file and no crawler will ever be able to see this template without being logged into the back end.

This comment has been minimized.

Copy link
@Toflar

Toflar Jan 8, 2020

Author Member

You are logged into the back end if you authenticate with a front end user. There's no possibility to authenticate as a FE user statelessly at the moment.
And /contao is not except by the robots.txt apparently, otherwise Escargot would not crawl.

This comment has been minimized.

Copy link
@leofeyer

leofeyer Jan 8, 2020

Member

You are logged into the back end if you authenticate with a front end user.

What? That would be a major security hole!

This comment has been minimized.

Copy link
@Toflar

Toflar Jan 8, 2020

Author Member

You misunderstood. I mean when you rebuild the search index from the back end, you are authenticated as a back end user and front end user at the same time. That's why.

This comment has been minimized.

Copy link
@leofeyer

leofeyer Jan 8, 2020

Member

I see.

And /contao is not except by the robots.txt apparently, otherwise Escargot would not crawl.

You have to delete the web/robots.txt file so the request triggers the new robots.txt route. It seems we forgot to remove the file when we merged #717. 🙈

This comment has been minimized.

Copy link
@Toflar

Toflar Jan 8, 2020

Author Member

No, it was never there: https://github.com/contao/contao/pull/717/files#diff-a793700f1c87e1a3a0d01553e0c01f39
But it would make sense, yes. I'll check.

This comment has been minimized.

Copy link
@Toflar

Toflar Jan 8, 2020

Author Member

Aaah, I see what you mean now. You're right. I had to remove the file and it's all good now.
However, I don't think deleting it is a good idea. The migration would need to copy the contents into the root pages so a possible previous configuration doesn't get lost.

This comment has been minimized.

Copy link
@Toflar

Toflar Jan 8, 2020

Author Member

And I would still like to keep the meta tags because they aren't wrong :)

leofeyer and others added 4 commits Jan 8, 2020
Copy link
Member

leofeyer left a comment

I have made the necessary adjustments in aef6723. Here are some screenshots:

There are still some things which need to be fixed though:

  1. The broken link checker never finishes because of an error: SQLSTATE[22001]: String data, right truncated: 1406 Data too long for column 'uri' at row 1.

  2. A max depth of 32 seems pretty high to me. The broken link checker worked best at a max depth of 3 – anything beyond just made the process longer without improving the result.

  3. The success and warning messages are in English and there is no way to translate them.

  4. I am not yet happy with the "download debug log" button. If the warning message was translatable, we could integrate the link into the message, which I would prefer.

  5. I still think that we do not need meta robots tags in the back end.

@leofeyer

This comment has been minimized.

Copy link
Member

leofeyer commented Jan 9, 2020

Another thing that I am not happy with is using sys_get_temp_dir(). We have had multiple issues with it in the past (open_basedir, write permissions etc.), therefore we are only using it in unit tests.

Can we create the files in var/ or another folder within the installation instead?

@Toflar

This comment has been minimized.

Copy link
Member Author

Toflar commented Jan 9, 2020

We need a temp dir that's cleaned up and sys_get_temp_dir() is the only one we have. Creating in var would never remove them and clutter the system.

@Toflar

This comment has been minimized.

Copy link
Member Author

Toflar commented Jan 9, 2020

Also sys_get_temp_dir() was created exaclty for that use case. It has to be configured correctly, that's just a hard requirement. Even in Symfony theres loads of sys_get_temp_dir()` calls and in other bundles likely too.

@Toflar

This comment has been minimized.

Copy link
Member Author

Toflar commented Jan 9, 2020

1. The broken link checker never finishes because of an error: `SQLSTATE[22001]: String data, right truncated: 1406 Data too long for column 'uri' at row 1`.

Needs to be fixed in Escargot: terminal42/escargot#9

2\. A max depth of 32 seems pretty high to me. The broken link checker worked best at a max depth of 3 – anything beyond just made the process longer without improving the result.

I disagree. In my tests it actually misses quite a lot with a depth of only 3. But we might configure 8 or so. Would that be okay?

3\. The success and warning messages are in English and there is no way to translate them.

Fixed in 68e636a.

4\. I am not yet happy with the "download debug log" button. If the warning message was translatable, we could integrate the link into the message, which I would prefer.

Imho that log should go to the top right or so because it's a general log file and right now it looks as if it would belong to a certain subscriber.

5\. I still think that we do not need meta robots tags in the back end.

Removed in 68e636a.

Toflar and others added 3 commits Jan 9, 2020
@leofeyer leofeyer merged commit 2aa01f5 into master Jan 9, 2020
9 checks passed
9 checks passed
Coverage
Details
Coding Style
Details
PHP 7.2
Details
PHP 7.3
Details
PHP 7.4
Details
Prefer Lowest
Details
Bundles
Details
Windows
Details
codecov/project 89.73% (+0.62%) compared to 3ccffd8
Details
@leofeyer leofeyer deleted the feature/crawl-backend branch Jan 9, 2020
@leofeyer

This comment has been minimized.

Copy link
Member

leofeyer commented Jan 9, 2020

Thanks a lot @Toflar.

@Toflar

This comment has been minimized.

Copy link
Member Author

Toflar commented Jan 9, 2020

WOOOOHOOOO 🎉 🎉 🎉 🎉

@leofeyer leofeyer changed the title Back end implementation for the crawler Add a back end maintenance task for the crawler Jan 10, 2020
Tastaturberuf pushed a commit to Tastaturberuf/contao that referenced this pull request Jan 13, 2020
Description
-----------

This is a first draft and I really need some help from here on I guess :)

This is how it looks at the moment:

<img width="1198" alt="Bildschirmfoto 2019-12-04 um 18 03 03" src="https://user-images.githubusercontent.com/481937/70164037-c140c200-16c0-11ea-9939-3072724a515a.png">

ToDo's:

- [x] Bring back the FE member authentication. We need to improve this in a future version to have stateless authentication but for now, it should at least continue to work as before.
- [x] Styling. It's all just some inline styles and not really pretty. I'm just really lacking the skills here.
- [x] What configuration options should we provide? All of them according to the crawl command? Does it even make sense to let the users decide on concurrency etc.?
- [x] Provide subscriber-specific logs (for level info+)

Commits
-------

e287095 First draft for the back end implementation
2f0d8b9 Implemented subscriber specific log files
7a5685d CS
32208f6 CrawlCommand CS
87fa4cf Discard back end configuration
5ca44b7 Re-implemented FE member auth as previously implemented
54f7a08 CS
9088730 Fix issue with timeout because of session deadlock
3ccffd8 Make sure the back end is never followed
ef8158c Merge branch 'master' into feature/crawl-backend
08be7b0 Fix the coding style
3bdd8dc Converted DCA to Doctrine entity and also properly named it
9a5a9e3 Fixed labels
aef6723 Adjust the back end implementation
68e636a Translated summary and warnings of all subscribers for the back end
76d33c5 Removed robots meta tags
45ce63e CS
c968781 Move the debug log link next to the progress bar
b2cec2c Move the CRAWL. labels into the default.xlf file and fix some minor CS issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.