Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Robots' caching logic, fixes #713 #717

Conversation

sebastian-nagel
Copy link
Contributor

  • cache redirected robots.txt for the target host only if the robots.txt is at the root (URL path = /robots.txt)
  • always cache the rules of a redirection for the source host

First check the example from #573 where a redirected robots.txt must not be cached for the target host:

2019-05-03 12:42:50.707 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Cache miss http:wyomingtheband.com:80 for http://wyomingtheband.com/
2019-05-03 12:42:50.957 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Robots for http://wyomingtheband.com/ redirected to https://www.facebook.com/wyomingtheband/robots.txt (not cached for target host because not at root)
2019-05-03 12:42:51.352 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Caching robots for http://wyomingtheband.com/ under key http:wyomingtheband.com:80 in cache success
2019-05-03 12:42:51.510 c.d.s.b.FetcherBolt FetcherThread #33 [INFO] [Fetcher #4] Fetched http://wyomingtheband.com/ with status 302 in msec 152

The URL http://wyomingtheband.com/ redirects to https://www.facebook.com/wyomingtheband/: the redirect is followed but the target is forbidden by https://www.facebook.com/robots.txt:

2019-05-03 12:43:10.443 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Cache miss https:www.facebook.com:443 for https://www.facebook.com/wyomingtheband/
2019-05-03 12:43:10.552 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Caching robots for https://www.facebook.com/wyomingtheband/ under key https:www.facebook.com:443 in cache success
2019-05-03 12:43:10.552 c.d.s.b.FetcherBolt FetcherThread #33 [INFO] Denied by robots.txt: https://www.facebook.com/wyomingtheband/

And one example where the redirected robots.txt is cached also for the target because it is at root (https://pfau-schinken.de/robots.txt redirects to https://www.pfau-schinken.de/robots.txt) - no cache miss for the second fetch:

2019-05-03 13:13:51.443 c.d.s.p.RobotRulesParser FetcherThread #19 [DEBUG] Cache miss https:pfau-schinken.de:443 for https://pfau-schinken.de/
2019-05-03 13:13:51.695 c.d.s.p.RobotRulesParser FetcherThread #19 [DEBUG] Caching robots for https://pfau-schinken.de/ under key https:pfau-schinken.de:443 in cache success
2019-05-03 13:13:51.695 c.d.s.p.RobotRulesParser FetcherThread #19 [DEBUG] Caching robots for https://www.pfau-schinken.de/robots.txt under key https:www.pfau-schinken.de:443 in cache success
2019-05-03 13:13:51.709 c.d.s.b.FetcherBolt FetcherThread #19 [INFO] [Fetcher #4] Fetched https://pfau-schinken.de/ with status 301 in msec 13
2019-05-03 13:14:31.572 c.d.s.b.FetcherBolt FetcherThread #43 [INFO] [Fetcher #4] Fetched https://www.pfau-schinken.de/ with status 200 in msec 111

Note that there is no caching if two robots.txt redirect to the same target not at root level. We would need a cache by URL to achieve this.

- cache redirected robots.txt for the target host only if the
  robots.txt is at the root (URL path = `/robots.txt`)
- always cache the rules of a redirection for the source host
@jnioche jnioche self-assigned this May 6, 2019
@jnioche jnioche added the core label May 6, 2019
@jnioche jnioche added this to the 1.14 milestone May 6, 2019
@jnioche jnioche merged commit 9a3d940 into apache:master May 7, 2019
@jnioche
Copy link
Contributor

jnioche commented May 7, 2019

thanks @sebastian-nagel

@sebastian-nagel sebastian-nagel deleted the sc-713-caching-redirected-robots-rules branch May 11, 2019 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants