Review Robots' caching logic, fixes #713 #717

sebastian-nagel · 2019-05-03T13:30:47Z

cache redirected robots.txt for the target host only if the robots.txt is at the root (URL path = /robots.txt)
always cache the rules of a redirection for the source host

First check the example from #573 where a redirected robots.txt must not be cached for the target host:

2019-05-03 12:42:50.707 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Cache miss http:wyomingtheband.com:80 for http://wyomingtheband.com/
2019-05-03 12:42:50.957 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Robots for http://wyomingtheband.com/ redirected to https://www.facebook.com/wyomingtheband/robots.txt (not cached for target host because not at root)
2019-05-03 12:42:51.352 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Caching robots for http://wyomingtheband.com/ under key http:wyomingtheband.com:80 in cache success
2019-05-03 12:42:51.510 c.d.s.b.FetcherBolt FetcherThread #33 [INFO] [Fetcher #4] Fetched http://wyomingtheband.com/ with status 302 in msec 152

The URL http://wyomingtheband.com/ redirects to https://www.facebook.com/wyomingtheband/: the redirect is followed but the target is forbidden by https://www.facebook.com/robots.txt:

2019-05-03 12:43:10.443 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Cache miss https:www.facebook.com:443 for https://www.facebook.com/wyomingtheband/
2019-05-03 12:43:10.552 c.d.s.p.RobotRulesParser FetcherThread #33 [DEBUG] Caching robots for https://www.facebook.com/wyomingtheband/ under key https:www.facebook.com:443 in cache success
2019-05-03 12:43:10.552 c.d.s.b.FetcherBolt FetcherThread #33 [INFO] Denied by robots.txt: https://www.facebook.com/wyomingtheband/

And one example where the redirected robots.txt is cached also for the target because it is at root (https://pfau-schinken.de/robots.txt redirects to https://www.pfau-schinken.de/robots.txt) - no cache miss for the second fetch:

2019-05-03 13:13:51.443 c.d.s.p.RobotRulesParser FetcherThread #19 [DEBUG] Cache miss https:pfau-schinken.de:443 for https://pfau-schinken.de/
2019-05-03 13:13:51.695 c.d.s.p.RobotRulesParser FetcherThread #19 [DEBUG] Caching robots for https://pfau-schinken.de/ under key https:pfau-schinken.de:443 in cache success
2019-05-03 13:13:51.695 c.d.s.p.RobotRulesParser FetcherThread #19 [DEBUG] Caching robots for https://www.pfau-schinken.de/robots.txt under key https:www.pfau-schinken.de:443 in cache success
2019-05-03 13:13:51.709 c.d.s.b.FetcherBolt FetcherThread #19 [INFO] [Fetcher #4] Fetched https://pfau-schinken.de/ with status 301 in msec 13
2019-05-03 13:14:31.572 c.d.s.b.FetcherBolt FetcherThread #43 [INFO] [Fetcher #4] Fetched https://www.pfau-schinken.de/ with status 200 in msec 111

Note that there is no caching if two robots.txt redirect to the same target not at root level. We would need a cache by URL to achieve this.

- cache redirected robots.txt for the target host only if the robots.txt is at the root (URL path = `/robots.txt`) - always cache the rules of a redirection for the source host

jnioche · 2019-05-07T13:51:02Z

thanks @sebastian-nagel

Review Robots' caching logic, fixes apache#713

f1d5501

- cache redirected robots.txt for the target host only if the robots.txt is at the root (URL path = `/robots.txt`) - always cache the rules of a redirection for the source host

jnioche self-assigned this May 6, 2019

jnioche added the core label May 6, 2019

jnioche added this to the 1.14 milestone May 6, 2019

jnioche merged commit 9a3d940 into apache:master May 7, 2019

sebastian-nagel deleted the sc-713-caching-redirected-robots-rules branch May 11, 2019 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review Robots' caching logic, fixes #713 #717

Review Robots' caching logic, fixes #713 #717

sebastian-nagel commented May 3, 2019

jnioche commented May 7, 2019

Review Robots' caching logic, fixes #713 #717

Review Robots' caching logic, fixes #713 #717

Conversation

sebastian-nagel commented May 3, 2019

jnioche commented May 7, 2019