New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler also shows broken links on subpages of banner destination URL #4213
Comments
I can confirm the issue (in 4.9 as well as 4.13). Reproduction:
Basically all you need is letting the crawler encounter an internal URL that responds with a redirect to an external URL. With the aforementioned setup, the crawler will encounter an internal URL in the HTML output - but in reality this internal page will redirect to an external page. The crawler will then continue to parse that external page and check the links it contains, as if it was an internal page. Crawler Debug Log
1 It is important that the external redirect page is hidden from the menu, otherwise the crawler will encounter the URL of the external page first and then it would skip it when processing the redirected URL of the internal page later on. |
Description ----------- Fixes #4213 See #4213 (comment) for a detailed explanation of the problem. This PR fixes this by comparing the host of the actual URL of the response with the base URI collection and returning a negative decision in `needsContent`, so that any internal URL that was redirected to an external URL is not getting processed. Commits ------- 2466907 do not process redirected URLs outside base domains 9c034f1 reference issue 2a7072c report ok on redirected URLs c1d25a5 use Uri class
Description ----------- Fixes #4213 See contao/contao#4213 (comment) for a detailed explanation of the problem. This PR fixes this by comparing the host of the actual URL of the response with the base URI collection and returning a negative decision in `needsContent`, so that any internal URL that was redirected to an external URL is not getting processed. Commits ------- 24669075 do not process redirected URLs outside base domains 9c034f1d reference issue 2a7072c4 report ok on redirected URLs c1d25a53 use Uri class
Affected version(s)
4.13.0
Description
When crawling my links, the Broken Link Checker log shows not only broken links on my own home page, but also all the broken links that are on the pages pointed to by the banner target URLs in my Bugbuster banner module .
In the forum, Spooky said: "Possibly the 301 forwarding of the banner URLs is the problem. Since the banner URL could change at any time, maybe 302 or 307 would be better here (you could suggest that in the extension on GitHub)."
Additionally, on GitHub at Contao, you could suggest that the crawler (or broken link checker) not analyze other links that are on a page after a 301 redirect if the 301 target page is no longer in the same domain.
Since I don't understand enough of the subtleties myself, I'll post the link to this thread here so that you can understand what we discussed in detail. I have already written the above-mentioned message to Bugbuster.
https://community.contao.org/de/showthread.php?82581-Crawler-also-shows-broken-links-on-third-party-pages-of-landing-pages-of-the-banner-module-&p=555552&posted=1#post555552
Maybe you can take care of it. Thanks very much,
grashalm
The text was updated successfully, but these errors were encountered: