Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler also shows broken links on subpages of banner destination URL #4213

Closed
grashalm4u opened this issue Feb 24, 2022 · 1 comment · Fixed by #4218
Closed

Crawler also shows broken links on subpages of banner destination URL #4213

grashalm4u opened this issue Feb 24, 2022 · 1 comment · Fixed by #4218
Labels
Milestone

Comments

@grashalm4u
Copy link

Affected version(s)

4.13.0

Description

When crawling my links, the Broken Link Checker log shows not only broken links on my own home page, but also all the broken links that are on the pages pointed to by the banner target URLs in my Bugbuster banner module .
In the forum, Spooky said: "Possibly the 301 forwarding of the banner URLs is the problem. Since the banner URL could change at any time, maybe 302 or 307 would be better here (you could suggest that in the extension on GitHub)."

Additionally, on GitHub at Contao, you could suggest that the crawler (or broken link checker) not analyze other links that are on a page after a 301 redirect if the 301 target page is no longer in the same domain.

Since I don't understand enough of the subtleties myself, I'll post the link to this thread here so that you can understand what we discussed in detail. I have already written the above-mentioned message to Bugbuster.

https://community.contao.org/de/showthread.php?82581-Crawler-also-shows-broken-links-on-third-party-pages-of-landing-pages-of-the-banner-module-&p=555552&posted=1#post555552

Maybe you can take care of it. Thanks very much,

grashalm

@fritzmg
Copy link
Contributor

fritzmg commented Feb 26, 2022

I can confirm the issue (in 4.9 as well as 4.13). Reproduction:

  1. Create a new page of type external and set the target to https://contao.org/en/ for example (i.e. some external page that responds with 200 and contains links). The redirect type does not matter (can be 301 or 302).
  2. Enable Hide in navigation for that external redirect page.1
  3. Create a new page of type internal and set the target to the external redirect page.
  4. Create a navigation module and insert that into the layout.

Basically all you need is letting the crawler encounter an internal URL that responds with a redirect to an external URL.

With the aforementioned setup, the crawler will encounter an internal URL in the HTML output - but in reality this internal page will redirect to an external page. The crawler will then continue to parse that external page and check the links it contains, as if it was an internal page.

Crawler Debug Log
Time,Source,URI,"Found on URI","Found on level",Tags,Message
"2022-02-26 11:01:58.583620","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",http://c413.local/sitemap.xml,http://c413.local/robots.txt,2,is-sitemap,"Did not index because the response did not contain a ""text/html"" Content-Type header."
"2022-02-26 11:01:58.752354","Terminal42\Escargot\Escargot",http://c413.local/en/website-en,,0,,"Skipped further response processing because crawler got redirected to an URI that's already been crawled."
"2022-02-26 11:01:58.999483","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",http://c413.local/en/,http://c413.local/sitemap.xml,3,,"Forwarded to the search indexer. Was indexed successfully."
"2022-02-26 11:01:59.416824","Terminal42\Escargot\Escargot",http://c413.local/,http://c413.local/en/,4,,"Skipped further response processing because crawler got redirected to an URI that's already been crawled."
"2022-02-26 11:01:59.508564","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",http://c413.local/en/external,http://c413.local/en/,4,,"Forwarded to the search indexer. Did not index because of the following reason: Ignored because canonical URI ""https://contao.org/en/"" does not match document URI."
"2022-02-26 11:01:59.771957","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:01:59.773304","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:01:59.773946","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/features.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:01:59.774531","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/case-studies.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:01:59.775118","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/news.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:01:59.775700","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/events.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:01:59.776276","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/team.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:00.976205","Terminal42\Escargot\Subscriber\RobotsSubscriber",https://demo.contao.org/contao/login,http://c413.local/en/external,5,disallowed-robots-txt,"Added the ""disallowed-robots-txt"" tag because of the robots.txt content."
"2022-02-26 11:02:00.976292","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://demo.contao.org/contao/login,http://c413.local/en/external,5,disallowed-robots-txt,"Do not request because the URI was disallowed to be followed by either rel=""nofollow"" or robots.txt hints."
"2022-02-26 11:02:00.977429","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/download.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:00.978048","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/media.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:00.979584","Terminal42\Escargot\Escargot",https://contao.org/,http://c413.local/en/external,5,,"Skipped further response processing because crawler got redirected to an URI that's already been crawled."
"2022-02-26 11:02:01.999692","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/release-plan.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:02.001087","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/contao-partners.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:02.001715","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/contao-partner-map.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:02.002343","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/service-description.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:02.636497","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://partners.contao.org/en,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:02.637220","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/support.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:02.813225","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://docs.contao.org/,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:03.018878","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://github.com/contao/contao/issues,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:03.019616","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/security-advisories.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:03.020235","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/network.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:03.711290","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/de/,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:03.712461","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/es/,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:03.713063","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/fr/,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:03.934410","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://extensions.contao.org/,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.055918","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://symfony.com/,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.056634","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://github.com/contao/contao,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.057821","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://github.com/contao/contao/security/policy,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.058479","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/news/contao-4_13_0.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.059094","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/news/contao-two-month-review-november-and-december-2021.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.059819","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/news/composer-and-contao-for-the-rest-of-the-world.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.677948","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/sitemap.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.678715","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/privacy-notice.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.679333","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://contao.org/en/legal-notice.html,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.679930","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://github.com/contao,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:04.882131","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://twitter.com/contaocms,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:05.097943","Terminal42\Escargot\Subscriber\RobotsSubscriber",https://www.facebook.com/contao,http://c413.local/en/external,5,disallowed-robots-txt,"Added the ""disallowed-robots-txt"" tag because of the robots.txt content."
"2022-02-26 11:02:05.098032","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://www.facebook.com/contao,http://c413.local/en/external,5,disallowed-robots-txt,"Do not request because the URI was disallowed to be followed by either rel=""nofollow"" or robots.txt hints."
"2022-02-26 11:02:05.201507","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://www.youtube.com/user/contaocms,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:05.529624","Contao\CoreBundle\Crawl\Escargot\Subscriber\SearchIndexSubscriber",https://www.pinterest.de/contaocms/,http://c413.local/en/external,5,,"Did not index because it was not part of the base URI collection."
"2022-02-26 11:02:05.803827","Terminal42\Escargot\Escargot",---,---,---,---,"Finished crawling! Sent 43 request(s)."

1 It is important that the external redirect page is hidden from the menu, otherwise the crawler will encounter the URL of the external page first and then it would skip it when processing the redirected URL of the internal page later on.

@fritzmg fritzmg added this to the 4.9 milestone Feb 26, 2022
@fritzmg fritzmg linked a pull request Feb 27, 2022 that will close this issue
leofeyer pushed a commit that referenced this issue Mar 2, 2022
Description
-----------

Fixes #4213

See #4213 (comment) for a detailed explanation of the problem.

This PR fixes this by comparing the host of the actual URL of the response with the base URI collection and returning a negative decision in `needsContent`, so that any internal URL that was redirected to an external URL is not getting processed.

Commits
-------

2466907 do not process redirected URLs outside base domains
9c034f1 reference issue
2a7072c report ok on redirected URLs
c1d25a5 use Uri class
leofeyer pushed a commit to contao/core-bundle that referenced this issue Mar 2, 2022
Description
-----------

Fixes #4213

See contao/contao#4213 (comment) for a detailed explanation of the problem.

This PR fixes this by comparing the host of the actual URL of the response with the base URI collection and returning a negative decision in `needsContent`, so that any internal URL that was redirected to an external URL is not getting processed.

Commits
-------

24669075 do not process redirected URLs outside base domains
9c034f1d reference issue
2a7072c4 report ok on redirected URLs
c1d25a53 use Uri class
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants