Crawler ignores rel="" and robots.txt (broken links checker shows thousands of broken links) #2813

markusmilkereit · 2021-02-27T22:45:01Z

Affected version(s)
Contao 4.9

Description
The social sharing links out of the box have rel="nofollow" , and just to make sure I also added Disallow: /_contao/ to the robots.txt. Still the Crawler follows the links, and marks the share-links as broken. Leads to this result right now:

Checked 1909 link(s) successfully. 2163 were broken.

The website has about 520 URLs, including news and events.

The text was updated successfully, but these errors were encountered:

fritzmg · 2021-02-28T15:02:04Z

You can add the data-skip-broken-link-checker attribute to those links.

Toflar · 2021-02-28T18:10:30Z

This is the desired behaviour. The broken link checker does not care about rel="nofollow". That's only relevant for the crawler. If you don't want to have those links checked, you can do as @fritzmg suggested.

markusmilkereit · 2021-03-01T06:12:06Z

@Toflar if this desired for the broken link checker, then the social media share links show badly broken behaviour with it. Might then still be a bug in the software as delivered.

@fritzmg that very nicely solves the problem, Thanks :)

ausi · 2021-03-01T09:19:21Z

and marks the share-links as broken

Why are they marked as broken, they should be valid links that show a 200 response IMO.

fritzmg · 2021-03-01T09:24:35Z

Yeah, you would need to analyse the reason. I suspect that the respective sites are may be blocking the crawler due to the number of requests or may be because of the User-Agent.

bytehead · 2021-03-01T09:29:17Z

LinkedIn for example returns a 999 on crawl, that's why @Toflar added the skip broken link checker attribute.

m-vo · 2021-03-01T09:31:53Z

LinkedIn for example returns a 999 on crawl,

Oh nice, let's break the web. 🤦

bytehead · 2021-03-01T09:33:06Z

LinkedIn 🤷‍♂️

fritzmg · 2021-03-01T09:50:08Z

May be the broken link checker should only check for 4xx status codes? Though that would reduce the quality of the result.

bytehead · 2021-03-01T09:51:54Z

5xx should be checked as well, no?

m-vo · 2021-03-01T09:53:38Z

Yeah, you would likely want to know if something is 503 forever for instance.

bytehead · 2021-03-01T09:55:48Z

Or just a 500 :)

Toflar · 2021-03-01T09:57:39Z

There's no point in discussing that. We won't implement workarounds for companies that think it's a good way to invent their own HTTP status codes. It's been a standard for what, the better part of 30 years?
There's a general solution to ignoring links you don't want to have checked. Just use it and we're all good.

bytehead · 2021-03-01T09:58:12Z

Just ignore all above 599: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status 😎

ausi · 2021-03-01T10:29:14Z

If the response code is not 2xx or 3xx the link is by definition a broken link. Ignoring any non-standard status codes would not make sense I think. If LinkedIn returns 999 for a valid URL, they need to fix this.

fritzmg · 2021-03-01T10:33:15Z

LinkedIn may be uses the "unofficial" 999 status code for Request denied, indicating that the server denies the crawlers request for whatever reason, for instance in cases where too many requests have been made or the User-Agent was identified as a bot.

Toflar closed this as completed Feb 28, 2021

github-actions bot locked as resolved and limited conversation to collaborators Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler ignores rel="" and robots.txt (broken links checker shows thousands of broken links) #2813

Crawler ignores rel="" and robots.txt (broken links checker shows thousands of broken links) #2813

markusmilkereit commented Feb 27, 2021

fritzmg commented Feb 28, 2021

Toflar commented Feb 28, 2021

markusmilkereit commented Mar 1, 2021

ausi commented Mar 1, 2021

fritzmg commented Mar 1, 2021

bytehead commented Mar 1, 2021

m-vo commented Mar 1, 2021

bytehead commented Mar 1, 2021

fritzmg commented Mar 1, 2021

bytehead commented Mar 1, 2021

m-vo commented Mar 1, 2021

bytehead commented Mar 1, 2021

Toflar commented Mar 1, 2021

bytehead commented Mar 1, 2021

ausi commented Mar 1, 2021

fritzmg commented Mar 1, 2021

Crawler ignores rel="" and robots.txt (broken links checker shows thousands of broken links) #2813

Crawler ignores rel="" and robots.txt (broken links checker shows thousands of broken links) #2813

Comments

markusmilkereit commented Feb 27, 2021

fritzmg commented Feb 28, 2021

Toflar commented Feb 28, 2021

markusmilkereit commented Mar 1, 2021

ausi commented Mar 1, 2021

fritzmg commented Mar 1, 2021

bytehead commented Mar 1, 2021

m-vo commented Mar 1, 2021

bytehead commented Mar 1, 2021

fritzmg commented Mar 1, 2021

bytehead commented Mar 1, 2021

m-vo commented Mar 1, 2021

bytehead commented Mar 1, 2021

Toflar commented Mar 1, 2021

bytehead commented Mar 1, 2021

ausi commented Mar 1, 2021

fritzmg commented Mar 1, 2021