Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler ignores rel="" and robots.txt (broken links checker shows thousands of broken links) #2813

Closed
markusmilkereit opened this issue Feb 27, 2021 · 16 comments

Comments

@markusmilkereit
Copy link

Affected version(s)
Contao 4.9

Description
The social sharing links out of the box have rel="nofollow" , and just to make sure I also added Disallow: /_contao/ to the robots.txt. Still the Crawler follows the links, and marks the share-links as broken. Leads to this result right now:

Checked 1909 link(s) successfully. 2163 were broken.

The website has about 520 URLs, including news and events.

@fritzmg
Copy link
Contributor

fritzmg commented Feb 28, 2021

You can add the data-skip-broken-link-checker attribute to those links.

@Toflar
Copy link
Member

Toflar commented Feb 28, 2021

This is the desired behaviour. The broken link checker does not care about rel="nofollow". That's only relevant for the crawler. If you don't want to have those links checked, you can do as @fritzmg suggested.

@Toflar Toflar closed this as completed Feb 28, 2021
@markusmilkereit
Copy link
Author

@Toflar if this desired for the broken link checker, then the social media share links show badly broken behaviour with it. Might then still be a bug in the software as delivered.

@fritzmg that very nicely solves the problem, Thanks :)

@ausi
Copy link
Member

ausi commented Mar 1, 2021

and marks the share-links as broken

Why are they marked as broken, they should be valid links that show a 200 response IMO.

@fritzmg
Copy link
Contributor

fritzmg commented Mar 1, 2021

Yeah, you would need to analyse the reason. I suspect that the respective sites are may be blocking the crawler due to the number of requests or may be because of the User-Agent.

@bytehead
Copy link
Member

bytehead commented Mar 1, 2021

LinkedIn for example returns a 999 on crawl, that's why @Toflar added the skip broken link checker attribute.

@m-vo
Copy link
Member

m-vo commented Mar 1, 2021

LinkedIn for example returns a 999 on crawl,

Oh nice, let's break the web. 🤦

@bytehead
Copy link
Member

bytehead commented Mar 1, 2021

LinkedIn 🤷‍♂️

@fritzmg
Copy link
Contributor

fritzmg commented Mar 1, 2021

May be the broken link checker should only check for 4xx status codes? Though that would reduce the quality of the result.

@bytehead
Copy link
Member

bytehead commented Mar 1, 2021

5xx should be checked as well, no?

@m-vo
Copy link
Member

m-vo commented Mar 1, 2021

Yeah, you would likely want to know if something is 503 forever for instance.

@bytehead
Copy link
Member

bytehead commented Mar 1, 2021

Or just a 500 :)

@Toflar
Copy link
Member

Toflar commented Mar 1, 2021

There's no point in discussing that. We won't implement workarounds for companies that think it's a good way to invent their own HTTP status codes. It's been a standard for what, the better part of 30 years?
There's a general solution to ignoring links you don't want to have checked. Just use it and we're all good.

@bytehead
Copy link
Member

bytehead commented Mar 1, 2021

Just ignore all above 599: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status 😎

@ausi
Copy link
Member

ausi commented Mar 1, 2021

If the response code is not 2xx or 3xx the link is by definition a broken link. Ignoring any non-standard status codes would not make sense I think. If LinkedIn returns 999 for a valid URL, they need to fix this.

@fritzmg
Copy link
Contributor

fritzmg commented Mar 1, 2021

LinkedIn may be uses the "unofficial" 999 status code for Request denied, indicating that the server denies the crawlers request for whatever reason, for instance in cases where too many requests have been made or the User-Agent was identified as a bot.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants