-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise robots and search engine/crawler dealing #598
Comments
Why would one want boots to crawl themes, and js? Also modules folder should not be crawled by default, in my opinion. If a site admin wants a particular module to be listed at his peril he can always add it himself. |
Google runs JavaScript and loads CSS to see how the page will be rendered. It uses this to decide if a page is "mobile friendly". For example, the space around buttons. |
Exactly why I put this in. Try Google web console (webmaster tools) for example. |
Historically, there were scripts in many different folders, at many different levels. We have prevented search-engines from indexing pages that For modern, well-behaved, search engines, these can be fixed with markup. For badly behaved robots (and robots which ignore robots.txt), we have the "site-access-rules". Perhaps it is time to review all this, and think differently.
|
Nowadays, Google and others will crawl your site with both agents. Of they look differently, they will probably downrate your site. |
I just took a look on what Google saw on my site. The whole right navbar is missing, so the individuals are barely linked to each other. Only a few sites are indexed I linked from my homepage to, despite the fact I am using a sitemap as well. If there are "bad crawlers" and they cause cpu usage, they could also be blocked more efficiently in a .htacces file. Software shouldn't need to worry about it anyway, just manage and serve content to whom it may be. /imho I'm open to other opinions: are there any drawbacks to this approach? |
I believe that google will now find the same information replicated in several different pages on your site and will thus move your site further down the listing. At least that was the reasoning why certain pages were blocked (such as calendar, etc) |
Sure. The Calendar has mainly links to the individuals (which are, for me, are now identified with my json-ld-plugin). I'd rather set a higher value for individual.php in the sitemap than blocking the calendar. I do understand this reason very well, but since search engines got a lot more intelligent in the past few years, I'm not sure this applies anymore. Also, they are looking for CONTENT. I'm not sure there is a lot of it at all. Just a few dates, palces and images, but no paragraphs full of text. I'll have my website crawled and see what will happen. |
This is, in part, why ancestry.com has recently moved to templating data presentation. "On {$child_birthdate} {$name}'s {$child_type} was born in {$child_birth_place}..." rather than the raw, minimalist data. However, Google ranks even higher pages containing structured data; I don't know about other search engines. Unfortunately, they do not yet support ged. They do support event data structures which might be used? |
They also apparently parse, but might not use in rank, the Person schema. |
This is why I created… https://github.com/bmhm/webtrees-jsonld ;-) |
archive.org won't crawl my page, and I found it (now?) uses a different User Agent String:
archive.org_bot
Also, for Google and others to crawl and rate the site correctly, the robots-example.txt should include:
There might be more updates needed. Low priority, though.
The text was updated successfully, but these errors were encountered: