Revise robots and search engine/crawler dealing #598

bmarwell · 2015-06-05T11:27:47Z

archive.org won't crawl my page, and I found it (now?) uses a different User Agent String: archive.org_bot

Also, for Google and others to crawl and rate the site correctly, the robots-example.txt should include:

Allow: /js/*
Allow: /themes/*
Allow: /modules_v*/*

There might be more updates needed. Low priority, though.

The text was updated successfully, but these errors were encountered:

vytux-com · 2015-06-06T12:06:40Z

Why would one want boots to crawl themes, and js?

Also modules folder should not be crawled by default, in my opinion. If a site admin wants a particular module to be listed at his peril he can always add it himself.

fisharebest · 2015-06-06T12:35:41Z

Google runs JavaScript and loads CSS to see how the page will be rendered.

It uses this to decide if a page is "mobile friendly". For example, the space around buttons.

bmarwell · 2015-06-06T13:46:45Z

Exactly why I put this in. Try Google web console (webmaster tools) for example.

fisharebest · 2015-06-07T17:37:49Z

Historically, there were scripts in many different folders, at many different levels.
This is the reason for using the "whitelist" approach, rather than a more traditional blacklist.
But, now that the code has been restructured, this is no longer a problem.

We have prevented search-engines from indexing pages that
(a) create alternate views of the same data (e.g. charts)
(b) generate infinite pages (e.g. calendar)
(c) use many resources to generate

For modern, well-behaved, search engines, these can be fixed with markup.

For badly behaved robots (and robots which ignore robots.txt), we have the "site-access-rules".
The site-access-rules are IPv4 only. They treat new/unrecognised browsers as robots. This is no longer a reasonable assumption, as new browsers are created frequently.

Perhaps it is time to review all this, and think differently.

on-click handlers can prevent robots from following links
ajax loaders can prevent them from loading pages.

bmarwell · 2015-06-07T19:23:33Z

Nowadays, Google and others will crawl your site with both agents. Of they look differently, they will probably downrate your site.

bmarwell · 2015-06-08T14:22:59Z

I just took a look on what Google saw on my site. The whole right navbar is missing, so the individuals are barely linked to each other. Only a few sites are indexed I linked from my homepage to, despite the fact I am using a sitemap as well.
It seems there is a penalty after all. I updated the robots.txt and the permissions for user agents to allow everything but admin*.php. Let's see what happens.

If there are "bad crawlers" and they cause cpu usage, they could also be blocked more efficiently in a .htacces file. Software shouldn't need to worry about it anyway, just manage and serve content to whom it may be. /imho

I'm open to other opinions: are there any drawbacks to this approach?

vytux-com · 2015-06-09T10:34:27Z

I believe that google will now find the same information replicated in several different pages on your site and will thus move your site further down the listing.

At least that was the reasoning why certain pages were blocked (such as calendar, etc)

bmarwell · 2015-06-09T11:12:53Z

Sure. The Calendar has mainly links to the individuals (which are, for me, are now identified with my json-ld-plugin). I'd rather set a higher value for individual.php in the sitemap than blocking the calendar.

I do understand this reason very well, but since search engines got a lot more intelligent in the past few years, I'm not sure this applies anymore.

Also, they are looking for CONTENT. I'm not sure there is a lot of it at all. Just a few dates, palces and images, but no paragraphs full of text. I'll have my website crawled and see what will happen.

Amgine0 · 2015-06-09T18:47:27Z

This is, in part, why ancestry.com has recently moved to templating data presentation. "On {$child_birthdate} {$name}'s {$child_type} was born in {$child_birth_place}..." rather than the raw, minimalist data. However, Google ranks even higher pages containing structured data; I don't know about other search engines. Unfortunately, they do not yet support ged. They do support event data structures which might be used?

Amgine0 · 2015-06-09T18:51:21Z

They also apparently parse, but might not use in rank, the Person schema.

bmarwell · 2015-06-09T19:09:46Z

This is why I created… https://github.com/bmhm/webtrees-jsonld ;-)

bmarwell changed the title ~~Update robots-example.txt~~ Revise robots and search engine/crawler dealing Jun 8, 2015

fisharebest closed this as completed in b9c82a0 Jun 18, 2015

vonarezen mentioned this issue Mar 12, 2021

Source citation no longer a link #3763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise robots and search engine/crawler dealing #598

Revise robots and search engine/crawler dealing #598

bmarwell commented Jun 5, 2015

vytux-com commented Jun 6, 2015

fisharebest commented Jun 6, 2015

bmarwell commented Jun 6, 2015

fisharebest commented Jun 7, 2015

bmarwell commented Jun 7, 2015

bmarwell commented Jun 8, 2015

vytux-com commented Jun 9, 2015

bmarwell commented Jun 9, 2015

Amgine0 commented Jun 9, 2015

Amgine0 commented Jun 9, 2015

bmarwell commented Jun 9, 2015

Revise robots and search engine/crawler dealing #598

Revise robots and search engine/crawler dealing #598

Comments

bmarwell commented Jun 5, 2015

vytux-com commented Jun 6, 2015

fisharebest commented Jun 6, 2015

bmarwell commented Jun 6, 2015

fisharebest commented Jun 7, 2015

bmarwell commented Jun 7, 2015

bmarwell commented Jun 8, 2015

vytux-com commented Jun 9, 2015

bmarwell commented Jun 9, 2015

Amgine0 commented Jun 9, 2015

Amgine0 commented Jun 9, 2015

bmarwell commented Jun 9, 2015