Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise robots and search engine/crawler dealing #598

Closed
bmarwell opened this issue Jun 5, 2015 · 11 comments
Closed

Revise robots and search engine/crawler dealing #598

bmarwell opened this issue Jun 5, 2015 · 11 comments

Comments

@bmarwell
Copy link
Contributor

bmarwell commented Jun 5, 2015

archive.org won't crawl my page, and I found it (now?) uses a different User Agent String: archive.org_bot

Also, for Google and others to crawl and rate the site correctly, the robots-example.txt should include:

Allow: /js/*
Allow: /themes/*
Allow: /modules_v*/*

There might be more updates needed. Low priority, though.

@vytux-com
Copy link
Contributor

Why would one want boots to crawl themes, and js?

Also modules folder should not be crawled by default, in my opinion. If a site admin wants a particular module to be listed at his peril he can always add it himself.

@fisharebest
Copy link
Owner

Google runs JavaScript and loads CSS to see how the page will be rendered.

It uses this to decide if a page is "mobile friendly". For example, the space around buttons.

@bmarwell
Copy link
Contributor Author

bmarwell commented Jun 6, 2015

Exactly why I put this in. Try Google web console (webmaster tools) for example.

@fisharebest
Copy link
Owner

Historically, there were scripts in many different folders, at many different levels.
This is the reason for using the "whitelist" approach, rather than a more traditional blacklist.
But, now that the code has been restructured, this is no longer a problem.

We have prevented search-engines from indexing pages that
(a) create alternate views of the same data (e.g. charts)
(b) generate infinite pages (e.g. calendar)
(c) use many resources to generate

For modern, well-behaved, search engines, these can be fixed with markup.

For badly behaved robots (and robots which ignore robots.txt), we have the "site-access-rules".
The site-access-rules are IPv4 only. They treat new/unrecognised browsers as robots. This is no longer a reasonable assumption, as new browsers are created frequently.

Perhaps it is time to review all this, and think differently.

  • on-click handlers can prevent robots from following links
  • ajax loaders can prevent them from loading pages.

@bmarwell
Copy link
Contributor Author

bmarwell commented Jun 7, 2015

Nowadays, Google and others will crawl your site with both agents. Of they look differently, they will probably downrate your site.

@bmarwell bmarwell changed the title Update robots-example.txt Revise robots and search engine/crawler dealing Jun 8, 2015
@bmarwell
Copy link
Contributor Author

bmarwell commented Jun 8, 2015

I just took a look on what Google saw on my site. The whole right navbar is missing, so the individuals are barely linked to each other. Only a few sites are indexed I linked from my homepage to, despite the fact I am using a sitemap as well.
It seems there is a penalty after all. I updated the robots.txt and the permissions for user agents to allow everything but admin*.php. Let's see what happens.

If there are "bad crawlers" and they cause cpu usage, they could also be blocked more efficiently in a .htacces file. Software shouldn't need to worry about it anyway, just manage and serve content to whom it may be. /imho

I'm open to other opinions: are there any drawbacks to this approach?

@vytux-com
Copy link
Contributor

I believe that google will now find the same information replicated in several different pages on your site and will thus move your site further down the listing.

At least that was the reasoning why certain pages were blocked (such as calendar, etc)

@bmarwell
Copy link
Contributor Author

bmarwell commented Jun 9, 2015

Sure. The Calendar has mainly links to the individuals (which are, for me, are now identified with my json-ld-plugin). I'd rather set a higher value for individual.php in the sitemap than blocking the calendar.

I do understand this reason very well, but since search engines got a lot more intelligent in the past few years, I'm not sure this applies anymore.

Also, they are looking for CONTENT. I'm not sure there is a lot of it at all. Just a few dates, palces and images, but no paragraphs full of text. I'll have my website crawled and see what will happen.

@Amgine0
Copy link

Amgine0 commented Jun 9, 2015

This is, in part, why ancestry.com has recently moved to templating data presentation. "On {$child_birthdate} {$name}'s {$child_type} was born in {$child_birth_place}..." rather than the raw, minimalist data. However, Google ranks even higher pages containing structured data; I don't know about other search engines. Unfortunately, they do not yet support ged. They do support event data structures which might be used?

@Amgine0
Copy link

Amgine0 commented Jun 9, 2015

They also apparently parse, but might not use in rank, the Person schema.

@bmarwell
Copy link
Contributor Author

bmarwell commented Jun 9, 2015

This is why I created… https://github.com/bmhm/webtrees-jsonld ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants