Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawlable removed? #169

Open
polyclick opened this issue Mar 12, 2013 · 5 comments
Open

Crawlable removed? #169

polyclick opened this issue Mar 12, 2013 · 5 comments

Comments

@polyclick
Copy link

Why is the crawlable feature removed in 1.6? Any further explanation would be awesome. ;)

@asual
Copy link
Owner

asual commented Mar 14, 2013

I don't consider this crawlable mechanism as a good long term investment. The new history API removes the need of hashes completely and provides a much better base for search engine indexing. Here are sample queries that show how the content can be indexed:
https://www.google.com/search?q=site:asual.com+%22About+content.%22
http://search.yahoo.com/search?p=site%3Aasual.com+%22About+content.%22

@benjaminhorner
Copy link

Hi. I am not too sure as to what your explanation demonstrates…
As far as I can see, google is still recommending the #! method (https://developers.google.com/webmasters/ajax-crawling/docs/getting-started).
What is the alternative you are talking about ?

@dbeja
Copy link

dbeja commented Jun 12, 2013

If, for example, an user that is visiting the site with IE9 (there are still many) and paste the url in facebook (for sharing), the site is crawled wrongly, because it doesn't have the hashbang for old browsers.

@trusktr
Copy link

trusktr commented Aug 2, 2013

@asual As far as I know, the Google search bot indexes only the initial content it receives on the page and doesn't wait for javascript to execute and for the script to load AJAX content. So in the example you linked, the initial content is "Home content", which is what should be indexed. How is the search bot able to index the "About content"?

Specifically, what needs to be done to make this work? I'm sure if you tell people to just "use the History API", it won't work for everyone.

If your About tab is indeed being indexed, then maybe the Search bot is actually executing javascript, and waiting some number of seconds to index the final generated content, but I highly doubt Google would make their servers do this for each and every single page on the internet, which would require an extremely high use of cpu and resources on Google's end, which is why they want to delegate this to end developers to implement on their own servers, which is why Google suggests using a headless browser to create snapshots.

Why would Google suggest us to use a headless browser if Google is using a headless browser already? Something's up here. How is your linked example not just a special (accidental?) case?

Maybe it is true: Google has decided to parse and execute the javascript on every single page of the internet, but I seriously doubt it. You'd need a really REALLY enormous computer to do that, probably more powerful than any computer we have today. There are billions of websites in existence, most of them using javascript. How can Google possibly index billions of sites (including executing all of their javascripts)?

Let's think about it. Retrieving static HTML and indexing the content without executing javascript is one thing, but retrieving HTML then executing javascript must be on average 10 to 10000 times slower depending on the site, and prone to crashes, lockups, and endless loops, and take-advantage-of-google hacks.

Is Google really doing this???

Just curious. Food for thought. :)

@trusktr
Copy link

trusktr commented Aug 2, 2013

Ok, I thought about this a little more:

Google will be able to index your content if you use the History API if and only if you send identical content from the server when that URL is visited in the browser, not if you rely on reading the URL then using AJAX to get and show content based on the URL.

Example: your site uses history.pushState and loads content via ajax. If a user visits the URL you saved to the history, the server will send back the page in the same form as it appeared when it was generated by ajax. In this case, History API works fine. But if your app relies solely on AJAX to generate content (as in no matter what URL you visit the app always retrieves data based on the URL then displays it with AJAX (so that the server always sends exactly the same HTML at first)) then Google will not be able to index the site.

So simply telling someone to use History API isn't the entire solution. The Ajax site still needs to be able to send an initial HTML snapshot of the page for each URL, which is a big task for someone who designed their site using JSON to retrieve data and not HTML. Those users will have to do heavy modification to make sure the app sends HTML on each initial request.

A solution for apps relying on JSON would be to use a headless browser on initial request in order to generate HTML, then send that, and from that moment on the app can continue to use JSON like normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants