Phantom-Snatch is an out of the box installation for an HTML snapshot service based on PhantomJS. It implements Google's and Bing's contract of indexing Ajax-based webpages in a generalized and scalable way.
The provisioning has been tested with AWS Opsworks using Ubuntu 12.04 LTS with Chef 11.4 and Ruby 1.9.3. There is no reason why the provisioning should not also work with newer versions.
- If you don't have an AWS account
- sign up (it's for free)
- sign into AWS console
- choose the OpsWorks service from the menu
- Create a new stack with the following options
- Default operating system: Ubuntu 12.04 LTS
- Chef version: 11.4
- Use custom Chef cookbooks: yes
- Repository type: Git
- Repository URL: git://github.com/bitsuppliers/phantom-snatch
- Add a new layer of type "Custom" to the stack
- Edit the created layer using the following options:
- Chef Recipes
- Setup: phantomjs
- Automatically Assign IP Addresses
- Public IP addresses: yes
- Elastic IP addresses: yes
- Add an instance
- instance size: micro
- Verify your installation
- wait until instance status is "online" (pick a coffee)
- copy your instance's IP address into a browser
- you know that the installation is complete when the server responds the following:
Requests must contain headers "X-Forwarded-Host" and "X-Forwarded-Proto" to evaluate base url
- Provide a vanilla Ubuntu
- Install Git and Chef Solo
- Clone phantom-snatch
git clone https://github.com/bitsuppliers/phantom-snatch
- Initialize submodules within phantom-snatch
cd phantom-snatch
git submodule init
- Run recipe "phantomjs" with Chef Solo
- Verify your installation
- wait until chef run is complete
- copy your instance's IP adress or hostname into a browser
- you know that the installation is complete when the server responds the following:
Requests must contain headers "X-Forwarded-Host" and "X-Forwarded-Proto" to evaluate base url
- to prevent (miss-)use by third parties, you may want to setup an IP-restriction or firewall rules for your intance(s)
- to provide redundancy or scalability, you may want to spawn multiple instances and add a load balancer
- to not rely on IPs, you may want to assign a hostname for your instance(s) (e.g. snapshots.example.com)
If you want to take a snapshot of
https://my.domain/#!/crawlable
then you can now call
http://phantom.snatch.hostname/?_escaped_fragment_=/crawlable
with headers
X-Forwarded-Proto: https
and X-Forwarded-Host: my.domain
Rendering Ajax-Pages has one major pitfall: Because of Ajax's asynchronous behaviour, it is hard for the renderer to say when the page is ready for a snapshot. The most robust and performant way to solve this, is to let the page actively signal that it's ready.
You can signal Phantom-Snatch that a page is ready by setting the attribute "data-status" of the "body" tag to "ready":
<body data-status="ready">
As soon as Phantom-Snatch gets this signal, it will take the snapshot and write its response. If PhantomJS never sees a signal, it will take a snapshot after 5 seconds and return whatever it sees then. Because performance has an impact on page rankings, it's highly recommended to signal Phantom-Snatch page-readyness instead of relying on a timeout.
You can make handling snapshots transparent if you use nginx to serve content for "my.domain".
location / {
# Your current configuration for root
proxy_set_header X-Forwarded-Proto https; # should be http or https
proxy_set_header X-Forwarded-Host $host; # may also be a fixed value
if ($args ~ "_escaped_fragment_=(.*)") {
proxy_pass http://phantom.snatch.hostname; # Address of your phantom-snatch instance
}
}
}
Now, nginx will forward every request containing an "_escaped_fragment_" query parameter to your snapshot server and return the result to the crawler.
The equivalent is possible with Apache (just missing a tested example).
The equivalent is possible with Varnish (just missing a tested example).
Tests have shown a pretty good performance for medium-complex pages. On an AWS micro instance (1 Core, 768 MB RAM) the avarage return time varied between 1-2 seconds. Even better results should be possible with instances dedicated to computing.
There is not enough experience yet with larger amounts of load to make final conclusions. Experience so far suggests that you can easily render pages in real time without any need to take screenshots in advance.
This said, it may still be advisable to cache already rendered snapshots on the proxy. Nginx, Apache and Varnish can easily be configured to add caching to their reverse proxy capabilities. This cache can then be prewarmed on deployment or on data changes by just calling the same URL as the crawler would.
Phantom-Snatch is the outcome of a concrete project with a concrete setting. Join us to make Phantom-Snatch useful and helpful for as many people as possible by making hard-coded settings configurable and by modifying chef recipes to support more platforms.
How can you contribute? Fork Phantom-Snatch and make all changes needed to make it work in your environment. Try to stay backward compatible so that we can merge your fork afterwards.
Moreover, feedback like "this worked nice on Ubuntu 13.x" is very welcome.