scraping williamhill website returns rubbish #251

Closed
hughht5 opened this Issue Jan 4, 2012 · 3 comments

Projects

None yet

3 participants

@hughht5
hughht5 commented Jan 4, 2012

The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:

var Browser = require("zombie");
var assert = require("assert");

// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});

run with node

output:
S�����J����ꪙRUݒ�kf�6�����Efr2�Riz��������^�����0�X�
��{�^�a�yp��p������Ή��`��(���S]-����'N�8q�����/���?�ݻ���u;�݇�ׯ�Eiٲ>��-���3�ۗG��Ee�,���mF���MI��Q�۲������ڊ��ZG��O�J�^S��Cg���JO�緹�Oݎ����P����ET�n;v������v���D�tvJn��J��8'��햷r�v:��m��J��Z�nh�]�� �����Z����.{�Z���Ӳl�B'�.¶�D�$n�/��u"���z������Ni��"Nj��\00_I\00\��S��O�E8{"�m;��h���,o�����Q�y��;��a[��������c���q�D�띊?����/|?:�;���Z!�}���/�wے�h�<����������%�������A�K=-a��~'
(actual output is much longer)

Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???

Thanks

@keichii
keichii commented Jan 10, 2012

i think the output in gzip, you need to specify text in the headers

@hughht5
hughht5 commented Jan 10, 2012

Thank alot, that makes perfect sense. As I'm a noob to this though, could you offer me an example of how to decompress the gzipped response? I don't know how to edit the headers.

Thanks again,
Hugh

@assaf
Owner
assaf commented May 28, 2012

Zombie will now send accept-encoding header to indicate it does not support gzip.

@assaf assaf closed this May 28, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment