Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work? #526

GenomeW · 2016-08-29T07:41:11Z

code as follow:
from pyspider.libs.base_handler import *

class Handler(BaseHandler):
crawl_config = {'headers': {
'Content-Type':'application/x-www-form-urlencoded',
'Accept':'/',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-CN,zh;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Length':'295',
'X-Requested-With': 'XMLHttpRequest',
'Cookie':'__utmt=1; __utma=89664858.1557390068.1472454301.1472628080.1472628080.6; __utmb=89664858.3.10.1472628080; __utmc=89664858; __utmz=89664858.1472628080.5.5.utmcsr=sogou|utmccn=(organic)|utmcmd=organic|utmctr=phytozome',
'Host':'phytozome.jgi.doe.gov',
'Origin':'https://phytozome.jgi.doe.gov',
'Referer':'https://phytozome.jgi.doe.gov/pz/portal.html',
'User-Agent':'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
'X-GWT-Module-Base':'https://phytozome.jgi.doe.gov/pz/phytoweb/',
'X-GWT-Permutation':'80DA602CF8FBCB99E9D79278AD2DA616',
}
}

@every(minutes=24 * 60)
def on_start(self):
    self.crawl('https://phytozome.jgi.doe.gov/pz/portal.html#!results?search=0&crown=1&star=1&method=2296&searchText=AUX/IAA&offset=0',callback=self.detail_page,fetch_type='js')

def index_page(self, response):
    for each in response.doc('*').items():
        self.crawl(each.attr.href, callback=self.detail_page,fetch_type='js')

@config(priority=2)
def detail_page(self, response):
    self.index_page(response)
    for each in response.doc('*').items():
        self.crawl(each.attr.href, callback=self.detail_page,fetch_type='js')
    return {
        "url": response.url,
        "content":response.doc("*").text()
    }

只能抓取到css

The text was updated successfully, but these errors were encountered:

cryptocxeq · 2016-09-04T03:31:22Z

Have you set up PhantomJS?

GenomeW · 2016-09-04T15:48:26Z

I think I have. It is "fetch_type='js'".

binux · 2016-09-04T22:08:30Z

It's very tricky to debug javascript rendering, sometimes it just doesn't work in phantomjs.

GenomeW · 2016-09-05T12:04:23Z

Thank you for your attention, binux., and how to deal with this issue?

cryptocxeq · 2016-09-05T13:33:24Z

have you followed this http://docs.pyspider.org/en/latest/tutorial/Render-with-PhantomJS/

GenomeW · 2016-09-05T14:28:01Z

Thank you for your attention, cryptocxeq. I have followed that, it is the same result.

binux · 2016-09-05T14:45:27Z

You could try some other way like extract data from XHR requests.

GenomeW · 2016-09-06T01:05:01Z

Ok, thanks. I'll try it.

GenomeW · 2016-09-07T07:56:27Z

How to deal with Request payload?

binux · 2016-09-07T08:51:37Z

You mean post request payload? http://docs.pyspider.org/en/latest/apis/self.crawl/#data

GenomeW · 2016-09-07T12:54:19Z

Ok, thanks.

GenomeW · 2016-09-08T01:02:55Z

How to post request payload? The payload request data is '7|0|9|https://phytozome.jgi.doe.gov/pz/phytoweb/|586D31F87ED8E95E9B7E23B85764FF3C|org.jgi.phyto.client.service.KWSService|fetch|org.jgi.phyto.shared.CQuery/2500317377|1|0|1111111111111111111111111111111111111111111111111111|AUX/IAA|1|2|3|4|1|5|5|6|0|0|0|0|0|0|0|0|7|7|0|0|8|0|0|7|9|0|6|0|' in In Chromium Browser. It seems no effects. Code as follow:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
crawl_config = {
'headers': {
'Accept':'/',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Content-Type':'text/x-gwt-rpc; charset=UTF-8',
'Cookie':'__utmt=1; __utma=89664858.688195566.1473219768.1473219768.1473230006.2; __utmb=89664858.18.10.1473230006; __utmc=89664858; __utmz=89664858.1473230006.2.2.utmcsr=sogou|utmccn=(organic)|utmcmd=organic|utmctr=phytozome',
'Host':'phytozome.jgi.doe.gov',
'Origin':'https://phytozome.jgi.doe.gov',
'Referer':'https://phytozome.jgi.doe.gov/pz/portal.html',
'User-Agent':'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/52.0.2743.116 Chrome/52.0.2743.116 Safari/537.36',
'X-GWT-Module-Base':'https://phytozome.jgi.doe.gov/pz/phytoweb/',
'X-GWT-Permutation':'80DA602CF8FBCB99E9D79278AD2DA616',
'Content-Type':'application/x-www-form-urlencoded'

}
}

@every(minutes=24 * 60)
def on_start(self):
    self.crawl('https://phytozome.jgi.doe.gov/pz/portal.html#!results?search=0&crown=1&star=1&method=0&searchText=AUX/IAA&offset=0', fetch_type='js', callback=self.index_page, method='POST', data={'keywordsearch':'7|0|5|https://phytozome.jgi.doe.gov/pz/phytoweb/|42C87A8A4506C490536155607DC00F10|org.jgi.phyto.client.service.NodeOrgEntityService|retrieveNodeInfos|Z|1|2|3|4|1|5|0|'})


def index_page(self, response):

    for each in response.doc('a').items():
        self.crawl(each.attr.href, callback=self.detail_page)

@config(priority=2)
def detail_page(self, response):
    self.index_page(response)

    return {
        "url": response.url

    }

binux · 2016-09-08T21:00:52Z

data='7|0|9|https://phytozome.jgi.doe.gov/pz/phytoweb/|586D31F87ED8E95E9B7E23B85764FF3C|org.jgi.phyto.client.service.KWSService|fetch|org.jgi.phyto.shared.CQuery/2500317377|1|0|1111111111111111111111111111111111111111111111111111|AUX/IAA|1|2|3|4|1|5|5|6|0|0|0|0|0|0|0|0|7|7|0|0|8|0|0|7|9|0|6|0|'

GenomeW · 2016-09-09T00:50:11Z

Thank you very much, binux.

GenomeW changed the title ~~Why running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, it doesn't work?~~ Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work? Aug 30, 2016

binux closed this as completed Dec 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work? #526

Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work? #526

GenomeW commented Aug 29, 2016 •

edited

cryptocxeq commented Sep 4, 2016

GenomeW commented Sep 4, 2016

binux commented Sep 4, 2016

GenomeW commented Sep 5, 2016

cryptocxeq commented Sep 5, 2016

GenomeW commented Sep 5, 2016

binux commented Sep 5, 2016

GenomeW commented Sep 6, 2016

GenomeW commented Sep 7, 2016

binux commented Sep 7, 2016

GenomeW commented Sep 7, 2016

GenomeW commented Sep 8, 2016 •

edited

binux commented Sep 8, 2016

GenomeW commented Sep 9, 2016

Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work? #526

Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work? #526

Comments

GenomeW commented Aug 29, 2016 • edited

cryptocxeq commented Sep 4, 2016

GenomeW commented Sep 4, 2016

binux commented Sep 4, 2016

GenomeW commented Sep 5, 2016

cryptocxeq commented Sep 5, 2016

GenomeW commented Sep 5, 2016

binux commented Sep 5, 2016

GenomeW commented Sep 6, 2016

GenomeW commented Sep 7, 2016

binux commented Sep 7, 2016

GenomeW commented Sep 7, 2016

GenomeW commented Sep 8, 2016 • edited

binux commented Sep 8, 2016

GenomeW commented Sep 9, 2016

GenomeW commented Aug 29, 2016 •

edited

GenomeW commented Sep 8, 2016 •

edited