Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work? #526

Closed
GenomeW opened this issue Aug 29, 2016 · 14 comments

Comments

@GenomeW
Copy link

GenomeW commented Aug 29, 2016

code as follow:
from pyspider.libs.base_handler import *

class Handler(BaseHandler):
crawl_config = {'headers': {
'Content-Type':'application/x-www-form-urlencoded',
'Accept':'/',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-CN,zh;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Length':'295',
'X-Requested-With': 'XMLHttpRequest',
'Cookie':'__utmt=1; __utma=89664858.1557390068.1472454301.1472628080.1472628080.6; __utmb=89664858.3.10.1472628080; __utmc=89664858; __utmz=89664858.1472628080.5.5.utmcsr=sogou|utmccn=(organic)|utmcmd=organic|utmctr=phytozome',
'Host':'phytozome.jgi.doe.gov',
'Origin':'https://phytozome.jgi.doe.gov',
'Referer':'https://phytozome.jgi.doe.gov/pz/portal.html',
'User-Agent':'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
'X-GWT-Module-Base':'https://phytozome.jgi.doe.gov/pz/phytoweb/',
'X-GWT-Permutation':'80DA602CF8FBCB99E9D79278AD2DA616',
}
}

@every(minutes=24 * 60)
def on_start(self):
    self.crawl('https://phytozome.jgi.doe.gov/pz/portal.html#!results?search=0&crown=1&star=1&method=2296&searchText=AUX/IAA&offset=0',callback=self.detail_page,fetch_type='js')

def index_page(self, response):
    for each in response.doc('*').items():
        self.crawl(each.attr.href, callback=self.detail_page,fetch_type='js')

@config(priority=2)
def detail_page(self, response):
    self.index_page(response)
    for each in response.doc('*').items():
        self.crawl(each.attr.href, callback=self.detail_page,fetch_type='js')
    return {
        "url": response.url,
        "content":response.doc("*").text()
    }

只能抓取到css

@GenomeW GenomeW changed the title Why running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, it doesn't work? Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work? Aug 30, 2016
@cryptocxeq
Copy link

Have you set up PhantomJS?

@GenomeW
Copy link
Author

GenomeW commented Sep 4, 2016

I think I have. It is "fetch_type='js'".

@binux
Copy link
Owner

binux commented Sep 4, 2016

It's very tricky to debug javascript rendering, sometimes it just doesn't work in phantomjs.

@GenomeW
Copy link
Author

GenomeW commented Sep 5, 2016

Thank you for your attention, binux., and how to deal with this issue?

@cryptocxeq
Copy link

@GenomeW
Copy link
Author

GenomeW commented Sep 5, 2016

Thank you for your attention, cryptocxeq. I have followed that, it is the same result.

@binux
Copy link
Owner

binux commented Sep 5, 2016

You could try some other way like extract data from XHR requests.

@GenomeW
Copy link
Author

GenomeW commented Sep 6, 2016

Ok, thanks. I'll try it.

@GenomeW
Copy link
Author

GenomeW commented Sep 7, 2016

How to deal with Request payload?

@binux
Copy link
Owner

binux commented Sep 7, 2016

You mean post request payload? http://docs.pyspider.org/en/latest/apis/self.crawl/#data

@GenomeW
Copy link
Author

GenomeW commented Sep 7, 2016

Ok, thanks.

@GenomeW
Copy link
Author

GenomeW commented Sep 8, 2016

How to post request payload? The payload request data is '7|0|9|https://phytozome.jgi.doe.gov/pz/phytoweb/|586D31F87ED8E95E9B7E23B85764FF3C|org.jgi.phyto.client.service.KWSService|fetch|org.jgi.phyto.shared.CQuery/2500317377|1|0|1111111111111111111111111111111111111111111111111111|AUX/IAA|1|2|3|4|1|5|5|6|0|0|0|0|0|0|0|0|7|7|0|0|8|0|0|7|9|0|6|0|' in In Chromium Browser. It seems no effects. Code as follow:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
crawl_config = {
'headers': {
'Accept':'/',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Content-Type':'text/x-gwt-rpc; charset=UTF-8',
'Cookie':'__utmt=1; __utma=89664858.688195566.1473219768.1473219768.1473230006.2; __utmb=89664858.18.10.1473230006; __utmc=89664858; __utmz=89664858.1473230006.2.2.utmcsr=sogou|utmccn=(organic)|utmcmd=organic|utmctr=phytozome',
'Host':'phytozome.jgi.doe.gov',
'Origin':'https://phytozome.jgi.doe.gov',
'Referer':'https://phytozome.jgi.doe.gov/pz/portal.html',
'User-Agent':'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/52.0.2743.116 Chrome/52.0.2743.116 Safari/537.36',
'X-GWT-Module-Base':'https://phytozome.jgi.doe.gov/pz/phytoweb/',
'X-GWT-Permutation':'80DA602CF8FBCB99E9D79278AD2DA616',
'Content-Type':'application/x-www-form-urlencoded'

}
}

@every(minutes=24 * 60)
def on_start(self):
    self.crawl('https://phytozome.jgi.doe.gov/pz/portal.html#!results?search=0&crown=1&star=1&method=0&searchText=AUX/IAA&offset=0', fetch_type='js', callback=self.index_page, method='POST', data={'keywordsearch':'7|0|5|https://phytozome.jgi.doe.gov/pz/phytoweb/|42C87A8A4506C490536155607DC00F10|org.jgi.phyto.client.service.NodeOrgEntityService|retrieveNodeInfos|Z|1|2|3|4|1|5|0|'})


def index_page(self, response):

    for each in response.doc('a').items():
        self.crawl(each.attr.href, callback=self.detail_page)

@config(priority=2)
def detail_page(self, response):
    self.index_page(response)

    return {
        "url": response.url

    }

@GenomeW
Copy link
Author

GenomeW commented Sep 9, 2016

Thank you very much, binux.

@binux binux closed this as completed Dec 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants