Skip to content
This repository has been archived by the owner on Jun 10, 2024. It is now read-only.

how to get value if a html element contain dynamic generated <script> tag #289

Closed
wanghaisheng opened this issue Sep 25, 2015 · 9 comments
Closed

Comments

@wanghaisheng
Copy link

there is a fragment like this one

<span>阅读:<script src="http://iapi.iiyi.com/v1/news/click/?view=yes&amp;id=223925&amp;mid=128"></script>77</span>

if use the following

            "readNumber": response.doc('span').text(),

will return "阅读:" without "77",what i want is 77

@binux
Copy link
Owner

binux commented Sep 26, 2015

can you reproduce in demo.pyspider.org?

@wanghaisheng
Copy link
Author

@binux you can see here http://demo.pyspider.org/debug/issue289

@binux
Copy link
Owner

binux commented Sep 27, 2015

Did you saved?

@wanghaisheng
Copy link
Author

pls try again

@binux
Copy link
Owner

binux commented Sep 27, 2015

the number is generated by javascript. you can make another http call to the script address to get it or using phantomjs to render the page with javascript support.

@wanghaisheng
Copy link
Author

@binux

            "readNumber": response.doc('DIV.b-mh-time>span:eq(1)>script.src').text(),

or

            "readNumber": response.doc('DIV.b-mh-time>span:eq(1)').text(),

does not work for the following

<span>阅读:<script src="http://iapi.iiyi.com/v1/news/click/?view=yes&amp;id=223925&amp;mid=128"></script>77</span>

@wanghaisheng
Copy link
Author

do you mean using phantomjs render like this

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2015-01-04 03:35:01
# Project: tutorial_pinterest

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    """
    This is a sample scrape script for: Level 3: Render with PhantomJS
    http://docs.pyspider.org/en/latest/tutorial/Render-with-PhantomJS/#running-javascript-on-page
    """

    def on_start(self):
        self.crawl('http://iapi.iiyi.com/v1/news/click/?view=yes&id=223925&mid=128',
                   fetch_type='js', callback=self.index_page)

    def index_page(self, response):
        return {
            "url": response.url,
            "readNumber": response.doc('body').text().split('(')[1].split(')')[0],

        }

@binux
Copy link
Owner

binux commented Sep 28, 2015

The number 77 doesn't exists without javascript. Enable phantomJS for pages like http://www.iiyi.com/d-07-217261.html

@wanghaisheng wanghaisheng changed the title how to get value if a html tag contain script how to get value if a html element contain dynamic generated <script> tag Sep 28, 2015
@wanghaisheng
Copy link
Author

this one works perfect

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2015-09-25 15:35:56
# Project: aiaiyi

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://www.iiyi.com/', callback=self.keshidaohang_page)

    @config(age=10 * 24 * 60 * 60)
    def keshidaohang_page(self, response):
        for each in response.doc('P.w_nav1>a').items():
            self.crawl(each.attr.href, callback=self.front_keshi_page)

    @config(age=24 * 60 * 60)
    def front_keshi_page(self, response):
        for each in response.doc('DIV.w_conl1:first>H3.clears>A').items():
            self.crawl(each.attr.href, callback=self.index_keshi_page)

    @config(age=24 * 60 * 60)
    def index_keshi_page(self, response):
        for each in response.doc('UL.pagelist>LI>A').items():
            title = response.url.split("-")
            total_pages = int(response.doc('UL.pagelist>LI:eq(10)>A').text().split('.')[3])

            for i in range(1, total_pages+1):
                i = repr(i)
                url = title[0]+'-'+title[1]+'-'+i+ '.html'

                self.crawl(url, callback=self.mouye_keshi_page)              



    @config(age=24 * 60 * 60)
    def mouye_keshi_page(self, response):
        for each in response.doc('DIV.b-mh-list>ul>li>a').items():
      # 由于span 下的script标签是动态生成的 需要启用phantomjs才行
            self.crawl(each.attr.href,fetch_type='js', callback=self.zixun_page)       

    @config(priority=2)
    def zixun_page(self, response):
        return {
            "publishTime": response.doc('DIV.b-mh-time>span:first').text(),
            "readNumber": response.doc('DIV.b-mh-time>span:eq(1)').text().split(' ')[1],
            "source": response.doc('DIV.b-mh-time>span:eq(2)').text(),
            "editor": response.doc('DIV.b-mh-time>span:eq(3)').text(),
            "title": response.doc('title').text(),
            "content": response.doc('DIV.b-mhcon').text()

        }

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants