how to get value if a html element contain dynamic generated <script> tag #289

wanghaisheng · 2015-09-25T08:55:19Z

there is a fragment like this one

<span>阅读：<script src="http://iapi.iiyi.com/v1/news/click/?view=yes&amp;id=223925&amp;mid=128"></script>77</span>

if use the following

            "readNumber": response.doc('span').text(),

will return "阅读：" without "77",what i want is 77

The text was updated successfully, but these errors were encountered:

binux · 2015-09-26T20:02:41Z

can you reproduce in demo.pyspider.org?

wanghaisheng · 2015-09-27T03:36:10Z

@binux you can see here http://demo.pyspider.org/debug/issue289

binux · 2015-09-27T09:34:56Z

Did you saved?

wanghaisheng · 2015-09-27T16:29:21Z

pls try again

binux · 2015-09-27T16:44:51Z

the number is generated by javascript. you can make another http call to the script address to get it or using phantomjs to render the page with javascript support.

wanghaisheng · 2015-09-27T23:42:32Z

@binux

            "readNumber": response.doc('DIV.b-mh-time>span:eq(1)>script.src').text(),

or

            "readNumber": response.doc('DIV.b-mh-time>span:eq(1)').text(),

does not work for the following

<span>阅读：<script src="http://iapi.iiyi.com/v1/news/click/?view=yes&amp;id=223925&amp;mid=128"></script>77</span>

wanghaisheng · 2015-09-28T00:10:17Z

do you mean using phantomjs render like this

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2015-01-04 03:35:01
# Project: tutorial_pinterest

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    """
    This is a sample scrape script for: Level 3: Render with PhantomJS
    http://docs.pyspider.org/en/latest/tutorial/Render-with-PhantomJS/#running-javascript-on-page
    """

    def on_start(self):
        self.crawl('http://iapi.iiyi.com/v1/news/click/?view=yes&id=223925&mid=128',
                   fetch_type='js', callback=self.index_page)

    def index_page(self, response):
        return {
            "url": response.url,
            "readNumber": response.doc('body').text().split('(')[1].split(')')[0],

        }

binux · 2015-09-28T11:38:57Z

The number 77 doesn't exists without javascript. Enable phantomJS for pages like http://www.iiyi.com/d-07-217261.html

wanghaisheng · 2015-09-28T14:23:51Z

this one works perfect

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2015-09-25 15:35:56
# Project: aiaiyi

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://www.iiyi.com/', callback=self.keshidaohang_page)

    @config(age=10 * 24 * 60 * 60)
    def keshidaohang_page(self, response):
        for each in response.doc('P.w_nav1>a').items():
            self.crawl(each.attr.href, callback=self.front_keshi_page)

    @config(age=24 * 60 * 60)
    def front_keshi_page(self, response):
        for each in response.doc('DIV.w_conl1:first>H3.clears>A').items():
            self.crawl(each.attr.href, callback=self.index_keshi_page)

    @config(age=24 * 60 * 60)
    def index_keshi_page(self, response):
        for each in response.doc('UL.pagelist>LI>A').items():
            title = response.url.split("-")
            total_pages = int(response.doc('UL.pagelist>LI:eq(10)>A').text().split('.')[3])

            for i in range(1, total_pages+1):
                i = repr(i)
                url = title[0]+'-'+title[1]+'-'+i+ '.html'

                self.crawl(url, callback=self.mouye_keshi_page)              



    @config(age=24 * 60 * 60)
    def mouye_keshi_page(self, response):
        for each in response.doc('DIV.b-mh-list>ul>li>a').items():
      # 由于span 下的script标签是动态生成的 需要启用phantomjs才行
            self.crawl(each.attr.href,fetch_type='js', callback=self.zixun_page)       

    @config(priority=2)
    def zixun_page(self, response):
        return {
            "publishTime": response.doc('DIV.b-mh-time>span:first').text(),
            "readNumber": response.doc('DIV.b-mh-time>span:eq(1)').text().split(' ')[1],
            "source": response.doc('DIV.b-mh-time>span:eq(2)').text(),
            "editor": response.doc('DIV.b-mh-time>span:eq(3)').text(),
            "title": response.doc('title').text(),
            "content": response.doc('DIV.b-mhcon').text()

        }

wanghaisheng changed the title ~~how to get value if a html tag contain script~~ how to get value if a html element contain dynamic generated <script> tag Sep 28, 2015

wanghaisheng closed this as completed Sep 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to get value if a html element contain dynamic generated <script> tag #289

how to get value if a html element contain dynamic generated <script> tag #289

wanghaisheng commented Sep 25, 2015

binux commented Sep 26, 2015

wanghaisheng commented Sep 27, 2015

binux commented Sep 27, 2015

wanghaisheng commented Sep 27, 2015

binux commented Sep 27, 2015

wanghaisheng commented Sep 27, 2015

wanghaisheng commented Sep 28, 2015

binux commented Sep 28, 2015

wanghaisheng commented Sep 28, 2015

how to get value if a html element contain dynamic generated <script> tag #289

how to get value if a html element contain dynamic generated <script> tag #289

Comments

wanghaisheng commented Sep 25, 2015

binux commented Sep 26, 2015

wanghaisheng commented Sep 27, 2015

binux commented Sep 27, 2015

wanghaisheng commented Sep 27, 2015

binux commented Sep 27, 2015

wanghaisheng commented Sep 27, 2015

wanghaisheng commented Sep 28, 2015

binux commented Sep 28, 2015

wanghaisheng commented Sep 28, 2015