Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selectable接口中xpath获取值的问题 #860

Open
wuchgan opened this issue Jan 30, 2019 · 10 comments
Open

Selectable接口中xpath获取值的问题 #860

wuchgan opened this issue Jan 30, 2019 · 10 comments

Comments

@wuchgan
Copy link

wuchgan commented Jan 30, 2019

@code4craft 作者您好,问个问题:
先用这个 List nodes = page.getHtml().xpath("").nodes();
在循环,然后Selectable selectable = nodes.get(i);再然后selectable.xpath("xxx")在debug模式下,打断点获取的值没有问题,但是直接运行就获取的有问题。这是为什么呢?

示例网址:http://www.atobo.com.cn/Companys/s-p6-k174255/
代码:
List nodes = page.getHtml().xpath("//div[@Class='product_contextlist bplist']/ul/li").nodes();
Selectable selectable = null;
for(int i = 0, len = nodes.size(); len > i; ++i) {
selectable = nodes.get(i);
System.out.println("selectable的xpath获取:" + selectable.xpath("//div/ul/li[@Class='p_name']/div/ul/li[contains(@Class,'pp_2web')]/a[4]/allText()").toString());
System.out.println("page的xpath获取:" + page.getHtml().xpath("//div[@Class='product_contextlist bplist']/ul/li[" + (i + 1) + "]/div/ul/li[@Class='p_name']/div/ul/li[contains(@Class,'pp_2web')]/a[4]/allText()").toString());
}

@wuchgan
Copy link
Author

wuchgan commented Jan 30, 2019

@code4craft 可以帮忙看下吗

@i-CNNN
Copy link

i-CNNN commented Mar 4, 2019

@code4craft 可以帮忙看下吗

我运行了一下你的代码, 有解析到数据啊,你的问题描述不是太清楚.

@wuchgan
Copy link
Author

wuchgan commented Mar 4, 2019

selectable.xpath有获取到值吗不在debug模式下?page.xpath是获取得到的

@wuchgan
Copy link
Author

wuchgan commented Mar 5, 2019

@code4craft是有解析到数据,但是selectable.xpath和 page.getHtml().xpath的解析结果不一致,原则上是要一致的。selectable.xpath这个取到的值是由问题的,不是a[4]的值而是a[1].

@wuchgan
Copy link
Author

wuchgan commented Mar 5, 2019

@code4craft 可以帮忙看下吗

我运行了一下你的代码, 有解析到数据啊,你的问题描述不是太清楚.

是有解析到数据,但是selectable.xpath和 page.getHtml().xpath的解析结果不一致,原则上是要一致的。selectable.xpath这个取到的值是由问题的,不是a[4]的值而是a[1].

@peter-wang-wsl
Copy link

peter-wang-wsl commented Mar 14, 2019

@newbiero 我也发生了下标取值错误问题
@code4craft

package com.cu.pageprocessor;

import java.util.ArrayList;
import java.util.List;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.cu.bean.News;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.processor.PageProcessor;

/**
  *  通信世界网
 *
 */
public class CWWNewsPageProcessor implements PageProcessor {
	
	private final static Logger logger = LoggerFactory.getLogger(CWWNewsPageProcessor.class);

	private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
	
	private String keyword = "5G";
	
	private String siteName = "通信世界网";
	
	private String urlPrefix = "http://";
	
	@Override
	public void process(Page page) {
		List<String> list = page.getHtml().xpath("//a/" + "@href").all();
		String str = page.getHtml().xpath("//a[3]/" + "@href").toString();
//		List<String> newsTitleList = page.getHtml().xpath("//a/text()").all();
		List<String> newsTitleList = page.getHtml().xpath("//a").all();
		if (null == newsTitleList || newsTitleList.isEmpty()) {
			logger.info("there is not news");
			return;
		}
		String newsTitle = null;
		String newsHref = null;
		List<News> newsList = new ArrayList<News>();
		for (int i = 0; i < newsTitleList.size(); i++) {
			newsTitle = newsTitleList.get(i);
			if (null != newsTitle && newsTitle.contains(keyword)) {
				newsHref = page.getHtml().xpath("//a[" + i + 1 + "]/@href").toString();
				if(null != newsHref && newsHref.startsWith(urlPrefix)) {
					News news = new News();
					news.setSiteName(siteName);
					news.setNewsTitle(newsTitle);
					news.setNewsUrl(newsHref);
					newsList.add(news);
				}
			}
		}
		
		if (!newsList.isEmpty()) {
			page.putField("news", newsList);
		}

	}

	@Override
	public Site getSite() {
		return site;
	}
	
    public static void main(String[] args) {

        Spider.create(new CWWNewsPageProcessor())
                //从"https://github.com/code4craft"开始抓
                .addUrl("http://www.cww.net.cn/")
                .addPipeline(new ConsolePipeline())
                //开启5个线程抓取
                .thread(5)
                //启动爬虫
                .run();
    }

}

str的值不是list.get(2)的值,但如果取a[2]的值则与list.get(1)一致

@wuchgan
Copy link
Author

wuchgan commented Mar 15, 2019

//a[3]/" + "@href

check your code.
你说的下标a[3]和list.get(3)不一样的,理解下a[3]的作用

@peter-wang-wsl
Copy link

//a[3]/" + "@href

check your code.
你说的下标a[3]和list.get(3)不一样的,理解下a[3]的作用

@newbiero 我说错了,是a[3]和list.get(2)

@wuchgan
Copy link
Author

wuchgan commented Mar 15, 2019

//a[3]/" + "@href

check your code.
你说的下标a[3]和list.get(3)不一样的,理解下a[3]的作用

@newbiero 我说错了,是a[3]和list.get(2)

这两个没有什么联系,存在差异是正常的,建议看下xpath

@peter-wang-wsl
Copy link

peter-wang-wsl commented Mar 15, 2019

//a[3]/" + "@href

check your code.
你说的下标a[3]和list.get(3)不一样的,理解下a[3]的作用

@newbiero 我说错了,是a[3]和list.get(2)

这两个没有什么联系,存在差异是正常的,建议看下xpath

@newbiero //a[3]/@href不是指第3个a标签的href属性吗?还请赐教

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants