有个bug,注解模式下的 #433

Open
obhen233 opened this Issue Jan 4, 2017 · 2 comments

Projects

None yet

2 participants

@obhen233
obhen233 commented Jan 4, 2017

import java.text.SimpleDateFormat;
import java.util.Date;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.model.AfterExtractor;
import us.codecraft.webmagic.model.ConsolePageModelPipeline;
import us.codecraft.webmagic.model.OOSpider;
import us.codecraft.webmagic.model.annotation.ExtractBy;
import us.codecraft.webmagic.model.annotation.HelpUrl;
import us.codecraft.webmagic.model.annotation.TargetUrl;

@TargetUrl("http://www.jokeji.cn/jokehtml/([\\w\\W]*)/\\d+.htm")
@HelpUrl("http://www.jokeji.cn/list_\\d+.htm")
@ExtractBy(value = "//span[@id='text110']/p",multi = true)
public class JokeModel implements AfterExtractor{

@ExtractBy(value = "//allText()")
private String joke;

private Date creat_time;

public static void main(String[] args) {
    OOSpider.create(Site.me().setSleepTime(1000)
            , new ConsolePageModelPipeline(), JokeModel.class)
            .addUrl("http://www.jokeji.cn/list.htm").thread(5).run();
}

@Override
public void afterProcess(Page page) {
	System.out.println("gegegegeeg");
	creat_time = new Date();
}
@Override
public String toString() {
	// TODO Auto-generated method stub
	SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:dd");
	String time = sdf.format(creat_time);
	return "{\"joke\":\""+joke+"\",\"create_time\":\""+time+"\"}";
}

}

上面的是代码 一个bug是,不加multi = true 的话只能取到第一个

下的数据,应该是不加就能判断是否是多条吧 还有一个是@ExtractBy(value = "//allText()") text 取不到数据 allText tinyText 取不到

@code4craft code4craft added this to the WebMagic-0.6.1 milestone Jan 8, 2017
@code4craft
Owner
  1. multi = true这个可以优化一下
  2. //allText()这样的语法是不支持的,只能是先选择到节点再使用allText()选取属性,例如//div/allText()
@obhen233

`


text


text


text

` 格式是这种,如果不用//allText() 那用啥咧
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment