WebCollector

WebCollector is an open source Java crawler which provides some simple interfaces for crawling the Web。You can setup a multi-threaded web crawler in 5 minutes!

###DEMO This DEMO extracts all the questions asked at http://www.zhihu.com/ . You need to create a crawler class that extends BreadthCrawler.

public class ZhihuCrawler extends BreadthCrawler{

    /**
     * This function is called when a page is fetched and
     * ready to be processed by your program.       
    */
    @Override
    public void visit(Page page) {
        String question_regex="^http://www.zhihu.com/question/[0-9]+";         
        if(Pattern.matches(question_regex, page.url)){              
            System.out.println("processing "+page.url);

            /*extract title of the page*/
            String title=page.doc.title();
            System.out.println(title);

            /*extract the content of question*/
            String question=page.doc.select("div[id=zh-question-detail]").text();
            System.out.println(question);
         
        }
    }

    /**
     * start crawling
    */
    public static void main(String[] args) throws IOException{  
        ZhihuCrawler crawler=new ZhihuCrawler();
        crawler.addSeed("http://www.zhihu.com/question/21003086");
        crawler.addRegex("http://www.zhihu.com/.*");
        /*start the crawler with depth=5*/
        crawler.start(5);  
    }


}

As can be seen in the above code,there are one function that should be overridden:

visit(Page page): This function is called after the content of a URL is downloaded successfully.You can easily get the url,text of the downloaded page.If the Content-Type of the downloaded page is text/html,you could also get the document and html of the page.The document is a dom tree parsed by JSOUP.The html is a String decoded by detected charset.Page is an instance of cn.edu.hfut.dmic.webcollector.model.Page

page.url is the url of the downloaded page
page.content is the origin data of the page
page.doc is an instance of org.jsoup.nodes.Document
page.headers is the response headers of the page
page.status shows the fetch status
page.fetchtime is the time this page be fetched at generated by System.currentTimeMillis()

中文教程: https://github.com/CrawlScript/WebCollector/blob/master/README.zh-cn.md

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
WebCollector		WebCollector
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-cn.md		README.zh-cn.md
webcollector-1.15-bin.zip		webcollector-1.15-bin.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCollector

About

Releases

Packages

License

damacheng009/WebCollector

Folders and files

Latest commit

History

Repository files navigation

WebCollector

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages