Skip to content

2.1 Crawling

Axel Ngonga edited this page Jul 7, 2014 · 1 revision

Introduction

In order to retrieve a set of pages we use Crawler4J. In particular, the crawler listed below crawls the IMDB domain and seperates movies and actors pages into two different indizes. We decided to have a seperate crawling step before the actual REX pipeline starts in order to minimize the risk of interdependencies.

You can find an instantiation for a crawler here.

To store the data we use a Lucene 4.X index, see CrawlIndex.java.

A sample index can be downloaded from here.

Parameters (no optional parameters!)

(1) You need to fill the map which directs certain URI starting with a substring to a certain LUCENE index:

       Map<CrawlIndex, Set<String>> index2URLs = new HashMap<CrawlIndex, Set<String>>();
       index2URLs.put(new CrawlIndex("imdb-title-index"), Sets.newHashSet("http://www.imdb.com/title/tt([0-9])*/$"));
       index2URLs.put(new CrawlIndex("imdb-name-index"), Sets.newHashSet("http://www.imdb.com/name/nm([0-9])*/$"));

(2) You need to instantiate the CrawlerConfig which is provided by Crawler4J. Here you have to handover the map of index mappings as well as a domain, e.g., http://www.imdb.com/. Afterwards you need to initialize the URLCrawlerController with a string for a temporary crawling directory and the CrawlerConfig.

        CrawlerConfig crawlIndexConfig = new CrawlerConfig("http://www.imdb.com/", index2URLs);
        URLCrawlerController crawlControl = new URLCrawlerController("crawlIMDB", crawlIndexConfig);

(3) You need to give the controller a set of seeds as starting points for the crawler, e.g., generate them from a synthetic URI

		Random r = new Random();
		for (int i = 0; i < 50000; i++) {
			int x = r.nextInt(9999999);
			DecimalFormat df = new DecimalFormat("0000000");
			df.format(x);
			crawlControl.addSeed("http://www.imdb.com/title/tt" + x);
		}

(4) You can put the crawler together like this:

public static void main(String[] args) throws Exception {
// map pages for movies and stars seperately
		Map<CrawlIndex, Set<String>> index2URLs = new HashMap<CrawlIndex, Set<String>>();
		index2URLs.put(new CrawlIndex("imdb-title-index"), Sets.newHashSet("http://www.imdb.com/title/tt([0-9])*/$"));
		index2URLs.put(new CrawlIndex("imdb-name-index"), Sets.newHashSet("http://www.imdb.com/name/nm([0-9])*/$"));
		
// configure crawler
		CrawlerConfig crawlIndexConfig = new CrawlerConfig("http://www.imdb.com/", index2URLs);
		URLCrawlerController crawlControl = new URLCrawlerController("crawlIMDB", crawlIndexConfig);

// generate 50000 random seed URLS
		Random r = new Random();
		for (int i = 0; i < 50000; i++) {
			int x = r.nextInt(9999999);
			DecimalFormat df = new DecimalFormat("0000000");
			df.format(x);
			crawlControl.addSeed("http://www.imdb.com/title/tt" + x);
		}
		for (int i = 0; i < 50000; i++) {
			int x = r.nextInt(9999999);
			DecimalFormat df = new DecimalFormat("0000000");
			df.format(x);
			crawlControl.addSeed("http://www.imdb.com/name/nm" + x);
		}
//start crawling
		crawlControl.startCrawler();

// Wait for 30 seconds if a server is slow
		Thread.sleep(30 * 1000);

		crawlControl.shutdown();
		crawlControl.waitUntilFinish();
	}
You can’t perform that action at this time.