2.1 Crawling

Axel Ngonga edited this page Jul 7, 2014 · 1 revision

Introduction

In order to retrieve a set of pages we use Crawler4J. In particular, the crawler listed below crawls the IMDB domain and seperates movies and actors pages into two different indizes. We decided to have a seperate crawling step before the actual REX pipeline starts in order to minimize the risk of interdependencies.

You can find an instantiation for a crawler here.

To store the data we use a Lucene 4.X index, see CrawlIndex.java.

A sample index can be downloaded from here.

Parameters (no optional parameters!)

(1) You need to fill the map which directs certain URI starting with a substring to a certain LUCENE index:

       Map<CrawlIndex, Set<String>> index2URLs = new HashMap<CrawlIndex, Set<String>>();
       index2URLs.put(new CrawlIndex("imdb-title-index"), Sets.newHashSet("http://www.imdb.com/title/tt([0-9])*/$"));
       index2URLs.put(new CrawlIndex("imdb-name-index"), Sets.newHashSet("http://www.imdb.com/name/nm([0-9])*/$"));

(2) You need to instantiate the CrawlerConfig which is provided by Crawler4J. Here you have to handover the map of index mappings as well as a domain, e.g., http://www.imdb.com/. Afterwards you need to initialize the URLCrawlerController with a string for a temporary crawling directory and the CrawlerConfig.

        CrawlerConfig crawlIndexConfig = new CrawlerConfig("http://www.imdb.com/", index2URLs);
        URLCrawlerController crawlControl = new URLCrawlerController("crawlIMDB", crawlIndexConfig);

(3) You need to give the controller a set of seeds as starting points for the crawler, e.g., generate them from a synthetic URI

		Random r = new Random();
		for (int i = 0; i < 50000; i++) {
			int x = r.nextInt(9999999);
			DecimalFormat df = new DecimalFormat("0000000");
			df.format(x);
			crawlControl.addSeed("http://www.imdb.com/title/tt" + x);
		}

(4) You can put the crawler together like this:

public static void main(String[] args) throws Exception {
// map pages for movies and stars seperately
		Map<CrawlIndex, Set<String>> index2URLs = new HashMap<CrawlIndex, Set<String>>();
		index2URLs.put(new CrawlIndex("imdb-title-index"), Sets.newHashSet("http://www.imdb.com/title/tt([0-9])*/$"));
		index2URLs.put(new CrawlIndex("imdb-name-index"), Sets.newHashSet("http://www.imdb.com/name/nm([0-9])*/$"));
		
// configure crawler
		CrawlerConfig crawlIndexConfig = new CrawlerConfig("http://www.imdb.com/", index2URLs);
		URLCrawlerController crawlControl = new URLCrawlerController("crawlIMDB", crawlIndexConfig);

// generate 50000 random seed URLS
		Random r = new Random();
		for (int i = 0; i < 50000; i++) {
			int x = r.nextInt(9999999);
			DecimalFormat df = new DecimalFormat("0000000");
			df.format(x);
			crawlControl.addSeed("http://www.imdb.com/title/tt" + x);
		}
		for (int i = 0; i < 50000; i++) {
			int x = r.nextInt(9999999);
			DecimalFormat df = new DecimalFormat("0000000");
			df.format(x);
			crawlControl.addSeed("http://www.imdb.com/name/nm" + x);
		}
//start crawling
		crawlControl.startCrawler();

// Wait for 30 seconds if a server is slow
		Thread.sleep(30 * 1000);

		crawlControl.shutdown();
		crawlControl.waitUntilFinish();
	}
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.