URL Collector

The URL collector is an application that crawls the Common Crawl corpus for URLs with the specified file extensions.

Architecture

The application suite is a distributed system to increase the processing speed. It can be deployed to one machine, but if the crawling speed should be increased, the slave application can be deployed to multiple machines.

The url-collector suite is built from three applications.

URL Collector Master

Knows what WARC files (work units) should be crawled, store this data in a database and distribute the work to the slave applications.

URL Collector Slave

Gets a work unit (effectively a Common Crawl WARC file) from the Master application, that should be parsed for URLs. Parse the urls then upload the results to a warehouse.

URL Collector Merger

Merge the crawl results in a warehouse into one big file, while doing so, removing the duplicate entries in the files.

Installation

To start crawling, you need to have Java and MongoDB installed on your machine.

Installing Java

First, you need to download the Java 17 Runtime Environment. It’s available here. After the download is complete you should run the installer and follow the directions it provides until the installation is complete.

Once it’s done, if you open a command line (write cmd to the Start menu’s search bar) you will be able to use the java command. Try to write java -version. You should get something similar:

java version "15" 2020-09-15
Java(TM) SE Runtime Environment (build 15+36-1562)
Java HotSpot(TM) 64-Bit Server VM (build 15+36-1562, mixed mode, sharing)

Installing MongoDB

Download MongoDB 5.0 from here. After the download is complete run the installer and follow the directions it provides. If it’s possible, install the MongoDB Compass tool as well because you might need it later for administrative tasks.

Running the Master Application

You can download the Master Application from our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar url-collector-master-application-{release-number}.jar ...

In the place of the … you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name and it’s value. For example:

java -jar url-collector-master-application-{release-number}.jar --example.parameter=value

Parameters

Table 1. Parameters

Parameter	Description
database.host	The host location of the MongoDB database server. (Default value: localhost)
database.port	The port open for the MongoDB database server. (Default value: 27017)
database.uri	If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "")
server.port	The port where the master server should listen on. (Default value: 8080)

Running the Slave Application

You can download the Slave Application from our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar url-collector-slave-application-{release-number}.jar ...

You can start the slave application on more than one machines if necessary, to improve the crawling speed.

In the place of the … you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name and it’s value. For example:

java -jar url-collector-slave-application-{release-number}.jar --example.parameter=value

Parameters

Table 2. Parameters

Parameter	Description
execution.parallelism-target	How many work units should be processed at the same time. (Default value: the number of processor cores multiplied by two)
warehouse.type	The type of the warehouse. Can be either 'local' or 'aws'.
warehouse.local.target-directory	The target directory where the crawled URLs should be saved. Only used if the warehouse.type is 'local'.
warehouse.aws.region	The region of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.
warehouse.aws.bucket-name	The name of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.
warehouse.aws.access-key	The access key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.
warehouse.aws.secret-key	The secret key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.
validation.types	The list of file executions that we need to save the URLs for.
master.host	The host location of the Master Application. (Default value: localhost)
master.port	The port location of the Master Application. (Default value: 8080)

Starting a crawl

To start a crawl, the Master application’s crawl initialization endpoint should be called. The request should be a POST request to the /crawl endpoint on the master with a body the contains the crawlId for the Common Crawl dataset that should be processed.

For example:

curl --location --request POST 'http://185.191.228.214:8081/crawl' \
--header 'Content-Type: application/json' \
--data-raw '{
"crawlId": "CC-MAIN-2021-31"
}'

Once it is done, the Slave application should automatically pick up the new work units in a matter of minutes.

Starting the Merger Application

You can download the Merger Application from our release page. Please take care to choose a non "pre-release" version!

After the download is complete run the application via the following command:

java -jar url-collector-merger-application-{release-number}.jar ...

In the place of the … you should write the various parameters with two dashes in the beginning and an equal sign between the parameter’s name, and it’s value. For example:

java -jar url-collector-merger-application-{release-number}.jar --example.parameter=value

Parameters

Table 3. Parameters

Parameter	Description
database.host	The host location of the MongoDB database server. (Default value: localhost)
database.port	The port open for the MongoDB database server. (Default value: 27017)
database.uri	If present and not empty, it overrides the host and port parameter. Let the user inject a MongoDB Connection String directly. Should be used to define the credentials and other custom connection parameters. (Default value: "")
warehouse.type	The type of the warehouse. Can be either 'local' or 'aws'.
warehouse.local.target-directory	The target directory where the crawled URLs should be saved. Only used if the warehouse.type is 'local'.
warehouse.aws.region	The region of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.
warehouse.aws.bucket-name	The name of the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.
warehouse.aws.access-key	The access key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.
warehouse.aws.secret-key	The secret key for the user that has access to the S3 bucket where the crawled URLs should be uploaded. Only used if the warehouse.type is 'aws'.
result.path	The location where the result of the merge should be saved at. It should be a directory. The result file will be saved there with the filename of 'result.ubds'.

Peeking into the results

The individual result files are LZMA encoded. If you want to peek into them, then first you should decompress the files (7-Zip can help you with this on Windows). The URLs will sit in a JSON array.

The merged result file is NOT compressed. In it, the URLs are separated by newline characters.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
gradle/wrapper		gradle/wrapper
url-collector-application		url-collector-application
url-collector-library		url-collector-library
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

License

bottomless-archive-project/url-collector

Folders and files

Latest commit

History

Repository files navigation

URL Collector

Architecture

URL Collector Master

URL Collector Slave

URL Collector Merger

Installation

Installing Java

Installing MongoDB

Running the Master Application

Parameters

Running the Slave Application

Parameters

Starting a crawl

Starting the Merger Application

Parameters

Peeking into the results

About

Topics

Resources

License

Stars

Watchers

Forks

Languages