Tools to construct and process webgraphs from Common Crawl data
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
.gitignore
LICENSE
README.md
pom.xml

README.md

cc-webgraph

Tools to construct and process web graphs from Common Crawl data

Compiling and Packaging Java Tools

The Java tools are compiled and packaged by Maven. If it's installed just run mvn package.

Construction and Ranking of Host- and Domain-Level Web Graphs

Host-Level Web Graph

The host-level web graph is built with help of PySpark, the corresponding code is found in the project cc-pyspark. Instructions are found in the script build_hostgraph.sh.

Domain-Level Web Graph

The domain-level web graph is distilled from the host-level graph by mapping host names to domain names. The ID mapping is kept in memory as an int array (if less than 2³¹ vertices). The Java tool to fold the host graph is best run from the script host2domaingraph.sh.

Processing Graphs using the Webgraph Framework

To analyze the graph structure and calculate rankings you may further process the graphs using software from the Laboratory for Web Algorithmics (LAW) at the University of Milano, namely the webgraph framework and the LAW library.

A couple of scripts which may help you to install the webgraph framework and run the tools to build and process the graphs are provided in src/script/webgraph_ranking/. They're also used to prepare the Common Crawl web graph releases. The first script installs the webgraph and LAW software in the same directory where the scripts are located:

cd ./src/script/webgraph_ranking/
./install_webgraph.sh
cd ../../../

To process a webgraph and rank the nodes, you should first adapt the configuration to your graph and hardware setup:

vi ./src/script/webgraph_ranking/webgraph_config.sh

After running

./src/script/webgraph_ranking/process_webgraph.sh graph_name vertices.txt.gz edges.txt.gz output_dir

the output_dir/ should contain all generated files, eg. graph_name.graph and graph_name-ranks.txt.gz.

The shell script is easily adapted to your needs. Please refer to the LAW dataset tutorial, the API docs of LAW and webgraph for further information.