Tools to construct and process web graphs from Common Crawl data
Compiling and Packaging Java Tools
The Java tools are compiled and packaged by Maven. If Maven is installed just run
mvn package. Now the Java tools can be run via
java -cp target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar <classname> <args>...
Note that the webgraphs are usually multiple Gigabytes in size and require a sufficient Java heap size (Java option
-Xmx) for processing.
Construction and Ranking of Host- and Domain-Level Web Graphs
Host-Level Web Graph
Domain-Level Web Graph
The domain-level web graph is distilled from the host-level graph by mapping host names to domain names. The ID mapping is kept in memory as an int array or FastUtil's big array if the host-level graph has more vertices than a Java array can hold (around 2³¹). The Java tool to fold the host graph is best run from the script host2domaingraph.sh.
Processing Graphs using the WebGraph Framework
To analyze the graph structure and calculate rankings you may further process the graphs using software from the Laboratory for Web Algorithmics (LAW) at the University of Milano, namely the WebGraph framework and the LAW library.
A couple of scripts may help you to run the WebGraph tools to build and process the graphs are provided in src/script/webgraph_ranking/. They're also used to prepare the Common Crawl web graph releases.
To process a webgraph and rank the nodes, you should first adapt the configuration to your graph and hardware setup:
./src/script/webgraph_ranking/process_webgraph.sh graph_name vertices.txt.gz edges.txt.gz output_dir
output_dir/ should contain all generated files, eg.