Tools to construct and process web graphs from Common Crawl data
Compiling and Packaging Java Tools
The Java tools are compiled and packaged by Maven. If it's installed just run
Construction and Ranking of Host- and Domain-Level Web Graphs
Host-Level Web Graph
Domain-Level Web Graph
The domain-level web graph is distilled from the host-level graph by mapping host names to domain names. The ID mapping is kept in memory as an int array (if less than 2³¹ vertices). The Java tool to fold the host graph is best run from the script host2domaingraph.sh.
Processing Graphs using the Webgraph Framework
To analyze the graph structure and calculate rankings you may further process the graphs using software from the Laboratory for Web Algorithmics (LAW) at the University of Milano, namely the webgraph framework and the LAW library.
A couple of scripts which may help you to install the webgraph framework and run the tools to build and process the graphs are provided in src/script/webgraph_ranking/. They're also used to prepare the Common Crawl web graph releases. The first script installs the webgraph and LAW software in the same directory where the scripts are located:
cd ./src/script/webgraph_ranking/ ./install_webgraph.sh cd ../../../
To process a webgraph and rank the nodes, you should first adapt the configuration to your graph and hardware setup:
./src/script/webgraph_ranking/process_webgraph.sh graph_name vertices.txt.gz edges.txt.gz output_dir
output_dir/ should contain all generated files, eg.