Skip to content

commoncrawl/cc-webgraph

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
src
 
 
 
 
 
 
 
 
 
 

cc-webgraph

Tools to construct and process web graphs from Common Crawl data

Compiling and Packaging Java Tools

The Java tools are compiled and packaged by Maven. If Maven is installed just run mvn package. Now the Java tools can be run via

java -cp target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar <classname> <args>...

The assembly jar file requires Java 10 or upwards to run. It includes also the WebGraph and LAW packages required to compute PageRank and Harmonic Centrality.

Note that the webgraphs are usually multiple Gigabytes in size and require a sufficient Java heap size (Java option -Xmx) for processing.

Construction and Ranking of Host- and Domain-Level Web Graphs

Host-Level Web Graph

The host-level web graph is built with help of PySpark, the corresponding code is found in the project cc-pyspark. Instructions are found in the script build_hostgraph.sh.

Domain-Level Web Graph

The domain-level web graph is distilled from the host-level graph by mapping host names to domain names. The ID mapping is kept in memory as an int array or FastUtil's big array if the host-level graph has more vertices than a Java array can hold (around 2³¹). The Java tool to fold the host graph is best run from the script host2domaingraph.sh.

Processing Graphs using the WebGraph Framework

To analyze the graph structure and calculate rankings you may further process the graphs using software from the Laboratory for Web Algorithmics (LAW) at the University of Milano, namely the WebGraph framework and the LAW library.

A couple of scripts may help you to run the WebGraph tools to build and process the graphs are provided in src/script/webgraph_ranking/. They're also used to prepare the Common Crawl web graph releases.

To process a webgraph and rank the nodes, you should first adapt the configuration to your graph and hardware setup:

vi ./src/script/webgraph_ranking/webgraph_config.sh

After running

./src/script/webgraph_ranking/process_webgraph.sh graph_name vertices.txt.gz edges.txt.gz output_dir

the output_dir/ should contain all generated files, eg. graph_name.graph and graph_name-ranks.txt.gz.

The shell script is easily adapted to your needs. Please refer to the LAW dataset tutorial, the API docs of LAW and webgraph for further information.

Exploring Webgraph Data Sets

The Common Crawl webgraph data sets are announced on the Common Crawl web site.

Instructions how to explore the webgraphs are given in the cc-notebooks project.

Credits

Thanks to the authors of the WebGraph framework used to process the graphs and compute page rank and harmonic centrality. See also Sebastiano Vigna's projects webgraph and webgraph-big.

About

Tools to construct and process webgraphs from Common Crawl data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published