Mirror of Apache Giraph
Java Shell
Failed to load latest commit information.
bin GIRAPH-312: Giraph needs an admin script (ereisman) Jan 28, 2013
conf GIRAPH-148. giraph-site.xml needs Apache header. Feb 13, 2012
dev-support GIRAPH-449: License header missing from for-each-profile.sh (apresta) Dec 13, 2012
giraph-accumulo GIRAPH-612: Improve website for upcoming release (aching) Apr 9, 2013
giraph-core GIRAPH-627: YARN build profile is broken. (rvs via aching) Apr 13, 2013
giraph-examples GIRAPH-595: InternalVertexRunner.run() should take GiraphConfiguratio… Apr 12, 2013
giraph-hbase GIRAPH-612: Improve website for upcoming release (aching) Apr 9, 2013
giraph-hcatalog GIRAPH-612: Improve website for upcoming release (aching) Apr 9, 2013
giraph-hive GIRAPH-612: Improve website for upcoming release (aching) Apr 9, 2013
src/site GIRAPH-612: Improve website for upcoming release (aching) Apr 9, 2013
.gitignore GIRAPH-420: build formats in profiles where it works. Nov 21, 2012
.reviewboardrc GIRAPH-458: split formats module into accumulo,hbase,hcatalog (nitay) Jan 2, 2013
CHANGELOG GIRAPH-627: YARN build profile is broken. (rvs via aching) Apr 13, 2013
CODE_CONVENTIONS GIRAPH-57: Add new RPC call (putVertexIdMessagesList) to batch Dec 15, 2011
LICENSE.txt [Bug 4503032] Added LICENSE.txt and NOTICE files as per apache. May 12, 2011
NOTICE GIRAPH-134: Fix NOTICE file for release. Jan 31, 2012
README GIRAPH-457: update module names (nitay) Dec 20, 2012
checkstyle.xml GIRAPH-13: Port Giraph to YARN Apr 3, 2013
findbugs-exclude.xml GIRAPH-236: Add FindBugs to maven build (Jan van der Lugt via aching). Jul 18, 2012
license-header.txt GIRAPH-40: Added checkstyle for enforcement of code conventions. All Feb 16, 2012
pom.xml GIRAPH-628: Can't build Giraph without git due to mavanagaiata not Apr 13, 2013


Giraph : Large-scale graph processing on Hadoop

Web and online social graphs have been rapidly growing in size and
scale during the past decade.  In 2008, Google estimated that the
number of web pages reached over a trillion.  Online social networking
and email sites, including Yahoo!, Google, Microsoft, Facebook,
LinkedIn, and Twitter, have hundreds of millions of users and are
expected to grow much more in the future.  Processing these graphs
plays a big role in relevant and personalized information for users,
such as results from a search engine or news in an online social
networking site.

Graph processing platforms to run large-scale algorithms (such as page
rank, shared connections, personalization-based popularity, etc.) have
become quite popular.  Some recent examples include Pregel and HaLoop.
For general-purpose big data computation, the map-reduce computing
model has been well adopted and the most deployed map-reduce
infrastructure is Apache Hadoop.  We have implemented a
graph-processing framework that is launched as a typical Hadoop job to
leverage existing Hadoop infrastructure, such as Amazon’s EC2.  Giraph
builds upon the graph-oriented nature of Pregel but additionally adds
fault-tolerance to the coordinator process with the use of ZooKeeper
as its centralized coordination service.

Giraph follows the bulk-synchronous parallel model relative to graphs
where vertices can send messages to other vertices during a given
superstep.  Checkpoints are initiated by the Giraph infrastructure at
user-defined intervals and are used for automatic application restarts
when any worker in the application fails.  Any worker in the
application can act as the application coordinator and one will
automatically take over if the current application coordinator fails.


Hadoop versions for use with Giraph:

Secure Hadoop versions:

- Apache Hadoop

  This is the default version used by Giraph: if you do not specify a
profile with the -P flag, maven will use this version. You may also
explicitly specify it with "mvn -Phadoop_0.20.203 <goals>".

- Apache Hadoop 1.0.2

  You may tell maven to use this version with "mvn -Phadoop_1.0 <goals>".

- Apache Hadoop 0.23.1

  You may tell maven to use this version with "mvn -Phadoop_0.23 <goals>".

- Apache Hadoop 3.0.0-SNAPSHOT

  You may tell maven to use this version with "mvn -Phadoop_trunk <goals>".

Unsecure Hadoop versions:

- Apache Hadoop 0.20.1, 0.20.2, 0.20.3

  You may tell maven to use 0.20.2 with "mvn -Phadoop_non_secure <goals>".

- Facebook Hadoop releases: https://github.com/facebook/hadoop-20, Master branch

  You may tell maven to use this version with "mvn -Phadoop_facebook <goals>"

-- Other versions reported working include:
---  Cloudera CDH3u0, CDH3u1

While we provide support for unsecure and Facebook versions of Hadoop
with the maven profiles 'hadoop_non_secure' and 'hadoop_facebook',
respectively, we have been primarily focusing on secure Hadoop releases
at this time.


Building and testing:

You will need the following:
- Java 1.6
- Maven 3 or higher. Giraph uses the munge plugin
  which requires Maven 3, to support multiple versions of Hadoop. Also, the
  web site plugin requires Maven 3.

Use the maven commands with secure Hadoop to:
- compile (i.e mvn compile)
- package (i.e. mvn package)
- test (i.e. mvn test)

For the non-secure versions of Hadoop, run the maven commands with the
additional argument '-Phadoop_non_secure'.
Example compilation commands is 'mvn -Phadoop_non_secure compile'.

For the Facebook Hadoop release, run the maven commands with the
additional arguments '-Phadoop_facebook'.
Example compilation commands is 'mvn -Phadoop_facebook compile'.



Giraph is a multi-module maven project. The top level generates a POM that
carries information common to all the modules. Each module creates a jar with
the code contained in it.

The giraph/ module contains the main giraph code. If you just want to work on
the main code only you can do all your work inside this subdirectory.
Specifically you would do something like:

  giraph-root/giraph/ $ mvn verify            # build from current state
  giraph-root/giraph/ $ mvn clean             # wipe out build files
  giraph-root/giraph/ $ mvn clean verify      # build from fresh state
  giraph-root/giraph/ $ mvn install           # install jar to local repository

The giraph-formats/ module contains hooks to read/write from various
formats (e.g. Accumulo, HBase, Hive). It depends on the giraph module. This
means if you make local changes to the giraph codebase you will first need to
install the giraph/ jar locally so that giraph-formats/ will pick it up.
In other words something like this:

  giraph-root/giraph/ $ mvn install
  giraph-root/giraph-formats $ mvn verify

To build everything at once you can issue the maven commands at the top level.
Note that we use the "install" target so that if you have any local changes to
giraph/ which formats needs it will get picked up because it will install
locally first.

  giraph-root/ $ mvn clean install


How to run the unittests on a local pseudo-distributed Hadoop instance:

As mentioned earlier, Giraph supports several versions of Hadoop.  In
this section, we describe how to run the Giraph unittests against a single
node instance of Apache Hadoop 0.20.203.

Download Apache Hadoop 0.20.203 (hadoop-
from a mirror picked at http://www.apache.org/dyn/closer.cgi/hadoop/common/
and unpack it into a local directory

Follow the guide at
to setup a pseudo-distributed single node Hadoop cluster.

Giraph’s code assumes that you can run at least 4 mappers at once,
unfortunately the default configuration allows only 2. Therefore you need
to update conf/mapred-site.xml:



After preparing the local filesystem with:

rm -rf /tmp/hadoop-<username>
/path/to/hadoop/bin/hadoop namenode -format

you can start the local hadoop instance:


and finally run Giraph’s unittests:

mvn clean test -Dprop.mapred.job.tracker=localhost:9001

Now you can open a browser, point it to http://localhost:50030 and watch the
Giraph jobs from the unittests running on your local Hadoop instance!

Counter limit: In Hadoop onwards, there is a limit on the number of
counters one can use, which is set to 120 by default. This limit restricts the
number of iterations/supersteps possible in Giraph. This limit can be increased
by setting a parameter "mapreduce.job.counters.limit" in job tracker's config
file mapred-site.xml.