chemaar edited this page Mar 20, 2013 · 7 revisions
Clone this wiki locally

Naive Example of Lambda Architecture for counting words

This is an example of the Lambda Architecture for counting words in real-time taking data for the Twitter Stream API (tweet4j API), it also generates an alert for a certain word taken as a parameter:

Toy Lambda Architecture

you can run the example but there are some drawbacks that should be fixed in next iterations:

  • Execution of Hadoop on demand (when new data is coming). What is the size of the input dataset to perform the Hadoop process? How results are updated? and batch views? (the topic is also known as "Incremental Batch Processing").
  • Sync between real time and batch views. When a real-time view is removed?
  • Server layer: how to merge results in a smart way? Thinking in SPARQL federation queries (Fedx)...
  • What services or algorithms must be performed on data?

Running the Batch Layer

  • Start the HADOOP instance $HADOOP_HOME/bin/

  • Load input data in HDFS. For instance "input" is the directory with the input files $HADOOP_HOME/bin/hadoop fs -put input

Ej: $HADOOP_HOME/bin/hadoop fs -put /home/chema/data/input/ input

  • Compile and package with Maven mvn assembly:assembly

  • 4-Run with Hadoop

$HADOOP_HOME/bin/hadoop jar target/relatweet-0.1-SNAPSHOT-job.jar input output

  • 5-Check the results

$HADOOP_HOME/bin/hadoop fs -cat output/part-r-00000

  • Create Batch views with SploutSQL
  • bin/ qnode start
  • bin/ dnode start
  • Edit hadoop script: exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS -Djava.library.path=/home/chema/develop/tools/sqlite4java-282 -classpath "$CLASSPATH" (only to ensure the Java library path is correct)
  • hadoop jar splout-hadoop-0.2.1-hadoop.jar simple-generate -i /home/chema/projects/seqos/prototypes/relatweet/output/part-r-00000 -o out-words -pby word -p 2 -s "word:string,count:int" --index "word" -t words -tb words
  • hadoop jar splout-hadoop-*-hadoop.jar deploy -q http://localhost:4412 -root out-words -ts words
  • Check http://localhost:4412
  • Remember to remove the dir out-words

Running the Server and Speed Layer

  1. Run redis-server ($ redis-server)
  2. Run rabbitmq-server (usually it is automatically started by the OS).
  3. Run node.js app ($ node src/main/webapp/app.js)
  4. Run topology of speed and server layer

mvn compile exec:java -Dexec.classpathScope=compile -Dexec.mainClass=org.seerc.relate.relatweet.lambda.server.LambdaWordCounterTopology

Show results



  • Clean Redis with redis-cli:
  • FLUSHDB - Removes data from your connection's CURRENT database.
  • FLUSHALL - Removes data from ALL databases.