Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Clone this wiki locally
Naive Example of Lambda Architecture for counting words
This is an example of the Lambda Architecture for counting words in real-time taking data for the Twitter Stream API (tweet4j API), it also generates an alert for a certain word taken as a parameter:
you can run the example but there are some drawbacks that should be fixed in next iterations:
- Execution of Hadoop on demand (when new data is coming). What is the size of the input dataset to perform the Hadoop process? How results are updated? and batch views? (the topic is also known as "Incremental Batch Processing").
- Sync between real time and batch views. When a real-time view is removed?
- Server layer: how to merge results in a smart way? Thinking in SPARQL federation queries (Fedx)...
- What services or algorithms must be performed on data?
Running the Batch Layer
Start the HADOOP instance
Load input data in HDFS. For instance "input" is the directory with the input files
$HADOOP_HOME/bin/hadoop fs -put input
$HADOOP_HOME/bin/hadoop fs -put /home/chema/data/input/ input
Compile and package with Maven
4-Run with Hadoop
$HADOOP_HOME/bin/hadoop jar target/relatweet-0.1-SNAPSHOT-job.jar input output
- 5-Check the results
$HADOOP_HOME/bin/hadoop fs -cat output/part-r-00000
- Create Batch views with SploutSQL
bin/splout-service.sh qnode start
bin/splout-service.sh dnode start
- Edit hadoop script:
exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS -Djava.library.path=/home/chema/develop/tools/sqlite4java-282 -classpath "$CLASSPATH"(only to ensure the Java library path is correct)
hadoop jar splout-hadoop-0.2.1-hadoop.jar simple-generate -i /home/chema/projects/seqos/prototypes/relatweet/output/part-r-00000 -o out-words -pby word -p 2 -s "word:string,count:int" --index "word" -t words -tb words
hadoop jar splout-hadoop-*-hadoop.jar deploy -q http://localhost:4412 -root out-words -ts words
- Check http://localhost:4412
- Remember to remove the dir out-words
Running the Server and Speed Layer
- Run redis-server (
- Run rabbitmq-server (usually it is automatically started by the OS).
- Run node.js app (
$ node src/main/webapp/app.js)
- Run topology of speed and server layer
mvn compile exec:java -Dexec.classpathScope=compile -Dexec.mainClass=org.seerc.relate.relatweet.lambda.server.LambdaWordCounterTopology
- Clean Redis with redis-cli:
- FLUSHDB - Removes data from your connection's CURRENT database.
- FLUSHALL - Removes data from ALL databases.