Sentinel is project written in Java to perform real-time stream mining on Twitter Public Stream using SAMOA and Apache Storm. Sentinel is a distributed system that aims to use new distributed algorithms. Currently Sentinel only supports real-time distributed classifications. See Tasks
section for details on how to work with Sentinel.
##Components
This component implements SAMOA's InstanceStreamClass which gets stream from StreamAPIReader
and performs sketching, filtering and etc.
This components connects to Twitter Public Stream API and reads instances and keeps the instances in an adaptive sliding window.
Represents an MVC style model, set of core attributes, setters and getters.
Processors perform text normalization to Tweets such as removing emoticons, URLs and Twitter Specific Characters.
Sketching algorithms such as SpaceSavings keep a summary of the text in-memory so that real-time stream mining could become possible. Also, they enable online approaches to data stream mining which are more adaptive than hold-out approaches, e.g. batch analysis of stream data.
This components transforms tweet texts into an sparse feature vectors and only keeps frequent features in memory only.
This component uses the classification approach for detecting language of a tweet according to Language Detection Library for Java.
Sentinel is a module of a bigger project. In order to use Sentinel, you need to run it with Apache Storm and SAMOA. Read the information at https://github.com/ambodi/samoa.
Clone SAMOA fork for Sentinel
git clone https://github.com/ambodi/samoa
Clone [Sentinel]
git clone https://github.com/ambodi/sentinel
Put Sentinel under
samoa-api/src/main/java/com/yahoo/labs/samoa/sentinel
Add twitter4j.properties
file in the root of the project. More info at Twitter 4J's Documentation on Generic properties
mvn clean install
Local Cluster:
mvn package
Apache Storm Cluster:
mvn -Pstorm package
Using Vertical Hoeffding Tree as a distributed parallel classification algorithm, you can perform sentiment analysis on Twitter Public Stream with Prequential Evaluation Task.
To perform sentiment analysis on a sample of 100000 tweets in real-time with 4 parallel nodes in your local cluster, run
bin/samoa local target/SAMOA-Local-0.3.0-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (classifiers.trees.VerticalHoeffdingTree -p 4) -s com.yahoo.labs.samoa.sentinel.model.TwitterStreamInstance"
Or if you run it in Apache Storm, run
bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (classifiers.trees.VerticalHoeffdingTree -p 4) -s com.yahoo.labs.samoa.sentinel.model.TwitterStreamInstance"
Put the following code under samoa-local(samoa-storm)/src/main/java/com/yahoo/labs/samoa/
:
public static void main( String[] args ) {
PrequentialEvaluation pe = new PrequentialEvaluation();
pe.setFactory(new SimpleComponentFactory());
pe.dumpFileOption.setValueViaCLIString("/tmp/dump.csv");
pe.instanceLimitOption.setValue(50);
pe.sampleFrequencyOption.setValue(5);
pe.learnerOption.setValueViaCLIString("classifiers.trees.VerticalHoeffdingTree -p 1");
pe.streamTrainOption.setValueViaCLIString(TwitterStreamInstance.class.getName());
pe.init();
}
Run mvn -X exec:java -Dexec.mainClass=com.yahoo.labs.samoa.app
This is preferred if you are developing and want to make use of debug mode.
The use and distribution terms for this software are covered by the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).