Skip to content
This repository

Real-Time, Twitter sentiment analyzer engine

branch: master

remove ttl on

Classified stream
latest commit 28738d2066
cyhex authored February 27, 2014
Octocat-spinner-32 data data folder December 09, 2013
Octocat-spinner-32 docs migrate twitter api client to requests January 23, 2014
Octocat-spinner-32 ex.conf readme February 18, 2014
Octocat-spinner-32 resources placeholder February 19, 2014
Octocat-spinner-32 smm remove ttl on February 27, 2014
Octocat-spinner-32 tests tokenization cleanup February 02, 2014
Octocat-spinner-32 toolbox switch to FreqDist December 20, 2013
Octocat-spinner-32 .gitignore migrate twitter api client to requests January 23, 2014
Octocat-spinner-32 README.md Update README.md February 21, 2014
Octocat-spinner-32 setup.py migrate twitter api client to requests January 23, 2014
Octocat-spinner-32 start-classifier.py classifier workers and pool December 10, 2013
Octocat-spinner-32 start-collector.py stream stuff December 13, 2013
Octocat-spinner-32 start-server.py layout February 02, 2014
README.md

Streamcrab

Streamcrab is a realtime twitter sentiment analyzer

This is the second version of the tool, and it is rewritten completely from previous version (still available in legacy branch)

Demo: http://www.streamcrab.com

Changes from previous version

  • Supports MaxEnt and Bayes classifiers (defaults to MaxEnt)
  • Simplified tweets collection (see Collecting raw Tweets)
  • Simplified trainer (see Train classifier)
  • Build in HTTP Server & frontend based on gevent and Flask
  • Unittests tested
  • Utilization of multi-core systems
  • Scalable (in theory :)

Requirements

  • python 2.7
  • python2.7-dev
  • mongodb server

Debian like systems:

apt-get install python2.7 python2.7-dev mongodb-server

Checkout

Checkout latest streamcrab branch from github

git clone https://github.com/cyhex/streamcrab.git ./streamcrab
cd streamcrab

Configure

copy smm/config.default.py to smm/config.py and edit smm/config.py according to your needs

cp smm/config.default.py smm/config.py
nano smm/config.py

Installation & Setup

Download and install required libs and data

python setup.py develop
python toolbox/setup-app.py

Testing

Run unittests

python -m unittest discover tests

Collecting raw Tweets

The base of data training is an assumption that tweets with happy emoticons :) are positive and tweets with sad :( emoticons have negative sentiment polarity

Wether this assumption is correct or not is outside the scope of this document.

Collect 2000 'happy' tweets

python toolbox/collect-tweets.py happy 2000

Collect 2000 'sad' tweets

python toolbox/collect-tweets.py sad 2000

for more options see

python toolbox/collect-classifier.py --help

Train classifier

Create and save new classifier trained from collected tweets

python toolbox/train-classifier.py maxEntTestCorpus 2000

for more options see

python toolbox/train-classifier.py --help

Start server stack

open 3 shells and type in each:

python start-collector.py
python start-classifier.py
python start-server.py

open browser on http://127.0.0.1:5000

Show stats

Show detailed info on collected Tweets and saved classifiers

python toolbox/show-classifiers.py

Its worth mention that Training data size is the size of the trained classifier after it has been serialized (pickled) whit protocol=1 actual Memory Usage may vary...

Interactive shell

You can directly interact with the trained classifier and get verbose output on how the score is calculated replace maxEntTestCorpus with desired classifier name see Show stats to display available classifiers

python toolbox/shell-classifier.py maxEntTestCorpus

You should see:

exit: ctrl+c

Loaded maxEntTestCorpus
Classify:

Type something and hit enter:

Classify: today is a bad day for this nation

Classification: negative with 53.29%

Feature                                          negativ positiv
----------------------------------------------------------------
bad==1 (1)                                         0.074
today==1 (1)                                       0.027
day==1 (1)                                         0.008
bad==1 (1)                                                -0.178
nation==1 (1)                                              0.139
today==1 (1)                                              -0.035
day==1 (1)                                                -0.007
-----------------------------------------------------------------
TOTAL:                                             0.109  -0.081
PROBS:                                             0.533   0.467

for more options see

python toolbox/shell-classifier.py --help

Training and testing results

see : https://github.com/cyhex/streamcrab/blob/master/docs/acurracy_tests.md

Production & deployment

Run everything behind nginx >= 1.3.13, automate processes management with supervisord.

Since nginx 1.3.13 supports websockets, so you should probably use latest stable version.

This is only one way of many to deploy the app. in folder ex.conf there are sample config files for nginx and supervisord.

Links, Sources etc

Something went wrong with that request. Please try again.