Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Added a README.

  • Loading branch information...
commit 5455374cd8e56c306e2e81bcd7139fd49836e0eb 1 parent 8d18292
@dustin dustin authored
Showing with 40 additions and 0 deletions.
  1. +40 −0 README.markdown
40 README.markdown
@@ -0,0 +1,40 @@
+# Twitter Topword Tracker
+
+This tool follows a twitter stream and prints out the most frequently
+encountered words over the given reporting frequency.
+
+This is related to a concept we gave to a customer who wanted to solve
+this problem a while back. This is a significantly scaled down
+version, but it can exercise a server pretty hard by doing the
+following:
+
+# What it Does
+
+## Count Words
+
+Every tweet that comes in is split into the words that make it up
+and those words are counted within the tweet. Each word is then
+`incr`emented against the current time window.
+
+## Maintain Top Lists
+
+Also after each tweet, we maintain the top 1,000 words. In order to
+do this, we do a `cas` loop where we `get` the current list, do a
+giant `multi-get` of the union of all of the words currently in the
+list and all of the words we just found to find their current counts,
+then update the list with this value. The `cas` frequently fails, and
+will likely fail more with more `-workers`.
+
+## Reporting
+
+Based on the `-interval` parameter, a report will be sent to stdout
+showing the top of the top list. That's currently just a simple `get`
+and in-memory ops.
+
+# Doing More
+
+In general, twitter only provides a 1% stream of tweets. Using the
+`-multiplier` parameter, you can send every tweet through multiple
+times such that `-multipler=100` should approximately equal the
+traffic of all of twitter (though the words being processed would
+vary, etc...).
Please sign in to comment.
Something went wrong with that request. Please try again.