A pure Ruby implementation of the SpaceSaver algorithm for estimating the top K elements in a data stream
Ruby
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
spec Add leaderboard reset function Feb 5, 2014
.document
.gitignore
.rspec
Gemfile
LICENSE.txt
README.md
Rakefile
VERSION
space-saver-redis.gemspec

README.md

space-saver-redis

This gem is a pure Ruby implementation of Metwally, Agrawal, and Abbadi's SpaceSaver algorithm for estimating the top K elements in a data stream. A Redis instance is used for storage.

Here's an example:

require 'redis'
require 'space-saver-redis'

# Estimate the top 10 most frequent items seen in a data stream
space_saver = SpaceSaver.new(Redis.new, 10)

urls_visited.each { |url| space_saver.increment("urls", url) }

After the above code executes, you can query space_saver.leaders("urls") to get an estimate of the top 10 most frequent URLs visited along with their estimated counts. The SpaceSaver instance uses only a Redis sorted set with at most K elements (10, in this case) at any time to make this estimation.

Obviously, since the data structure uses only a small, fixed amount of space, there are some data distributions that can cause the top K elements returned to be completely incorrect, but for a lot of data distributions the results are worth the savings in space. In particular, for a SpaceSaver instance initialized with parameter K that observes a data stream of N items, any item that occurs more than N/K times is guaranteed to be in the list of estimated leaders.

One way to cope with the error involved in this kind of estimation is to use a K bigger than you actually need and then truncate the number of leaders returned at query time. You can pass an additional parameter to the call to leaders to do this, for example space_saver.leaders("urls", 3) will return only the top 3 of the 10 estimated most frequent items.

Installation

gem install space-saver-redis