Permalink
Browse files

Adding HyperLogLog::TimeSeriesCounter

The time series counter allows estimation of set cardinalities at
time intervals that start at arbitrary points in the past, e.g.
how many distinct additions to the set have happened in the past
week?

Moving the HyperLogLog implementation into HyperLogLog::Counter
since there are now two different counter implementations in this
gem.
  • Loading branch information...
1 parent 041dcf7 commit d86c5ffefd8d36b78f32151bb8acae3a66708bc4 @aaw committed Nov 17, 2012
Showing with 708 additions and 316 deletions.
  1. +1 −0 Gemfile
  2. +2 −0 Gemfile.lock
  3. +11 −0 HISTORY.md
  4. +124 −25 README.md
  5. +67 −0 lib/algorithm.rb
  6. +35 −0 lib/counter.rb
  7. +0 −93 lib/hyper_log_log.rb
  8. +3 −1 lib/hyperloglog-redis.rb
  9. +50 −0 lib/time_series_counter.rb
  10. +199 −197 spec/hyper_log_log_spec.rb
  11. +216 −0 spec/time_series_counter_spec.rb
View
1 Gemfile
@@ -7,4 +7,5 @@ group :development, :test do
gem 'jeweler', '~> 1.8.4'
gem 'rake', '~> 0.9.2.2'
gem 'rspec', '~> 2.11.0'
+ gem 'timecop', '~> 0.5.3'
end
View
2 Gemfile.lock
@@ -22,6 +22,7 @@ GEM
rspec-expectations (2.11.3)
diff-lcs (~> 1.1.3)
rspec-mocks (2.11.2)
+ timecop (0.5.3)
PLATFORMS
ruby
@@ -32,3 +33,4 @@ DEPENDENCIES
rake (~> 0.9.2.2)
redis (~> 3.0.1)
rspec (~> 2.11.0)
+ timecop (~> 0.5.3)
View
11 HISTORY.md
@@ -1,3 +1,14 @@
+## 2.0.0 (unreleased)
+
+* Changed the underlying storage from Redis hashes to bitstrings [simonkro](https://github.com/simonkro)
+ TODO: put example upgrade script here.
+
+* Moved main counter implementation from `HyperLogLog` to the class `HyperLogLog::Counter`
+
+* Added `HyperLogLog::TimeSeriesCounter` a counter type that can estimate cardinalities
+ for all events from a particular point in the past until the present.
+
+
## 1.0.0 (10/26/2012)
* Changed the underlying storage from Redis sorted sets to Redis hashes. This
View
149 README.md
@@ -1,15 +1,14 @@
hyperloglog-redis
=================
-This gem is an implementation of the HyperLogLog algorithm for estimating
+This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating
cardinalities of sets observed via a stream of events. A [Redis](http://redis.io)
-instance is used for storing the counters. A simple example:
+instance is used for storing the counters. A minimal example:
require 'redis'
require 'hyperloglog-redis'
- redis = Redis.new
- counter = HyperLogLog.new(redis)
+ counter = HyperLogLog::Counter.new(Redis.new)
['john', 'paul', 'george', 'ringo', 'john', 'paul'].each do |beatle|
counter.add('beatles', beatle)
end
@@ -18,22 +17,26 @@ instance is used for storing the counters. A simple example:
Each HyperLogLog counter uses a small, fixed amount of space but can
estimate the cardinality of any set of up to around a billion values with
-relative error of about 1.04 / Math.sqrt(2 ** b), where b is a parameter
-passed to the HyperLogLog initializer that defaults to 10. With b = 10,
-each counter is represented by a Redis sorted set with 2 ** b = 1024 values
-(a few KB of space) and we get an expected relative error of 3%. Contrast this
-with the amount of space needed to compute set cardinality exactly, which is
-over 100 MB for a even a bit vector representing a set with a billion values.
-
-The basic idea of HyperLogLog (and its predecessors PCSA and LogLog) is to apply
-a good hash function to each value you see in the stream and record the longest
-run of zeros that you've seen as a prefix of any hashed value. If the hash
-function is good, you'd expect that its bits are statistically independent, so
-seeing a value that starts with exactly X zeros should happen with probability
-2 ** -(X + 1). So if you've seen a run of 5 zeros in one of your hash values,
+relative error of 1.04 / Math.sqrt(2 ** b) with high probability, where b is a
+parameter passed to the `HyperLogLog::Counter` initializer that defaults to 10.
+With b = 10, each counter is represented by a 1 KB string in Redis and we get
+an expected relative error of 3%. Contrast this with the amount of space needed
+to compute set cardinality exactly, which is over 100 MB for a even a bit vector
+representing a set with a billion values.
+
+The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is
+to apply a good hash function to each value observed in the stream and record the longest
+run of zeros seen as a prefix of any hashed value. If the hash
+function is good, the bits in any hashed value should be close to statistically independent,
+so seeing a value that starts with exactly X zeros should happen with probability close to
+2 ** -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values,
you're likely to have around 2 ** 6 = 64 values in the underlying set. The actual
implementation and analysis are much more advanced than this, but that's the idea.
+This gem implements a few useful extensions to the basic HyperLogLog algorithm
+which allow you to estimate unions and intersections of counters as well as
+counts within specific time ranges. These extensions are described in detail below.
+
The HyperLogLog algorithm is described and analyzed in the paper
["HyperLogLog: the analysis of a near-optimal cardinality estimation
algorithm"](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
@@ -49,31 +52,127 @@ You can also ask for an estimate of the union from multiple counters:
counter.add('wings', wings_member)
end
- puts "There are approximately #{counter.union('beatles', 'wings')} people who were in the Beatles or Wings"
+ puts "There are approximately #{counter.union(['beatles', 'wings'])} people who were in the Beatles or Wings"
The same relative error guarantee above applies to unions: a union of
-size N can be estimated to within N * (1.04 / Math.sqrt(2 ** b)) elements,
+size N can be estimated to within +/- N * (1.04 / Math.sqrt(2 ** b)) elements,
regardless of how many HyperLogLog counters that union spans. You can store
a unioned counter for querying or combining later with `union_store`:
- counter.union_store('all_beatles_and_wings_members', 'beatles', 'wings')
+ counter.union_store('all_beatles_and_wings_members', ['beatles', 'wings'])
puts "There are approximately #{counter.count('all_beatles_and_wings_members'}} people who were in the Beatles or Wings"
Intersections can also be estimated:
- puts "There are approximately #{counter.intersection('beatles', 'wings')} people who were in both the Beatles and Wings"
+ puts "There are approximately #{counter.intersection(['beatles', 'wings'])} people who were in both the Beatles and Wings"
However, intersections of HyperLogLog counters are calculated indirectly via the
[inclusion/exclusion principle](http://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle)
as a sum of unions and there aren't good theoretical bounds on the error of that sum. In
practice, the estimates that come out of small intersections tend to follow the
-same relative error patterns, but beware using this type of estimation on large
-intersections, both because the errors can be much larger than those guaranteed
+same relative error patterns, but beware using this type of estimation on intersections
+of large numbers of sets, both because the errors can be much larger than those guaranteed
for unions and the complexity of computing intersections grows exponentially with
-the number of counters being intersected.
+the number of sets in the intersection.
+
+Set cardinality within a time interval
+======================================
+
+All examples up until now use `HyperLogLog::Counter`, which stores HyperLogLog
+counters as (2 ** b)-byte Redis strings. hyperloglog-redis also contains the counter implementation
+`HyperLogLog::TimeSeriesCounter`, which uses a little more space (Redis strings of up to
+4 * (32 - b) * (2 ** b) bytes) but allows you to estimate the cardinality of sets during
+certain time windows.
+
+Using `HyperLogLog::TimeSeriesCounter`, you can get estimates of the number of distinct
+elements added to a set in the past X seconds, for any value of X. A `HyperLogLog::TimeSeriesCounter`
+is initialized with the same arguments as a regular `Counter` but implements a
+superset of `HyperLogLog::Counter`'s interface. Namely, each of the methods `add`,
+`count`, `union`, `intersection`, and `union_store` take an optional final time argument,
+either a Ruby `Time` or an integer representing seconds since the epoch.
+
+When passed a time argument t, `add` registers an addition to the set at time t. When no
+time is passed, the current system time is used. The methods `count`, `union`,
+`intersection`, and `union_store` all estimate set cardinality for the time interval
+consisting of all events that happened after time t when t is passed as a final argument.
+
+For example, to get the number of distinct user logins within the
+past week, we might call:
+
+ one_week = 60 * 60 * 24 * 7
+ logins_in_past_week = counter.count('user_logins', Time.now - one_week)
+
+A note about relative errors
+============================
+
+With a parameter `b` in the range [4..16], HyperLogLog counters provide a relative
+error of 1.04/sqrt(2 ** b) with high probability. When unions, intersections, and
+time range queries are involved, it's sometimes not clear what the relative error
+is relative to, so here is some clarification:
+
+* For a union of counters, the relative error applies to the size of the union. Taking
+the union of counters is lossless in the sense that you end up with the same counter
+you would have arrived at had you observed the union of all of the individual events.
+
+* For an intersection of counters, there's no good theoretical bound on the relative
+error. In practice, and especially for intersections involving a small number of sets,
+the relative error you obtain tends to be in relation to the size of the union of the
+sets involved. For example, if you have two sets, each of cardinality 5000 and observe
+both sets through HyperLogLog counters with parameter b=10 (3% relative error), you can
+expect the intersection estimate to be within 10000 * 0.03 = 300 of the actual intersection
+size.
+
+* For time queries, the relative error applies to the size of the set within the time
+range you've queried. For example, given a set of cardinality 1,000,000 that has had
+100 distinct additions within the last 10 minutes, if you observe such a set with a
+HyperLogLog counter with parameter b=10 (3% relative error), you can expect the count
+returned from a query about the last 10 minutes to be within 3 of 100.
+
+Comparison to other approaches
+==============================
+
+When trying to optimize for space, two well-known alternatives to HyperLogLog exist:
+
+* Bit vectors: you provide some near-perfect hash function between keys in your domain
+and an interval of integers, then represent that interval of integers with bits.
+* Bloom filters with counters: use a [Bloom filter](http://en.wikipedia.org/wiki/Bloom_filter)
+to keep track of items seen; on insert, when the Bloom filter tells you that the item
+seen is not in the set, increment the counter.
+
+Both bit vectors and bloom filters can be augmented to hold timestamps for entries in the
+data structures and simulate counters for time-ranges like `HyperLogLog::TimeSeriesCounter`.
+
+Bit vectors give exact counts, but the space complexity is linear with the size of
+the set, and you must either allocate a large bit vector upfront or cope with the complexity
+of dynamically resizing your bit vector as the set grows. Providing a manual mapping from
+members of your set to an interval of integers is sometimes a non-trivial task. Counts,
+unions, and intersections are all linear-time operations in the size of the universe of
+the set being represented.
+
+Bloom filters can be much more compact than bit vectors, but the actual count associated
+with a Bloom filter is an artifact of the construction of the data structure, so the cost
+of estimating a union or intersection is linear in the size of the Bloom filter. Getting
+high probability guarantees on the quality of the estimate of Bloom filter counts requires
+several "good" hash functions that have some degree of independence from each other; in
+practice, coming up with several independent implementations of good hash functions is
+difficult. Bloom filters require that all of their space be allocated upfront (re-hashing
+isn't possible without replaying all events), so in practice you need some estimate of
+how large the counters are going to be before allocating the counter.
+
+HyperLogLog counters take up less space than either of the above approaches and provide
+constant-time implementations (in the size of the sets being represented) of unions,
+intersections, and time range queries. A `HyperLogLog::Counter` with parameter b will
+be stored in a Redis string of length at most 2 ** b bytes, whereas a `HyperLogLog::TimeSeriesCounter` with parameter
+b will be stored in a Redis string of length at most 4 * (32 - b) * (2 ** b) bytes. For counters representing smaller sets,
+the size taken up by a `HyperLogLog::TimeSeriesCounter` can be significantly less. Here
+are some examples for specific values of b:
+
+* With b = 7, a `HyperLogLog::Counter` uses at most 128 bytes and a `HyperLogLog::TimeSeriesCounter` uses at most 13 KB while providing a relative error of 9%.
+* With b = 11, a `HyperLogLog::Counter` uses at most 2 KB and a `HyperLogLog::TimeSeriesCounter` uses at most 168 KB while providing a relative error of 2%
+* With b = 16, a `HyperLogLog::Counter` uses at most 64 KB and a `HyperLogLog::TimeSeriesCounter` uses at most 4 MB while providing a relative error of less than half a percent.
Installation
============
- gem install hyperloglog-redis
+ gem install hyperloglog-redis
View
67 lib/algorithm.rb
@@ -0,0 +1,67 @@
+require 'redis'
+require 'murmurhash3'
+
+module HyperLogLog
+ module Algorithm
+
+ def initialize(redis, b=10)
+ raise "Accuracy not supported. Please choose a value of b between 4 and 16" if b < 4 || b > 16
+ @redis = redis
+ @bits_in_hash = 32 - b
+ @m = (2 ** b).to_i
+ if @m == 16
+ @alpha = 0.673
+ elsif @m == 32
+ @alpha = 0.697
+ elsif @m == 64
+ @alpha = 0.709
+ else
+ @alpha = 0.7213/(1 + 1.079/@m)
+ end
+ end
+
+ def hash_info(value)
+ hash = MurmurHash3::V32.murmur3_32_str_hash(value)
+ [hash, hash % @m, rho(hash / @m)]
+ end
+
+ # Estimate the cardinality of the intersection of several sets. We do this by
+ # using the principle of inclusion and exclusion to represent the size of the
+ # intersection as the alternating sum of an exponential number of
+ # cardinalities of unions of smaller sets.
+ def intersection(counter_names, time=0)
+ icount = (1..counter_names.length).map do |k|
+ counter_names.combination(k).map do |group|
+ ((k % 2 == 0) ? -1 : 1) * union_helper(group, time)
+ end.inject(0, :+)
+ end.inject(0, :+)
+ [icount, 0].max
+ end
+
+ def union_helper(counter_names, time=0)
+ all_estimates = raw_union(counter_names, time).select{ |i| i > 0 }
+ estimate_sum = all_estimates.reduce(0.0){ |a, score| a + 2.0 ** -score }
+ estimate = @alpha * @m * @m / (estimate_sum + @m - all_estimates.length)
+ if estimate <= 2.5 * @m
+ if all_estimates.length == @m
+ estimate.round
+ else # Correction for small sets
+ (@m * Math.log(Float(@m)/(@m - all_estimates.length))).round
+ end
+ elsif estimate <= 2 ** 32 / 30.0
+ estimate.round
+ else # Correction for large sets
+ (-2**32 * Math.log(1 - estimate/(2.0**32))).round
+ end
+ end
+
+ # rho(i) is the position of the first 1 in the binary representation of i,
+ # reading from most significant to least significant bits. Some examples:
+ # rho(1...) = 1, rho(001...) = 3, rho(000...0) = @bits_in_hash + 1
+ def rho(i)
+ return @bits_in_hash + 1 if i == 0
+ @bits_in_hash - Math.log(i, 2).floor
+ end
+
+ end
+end
View
35 lib/counter.rb
@@ -0,0 +1,35 @@
+module HyperLogLog
+ class Counter
+ include Algorithm
+
+ def add(counter_name, value)
+ hash, function_name, new_value = hash_info(value)
+ existing_value = @redis.getrange(counter_name, function_name, function_name).unpack('C').first.to_i
+ @redis.setrange(counter_name, function_name, new_value.chr) if new_value > existing_value
+ end
+
+ # Estimate the cardinality of a single set
+ def count(counter_name)
+ union_helper([counter_name])
+ end
+
+ # Estimate the cardinality of the union of several sets
+ def union(counter_names)
+ union_helper(counter_names)
+ end
+
+ # Store the union of several sets in *destination* so that it can be used as
+ # a HyperLogLog counter later.
+ def union_store(destination, counter_names)
+ @redis.set(destination, raw_union(counter_names).inject('') {|a, e| a << e.chr})
+ end
+
+ def raw_union(counter_names, time=nil)
+ counters = @redis.mget(*counter_names).compact
+ return [] if counters.none?
+ return counters.first.each_byte if counters.one?
+ counters.map{|c| c.unpack("C#{@m}")}.transpose.map {|e| e.compact.max.to_i}
+ end
+
+ end
+end
View
93 lib/hyper_log_log.rb
@@ -1,93 +0,0 @@
-require 'redis'
-require 'murmurhash3'
-
-class HyperLogLog
- def initialize(redis, b=10)
- raise "Accuracy not supported. Please choose a value of b between 4 and 16" if b < 4 || b > 16
- @redis = redis
- @bits_in_hash = 32 - b
- @m = (2 ** b).to_i
- if @m == 16
- @alpha = 0.673
- elsif @m == 32
- @alpha = 0.697
- elsif @m == 64
- @alpha = 0.709
- else
- @alpha = 0.7213/(1 + 1.079/@m)
- end
- end
-
- def add(counter_name, value)
- hash = MurmurHash3::V32.murmur3_32_str_hash(value)
- function_name = hash % @m
- w = hash / @m
- existing_value = @redis.getrange(counter_name, function_name, function_name).unpack('C').first.to_i
- new_value = rho(w)
- @redis.setrange(counter_name, function_name, new_value.chr) if new_value > existing_value
- end
-
- # Estimate the cardinality of a single set
- def count(counter_name)
- union_helper([counter_name])
- end
-
- # Estimate the cardinality of the union of several sets
- def union(*counter_names)
- union_helper(counter_names)
- end
-
- # Store the union of several sets in *destination* so that it can be used as
- # a HyperLogLog counter later.
- def union_store(destination, *counter_names)
- @redis.set(destination, raw_union(counter_names).inject('') {|a, e| a << e.chr})
- end
-
- # Estimate the cardinality of the intersection of several sets. We do this by
- # using the principle of inclusion and exclusion to represent the size of the
- # intersection as the alternating sum of an exponential number of
- # cardinalities of unions of smaller sets.
- def intersection(*counter_names)
- icount = (1..counter_names.length).map do |k|
- counter_names.combination(k).map do |group|
- ((k % 2 == 0) ? -1 : 1) * union_helper(group)
- end.inject(0, :+)
- end.inject(0, :+)
- [icount, 0].max
- end
-
- def union_helper(counter_names)
- all_estimates = raw_union(counter_names).select{|i| i > 0}
- estimate_sum = all_estimates.reduce(0.0) {|a, score| a + 2.0 ** -score}
- estimate = @alpha * @m * @m * ((estimate_sum + @m - all_estimates.length) ** -1.0)
- if estimate <= 2.5 * @m
- if all_estimates.length == @m
- estimate.round
- else # Correction for small sets
- (@m * Math.log(Float(@m)/(@m - all_estimates.length))).round
- end
- elsif estimate <= 2 ** 32 / 30.0
- estimate.round
- else # Correction for large sets
- (-2**32 * Math.log(1 - estimate/(2.0**32))).round
- end
- end
-
- def raw_union(counter_names)
- counters = @redis.mget(*counter_names).compact
-
- return [] if counters.none?
- return counters.first.each_byte if counters.one?
-
- counters.map{|c| c.unpack("C#{@m}")}.transpose.map {|e| e.compact.max.to_i}
- end
-
- # rho(i) is the position of the first 1 in the binary representation of i,
- # reading from most significant to least significant bits. Some examples:
- # rho(1...) = 1, rho(001...) = 3, rho(000...0) = @bits_in_hash + 1
- def rho(i)
- return @bits_in_hash + 1 if i == 0
- @bits_in_hash - Math.log(i, 2).floor
- end
-
-end
View
4 lib/hyperloglog-redis.rb
@@ -1 +1,3 @@
-require "hyper_log_log"
+require "algorithm"
+require "counter"
+require "time_series_counter"
View
50 lib/time_series_counter.rb
@@ -0,0 +1,50 @@
+module HyperLogLog
+ class TimeSeriesCounter
+ include Algorithm
+
+ def add(counter_name, value, time=nil)
+ hash, function_name, new_value = hash_info(value)
+ index = 4 * (function_name + (new_value.to_i * @m))
+ if time.nil?
+ @redis.setrange(counter_name, index, [Time.now.to_i].pack('N'))
+ else
+ existing_time = @redis.getrange(counter_name, index, index + 3)
+ existing_val = existing_time.empty? ? 0 : existing_time.unpack('N').first
+ @redis.setrange(counter_name, index, [time.to_i].pack('N')) if time.to_i > existing_val
+ end
+ end
+
+ # Estimate the cardinality of a single set
+ def count(counter_name, time=0)
+ union_helper([counter_name], time)
+ end
+
+ # Estimate the cardinality of the union of several sets
+ def union(counter_names, time=0)
+ union_helper(counter_names, time)
+ end
+
+ # Store the union of several sets in *destination* so that it can be used as
+ # a HyperLogLog counter later.
+ def union_store(destination, counter_names, time=0)
+ raw_counters = @redis.mget(*counter_names).compact.map{ |c| c.unpack('N*').map{ |x| x > time ? x : 0 } }
+ max_length = raw_counters.map{ |c| c.length }.max
+ combined_counters = raw_counters.map{ |c| c.fill(0, c.length, max_length - c.length) }.transpose.map{ |e| e.max.to_i }
+ @redis.set(destination, combined_counters.pack('N*'))
+ end
+
+ def raw_union(counter_names, time=0)
+ raw_counters = @redis.mget(*counter_names).compact
+ return [] if raw_counters.none?
+ hyperloglog_counters = raw_counters.map do |counter|
+ slices = counter.unpack('N*').each_slice(@m).to_a
+ slices.last.fill(0, slices.last.length, slices.first.length - slices.last.length)
+ slices.transpose.map{ |x| x.rindex{ |c| c > time } || 0 }
+ end
+ return hyperloglog_counters.first if hyperloglog_counters.one?
+ max_length = hyperloglog_counters.map{ |c| c.length }.max
+ hyperloglog_counters.map{ |c| c.fill(0, c.length, max_length - c.length) }.transpose.map{ |e| e.max.to_i }
+ end
+
+ end
+end
View
396 spec/hyper_log_log_spec.rb
@@ -1,224 +1,226 @@
require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
describe HyperLogLog do
-
- it "doesn't change its count when it sees values that it's already seen" do
- redis = Redis.new
- counter = HyperLogLog.new(redis, 10)
- test_set = (1..100).map{ |x| x.to_s }
- test_set.each{ |value| counter.add("mycounter", value) }
- original_estimate = counter.count("mycounter")
- 5.times do
- test_set.each do |value|
- counter.add("mycounter", value)
- counter.count("mycounter").should == original_estimate
+
+ [HyperLogLog::Counter, HyperLogLog::TimeSeriesCounter].each do |counter_type|
+
+ it "doesn't change its count when it sees values that it's already seen" do
+ redis = Redis.new
+ counter = counter_type.new(redis, 10)
+ test_set = (1..100).map{ |x| x.to_s }
+ test_set.each{ |value| counter.add("mycounter", value) }
+ original_estimate = counter.count("mycounter")
+ 5.times do
+ test_set.each do |value|
+ counter.add("mycounter", value)
+ counter.count("mycounter").should == original_estimate
+ end
end
end
- end
- it "can maintain more than one logically distinct counter" do
- redis = Redis.new
- counter = HyperLogLog.new(redis, 10)
- other_estimate = counter.count("counter2")
- (1..100).each do |i|
- counter.add("counter1", i.to_s)
- counter.count("counter2").should == other_estimate
+ it "can maintain more than one logically distinct counter" do
+ redis = Redis.new
+ counter = counter_type.new(redis, 10)
+ other_estimate = counter.count("counter2")
+ (1..100).each do |i|
+ counter.add("counter1", i.to_s)
+ counter.count("counter2").should == other_estimate
+ end
+ other_estimate = counter.count("counter1")
+ (101..200).each do |i|
+ counter.add("counter2", i.to_s)
+ counter.count("counter1").should == other_estimate
+ end
+ other_estimate = counter.count("counter2")
+ (201..300).each do |i|
+ counter.add("counter1", i.to_s)
+ counter.count("counter2").should == other_estimate
+ end
+ counter.count("counter1").should > 100
+ counter.count("counter2").should > 50
+ counter.count("counter1").should > counter.count("counter2")
end
- other_estimate = counter.count("counter1")
- (101..200).each do |i|
- counter.add("counter2", i.to_s)
- counter.count("counter1").should == other_estimate
+
+ it "can exactly count small sets" do
+ redis = Redis.new
+ counter = counter_type.new(redis, 11)
+ 10.times { |i| counter.add("mycounter", i.to_s) }
+ counter.count("mycounter").should == 10
end
- other_estimate = counter.count("counter2")
- (201..300).each do |i|
- counter.add("counter1", i.to_s)
- counter.count("counter2").should == other_estimate
- end
- counter.count("counter1").should > 100
- counter.count("counter2").should > 50
- counter.count("counter1").should > counter.count("counter2")
- end
-
- it "can exactly count small sets" do
- redis = Redis.new
- counter = HyperLogLog.new(redis, 11)
- 10.times { |i| counter.add("mycounter", i.to_s) }
- counter.count("mycounter").should == 10
- end
-
- it "can exactly count small unions" do
- redis = Redis.new
- counter = HyperLogLog.new(redis, 11)
- (1..8).each { |i| counter.add("mycounter1", i.to_s) }
- (5..12).each { |i| counter.add("mycounter2", i.to_s) }
- counter.union("mycounter1", "mycounter2").should == 12
- end
-
- it "can exactly count small intersections" do
- redis = Redis.new
- counter = HyperLogLog.new(redis, 11)
- (1..8).each { |i| counter.add("mycounter1", i.to_s) }
- (5..12).each { |i| counter.add("mycounter2", i.to_s) }
- counter.intersection("mycounter1", "mycounter2").should == 4
- end
-
- it "can store unions for querying later" do
- redis = Redis.new
- counter = HyperLogLog.new(redis, 11)
- (1..10).each { |i| counter.add("mycounter1", i.to_s) }
- (5..15).each { |i| counter.add("mycounter2", i.to_s) }
- (15..25).each { |i| counter.add("mycounter3", i.to_s) }
- (20..50).each { |i| counter.add("mycounter4", i.to_s) }
- counter.union_store("aggregate_counter", "mycounter1", "mycounter2", "mycounter3", "mycounter4")
- counter.union("mycounter1", "mycounter2", "mycounter3", "mycounter4").should == counter.count("aggregate_counter")
- end
-
- # With parameter b, HyperLogLog should produce estimates that have
- # relative error of 1.04 / Math.sqrt(2 ** b). Of course, this analysis
- # is based on assumptions that aren't necessarily true in practice and
- # the observed relative error will depend on the distribution of data
- # we receive as well as the interaction of the murmur hash implementation
- # with that data. Keeping that in mind, the following spec makes sure
- # that in the process of adding 1000 values to a set, HyperLogLog only
- # gives bad estimates (more than twice the expected relative error) in
- # less than 1% of the cases and never gives very bad estimates (more than
- # three times the expected relative error.)
- #
- # It's fine to fudge these numbers a little if the implementation changes,
- # since you can clearly find a different set of values that make this test
- # fail even without changing the implementation. But it should serve as a
- # good indication that there aren't any logical errors in the HyperLogLog
- # implementation, since it exercises all of the cases in HyperLogLog's
- # count method except for the correction for very large set sizes.
-
- it "produces acceptable estimates for counts" do
- max_items = 1000
- redis = Redis.new
- (6..16).each do |b|
- counter = HyperLogLog.new(redis, b)
- redis.del('mycounter')
+
+ it "can exactly count small unions" do
+ redis = Redis.new
+ counter = counter_type.new(redis, 11)
+ (1..8).each { |i| counter.add("mycounter1", i.to_s) }
+ (5..12).each { |i| counter.add("mycounter2", i.to_s) }
+ counter.union(["mycounter1", "mycounter2"]).should == 12
+ end
+
+ it "can exactly count small intersections" do
+ redis = Redis.new
+ counter = counter_type.new(redis, 11)
+ (1..8).each { |i| counter.add("mycounter1", i.to_s) }
+ (5..12).each { |i| counter.add("mycounter2", i.to_s) }
+ counter.intersection(["mycounter1", "mycounter2"]).should == 4
+ end
+
+ it "can store unions for querying later" do
+ redis = Redis.new
+ counter = counter_type.new(redis, 11)
+ (1..10).each { |i| counter.add("mycounter1", i.to_s) }
+ (5..15).each { |i| counter.add("mycounter2", i.to_s) }
+ (15..25).each { |i| counter.add("mycounter3", i.to_s) }
+ (20..50).each { |i| counter.add("mycounter4", i.to_s) }
+ counter.union_store("aggregate_counter", ["mycounter1", "mycounter2", "mycounter3", "mycounter4"])
+ counter.union(["mycounter1", "mycounter2", "mycounter3", "mycounter4"]).should == counter.count("aggregate_counter")
+ end
+
+ # With parameter b, HyperLogLog should produce estimates that have
+ # relative error of 1.04 / Math.sqrt(2 ** b). Of course, this analysis
+ # is based on assumptions that aren't necessarily true in practice and
+ # the observed relative error will depend on the distribution of data
+ # we receive as well as the interaction of the murmur hash implementation
+ # with that data. Keeping that in mind, the following spec makes sure
+ # that in the process of adding 1000 values to a set, HyperLogLog only
+ # gives bad estimates (more than twice the expected relative error) in
+ # less than 1% of the cases and never gives very bad estimates (more than
+ # three times the expected relative error.)
+ #
+ # It's fine to fudge these numbers a little if the implementation changes,
+ # since you can clearly find a different set of values that make this test
+ # fail even without changing the implementation. But it should serve as a
+ # good indication that there aren't any logical errors in the HyperLogLog
+ # implementation, since it exercises all of the cases in HyperLogLog's
+ # count method except for the correction for very large set sizes.
+
+ it "produces acceptable estimates for counts" do
+ max_items = 1000
+ redis = Redis.new
+ (6..16).each do |b|
+ counter = counter_type.new(redis, b)
+ redis.del('mycounter')
+ bad_estimates = 0
+ very_bad_estimates = 0
+ expected_relative_error = 1.04 / Math.sqrt(2 ** b)
+ max_items.times do |i|
+ value = Digest::MD5.hexdigest("value#{i}")
+ counter.add("mycounter", value)
+ actual = i + 1
+ approximate = counter.count("mycounter")
+ relative_error = (actual - approximate).abs / Float(actual)
+ bad_estimates += 1 if relative_error > expected_relative_error * 2
+ very_bad_estimates += 1 if relative_error > expected_relative_error * 3
+ end
+ bad_estimates.should < max_items / 100.00
+ very_bad_estimates.should == 0
+ end
+ end
+
+ it "produces acceptable estimates for unions with few elements in common" do
+ b, max_items = 10, 2000
+ counter = counter_type.new(Redis.new, b)
bad_estimates = 0
very_bad_estimates = 0
expected_relative_error = 1.04 / Math.sqrt(2 ** b)
max_items.times do |i|
- value = Digest::MD5.hexdigest("value#{i}")
- counter.add("mycounter", value)
- actual = i + 1
- approximate = counter.count("mycounter")
+ value1 = Digest::MD5.hexdigest("value#{i}")
+ counter.add("mycounter1", value1)
+ value2 = Digest::MD5.hexdigest("value#{i}incounter2")
+ counter.add("mycounter2", value2)
+ value3 = Digest::MD5.hexdigest("this is value#{i}")
+ counter.add("mycounter3", value3)
+ actual = 3 * (i + 1)
+ approximate = counter.union(["mycounter1", "mycounter2", "mycounter3"])
relative_error = (actual - approximate).abs / Float(actual)
bad_estimates += 1 if relative_error > expected_relative_error * 2
very_bad_estimates += 1 if relative_error > expected_relative_error * 3
end
- bad_estimates.should < max_items / 100.00
+ bad_estimates.should < (3 * max_items) / 100.00
very_bad_estimates.should == 0
end
- end
-
- it "produces acceptable estimates for unions with few elements in common" do
- b, max_items = 10, 2000
- counter = HyperLogLog.new(Redis.new, b)
- bad_estimates = 0
- very_bad_estimates = 0
- expected_relative_error = 1.04 / Math.sqrt(2 ** b)
- max_items.times do |i|
- value1 = Digest::MD5.hexdigest("value#{i}")
- counter.add("mycounter1", value1)
- value2 = Digest::MD5.hexdigest("value#{i}incounter2")
- counter.add("mycounter2", value2)
- value3 = Digest::MD5.hexdigest("this is value#{i}")
- counter.add("mycounter3", value3)
- actual = 3 * (i + 1)
- approximate = counter.union("mycounter1", "mycounter2", "mycounter3")
- relative_error = (actual - approximate).abs / Float(actual)
- bad_estimates += 1 if relative_error > expected_relative_error * 2
- very_bad_estimates += 1 if relative_error > expected_relative_error * 3
- end
- bad_estimates.should < (3 * max_items) / 100.00
- very_bad_estimates.should == 0
- end
-
- it "produces acceptable estimates for unions with many elements in common" do
- b, max_items, intersection_size = 10, 1000, 2000
- counter = HyperLogLog.new(Redis.new, b)
- bad_estimates = 0
- very_bad_estimates = 0
- expected_relative_error = 1.04 / Math.sqrt(2 ** b)
-
- intersection_size.times do |i|
- value = Digest::MD5.hexdigest("test#{i}value")
- ['mycounter1', 'mycounter2', 'mycounter3'].each do |counter_name|
- counter.add(counter_name, value)
+
+ it "produces acceptable estimates for unions with many elements in common" do
+ b, max_items, intersection_size = 10, 1000, 2000
+ counter = counter_type.new(Redis.new, b)
+ bad_estimates = 0
+ very_bad_estimates = 0
+ expected_relative_error = 1.04 / Math.sqrt(2 ** b)
+
+ intersection_size.times do |i|
+ value = Digest::MD5.hexdigest("test#{i}value")
+ ['mycounter1', 'mycounter2', 'mycounter3'].each do |counter_name|
+ counter.add(counter_name, value)
+ end
end
+
+ max_items.times do |i|
+ value1 = Digest::MD5.hexdigest("value#{i}")
+ counter.add("mycounter1", value1)
+ value2 = Digest::MD5.hexdigest("value#{i}isincounter2")
+ counter.add("mycounter2", value2)
+ value3 = Digest::MD5.hexdigest("this is value#{i}")
+ counter.add("mycounter3", value3)
+ actual = 3 * (i + 1) + intersection_size
+ approximate = counter.union(["mycounter1", "mycounter2", "mycounter3"])
+ relative_error = (actual - approximate).abs / Float(actual)
+ bad_estimates += 1 if relative_error > expected_relative_error * 2
+ very_bad_estimates += 1 if relative_error > expected_relative_error * 3
+ end
+
+ bad_estimates.should < ((3 * max_items) + intersection_size) / 100.00
+ very_bad_estimates.should == 0
end
- max_items.times do |i|
- value1 = Digest::MD5.hexdigest("value#{i}")
- counter.add("mycounter1", value1)
- value2 = Digest::MD5.hexdigest("value#{i}isincounter2")
- counter.add("mycounter2", value2)
- value3 = Digest::MD5.hexdigest("this is value#{i}")
- counter.add("mycounter3", value3)
- actual = 3 * (i + 1) + intersection_size
- approximate = counter.union("mycounter1", "mycounter2", "mycounter3")
- relative_error = (actual - approximate).abs / Float(actual)
- bad_estimates += 1 if relative_error > expected_relative_error * 2
- very_bad_estimates += 1 if relative_error > expected_relative_error * 3
- end
-
- bad_estimates.should < ((3 * max_items) + intersection_size) / 100.00
- very_bad_estimates.should == 0
- end
-
- # There are no good theoretical guarantees that I know of for arbitrary
- # intersection estimation, since it's expessed as the sum of unions of
- # HyperLogLog counters, but it tends to work okay in practice, as seen below.
-
- it "produces decent estimates for intersections" do
- b, max_items = 6, 1000
- counter = HyperLogLog.new(Redis.new, b)
- expected_relative_error = 1.04 / Math.sqrt(2 ** b)
-
- max_items.times do |i|
- value1 = Digest::MD5.hexdigest("first-value#{i}")
- value2 = Digest::MD5.hexdigest("second-value#{i}")
- value3 = Digest::MD5.hexdigest("third-value#{i}")
- value4 = Digest::MD5.hexdigest("fourth-value#{i}")
- counter.add("mycounter1", value1)
- counter.add("mycounter2", value2)
- counter.add("mycounter3", value3)
- counter.add("mycounter4", value4)
- [value1, value2, value3, value4].each{ |value| counter.add("mycounter5", value) }
- end
-
- small_counters = ['mycounter1', 'mycounter2', 'mycounter3', 'mycounter4']
+ # There are no good theoretical guarantees that I know of for arbitrary
+ # intersection estimation, since it's expessed as the sum of unions of
+ # HyperLogLog counters, but it tends to work okay in practice, as seen below.
- small_counters.each do |counter_name|
- intersection_estimate = counter.intersection(counter_name, 'mycounter5')
- intersection_estimate.should > 0
- (intersection_estimate - counter.count(counter_name)).abs.should < max_items * expected_relative_error
- end
-
- [2,3].each do |intersection_size|
- small_counters.combination(intersection_size).each do |counter_names|
- intersection_estimate = counter.intersection(*counter_names)
- intersection_estimate.should >= 0
- intersection_estimate.should < intersection_size * max_items * expected_relative_error
+ it "produces decent estimates for intersections" do
+ b, max_items = 6, 1000
+ counter = counter_type.new(Redis.new, b)
+ expected_relative_error = 1.04 / Math.sqrt(2 ** b)
+
+ max_items.times do |i|
+ value1 = Digest::MD5.hexdigest("first-value#{i}")
+ value2 = Digest::MD5.hexdigest("second-value#{i}")
+ value3 = Digest::MD5.hexdigest("third-value#{i}")
+ value4 = Digest::MD5.hexdigest("fourth-value#{i}")
+ counter.add("mycounter1", value1)
+ counter.add("mycounter2", value2)
+ counter.add("mycounter3", value3)
+ counter.add("mycounter4", value4)
+ [value1, value2, value3, value4].each{ |value| counter.add("mycounter5", value) }
end
- end
-
- 100.times do |i|
- value = Digest::MD5.hexdigest("somethingintheintersection#{i}")
- small_counters.each { |counter_name| counter.add(counter_name, value) }
- end
-
- [2,3,4].each do |intersection_size|
- small_counters.combination(intersection_size).each do |counter_names|
- intersection_estimate = counter.intersection(*counter_names)
- intersection_estimate.should >= 0
- (intersection_estimate - 100).abs.should < intersection_size * (max_items + 100) * expected_relative_error
+
+ small_counters = ['mycounter1', 'mycounter2', 'mycounter3', 'mycounter4']
+
+ small_counters.each do |counter_name|
+ intersection_estimate = counter.intersection([counter_name, 'mycounter5'])
+ intersection_estimate.should > 0
+ (intersection_estimate - counter.count(counter_name)).abs.should < max_items * expected_relative_error
+ end
+
+ [2,3].each do |intersection_size|
+ small_counters.combination(intersection_size).each do |counter_names|
+ intersection_estimate = counter.intersection(counter_names)
+ intersection_estimate.should >= 0
+ intersection_estimate.should < intersection_size * max_items * expected_relative_error
+ end
end
+
+ 100.times do |i|
+ value = Digest::MD5.hexdigest("somethingintheintersection#{i}")
+ small_counters.each { |counter_name| counter.add(counter_name, value) }
+ end
+
+ [2,3,4].each do |intersection_size|
+ small_counters.combination(intersection_size).each do |counter_names|
+ intersection_estimate = counter.intersection(counter_names)
+ intersection_estimate.should >= 0
+ (intersection_estimate - 100).abs.should < intersection_size * (max_items + 100) * expected_relative_error
+ end
+ end
+
end
-
end
-
end
View
216 spec/time_series_counter_spec.rb
@@ -0,0 +1,216 @@
+require 'securerandom'
+require 'timecop'
+require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
+
+MINUTES=60
+HOURS=MINUTES*60
+DAYS=HOURS*24
+WEEKS=DAYS*7
+
+describe HyperLogLog::TimeSeriesCounter do
+
+ before(:each) do
+ @b = 11
+ @redis = Redis.new
+ @counter = HyperLogLog::TimeSeriesCounter.new(@redis, @b)
+ @expected_relative_error = 1.04 / Math.sqrt(2 ** @b)
+
+ def counter_should_equal(counter_val, expected_val, relative_error_base=nil)
+ (counter_val - expected_val).abs.should <= (relative_error_base || expected_val) * @expected_relative_error
+ end
+ end
+
+ it "can estimate cardinalities from any particular point in time until the present" do
+ Timecop.travel(Time.now - 2 * WEEKS) do
+ (0..100).each { |i| @counter.add('mycounter', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 1 * WEEKS) do
+ (100..200).each { |i| @counter.add('mycounter', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 6 * DAYS) do
+ (0..100).each { |i| @counter.add('mycounter', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 5 * DAYS) do
+ (100..200).each { |i| @counter.add('mycounter', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 4 * DAYS) do
+ (200..250).each { |i| @counter.add('mycounter', "item#{i}") }
+ end
+
+ counter_should_equal(@counter.count('mycounter'), 250)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 3 * WEEKS), 250)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 1 * WEEKS - 3 * DAYS), 250)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 1 * WEEKS), 250)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 5 * DAYS - 12 * HOURS), 150, 250)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 4 * DAYS - 12 * HOURS), 50, 250)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 3 * DAYS), 0, 250)
+ end
+
+ it "can estimate unions from any particular point in time until the present" do
+ Timecop.travel(Time.now - 2 * WEEKS) do
+ (0..100).each { |i| @counter.add('mycounter1', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 1 * WEEKS) do
+ (100..200).each { |i| @counter.add('mycounter2', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 6 * DAYS) do
+ (0..100).each { |i| @counter.add('mycounter1', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 5 * DAYS) do
+ (100..200).each { |i| @counter.add('mycounter2', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 4 * DAYS) do
+ (200..250).each { |i| @counter.add('mycounter1', "item#{i}") }
+ end
+
+ counter_should_equal(@counter.union(['mycounter1', 'mycounter2']), 250)
+ counter_should_equal(@counter.union(['mycounter1', 'mycounter2'], Time.now.to_i - 3 * WEEKS), 250)
+ counter_should_equal(@counter.union(['mycounter1', 'mycounter2'], Time.now.to_i - 1 * WEEKS - 3 * DAYS), 250)
+ counter_should_equal(@counter.union(['mycounter1', 'mycounter2'], Time.now.to_i - 1 * WEEKS), 250)
+ counter_should_equal(@counter.union(['mycounter1', 'mycounter2'], Time.now.to_i - 5 * DAYS - 12 * HOURS), 150, 250)
+ counter_should_equal(@counter.union(['mycounter1', 'mycounter2'], Time.now.to_i - 4 * DAYS - 12 * HOURS), 50, 250)
+ counter_should_equal(@counter.union(['mycounter1', 'mycounter2'], Time.now.to_i - 3 * DAYS), 0, 250)
+ end
+
+ it "can estimate intersections from any particular point in time until the present" do
+ Timecop.travel(Time.now - 2 * WEEKS) do
+ (0..100).each { |i| @counter.add('mycounter1', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 1 * WEEKS) do
+ (100..200).each { |i| @counter.add('mycounter2', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 6 * DAYS) do
+ (0..100).each { |i| @counter.add('mycounter2', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 5 * DAYS) do
+ (100..200).each { |i| @counter.add('mycounter1', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 4 * DAYS) do
+ (200..250).each { |i| @counter.add('mycounter1', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 3 * DAYS) do
+ (200..250).each { |i| @counter.add('mycounter2', "item#{i}") }
+ end
+
+ counter_should_equal(@counter.intersection(['mycounter1', 'mycounter2']), 250)
+ counter_should_equal(@counter.intersection(['mycounter1', 'mycounter2'], Time.now.to_i - 3 * WEEKS), 250)
+ counter_should_equal(@counter.intersection(['mycounter1', 'mycounter2'], Time.now.to_i - 1 * WEEKS - 3 * DAYS), 150, 250)
+ counter_should_equal(@counter.intersection(['mycounter1', 'mycounter2'], Time.now.to_i - 6 * DAYS - 12 * HOURS), 50, 250)
+ counter_should_equal(@counter.intersection(['mycounter1', 'mycounter2'], Time.now.to_i - 5 * DAYS - 12 * HOURS), 50, 250)
+ counter_should_equal(@counter.intersection(['mycounter1', 'mycounter2'], Time.now.to_i - 4 * DAYS - 12 * HOURS), 50, 250)
+ counter_should_equal(@counter.intersection(['mycounter1', 'mycounter2'], Time.now.to_i - 3 * DAYS - 12 * HOURS), 0, 250)
+ counter_should_equal(@counter.intersection(['mycounter1', 'mycounter2'], Time.now.to_i - 2 * DAYS), 0, 250)
+ end
+
+ it "can use union_store to store snapshots of counters at particular points in time" do
+ Timecop.travel(Time.now - 2 * WEEKS) do
+ (0..100).each { |i| @counter.add('mycounter1', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 1 * WEEKS) do
+ (100..200).each { |i| @counter.add('mycounter2', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 6 * DAYS) do
+ (0..100).each { |i| @counter.add('mycounter2', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 5 * DAYS) do
+ (100..200).each { |i| @counter.add('mycounter1', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 4 * DAYS) do
+ (200..250).each { |i| @counter.add('mycounter1', "item#{i}") }
+ end
+ Timecop.travel(Time.now - 3 * DAYS) do
+ (200..250).each { |i| @counter.add('mycounter2', "item#{i}") }
+ end
+
+ @counter.union_store('counter1_1_week_ago', ['mycounter1'], Time.now.to_i - 1 * WEEKS)
+ @counter.union_store('counter2_5_days_ago', ['mycounter2'], Time.now.to_i - 5 * DAYS)
+ counter_should_equal(@counter.union(['counter1_1_week_ago', 'counter2_5_days_ago']), 150, 250)
+ end
+
+ it "allows you to override the time an event is registered when it's added" do
+ (0..1000).each { |i| @counter.add('mycounter', "item#{i}", Time.now.to_i - 3 * WEEKS) }
+ (1000..2000).each { |i| @counter.add('mycounter', "item#{i}", Time.now.to_i - 2 * WEEKS) }
+ (2000..3000).each { |i| @counter.add('mycounter', "item#{i}", Time.now.to_i - 1 * WEEKS) }
+ (3000..4000).each { |i| @counter.add('mycounter', "item#{i}") }
+
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 4 * WEEKS), 4000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 2 * WEEKS - 3 * DAYS), 3000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 1 * WEEKS - 3 * DAYS), 2000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 3 * DAYS), 1000)
+ end
+
+ it "doesn't screw up more recent counts when items are injected with earlier timestamp overrides" do
+ Timecop.travel(Time.now - 3 * WEEKS) do
+ (0..1000).each { |i| @counter.add('mycounter', "item#{i}") }
+ end
+
+ Timecop.travel(Time.now - 2 * WEEKS) do
+ (1000..2000).each { |i| @counter.add('mycounter', "item#{i}") }
+ end
+
+ Timecop.travel(Time.now - 1 * WEEKS) do
+ (2000..3000).each { |i| @counter.add('mycounter', "item#{i}") }
+ end
+
+ Timecop.travel(Time.now - 2 * DAYS) do
+ (1000..2000).each { |i| @counter.add('mycounter', "item#{i}") }
+ end
+
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 4 * WEEKS), 3000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 2 * WEEKS - 3 * DAYS), 2000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 1 * WEEKS - 3 * DAYS), 2000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 3 * DAYS), 1000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 1 * DAYS), 0)
+
+ # Shouldn't change counts, since they're updates to counts that happen later
+ # than the time we're trying to inject
+ (1000..2000).each { |i| @counter.add('mycounter', "item#{i}", Time.now.to_i - 1 * WEEKS) }
+
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 4 * WEEKS), 3000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 2 * WEEKS - 3 * DAYS), 2000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 1 * WEEKS - 3 * DAYS), 2000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 3 * DAYS), 1000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 1 * DAYS), 0)
+
+ # Should change counts, since they're updates to counts for items we've never
+ # seen before in the past
+ (3000..4000).each { |i| @counter.add('mycounter', "item#{i}", Time.now.to_i - 1 * WEEKS) }
+
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 4 * WEEKS), 4000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 2 * WEEKS - 3 * DAYS), 3000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 1 * WEEKS - 3 * DAYS), 3000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 3 * DAYS), 1000)
+ counter_should_equal(@counter.count('mycounter', Time.now.to_i - 1 * DAYS), 0)
+ end
+
+ it "can compute deltas over time on events correctly" do
+ # A larger-scale test that simulates user join events and tests that we can get
+ # week-by-week deltas. Generate new user counts according to the following
+ # weekly schedule: 55780 during the first week, 300 more during the next week,
+ # 10 more the next week, etc.
+
+ schedule = [55780, 300, 10, 4000, 1000, 1000, 5000, 15000, 30000, 3000]
+ schedule.each_with_index do |num_users, i|
+ Timecop.travel(Time.now - (schedule.length * WEEKS) + (i * WEEKS)) do
+ num_users.times do |i|
+ Timecop.travel(Time.now + 2 * HOURS + i) do
+ @counter.add("users", "user#{SecureRandom.uuid}")
+ end
+ end
+ end
+ end
+
+ actual_total = schedule.reduce(:+)
+ estimated_total = @counter.count("users")
+ (actual_total - estimated_total).abs.should < @expected_relative_error * actual_total
+
+ # Go through the schedule, computing week-by-week deltas and comparing them to the
+ # scheduled additions.
+
+ schedule.each_with_index do |users_joined, i|
+ week = schedule.length - 1 - i
+ c = @counter.count('users', Time.now.to_i - (week+1) * WEEKS) - @counter.count('users', Time.now.to_i - week * WEEKS)
+ (users_joined - c).abs.should < @expected_relative_error * schedule.reduce(:+)
+ end
+ end
+end

0 comments on commit d86c5ff

Please sign in to comment.