Skip to content

Duplicate detection #86

wants to merge 8 commits into from

2 participants

ferd commented Nov 14, 2012

Optional duplicate and near-duplicate detection and merging of log messages,
based on simhashing.

Tests are included, code is up for review.

Documentation is missing, but will be added on demand.

ferd added some commits Oct 12, 2012
@ferd ferd Initial addition to log deduplication experiment
This commit shouldn't be used except in experimental
settings or until it is supplanted by more fixes to correct it.

This adds log deduplication based on a near-duplicate
analysis of formatted messages (without node or timestamp)
over a given period of time. The objective is to allow
long chains of repeated events in a short burst of time
to be reduced to smaller quantities of text transmitted
and buffered to the error logger. This should allow more
resilience to overload situations, and allow to take less
place during long cascades of failures.

- have more tests
- refactor code to avoid duplication between lager.erl and
- figure out if the new application dependencies are a
  problem for users of lager.
- document new features.
@ferd ferd Moving to an ETS-based implementation of deduper
This allows much better performances than a dict-based version.
Also note that most of the improvements are also available due
to moving to a 'first match' search for duplicates (below
the given treshold) rather than a 'best match' option, in order
to possibly reduce the maximum number of match attempts to be made
when logs are wildly different.

It's still not known if this is the best approach, but it did
allow to survive overload situations much better.

Also using ETS gives way to try some interesting optimizations
by making the table protected later on (i.e. do not send
an error message if 1000 are already in the table).
@ferd ferd Improving shinglign/weights for deduping
Now able to process a lot more messages per second with as good
of a significance in simhashing.

This allows us to move away from phash2 as a hash algorithm (32
bits) to MD5 (128bits). While it copies a bit more data for each
communication, it allows much more accuracy in duplicate
@ferd ferd Adding limits to deduplication
This allows to do an optional pre-read in ETS to find out if
the currently deduplicated log message has been seen more than
a given number of times (dictated by the 'duplicate_limit'
env variable). If it's been seen too often, the entire
negotiation between lager_deduper and its children is skipped to
avoid undesirable message sending that may take a lot of space
or processing time.

This measure is a best effort -- ETS not having locks or anything
of that kind, it is possible to get more matched messages than the
limit states.
@ferd ferd Cleaning up a bit, adding tests.
Also upgrading simhash dependencies. The tests are a
deduper adaptation of lager_test_backend, and they all pass
similarly, but while handling things with a very low treshold.
@ferd ferd Fixing embarrassing typoes b00dc78
@ferd ferd Adding an option for quick deduplicated delivery
When logging deduplicated messages, the error handlers only
learn about the existance of an error message after they have
happened. In some cases, the time value could be set high, and
this would create latencies in fault detection.

This commit adds an option so that the first message of a group
(seen as too different to be merged in with another message) is
instantly sent to the error logger, while the following ones get

This will effectively change the behaviour from receiving:
  "error X (N times)"
  "error X" + "error X (N-1 times)"
@ferd ferd Fixing a race condition.
During system blockage, the gen_server:call limit of 5 seconds
could go off before giving an answer, creating an incomplete
deduplication session between lager_deduper and the error
reporter. This would in turn create dead entries that would never
get dumped in the future, effectively deleting a type of error
messages from the logs.

By setting the timeout to infinity, we allow more tolerance to
overload situations (which is when the deduper is useful) and
fix the issue entirely.
@Vagabond Vagabond commented on the diff Nov 14, 2012
@@ -36,6 +37,24 @@
%% current one - default is 0
{crash_log_count, 5},
%% Whether to redirect error_logger messages into lager - defaults to true
- {error_logger_redirect, true}
+ {error_logger_redirect, true},
+ %% How sensitive lager should be to near-duplicates. 0 means no
+ %% special handling is done
Vagabond added a note Nov 14, 2012

An example of what a non-zero value does here would be nice.

ferd added a note Nov 14, 2012

That's definitely stuff to document. The value is entirely arbitrary and needs to be experimented with. The thing is, simhashing calculates distance between hashes as the hamming distance between the values. As such, the intermediary hashing algorithm and how many bits it has (here we use MD5) will impact what distance is significant, and what distance isn't.

I'm thinking this is purely a documentation issue, and showing some examples may be nice here, rather than shoving it in the config file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Vagabond Vagabond commented on the diff Nov 14, 2012
+ {duplicate_threshold, 0},
+ %% How frequently lager should dump its duplicate entries, in milliseconds.
+ %% only used if duplicate_threshold > 0
+ {duplicate_dump, 1000},
+ %% How many similar messages can be logged per dump. If 150 log
+ %% messages are identical but this limit is set to 5, the log handler
+ %% will only receive 5 of them and show (5 times+) instead of (N times).
+ %% a value of 0 or 'undefined' disables limits.
+ {duplicate_limit, undefined},
+ %% Duplicate detection, by default, will eat up all the messages in a given
+ %% time period before dumping them. If the waiting time is set high (say a
+ %% minute or so), it means reaction time is much higher. Setting
+ %% 'duplicate_quick_notification' to 'true' makes the first message of a
+ %% given error go through, while accumulating the later ones to make sure
+ %% you get both quick reporting and deduplication of messages.
+ {duplicate_quick_notification, false}
Vagabond added a note Nov 14, 2012

I believe syslog's default behaviour is to print the first message and then suppress all duplicates until a different event comes in, at which point it'll print "last message duplicated N times", is this what setting this to 'true' does?

ferd added a note Nov 14, 2012

In this case, what it does is buffer all the events for a given period of time. It doesn't dump as soon as something new happens, but only at the end of the time period. So if I get the messages 'a a b a b c a a b c' in that order during a single time slice, I should see the result being 'a b c a(4) b(2) c' or something equivalent. The first 3 'a b c' messages are printed as they come, the rest is accumulated.

The objective was that if you have a fast sequence of 'a b a b a b a b a a b a b b' or something similar, you don't end up being spammed again, which could happen during cascading failures. Instead we'd be getting 'a b a(6) b(6)'.

It is definitely possible to make it mimic syslog's behaviour if you want to, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Vagabond Vagabond commented on the diff Nov 14, 2012
@@ -52,6 +52,27 @@ start(_StartType, _StartArgs) ->
{_, Traces} = lager_mochiglobal:get(loglevel),
lager_mochiglobal:put(loglevel, {MinLog, Traces}),
+ Threshold = case application:get_env(lager, duplicate_threshold) of
+ {ok, ThresV} -> ThresV;
Vagabond added a note Nov 14, 2012

Given the guard in safe_notify above to make sure threshhold is >= 0, wouldn't it make sense to guard on that here so we don't case_clause if it was set to a negative value?

ferd added a note Nov 14, 2012

Noted, will do. Do you want it to crash or silently change it to 0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

As per IRC, the dependency issue and the penalties a logger pays for having deps, we decided to include simhash as part of the lager application.


Also, it looks like this branch could use a merge/rebase against master, this branch was forked before the 'metadata sent to backend' merge, which changed a lot of stuff around how events are send to the backend (and the accessibility of metadata in events).

I actually think having specific metadata accessible will make things easier as you'll be able to weight different 'shingles' more intelligently. Unfortunately it also means writing more code...

@Vagabond Vagabond commented on the diff Nov 14, 2012
+setup(Threshold, Timer) ->
+ lager_mochiglobal:put(duplicate_threshold, Threshold),
+ lager_mochiglobal:put(duplicate_dump, Timer),
+ application:start(simhash),
+ {ok, DedupPid} = start_link(),
+ unlink(DedupPid),
+ DedupPid.
+cleanup(Pid) ->
+ exit(Pid, shutdown).
+low_threshold_test_() ->
+ {"Check that with low threshold, all messages are handled individually",
Vagabond added a note Nov 14, 2012

This test hangs forever for me with no output...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
ferd commented Nov 14, 2012

Rebasing proved impossible for a few reasons, including but not limited to how much the internal logging message format changed.

Pending a major rewrite of my fork, there's no way to make this work into lager as is.

@ferd ferd closed this Nov 14, 2012

Do you want to do the rebase, or would you like me to try it?


Or rather, the rebase and the rewrite.

ferd commented Nov 14, 2012

I'm rewriting it from scratch. A rebase would have been way too messy. Right now I'm redoing all the tests so that they work with the new metadata stuff. I'll send a new pull req when ready.


Ok, thanks. Sorry for the headache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.