Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Duplicate detection #86

Closed
wants to merge 8 commits into from

2 participants

@ferd

Optional duplicate and near-duplicate detection and merging of log messages,
based on simhashing.

Tests are included, code is up for review.

Documentation is missing, but will be added on demand.

ferd added some commits
@ferd ferd Initial addition to log deduplication experiment
This commit shouldn't be used except in experimental
settings or until it is supplanted by more fixes to correct it.

This adds log deduplication based on a near-duplicate
analysis of formatted messages (without node or timestamp)
over a given period of time. The objective is to allow
long chains of repeated events in a short burst of time
to be reduced to smaller quantities of text transmitted
and buffered to the error logger. This should allow more
resilience to overload situations, and allow to take less
place during long cascades of failures.

TODO:
- have more tests
- refactor code to avoid duplication between lager.erl and
  lager_dedup.erl
- figure out if the new application dependencies are a
  problem for users of lager.
- document new features.
7405033
@ferd ferd Moving to an ETS-based implementation of deduper
This allows much better performances than a dict-based version.
Also note that most of the improvements are also available due
to moving to a 'first match' search for duplicates (below
the given treshold) rather than a 'best match' option, in order
to possibly reduce the maximum number of match attempts to be made
when logs are wildly different.

It's still not known if this is the best approach, but it did
allow to survive overload situations much better.

Also using ETS gives way to try some interesting optimizations
by making the table protected later on (i.e. do not send
an error message if 1000 are already in the table).
1240b68
@ferd ferd Improving shinglign/weights for deduping
Now able to process a lot more messages per second with as good
of a significance in simhashing.

This allows us to move away from phash2 as a hash algorithm (32
bits) to MD5 (128bits). While it copies a bit more data for each
communication, it allows much more accuracy in duplicate
detection.
a950d24
@ferd ferd Adding limits to deduplication
This allows to do an optional pre-read in ETS to find out if
the currently deduplicated log message has been seen more than
a given number of times (dictated by the 'duplicate_limit'
env variable). If it's been seen too often, the entire
negotiation between lager_deduper and its children is skipped to
avoid undesirable message sending that may take a lot of space
or processing time.

This measure is a best effort -- ETS not having locks or anything
of that kind, it is possible to get more matched messages than the
limit states.
5049634
@ferd ferd Cleaning up a bit, adding tests.
Also upgrading simhash dependencies. The tests are a
deduper adaptation of lager_test_backend, and they all pass
similarly, but while handling things with a very low treshold.
de63108
@ferd ferd Fixing embarrassing typoes b00dc78
@ferd ferd Adding an option for quick deduplicated delivery
When logging deduplicated messages, the error handlers only
learn about the existance of an error message after they have
happened. In some cases, the time value could be set high, and
this would create latencies in fault detection.

This commit adds an option so that the first message of a group
(seen as too different to be merged in with another message) is
instantly sent to the error logger, while the following ones get
accumulated.

This will effectively change the behaviour from receiving:
  "error X (N times)"
to:
  "error X" + "error X (N-1 times)"
fda26db
@ferd ferd Fixing a race condition.
During system blockage, the gen_server:call limit of 5 seconds
could go off before giving an answer, creating an incomplete
deduplication session between lager_deduper and the error
reporter. This would in turn create dead entries that would never
get dumped in the future, effectively deleting a type of error
messages from the logs.

By setting the timeout to infinity, we allow more tolerance to
overload situations (which is when the deduper is useful) and
fix the issue entirely.
3231aae
@Vagabond Vagabond commented on the diff
src/lager.app.src
@@ -36,6 +37,24 @@
%% current one - default is 0
{crash_log_count, 5},
%% Whether to redirect error_logger messages into lager - defaults to true
- {error_logger_redirect, true}
+ {error_logger_redirect, true},
+ %% How sensitive lager should be to near-duplicates. 0 means no
+ %% special handling is done
@Vagabond Collaborator

An example of what a non-zero value does here would be nice.

@ferd
ferd added a note

That's definitely stuff to document. The value is entirely arbitrary and needs to be experimented with. The thing is, simhashing calculates distance between hashes as the hamming distance between the values. As such, the intermediary hashing algorithm and how many bits it has (here we use MD5) will impact what distance is significant, and what distance isn't.

I'm thinking this is purely a documentation issue, and showing some examples may be nice here, rather than shoving it in the config file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Vagabond Vagabond commented on the diff
src/lager.app.src
((8 lines not shown))
+ {duplicate_threshold, 0},
+ %% How frequently lager should dump its duplicate entries, in milliseconds.
+ %% only used if duplicate_threshold > 0
+ {duplicate_dump, 1000},
+ %% How many similar messages can be logged per dump. If 150 log
+ %% messages are identical but this limit is set to 5, the log handler
+ %% will only receive 5 of them and show (5 times+) instead of (N times).
+ %% a value of 0 or 'undefined' disables limits.
+ {duplicate_limit, undefined},
+ %% Duplicate detection, by default, will eat up all the messages in a given
+ %% time period before dumping them. If the waiting time is set high (say a
+ %% minute or so), it means reaction time is much higher. Setting
+ %% 'duplicate_quick_notification' to 'true' makes the first message of a
+ %% given error go through, while accumulating the later ones to make sure
+ %% you get both quick reporting and deduplication of messages.
+ {duplicate_quick_notification, false}
@Vagabond Collaborator

I believe syslog's default behaviour is to print the first message and then suppress all duplicates until a different event comes in, at which point it'll print "last message duplicated N times", is this what setting this to 'true' does?

@ferd
ferd added a note

In this case, what it does is buffer all the events for a given period of time. It doesn't dump as soon as something new happens, but only at the end of the time period. So if I get the messages 'a a b a b c a a b c' in that order during a single time slice, I should see the result being 'a b c a(4) b(2) c' or something equivalent. The first 3 'a b c' messages are printed as they come, the rest is accumulated.

The objective was that if you have a fast sequence of 'a b a b a b a b a a b a b b' or something similar, you don't end up being spammed again, which could happen during cascading failures. Instead we'd be getting 'a b a(6) b(6)'.

It is definitely possible to make it mimic syslog's behaviour if you want to, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Vagabond Vagabond commented on the diff
src/lager_app.erl
@@ -52,6 +52,27 @@ start(_StartType, _StartArgs) ->
{_, Traces} = lager_mochiglobal:get(loglevel),
lager_mochiglobal:put(loglevel, {MinLog, Traces}),
+ Threshold = case application:get_env(lager, duplicate_threshold) of
+ {ok, ThresV} -> ThresV;
@Vagabond Collaborator

Given the guard in safe_notify above to make sure threshhold is >= 0, wouldn't it make sense to guard on that here so we don't case_clause if it was set to a negative value?

@ferd
ferd added a note

Noted, will do. Do you want it to crash or silently change it to 0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Vagabond
Collaborator

As per IRC, the dependency issue and the penalties a logger pays for having deps, we decided to include simhash as part of the lager application.

@Vagabond
Collaborator

Also, it looks like this branch could use a merge/rebase against master, this branch was forked before the 'metadata sent to backend' merge, which changed a lot of stuff around how events are send to the backend (and the accessibility of metadata in events).

I actually think having specific metadata accessible will make things easier as you'll be able to weight different 'shingles' more intelligently. Unfortunately it also means writing more code...

@Vagabond Vagabond commented on the diff
src/lager_deduper.erl
((212 lines not shown))
+
+
+-ifdef(TEST).
+setup(Threshold, Timer) ->
+ lager_mochiglobal:put(duplicate_threshold, Threshold),
+ lager_mochiglobal:put(duplicate_dump, Timer),
+ application:start(simhash),
+ {ok, DedupPid} = start_link(),
+ unlink(DedupPid),
+ DedupPid.
+
+cleanup(Pid) ->
+ exit(Pid, shutdown).
+
+low_threshold_test_() ->
+ {"Check that with low threshold, all messages are handled individually",
@Vagabond Collaborator

This test hangs forever for me with no output...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ferd

Rebasing proved impossible for a few reasons, including but not limited to how much the internal logging message format changed.

Pending a major rewrite of my fork, there's no way to make this work into lager as is.

@ferd ferd closed this
@Vagabond
Collaborator

Do you want to do the rebase, or would you like me to try it?

@Vagabond
Collaborator

Or rather, the rebase and the rewrite.

@ferd

I'm rewriting it from scratch. A rebase would have been way too messy. Right now I'm redoing all the tests so that they work with the new metadata stuff. I'll send a new pull req when ready.

@Vagabond
Collaborator

Ok, thanks. Sorry for the headache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Oct 13, 2012
  1. @ferd

    Initial addition to log deduplication experiment

    ferd authored
    This commit shouldn't be used except in experimental
    settings or until it is supplanted by more fixes to correct it.
    
    This adds log deduplication based on a near-duplicate
    analysis of formatted messages (without node or timestamp)
    over a given period of time. The objective is to allow
    long chains of repeated events in a short burst of time
    to be reduced to smaller quantities of text transmitted
    and buffered to the error logger. This should allow more
    resilience to overload situations, and allow to take less
    place during long cascades of failures.
    
    TODO:
    - have more tests
    - refactor code to avoid duplication between lager.erl and
      lager_dedup.erl
    - figure out if the new application dependencies are a
      problem for users of lager.
    - document new features.
Commits on Oct 15, 2012
  1. @ferd

    Moving to an ETS-based implementation of deduper

    ferd authored
    This allows much better performances than a dict-based version.
    Also note that most of the improvements are also available due
    to moving to a 'first match' search for duplicates (below
    the given treshold) rather than a 'best match' option, in order
    to possibly reduce the maximum number of match attempts to be made
    when logs are wildly different.
    
    It's still not known if this is the best approach, but it did
    allow to survive overload situations much better.
    
    Also using ETS gives way to try some interesting optimizations
    by making the table protected later on (i.e. do not send
    an error message if 1000 are already in the table).
Commits on Oct 16, 2012
  1. @ferd

    Improving shinglign/weights for deduping

    ferd authored
    Now able to process a lot more messages per second with as good
    of a significance in simhashing.
    
    This allows us to move away from phash2 as a hash algorithm (32
    bits) to MD5 (128bits). While it copies a bit more data for each
    communication, it allows much more accuracy in duplicate
    detection.
  2. @ferd

    Adding limits to deduplication

    ferd authored
    This allows to do an optional pre-read in ETS to find out if
    the currently deduplicated log message has been seen more than
    a given number of times (dictated by the 'duplicate_limit'
    env variable). If it's been seen too often, the entire
    negotiation between lager_deduper and its children is skipped to
    avoid undesirable message sending that may take a lot of space
    or processing time.
    
    This measure is a best effort -- ETS not having locks or anything
    of that kind, it is possible to get more matched messages than the
    limit states.
Commits on Oct 26, 2012
  1. @ferd

    Cleaning up a bit, adding tests.

    ferd authored
    Also upgrading simhash dependencies. The tests are a
    deduper adaptation of lager_test_backend, and they all pass
    similarly, but while handling things with a very low treshold.
  2. @ferd

    Fixing embarrassing typoes

    ferd authored
Commits on Oct 30, 2012
  1. @ferd

    Adding an option for quick deduplicated delivery

    ferd authored
    When logging deduplicated messages, the error handlers only
    learn about the existance of an error message after they have
    happened. In some cases, the time value could be set high, and
    this would create latencies in fault detection.
    
    This commit adds an option so that the first message of a group
    (seen as too different to be merged in with another message) is
    instantly sent to the error logger, while the following ones get
    accumulated.
    
    This will effectively change the behaviour from receiving:
      "error X (N times)"
    to:
      "error X" + "error X (N-1 times)"
  2. @ferd

    Fixing a race condition.

    ferd authored
    During system blockage, the gen_server:call limit of 5 seconds
    could go off before giving an answer, creating an incomplete
    deduplication session between lager_deduper and the error
    reporter. This would in turn create dead entries that would never
    get dumped in the future, effectively deleting a type of error
    messages from the logs.
    
    By setting the timeout to infinity, we allow more tolerance to
    overload situations (which is when the deduper is useful) and
    fix the issue entirely.
Something went wrong with that request. Please try again.