Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exclude_lines and include_lines options #430

Merged
merged 1 commit into from
Dec 7, 2015

Conversation

monicasarbu
Copy link
Contributor

  • exclude_lines option accepts a list of regular expressions to match. If any line matches at least one of the regular
    expressions, the line is dropped and not exported
  • include_lines option accepts a list of regular expressions to match. If any line matches at least one of the regular
    expressions, the line is exported

Example:

  • include_lines: ["^ERR", "^WARN"]

Results in exporting only the lines that start with "ERR" or "WARN".
Note: First executes the include_lines and then exclude_lines.

This fixes the issue: https://github.com/elastic/filebeat/issues/78

@monicasarbu monicasarbu added the Filebeat Filebeat label Dec 2, 2015
@urso
Copy link

urso commented Dec 2, 2015

I'd opt for 'AND' combining the regular expressions, not 'OR' combining. Or is already possible from within the regular expression. Just use '|' operator (see my comment on original issue).

Maybe we should consider using 'CompilePOSIX'. These regexes are(should be) compatible to egrep. So I can test my filters with 'cat <infile> | egrep ... | egrep ... | egrep -v ... | egrep -v ...'. This is similar to AND-combining the regular expressions.

@ruflin
Copy link
Member

ruflin commented Dec 2, 2015

@urso I would disagree here. I think the general use case is OR here if a list of regexp is provided. See the example from @monicasarbu

@monicasarbu
Copy link
Contributor Author

@urso It's a good point to CompilePOSIX instead of Compile. We can also advice our users to test the regexp before using egrep. I didn't think about this before. Thanks for letting me know.

I think it makes more sense OR than AND. Imagine you configure: include_lines: ["expr1", "expr2"]. I would say each expr1 and expr2 should define a group of lines to be included.

It's true that you can use "|" inside the regexp to do OR, but I am not a fan of it as complicates the regexp. I think the regexp should be simpler as possible because you can easily have a mistake in it.

@ruflin
Copy link
Member

ruflin commented Dec 2, 2015

Implementation LGTM. It would be nice to add some system tests but we can still do that in a second step.

@urso
Copy link

urso commented Dec 2, 2015

Well, thinking about chains of egrep.

I think for exclude_lines OR is most similar to chains of 'egrep -v'. So I'm fine with exclude_lines using OR.

Still not convinced of using OR for include_lines (as chains of egrep behave like AND). But I can live with either semantics.

if len(h.ExcludeLinesRegexp) > 0 {
if MatchRegExps(h.ExcludeLinesRegexp, text) {
// drop line
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using continue here is righteous dangerous. When line is dropped, file offset is broken + backoff algorithm is not properly reset.

Maybe move generation of FileEvent and forwarding into it's own function. The check for generating/sending the event will be made in this extra function.

@monicasarbu monicasarbu force-pushed the exclude_include_lines branch 3 times, most recently from a5bf779 to 006d8d9 Compare December 2, 2015 22:39
@ruflin
Copy link
Member

ruflin commented Dec 3, 2015

One potential solution to not have to modify the harvester would be to move the functionality to the spooler. The advantage of this would be that excluding / including would apply to all harvester types (at the moment there are only 2 and not well separated). The disadvantage is that the data would be "discarded" on step later.

@@ -47,6 +47,8 @@ type ProspectorConfig struct {
IgnoreOlderDuration time.Duration
ScanFrequency string `yaml:"scan_frequency"`
ScanFrequencyDuration time.Duration
ExcludeLines []string `yaml:"exclude_lines"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably belongs into the HarvesterConfig. Same for next line.

@ruflin
Copy link
Member

ruflin commented Dec 3, 2015

Think about the above again, moving it to the spooler will not be an option, as the config is harvester specific.

filebeat = self.start_filebeat()

self.wait_until(
lambda: self.log_contains(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More recently we started to prefer using conditions like self.output_has(lines=40). Seems to result in less flaky tests.

@tsg
Copy link
Contributor

tsg commented Dec 3, 2015

The current implementation looks clean enough to me. @monicasarbu may I ask for a system test that uses both the include and exclude options? Whatever behavior we have in that case, we have to make sure it doesn't change in the future.

event.SetFieldsUnderRoot(h.Config.FieldsUnderRoot)
h.SpoolerChan <- event // ship the new event downstream
// Eat line
h.Offset += int64(bytesRead) // Update offset if complete line has been processed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line appears in both branches. we can move it out after the if-else and remove else branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I had the impression that h.Offset is used in SetFieldsUnderRoot. I'll fix it. Thanks!

@urso
Copy link

urso commented Dec 3, 2015

All in all LGTM

@monicasarbu monicasarbu force-pushed the exclude_include_lines branch 2 times, most recently from 05ef634 to ce4b039 Compare December 3, 2015 13:07
@monicasarbu
Copy link
Contributor Author

Done with all the comments. Thank you everyone!

@@ -30,6 +30,14 @@ filebeat:
# * stdin: Reads the standard in
input_type: log

# Exclude lines. A list of regular expressions to match. It drops the lines that are
# matching any regular expression from the list. By default, no lines are dropped.
# exclude_lines: ["^DBG"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny detail: We always have not space after # for the actual config. This make it possible to differentiate it from the comment and make uncommenting easier. Same below.

@ruflin
Copy link
Member

ruflin commented Dec 3, 2015

@monicasarbu I left some additional comments.

- exclude_lines option accepts a list of regular expressions to match. It drops all the lines that are matching any of
 the regular expressions.
- include_lines option accepts a list of regular expressions to match. It exports only the lines that are matching any
  of the regular expressions.

Example:

 - include_lines: ["^ERR", "^WARN"]

Results in exporting only the lines that start with "ERR" or "WARN".
If both options are present, the include_lines is executed first.
@ruflin
Copy link
Member

ruflin commented Dec 5, 2015

@monicasarbu LGTM. Can you rebase again before merging? There seems to be a conflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants