Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syslog inputs parses RFC3164 events via TCP or UDP #6842

Merged
merged 1 commit into from
Apr 19, 2018

Conversation

ph
Copy link
Contributor

@ph ph commented Apr 12, 2018

The Syslog inputs will use the UDP and TCP source lib, allowing the same socket behavior and the
same options as the two existing inputs. The parser is a state machine build with ragel and allow to parse FC3164 events with some less than perfect variants, if the received event is a complete
FC3164 we will extract all of them, for us the minimum valid message MUST have the date and the message defined.

Anything else we will log and drop them.

Fields:

  • priority
  • timestamp
  • program
  • pid
  • message
  • facility: extracted from the priority
  • severity: extracted from the priority
  • severity_label: mapped from the official list
  • facility_label: mapped from the official list

Sample Configuration:

  #enabled: false

  #protocol.tcp:
    # The host and port to receive the new event
    #host: "localhost:9000"

    # Character used to split new message
    #line_delimiter: "\n"

    # Maximum size in bytes of the message received over TCP
    #max_message_size: 20MiB

    # The number of seconds of inactivity before a remote connection is closed.
    #timeout: 300s

  #protocol.udp:
    # The host and port to receive the new event
    #host: "localhost:9000"

    # Maximum size of the message received over UDP
    #max_message_size: 10240

Limitations:

  • Doesn't support multiline events like darwin can do, we need to extract the multiline logic from the log input.
  • Only support RFC3164, RFC5424 will require more work on the parser.

close #5862

Missing in this PR

  • Update the fields.yml with the new fields, validate with ECS.
  • docs / asciidoc

Note to reviewers:

  • event.go is built by the state machine as we go, there is no validation in this class since all the validation
    is done at the state machine level.
  • Ignore the parser.go, this is a generated file from parser.rl and syslog_rfc3164.rl, it's a messy file using all the possible
    tricks in the book to make it really fast.
  • parser_test.go contains all the usecase I've defined, I might miss some, please add suggestions.
  • I haven't made the build install ragel, unless we have more than one file.
  • I've changed a bit how TCP/UDP interract to make it easier to use, the changes should be minimum.

Speed

Performance looks OK to me, I don't have the number from my regular expression poc, but it's much better.

pkg: github.com/elastic/beats/filebeat/input/syslog
BenchmarkParser-8       10000000               176 ns/op              64 B/op          3 allocs/op
PASS
ok      github.com/elastic/beats/filebeat/input/syslog  1.977s

@ph ph assigned kvch Apr 12, 2018
@ph ph force-pushed the feature/syslog-parser-with-udp-tcp branch from 9716054 to 6089e46 Compare April 12, 2018 14:58
@elastic elastic deleted a comment from houndci-bot Apr 12, 2018
@elastic elastic deleted a comment from houndci-bot Apr 12, 2018
@elastic elastic deleted a comment from houndci-bot Apr 12, 2018
@elastic elastic deleted a comment from houndci-bot Apr 12, 2018
@elastic elastic deleted a comment from houndci-bot Apr 12, 2018
@ph ph added the review label Apr 12, 2018

// Parser is generated from a ragel state machine using the following command:
//#go generate ragel -Z -G2 parser.rl -o parser.go
var log = logp.NewLogger("syslog")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work as desired.

This should never be used from any global contexts, otherwise you will receive a no-op Logger. This is because the logp package needs to be initialized first. Instead create new Logger instance that your object reuses.

https://godoc.org/github.com/elastic/beats/libbeat/logp#NewLogger

)

// Parser is generated from a ragel state machine using the following command:
//#go generate ragel -Z -G2 parser.rl -o parser.go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write //go:generate ragel -Z -G2 parser.rl -o parser.go so that go generate will run the commands for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I've commented it for testing something, I will revert it back.

@@ -0,0 +1,45 @@
%%{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, not too hard to understand and bit clearer to understand than a big regexp.

High level is each declaration prio = "<" + prioprity ">" will create a state machine, you can compose them together and reuse them, ragel will take care of extrapolating all of them. This file could be used as is in any language supported by ragel: python, ruby and Java. So if this code is OK, it could be shared with the logstash-input-syslog instead of using grok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this approach. Do you mind sharing why did you choose ragel? What other lexers did you consider? I am just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't look too much in other lexer other than https://github.com/timtadh/lexmachine

I've worked with ragel before, I've made a SQL like DSL on top of a lucene document. They really focus on speed and this is a mature project.

Also, I like the fact that the lexer and the target language are not linked, Meaning we can target both the beats syslog input and the logstash input syslog. But ragel isn't the only tool available that can do that.

@@ -7,6 +7,9 @@ import (
"github.com/dustin/go-humanize"
)

// NAME is the human readable name and identifier.
const NAME = "tcp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CapitalCamelCase instead?

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Just saw it in the PR message that the two things below are planned ;-)

One thing that is missing in this PR are the additional fields in the fields.yml. Also we should add some basic docs for it to get started.

# The number of seconds of inactivity before a remote connection is closed.
#timeout: 300s

#protocol.udp:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can one input have tcp and udp at the same time? If not, we should probably have the syslog input defined twice here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot for complexity reason :) I am OK to have it twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad it can't. Perhaps we should have a check to make sure only one is used at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uniqueness of the protocol is already taken care with the Config.namespace's handling, more than one namespace configured accessing 'filebeat.inputs.1.protocol'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a note to check if we can change that error message.

#host: "localhost:9000"

# Maximum size of the message received over UDP
#max_message_size: 10240
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we bring message size for the 2 in line? Also the way it's configured, I like the humanize one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to add support for humanize in udp, I will do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kvch @ruflin done in #6886

"github.com/stretchr/testify/assert"
)

func TestParseSyslog(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, that are the tests I was looking for.

}

if ev.HasPid() {
f["pid"] = ev.Pid()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we align the fields here already with ECS? This would be process.pid.

To make sure we don't conflict, all these fields should be prefixed with syslog. if they are not already used fields.

@kvch
Copy link
Contributor

kvch commented Apr 13, 2018

@ph Do you mind opening an issue for RFC5424 after this PR is merged?

@ph
Copy link
Contributor Author

ph commented Apr 13, 2018

@kvch 👍 for another issue for rfc5424, some state machine of the parser can be reused as is, the structured part of rfc5424 is the weirdest part..

@ph ph mentioned this pull request Apr 16, 2018
6 tasks
@ph
Copy link
Contributor Author

ph commented Apr 16, 2018

@kvch followup issue for rfc5424 in #6872

}

var defaultUDP = udp.Config{
MaxMessageSize: 10240,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please humanize this option, as it's done in defaultTCP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to make bytes a first class citizen in ucfg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewkroh I prefer to have it in the beats repository, this way we can more control deprecation if needed.

second = ([0-5][0-9])>tok %second;
nanosecond = digit+;
time = hour ":" minute ":" second ("." nanosecond)?;
timestamp= month space day space time;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a space before = :)

}
}

func TestInvalidSyslog(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are test cases going to be added to this? I would like to see that we fail properly on unsupported messages to make sure we don't forward unpredictable or garbage events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can delete that, I've added all tests for somewhat invalid syslog in the main test https://github.com/elastic/beats/pull/6842/files#diff-6ae69c7f1d1edd97ba8ebcdc7df2e855R18

On failure the catch all is to only populate the event.Message and for us to be a valid syslog event we must have timestamp and a message defined, everything else is just extra data.

@kvch
Copy link
Contributor

kvch commented Apr 17, 2018

Thanks for implementing BSD syslog protocol. It's great that you added TCP to the supported transports, regardless that it's not required by the legacy protocol. I am very happy we went for using lexers instead of unmaintainable monster regexes.

Furthermore, with this PR we move forward with extending capabilities and use cases of Filebeat which is the "beatific" way. But it's clear that the current architecture is not flexible enough yet. I am a bit sad it can't receive syslog messages from arbitrary source (e.g file). Unfortunately, right now reading from sources and parsing the incoming bytes is too tightly coupled to achieve that. This coupling is the a major problem with the architecture of your PR. That's where all of my review notes stems from. I haven't shared those problems here, as it is out of scope of this PR. Also, this feature is valuable as is.

We really need a big refactor to address this architectural problem. We should decouple transport layer from creating, parsing and transforming events. Otherwise we would never be able to read syslog messages from a local file. (Unless, we send through to a local port over TCP or UDP :D)

I am glad to merge this PR as soon as review notes are resolved so we could focus on the refactor. FYI I am preparing a proposal of this. :)

@ph
Copy link
Contributor Author

ph commented Apr 17, 2018

Furthermore, with this PR we move forward with extending capabilities and use cases of Filebeat which is the "beatific" way. But it's clear that the current architecture is not flexible enough yet. I am a bit sad it can't receive syslog messages from arbitrary source (e.g file). Unfortunately, right now reading from sources and parsing the incoming bytes is too tightly coupled to achieve that. This coupling is the a major problem with the architecture of your PR. That's where all of my review notes stems from. I haven't shared those problems here, as it is out of scope of this PR. Also, this feature is valuable as is.

Yes, this is why I was making baby steps, moving the UDP logic and the TCP out of the inputs so they can become composable. I wanted to do the same with the log inputs, the syslog parsing is currently coupled with theses two protocols, but the syslog parsing is just a function call so it can be another building block.

I am glad to merge this PR as soon as review notes are resolved so we could focus on the refactor. FYI I am preparing a proposal of this. :)

This is nice I was working on a similar proposal, we should probably sync on that, when I've created that feature, I've seen the same problems but it would require a bit more work to achieve without doing some refactoring.

My idea higher level was to be able mostly works with stream and leverage more of the processor interface.

To be able to do:

  • inputs X
  • Multiline
  • Syslog
  • Dissect

@ph
Copy link
Contributor Author

ph commented Apr 17, 2018

@kvch If we want to make it work right now with the file input it could be a processor, nothing prevent us to do that without the need to refactor. The syslog input could just initialize the processor or use the library directly.

(maybe we have some problems with the system module)

@kvch
Copy link
Contributor

kvch commented Apr 18, 2018

@ph I would keep it as is, so we have a chance to release it in 6.3.

@ph ph force-pushed the feature/syslog-parser-with-udp-tcp branch from 4e4cee2 to 798ddf2 Compare April 18, 2018 18:13
@ph
Copy link
Contributor Author

ph commented Apr 18, 2018

@kvch @ruflin I've updated the PR with the reviews comments, I've followed ECS as much as I can, I think it make sense. I've updated the fields too, added a wip docs for it.

The following is an example of the output if all fields are present:

{
  "@timestamp": "2018-10-12T02:14:15.000Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "doc",
    "version": "7.0.0-alpha1",
    "truncated": false
  },
  "message": "'su root' failed for lonvick on \/dev\/pts\/8 0",
  "source": "127.0.0.1:56949",
  "hostname": "wopr.mymachine.co",
  "event": {
    "severity": 5
  },
  "input": {
    "type": "syslog"
  },
  "syslog": {
    "priority": 13,
    "severity_label": "Notice",
   "facility_label": "user-level"
    "facility": 1
  },
  "process": {
    "pid": 2000,
    "program": "postfix"
  },
  "prospector": {
    "type": "syslog"
  },
  "beat": {
    "name": "sashimi",
    "hostname": "sashimi",
    "version": "7.0.0-alpha1"
  }
}

<<{beatname_lc}-input-{type}-common-options>> described later.


[id="{beatname_lc}-input-{type}-common-options"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dedemorton This input share all the options from the TCP/UDP inputs, what would be the best way to keep track of them here? Should we extract the options into external files that can be included in the syslog / tcp and udp.

Something like

tcp-options
udp-options

tcp input include tcp-options
udp input include udp options
syslog input include both tcp-options and udp-options

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'd do something similar to what I've done here:

https://raw.githubusercontent.com/elastic/beats/master/filebeat/docs/inputs/input-common-file-options.asciidoc

So maybe you would have inputs/input-common-tcp-options.asciidoc and input-common-udp-options.asciidoc. Make sure that you add a comment to the top of the file explaining how the options are shared (see input-common-file-options.asciidoc for an example).

Of course, you'll need to use attributes to resolve the section IDs for each option because the IDs need to be unique across the book. Something like:

[id="{beatname_lc}-input-{type}-option-name"]

@ruflin
Copy link
Member

ruflin commented Apr 19, 2018

@ph ECS implementation looks really good 🎉 . The only thing I wonder is the content of agent.name. Where is this value coming from?

@ph
Copy link
Contributor Author

ph commented Apr 19, 2018

@ruflin agent.name is kind of tricky, I think i will move it in the syslog, in BSD world its the facility this is extracted from the priority with some bits operation and mapped to a static list of facility this list is the official list in RFC, see https://github.com/elastic/beats/pull/6842/files#diff-325cb617098659d9dad2180c415be116R43.

So its kind-of-agent

@ph ph force-pushed the feature/syslog-parser-with-udp-tcp branch from 798ddf2 to 0e09ef5 Compare April 19, 2018 14:30
@ph ph force-pushed the feature/syslog-parser-with-udp-tcp branch from 0e09ef5 to 2cdd6bc Compare April 19, 2018 14:32
@ph ph removed the needs_docs label Apr 19, 2018
@ph ph force-pushed the feature/syslog-parser-with-udp-tcp branch 2 times, most recently from 0d27f10 to 8b744f7 Compare April 19, 2018 14:37
@ph
Copy link
Contributor Author

ph commented Apr 19, 2018

@kvch @ruflin I have made the changes to agent.name, extracted tcp/udp options and added docs.
The PR was rebased + squashed.

@ph ph force-pushed the feature/syslog-parser-with-udp-tcp branch from 956d3ad to 27ad0e5 Compare April 19, 2018 17:00
The Syslog inputs will use the UDP and TCP source lib, allowing the same socket behavior and the
same options as the two existing inputs. The parser is a state machine build with ragel[1] and allow
to parse FC3164[2] events with some less than perfect variants, if the received event is a complete
RFC3164 we will extract all of them, for us the minimum valid message MUST have the `date` and the
`message` defined.

Anything else we will log and drop them.

Fields:
* priority
* timestamp
* program
* pid
* message
* facility: extracted from the priority
* severity: extracted from the priority
* severity_label: mapped from the official list.
* facility_label: mapped from the official list[2]

Sample Configuration:

```yaml
  #enabled: false

  #protocol.tcp:
    # The host and port to receive the new event
    #host: "localhost:9000"

    # Character used to split new message
    #line_delimiter: "\n"

    # Maximum size in bytes of the message received over TCP
    #max_message_size: 20MiB

    # The number of seconds of inactivity before a remote connection is closed.
    #timeout: 300s

  #protocol.udp:
    # The host and port to receive the new event
    #host: "localhost:9000"

    # Maximum size of the message received over UDP
    #max_message_size: 10240
```

Limitations:
* Doesn't support multiline events like darwin can do, we need to extract the multiline logic from the log input.
* Only support RFC3164, RFC5424 will require more work on the parser.

close elastic#5862

[1]: http://www.colm.net/open-source/ragel/
[2]: https://tools.ietf.org/html/rfc3164
@ph ph force-pushed the feature/syslog-parser-with-udp-tcp branch from 27ad0e5 to f040a01 Compare April 19, 2018 17:06
@kvch kvch merged commit 2a0ad4d into elastic:master Apr 19, 2018
@rhclayto
Copy link

rhclayto commented Dec 2, 2018

I believe there is a bug in syslog_rfc3164.rl .

See referenced issue:
#9323

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow filebeat to collect syslog events
6 participants