Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for plain JSON log files #42

Closed
bcicen opened this issue Sep 4, 2014 · 12 comments · Fixed by #377
Closed

Support for plain JSON log files #42

bcicen opened this issue Sep 4, 2014 · 12 comments · Fixed by #377

Comments

@bcicen
Copy link

bcicen commented Sep 4, 2014

I currently already log my application events in JSON, is there an easy way for me to be able to ship these into logstash with the log-courier input? With lumberjack input this was accomplished with "format => json_event" and no additional parsing; however, courier seems to be lacking any datatype or format classification with the logstash plugin.

@driskell
Copy link
Owner

Hi @bcicen

Log Courier produces structured data out of the log file. It takes the line, the host, the path, and any additional fields, to generate the event. So the resulting event transmitting to Logstash already has JSON structure to it.

The best way at the moment to decode and "merge" is to use a filter on Logstash side to decode the line field. I don't plan to allow codecs in the plugin as those generally expect single line data and not structured data.

I am considering adding a JSON codec to Log Courier in the future, to do this on the client side which will save a filter on the indexers. It's a really small resource gain compared to the multiline and filter codec, however, as json decoding in logstash is very fast indeed.

Jason

@mahnunchik
Copy link

+1

@matthughes
Copy link

Maybe, I'm wrong, but I believe the OP is just referring to supporting this: http://logstash.net/docs/1.4.2/codecs/json. Logstash forwarder supports it and I don't see any reference to it in their source code. Basically given a JSON log 'line' of:

{
  foo: "bar",
  bin: "baz",
  message: "this is the message"
}

should produce a source of:

{
  "_index": "logstash-2015.01.16",
  "_type": "...",
  "_id": "AUrz8DM4tItrB37ovWRo",
  "_score": 1,
  "_source": {
    "host": "...",
    "message": "this is the message",
    "foo": "bar",
    "bin": "baz"
    "offset": 3288,
    "path": "/opt/app/logs/log.json",
    "type": "devlog",
    "@version": "1",
    "@timestamp": "2015-01-16T18:10:12.716Z"
  }
}

Instead log-courier does this, squashing the JSON entry into the message field:

{
  "_index": "logstash-2015.01.16",
  "_type": "...",
  "_id": "AUrz8DM4tItrB37ovWRo",
  "_score": 1,
  "_source": {
    "host": "...",
    "message": "{ \"message\": \"this is the message\", \"foo\": \"bar\", \"bin\": \"baz\"}"
    "offset": 3288,
    "path": "/opt/app/logs/log.json",
    "type": "devlog",
    "@version": "1",
    "@timestamp": "2015-01-16T18:10:12.716Z"
  }
}

All I do to configure this with LSF is:

input {
  lumberjack {
    codec => "json"
  }
}

output {
   elasticsearch {
      host => "localhost"
      protocol => "transport"
   }
}

I don't need any filters. Again, I can't even tell if LSF is doing anything special to support this; certainly don't see any references to json codec in their codebase. I tried both json and json_lines but both just embed the whole JSON structure inside message.

@matthughes
Copy link

This is my workaround for the time being:

input {
  lumberjack {
    port => 9000
    codec => "json"
  }

  courier {
    port => 9001
    codec => "json"
  }
}

filter {
  if [shipper] == "lc-1.3" {
      json {
        source => "message"
      }
  }
}

I have my clients declare a 'shipper' field of lc-1.3. That way if we ever get a JsonCodec, I can just change that value to new version and won't double parse the json in the future.

@driskell
Copy link
Owner

I'm going to look at this again.

The issue is LSF does not stream logs to codec properly. So if the codec is say, multi line, it corrupts entries by mixing entries from different clients together. JSON would work OK - the problem is the streaming codecs like multi line. I just do not want to implement something that is inherently broken, even if it works "sometimes", and that's why I removed it when I forked.

I can see TCP input has a real working implementation where streams are handled correctly. I'll use that as a reference point. The internal queue in the courier plugin looks to be the only barrier but I'll know more once I can sit down and have another think.

It would definitely be useful to support codecs as now logstash 1.5 makes installing third party plugins and codecs so easy it'll be silly not to allow one to take advantage of them.

@jenshz
Copy link

jenshz commented Jan 17, 2015

+1

@driskell
Copy link
Owner

Things work great with JSON etc so to resolve this ticket would be feasible. However, it would then mean the "multiline" codec could be used. This is where things gets complicated.

If there's a multiline event that consists of 11 lines, and only 10 lines are received so far, but not the final 11th line - the data cannot be acknowledged, otherwise if we lose connection or logstash crashes, the chunk is lost a lone-wolf single line event appears (I see this as corruption).

Overall this means some heavy work to the acknowledgement code. I've done some initial work in a feature branch to allow the codecs, it's just missing the heavy work on acknowledgements.

driskell added a commit that referenced this issue Jan 17, 2015
@driskell
Copy link
Owner

Further thoughts:

What if a codec is added that performs other types of modifications, such as filters events, or combines them in an arbitrary fashion... such codecs would be completely impossible to handle with acknowledgements without the codec being aware we are aiming for guaranteed delivery.

As such, it may be the PR #95 could be all we need for now - but we're back in the realm of some codecs will just act strangely and break.

Proposal: I will hardcode that only plain, json, and other tested codecs will be allowed, throwing configuration error otherwise.

Interestingly, I noticed someone started looking at guaranteed pipeline in Logstash before I did:
catalyst@5b9d27b
Further work there could mean codecs themselves tag as "I support guaranteed delivery" - and it means that events aren't acknowledged until they reach elasticsearch ... definitely the path to go.

@landonix
Copy link

landonix commented Jun 4, 2015

+1 for json codec support, filters are not needed

@padusumilli
Copy link

+1 for json codec support

@Balanc3
Copy link

Balanc3 commented Jan 9, 2016

+1

@driskell
Copy link
Owner

Multiline codec is now deprecated in newer Logstash so this makes codec support way simpler and more reliable.

I'm also looking at JSON decoding in the shipper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants