Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails processing jsonl+gzip when using S3 Input plugin #18696

Closed
nhnicwaller opened this issue May 21, 2020 · 3 comments · Fixed by #18764
Closed

Fails processing jsonl+gzip when using S3 Input plugin #18696

nhnicwaller opened this issue May 21, 2020 · 3 comments · Fixed by #18764
Assignees
Labels
Team:Platforms Label for the Integrations - Platforms team

Comments

@nhnicwaller
Copy link
Contributor

nhnicwaller commented May 21, 2020

When using Filebeat with the S3 input plugin (beta), Filebeat will fail processing files that contain JSON lines and are GZIP-compressed. This is the output format of AWS GuardDuty with S3 export enabled, which means that Filebeat is unable to process logs as written by AWS GuardDuty. The issue occurs specifically when the object contains JSON lines, is GZIP-compressed, and has the following metadata on the S3 object:

Content-Encoding: gzip
Content-Type: application/json

Note: I originally posted this as a thread on the discussion forum, but now I am confident it is a bug in Filebeat so I'm creating an issue here.

Actual Results

When using Filebeat 7.6.2 I get these error messages. In this case it seems that Filebeat is attempting to decompress the GZIP stream, but the stream has already been automatically decompressed by the transport in aws-sdk-go based on the object Metadata.

2020-05-20T22:42:27.973Z	ERROR	[s3]	s3/input.go:447	gzip.NewReader failed: gzip: invalid header
2020-05-20T22:42:27.974Z	ERROR	[s3]	s3/input.go:386	createEventsFromS3Info failed for AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/15/659b5608-a71c-3b42-8979-f851e61d9098.jsonl.gz: gzip.NewReader failed: gzip: invalid header

And when using Filebeat 7.7.0 I get slightly different error messages. This seems to stem from the fact that GuardDuty is incorrectly assigning the application/json MIME type to files that actually contain JSON lines/newline-delimited JSON.

2020-05-21T19:41:28.122Z	ERROR	[s3]	s3/input.go:434	expand_event_list_from_field parameter is missing in config for application/json content-type file
2020-05-21T19:41:28.122Z	ERROR	[s3]	s3/input.go:387	createEventsFromS3Info failed for AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/14/8b55ad23-fe05-3f4e-8ff7-d61365cb2ad3.jsonl.gz: expand_event_list_from_field parameter is missing in config for application/json content-type file

Expected Results

Presumably GuardDuty should be using application/json-seq or application/jsonstream or application/x-json-stream or application/x-ndjson or application/x-jsonlines but there doesn't seem to be a standard there. So it seems like Filebeat should be able to handle cases where JSON-lines/NDJson files are saved with the application/json MIME type, perhaps with automatic detection or a configuration override.

The S3 input plugin should also be careful to not attempt GZIP decompression twice (once automatic in the transport layer, and once explicitly in the s3 input code).

Additional information

  • I'm testing Filebeat inside a docker container based on ubuntu:18.04
  • I'm downloading Filebeat directly from artifacts.elastic.co and using the dpkg install method

My Filebeat Configuration

filebeat.inputs:
  - type: s3
    enabled: true
    queue_url: https://sqs.ca-central-1.amazonaws.com/123456789123/awslogs-guardduty

processors:
  - decode_json_fields:
      fields: ['message']
      target: "aws.guardduty"
  - timestamp:
      field: "aws.guardduty.updatedAt"
      layouts:
        - '2006-01-02T15:04:05Z'
  - add_fields:
      target: "event"
      fields:
        dataset: "aws.guardduty"
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 21, 2020
@nhnicwaller
Copy link
Contributor Author

I discovered that Filebeat 7.5.2 has no trouble reading and shipping these objects from S3, so it seems like this is a regression introduced between Filebeat 7.6.0 and 7.6.2.

@andresrc andresrc added the Team:Platforms Label for the Integrations - Platforms team label May 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 22, 2020
@nhnicwaller
Copy link
Contributor Author

I suggest checking the first two bytes for the GZIP magic number 0x1F8B to decide whether to attempt GZIP decompression, rather than relying on potentially inaccurate metadata properties/headers.

Or alternately, assume the GZIP library already checks for the magic number and if GZIP decompression fails, then just treat the stream as if it has already been decompressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Platforms Label for the Integrations - Platforms team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants