Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] Add timeout to GetObjectRequest for s3 input #15590

Merged
merged 22 commits into from
Jan 28, 2020
Merged

[Filebeat] Add timeout to GetObjectRequest for s3 input #15590

merged 22 commits into from
Jan 28, 2020

Conversation

kaiyan-sheng
Copy link
Contributor

@kaiyan-sheng kaiyan-sheng commented Jan 15, 2020

Problem we see when using s3 input:
When using s3 input to read logs from S3 bucket, after a while with high amount of logs read: connection reset by peer error showed up. This error is triggered by reader.ReadString function, then processorKeepAlive found it's taking too long to run processMessage, which is longer than half of the set visibility timeout. So changeVisibilityTimeout function keep getting called repeatedly.

This PR is to add timeout into GetObjectRequest API call by using context pattern to implement timeout logic that will cancel the request if it takes too long. This way, after the default timeout 2 minute is hit, this specific S3 object will be skipped, SQS message will return back to the queue later. So Filebeat can try to read it again later.

I decided to add a config option called context_timeout for s3 input because based on your visibility_timeout value, context_timeout can be as large as half of the visibility_timeout. This will allow users to modify both timeout values when using s3 input or filebeat aws module with larger s3 objects or smaller network bandwidth.

closes #15502

@kaiyan-sheng kaiyan-sheng self-assigned this Jan 15, 2020
@kaiyan-sheng kaiyan-sheng added Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. review Team:Integrations Label for the Integrations team labels Jan 15, 2020
@kaiyan-sheng kaiyan-sheng changed the title [Filebeat] Add timeout to GetObjectRequest [Filebeat] Add timeout to GetObjectRequest for s3 input Jan 15, 2020
@kaiyan-sheng kaiyan-sheng requested a review from a team as a code owner January 16, 2020 14:55
@exekias exekias self-requested a review January 22, 2020 11:02
Copy link
Contributor

@exekias exekias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, I understand you are still working on testing this one, right?

x-pack/filebeat/input/s3/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/s3/input.go Show resolved Hide resolved
x-pack/filebeat/input/s3/config.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/s3/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/s3/input.go Outdated Show resolved Hide resolved
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it is in progress, but I was taking a look and I have a couple of questions.

The new context inputCtx with its goroutine for cancelation doesn't smell well to me, why wasn't the previous p.context enough?

after the default timeout 2 minute is hit, this specific S3 object will be skipped, SQS message will return back to the queue later. So Filebeat can try to read it again later.

Does it mean that for big objects that can take more than 2 minutes to download, the timeout is hit and then Filebeat retries with the same object? Does filebeat keep the last offset or something so it doesn't keep continuously retrying?

x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc Outdated Show resolved Hide resolved
x-pack/filebeat/filebeat.reference.yml Outdated Show resolved Hide resolved
x-pack/filebeat/input/s3/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/s3/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/s3/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/s3/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/s3/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/s3/input_test.go Outdated Show resolved Hide resolved
@kaiyan-sheng
Copy link
Contributor Author

The new context inputCtx with its goroutine for cancelation doesn't smell well to me, why wasn't the previous p.context enough?

Yes, p.context is enough for this. I just made the change in this PR to used the existing p.context.

Does it mean that for big objects that can take more than 2 minutes to download, the timeout is hit and then Filebeat retries with the same object? Does filebeat keep the last offset or something so it doesn't keep continuously retrying?

@jsoriano I haven't seen any big objects request more than even 1 minute to download. I think the problem seeing in the issue is caused by resource leak from not having defer resp.Body.Close() in the code. The connection reset error happened when making GetObjectRequest API call which is one step before actual reading the log file. So if that failed, the SQS message goes back into the queue and the same S3 object will be retried with GetObjectRequest later after visibility timeout is done.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection reset error happened when making GetObjectRequest API call which is one step before actual reading the log file. So if that failed, the SQS message goes back into the queue and the same S3 object will be retried with GetObjectRequest later after visibility timeout is done.

Oh ok, so then this timeout and the retries don't affect the actual download of the log file, right? If that is the case then it LGTM.

Thanks for addressing all the changes!

x-pack/filebeat/input/s3/config.go Outdated Show resolved Hide resolved
@ycombinator
Copy link
Contributor

@kaiyan-sheng Since the issue this fixes is labeled as a bug, should this fix be backported to 7.5 (in case there's a 7.5.3 before 7.6.0) as well? I see you did this for #15844.

@kaiyan-sheng
Copy link
Contributor Author

@ycombinator Yes I agree, just in case if there's a 7.5.3. I will create the backport right now. Thank you!

kaiyan-sheng added a commit that referenced this pull request Jan 28, 2020
)

* Add timeout to GetObjectRequest which will cancel the request if it takes too long
* Close resp.Body from S3 GetObject API to prevent resource leak
* Change aws_api_timeout to api_timeout
(cherry picked from commit 86c3e63)
kaiyan-sheng added a commit that referenced this pull request Jan 29, 2020
… for s3 input (#15908)

* [Filebeat] Add timeout to GetObjectRequest for s3 input (#15590)

* Add timeout to GetObjectRequest which will cancel the request if it takes too long
* Close resp.Body from S3 GetObject API to prevent resource leak
* Change aws_api_timeout to api_timeout

(cherry picked from commit 86c3e63)

* update changelog

* Add default value in manifest.yml
kaiyan-sheng added a commit that referenced this pull request Feb 18, 2020
)

* Add timeout to GetObjectRequest which will cancel the request if it takes too long
* Close resp.Body from S3 GetObject API to prevent resource leak
* Change aws_api_timeout to api_timeout

(cherry picked from commit 86c3e63)
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…Request for s3 input (elastic#15908)

* [Filebeat] Add timeout to GetObjectRequest for s3 input (elastic#15590)

* Add timeout to GetObjectRequest which will cancel the request if it takes too long
* Close resp.Body from S3 GetObject API to prevent resource leak
* Change aws_api_timeout to api_timeout

(cherry picked from commit cf7b92f)

* update changelog

* Add default value in manifest.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Filebeat] S3 input stop ingesting logs after some time
7 participants