Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetRecords blocked for more than 10 minutes #1037

Closed
nonexu opened this issue Jan 12, 2017 · 37 comments
Closed

GetRecords blocked for more than 10 minutes #1037

nonexu opened this issue Jan 12, 2017 · 37 comments
Labels
guidance Question that needs advice or information.

Comments

@nonexu
Copy link

nonexu commented Jan 12, 2017

Hi,
I encounter an issue that it took more than 10 minutes to call GetRecords.
So, there was an error "Iterator expired", when get records next time.
It was correct when I started my application, and issue happened after several minutes.
GetRecords should not be blocked, from AWS doc

I enabled kinesis api debug info: log attached.
Did you encounter this issue? And how to solve it?

Thank you in advance!

debug_info.log.gz
get_record_code

@xibz
Copy link
Contributor

xibz commented Jan 12, 2017

Hello @nonexu, thank you for reaching out to us. Just looking at your small snippet, are you only sleeping if there is an error? How often do you make a call to the service if there isnt an error?
I am wondering if you are making too many calls too quickly. Try sleeping at the end of the loop to see if that fixes it. And try some different sleep durations. Let me know if that helps any.

@nonexu
Copy link
Author

nonexu commented Jan 12, 2017

@xibz
Thanks for your reply,
The answer for this question
are you only sleeping if there is an error?
Yes, Only sleep 1 second when there is error.
How often do you make a call to the service if there isnt an error?
Call GetRecords continuous without sleep.

It looks like this issue. I will do some modification and test.
I will give some feedback later.
#301

@xibz xibz added guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Jan 14, 2017
@xibz
Copy link
Contributor

xibz commented Jan 17, 2017

Hello @nonexu, I am going to go ahead and close this. If you are still having issues with this, please let us know.

@xibz xibz closed this as completed Jan 17, 2017
@nonexu
Copy link
Author

nonexu commented Jan 19, 2017

Hello, @xibz
Sorry for delay in responding.
This issue is not solved, I have created a ticket for kinesis team.
I enable the sdk debug info.
We can see http request send at 2017-01-19 00:22:13.580
But the debug information from SDK, it get response at 2017-01-19 00:34:18.304.
The time in http header is : Date: Thu, 19 Jan 2017 00:22:12 GMT^M.
I want to know does sdk debug the header immediately when it receive the header or any delay after it receive the body? The body is debug after the header.

2017-01-19 00:22:13.580 main.go:93 [Info ] aws sdk debug info:[DEBUG: Request kinesis/GetRecords Details:
---[ REQUEST POST-SIGN ]-----------------------------
POST / HTTP/1.1^M
Host: kinesis.us-west-2.amazonaws.com^M
User-Agent: aws-sdk-go/1.6.10 (go1.7.1; linux; amd64)^M
Content-Length: 257^M
Authorization: AWS4-HMAC-SHA256 Credential=AKIAJYXRKCV4G2A7WBPQ/20170119/us-west-2/kinesis/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date;x-amz-target, Signature=fde4436512a27e006860b66f20a8dacb2f5af9206f0606e24f9cb7c8f653b8aa^M
Content-Type: application/x-amz-json-1.1^M
X-Amz-Date: 20170119T002213Z^M
X-Amz-Target: Kinesis_20131202.GetRecords^M
Accept-Encoding: gzip^M
^M
{"Limit":1000,"ShardIterator":"AAAAAAAAAAHeniHBOBOBRr7BWaOpMUFrHaZ5AXhBTJWQk/PRn/baWpvHtGFFLVeG9Mdj66pR90J7ya1k/F3763i3uYPXZDGX7q2uioMsSmYpaJ5uAGNr0trEFafLjFNKuq2SUHFpAn4FpDVLssCHD05/oO1XeKVzvIr+15tI7Fhpgn91Yza7G0P7XRMGL6Dl4FvJ9BKsdQDYQG7NFUu5WeOajT2aIoUU"}

2017-01-19 00:34:18.304 main.go:93 [Info ] aws sdk debug info:[DEBUG: Response kinesis/GetRecords Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK^M
Content-Length: 2915437^M
Content-Type: application/x-amz-json-1.1^M
Date: Thu, 19 Jan 2017 00:22:12 GMT^M
Server: Apache-Coyote/1.1^M
X-Amz-Id-2: nBTcoj6/ZtcWzmfhFmjPT//k33UxvXEyp22mRMeq6m/+8HYA9bWMWPa1HTyabaD6CvrXymfTE9E2UVLW5qptyUlih0F9gN1n^M
X-Amzn-Requestid: cb24b9b4-ca7d-5b31-9bde-b14185e803bd^M

@xibz
Copy link
Contributor

xibz commented Jan 19, 2017

@nonexu - We log immediately when we have the body, IF you set the logger to log the body. Perhaps, profiling using pprof to see where the bottleneck is occurring may give us a better insight to what is going on.

@xibz xibz reopened this Jan 19, 2017
@nonexu
Copy link
Author

nonexu commented Jan 19, 2017

@xibz
Sorry, I don't understand clearly about this: "profiling using pprof to see where the bottleneck"
Could you give me a detail description about this and how to do it?

Other question, sdk debug header and body immediately when it receive, why there is no time out
from 2017-01-19 00:22:13.580 2017-01-19 00:34:18.304?
Is there any data delivery at this time, how can I see the detail information?
My server locals on aws cloud, so I think this data can be delivered soon.

I tried to set http timeout 3 minutes when init sdk, but it did not fix this issue.
Thank you in advance!

@xibz
Copy link
Contributor

xibz commented Jan 19, 2017

@nonexu , I was wondering about those times. Those are sent back from the service.
Adding your own timer at the end of the send handler to check if the service is sending the correct time back would be at least let us know it isn't the SDK. If it isn't, I would mention that in the ticket.

req, out := svc.GetRecordsRequest(&kinesis.GetRecordsInput{
// values here
})
req.Handlers.Send.PushBack(func(r *request.Request) {
    fmt.Println("Time:", time.Now().UTC())
})

If the times are correct, I'd profile next.
Here is a good blog that shows you how to profile your go applications.

Please let me know if you have any issues with the profiling and/or with the snippet above.

@nonexu
Copy link
Author

nonexu commented Jan 20, 2017

@xibz
Thanks for your comment.
I hope I understand your idea, I modified the function in aws sdk as below.

func (c *Kinesis) GetRecords(input *GetRecordsInput) (*GetRecordsOutput, error) {
	req, out := c.GetRecordsRequest(input)
	req.Handlers.Send.PushBack(func(r *request.Request) {
    	fmt.Println("Time:", time.Now().UTC())
		})
	err := req.Send()
	return out, err
}

Issue was reproduced again, it took more than 1 hour this time.
Time info is debugged as below after body debugged.
Time: 2017-01-20 05:31:15.605637078 +0000 UTC

2017-01-20 04:13:00.312        main.go:460 [Info  ] start get records:shardId-000000000002, AAAAAAAAAAHDBZ56CD+KUTPh2oPGPe9vDxaGOp3iYA9I6QATwcI4NlRWuFmyJROzK3D22/s5jfJhYDJBEFNY4yhoWimXthuGiC64gCSwOT0xO+4PVePkmLEiD/Ug6QuW2WgQOA/H3bpQW1eXalttSrzvXUkSBaM0HByFARLtqYFC8v5vi48xlRJ4emKKwjyvsprP2WfiCpjsjT0Xi2+kc7dORjuCyASI, 1000
2017-01-20 04:13:00.312        main.go:93  [Info  ] aws sdk debug info:[DEBUG: Request kinesis/GetRecords Details:
---[ REQUEST POST-SIGN ]-----------------------------
POST / HTTP/1.1^M
Host: kinesis.us-west-2.amazonaws.com^M
User-Agent: aws-sdk-go/1.6.10 (go1.7.1; linux; amd64)^M
Content-Length: 257^M
Authorization: AWS4-HMAC-SHA256 Credential=AKIAJYXRKCV4G2A7WBPQ/20170120/us-west-2/kinesis/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date;x-amz-target, Signature=5a1dd0fae438cda816a5c166938777ce22eab2a1e85c92a2d4c7699656942922^M
Content-Type: application/x-amz-json-1.1^M
X-Amz-Date: 20170120T041300Z^M
X-Amz-Target: Kinesis_20131202.GetRecords^M
Accept-Encoding: gzip^M
^M
{"Limit":1000,"ShardIterator":"AAAAAAAAAAHDBZ56CD+KUTPh2oPGPe9vDxaGOp3iYA9I6QATwcI4NlRWuFmyJROzK3D22/s5jfJhYDJBEFNY4yhoWimXthuGiC64gCSwOT0xO+4PVePkmLEiD/Ug6QuW2WgQOA/H3bpQW1eXalttSrzvXUkSBaM0HByFARLtqYFC8v5vi48xlRJ4emKKwjyvsprP2WfiCpjsjT0Xi2+kc7dORjuCyASI"}
-----------------------------------------------------]

2017-01-20 05:31:15.598        main.go:93  [Info  ] aws sdk debug info:[DEBUG: Response kinesis/GetRecords Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK^M
Content-Length: 2120174^M
Content-Type: application/x-amz-json-1.1^M
Date: Fri, 20 Jan 2017 04:13:00 GMT^M
Server: Apache-Coyote/1.1^M
X-Amz-Id-2: 2nUehIoVcc8MfC4HDK7EJT8vHz3QPWNzH00LMzwX26tdi03Vb+mIEN+mQLeLKP9IN+crako3GC4lRaVZwl9kgYuBiYvw2WWJ^M
X-Amzn-Requestid: f4a299ca-c647-7e07-a459-09461b84cff5^M
{"MillisBehindLatest":0,"NextShardIterator":"AAAAAAAAAAFXMOfHraka9+Pw7oLD2AHQVmfwwSyml4fFfhCNG8IKTVA5NvlSE17fPu0JFNeBCRv7rSOaEOPkHaPAkOS0DcsQUxGqBsJ0qr+ZiVKnTbAokuvKIGZWlF+JUjzxgrOmwyysrkF2FEINmyqDEIgjxis/HZ6Mwc9GC4PiaXOHfmjZV+cl7bHo8LDPs8AZjdpKDyhFdIMVRHxkSnDsRIex1RwO","Records":[....body....]}
Time: 2017-01-20 05:31:15.605637078 +0000 UTC
2017-01-20 05:31:15.640        main.go:475 [Info  ] get records length 618:1000, shardId: shardId-000000000002

Do you need more information?

Some additional information about this issue:
The issue always happened on this server, ip:35.165.XXX.XXX.
It is all ok on this server 52.27.XXX.XXX with same executable file and config file.
Both of them local aws cloud us-west-2c with different linux OS version

Thank you in advance!

@xibz
Copy link
Contributor

xibz commented Jan 20, 2017

@nonexu - Looks like the service is returning the incorrect time. I would mention that in the ticket. I believe this is a service issue, but I think we should leave this issue open until it is solved. Please let us know if you need anything and/or information.

@nonexu
Copy link
Author

nonexu commented Jan 20, 2017

Thank you, I only want to know the reason and solution for the issue.
Otherwise, I can't ensure my application works well on different server, and can't update or change servers without worries.

@xibz
Copy link
Contributor

xibz commented Jan 23, 2017

@nonexu - The service team just got back to me and is asking for the instance id of the ec2 instance that this is occurring on. Can you please provide that?

@nonexu
Copy link
Author

nonexu commented Jan 24, 2017

@xibz Thanks for your help!
instance id: i-093416b7e6559c70b, which issue occurred on.
i-0511a166cfada4299, which is all correct with same config file.
I once started new ec2 with same OS version as i-093416b7e6559c70b,
Same issue happened again. So, I doubt is it related with OS env?

@xibz
Copy link
Contributor

xibz commented Jan 24, 2017

@nonexu - Thank you for getting that to us. I've relayed this information to the service team. If you look at #301, the solution that other customer provided was to implement an HTTP timeout of 1 minute with retries. That may be a good solution for now, while we investigate what is going on.

@nonexu
Copy link
Author

nonexu commented Jan 25, 2017

@xibz Thanks for your idea!
I once tried this solution. But it time out for 5 times continuously in my test, then Iterator expired.

Maybe, I need to init new kinesis client, when iterator expired. But it is just a workaround.

@jasdel
Copy link
Contributor

jasdel commented Jan 25, 2017

@nonexu you shouldn't need a new client, but you may need to retrieve a new iterator if the current expires.

@xibz
Copy link
Contributor

xibz commented Jan 25, 2017

@nonexu - Additionally, how long does it take, typically, to experience a long delay?

@xibz xibz added closing-soon This issue will automatically close in 4 days unless further comments are made. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Jan 27, 2017
@nonexu
Copy link
Author

nonexu commented Feb 3, 2017

@xibz Sorry for delayed response, I took a leave in the past days.
It always takes about 20 minutes to return the result.
Is there any news about this issue?

@xibz
Copy link
Contributor

xibz commented Feb 3, 2017

@nonexu - The service team is looking into to it, but requires some information. I'll let them know how often this occurs. Thank you for providing the necessary information and no worries on the delayed response!

@nonexu
Copy link
Author

nonexu commented Feb 3, 2017

@xibz
Can I do some help for this issue?
Such as, I reproduce it and service team checks the request from i-093416b7e6559c70b immediately.
I will tell you, when it is reproduced.

@nonexu
Copy link
Author

nonexu commented Feb 3, 2017

@xibz It is blocked now for 5 minutes on i-093416b7e6559c70b.
Can service team check it now?

@xibz
Copy link
Contributor

xibz commented Feb 3, 2017

@nonexu - I have asked them to check. I will let you know when they get back to me.

@nonexu
Copy link
Author

nonexu commented Feb 3, 2017

@xibz detail information about this time.
Api called at 2017-02-03 02:40:36
return at 2017-02-03 03:24:41

@xibz xibz removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 3, 2017
@xibz
Copy link
Contributor

xibz commented Feb 3, 2017

Hello @nonexu, the service team just got back to me. So, it looks like it is what we suspected earlier. I am unsure if the sleep I suggested in the loop is enough, but that instance was running at 100% CPU. I would profile and observe the instance to see what is causing the high CPU utilization.

@xibz xibz added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 3, 2017
@nonexu
Copy link
Author

nonexu commented Feb 4, 2017

@xibz Thanks for your feedback.
Waiting for your good news.
Same issue happened in other sever with same OS version.

@xibz
Copy link
Contributor

xibz commented Feb 4, 2017

@nonexu - Sorry, I think I may have been unclear here. The issue here isn't the service or SDK, but how it is being used. I suggested a sleep would help. If it hadn't, I suggest profiling to see why this your code base is utilizing 100% CPU. Please let us know if there is any more that we can do.

@nonexu
Copy link
Author

nonexu commented Feb 4, 2017

@xibz There is a sleep 1 second when it failed.
It is strange why issue happened on i-093416b7e6559c70b, same executed file is always successful on i-0511a166cfada4299, If the root case is how it is being used.
I will profile and check it.

@xibz
Copy link
Contributor

xibz commented Feb 4, 2017

@nonexu - Yea, that is suspicious to why it only affects certain instances. Without being able to see all the instances and how they are configured, won't lead me down any conclusions. I suggest looking into why the spike to 100% CPU usage and that may give more context to why it affect some instances and not others.

@nonexu
Copy link
Author

nonexu commented Feb 4, 2017

@xibz
There is a daily work on 00:00 every day via crontab, so 100% usage at that moment.
And pprof is empty and no data, when it is blocked.
I update the test code, which just get records from kinesis. Same issue happened.
Could you help me check whether the code is correct?
Thank you in advance!
main.go.gz

@xibz
Copy link
Contributor

xibz commented Feb 5, 2017

@nonexu - You are only sleeping if there is an error or if the record length is 0. You need to add a sleep in your loop that is always hit. I just moved it to the beginning.

	for{
                        time.Sleep(time.Second)
			params := &kinesis.GetRecordsInput{
			ShardIterator: aws.String(shardIterator), // Required
			Limit:         aws.Int64(1000),
			}

			fmt.Println("start get data, shardId:", shardId)
			result, err := this.client.GetRecords(params)
			fmt.Println("api return")

			if err != nil{
				fmt.Println("get record error:", err.Error())
				continue
			}

			recordLen := len(result.Records)
			fmt.Println("get records length:", recordLen)
			if recordLen > 0{
				fmt.Println(result.Records[0])
			}
			shardIterator = *result.NextShardIterator
	}

Try that and let us know if that helps!

@nonexu
Copy link
Author

nonexu commented Feb 6, 2017

@xibz Thanks for your reply.
Same issue happened after add a sleep in loop.
And I checked CPU load is normal when I tested.

@xibz
Copy link
Contributor

xibz commented Feb 6, 2017

@nonexu - Thank you for keeping us up to date. What was the CPU load? And how long was the delay?

@nonexu
Copy link
Author

nonexu commented Feb 6, 2017

@xibz It delayed for 10 minutes this time.
See cpu load via top command.
top - 01:14:47 up 38 days, 21:55, 1 user, load average: 0.00, 0.01, 0.05
image

@nonexu
Copy link
Author

nonexu commented Feb 6, 2017

@xibz Could you test this case on a server with same OS version and configuration as i-093416b7e6559c70b.
I think it may be reproduced.

@xibz xibz removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 8, 2017
@xibz
Copy link
Contributor

xibz commented Feb 8, 2017

Thank you @nonexu for all the information. I've forwarded your graph to the service team. I will try to reproduce this on my end using your code and see if I can reproduce it.

@xibz
Copy link
Contributor

xibz commented Feb 8, 2017

@nonexu - The service team has asked for sar metrics of the instance during use to capture this delay. Can you please provide that?

@xibz xibz added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 9, 2017
@xibz
Copy link
Contributor

xibz commented Feb 16, 2017

Hello @nonexu, if you are still having issues with this, please feel free to reopen this. I am going to close this for now until we have further data.

@jasdel
Copy link
Contributor

jasdel commented Apr 4, 2017

Hi we've merged in PR ##1166 which adds retrying of connection reset errors discovered during unmarshaling response body, and read timeouts for Kinesis API Get operations. The read timeouts should prevent your application hanging for significant amounts of time.

@diehlaws diehlaws removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Oct 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Question that needs advice or information.
Projects
None yet
Development

No branches or pull requests

4 participants