GetRecords blocked for more than 10 minutes #1037

nonexu · 2017-01-12T03:20:57Z

Hi,
I encounter an issue that it took more than 10 minutes to call GetRecords.
So, there was an error "Iterator expired", when get records next time.
It was correct when I started my application, and issue happened after several minutes.
GetRecords should not be blocked, from AWS doc

I enabled kinesis api debug info: log attached.
Did you encounter this issue? And how to solve it?

Thank you in advance!

debug_info.log.gz

xibz · 2017-01-12T03:37:09Z

Hello @nonexu, thank you for reaching out to us. Just looking at your small snippet, are you only sleeping if there is an error? How often do you make a call to the service if there isnt an error?
I am wondering if you are making too many calls too quickly. Try sleeping at the end of the loop to see if that fixes it. And try some different sleep durations. Let me know if that helps any.

nonexu · 2017-01-12T06:21:32Z

@xibz
Thanks for your reply,
The answer for this question
are you only sleeping if there is an error?
Yes, Only sleep 1 second when there is error.
How often do you make a call to the service if there isnt an error?
Call GetRecords continuous without sleep.

It looks like this issue. I will do some modification and test.
I will give some feedback later.
#301

xibz · 2017-01-17T00:07:51Z

Hello @nonexu, I am going to go ahead and close this. If you are still having issues with this, please let us know.

nonexu · 2017-01-19T01:14:26Z

Hello, @xibz
Sorry for delay in responding.
This issue is not solved, I have created a ticket for kinesis team.
I enable the sdk debug info.
We can see http request send at 2017-01-19 00:22:13.580
But the debug information from SDK, it get response at 2017-01-19 00:34:18.304.
The time in http header is : Date: Thu, 19 Jan 2017 00:22:12 GMT^M.
I want to know does sdk debug the header immediately when it receive the header or any delay after it receive the body? The body is debug after the header.

2017-01-19 00:22:13.580 main.go:93 [Info ] aws sdk debug info:[DEBUG: Request kinesis/GetRecords Details:
---[ REQUEST POST-SIGN ]-----------------------------
POST / HTTP/1.1^M
Host: kinesis.us-west-2.amazonaws.com^M
User-Agent: aws-sdk-go/1.6.10 (go1.7.1; linux; amd64)^M
Content-Length: 257^M
Authorization: AWS4-HMAC-SHA256 Credential=AKIAJYXRKCV4G2A7WBPQ/20170119/us-west-2/kinesis/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date;x-amz-target, Signature=fde4436512a27e006860b66f20a8dacb2f5af9206f0606e24f9cb7c8f653b8aa^M
Content-Type: application/x-amz-json-1.1^M
X-Amz-Date: 20170119T002213Z^M
X-Amz-Target: Kinesis_20131202.GetRecords^M
Accept-Encoding: gzip^M
^M
{"Limit":1000,"ShardIterator":"AAAAAAAAAAHeniHBOBOBRr7BWaOpMUFrHaZ5AXhBTJWQk/PRn/baWpvHtGFFLVeG9Mdj66pR90J7ya1k/F3763i3uYPXZDGX7q2uioMsSmYpaJ5uAGNr0trEFafLjFNKuq2SUHFpAn4FpDVLssCHD05/oO1XeKVzvIr+15tI7Fhpgn91Yza7G0P7XRMGL6Dl4FvJ9BKsdQDYQG7NFUu5WeOajT2aIoUU"}

2017-01-19 00:34:18.304 main.go:93 [Info ] aws sdk debug info:[DEBUG: Response kinesis/GetRecords Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK^M
Content-Length: 2915437^M
Content-Type: application/x-amz-json-1.1^M
Date: Thu, 19 Jan 2017 00:22:12 GMT^M
Server: Apache-Coyote/1.1^M
X-Amz-Id-2: nBTcoj6/ZtcWzmfhFmjPT//k33UxvXEyp22mRMeq6m/+8HYA9bWMWPa1HTyabaD6CvrXymfTE9E2UVLW5qptyUlih0F9gN1n^M
X-Amzn-Requestid: cb24b9b4-ca7d-5b31-9bde-b14185e803bd^M

xibz · 2017-01-19T02:23:18Z

@nonexu - We log immediately when we have the body, IF you set the logger to log the body. Perhaps, profiling using pprof to see where the bottleneck is occurring may give us a better insight to what is going on.

nonexu · 2017-01-19T02:44:09Z

@xibz
Sorry, I don't understand clearly about this: "profiling using pprof to see where the bottleneck"
Could you give me a detail description about this and how to do it?

Other question, sdk debug header and body immediately when it receive, why there is no time out
from 2017-01-19 00:22:13.580 2017-01-19 00:34:18.304?
Is there any data delivery at this time, how can I see the detail information?
My server locals on aws cloud, so I think this data can be delivered soon.

I tried to set http timeout 3 minutes when init sdk, but it did not fix this issue.
Thank you in advance!

xibz · 2017-01-19T03:11:43Z

@nonexu , I was wondering about those times. Those are sent back from the service.
Adding your own timer at the end of the send handler to check if the service is sending the correct time back would be at least let us know it isn't the SDK. If it isn't, I would mention that in the ticket.

req, out := svc.GetRecordsRequest(&kinesis.GetRecordsInput{
// values here
})
req.Handlers.Send.PushBack(func(r *request.Request) {
    fmt.Println("Time:", time.Now().UTC())
})

If the times are correct, I'd profile next.
Here is a good blog that shows you how to profile your go applications.

Please let me know if you have any issues with the profiling and/or with the snippet above.

nonexu · 2017-01-20T06:10:22Z

@xibz
Thanks for your comment.
I hope I understand your idea, I modified the function in aws sdk as below.

func (c *Kinesis) GetRecords(input *GetRecordsInput) (*GetRecordsOutput, error) {
	req, out := c.GetRecordsRequest(input)
	req.Handlers.Send.PushBack(func(r *request.Request) {
    	fmt.Println("Time:", time.Now().UTC())
		})
	err := req.Send()
	return out, err
}

Issue was reproduced again, it took more than 1 hour this time.
Time info is debugged as below after body debugged.
Time: 2017-01-20 05:31:15.605637078 +0000 UTC

2017-01-20 04:13:00.312        main.go:460 [Info  ] start get records:shardId-000000000002, AAAAAAAAAAHDBZ56CD+KUTPh2oPGPe9vDxaGOp3iYA9I6QATwcI4NlRWuFmyJROzK3D22/s5jfJhYDJBEFNY4yhoWimXthuGiC64gCSwOT0xO+4PVePkmLEiD/Ug6QuW2WgQOA/H3bpQW1eXalttSrzvXUkSBaM0HByFARLtqYFC8v5vi48xlRJ4emKKwjyvsprP2WfiCpjsjT0Xi2+kc7dORjuCyASI, 1000
2017-01-20 04:13:00.312        main.go:93  [Info  ] aws sdk debug info:[DEBUG: Request kinesis/GetRecords Details:
---[ REQUEST POST-SIGN ]-----------------------------
POST / HTTP/1.1^M
Host: kinesis.us-west-2.amazonaws.com^M
User-Agent: aws-sdk-go/1.6.10 (go1.7.1; linux; amd64)^M
Content-Length: 257^M
Authorization: AWS4-HMAC-SHA256 Credential=AKIAJYXRKCV4G2A7WBPQ/20170120/us-west-2/kinesis/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date;x-amz-target, Signature=5a1dd0fae438cda816a5c166938777ce22eab2a1e85c92a2d4c7699656942922^M
Content-Type: application/x-amz-json-1.1^M
X-Amz-Date: 20170120T041300Z^M
X-Amz-Target: Kinesis_20131202.GetRecords^M
Accept-Encoding: gzip^M
^M
{"Limit":1000,"ShardIterator":"AAAAAAAAAAHDBZ56CD+KUTPh2oPGPe9vDxaGOp3iYA9I6QATwcI4NlRWuFmyJROzK3D22/s5jfJhYDJBEFNY4yhoWimXthuGiC64gCSwOT0xO+4PVePkmLEiD/Ug6QuW2WgQOA/H3bpQW1eXalttSrzvXUkSBaM0HByFARLtqYFC8v5vi48xlRJ4emKKwjyvsprP2WfiCpjsjT0Xi2+kc7dORjuCyASI"}
-----------------------------------------------------]

2017-01-20 05:31:15.598        main.go:93  [Info  ] aws sdk debug info:[DEBUG: Response kinesis/GetRecords Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK^M
Content-Length: 2120174^M
Content-Type: application/x-amz-json-1.1^M
Date: Fri, 20 Jan 2017 04:13:00 GMT^M
Server: Apache-Coyote/1.1^M
X-Amz-Id-2: 2nUehIoVcc8MfC4HDK7EJT8vHz3QPWNzH00LMzwX26tdi03Vb+mIEN+mQLeLKP9IN+crako3GC4lRaVZwl9kgYuBiYvw2WWJ^M
X-Amzn-Requestid: f4a299ca-c647-7e07-a459-09461b84cff5^M
{"MillisBehindLatest":0,"NextShardIterator":"AAAAAAAAAAFXMOfHraka9+Pw7oLD2AHQVmfwwSyml4fFfhCNG8IKTVA5NvlSE17fPu0JFNeBCRv7rSOaEOPkHaPAkOS0DcsQUxGqBsJ0qr+ZiVKnTbAokuvKIGZWlF+JUjzxgrOmwyysrkF2FEINmyqDEIgjxis/HZ6Mwc9GC4PiaXOHfmjZV+cl7bHo8LDPs8AZjdpKDyhFdIMVRHxkSnDsRIex1RwO","Records":[....body....]}
Time: 2017-01-20 05:31:15.605637078 +0000 UTC
2017-01-20 05:31:15.640        main.go:475 [Info  ] get records length 618:1000, shardId: shardId-000000000002

Do you need more information?

Some additional information about this issue:
The issue always happened on this server, ip:35.165.XXX.XXX.
It is all ok on this server 52.27.XXX.XXX with same executable file and config file.
Both of them local aws cloud us-west-2c with different linux OS version

Thank you in advance!

xibz · 2017-01-20T06:23:19Z

@nonexu - Looks like the service is returning the incorrect time. I would mention that in the ticket. I believe this is a service issue, but I think we should leave this issue open until it is solved. Please let us know if you need anything and/or information.

nonexu · 2017-01-20T06:35:09Z

Thank you, I only want to know the reason and solution for the issue.
Otherwise, I can't ensure my application works well on different server, and can't update or change servers without worries.

xibz · 2017-01-23T17:42:55Z

@nonexu - The service team just got back to me and is asking for the instance id of the ec2 instance that this is occurring on. Can you please provide that?

nonexu · 2017-01-24T01:20:15Z

@xibz Thanks for your help!
instance id: i-093416b7e6559c70b, which issue occurred on.
i-0511a166cfada4299, which is all correct with same config file.
I once started new ec2 with same OS version as i-093416b7e6559c70b,
Same issue happened again. So, I doubt is it related with OS env?

xibz · 2017-01-24T17:39:57Z

@nonexu - Thank you for getting that to us. I've relayed this information to the service team. If you look at #301, the solution that other customer provided was to implement an HTTP timeout of 1 minute with retries. That may be a good solution for now, while we investigate what is going on.

nonexu · 2017-01-25T01:24:19Z

@xibz Thanks for your idea!
I once tried this solution. But it time out for 5 times continuously in my test, then Iterator expired.

Maybe, I need to init new kinesis client, when iterator expired. But it is just a workaround.

jasdel · 2017-01-25T05:27:57Z

@nonexu you shouldn't need a new client, but you may need to retrieve a new iterator if the current expires.

xibz · 2017-01-25T18:05:34Z

@nonexu - Additionally, how long does it take, typically, to experience a long delay?

nonexu · 2017-02-03T02:16:31Z

@xibz Sorry for delayed response, I took a leave in the past days.
It always takes about 20 minutes to return the result.
Is there any news about this issue?

xibz · 2017-02-03T02:31:12Z

@nonexu - The service team is looking into to it, but requires some information. I'll let them know how often this occurs. Thank you for providing the necessary information and no worries on the delayed response!

nonexu · 2017-02-03T02:42:29Z

@xibz
Can I do some help for this issue?
Such as, I reproduce it and service team checks the request from i-093416b7e6559c70b immediately.
I will tell you, when it is reproduced.

nonexu · 2017-02-03T02:45:53Z

@xibz It is blocked now for 5 minutes on i-093416b7e6559c70b.
Can service team check it now?

xibz · 2017-02-03T02:49:03Z

@nonexu - I have asked them to check. I will let you know when they get back to me.

nonexu · 2017-02-03T03:52:59Z

@xibz detail information about this time.
Api called at 2017-02-03 02:40:36
return at 2017-02-03 03:24:41

xibz · 2017-02-03T17:34:32Z

Hello @nonexu, the service team just got back to me. So, it looks like it is what we suspected earlier. I am unsure if the sleep I suggested in the loop is enough, but that instance was running at 100% CPU. I would profile and observe the instance to see what is causing the high CPU utilization.

nonexu · 2017-02-04T00:14:48Z

@xibz Thanks for your feedback.
Waiting for your good news.
Same issue happened in other sever with same OS version.

xibz · 2017-02-04T00:17:49Z

@nonexu - Sorry, I think I may have been unclear here. The issue here isn't the service or SDK, but how it is being used. I suggested a sleep would help. If it hadn't, I suggest profiling to see why this your code base is utilizing 100% CPU. Please let us know if there is any more that we can do.

nonexu · 2017-02-04T00:26:12Z

@xibz There is a sleep 1 second when it failed.
It is strange why issue happened on i-093416b7e6559c70b, same executed file is always successful on i-0511a166cfada4299, If the root case is how it is being used.
I will profile and check it.

xibz · 2017-02-04T05:15:33Z

@nonexu - Yea, that is suspicious to why it only affects certain instances. Without being able to see all the instances and how they are configured, won't lead me down any conclusions. I suggest looking into why the spike to 100% CPU usage and that may give more context to why it affect some instances and not others.

nonexu · 2017-02-04T11:30:20Z

@xibz
There is a daily work on 00:00 every day via crontab, so 100% usage at that moment.
And pprof is empty and no data, when it is blocked.
I update the test code, which just get records from kinesis. Same issue happened.
Could you help me check whether the code is correct?
Thank you in advance!
main.go.gz

xibz · 2017-02-05T04:47:10Z

@nonexu - You are only sleeping if there is an error or if the record length is 0. You need to add a sleep in your loop that is always hit. I just moved it to the beginning.

	for{
                        time.Sleep(time.Second)
			params := &kinesis.GetRecordsInput{
			ShardIterator: aws.String(shardIterator), // Required
			Limit:         aws.Int64(1000),
			}

			fmt.Println("start get data, shardId:", shardId)
			result, err := this.client.GetRecords(params)
			fmt.Println("api return")

			if err != nil{
				fmt.Println("get record error:", err.Error())
				continue
			}

			recordLen := len(result.Records)
			fmt.Println("get records length:", recordLen)
			if recordLen > 0{
				fmt.Println(result.Records[0])
			}
			shardIterator = *result.NextShardIterator
	}

Try that and let us know if that helps!

nonexu · 2017-02-06T00:52:41Z

@xibz Thanks for your reply.
Same issue happened after add a sleep in loop.
And I checked CPU load is normal when I tested.

xibz · 2017-02-06T01:10:03Z

@nonexu - Thank you for keeping us up to date. What was the CPU load? And how long was the delay?

nonexu · 2017-02-06T01:18:44Z

@xibz It delayed for 10 minutes this time.
See cpu load via top command.
top - 01:14:47 up 38 days, 21:55, 1 user, load average: 0.00, 0.01, 0.05

nonexu · 2017-02-06T07:07:01Z

@xibz Could you test this case on a server with same OS version and configuration as i-093416b7e6559c70b.
I think it may be reproduced.

xibz · 2017-02-08T21:16:49Z

Thank you @nonexu for all the information. I've forwarded your graph to the service team. I will try to reproduce this on my end using your code and see if I can reproduce it.

xibz · 2017-02-08T21:38:02Z

@nonexu - The service team has asked for sar metrics of the instance during use to capture this delay. Can you please provide that?

xibz · 2017-02-16T18:15:25Z

Hello @nonexu, if you are still having issues with this, please feel free to reopen this. I am going to close this for now until we have further data.

jasdel · 2017-04-04T22:14:25Z

Hi we've merged in PR ##1166 which adds retrying of connection reset errors discovered during unmarshaling response body, and read timeouts for Kinesis API Get operations. The read timeouts should prevent your application hanging for significant amounts of time.

xibz added guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Jan 14, 2017

xibz closed this as completed Jan 17, 2017

xibz reopened this Jan 19, 2017

xibz added closing-soon This issue will automatically close in 4 days unless further comments are made. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Jan 27, 2017

xibz removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 3, 2017

xibz added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 3, 2017

xibz removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 8, 2017

xibz added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Feb 9, 2017

xibz closed this as completed Feb 16, 2017

twang-rs mentioned this issue Mar 22, 2017

GetRecords hangs indefinitely #1141

Closed

diehlaws removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Oct 29, 2018

jpaskhay mentioned this issue Oct 27, 2021

Kinesis PutRecords intermittently hangs with successful but empty response #4149

Closed

3 tasks

GetRecords blocked for more than 10 minutes #1037

GetRecords blocked for more than 10 minutes #1037

Comments

nonexu commented Jan 12, 2017 • edited Loading

xibz commented Jan 12, 2017 • edited Loading

nonexu commented Jan 12, 2017

xibz commented Jan 17, 2017

nonexu commented Jan 19, 2017

xibz commented Jan 19, 2017

nonexu commented Jan 19, 2017

xibz commented Jan 19, 2017 • edited Loading

nonexu commented Jan 20, 2017 • edited by xibz Loading

xibz commented Jan 20, 2017 • edited Loading

nonexu commented Jan 20, 2017

xibz commented Jan 23, 2017

nonexu commented Jan 24, 2017

xibz commented Jan 24, 2017

nonexu commented Jan 25, 2017 • edited Loading

jasdel commented Jan 25, 2017

xibz commented Jan 25, 2017 • edited Loading

nonexu commented Feb 3, 2017

xibz commented Feb 3, 2017 • edited Loading

nonexu commented Feb 3, 2017

nonexu commented Feb 3, 2017

xibz commented Feb 3, 2017

nonexu commented Feb 3, 2017

xibz commented Feb 3, 2017

nonexu commented Feb 4, 2017

xibz commented Feb 4, 2017

nonexu commented Feb 4, 2017

xibz commented Feb 4, 2017

nonexu commented Feb 4, 2017

xibz commented Feb 5, 2017

nonexu commented Feb 6, 2017 • edited Loading

xibz commented Feb 6, 2017

nonexu commented Feb 6, 2017

nonexu commented Feb 6, 2017 • edited Loading

xibz commented Feb 8, 2017

xibz commented Feb 8, 2017

xibz commented Feb 16, 2017

jasdel commented Apr 4, 2017

nonexu commented Jan 12, 2017 •

edited

Loading

xibz commented Jan 12, 2017 •

edited

Loading

xibz commented Jan 19, 2017 •

edited

Loading

nonexu commented Jan 20, 2017 •

edited by xibz

Loading

xibz commented Jan 20, 2017 •

edited

Loading

nonexu commented Jan 25, 2017 •

edited

Loading

xibz commented Jan 25, 2017 •

edited

Loading

xibz commented Feb 3, 2017 •

edited

Loading

nonexu commented Feb 6, 2017 •

edited

Loading

nonexu commented Feb 6, 2017 •

edited

Loading