Skip to content
This repository has been archived by the owner on Jan 16, 2024. It is now read-only.

AVS will become abnormal after Internet disconnect / resume #727

Closed
zeusshuang opened this issue May 18, 2018 · 18 comments
Closed

AVS will become abnormal after Internet disconnect / resume #727

zeusshuang opened this issue May 18, 2018 · 18 comments

Comments

@zeusshuang
Copy link

Hi AVS team,

AVS SDK 1.7.1
I attached a log for this case, and it can be 100% reproduced.

  • Alexa, play BBC radio two from TuneIn
    -- while you listen audio start, remove the Internet from WiFi AP (but WiFi is still on)
    -- TuneIn will continue to play 1 or 2 minutes, maybe buffering.
    -- wait about 6 minutes (this log, I wait 10 minutes)
    -- put Internet resume on WiFi AP, and make sure Internet is actually working.
    -- No audio streaming resume from TuneIn
    -- You will see the BBC TuneIn is still running on APP or Web APP (in playing state)
  • Alexa, what's the captiple of France
    -- AVS go into listening state, no response. From the APP history, no text is detected.

AVS will never return until restart it.
And it's very interesting, if it's not in playing state, and test it in Idle state, I mean, in idle state, just disconnect Internet and resume, it will be fine and no problem.

Why I do this test is that, during past months test, we sometimes get AVS stuck in overnight test,
the symptom is AVS is no response, but whole system seems fine.
To review the source code, we found a ping time out for 5 minutes in "HTTP2Transport.cpp".
So I design this test to check if it will be problem while it failed to ping timeout.

And if you change 5 minutes to 1 minutes for example, it will reduce your test time.
I found, while you see the log "[ 3] 9 AttachmentReaderSource:read:size=0,status=1" dump,
it will go to fail.

In the log, you will see my first trigger it at line#5, and let it running for 10 minutes.
Next trigger is at line#7178, and line#7692. The AVS will go into listening state but no response.

2.log

@mavamazon
Copy link
Contributor

Hi!
Thank you for detailed report! This is definitely looks wrong. We will investigate it and let you know if (even) more information is required.

@zeusshuang
Copy link
Author

Hi @mavamazon ,

Here I update another test result yesterday, and I need to correct the condition in my origin report.
I found to wait for 5 minutes (wait for next HTTP2 ping timeout) is not must, it seems the system become abnormal after this log flooding: "[ 3] 9 AttachmentReaderSource:read:size=0,status=1".

@kjkh
Copy link

kjkh commented May 23, 2018

Hi @zeusshuang,

To help with the investigation, is this a new issue introduced by 1.7.1? Or have you seen this occur in previous versions of the SDK?

@zeusshuang
Copy link
Author

Hi @kjkh ,

It also happens on previous SDK. At least we start to look into this case since 1.6.
While it go into stuck, the media player will become pause state, so we ever consider to monitor it and restart AVS, but I think it's not a good solution.

@visayamv
Copy link
Contributor

Hi @zeusshuang , from your logs

2018-05-18 01:00:57.116 [ 14] 0 HTTP2StreamPool:getStream:streamId=49,numAcquiredStreams=2
2018-05-18 01:00:57.117 [ 14] 9 HTTP2Transport:insertActiveStream:handle=0x2099318
....
2018-05-18 01:01:27.132 [ 14] I HTTP2Transport:streamProgressTimedOut:streamId=49
...
2018-05-18 01:01:27.134 [ 14] I HTTP2Transport:streamTimedOutReason:curlRespId=0
2018-05-18 01:01:27.135 [ 14] 0 HTTP2StreamPool:releaseStream:streamId=49,numAcquiredStreams=1
2018-05-18 01:01:27.135 [ d] 1 FocusManager:releaseChannel:channelName=Dialog
2018-05-18 01:01:27.135 [ 14] 0 HTTP2Transport:networkLoop_582::Request Timeout

I can see that your utterance was tried to be sent to the server, but unfortunately timed out and was not able to send.

Can you check while the issue is happening that your device can connect to the server by issuing the following command:

curl https://avs-alexa-na.amazon.com

It should return a response.

After issuing that command, try to trigger your application again.

@zeusshuang
Copy link
Author

zeusshuang commented May 30, 2018

Hi @visayamv ,

I tried it, it will respond "Unauthorized_request_exception". But it's no help, the symptom is the same as my report before.

BTW, this log "HTTP2Transport:networkLoop_582::Request Timeout", is my added from case#582 suggestion.

@kencecka
Copy link
Contributor

kencecka commented Jun 5, 2018

Hi @zeusshuang,

I'm actively looking at this issue and have not been able to reproduce it yet. When you are running this test, are you using the SampleApp from 1.7.1, or are you testing a custom application. Are you testing it on an embedded platform, or on a desktop?

I do not have a router handy, but I set up an equivalent test environment by running SampleApp from 1.7.1 on my mac, connecting over WiFi to my cell phone. I then disabled cellular data on the phone while leaving the WiFi access point active. This should be equivalent to your scenario of pulling the internet connection from a router. In this test, I was able to observe the tunein station ceasing to play after disabling the internet connection. After 10 minutes, I restored the internet connection and was able to ask alexa for a joke successfully. I did not see the tunin station resume playing however.

If you are not testing with plain SampleApp on desktop, could I trouble you to try it and check whether you can reproduce the issue using your router to break the internet connection? In the mean time, I will continue to investigate here.

Ken

@zeusshuang
Copy link
Author

Hi @kencecka ,

I tested it on our embedded system.
Our team could also see this case on Amazon Intel reference kit before, I will update the logs from Amazon development kit for your reference.

By your test steps, I think it should be the same as my tests. Not sure in your test, will the player still play stream after Internet disconnected (due to it has buffer) ?
By my tests, it will continue to play for about 2 minutes, then no sound. Then in log, you will see the log "[ 3] 9 AttachmentReaderSource:read:size=0,status=1" , then you can plug back the Internet (so no need to wait for a long time).
Anyway, I will try to check this case on another device again, on Amazon reference board or build a Linux PC to try it.

@zeusshuang
Copy link
Author

Hi @kencecka ,

Sorry to update late. Our team verified this case with Amazon development Kit (Intel MIC solution), and we can see the same case.
And you are correct, sometimes, it may return to normal state while Internet disconnected, but most of the case, it will fail as my report, even we found the AVS crashed.
We found, if the player state could change to pause or stop after disconnected, it will be fine.
If it continue to read buffer and read fail, it will become for-ever loop.

Here I attached 5 logs, two are okay to return normal cases, three are failed cases.
I think the root cause should be in player reader, but I have no idea.

faile_play_and_ping_timeout_curl_error.txt
faile_program_interruption.txt
failed_read1351.txt
ok.txt
ok2_timeout.txt

@scotthea-amazon
Copy link
Contributor

Hi @zeusshuang,

Looking at failed_read1351.txt:

The last activity on AudioPlayer's MediaPlayer thread is here:

2018-06-07 03:24:41.649 [  3] 9 AttachmentReaderSource:read:size=1351,status=0

When the state switches to LISTENING, a context is needed to send the Recognize event to AVS, leading to:

2018-06-07 03:39:32.962 [  7] 0 AudioPlayer:provideState:stateRequestToken=3

...which leads to:

2018-06-07 03:39:32.962 [  9] 0 AudioPlayer:executeProvideState:sendToken=true,stateRequestToken=3 

... which leads to:

2018-06-07 03:39:32.962 [  9] 9 MediaPlayer:getOffsetCalled

...which should be handled by AudioPlayer's MediaPlayer thread (i.e. thread 3). However, we never see a subsequent log with 'handleGetOffsetCalled' for thread 3, and a couple of seconds later ContextManager's timeout expires:

2018-06-07 03:39:34.962 [  7] E ContextManager:updateStatesLoopFailed:reason=stateProviderTimedOut

...which cancels the pending recognize event before it is sent.

So, it appears that this is all caused by thread 3 getting stuck. That thread is driven by gstreamer. We expect that besides called things like handleGetOffset() it should be calling AttachmentReaderSource::handleReadData() over and over. If this thread is behaving as we expect, it seems very likely that it is in AttachmentReaderSource::handleReadData()'s call to m_reader->read().

In 1.7.1, that is expected to be a non-blocking call because AudioPlayer::handlePlayDirective() creates a NONBLOCKING attachment reader. In addition, the call to m_reader->read() is called with a 1 millisecond timeout value. Either should prevent this call from blocking.

I suggest that we add a new log line in AttachmentReaderSource::handleReadData(), right before the call to m_reader->read(). For example:

ACSDK_DEBUG9(LX("beforeCallingRead"));

If we reproduce this problem with the new log line, and we see the new log line without a subsequent log line emitted after the call to m_reader->read(), then we can be sure that is where the thread is getting stuck, and then dig deeper in to how that could be happening. If we do not find that pattern in the log, we will know that we need to track down other places this thread could be hung up.

If by chance you can reproduce this problem under a debugger, it might be even easier to just get a backtrace of where this thread is blocked when the problem occurs.

Can you reproduce the problem with the additional log line and/or provide us a backtrace of the blocked thread?

Best regards,
-SWH

@zeusshuang
Copy link
Author

Hi @scotthea-amazon ,

The last logs for you is from Amazon development kit, it's AVS 1.6 (We have no newer to upgrade it, so we tried 1.6).
Sorry, I must cause your confused. To test Amazon development kit is just for double confirm it's not platform porting issue.

I attached the new log by your suggestion.
In AVS 1.7.1, it will not block there, so it will continue to dump read is 0 ( you can see the log from my first report, it's from 1.7.1).
After it goes into this state, it will not response anything after keyword detection is triggered.
(see line#3005 in my attached log)

By our investigating, if the media player is in playing state and finally found no buffer and Internet failed, it will go to stuck. If it's not in playing state, eg. in idle or pause state, it will be safe.

5.log

@sanjayrd
Copy link
Contributor

Hi @zeusshuang,

Thank you for the logs and information. This line in particular is very interesting to me:
2018-06-13 07:01:42.372 [ 1d] E AudioPlayer:onFocusChangedTimedOut:newFocus=BACKGROUND,m_currentActivity=PLAYING

It looks like AudioPlayer is indeed trying to pause and get out of the playing state (as it should be), but is failing. This ties in with what you said about the MediaPlayer getting stuck. I will investigate this and see if it's helpful.

Thanks,
Sanjay

@scotthea-amazon
Copy link
Contributor

Hi @zeusshuang,

Since it appears that the blockage is occurring in the MediaPlayer, and likely within gstreamer, I suggest reproducing the problem with gstreamer's logging enabled. See: debugging tools.

Something else that might help would be to capture a backtrace of all threads once the app is in this problem state. That might tell us where things are getting stuck.

Also, we have still been unable to reproduce the problem from your description. Might there be something special about your media playback, network, or other configuration details that would help us to reproduce the issue?

Thank you,
-SWH

@zeusshuang
Copy link
Author

Hi @scotthea-amazon ,

Thanks for the help.
I am asking SDK 1.8 firmware for Amazon development kit (Pi+Intel solution platform) and I will try it here and get logs for checking.

@BennyAvramson
Copy link
Contributor

Hi,

Thanks for pointing this issue out. We'll include a fix in the next release.

Thanks,
Benny

@kclchan
Copy link
Contributor

kclchan commented Sep 13, 2018

This issue should be fixed in v1.9. Thanks!

@jackeyu
Copy link

jackeyu commented Oct 4, 2018

Hi @kclchan ,

Could you help to point which files need merger into v1.7 from v1.9 If we need to use AVS SDK v1.7 ?

Thanks

@kclchan
Copy link
Contributor

kclchan commented Oct 10, 2018

Hi @jackeyu, unfortunately the fix involves changes to multiple files and is not simple to merge into v1.7. I would recommend upgrading to v1.9.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants