Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in pcm_dsnoop.c in alsa-lib #213

Closed
TE-N-ShengjiuWang opened this issue Mar 1, 2022 · 5 comments
Closed

Issue in pcm_dsnoop.c in alsa-lib #213

TE-N-ShengjiuWang opened this issue Mar 1, 2022 · 5 comments

Comments

@TE-N-ShengjiuWang
Copy link

TE-N-ShengjiuWang commented Mar 1, 2022

Hi Takashi Iwai, Jaroslav Kysela

We encountered an issue in the pcm_dsnoop use case, could you please help to have a look?

Issue description:
With two instances for dsnoop type device running in parallel, after suspend/resume, one of the instances will be hung in memcpy because the very large copy size is obtained.

#3 0x0000ffffa78d5098 in snd_pcm_dsnoop_sync_ptr (pcm=0xaaab06563da0)
at pcm_dsnoop.c:158
dsnoop = 0xaaab06563c20
slave_hw_ptr = 64
old_slave_hw_ptr = 533120
avail = 187651522444320

Reason analysis:
The root cause that I analysis is that after suspend/resume, one instance will get the SND_PCM_STATE_SUSPENDED state from slave pcm device, then it will do snd_pcm_prepare() and snd_pcm_start(), which will reset the dsnoop->slave_hw_ptr and the hw_ptr of slave pcm device, then the state of this instance is correct. But another instance may not get the SND_PCM_STATE_SUSPENDED state from slave pcm device because slave device may have been recovered by first instance, so the dsnoop->slave_hw_ptr is not reset. but because hw_ptr of slave pcm device has been reset, so there will be a very large "avail" size.

Solution:
I didn't come up with a fix for this issue, seems there is no easy way to let another instance know this case and reset the dsnoop->slave_hw_ptr, could you please help?

@perexg
Copy link
Member

perexg commented Mar 3, 2022

@TE-N-ShengjiuWang
Copy link
Author

TE-N-ShengjiuWang commented Mar 4, 2022

Thanks for the fix. But it doesn't solve the issue totally,
I do two modification base on your fix, after test, the issue can't be reproduced. could you please review?

diff --git a/src/pcm/pcm_direct.c b/src/pcm/pcm_direct.c
index 78716f229a14..5bff4e0836e0 100644
--- a/src/pcm/pcm_direct.c
+++ b/src/pcm/pcm_direct.c
@@ -659,6 +659,14 @@ int snd_pcm_direct_slave_recover(snd_pcm_direct_t *direct)
  */
 int snd_pcm_direct_client_chk_xrun(snd_pcm_direct_t *direct, snd_pcm_t *pcm)
 {
+
+       /*
+        * Add semaphore protection for getting the 'shmptr->s.recoveries'.
+        * As another instance maybe in snd_pcm_direct_slave_recover() but
+        * haven't update the 'shmptr->s.recoveries', then this instance
+        * can't return correct state.
+        */
+       snd_pcm_direct_semaphore_down(direct, DIRECT_IPC_SEM_CLIENT);
        if (direct->shmptr->s.recoveries != direct->recoveries) {
                /* no matter how many xruns we missed -
                 * so don't increment but just update to actual counter
@@ -676,6 +684,7 @@ int snd_pcm_direct_client_chk_xrun(snd_pcm_direct_t *direct, snd_pcm_t *pcm)
                else
                        direct->state = SND_PCM_STATE_XRUN;
        }
+       snd_pcm_direct_semaphore_up(direct, DIRECT_IPC_SEM_CLIENT);
        if (direct->state == SND_PCM_STATE_XRUN)
                return -EPIPE;
        else if (direct->state == SND_PCM_STATE_SUSPENDED)
diff --git a/src/pcm/pcm_dsnoop.c b/src/pcm/pcm_dsnoop.c
index 1653f6fba86c..ee7506bff442 100644
--- a/src/pcm/pcm_dsnoop.c
+++ b/src/pcm/pcm_dsnoop.c
@@ -146,14 +146,22 @@ static int snd_pcm_dsnoop_sync_ptr(snd_pcm_t *pcm)
        default:
                break;
        }
-       err = snd_pcm_direct_client_chk_xrun(dsnoop, pcm);
-       if (err < 0)
-               return err;
        if (dsnoop->slowptr)
                snd_pcm_hwsync(dsnoop->spcm);
        old_slave_hw_ptr = dsnoop->slave_hw_ptr;
        snoop_timestamp(pcm);
        slave_hw_ptr = dsnoop->slave_hw_ptr;
+
+       /*
+        * FIXME: Move snd_pcm_direct_client_chk_xrun after getting the
+        * dsnoop->spcm->hw.ptr. If the snd_pcm_direct_slave_recover()
+        * of another instance happening before dsnoop->spcm->hw.ptr
+        * is got, then a wrong spcm->hw.ptr is got which cause a wrong
+        * 'diff' data later.
+        */
+       err = snd_pcm_direct_client_chk_xrun(dsnoop, pcm);
+       if (err < 0)
+               return err;
        diff = pcm_frame_diff(slave_hw_ptr, old_slave_hw_ptr, dsnoop->slave_boundary);
        if (diff == 0)          /* fast path */
                return 0;

By the way, the RECOVERIES_FLAG_SUSPENDED is in BIT(31), do we need to consider the overflow of recovery counter? if it overflow to BT(31), then the FLAG_SUSPEND should be wrongly set.

Best regards
Wang shengjiu

@tiwai
Copy link
Contributor

tiwai commented Mar 4, 2022 via email

@TE-N-ShengjiuWang
Copy link
Author

Thanks

On Fri, 04 Mar 2022 09:29:54 +0100, Shengjiu Wang wrote: Thanks for the fix. But it doesn't solve the issue totally, I do two modification base on your fix, after test, the issue can't be reproduced. could you please review?
Could you rather submit a proper patch? I hesitated to apply the semaphore around snd_pcm_direct_client_chk_xrun() as it's a hot path. But it seems that we have to do it in anyway, and if any, this could be another patch (the function is already present and used for xrun checks). And, we can implement a fast-path there as well; if direct->state is XRUN or SUSPENDED at the beginning of the function, returns immediately. Or, we may move the update of recoveries count in snd_pcm_direct_slave_recover() earlier, almost at the beginning of the procedure. Then other in parallel can catch the change, practically works even without semaphore.

I saw you have update the origin/topic/pcm-direct-resume branch, I test your latest change, it is more stable than before, but still meet once of the issue after overnight test, it it very very low possibility.

So I suggest if we need to do below change, shall we?

diff --git a/src/pcm/pcm_dsnoop.c b/src/pcm/pcm_dsnoop.c
index 729ff447b41f..cc333b3f4384 100644
--- a/src/pcm/pcm_dsnoop.c
+++ b/src/pcm/pcm_dsnoop.c
@@ -134,14 +134,21 @@ static int snd_pcm_dsnoop_sync_ptr(snd_pcm_t *pcm)
        snd_pcm_sframes_t diff;
        int err;

-       err = snd_pcm_direct_check_xrun(dsnoop, pcm);
-       if (err < 0)
-               return err;
        if (dsnoop->slowptr)
                snd_pcm_hwsync(dsnoop->spcm);
        old_slave_hw_ptr = dsnoop->slave_hw_ptr;
        snoop_timestamp(pcm);
        slave_hw_ptr = dsnoop->slave_hw_ptr;
+       /*
+        * FIXME: Move snd_pcm_direct_client_chk_xrun after getting the
+        * dsnoop->spcm->hw.ptr. If the snd_pcm_direct_slave_recover()
+        * of another instance happening before dsnoop->spcm->hw.ptr
+        * is got, then a wrong spcm->hw.ptr is got which cause a wrong
+        * 'diff' data later.
+        */
+       err = snd_pcm_direct_check_xrun(dsnoop, pcm);
+       if (err < 0)
+               return err;
        diff = pcm_frame_diff(slave_hw_ptr, old_slave_hw_ptr, dsnoop->slave_boundary);

By the way, the RECOVERIES_FLAG_SUSPENDED is in BIT(31), do we need to consider the overflow of recovery counter? if it overflow to BT(31), then the FLAG_SUSPEND should be wrongly set.
It's masked with RECOVERIES_MASK. Takashi

@perexg
Copy link
Member

perexg commented May 3, 2022

For reference, this issue was fixed in bb9f258 .

@perexg perexg closed this as completed May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants