Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Research] Audio-video offset synchronization and audio+input delay calibration #87

Closed
dtinth opened this issue Feb 4, 2015 · 7 comments
Assignees

Comments

@dtinth
Copy link
Member

dtinth commented Feb 4, 2015

Assumptions

  • Response time is zero. When the player hears the sound or sees that the note hits the judgement area, he/she pushes the button immediately.
  • When the game emits the sound, it takes time _S_ until the player hears it. (Audio latency)
  • When the game renders the display, it takes time _D_ until the player sees it. (Video latency)
  • When the player hits the button, it takes time _I_ until the computer recognizes it. (Input latency)

Findings

  • It is impossible to measure/calibrate the values of _S, _D, _I_ separately.
  • But it is possible to find these values:
    • _S+_I (Audio + input latency)
    • _D+_I (Video + input latency)
    • _S-_D (Audio-video offset)

Methods

Calibration

  1. Measure _A_ = _S+_I

    This is the time it takes from when the sound is emitted to when the computer recognizes the button press. This means that when computer emits sound at time _t, the computer will receive button press at time _t'=_t+_A. To compensate, when button press is received at time _t', judgement must be performed for time _t'-_A=_t.

  2. Measure _B_ = _S-_D

    This is the audio/video offset. This can be done by letting user adjust the value until audio and video are in sync. With this value, we adjust the display to display notes in time _t+_B.

    • When computer emits sound at time _t, player will hear it at time _t+_S_.
    • When computer emits graphics at time _t+_B, player will see it at time _t+_D+_S-_D = _t+_S, thus proving that the user will find audio and video to be in sync.

Gameplay

  • sound(t): At time _t_, emit sound as usual.
  • display(t+B): At time _t_, emit graphic in sync with the sound.
  • judgment(t-A): Judge notes _A_ unit of time behind the sound emission.

Auto-Keysounds

This is a problem if the value of _A_ is large. If the player hit the note at the correct time, they will hear delayed keysounds.

That's why for most keysounded music games, user has to press the button significantly before the sound is emitted. Examples include DJMAX Technika and Tone Sphere.

Therefore, we have to play keysound for the player, so that the player hears at the correct time. This is called "Auto-sound" in Open2Jam.

Enabling AutoSound

AutoSound will only be enabled when value of _A_ is significantly large. I'd say 16 milliseconds. A warning message should appear in synchronization dialog saying that AutoSound has been enabled to compensate for audio delay.

AutoSound Mechanics

  • Each player has a playing state _p_, defaults to true.
    • This state is true when player is actively playing and false when player isn't playing any note.
    • When user doesn't play any note, we don't want keysounds to be automatically emitted, so we stop autosound mechanism until user hits a note again, setting _p_ back to true.
  • Each note _n_ will have a keysound state k(n), defaults to "NONE".
  • If player hits the note _n_ before it is sounded (k(n) is "NONE"):
    • Emit the keysound.
    • Set k(n) to "EMITTED"
    • Set _p_ to true
  • If it's the time for the note (_t_ = t(n)) and k(n) is "NONE" and the player is playing (_p_ is true):
    • Emit the keysound.
    • Set k(n) to "EMITTED"
  • If the note is missed and k(n) is "EMITTED":
    • Stop the keysound.
    • Set _p_ to false.
@dtinth
Copy link
Member Author

dtinth commented Feb 4, 2015

Calibrating the value of _A_

Methods

  • Play a song and instruct user to press the button on every beat.
  • Record at least 56 samples of _A_ (time that keypress is registered - time that sound is emitted)
  • Analyze the recorded samples.

Status

Data have been collected and in analysis.

@dtinth
Copy link
Member Author

dtinth commented Feb 25, 2015

From the data analysis, we have found that the delay on average is 13.96 ms. This is very small, and we are pretty safe to use 0ms delay as default value.

The average standard deviation of the delay is 21.57 milliseconds. This means that on average, we are 99% confident that the actual delay is within 6.59 of the obtained mean.

This shows that our method is quite effective.

Our advisor, @jittat, suggested that we can try to reduce the number of required samples, so that the calibration process would become shorter.


@Nachanok

Try to simulate by performing the same experiment, but only using the first _n_ samples. Find the smallest value of _n_ such that the average resulting 99% confidence interval is less than 10 milliseconds.

@dtinth dtinth mentioned this issue Mar 4, 2015
@Nachanok
Copy link
Contributor

Nachanok commented Mar 4, 2015

@dtinth
Copy link
Member Author

dtinth commented Mar 5, 2015

@Nachanok Thank you! I think it's better for you to focus on the skin's code. I'll continue from here. 😄

@dtinth dtinth assigned dtinth and unassigned Nachanok Mar 5, 2015
@dtinth
Copy link
Member Author

dtinth commented Mar 5, 2015

I've done some changes to our calculation:

Since our sample is going to be small, and we only have sample standard deviation (we don't know the population standard deviation), I used T-score instead of Z-score.

Some communication problems also led to the sample data being incorrectly trimmed.

@dtinth
Copy link
Member Author

dtinth commented Mar 5, 2015

Here's the confidence interval results by trying various values of _n. For each value of _n, each person's tapping pattern is truncated to the first _n_ taps.

n 90% 95% 99%
9999 3.940332 4.723988 5.654889
50 4.623089 5.554879 6.670214
49 4.613531 5.544791 6.660451
48 4.884783 5.869317 7.047786
47 4.927309 5.921907 7.113446
46 4.976760 5.982942 7.189458
45 5.037827 6.058083 7.282660
44 5.107284 6.143473 7.388458
43 5.138891 6.183502 7.439988
42 5.137204 6.183623 7.443767
41 5.494494 6.611390 7.954823
40 5.514859 6.638205 7.990986
39 5.581117 6.720470 8.094260
38 5.661806 6.820375 8.219230
37 5.737184 6.914181 8.337363
36 5.784753 6.974789 8.416010
35 5.761313 6.950079 8.392228
34 6.284942 7.577877 9.143717
33 6.369274 7.683485 9.277815
32 6.447250 7.781886 9.404011
31 6.690453 8.080363 9.773102
30 6.598211 7.974306 9.653954
29 6.629881 8.018521 9.717638
28 6.577397 7.961592 9.659865
27 7.366215 8.909082 10.796907
26 7.480202 9.054390 10.985765
25 7.561358 9.161038 11.129612
24 7.778887 9.434326 11.478393
23 7.907150 9.601103 11.700677
22 8.114986 9.866621 12.047054
21 8.375315 10.198808 12.479895
20 9.080049 11.039995 13.479732
19 9.291662 11.314664 13.845325
18 9.602974 11.714706 14.371431
17 10.136242 12.391364 15.247396
16 10.484268 12.849220 15.867984
15 10.148322 12.475783 15.474883
14 9.879257 12.191519 15.205670

Each song section is 28 hits. Therefore, with only one section, we are 98% sure that our observed mean will be within 10ms of the actual audio+input delay.

To be more sure (99%), I think obtaining just 35 samples are good enough.

@dtinth dtinth changed the title Audio-video offset synchronization and audio+input delay calibration [Research] Audio-video offset synchronization and audio+input delay calibration Mar 7, 2015
@dtinth
Copy link
Member Author

dtinth commented Mar 7, 2015

I think the preliminary research has obtained satisfactory result.

The task of implementing them in the game will be another issue.

@dtinth dtinth closed this as completed Mar 7, 2015
@dtinth dtinth removed the c:ready label Mar 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants