Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timestamp_offset vs. frame_offset #23

Closed
bryanhpchiang opened this issue Jul 19, 2023 · 8 comments · Fixed by #25
Closed

timestamp_offset vs. frame_offset #23

bryanhpchiang opened this issue Jul 19, 2023 · 8 comments · Fixed by #25

Comments

@bryanhpchiang
Copy link

image

what's the difference between timestamp_offset and frame_offset? i get what the timestamp_offset is doing but not sure why the frame_offset is also necessary. thanks!

@makaveli10
Copy link
Collaborator

frame_offset is used just to get rid of the stale audio frames which have already been processed here. but we then use it to update the timestamp_offset as well.

We need to get rid of the frames because np.concatenate becomes time consuming as the size of the array grows.

@bryanhpchiang
Copy link
Author

thanks for explaining.
image
this part is a bit confusing to me.

it sounds like we're saying "hey now frames_np represents the audio starting 45 seconds into recording" but then only the first 30 seconds is removed?

which makes sense you want to keep that 15s overlap but then the calculation below is a bit off?

so timestamp_offset = seconds of audio already processed, frames_offset = position of the frames_np in the absolute audio timeline

image

if timestamp_offset = 60s, frames_offset = 45s, then we would take frames_np[15rate] but in reality we would want frames_np[30rate]?

@makaveli10
Copy link
Collaborator

makaveli10 commented Jul 20, 2023

it sounds like we're saying "hey now frames_np represents the audio starting 45 seconds into recording" but then only the first 30 seconds is removed?

Yes, we keep 15 seconds to process, we dont remove the whole 45 seconds because that might contain unprocessed audio frames. We only remove the first 30 seconds when the length is more than 45.

if timestamp_offset = 60s, frames_offset = 45s, then we would take frames_np[15rate] but in reality we would want frames_np[30rate]?

L158 i.e.:

samples_take = max(0, (self.timestamp_offset - self.frames_offset)*self.RATE)
input_bytes = self.frames_np[int(samples_take):].copy()

Lets say the duration of your frames_np is t if frame_offset is 45 and timestamp_offset is 60 then according to the logic the samples that needs to be processed would be anything after the first timestamp_offset - frame_offset seconds which is 15, so input_bytes should ideally have (t-15)srate samples. Let me know if this doesnt make sense

@makaveli10
Copy link
Collaborator

Closing this. Feel free to re-open if you have any queries.

@bryanhpchiang
Copy link
Author

hey! thanks for clarifying

here's an example to show why i'm confused:

so initially frames_offset = timestamp_offset = 0

lets say audio comes in, things are getting transcribed (so the clipping here never occurs)
image

okay, now timestamp offset reaches 50 (seconds) so this is triggered

so now frames_np contains around 50 - 30 = 20 seconds worth of audio
frames_offset is 45

image

the next time we try to transcribe, the last 15 seconds of audio in frames_np are used as the input_bytes (the 15 comes from frames_np[50-45:])

now lets say in the current loop, we find two segments in the 15 seconds of audio
say like segment s1 s=0, e=5, and segment s2 s=7, e=10

image

when we add the first segment (s1) to the transcript, it's going to go in as start = 50 + 0 = 50, end = 50 + 5 = 55 but that'd be incorrect since the absolute position of segment actually starts at 35 seconds.

trying to figure out where i'm going wrong, thanks for helping with this!

@bryanhpchiang
Copy link
Author

i think the core issue is just that already processed audio ends up getting reprocessed?

@makaveli10
Copy link
Collaborator

makaveli10 commented Jul 26, 2023

Thanks for your curiousity, i had a deeper look and found a typo. I was previously removing 45s of audio when I had 60s, changed 60s to 45s but missed updating 45s to 30s. So, here we should ideally increment frames_offset by 30s.

self.frames_offset += 30

This way frames_offset never exceeds timestamp offset. Consider this example.

Lets say our timestamp_offset is 44s (with only 1 segment being output from whisper) so, we have more audio coming in and we update frames_offset.

timestamp_offset = 44
frames_offset = 30

we should be processing anything after 44 seconds since timestamp_offset tells what has been processed already.

so, from frames_np which is now 15s(30-45 in absolute audio time) what we process is timestamp_offset - frames_offset(44-30) = 14

samples_take = max(0, (self.timestamp_offset - self.frames_offset)*self.RATE)
input_bytes = self.frames_np[int(samples_take):].copy()

anything after 14 seconds from and which is correct. Because the audio frame at 14s is same as the audio_frame at 44s if we hadnt removed anything from frames_np. Here is a plot over a 500s audio with frames_offset incremented by 30s instead of 45s.

plot

It shows that we are not reprocessing, if we were then timestamps_offset should fall below frames_offset only then we would pickup already processed samples.

samples_take = max(0, (self.timestamp_offset - self.frames_offset)*self.RATE)

samples_take = 0

input_bytes = self.frames_np[0:].copy()

picks up already processed samples.
So,

self.frames_offset += 30

should resolve the issues that you are seeing.

@bryanhpchiang
Copy link
Author

awesome thanks for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants