Added TimestampRulesFilter implementation #45

jkrukowski · 2024-03-04T13:48:15Z

This PR adds implementation for TimestampRulesFilter. The implementation is based on https://github.com/openai/whisper/blob/master/whisper/decoding.py#L441

Couple of questions here @ZachNagengast:

sampleBegin param passed to TimestampRulesFilter is 0, I think it might be incorrect. I compared it to the python implementation from the OpenAI repo and there this param is always greater or equal than 3 (and this makes sense, first 3 tokens are special tokens: 50258, 50259 and 50359 and AFAIK we don't want to supress them). If you run this code as is, some segments might be omited (because of the sampleBegin is 0, if you change it to 3, it should be ok).
this implementation slows down the whole inference code, maybe you have some ideas how to optimize it?
you mentioned that is has duplicated logic with SegmentSeeker, but I don't see it (AFAIK TimestampRulesFilter just supresses the token probabilities, while SegmentSeeker creates the whole segments). Could you please clarify?

Sources/WhisperKit/Core/LogitsFilter.swift

jkrukowski · 2024-03-06T18:12:28Z

Tests/WhisperKitTests/UnitTests.swift

-        XCTAssertEqual(result?.segments.count, 2, "Expected 2 segments")
+        XCTAssertEqual(result?.segments.count, 3, "Expected 3 segments")


enabled timestamps are causing more segments to appear

Sources/WhisperKit/Core/LogitsFilter.swift

ZachNagengast · 2024-03-13T00:48:45Z

@jkrukowski I push a small commit to measure the logit filtering time, here is what I'm getting for tiny with and without these new timestamp rules on the jfk.wav file:
With:
[WhisperKit] - Logit Filtering: 192.41 ms / 28 runs ( 6.87 ms/run) 37.78%
Without:
[WhisperKit] - Logit Filtering: 0.07 ms / 28 runs ( 0.00 ms/run) 0.02%

This is a bit high, it becomes especially noticeable with the tiny model. Something that is interesting is that only the first and last few tokens are slow (graph by chatgpt). This is for the jfk.wav

Hopefully this gives you some guidance on where to look for optimizations. And the majority of the slowdown is in this block of code:

            // timestamps have to appear in pairs, except directly before EOT; mask logits accordingly
            let sampledTokens = tokens[sampleBegin...]
            let lastWasTimestamp = sampledTokens.count >= 1 && sampledTokens.last! >= timeTokenBegin
            let penultimateWasTimestamp = sampledTokens.count < 2 || sampledTokens.dropLast().last! >= timeTokenBegin
            if lastWasTimestamp {
                if penultimateWasTimestamp {
                    // has to be non-timestamp
                    logits.fillLastDimension(indexes: timeTokenBegin..<logits.count, with: -FloatType.infinity)
                } else {
                    // cannot be normal text tokens
                    logits.fillLastDimension(indexes: 0..<endToken, with: -FloatType.infinity)
                }
            }

# Conflicts: # Sources/WhisperKitCLI/transcribe.swift

jkrukowski · 2024-03-19T16:22:52Z

@jkrukowski I push a small commit to measure the logit filtering time, here is what I'm getting for tiny with and without these new timestamp rules on the jfk.wav file: With: [WhisperKit] - Logit Filtering: 192.41 ms / 28 runs ( 6.87 ms/run) 37.78% Without: [WhisperKit] - Logit Filtering: 0.07 ms / 28 runs ( 0.00 ms/run) 0.02%

This is a bit high, it becomes especially noticeable with the tiny model. Something that is interesting is that only the first and last few tokens are slow (graph by chatgpt). This is for the jfk.wav

Hopefully this gives you some guidance on where to look for optimizations. And the majority of the slowdown is in this block of code:
            // timestamps have to appear in pairs, except directly before EOT; mask logits accordingly
            let sampledTokens = tokens[sampleBegin...]
            let lastWasTimestamp = sampledTokens.count >= 1 && sampledTokens.last! >= timeTokenBegin
            let penultimateWasTimestamp = sampledTokens.count < 2 || sampledTokens.dropLast().last! >= timeTokenBegin
            if lastWasTimestamp {
                if penultimateWasTimestamp {
                    // has to be non-timestamp
                    logits.fillLastDimension(indexes: timeTokenBegin..<logits.count, with: -FloatType.infinity)
                } else {
                    // cannot be normal text tokens
                    logits.fillLastDimension(indexes: 0..<endToken, with: -FloatType.infinity)
                }
            }

@ZachNagengast I've added more performant version of fillLastDimension function, seems like it's doing better, this is what I get for the release build on the jfk.wav file:

[WhisperKit] ---- Transcription Timings ----
[WhisperKit] Audio Load:              2.33 ms /      1 runs (    2.33 ms/run)  0.66%
[WhisperKit] Audio Processing:        0.11 ms /      1 runs (    0.11 ms/run)  0.03%
[WhisperKit] Mels:                   35.53 ms /      1 runs (   35.53 ms/run) 10.11%
[WhisperKit] Encoding:               13.39 ms /      1 runs (   13.39 ms/run)  3.81%
[WhisperKit] Matrices Init:           0.22 ms /      1 runs (    0.22 ms/run)  0.06%
[WhisperKit] Prefill:                 0.00 ms /      1 runs (    0.00 ms/run)  0.00%
[WhisperKit] Decoding:              239.40 ms /     28 runs (    8.55 ms/run) 68.15%
[WhisperKit] Non-inference:          61.25 ms /     28 runs (    2.19 ms/run) 17.43%
[WhisperKit] - Logit Filtering:       3.24 ms /     28 runs (    0.12 ms/run)  0.92%
[WhisperKit] - Sampling:             14.17 ms /     28 runs (    0.51 ms/run)  4.03%
[WhisperKit] - Kv Caching:            2.79 ms /     28 runs (    0.10 ms/run)  0.80%
[WhisperKit] - Word Timestamps:       0.00 ms /      0 runs (    0.00 ms/run)  0.00%
[WhisperKit] - Windowing:             0.08 ms /      1 runs (    0.08 ms/run)  0.02%
[WhisperKit] Fallbacks:               0.00 ms /      0 runs (    0.00 ms/run)  0.00%
[WhisperKit] Decoding Full Loop:    351.06 ms /     28 runs (   12.54 ms/run) 99.93%

ZachNagengast · 2024-03-19T17:06:30Z

Much better! This looks in line with what I was seeing for those faster middle tokens previously. Think this is ready to come out of draft now?

jkrukowski · 2024-03-19T17:09:40Z

Much better! This looks in line with what I was seeing for those faster middle tokens previously. Think this is ready to come out of draft now?

good to hear this, 2 things are left:

self.sampleBegin = 3 // FIXME: it should not be hardcoded value -- not sure what value should I put there
force unwrapping in sumOfProbabilityOverTimestampsIsAboveAnyOtherToken maybe we should not force unwrap and return false gracefully, wdyt?

ZachNagengast · 2024-03-19T17:22:56Z

self.sampleBegin = 3 // FIXME: it should not be hardcoded value -- not sure what value should I put there

PrefilledIndex is already being passed into this function, but I think actually it should use intialPromptIndex. A good test to add for accuracy on this would be similar to this one

WhisperKit/Tests/WhisperKitTests/UnitTests.swift

Line 314 in e45dc0a

func testSampleLength() async {

where you'd create a bunch of options that change this initialPromptIndex and make sure it's working properly.

force unwrapping in sumOfProbabilityOverTimestampsIsAboveAnyOtherToken maybe we should not force unwrap and return false gracefully, wdyt?

Besides the verbosity I think it's ok. If you want to be extra safe, you can wrap that whole part in a do catch and log an error similar to the sampling code. I'm not sure all the scenarios where BNNS will throw, but returning false would just fallback to default behavior so no issues there.

ZachNagengast

Approving for the pre-release tests but curious your thoughts on the comments, can be a future PR.

ZachNagengast · 2024-03-20T17:17:15Z

Sources/WhisperKit/Core/LogitsFilter.swift

-    public init(tokenizer: Tokenizer, sampleBegin: Int) {
-        // TODO: implement
-        fatalError("Not implemented: \(#function)")
+    public init(


This interface is complex in the same way the SegmentSeeker is, I believe most of these are in the tokenizer object but that would require passing this in.

you're right, it's complex. I didn't want to make it dependent on Tokenizer so it's decoupled and relatively easier to test. I can change it if you think otherwise

No problem, your logic makes sense, we may want a simple object like SpecialTokens in the future, and extend tokenizer with it, rather than just adding these index properties themselves as extensions.

ZachNagengast · 2024-03-20T17:25:24Z

Sources/WhisperKit/Core/Utils.swift

@@ -135,6 +144,10 @@ func tokenizerNameForVariant(_ variant: ModelVariant) -> String {
    return tokenizerName
 }

+func isModelMultilingual(logitsDim: Int?) -> Bool {


We have this already here:

WhisperKit/Sources/WhisperKit/Core/Models.swift

Line 60 in cf75348

public var isMultilingual: Bool {

Thoughts on combining them? I think checking the logitDims is more robust, perhaps it can be set on the model or the textdecoder on load here:

WhisperKit/Sources/WhisperKit/Core/WhisperKit.swift

Line 313 in cf75348

if let logitsDim = textDecoder.logitsSize,

fair enough, I can tackle it in a separate PR

Sources/WhisperKit/Core/WhisperKit.swift

added TimestampRulesFilter implementation

68d50b6

jkrukowski commented Mar 4, 2024

View reviewed changes

Sources/WhisperKit/Core/LogitsFilter.swift Outdated Show resolved Hide resolved

added TimestampRulesFilter implementation

7208dbb

jkrukowski commented Mar 6, 2024

View reviewed changes

Sources/WhisperKit/Core/LogitsFilter.swift Outdated Show resolved Hide resolved

ZachNagengast linked an issue Mar 8, 2024 that may be closed by this pull request

Timestamp Rules Logits Filter #24

Closed

jkrukowski and others added 2 commits March 12, 2024 19:00

Merge branch 'main' into timestamp-rules-filter

6aae04f

Track logit filters in timings

ee95bb0

jkrukowski added 4 commits March 18, 2024 09:48

Merge branch 'main' into timestamp-rules-filter

4579af5

# Conflicts: # Sources/WhisperKitCLI/transcribe.swift

removed not used

5534094

optimized fillLastDimension

7f1f2d0

Merge branch 'main' into timestamp-rules-filter

8630d3c

jkrukowski added 2 commits March 20, 2024 12:30

review fixes

ecc88db

Merge branch 'main' into timestamp-rules-filter

9ce87d6

jkrukowski marked this pull request as ready for review March 20, 2024 11:33

jkrukowski added 2 commits March 20, 2024 12:34

reverse

892a64a

fix

662507a

ZachNagengast approved these changes Mar 20, 2024

View reviewed changes

ZachNagengast merged commit 508240f into argmaxinc:main Mar 22, 2024
11 of 12 checks passed

jkrukowski mentioned this pull request Mar 22, 2024

Tokenizer refactor, tests cleanup #87

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added TimestampRulesFilter implementation #45

Added TimestampRulesFilter implementation #45

jkrukowski commented Mar 4, 2024

jkrukowski Mar 6, 2024

ZachNagengast commented Mar 13, 2024

jkrukowski commented Mar 19, 2024

ZachNagengast commented Mar 19, 2024

jkrukowski commented Mar 19, 2024

ZachNagengast commented Mar 19, 2024

ZachNagengast left a comment

ZachNagengast Mar 20, 2024

jkrukowski Mar 20, 2024

ZachNagengast Mar 20, 2024

ZachNagengast Mar 20, 2024

jkrukowski Mar 20, 2024

		XCTAssertEqual(result?.segments.count, 2, "Expected 2 segments")
		XCTAssertEqual(result?.segments.count, 3, "Expected 3 segments")

Added TimestampRulesFilter implementation #45

Added TimestampRulesFilter implementation #45

Conversation

jkrukowski commented Mar 4, 2024

jkrukowski Mar 6, 2024

Choose a reason for hiding this comment

ZachNagengast commented Mar 13, 2024

jkrukowski commented Mar 19, 2024

ZachNagengast commented Mar 19, 2024

jkrukowski commented Mar 19, 2024

ZachNagengast commented Mar 19, 2024

ZachNagengast left a comment

Choose a reason for hiding this comment

ZachNagengast Mar 20, 2024

Choose a reason for hiding this comment

jkrukowski Mar 20, 2024

Choose a reason for hiding this comment

ZachNagengast Mar 20, 2024

Choose a reason for hiding this comment

ZachNagengast Mar 20, 2024

Choose a reason for hiding this comment

jkrukowski Mar 20, 2024

Choose a reason for hiding this comment