Cache attention keys + values to speed up inference #216

epwalsh · 2023-06-19T22:32:16Z

Adds support for attention key/value caching and enables this in our .generate() method.

epwalsh · 2023-06-20T00:46:38Z

olmo/model.py

+            # NOTE (epwalsh): we need to initialize the attn bias in order for attn to work properly
+            # with key+value cache. Otherwise `F.scaled_dot_product_attention()` doesn't seem to compute
+            # scores correctly.
+            or past_key_values is not None


I'm not sure why F.scaled_dot_product_attention() doesn't give the right scores in this situation, but since this works now I stopped looking into it.

OyvindTafjord · 2023-06-20T02:26:53Z

I haven't studied the code in detail, but on a glance it looks reasonable. I did try to rerun an experiment from earlier today with a 1B model, and it got identical predictions on all 2000 instances across two tasks, but much faster:

01H3A8VA2615CAMJCB3VC0HG2H  NaturalQs: 322 sec, DROP: 6133 sec
01H3B5PNVWEB22RPRYY3QGRHTB  NaturalQs: 143 sec, DROP: 480 sec

It's more than 12x faster for the longer (18 words avg, but with some longer outliers) DROP answers, and over 2x faster for the shorter (10 words avg) NaturalQs answers.

OyvindTafjord · 2023-06-20T02:55:38Z

Here are similar numbers on two summarization tasks:

01H30Y87MDV9T2KNNZBJ7S6SSW  SciTLDR: 2700 sec, XSum: 3300 sec
01H3BAC2WB01VYK04KWXPH5R9D  SciTLDR:  470 sec, XSum:  500 sec

again with identical metrics and notable (~6x) speedup.

OyvindTafjord

LGTM! I don't see any red flags in a cursory look over the code, and my end-to-end testing seems to indicate the code works as intended

epwalsh · 2023-06-20T16:37:35Z

Thank you @OyvindTafjord! Glad to hear there's a big speed up.

epwalsh added 4 commits June 19, 2023 15:31

cache attention keys/values during inference

d605355

ssh, mypy

8dcde40

clean up

85b8fa1

testing

c9f2b57

epwalsh requested a review from OyvindTafjord June 20, 2023 00:03

epwalsh added 2 commits June 19, 2023 17:08

rename

e9d4f52

fix

c068d8b

epwalsh commented Jun 20, 2023

View reviewed changes

OyvindTafjord approved these changes Jun 20, 2023

View reviewed changes

add option to only compute last logits

3ab48df

epwalsh merged commit 7c866c9 into main Jun 20, 2023
10 checks passed

epwalsh deleted the petew-cache-attn branch June 20, 2023 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache attention keys + values to speed up inference #216

Cache attention keys + values to speed up inference #216

epwalsh commented Jun 19, 2023

epwalsh Jun 20, 2023

OyvindTafjord commented Jun 20, 2023

OyvindTafjord commented Jun 20, 2023

OyvindTafjord left a comment

epwalsh commented Jun 20, 2023

Cache attention keys + values to speed up inference #216

Cache attention keys + values to speed up inference #216

Conversation

epwalsh commented Jun 19, 2023

epwalsh Jun 20, 2023

Choose a reason for hiding this comment

OyvindTafjord commented Jun 20, 2023

OyvindTafjord commented Jun 20, 2023

OyvindTafjord left a comment

Choose a reason for hiding this comment

epwalsh commented Jun 20, 2023