-
Notifications
You must be signed in to change notification settings - Fork 45
Feature/ait 51 token streaming granular history #3014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/ait 51 token streaming granular history #3014
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| // ✅ Do this - publish without await for maximum throughput | ||
| for await (const event of stream) { | ||
| if (event.type === 'token') { | ||
| channel.publish('token', event.text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any guidance on how users are meant to handle the result of the publish in this scenario? In some failure modes (e.g. a bunch of messages end up queued client-side and then get failed due to the connection becoming SUSPENDED, but the user just ploughs on publishing subsequent messages) they might end up with gaps in the published token stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Or, perhaps an even more realistic scenario: some publishes are rejected due to rate limits but we plough ahead with subsequent publishes, some of which might succeed once the rate limiting subsides)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are considering a page about discontinuity handling generally, and I think we can consider how to tackle this problem as part of that, but needs some more thinking. I'll make a note. If you have any ideas on how to handle that I'm all ears :)
| ```javascript | ||
| const channel = realtime.channels.get('{{RANDOM_CHANNEL_NAME}}'); | ||
|
|
||
| const responses = new Map(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "Track responses by ID" comment, as above, would be useful here I think.
| const channel = realtime.channels.get('{{RANDOM_CHANNEL_NAME}}'); | ||
|
|
||
| // Track responses by ID | ||
| const responses = new Map(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it makes sense to suggest storing the partial responses in the case where we don't have explicit start and stop events given that the storage will potentially grow unboundedly. I'd suggest perhaps only showing the Map solution in the explicit start / stop events case and perhaps here just log the response ID alongside the message. Or have I missed something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included it because I wanted to illustrate that responses could be multiplexed on the channel (see "even when delivered concurrently" above, although we will likely have a specific page for this concept in more detail). I think in this case it's okay - the example is intended to be illustrative (and I wanted it to show how the client would append tokens for the same response together). In a real app, you would likely have more complex solutions if the data could genuinely grow large enough to cause memory issues (e.g. local storage and loading only the data into memory that is currently visible at your scroll position, and so on).
| // Handle response stop | ||
| await channel.subscribe('stop', (message) => { | ||
| const responseId = message.extras?.headers?.responseId; | ||
| const finalText = responses.get(responseId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps (assuming that the idea of the responses map is just to accumulate response content during generation) remove from responses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do, although to the comment above, the example is intended to be illustrative, and if you want to render the messages, they need to be somewhere (and I think it's out of scope for this page to discuss strategies for managing and displaying unbounded data in web apps generally)
src/pages/docs/ai-transport/features/token-streaming/message-per-token.mdx
Show resolved
Hide resolved
GregHolmes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only made a couple minor suggestions on starting the sentences earlier on. Other than that, I think you've got this spot on.
src/pages/docs/ai-transport/features/token-streaming/message-per-token.mdx
Outdated
Show resolved
Hide resolved
src/pages/docs/ai-transport/features/token-streaming/message-per-token.mdx
Outdated
Show resolved
Hide resolved
8c010c8 to
5014fd3
Compare
dc18bb1 to
d9f4b08
Compare
5014fd3 to
a863e09
Compare
a863e09 to
2f792d4
Compare
Add intro describing the pattern, its properties, and use cases.
d9f4b08 to
0e663af
Compare
Includes continuous token streams, correlating tokens for distinct responses, and explicit start/end events.
Splits each token streaming approach into distinct patterns and shows both the publish and subscribe side behaviour alongside one another.
Includes hydration with rewind and hydration with persisted history + untilAttach. Describes the pattern for handling in-progress live responses with complete responses loaded from the database.
0e663af to
52e32d8
Compare
e5ea672
into
AIT-129-AIT-Docs-release-branch
|
Is it in this doc or another doc that we would discuss streaming with ephemeral messages? |
Description
Adds a "Token Streaming" section to the AIT docs with a page for token streaming with a message per token.
Covers:
untilAttachNote that the 100 message rewind limit will change soon, and these docs will be updated to reflect that.
Checklist