It doesn't need to match 100%, but in a multi-turn conversation (say, 3 messages exchanged), the reported token usage should be roughly equivalent (within, say, a 20% delta). It seems something is off right now, so add an integration test that verifies this.


It doesn't need to match 100%, but in a multi-turn conversation (say, 3 messages exchanged), the reported token usage should be roughly equivalent (within, say, a 20% delta). It seems something is off right now, so add an integration test that verifies this.