# How to track token usage in ChatModels

:::info Prerequisites

This guide assumes familiarity with the following concepts:
- [Chat models](/docs/concepts/#chat-models)

:::

Tracking token usage to calculate cost is an important part of putting your app in production. This guide goes over how to obtain this information from your LangChain model calls.

This guide requires `langchain-openai >= 0.1.9`.

In [None]:
# %pip install --upgrade --quiet langchain langchain-openai

## Using LangSmith

You can use [LangSmith](https://www.langchain.com/langsmith) to help track token usage in your LLM application. See the [LangSmith quick start guide](https://docs.smith.langchain.com/).

## Using AIMessage.usage_metadata

A number of model providers return token usage information as part of the chat generation response. When available, this information will be included on the `AIMessage` objects produced by the corresponding model.

LangChain `AIMessage` objects include a [usage_metadata](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.ai.AIMessage.html#langchain_core.messages.ai.AIMessage.usage_metadata) attribute. When populated, this attribute will be a [UsageMetadata](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.ai.UsageMetadata.html) dictionary with standard keys (e.g., `"input_tokens"` and `"output_tokens"`).

Examples:

**OpenAI**:

In [1]:
# # !pip install -qU langchain-openai

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")
openai_response = llm.invoke("hello")
openai_response.usage_metadata

{'input_tokens': 8, 'output_tokens': 9, 'total_tokens': 17}

### Using AIMessage.response_metadata

Metadata from the model response is also included in the AIMessage [response_metadata](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.ai.AIMessage.html#langchain_core.messages.ai.AIMessage.response_metadata) attribute. These data are typically not standardized. Note that different providers adopt different conventions for representing token counts:

In [4]:
print(f'OpenAI: {openai_response.response_metadata["token_usage"]}\n')

OpenAI: {'completion_tokens': 9, 'prompt_tokens': 8, 'total_tokens': 17, 'completion_tokens_details': {'reasoning_tokens': 0}}



### Streaming

Some providers support token count metadata in a streaming context.

#### OpenAI

For example, OpenAI will return a message [chunk](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.ai.AIMessageChunk.html) at the end of a stream with token usage information. This behavior is supported by `langchain-openai >= 0.1.9` and can be enabled by setting `stream_usage=True`. This attribute can also be set when `ChatOpenAI` is instantiated.

```{=mdx}
:::note
By default, the last message chunk in a stream will include a `"finish_reason"` in the message's `response_metadata` attribute. If we include token usage in streaming mode, an additional chunk containing usage metadata will be added to the end of the stream, such that `"finish_reason"` appears on the second to last message chunk.
:::
```

In [5]:
llm = ChatOpenAI(model="gpt-4o-mini")

aggregate = None
for chunk in llm.stream("hello", stream_usage=True):
    print(chunk)
    aggregate = chunk if aggregate is None else aggregate + chunk

content='' additional_kwargs={} response_metadata={} id='run-81d90370-130a-4005-980d-0b64cac2a629'
content='Hello' additional_kwargs={} response_metadata={} id='run-81d90370-130a-4005-980d-0b64cac2a629'
content='!' additional_kwargs={} response_metadata={} id='run-81d90370-130a-4005-980d-0b64cac2a629'
content=' How' additional_kwargs={} response_metadata={} id='run-81d90370-130a-4005-980d-0b64cac2a629'
content=' can' additional_kwargs={} response_metadata={} id='run-81d90370-130a-4005-980d-0b64cac2a629'
content=' I' additional_kwargs={} response_metadata={} id='run-81d90370-130a-4005-980d-0b64cac2a629'
content=' assist' additional_kwargs={} response_metadata={} id='run-81d90370-130a-4005-980d-0b64cac2a629'
content=' you' additional_kwargs={} response_metadata={} id='run-81d90370-130a-4005-980d-0b64cac2a629'
content=' today' additional_kwargs={} response_metadata={} id='run-81d90370-130a-4005-980d-0b64cac2a629'
content='?' additional_kwargs={} response_metadata={} id='run-81d90370-130a-

Note that the usage metadata will be included in the sum of the individual message chunks:

In [6]:
print(aggregate.content)
print(aggregate.usage_metadata)

Hello! How can I assist you today?
{'input_tokens': 8, 'output_tokens': 9, 'total_tokens': 17}


To disable streaming token counts for OpenAI, set `stream_usage` to False, or omit it from the parameters:

In [7]:
aggregate = None
for chunk in llm.stream("hello"):
    print(chunk)

content='' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-4f9f-9010-32d278cb526e'
content='Hello' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-4f9f-9010-32d278cb526e'
content='!' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-4f9f-9010-32d278cb526e'
content=' How' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-4f9f-9010-32d278cb526e'
content=' can' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-4f9f-9010-32d278cb526e'
content=' I' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-4f9f-9010-32d278cb526e'
content=' assist' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-4f9f-9010-32d278cb526e'
content=' you' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-4f9f-9010-32d278cb526e'
content=' today' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-4f9f-9010-32d278cb526e'
content='?' additional_kwargs={} response_metadata={} id='run-bb61f0fa-87f6-

You can also enable streaming token usage by setting `stream_usage` when instantiating the chat model. This can be useful when incorporating chat models into LangChain [chains](/docs/concepts#langchain-expression-language-lcel): usage metadata can be monitored when [streaming intermediate steps](/docs/how_to/streaming#using-stream-events) or using tracing software such as [LangSmith](https://docs.smith.langchain.com/).

See the below example, where we return output structured to a desired schema, but can still observe token usage streamed from intermediate steps.

In [8]:
from pydantic import BaseModel, Field


class Joke(BaseModel):
    """Joke to tell user."""

    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")


llm = ChatOpenAI(
    model="gpt-4o-mini",
    stream_usage=True,
)
# Under the hood, .with_structured_output binds tools to the
# chat model and appends a parser.
structured_llm = llm.with_structured_output(Joke)

async for event in structured_llm.astream_events("Tell me a joke", version="v2"):
    if event["event"] == "on_chat_model_end":
        print(f'Token usage: {event["data"]["output"].usage_metadata}\n')
    elif event["event"] == "on_chain_end":
        print(event["data"]["output"])
    else:
        pass

Token usage: {'input_tokens': 74, 'output_tokens': 21, 'total_tokens': 95}

setup="Why don't scientists trust atoms?" punchline='Because they make up everything!'


Token usage is also visible in the corresponding [LangSmith trace](https://smith.langchain.com/public/fe6513d5-7212-4045-82e0-fefa28bc7656/r) in the payload from the chat model.

## Using callbacks

There are also some API-specific callback context managers that allow you to track token usage across multiple calls. It is currently only implemented for the OpenAI API and Bedrock Anthropic API.

### OpenAI

Let's first look at an extremely simple example of tracking token usage for a single Chat model call.

In [9]:
# !pip install -qU langchain-community wikipedia

from langchain_community.callbacks.manager import get_openai_callback

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    stream_usage=True,
)

with get_openai_callback() as cb:
    result = llm.invoke("Tell me a joke")
    print(cb)

Tokens Used: 29
	Prompt Tokens: 11
	Completion Tokens: 18
Successful Requests: 1
Total Cost (USD): $1.2449999999999998e-05


Anything inside the context manager will get tracked. Here's an example of using it to track multiple calls in sequence.

In [10]:
with get_openai_callback() as cb:
    result = llm.invoke("Tell me a joke")
    result2 = llm.invoke("Tell me a joke")
    print(cb.total_tokens)

57


In [11]:
with get_openai_callback() as cb:
    for chunk in llm.stream("Tell me a joke"):
        pass
    print(cb)

Tokens Used: 29
	Prompt Tokens: 11
	Completion Tokens: 18
Successful Requests: 1
Total Cost (USD): $1.2449999999999998e-05
