[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aurelio-labs/cookbook/blob/main/gen-ai/agents/video-agent.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/aurelio-labs/cookbook/blob/main/gen-ai/agents/video-agent.ipynb)

# Chat with Video

In this example we're going to work through building a "chat-with-video" AI pipeline and agent. We'll see how to:

1. Take any YouTube video and transcribe it to text using Aurelio's video-to-text endpoint.
2. Use Mistral LLMs to chat with our transcribed video content.
3. Add chat history to make our AI conversational.
4. Integrate async and streaming for a better UX and improved scalability.
5. See how we can optimize costs by reducing overall token count using semantic similarity, using Aurelio's chunking endpoint and Mistral's embedding models.

In [None]:
!pip install -qU \
  aurelio-sdk==0.0.18 \
  "yt-dlp[default]==2025.2.19" \
  mistralai==1.5.1

In [None]:
!yt-dlp https://www.youtube.com/watch?v=JaHfCrVTYF4 -f mp4

We will use the [Aurelio Platform](https://platform.aurelio.ai/) for both video processing _and_ later for chunking. To follow the tutorial you can use the coupon `JBVIDEOAGENT` for free credits.

In [None]:
from aurelio_sdk import AurelioClient
import os
from getpass import getpass

os.environ["AURELIO_API_KEY"] = os.getenv("AURELIO_API_KEY") or \
    getpass("Enter your Aurelio API key: ")

client = AurelioClient(api_key=os.environ["AURELIO_API_KEY"])

Now we send our video to Aurelio Platform for processing and chunking:

In [None]:
response_video_file = client.extract_file(
    file_path="/content/AI Agents as Neuro-Symbolic Systems？ [JaHfCrVTYF4].mp4",
    quality="low", chunk=False, wait=-1
)

We can access the transcribed video like so:

In [5]:
content = response_video_file.document.content
content

" Okay, so I wanted to put together a sort of overview video of what I'm currently working on, which is thinking or restructuring the way that I'm thinking about agents and the way that I'm also teaching or talking about agents. So this isn't going to be like a fully sort of edited and structured video. I just want to show you a little bit of what I'm thinking about and explain or explain where I'm coming from really.  So all in all, this is part of actually a broader thing that I am working on, which is actually why I haven't been posting on YouTube specifically for quite a while. And I think it's almost two months, which is the longest I think I haven't posted in forever. And, you know, it's well, okay, it's because I'm working on this, but it's also for other things as well. I had my first son like a month ago. So I've been pretty busy there. and just working on a lot of things over Aurelio as well. But I wanted to go through this introduction to  AI agents article that I'm working 

We can count the number of words from our transcribed video:

In [6]:
len(content.split())

4152

## Connecting an LLM

We're using Mistral AI in this example and we'll be using both their LLM and embed models, you can get [an API key from here](https://console.mistral.ai/api-keys).

In [15]:
import os
from mistralai import Mistral
from getpass import getpass

os.environ["MISTRAL_API_KEY"] = os.getenv("MISTRAL_API_KEY") or \
    getpass("Enter your Mistral API key: ")

client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])

In [17]:
from mistralai.models import (
    SystemMessage,
    UserMessage
)

response = client.chat.complete(
    model="mistral-large-latest",
    messages=[
        SystemMessage(content=(
            "You are an AI expert providing help to the user "
            "based on the content of the provided transcribed "
            "document.\n\n---\n\nTranscription:\n\n" +
            content
        )),
        UserMessage(content="Hi can you summarize this for me?")
    ]
)

We get our message content like so:

In [18]:
response.choices[0].message

AssistantMessage(content="Sure! Here's a summary of the transcribed document:\n\n### Overview:\nThe speaker is working on restructuring their approach to AI agents and plans to create a more structured video and course on the topic. They have been busy with personal life events and other projects, which has delayed their YouTube posting.\n\n### Key Points:\n1. **Introduction to AI Agents**:\n   - The speaker discusses the React agent, a foundational structure for LLM-based agents.\n   - React agents use multiple reasoning steps and can call external tools to gather information or perform actions.\n\n2. **Example of a React Agent**:\n   - An example is provided where an agent answers a question about the Apple Remote by using a search tool to gather information in multiple steps.\n   - The agent uses a search tool to find out which program the Apple Remote controls and then searches for other devices that can control that program.\n\n3. **Definition of Agents**:\n   - The speaker argues

We can also track our usage:

In [19]:
response.usage

UsageInfo(prompt_tokens=5430, completion_tokens=546, total_tokens=5976)

## Adding Chat History

To make our video chat conversational we need to maintain chat history — to do that we'll write an `Agent` class that we can initialize and interact with, it will maintain our messages within this class.

In [20]:
from mistralai.models import AssistantMessage, UsageInfo

class Agent:
    messages: list[AssistantMessage | SystemMessage | UserMessage]
    usage: list[UsageInfo]

    def __init__(self):
        self.messages = [
            SystemMessage(content=(
                "You are an AI expert providing help to the user "
                "based on the content of the provided transcribed "
                "document.\n\n---\n\nTranscription:\n\n" +
                content
            ))
        ]
        self.usage = []

    def chat(self, content: str) -> AssistantMessage:
        # append user message to self.messages
        self.messages.append(UserMessage(content=content))
        # generate response
        response = client.chat.complete(
            model="mistral-large-latest",
            messages=self.messages
        )
        # append assistant message to self.messages
        self.messages.append(response.choices[0].message)
        # append usage (we can use this later)
        self.usage.append(response.usage)
        return response.choices[0].message

Now we can chat with our conversational history agent:

In [21]:
from IPython.display import display, Markdown

# initialize
agent = Agent()

res = agent.chat(
    content="can you summarize the meaning of 'symbolic' in this article?"
)
# print output in markdown
display(Markdown(res.content))

In the context of the article, 'symbolic' refers to a traditional approach to artificial intelligence that involves using handwritten rules, ontologies, and logical functions to create AI systems. This method, often called "Good Old-Fashioned AI" (GOFAI), relies on explicit, predefined knowledge structures to enable reasoning and decision-making. The article contrasts this with the 'neural' approach, which uses neural networks and machine learning to learn patterns and representations from data. The combination of these two approaches—symbolic and neural—is referred to as a neurosymbolic architecture, which the article suggests is a more comprehensive and flexible way to define and build AI agents.

Now let's ask another question which requires context of the previous interactions:

In [22]:
res = agent.chat(
    content="can you give me that but in short bullet-points?"
)
# print output in markdown
display(Markdown(res.content))

Certainly! Here are the key points about 'symbolic' in the article:

- **Traditional AI Approach**: Involves using handwritten rules, ontologies, and logical functions.
- **Good Old-Fashioned AI (GOFAI)**: Relies on explicit, predefined knowledge structures for reasoning and decision-making.
- **Contrast with Neural AI**: Unlike neural networks, which learn from data, symbolic AI uses predefined logic.
- **Neurosymbolic Architecture**: Combines symbolic AI (handwritten rules) with neural AI (neural networks) for a more comprehensive and flexible AI system.
- **Agent Definition**: The article suggests that agents should be defined as neurosymbolic systems, integrating both symbolic and neural components.

This summary captures the essence of how 'symbolic' is discussed in the article.

## Async and Streaming

When developing AI apps that rely heavily on external APIs we tend to write async code to make our applications more scalable. With async code the time that our code would be spent waiting for API responses can instead be spent performing other tasks.

We will rewrite our `Agent` class to work fully asynchronously, and we'll also add streaming — which can provide an improved user experience as we can show the user the tokens as soon as they're generated.

In [23]:
class Agent:
    messages: list[AssistantMessage | SystemMessage | UserMessage]
    usage: list[UsageInfo]

    def __init__(self):
        self.messages = [
            SystemMessage(content=(
                "You are an AI expert providing help to the user "
                "based on the content of the provided transcribed "
                "document.\n\n---\n\nTranscription:\n\n" +
                content
            ))
        ]
        self.usage = []

    async def chat(self, content: str) -> AssistantMessage:
        # append user message to self.messages
        self.messages.append(UserMessage(content=content))
        # generate response asynchronously
        response = await client.chat.stream_async(
            model="mistral-large-latest",
            messages=self.messages
        )
        # full response object to be built
        all_tokens = []
        all_usage = []
        # iterate through the token generator and add to queue
        async for chunk in response:
            if (token := chunk.data.choices[0].delta.content) is not None:
                print(token, end="", flush=True)
                all_tokens.append(token)
        # append assistant message to self.messages
        self.messages.append(AssistantMessage(content="".join(all_tokens)))
        # append usage (we can use this later)
        self.usage.append(chunk.data.usage)
        return self.messages[-1]

In [24]:
agent = Agent()

res = await agent.chat(
    content="can you summarize the meaning of 'symbolic' in this article?"
)

In the context of the transcribed document, "symbolic" refers to a traditional approach in artificial intelligence (AI) that involves handwritten rules, ontologies, and logical functions to build AI systems. This approach is often contrasted with "neural" or connectionist AI, which relies on neural networks and machine learning.

Key points about the symbolic approach mentioned in the document:

1. **Historical Context**: The symbolic approach was prominent in the mid-20th century, often referred to as "good old-fashioned AI" (GOFAI).
2. **Logical Frameworks**: It involves logical methodologies, such as syllogistic logic from Aristotle, where conclusions are derived from premises.
3. **Handwritten Rules**: Symbolic AI systems are built using manually crafted rules and ontologies, aiming to create systems that can reason logically.
4. **Comparison with Neural AI**: Unlike neural AI, which learns from data and mimics the structure of the brain, symbolic AI relies on predefined logical st

We can continue our conversation:

In [25]:
res = await agent.chat(
    content="tell me in more detail what was said on point (1)"
)

In the transcribed document, point (1) discusses the historical context of the symbolic approach in artificial intelligence (AI). Here's a more detailed breakdown:

**Historical Context**:

* The symbolic approach was one of the earliest methods used in AI, emerging in the mid-20th century.
* It was prominent during the 1940s, 1950s, and 1960s, and possibly extended into the 1970s.
* This period is often referred to as the era of "good old-fashioned AI" (GOFAI), a term coined to distinguish it from more modern approaches like connectionist or neural AI.
* During this time, AI research was largely focused on creating intelligent systems through manual engineering, using rules, logic, and symbol manipulation.
* The key idea was that human intelligence could be reduced to symbol manipulation, and thus, AI could be achieved by programming machines to manipulate symbols in a similar way.
* This approach was heavily influenced by the work of philosophers and mathematicians, such as Aristotle

In [26]:
res = await agent.chat(
    content="and what does that have to do with AI agents?"
)

The historical context of the symbolic approach in AI is relevant to AI agents in several ways, as discussed in the transcribed document. Here's how it connects:

1. **Foundation of AI Agents**: The symbolic approach provides a foundational understanding of how AI agents can be designed to reason and make decisions. Early AI agents were often rule-based systems that used symbolic reasoning to interpret their environment and make decisions.
2. **Neuro-Symbolic Architecture**: The speaker in the document argues for a neuro-symbolic architecture for AI agents, which combines neural ( connectionist) and symbolic methods. The historical context of symbolic AI is crucial for understanding one half of this architecture.
3. **Logical Reasoning**: Symbolic AI's focus on logical reasoning and manipulation of symbols is still relevant for AI agents today. Even modern AI agents, which often rely heavily on neural networks, can benefit from incorporating symbolic reasoning to improve their decision

## Optimizing Token Cost

Throwing the full article into each interaction will give us maximal accuracy but is also _expensive_.

We can see this by checking our token usage and calculating our costs. To check our usage we simply access our agent's `usage` attribute:

In [27]:
agent.usage

[UsageInfo(prompt_tokens=5436, completion_tokens=272, total_tokens=5708),
 UsageInfo(prompt_tokens=5724, completion_tokens=551, total_tokens=6275),
 UsageInfo(prompt_tokens=6289, completion_tokens=521, total_tokens=6810)]

To estimate our costs for running our agent, we can take the latest pricing for Mistral's large model for `prompt_tokens` (input) and `completion_tokens` (output) from their [pricing page](https://mistral.ai/en/products/la-plateforme#pricing).

As of 08 March 2025 those prices are:

| Model | API name | Input (/M tokens) | Output (/M tokens) |
| ----- | -------- | ----------------- | ------------------ |
| Mistral Large 24.11 | mistral-large-latest | \$2 | \$6 |

We can define a cost calculator like so:

In [28]:
def cost(usage: UsageInfo) -> float:
    input_cost = usage.prompt_tokens * 2e-6
    output_cost = usage.completion_tokens * 6e-6
    return round(input_cost + output_cost, 5)

In [29]:
# save these values for later reference
original_cost = []

for usage in agent.usage:
    usage_cost = cost(usage=usage)
    original_cost.append(usage_cost)
    print(f"${usage_cost}")

$0.0125
$0.01475
$0.0157


These seem like small numbers but they will quickly add up as we continue throwing the full transcription into our LLM with every new interaction.

To optimize our token cost we can pull in _only_ the most relevant chunks of information, making use of semantic similarity. To do this we will:

1. Break our transcribed document into smaller chunks.
2. Embed those chunks into vector embeddings.
3. Store those vector embeddings in a numpy array.
4. When querying, our LLM (now agent) will transform our question into a small query.
5. We embed that query into a vector embedding.
6. Compare the semantic similarity between our query vector and our chunk vectors to find the most similar chunks.
7. Return those chunks to our LLM ready for a final response.

Let's start by chunking our document, we use the _async_ Aurelio chunking endpoint for this:

In [30]:
from aurelio_sdk import AsyncAurelioClient

# we reinitialize the client for async
aurelio_client = AsyncAurelioClient()

Call the chunk endpoint on our document:

In [36]:
from aurelio_sdk import ChunkingOptions

# we use a semantic chunker with a max chunk length of 500 tokens
chunking_options = ChunkingOptions(
    chunker_type="semantic",
    max_chunk_length=500,
    window_size=5
)

chunks = await aurelio_client.chunk(
    content=content,
    processing_options=chunking_options
)

We can see the chunks like so:

In [37]:
chunks.document.chunks[5:10]

[ResponseChunk(id='chunk_3447ec47-f78c-427d-ac65-89e5fdc607ea', content="I know that Apple remote controls the front row program.  but what other device controls the front row program. So it says, okay, based on this, my next reasoning step is I need to search front row and find other devices that control it. So then it does this search for front row. It could also probably do something, if we're thinking in rag terms here, it could be like device to control front row. And probably more today LM would do that. But that's fine. this is just an example. So it goes back to the search tool again and it says query.", chunk_index=6, num_tokens=121, metadata={}),
 ResponseChunk(id='chunk_935696c3-3bfa-485b-a305-8e94166f3eb6', content="Front row. And this isn't  I've shortened this for the sake of brevity. I think in the actual example, or at least from the paper, the actual example returns a lot more information. But this is the part of it that is important.", chunk_index=7, num_tokens=54, me

Next, we need to embed each chunk. For that we will use Mistral's embedding model.

In [38]:
embeddings_response = await client.embeddings.create_async(
    model="mistral-embed",
    inputs=[x.content for x in chunks.document.chunks]
)

This returns a list of `EmbeddingResponseData` objects inside the `.data` attribute:

In [39]:
len(embeddings_response.data)

35

Each of those `EmbeddingResponseData` objects contains a single vector embedding inside the `.embedding` attribute. The dimensionality of the embedding model is `1024` which we can see below:

In [40]:
len(embeddings_response.data[0].embedding)

1024

Next we must add all of these vectors to a single numpy array:

In [41]:
import numpy as np

video_emb = np.asarray([x.embedding for x in embeddings_response.data])
# check the dimensionality of our video chunks array
video_emb.shape

(35, 1024)

From this we can see we have 35 1024-dimensional vector embeddings that represent our full transcribed document. Now to plug this back into our agent we will create a `query` tool that will allow our LLM to provide a natural language query and return the most relevant chunks based on that query.

Let's start by writing this tool step-by-step. First, we create a _query vector_:

In [42]:
query = "what is the relationship between AI agents and GOFAI?"

embeddings_response = await client.embeddings.create_async(
    model="mistral-embed",
    inputs=[query]
)
xq = np.asarray(embeddings_response.data[0].embedding)
xq.shape

(1024,)

Now calculate the dot product similarity between our query vector `xq` and precomputed document chunk vectors.

In [43]:
sim_arr = np.dot(xq, video_emb.T)
sim_arr

array([0.6828565 , 0.75058198, 0.69919688, 0.67968833, 0.61341964,
       0.60867901, 0.62188296, 0.60628245, 0.60307181, 0.74770941,
       0.76357249, 0.61461822, 0.75837007, 0.75699921, 0.64369794,
       0.71479521, 0.65117063, 0.68480309, 0.71202544, 0.69951878,
       0.68277037, 0.68066402, 0.74412483, 0.64809726, 0.66284248,
       0.64783145, 0.62315144, 0.62955026, 0.66573198, 0.70150278,
       0.69305011, 0.67002756, 0.69779223, 0.70010112, 0.58219499])

Now we return the indexes for the `top_k` most similar (highest scoring) chunks:

In [44]:
top_k = 3  # we'll set top_k to 3, returning the 3 most similar chunks

most_similar_idx = np.argsort(sim_arr)[-top_k:][::-1]
most_similar_idx

array([10, 12, 13])

Before pulling the content of each chunk, we convert our list of chunks into an array of chunks — these will speed up our chunk content retrieval later.

In [45]:
chunks_content = np.asarray([x.content for x in chunks.document.chunks])
chunks_content.shape

(35,)

Now retrieve the chunks content:

In [46]:
chunks_content[most_similar_idx]

array([". .  basically agent, LM agent. I think this came just before the React agent paper. It's very similar, I would say, has a bit less structured in the React agent. But, yeah, it's super relevant. And the way that they described their system was that it was a neurosymbolic architecture. I really like this definition because a, so neurosymbolic architecture, It's two things, right? You have the neural part, you have the symbolic part. And I actually have another kind of starting  on this article but it's uh yeah there's this mostly notes at the moment so the neural part of this in fact let's start with the symbolic part the symbolic part is the more traditional AI right so the you know I think this is back in the 40s 50s 60s mostly and then maybe so actually 70s as well this was actually maybe not 70s this was this was a the sort of traditional approach to AI. And the idea,  or the symbolists that were just like full on symbolists felt that true AGI would be achieved through writt

We can see here that we've returned the chunks of our article _most_ relevant to our query. Now we wrap all of this up into a single function, ie our _tool_:

In [47]:
async def search(query: str) -> str:
    """Use this tool to search for relevant chunks of information
    from the provided video. Provide as much context as possible
    to the `query` parameter, ensuring to write your search
    query in natural language. If you must answer multiple
    questions you should use this tool to only answer one at a
    time. Do not include multiple questions in the `query`."""
    # embed our query to create a 'query vector'
    embeddings_response = await client.embeddings.create_async(
        model="mistral-embed",
        inputs=[query]
    )
    xq = np.asarray(embeddings_response.data[0].embedding)
    # perform the similarity search
    sim_arr = np.dot(xq, video_emb.T)
    # get the top_k most similar chunks
    most_similar_idx = np.argsort(sim_arr)[-top_k:][::-1]
    # return our most relevant chunks
    return "\n---\n".join(chunks_content[most_similar_idx].tolist())

Let's test quickly:

In [48]:
print(await search(query=query))

. .  basically agent, LM agent. I think this came just before the React agent paper. It's very similar, I would say, has a bit less structured in the React agent. But, yeah, it's super relevant. And the way that they described their system was that it was a neurosymbolic architecture. I really like this definition because a, so neurosymbolic architecture, It's two things, right? You have the neural part, you have the symbolic part. And I actually have another kind of starting  on this article but it's uh yeah there's this mostly notes at the moment so the neural part of this in fact let's start with the symbolic part the symbolic part is the more traditional AI right so the you know I think this is back in the 40s 50s 60s mostly and then maybe so actually 70s as well this was actually maybe not 70s this was this was a the sort of traditional approach to AI. And the idea,  or the symbolists that were just like full on symbolists felt that true AGI would be achieved through written rules

Now we need to redefine our `Agent` and plug our new tool into it. To do that we need to format our tool so that the Mistral API can read it:

In [49]:
import inspect

inspect.getdoc(search)

'Use this tool to search for relevant chunks of information\nfrom the provided video. Provide as much context as possible\nto the `query` parameter, ensuring to write your search\nquery in natural language. If you must answer multiple \nquestions you should use this tool to only answer one at a\ntime. Do not include multiple questions in the `query`.'

In [50]:
search.__name__

'search'

In [51]:
func_schema = {
    "name": search.__name__,
    "description": inspect.getdoc(search),
    "parameters": {"type": "object", "properties": {}, "required": []}
}
func_schema

{'name': 'search',
 'description': 'Use this tool to search for relevant chunks of information\nfrom the provided video. Provide as much context as possible\nto the `query` parameter, ensuring to write your search\nquery in natural language. If you must answer multiple \nquestions you should use this tool to only answer one at a\ntime. Do not include multiple questions in the `query`.',
 'parameters': {'type': 'object', 'properties': {}, 'required': []}}

In [52]:
dtype_map = {
    int: "number",
    float: "number",
    str: "string",
    bool: "boolean",
    None: "null",
    list: "array",
}

In [57]:
signature = inspect.signature(search)
for name, dtype in signature.parameters.items():
    # add param to properties
    func_schema["parameters"]["properties"][name] = {
        "type": dtype_map.get(dtype.annotation, "object")
    }
    # and required (assuming all are required)
    func_schema["parameters"]["required"].append(name)

We now have our fully defined function schema:

In [58]:
func_schema

{'name': 'search',
 'description': 'Use this tool to search for relevant chunks of information\nfrom the provided video. Provide as much context as possible\nto the `query` parameter, ensuring to write your search\nquery in natural language. If you must answer multiple \nquestions you should use this tool to only answer one at a\ntime. Do not include multiple questions in the `query`.',
 'parameters': {'type': 'object',
  'properties': {'query': {'type': 'string'}},
  'required': ['query']}}

We transform this into a `mistralai` `Function` object:

In [59]:
from mistralai.models.function import Function

tool_signatures = [
    {
        "type": "function",
        "function": Function(
            name=func_schema["name"],
            description=func_schema["description"],
            parameters=func_schema["parameters"]
        )
    }
]

We can add our new `tool_signatures` list to our completion call within our `Agent`.

In [60]:
class Agent:
    messages: list[AssistantMessage | SystemMessage | UserMessage]
    usage: list[UsageInfo]
    tool_signatures: list[Function]

    def __init__(self, tool_signatures: list[Function]):
        self.messages = [
            SystemMessage(content=(
                "You are an AI expert providing help to the user "
                "based on the content of the provided transcribed "
                "document."
            ))
        ]
        self.usage = []
        self.tool_signatures = tool_signatures

    async def chat(self, content: str) -> AssistantMessage:
        # append user message to self.messages
        self.messages.append(UserMessage(content=content))
        # generate response asynchronously
        response = await client.chat.stream_async(
            model="mistral-large-latest",
            messages=self.messages,
            tools=self.tool_signatures,
            tool_choice="auto"
        )
        # full response object to be built
        all_tokens = []
        all_usage = []
        # iterate through the token generator
        async for chunk in response:
            if isinstance((tool_call := chunk.data.choices[0].delta.tool_calls), list):
                print(tool_call)
            elif (token := chunk.data.choices[0].delta.content) is not None:
                print(token, end="", flush=True)
                all_tokens.append(token)
        # append assistant message to self.messages
        #self.messages.append(AssistantMessage(content="".join(all_tokens)))
        # append usage (we can use this later)
        #self.usage.append(chunk.data.usage)
        return self.messages[-1], tool_call

In [61]:
agent = Agent(tool_signatures=tool_signatures)

res = await agent.chat(
    content="can you summarize the meaning of 'symbolic' in this article?"
)

[ToolCall(function=FunctionCall(name='search', arguments='{"query": "what is the meaning of \'symbolic\' in this article?"}'), id='wnRfREnw4', type=None, index=0)]


Our video agent can now create a tool call but it cannot execute the tool call — for that we need a little more scaffolding to handle the detection of a tool call coming from our LLM and the translation of that into execution of our `search` function.

To do that, we will create a tool execution function:

In [64]:
import json
from mistralai.models import ToolCall, ToolMessage

tools = [search]

tool_map = {t.__name__: t for t in tools}

async def execute_tool(tool_call: ToolCall) -> ToolMessage:
    tool_name = tool_call.function.name
    tool_params = json.loads(tool_call.function.arguments)
    tool_call_id = tool_call.id
    out = await tool_map[tool_name](**tool_params)
    return ToolMessage(
        content=out,
        name=tool_name,
        tool_call_id=tool_call_id
    )

Now let's take the `tool_call` from our previous `Agent.chat` call and run it through our `execute_tool` function.

In [65]:
res[1][0]

ToolCall(function=FunctionCall(name='search', arguments='{"query": "what is the meaning of \'symbolic\' in this article?"}'), id='wnRfREnw4', type=None, index=0)

In [66]:
tool_message = await execute_tool(tool_call=res[1][0])
tool_message

ToolMessage(content=". .  basically agent, LM agent. I think this came just before the React agent paper. It's very similar, I would say, has a bit less structured in the React agent. But, yeah, it's super relevant. And the way that they described their system was that it was a neurosymbolic architecture. I really like this definition because a, so neurosymbolic architecture, It's two things, right? You have the neural part, you have the symbolic part. And I actually have another kind of starting  on this article but it's uh yeah there's this mostly notes at the moment so the neural part of this in fact let's start with the symbolic part the symbolic part is the more traditional AI right so the you know I think this is back in the 40s 50s 60s mostly and then maybe so actually 70s as well this was actually maybe not 70s this was this was a the sort of traditional approach to AI. And the idea,  or the symbolists that were just like full on symbolists felt that true AGI would be achieved 

This is our executed tool output. We'd append this alongside an `AssistantMessage` for the initial LLM-generated tool call to our `Agent.messages` attribute, then feed everything back into our LLM for it to decide what to do next. Hopefully, we'll see our LLM deciding to use the information it gathered to respond to the user.

In [67]:
agent.messages.extend([
    AssistantMessage(content="", tool_calls=res[1]),
    tool_message
])
agent.messages

[SystemMessage(content='You are an AI expert providing help to the user based on the content of the provided transcribed document.', role='system'),
 UserMessage(content="can you summarize the meaning of 'symbolic' in this article?", role='user'),
 AssistantMessage(content='', tool_calls=[ToolCall(function=FunctionCall(name='search', arguments='{"query": "what is the meaning of \'symbolic\' in this article?"}'), id='wnRfREnw4', type=None, index=0)], prefix=False, role='assistant'),
 ToolMessage(content=". .  basically agent, LM agent. I think this came just before the React agent paper. It's very similar, I would say, has a bit less structured in the React agent. But, yeah, it's super relevant. And the way that they described their system was that it was a neurosymbolic architecture. I really like this definition because a, so neurosymbolic architecture, It's two things, right? You have the neural part, you have the symbolic part. And I actually have another kind of starting  on this a

In [69]:
# generate response asynchronously
response = await client.chat.stream_async(
    model="mistral-large-latest",
    messages=agent.messages,
    tools=agent.tool_signatures,
    tool_choice="auto"
)
# full response object to be built
all_tokens = []
all_usage = []
# iterate through the token generator and add to queue
async for chunk in response:
    if isinstance((tool_call := chunk.data.choices[0].delta.tool_calls), list):
        print(tool_call)
    elif (token := chunk.data.choices[0].delta.content) is not None:
        print(token, end="", flush=True)
        all_tokens.append(token)

In the article, the term "symbolic" refers to a traditional approach to artificial intelligence (AI) that relies on written rules, ontologies, and logical functions to achieve true AGI (Artificial General Intelligence). This method involves creating handwritten philosophical grammars and logical representations, such as syllogistic logic from Aristotle, which includes major premises, minor premises, and conclusions. For example, a symbolic approach might state that all dogs have four legs, using this logical structure to define and reason about concepts.

The article also discusses "neurosymbolic architecture," which combines both symbolic and neural components. The symbolic part involves handwritten code or rules that can be triggered by a large language model (LLM) or another type of neural network. Neural networks, on the other hand, learn logical representations of different concepts, such as what a strawberry or a dog is, without relying on handwritten rules.

In summary, "symboli

After adding these additional tool call messages our LLM is able to respond to our query directly. Now let's integrate all of this back into a new `Agent` class.

In [96]:
class Agent:
    messages: list[AssistantMessage | SystemMessage | UserMessage]
    usage: list[UsageInfo]
    tool_signatures: list[Function]

    def __init__(self, tool_signatures: list[Function], max_steps: int = 3):
        self.messages = [
            SystemMessage(content=(
                "You are an AI expert providing help to the user "
                "based on the content of the provided transcribed "
                "document."
            ))
        ]
        self.usage = []
        self.tool_signatures = tool_signatures
        self.max_steps = max_steps

    async def chat(self, content: str) -> AssistantMessage:
        # append user message to self.messages
        self.messages.append(UserMessage(content=content))
        # we will need to enter a loop to support multiple iterations
        step = 0
        while step <= self.max_steps:
            # generate response asynchronously
            response = await client.chat.stream_async(
                model="mistral-large-latest",
                messages=self.messages,
                tools=self.tool_signatures,
                tool_choice="auto"
            )
            # full response object to be built
            all_tokens = []
            all_usage = []
            # iterate through the token generator and add to queue
            async for chunk in response:
                if isinstance(
                    (tool_calls := chunk.data.choices[0].delta.tool_calls), list
                ):
                    # print the tool call in a cleaner format
                    print(
                        f"{tool_calls[0].function.name}: "
                        f"{tool_calls[0].function.arguments}"
                    )
                    # we execute our tool
                    tool_message = await execute_tool(tool_call=tool_calls[0])
                    # and add the assistant tool call and tool output message
                    # to our self.messages
                    self.messages.extend([
                        AssistantMessage(content="", tool_calls=tool_calls),
                        tool_message
                    ])

                elif (token := chunk.data.choices[0].delta.content) is not None:
                    print(token, end="", flush=True)
                    all_tokens.append(token)
            # append usage (we can use this later)
            self.usage.append(chunk.data.usage)
            # append assistant message to self.messages (if returned)
            if len(all_tokens) > 1:
                self.messages.append(AssistantMessage(content="".join(all_tokens)))
                break
            step += 1
        return self.messages[-1], tool_call

Now let's try another query, this time we will try to allow our agent to use the `search` tool twice to collate information from various chunks.

In [97]:
agent = Agent(tool_signatures=tool_signatures)

res = await agent.chat(
    content=(
        "Does the document mention 'good old fashioned AI'? And does it "
        "say anything about deepseek? How does the document "
        "compare the two?"
    )
)

search: {"query": "good old fashioned AI"}
search: {"query": "deepseek"}
Yes, the document mentions 'good old fashioned AI'. This term is used to refer to traditional AI approaches, often called symbolic AI. The document describes this type of AI as involving logical frameworks and methodologies, such as those developed by Aristotle. It involves constructing deeper AI systems that can figure things out, using exercises and logical frameworks.

The document does not explicitly mention 'deepseek'.

The document compares what could be inferred as more modern AI approaches, involving neural networks and connectionist AI, with the traditional 'good old fashioned AI'. The traditional AI is described as involving written rules, ontologies, and logical functions, aiming to achieve AGI through handwritten philosophical grammars and logical frameworks like syllogistic logic. In contrast, the more modern approaches, which could be compared to 'deepseek' (though not explicitly mentioned), involve 

Despite performing two searches to answer this query, we still used _significantly_ less tokens:

In [98]:
agent.usage

[UsageInfo(prompt_tokens=187, completion_tokens=23, total_tokens=210),
 UsageInfo(prompt_tokens=898, completion_tokens=20, total_tokens=918),
 UsageInfo(prompt_tokens=1727, completion_tokens=228, total_tokens=1955)]

Without chunking we spent ~$0.12 on a single query (with a single question):

In [99]:
original_cost

[0.0125, 0.01475, 0.0157]

With chunking, for two questions within a single query we're spending:

In [100]:
for usage in agent.usage:
    print(f"${cost(usage=usage)}")

$0.00051
$0.00192
$0.00482


A total of ~$0.0072, almost a 50% reduction in price despite being a more complex query.

---

With that we've built an agent capable of helping us understand videos. We've also taken steps to drastically optimize expenditure.