# LangGraph and LangSmith - Agentic RAG Powered by LangChain

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

- 🤝 Breakout Room #2:
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI

# 🤝 Breakout Room #1

## Part 1: LangGraph - Building Cyclic Applications with LangChain

LangGraph is a tool that leverages LangChain Expression Language to build coordinated multi-actor and stateful applications that includes cyclic behaviour.

### Why Cycles?

In essence, we can think of a cycle in our graph as a more robust and customizable loop. It allows us to keep our application agent-forward while still giving the powerful functionality of traditional loops.

Due to the inclusion of cycles over loops, we can also compose rather complex flows through our graph in a much more readable and natural fashion. Effectively allowing us to recreate application flowcharts in code in an almost 1-to-1 fashion.

### Why LangGraph?

Beyond the agent-forward approach - we can easily compose and combine traditional "DAG" (directed acyclic graph) chains with powerful cyclic behaviour due to the tight integration with LCEL. This means it's a natural extension to LangChain's core offerings!

## Task 1:  Dependencies


## Task 2: Environment Variables

We'll want to set our OpenAI, Tavily, and LangSmith API keys along with our LangSmith environment variables.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [2]:
os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY")

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE8 - LangGraph - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

## Task 3: Creating our Tool Belt

As is usually the case, we'll want to equip our agent with a toolbelt to help answer questions and add external knowledge.

There's a tonne of tools in the [LangChain Community Repo](https://github.com/langchain-ai/langchain-community/tree/main/libs/community) but we'll stick to a couple just so we can observe the cyclic nature of LangGraph in action!

We'll leverage:

- [Tavily Search Results](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/tavily_search/tool.py)
- [Arxiv](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/arxiv/tool.py)

#### 🏗️ Activity #1:

Please add the tools to use into our toolbelt.

> NOTE: Each tool in our toolbelt should be a method.

In [4]:
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.tools.arxiv.tool import ArxivQueryRun

tavily_tool = TavilySearchResults(max_results=5)

tool_belt = [
    tavily_tool,
    ArxivQueryRun(),
]

  tavily_tool = TavilySearchResults(max_results=5)


### Model

Now we can set-up our model! We'll leverage the familiar OpenAI model suite for this example - but it's not *necessary* to use with LangGraph. LangGraph supports all models - though you might not find success with smaller models - as such, they recommend you stick with:

- OpenAI's GPT-3.5 and GPT-4
- Anthropic's Claude
- Google's Gemini

> NOTE: Because we're leveraging the OpenAI function calling API - we'll need to use OpenAI *for this specific example* (or any other service that exposes an OpenAI-style function calling API.

In [5]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4.1-nano", temperature=0)

Now that we have our model set-up, let's "put on the tool belt", which is to say: We'll bind our LangChain formatted tools to the model in an OpenAI function calling format.

In [6]:
model = model.bind_tools(tool_belt)

#### ❓ Question #1:

How does the model determine which tool to use?

The model determines which tool to use based on the input query and the description of each tool, including its name (e.g. "arxiv", "tavily_search_results_json"), description of its purpose, and parameters (including expected inputs). The model reads the user query, matches it to available tools and chooses ones that fit the task (this will be reflected in the `AIMessage` emitted by the LLM with  `tool_calls` parameter).

## Task 4: Putting the State in Stateful

Earlier we used this phrasing:

`coordinated multi-actor and stateful applications`

So what does that "stateful" mean?

To put it simply - we want to have some kind of object which we can pass around our application that holds information about what the current situation (state) is. Since our system will be constructed of many parts moving in a coordinated fashion - we want to be able to ensure we have some commonly understood idea of that state.

LangGraph leverages a `StatefulGraph` which uses an `AgentState` object to pass information between the various nodes of the graph.

There are more options than what we'll see below - but this `AgentState` object is one that is stored in a `TypedDict` with the key `messages` and the value is a `Sequence` of `BaseMessages` that will be appended to whenever the state changes.

Let's think about a simple example to help understand exactly what this means (we'll simplify a great deal to try and clearly communicate what state is doing):

1. We initialize our state object:
  - `{"messages" : []}`
2. Our user submits a query to our application.
  - New State: `HumanMessage(#1)`
  - `{"messages" : [HumanMessage(#1)}`
3. We pass our state object to an Agent node which is able to read the current state. It will use the last `HumanMessage` as input. It gets some kind of output which it will add to the state.
  - New State: `AgentMessage(#1, additional_kwargs {"function_call" : "WebSearchTool"})`
  - `{"messages" : [HumanMessage(#1), AgentMessage(#1, ...)]}`
4. We pass our state object to a "conditional node" (more on this later) which reads the last state to determine if we need to use a tool - which it can determine properly because of our provided object!

In [7]:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

## Task 5: It's Graphing Time!

Now that we have state, and we have tools, and we have an LLM - we can finally start making our graph!

Let's take a second to refresh ourselves about what a graph is in this context.

Graphs, also called networks in some circles, are a collection of connected objects.

The objects in question are typically called nodes, or vertices, and the connections are called edges.

Let's look at a simple graph.

![image](https://i.imgur.com/2NFLnIc.png)

Here, we're using the coloured circles to represent the nodes and the yellow lines to represent the edges. In this case, we're looking at a fully connected graph - where each node is connected by an edge to each other node.

If we were to think about nodes in the context of LangGraph - we would think of a function, or an LCEL runnable.

If we were to think about edges in the context of LangGraph - we might think of them as "paths to take" or "where to pass our state object next".

Let's create some nodes and expand on our diagram.

> NOTE: Due to the tight integration with LCEL - we can comfortably create our nodes in an async fashion!

In [8]:
from langgraph.prebuilt import ToolNode

def call_model(state: AgentState):
  messages = state["messages"]
  response = model.invoke(messages)
  return {"messages" : [response]}

tool_node = ToolNode(tool_belt) # this will actually implement the tool calls 


# tool_calls structure follows the OpenAI function calling format, which is why it works seamlessly with OpenAI models and bind_tools().

Now we have two total nodes. We have:

- `call_model` is a node that will...well...call the model
- `tool_node` is a node which can call a tool

Let's start adding nodes! We'll update our diagram along the way to keep track of what this looks like!


In [9]:
from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)

uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x115c16ba0>

Let's look at what we have so far:

![image](https://i.imgur.com/md7inqG.png)

Next, we'll add our entrypoint. All our entrypoint does is indicate which node is called first.

In [10]:
uncompiled_graph.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x115c16ba0>

![image](https://i.imgur.com/wNixpJe.png)

Now we want to build a "conditional edge" which will use the output state of a node to determine which path to follow.

We can help conceptualize this by thinking of our conditional edge as a conditional in a flowchart!

Notice how our function simply checks if there is a "function_call" kwarg present.

Then we create an edge where the origin node is our agent node and our destination node is *either* the action node or the END (finish the graph).

It's important to highlight that the dictionary passed in as the third parameter (the mapping) should be created with the possible outputs of our conditional function in mind. In this case `should_continue` outputs either `"end"` or `"continue"` which are subsequently mapped to the action node or the END node.

In [11]:
def should_continue(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  return END

uncompiled_graph.add_conditional_edges(
    "agent",
    should_continue
)

<langgraph.graph.state.StateGraph at 0x115c16ba0>

Let's visualize what this looks like.

![image](https://i.imgur.com/8ZNwKI5.png)

Finally, we can add our last edge which will connect our action node to our agent node. This is because we *always* want our action node (which is used to call our tools) to return its output to our agent!

In [12]:
uncompiled_graph.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x115c16ba0>

Let's look at the final visualization.

![image](https://i.imgur.com/NWO7usO.png)

All that's left to do now is to compile our workflow - and we're off!

In [13]:
simple_agent_graph = uncompiled_graph.compile()

#### ❓ Question #2:

Is there any specific limit to how many times we can cycle?

If not, how could we impose a limit to the number of cycles?


Answer: Ways to impose cycle limits:

  **1. State-based counting (from notebook example)**
  ```python 

  def should_continue(state):
      last_message = state["messages"][-1]

      # Limit based on message count
      if len(state["messages"]) > 10:
          return END

      if last_message.tool_calls:
          return "action"
      return END
``` 
  **2. Explicit step counter in state:**
  ```python 
  
  class AgentState(TypedDict):
      messages: Annotated[list, add_messages]
      step_count: int

  def should_continue(state):
      # Increment step counter
      current_steps = state.get("step_count", 0)

      if current_steps >= 5:  # Max 5 cycles
          return END

      if state["messages"][-1].tool_calls:
          return "action"
      return END

  def call_model(state):
      messages = state["messages"]
      response = model.invoke(messages)
      return {
          "messages": [response],
          "step_count": state.get("step_count", 0) + 1
      }
```

  **3. Compile-time recursion limit**
  ```python

  app = graph.compile(
      checkpointer=checkpointer,
      recursion_limit=10  # Built-in limit is a global counter, updated every time any node in the graph is executed
  )
  ```

  **4. Time-based limits**
  ```python
  import time

  class AgentState(TypedDict):
      messages: Annotated[list, add_messages]
      start_time: float

  def should_continue(state):
      elapsed = time.time() - state.get("start_time", time.time())
      if elapsed > 30:  # 30 second limit
        return END
```


## Using Our Graph

Now that we've created and compiled our graph - we can call it *just as we'd call any other* `Runnable`!

Let's try out a few examples to see how it fairs:

In [14]:
from langchain_core.messages import HumanMessage

inputs = {"messages" : [HumanMessage(content="How are technical professionals using AI to improve their work?")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"][0].content)
        print("\n\n")

Receiving update from node: 'agent'
Technical professionals are using AI in various ways to enhance their work, including automating repetitive tasks, improving decision-making, analyzing large datasets, developing new products and services, and optimizing processes. They leverage AI for tasks such as machine learning model development, natural language processing, computer vision, predictive analytics, and automation of workflows. This integration helps increase efficiency, accuracy, and innovation across different industries. Would you like specific examples or insights into particular fields?





Let's look at what happened:

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node
4. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node
5. The agent node added a response to the state object and passed it along the conditional edge
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

Now let's look at an example that shows a multiple tool usage - all with the same flow!

In [15]:
inputs = {"messages" : [HumanMessage(content="Search Arxiv for the A Comprehensive Survey of Deep Research paper, then search each of the authors to find out where they work now using Tavily!")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"\n Tool Used: {values['messages'][0].name}")
        print(values["messages"])

        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_QoYxn47rIAb9XcbvHxAOmdl5', 'function': {'arguments': '{"query": "A Comprehensive Survey of Deep Research"}', 'name': 'arxiv'}, 'type': 'function'}, {'id': 'call_mG0Vb5ivQiTFOKc5YaHav4ED', 'function': {'arguments': '{"query": "A Comprehensive Survey of Deep Research paper"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 59, 'prompt_tokens': 182, 'total_tokens': 241, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_7c233bf9d1', 'id': 'chatcmpl-CLExkOvmW70lcGq4N5sTPo9TwIjvl', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run--fa1de3d9-d429-4e

In [16]:
import textwrap
from langchain_core.messages import AIMessage, ToolMessage, HumanMessage

def pretty_print_message(msg, width=80):
    """Pretty print a message with width limiting"""
    if isinstance(msg, AIMessage):
        print(f"  Type: AIMessage")
        print(f"  Content: {textwrap.fill(msg.content, width=width)}")
        if msg.tool_calls:
            print(f"  Tool Calls: {len(msg.tool_calls)} call(s)")
            for i, tool_call in enumerate(msg.tool_calls):
                print(f"    {i+1}. {tool_call['name']}")
    elif isinstance(msg, ToolMessage):
        print(f"  Type: ToolMessage")
        print(f"  Tool: {msg.name}")
        print(f"  Content: {textwrap.fill(msg.content[:200] + '...', width=width)}")
    elif isinstance(msg, HumanMessage):
        print(f"  Type: HumanMessage")
        print(f"  Content: {textwrap.fill(msg.content, width=width)}")
    else:
        print(f"  Type: {type(msg).__name__}")
        print(f"  Content: {textwrap.fill(str(msg.content), width=width)}")

inputs = {"messages": [HumanMessage(content="Search Arxiv for the A Comprehensive Survey of Deep Research paper, then search each of the authors to find out where they work now using Tavily!")]}

print("=" * 80)
print("AGENT EXECUTION TRACE")
print("=" * 80)

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"\n🔄 RECEIVING UPDATE FROM NODE: '{node}'")
        print("-" * 50)

        if node == "action" and values['messages']:
            print(f"🛠️  Tool Used: {values['messages'][0].name}")

        print(f"📝 Messages ({len(values['messages'])} new):")
        for i, msg in enumerate(values['messages']):
            print(f"\n  Message {i+1}:")
            pretty_print_message(msg)

        print("\n" + "=" * 80)

AGENT EXECUTION TRACE

🔄 RECEIVING UPDATE FROM NODE: 'agent'
--------------------------------------------------
📝 Messages (1 new):

  Message 1:
  Type: AIMessage
  Content: 
  Tool Calls: 2 call(s)
    1. arxiv
    2. tavily_search_results_json


🔄 RECEIVING UPDATE FROM NODE: 'action'
--------------------------------------------------
🛠️  Tool Used: arxiv
📝 Messages (2 new):

  Message 1:
  Type: ToolMessage
  Tool: arxiv
  Content: Published: 2025-06-14 Title: A Comprehensive Survey of Deep Research: Systems,
Methodologies, and Applications Authors: Renjun Xu, Jingwen Peng Summary: This
survey examines the rapidly evolving field...

  Message 2:
  Type: ToolMessage
  Tool: tavily_search_results_json
  Content: [{"title": "[2506.12594] A Comprehensive Survey of Deep Research - arXiv",
"url": "https://arxiv.org/abs/2506.12594", "content": "We gratefully acknowledge
support from the Simons Foundation, member i...


🔄 RECEIVING UPDATE FROM NODE: 'agent'
---------------------------------

#### 🏗️ Activity #2:

Please write out the steps the agent took to arrive at the correct answer.

**Answer:**

**Step 1: Initial Planning (Agent Node)**
- The agent received the complex multi-part request: "Search Arxiv for the A Comprehensive Survey of Deep Research paper, then search each of the authors to find out where they work now using Tavily!"
- The agent analyzed the request and decided to start with an Arxiv search for the specific paper
- It generated an AIMessage with 2 tool calls: one for Arxiv search and one for Tavily search

**Step 2: Search Execution (Action Node)**
- The action node executed both tools in parallel:
ArxivQueryRun tool with query "A Comprehensive Survey of Deep Research"
TavilySearchResults tool with query "author of A Comprehensive Survey of Deep Research"
- Both tools returned results (ToolMessages), with the Arxiv tool finding the target paper and extracting author information (Renjun Xu and Jingwen Peng)

**Step 3: Processing Results and Planning Next Steps (Agent Node)**
- The agent analyzed the Arxiv search results and identified the main authors from the paper
- It recognized that it needed to search for each author individually to find their current workplace
- Generated a new AIMessage with 2 additional tool calls for Tavily searches to find where each author currently works

**Step 4: Author Workplace Search (Action Node)**
-The action node executed multiple TavilySearchResults calls in parallel to search for the current employment information of each author
-These searches returned information about where Renjun Xu and Jingwen Peng currently work

**Step 5: Final Synthesis (Agent Node)**
-The agent compiled information from all tool results
- Synthesized findings about where each author currently works
- Generated a final comprehensive AIMessage with the complete answer
- Since no more tool calls were needed, the conditional edge routed to END, completing the task


The agent successfully demonstrated multi-step reasoning by breaking down the complex request into manageable parts, using multiple tools in sequence and parallel, and synthesizing all the gathered information into a coherent final answer.




# 🤝 Breakout Room #2

## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

To do a little bit more preprocessing, let's wrap our LangGraph agent in a simple chain.

In [20]:
def convert_inputs(input_object):
  return {"messages" : [HumanMessage(content=input_object["text"])]}

def parse_output(input_state):
  return {"answer" : input_state["messages"][-1].content}

agent_chain_with_formatting = convert_inputs | simple_agent_graph | parse_output

agent_chain_with_formatting.invoke({"text" : "What is Deep Research?"})

{'answer': 'Deep Research typically refers to an in-depth and comprehensive investigation or analysis into a specific topic, subject, or field. It involves thorough examination of available information, data, and resources to uncover detailed insights, understand complex issues, and generate well-informed conclusions. Deep Research is often characterized by extensive literature review, data collection, critical analysis, and synthesis of findings.\n\nIf you are referring to a specific organization, product, or platform named "Deep Research," please provide more context so I can give a more precise answer.'}

### Task 1: Creating An Evaluation Dataset

Just as we saw last week, we'll want to create a dataset to test our Agent's ability to answer questions.

In order to do this - we'll want to provide some questions and some answers. Let's look at how we can create such a dataset below.

```python
questions = [
    {
        "inputs" : {"text" : "Who were the main authors on the 'A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications' paper?"},
        "outputs" : {"must_mention" : ["Peng", "Xu"]}   
    },
    ...,
    {
        "inputs" : {"text" : "Where do the authors of the 'A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications' work now?"},
        "outputs" : {"must_mention" : ["Zhejiang", "Liberty Mutual"]}
    }
]
```

#### 🏗️ Activity #3:

Please create a dataset in the above format with at least 5 questions that pertain to the cohort use-case (more information [here](https://www.notion.so/Session-4-RAG-with-LangGraph-OSS-Local-Models-Eval-w-LangSmith-26acd547af3d80838d5beba464d7e701#26acd547af3d81d08809c9c82a462bdd)), or the use-case you're hoping to tackle in your Demo Day project.

In [21]:
questions = [
    {
        "inputs": {"text": "What AI project should a small e-commerce startup prioritize to maximize ROI in their first year?"},
        "outputs": {"must_mention": ["customer service", "recommendation", "automation"]}
    },
    {
        "inputs": {"text": "For a healthcare company with limited resources, what AI initiative would provide the most value to patients and reduce operational costs?"},
        "outputs": {"must_mention": ["diagnosis", "medical imaging", "patient triage", "cost reduction"]}
    },
    {
        "inputs": {"text": "What AI solution should a manufacturing company implement first to improve efficiency and reduce waste?"},
        "outputs": {"must_mention": ["predictive maintenance", "quality control", "supply chain", "optimization", "IoT"]}
    },
    {
        "inputs": {"text": "For a financial services firm, what AI project would best help them serve customers better while ensuring regulatory compliance?"},
        "outputs": {"must_mention": ["fraud detection", "risk assessment", "compliance", "customer onboarding", "automation"]}
    },
    {
        "inputs": {"text": "What AI initiative should a nonprofit organization focus on to maximize their social impact with limited technical resources?"},
        "outputs": {"must_mention": ["volunteer matching", "donor engagement", "impact measurement", "cost-effective"]}
    }
]

Now we can add our dataset to our LangSmith project using the following code which we saw last Thursday!

In [23]:
from langsmith import Client

client = Client()

dataset_name = f"Simple Search Agent - Evaluation Dataset - {uuid4().hex[0:8]}"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the cohort use-case to evaluate the Simple Search Agent."
)

client.create_examples(
    dataset_id=dataset.id,
    examples=questions
)

{'example_ids': ['416e9181-d594-4c61-8c09-3bec7e5ab4c9',
  '42e9d875-5d64-42a0-9186-6d251d8ed191',
  '0ca8d4ab-3997-4a71-b9d0-498ec8487cc4',
  'c5722ca2-ad0f-45cd-bff9-0abcf59bf4da',
  'e124012b-1871-48c2-99a1-16d9191a7df6'],
 'count': 5}

### Task 2: Adding Evaluators

Let's use the OpenEvals library to product an evaluator that we can then pass into LangSmith!

> NOTE: Examine the `CORRECTNESS_PROMPT` below!

In [24]:
from openevals.prompts import CORRECTNESS_PROMPT
print(CORRECTNESS_PROMPT)

You are an expert data labeler evaluating model outputs for correctness. Your task is to assign a score based on the following rubric:

<Rubric>
  A correct answer:
  - Provides accurate and complete information
  - Contains no factual errors
  - Addresses all parts of the question
  - Is logically consistent
  - Uses precise and accurate terminology

  When scoring, you should penalize:
  - Factual errors or inaccuracies
  - Incomplete or partial answers
  - Misleading or ambiguous statements
  - Incorrect terminology
  - Logical inconsistencies
  - Missing key information
</Rubric>

<Instructions>
  - Carefully read the input and output
  - Check for factual accuracy and completeness
  - Focus on correctness of information rather than style or verbosity
</Instructions>

<Reminder>
  The goal is to evaluate factual correctness and completeness of the response.
</Reminder>

<input>
{inputs}
</input>

<output>
{outputs}
</output>

Use the reference outputs below to help you evaluate the

In [25]:
from openevals.llm import create_llm_as_judge

correctness_evaluator = create_llm_as_judge(
        prompt=CORRECTNESS_PROMPT,
        model="openai:o3-mini", # very impactful to the final score
        feedback_key="correctness",
    )

In [26]:
result = correctness_evaluator(
    inputs={"question": "What is the capital of France?"},
    outputs={"answer": "Paris"},
    reference_outputs={"expected": "Paris"}
)

print(result)

{'key': 'correctness', 'score': True, 'comment': 'The provided answer is accurate and complete; it correctly identifies Paris as the capital of France without any factual errors, matching the expected answer exactly. Thus, the score should be: true.', 'metadata': None}


Let's also create a custom Evaluator for our created dataset above - we do this by first making a simple Python function!

In [27]:
def must_mention(inputs: dict, outputs: dict, reference_outputs: dict) -> float:
  # determine if the phrases in the reference_outputs are in the outputs
  required = reference_outputs.get("must_mention") or []
  score = all(phrase in outputs["answer"] for phrase in required)
  return score

#### ❓ Question #4:

What are some ways you could improve this metric as-is?

**Answer**

IMO the current evaluator is very strict and this can lead to significant bias in the way the answers are evaluated. Here are some potential methods to "relax" the evaluation criterion:

- Ensure we check for case sensitivity for the must-have phrases

- Instead of using `all` we can use a partial credit scoring where we count how many of the `must_mention` phrases appear in the output, i.e., something like

```python
        matches = sum(1 for phrase in required if phrase.lower() in answer_lower)
        score = matches / len(required) if required else 1.0
```

- Checking for semantic similarity - we can use potential embedding models to see if the answer's embedding relates to the embeddings of the required phrases

- We can also define a weight scheme where appearance of some words are valued more highly than others. This is ofcourse very subjective to the use-case in hand.

- We can get even more exotic, use something like the `fuzzywuzzy` python package that allows you to measure the fuzziness (i.e., Levenshtein distance ) to measure how similar each required phrase is over combination of rolling windows

Task 3: Evaluating

All that is left to do is evaluate our agent's response!

In [28]:
results = client.evaluate(
    agent_chain_with_formatting,
    data=dataset.name,
    evaluators=[correctness_evaluator, must_mention],
    experiment_prefix="simple_agent, baseline",  # optional, experiment name prefix
    description="Testing the baseline system.",  # optional, experiment description
    max_concurrency=4, # optional, add concurrency
)

View the evaluation results for experiment: 'simple_agent, baseline-a57ea0d3' at:
https://smith.langchain.com/o/08071020-873b-4614-a512-80a2bbd38f89/datasets/fa66cbbb-215b-46c5-ad9c-01c50d33198d/compare?selectedSessions=b05ba7c7-2253-4e52-b44f-6485f9c4b8f6




0it [00:00, ?it/s]

## Part 2: LangGraph with Helpfulness:

### Task 3: Adding Helpfulness Check and "Loop" Limits

Now that we've done evaluation - let's see if we can add an extra step where we review the content we've generated to confirm if it fully answers the user's query!

We're going to make a few key adjustments to account for this:

1. We're going to add an artificial limit on how many "loops" the agent can go through - this will help us to avoid the potential situation where we never exit the loop.
2. We'll add to our existing conditional edge to obtain the behaviour we desire.

First, let's define our state again - we can check the length of the state object, so we don't need additional state for this.

In [29]:
class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

Now we can set our graph up! This process will be almost entirely the same - with the inclusion of one additional node/conditional edge!

#### 🏗️ Activity #4:

Please write markdown for the following cells to explain what each is doing.

##### Arnab: Instantiating the state graph and adding relevant nodes (called "agent" and "action") - similar to the ReAct setup


In [30]:
graph_with_helpfulness_check = StateGraph(AgentState)

graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x125601810>

##### Arnab: Set the initial entrypoint to the graph via the agent node. This means the input query to the graph on invokation will enter via the agent node.

In [32]:
graph_with_helpfulness_check.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x125601810>

##### Arnab: Define the conditional edge function that determines next node to go to from the agent node based on current state. There are 3 possible routes:

- If the agent LLM has asked to use tools (easy to check if its json scheme has a non-empty `tool_calls` attribute), then go to the `action` node (that implements the tools)

- If the total number of messages in the state history has exceeded 10 (i.e., there has been ore 5 rounds of play between the agent and action nodes), then terminate and go to the `END` node

- Otherwise, define an LCEL chain that uses gpt-4.1:mini as a judge to determine whether the LLM's final response (in the agent node) to the initial query (from the tool node) is helpful or not. Based on the LLM's Y or N assessment, we either go the action node or to the end node, respectively.

In [34]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  initial_query = state["messages"][0]
  final_response = state["messages"][-1]

  if len(state["messages"]) > 10:
    return "END"

  prompt_template = """\
  Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

  Initial Query:
  {initial_query}

  Final Response:
  {final_response}"""

  helpfullness_prompt_template = PromptTemplate.from_template(prompt_template)

  helpfulness_check_model = ChatOpenAI(model="gpt-4.1-mini")

  helpfulness_chain = helpfullness_prompt_template | helpfulness_check_model | StrOutputParser()

  helpfulness_response = helpfulness_chain.invoke({"initial_query" : initial_query.content, "final_response" : final_response.content})

  if "Y" in helpfulness_response:
    return "end"
  else:
    return "continue"

##### Arnab: Add the conditional edge to the graph using the `tool_call_or_helpful` function with the path map (string to node) defined based on the return values of the function.

In [35]:
graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue" : "agent",
        "action" : "action",
        "end" : END
    }
)

<langgraph.graph.state.StateGraph at 0x125601810>

##### Add the final "directed" edge from action to agent. Note the order matters here.

In [36]:
graph_with_helpfulness_check.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x125601810>

##### Compile the graph so that it is ready to be invoked as a Runnable.

In [37]:
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

##### Checking the streamed outputs from the node using LangGraph's astream method (in the updates mode). Note here there is no tool call initiated by the LLM so the process ends after the first invokation.

In [38]:
inputs = {"messages" : [HumanMessage(content="What are Deep Research Agents?")]}

async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='Deep Research Agents are advanced AI systems designed to assist with complex research tasks. They leverage deep learning techniques and large datasets to analyze, synthesize, and generate insights across various fields of study. These agents can automate literature reviews, extract relevant information from vast sources, identify patterns, and even generate hypotheses or summaries to support researchers in their work. They are used in academia, industry, and scientific research to accelerate discovery and improve the accuracy and depth of research outcomes. Would you like more detailed information or specific examples?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 103, 'prompt_tokens': 158, 'total_tokens': 261, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_to

## Part 3: LangGraph for the "Patterns" of GenAI

### Task 4: Helpfulness Check of Gen AI Pattern Descriptions

Let's ask our system about the 3 main patterns in Generative AI:

1. Context Engineering
2. Fine-tuning
3. Agents

In [39]:
patterns = ["Context Engineering", "Fine-tuning", "LLM-based agents"]

In [40]:
for pattern in patterns:
  what_is_string = f"What is {pattern} and when did it break onto the scene??"
  inputs = {"messages" : [HumanMessage(content=what_is_string)]}
  messages = agent_with_helpfulness_check.invoke(inputs)
  print(messages["messages"][-1].content)
  print("\n\n")

Context Engineering is a relatively new and emerging field that focuses on designing, managing, and utilizing context information to improve the functionality and adaptability of systems, particularly in areas like artificial intelligence, ubiquitous computing, and human-computer interaction. It involves creating systems that can understand and respond appropriately to the context in which they are used, enhancing user experience and system performance.

The concept of Context Engineering began gaining attention in the early 2000s as the proliferation of mobile devices, sensors, and ubiquitous computing environments made context-aware systems more feasible and desirable. It became more prominent with the rise of pervasive computing and the need for systems that can adapt dynamically to changing environments and user needs.

If you want, I can look up more detailed and specific information about the origins and development of Context Engineering. Would you like me to do that?



Fine-tu