# LangGraph and LangSmith - Agentic RAG Powered by LangChain

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

- 🤝 Breakout Room #2:
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI

# 🤝 Breakout Room #1

## Part 1: LangGraph - Building Cyclic Applications with LangChain

LangGraph is a tool that leverages LangChain Expression Language to build coordinated multi-actor and stateful applications that includes cyclic behaviour.

### Why Cycles?

In essence, we can think of a cycle in our graph as a more robust and customizable loop. It allows us to keep our application agent-forward while still giving the powerful functionality of traditional loops.

Due to the inclusion of cycles over loops, we can also compose rather complex flows through our graph in a much more readable and natural fashion. Effectively allowing us to recreate application flowcharts in code in an almost 1-to-1 fashion.

### Why LangGraph?

Beyond the agent-forward approach - we can easily compose and combine traditional "DAG" (directed acyclic graph) chains with powerful cyclic behaviour due to the tight integration with LCEL. This means it's a natural extension to LangChain's core offerings!

## Task 1:  Dependencies


## Task 2: Environment Variables

We'll want to set both our OpenAI API key and our LangSmith environment variables.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [2]:
os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY")

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE7 - LangGraph - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

## Task 3: Creating our Tool Belt

As is usually the case, we'll want to equip our agent with a toolbelt to help answer questions and add external knowledge.

There's a tonne of tools in the [LangChain Community Repo](https://github.com/langchain-ai/langchain-community/tree/main/libs/community) but we'll stick to a couple just so we can observe the cyclic nature of LangGraph in action!

We'll leverage:

- [Tavily Search Results](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/tavily_search/tool.py)
- [Arxiv](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/arxiv/tool.py)

#### 🏗️ Activity #1:

Please add the tools to use into our toolbelt.

> NOTE: Each tool in our toolbelt should be a method.

##### ✅ Answer:

Another tool that could be used is Wikipedia to get more structured information or youtube to get video info.

```
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.tools.arxiv.tool import ArxivQueryRun
from langchain_community.tools.wikipedia.tool import WikipediaQueryRun
from langchain_community.tools.youtube.tool import YouTubeSearchTool

tavily_tool = TavilySearchResults(max_results=5)

tool_belt = [
    tavily_tool,
    ArxivQueryRun(),
    WikipediaQueryRun(),  # New Wikipedia tool
    YouTubeSearchTool(),  # New YouTube tool
]
````

In [4]:
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.tools.arxiv.tool import ArxivQueryRun

tavily_tool = TavilySearchResults(max_results=5)

tool_belt = [
    tavily_tool,
    ArxivQueryRun(),
]

  tavily_tool = TavilySearchResults(max_results=5)


### Model

Now we can set-up our model! We'll leverage the familiar OpenAI model suite for this example - but it's not *necessary* to use with LangGraph. LangGraph supports all models - though you might not find success with smaller models - as such, they recommend you stick with:

- OpenAI's GPT-3.5 and GPT-4
- Anthropic's Claude
- Google's Gemini

> NOTE: Because we're leveraging the OpenAI function calling API - we'll need to use OpenAI *for this specific example* (or any other service that exposes an OpenAI-style function calling API.

In [5]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4.1-nano", temperature=0)

Now that we have our model set-up, let's "put on the tool belt", which is to say: We'll bind our LangChain formatted tools to the model in an OpenAI function calling format.

In [6]:
model = model.bind_tools(tool_belt)

#### ❓ Question #1:

How does the model determine which tool to use?

##### ✅ Answer:

The model determines which tool to use through **OpenAI's function calling API**, where the available tools (like Tavily search and Arxiv) are bound to the model in a structured format that includes function names, descriptions, and parameter schemas. When the model receives a user query, it analyzes the content and intent of the request, then generates a `tool_calls` field in its response that specifies which tool to invoke along with the appropriate arguments - essentially acting as an intelligent router that matches the user's needs to the most suitable available tool based on the tool descriptions and the query's requirements.

## Task 4: Putting the State in Stateful

Earlier we used this phrasing:

`coordinated multi-actor and stateful applications`

So what does that "stateful" mean?

To put it simply - we want to have some kind of object which we can pass around our application that holds information about what the current situation (state) is. Since our system will be constructed of many parts moving in a coordinated fashion - we want to be able to ensure we have some commonly understood idea of that state.

LangGraph leverages a `StatefulGraph` which uses an `AgentState` object to pass information between the various nodes of the graph.

There are more options than what we'll see below - but this `AgentState` object is one that is stored in a `TypedDict` with the key `messages` and the value is a `Sequence` of `BaseMessages` that will be appended to whenever the state changes.

Let's think about a simple example to help understand exactly what this means (we'll simplify a great deal to try and clearly communicate what state is doing):

1. We initialize our state object:
  - `{"messages" : []}`
2. Our user submits a query to our application.
  - New State: `HumanMessage(#1)`
  - `{"messages" : [HumanMessage(#1)}`
3. We pass our state object to an Agent node which is able to read the current state. It will use the last `HumanMessage` as input. It gets some kind of output which it will add to the state.
  - New State: `AgentMessage(#1, additional_kwargs {"function_call" : "WebSearchTool"})`
  - `{"messages" : [HumanMessage(#1), AgentMessage(#1, ...)]}`
4. We pass our state object to a "conditional node" (more on this later) which reads the last state to determine if we need to use a tool - which it can determine properly because of our provided object!

In [7]:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

## Task 5: It's Graphing Time!

Now that we have state, and we have tools, and we have an LLM - we can finally start making our graph!

Let's take a second to refresh ourselves about what a graph is in this context.

Graphs, also called networks in some circles, are a collection of connected objects.

The objects in question are typically called nodes, or vertices, and the connections are called edges.

Let's look at a simple graph.

![image](https://i.imgur.com/2NFLnIc.png)

Here, we're using the coloured circles to represent the nodes and the yellow lines to represent the edges. In this case, we're looking at a fully connected graph - where each node is connected by an edge to each other node.

If we were to think about nodes in the context of LangGraph - we would think of a function, or an LCEL runnable.

If we were to think about edges in the context of LangGraph - we might think of them as "paths to take" or "where to pass our state object next".

Let's create some nodes and expand on our diagram.

> NOTE: Due to the tight integration with LCEL - we can comfortably create our nodes in an async fashion!

In [8]:
from langgraph.prebuilt import ToolNode

def call_model(state):
  messages = state["messages"]
  response = model.invoke(messages)
  return {"messages" : [response]}

tool_node = ToolNode(tool_belt)

Now we have two total nodes. We have:

- `call_model` is a node that will...well...call the model
- `tool_node` is a node which can call a tool

Let's start adding nodes! We'll update our diagram along the way to keep track of what this looks like!


In [9]:
from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)

uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x1125b9e80>

Let's look at what we have so far:

![image](https://i.imgur.com/md7inqG.png)

Next, we'll add our entrypoint. All our entrypoint does is indicate which node is called first.

In [10]:
uncompiled_graph.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x1125b9e80>

![image](https://i.imgur.com/wNixpJe.png)

Now we want to build a "conditional edge" which will use the output state of a node to determine which path to follow.

We can help conceptualize this by thinking of our conditional edge as a conditional in a flowchart!

Notice how our function simply checks if there is a "function_call" kwarg present.

Then we create an edge where the origin node is our agent node and our destination node is *either* the action node or the END (finish the graph).

It's important to highlight that the dictionary passed in as the third parameter (the mapping) should be created with the possible outputs of our conditional function in mind. In this case `should_continue` outputs either `"end"` or `"continue"` which are subsequently mapped to the action node or the END node.

In [11]:
def should_continue(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  return END

uncompiled_graph.add_conditional_edges(
    "agent",
    should_continue
)

<langgraph.graph.state.StateGraph at 0x1125b9e80>

Let's visualize what this looks like.

![image](https://i.imgur.com/8ZNwKI5.png)

Finally, we can add our last edge which will connect our action node to our agent node. This is because we *always* want our action node (which is used to call our tools) to return its output to our agent!

In [12]:
uncompiled_graph.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x1125b9e80>

Let's look at the final visualization.

![image](https://i.imgur.com/NWO7usO.png)

All that's left to do now is to compile our workflow - and we're off!

In [13]:
simple_agent_graph = uncompiled_graph.compile()

#### ❓ Question #2:

Is there any specific limit to how many times we can cycle?

If not, how could we impose a limit to the number of cycles?

##### ✅ Answer:

No, there is no specific built-in limit to how many times the graph can cycle in the basic implementation shown in the notebook. The graph will continue cycling between the agent and action nodes until the conditional edge determines it should end (when there are no more tool calls).
To impose a limit to the number of cycles, you could:
1. Add a counter to the state - Track the number of cycles in the state object and check it in the conditional function
2. Check message length - Use the length of the messages array as a proxy for cycle count (as shown later in the notebook with if len(state["messages"]) > 10)
3. Add a dedicated cycle counter - Include a cycle_count field in the AgentState and increment it with each cycle
4. Use a maximum iteration parameter - Set a maximum number of iterations when compiling the graph

## Using Our Graph

Now that we've created and compiled our graph - we can call it *just as we'd call any other* `Runnable`!

Let's try out a few examples to see how it fairs:

In [14]:
from langchain_core.messages import HumanMessage

inputs = {"messages" : [HumanMessage(content="Who is the current captain of the Winnipeg Jets?")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_Dnb9AjWdT75QHubXGu4AyCTf', 'function': {'arguments': '{"query":"current captain of the Winnipeg Jets"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 23, 'prompt_tokens': 162, 'total_tokens': 185, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': None, 'id': 'chatcmpl-BsLMeH8oyTBMBgFbSE0CNesLJKnRS', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run--7165ac25-bdde-4c3f-b20c-030b4bfcf126-0', tool_calls=[{'name': 'tavily_search_results_json', 'args': {'query': 'current captain of the Winnipeg Jets'}, 'id': 'call_Dnb9AjWdT75QHubXGu4AyCTf', 'type': 

Let's look at what happened:

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node
4. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node
5. The agent node added a response to the state object and passed it along the conditional edge
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

Now let's look at an example that shows a multiple tool usage - all with the same flow!

In [15]:
inputs = {"messages" : [HumanMessage(content="Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using Tavily!")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"Tool Used: {values['messages'][0].name}")
        print(values["messages"])

        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_jvEGo8lMfMHZuC7ChE6m3h7e', 'function': {'arguments': '{"query": "QLoRA"}', 'name': 'arxiv'}, 'type': 'function'}, {'id': 'call_6Qy3UYekXRcX9uGgB0WLp1QP', 'function': {'arguments': '{"query": "latest Tweet of the first author of QLoRA"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_SjDrhazQhJpKV0pNENdHBZ86', 'function': {'arguments': '{"query": "latest Tweet of the second author of QLoRA"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_dTpGBywQUTzT9oibnEGKJHxO', 'function': {'arguments': '{"query": "latest Tweet of the third author of QLoRA"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 113, 'prompt_tokens': 178, 'total_tokens': 291, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0,

#### 🏗️ Activity #2:

Please write out the steps the agent took to arrive at the correct answer.

# 🤝 Breakout Room #2

## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

To do a little bit more preprocessing, let's wrap our LangGraph agent in a simple chain.

In [46]:
def convert_inputs(input_object):
  return {"messages" : [HumanMessage(content=input_object["question"])]}

def parse_output(input_state):
  return input_state["messages"][-1].content

agent_chain_with_formatting = convert_inputs | simple_agent_graph | parse_output

In [47]:
agent_chain_with_formatting.invoke({"question" : "What is RAG?"})

'RAG, or Retrieval-Augmented Generation, is an AI framework that enhances the capabilities of large language models (LLMs) by allowing them to retrieve and incorporate external information before generating responses. This process involves two main phases: retrieval and content generation. During retrieval, the system searches for relevant information from external sources such as databases, documents, or web sources. The retrieved information is then used by the LLM to produce more accurate, relevant, and up-to-date responses. \n\nRAG is particularly useful because it enables LLMs to access new data without the need for retraining, making the AI more flexible and capable of providing authoritative and contextually appropriate answers. It is widely applied in areas like customer service, knowledge management, and enterprise solutions to improve the quality and reliability of AI-generated content.'

### Task 1: Creating An Evaluation Dataset

Just as we saw last week, we'll want to create a dataset to test our Agent's ability to answer questions.

In order to do this - we'll want to provide some questions and some answers. Let's look at how we can create such a dataset below.

```python
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]
```

#### 🏗️ Activity #3:

Please create a dataset in the above format with at least 5 questions.

In [49]:
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]

Now we can add our dataset to our LangSmith project using the following code which we saw last Thursday!

In [50]:
from langsmith import Client

client = Client()

dataset_name = f"Retrieval Augmented Generation - Evaluation Dataset - {uuid4().hex[0:8]}"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    outputs=answers,
    dataset_id=dataset.id,
)

{'example_ids': ['aa591263-5df0-4599-85ee-9757d2c0c797',
  '40989ee7-8867-4ecc-b939-322cd0cae6ee',
  '518700b3-6132-48b5-a0a4-febe91b15f73',
  '85724822-161d-4e1a-864c-e9f2b295ffd1',
  '57e07f99-3248-44db-9762-31c948f2964e',
  '8ee519ae-9eb6-44c0-9c00-9200d7bcf24b'],
 'count': 6}

#### ❓ Question #3:

How are the correct answers associated with the questions?

> NOTE: Feel free to indicate if this is problematic or not

##### ✅ Answer:

The correct answers are associated with the questions through **index-based matching** in parallel arrays. Here's how it works:

```python
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    # ... more questions
]

answers = [
    {"must_mention": ["paged", "optimizer"]},
    {"must_mention": ["NF4", "NormalFloat"]},
    # ... more answers
]
```

**Association Method:**
- Question at index `0` → Answer at index `0`
- Question at index `1` → Answer at index `1`
- And so on...

**Potential Problems with This Approach:**

1. **Fragile Index Matching**: If you accidentally misalign the arrays, questions and answers won't match correctly
2. **No Explicit Linking**: There's no explicit identifier connecting each question to its answer
3. **Maintenance Issues**: Adding/removing questions requires careful index management
4. **No Validation**: No built-in check to ensure arrays have the same length

**Better Alternative:**
```python
dataset = [
    {
        "question": "What optimizer is used in QLoRA?",
        "answer": {"must_mention": ["paged", "optimizer"]}
    },
    {
        "question": "What data type was created in the QLoRA paper?",
        "answer": {"must_mention": ["NF4", "NormalFloat"]}
    }
]
```

This approach would be more robust and self-documenting, but the current index-based method works for simple datasets.

### Task 2: Adding Evaluators

Now we can add a custom evaluator to see if our responses contain the expected information.

We'll be using a fairly naive exact-match process to determine if our response contains specific strings.

In [51]:
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    score = all(phrase in prediction for phrase in required)
    return EvaluationResult(key="must_mention", score=score)

#### ❓ Question #4:

What are some ways you could improve this metric as-is?

> NOTE: Alternatively you can suggest where gaps exist in this method.

##### ✅ Answer:


### **Current Limitations of the `must_mention` Metric:**

1. **Exact String Matching Only**
   - Only finds exact phrases, not synonyms or paraphrases
   - "paged optimizer" vs "paged optimizers" would fail
   - "NF4" vs "NormalFloat4" would fail

2. **No Semantic Understanding**
   - Doesn't understand context or meaning
   - "uses paged optimizer" vs "implements paged optimizer" treated differently
   - No understanding of technical variations

3. **All-or-Nothing Scoring**
   - Binary pass/fail, no partial credit
   - Missing one phrase fails the entire response
   - No weighting of importance

4. **Case Sensitivity**
   - "Paged" vs "paged" would be treated as different
   - No normalization of text

### **Ways to Improve the Metric:**

#### **1. Fuzzy Matching**
```python
from difflib import SequenceMatcher

def fuzzy_match(prediction, required, threshold=0.8):
    return any(SequenceMatcher(None, phrase.lower(), prediction.lower()).ratio() > threshold 
               for phrase in required)
```

#### **2. Semantic Similarity**
```python
from sentence_transformers import SentenceTransformer

def semantic_check(prediction, required):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    pred_embedding = model.encode(prediction)
    req_embeddings = model.encode(required)
    similarities = cosine_similarity([pred_embedding], req_embeddings)
    return any(sim > 0.7 for sim in similarities[0])
```

#### **3. Partial Credit Scoring**
```python
def partial_credit(prediction, required):
    found = sum(1 for phrase in required if phrase.lower() in prediction.lower())
    return found / len(required)  # Returns 0.0 to 1.0
```

#### **4. Enhanced Evaluator**
```python
@run_evaluator
def improved_must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    
    # Normalize text
    prediction = prediction.lower()
    required = [phrase.lower() for phrase in required]
    
    # Check exact matches
    exact_matches = sum(1 for phrase in required if phrase in prediction)
    
    # Check partial matches (substring)
    partial_matches = sum(1 for phrase in required 
                         if any(word in prediction for word in phrase.split()))
    
    # Calculate score
    exact_score = exact_matches / len(required) if required else 0
    partial_score = partial_matches / len(required) if required else 0
    
    # Weighted final score
    final_score = (exact_score * 0.7) + (partial_score * 0.3)
    
    return EvaluationResult(key="improved_must_mention", score=final_score)
```

#### **5. Multiple Evaluation Metrics**
```python
@run_evaluator
def comprehensive_evaluation(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    
    metrics = {
        "exact_match": all(phrase in prediction for phrase in required),
        "partial_match": sum(1 for phrase in required if phrase in prediction) / len(required),
        "length_appropriate": 50 <= len(prediction) <= 500,
        "contains_key_concepts": any(phrase in prediction for phrase in required)
    }
    
    # Overall score
    overall_score = sum(metrics.values()) / len(metrics)
    
    return EvaluationResult(key="comprehensive", score=overall_score)
```

### **Gaps in Current Method:**

1. **No Context Awareness** - Doesn't understand if the information is correctly used
2. **No Fact Verification** - Doesn't check if the information is accurate
3. **No Response Quality** - Doesn't evaluate coherence or completeness
4. **No Relevance Scoring** - Doesn't assess if the response actually answers the question
5. **No Hallucination Detection** - Doesn't identify if the agent made up information

The current metric is a **good starting point** but would benefit from more sophisticated NLP techniques and multiple evaluation dimensions.

Task 3: Evaluating

All that is left to do is evaluate our agent's response!

In [54]:
experiment_results = client.evaluate(
    agent_chain_with_formatting,
    data=dataset_name,
    evaluators=[must_mention],
    experiment_prefix=f"Search Pipeline - Evaluation - {uuid4().hex[0:4]}",
    metadata={"version": "1.0.0"},
)

View the evaluation results for experiment: 'Search Pipeline - Evaluation - a1de-cb133507' at:
https://smith.langchain.com/o/88c6730c-2899-4f8d-a2a1-13e382f99e7f/datasets/ca6c3318-00ab-4585-b17f-854889524b5d/compare?selectedSessions=dfa10521-b4ef-46c1-90be-4c7a00dd1ba5




0it [00:00, ?it/s]

In [55]:
experiment_results

## Part 2: LangGraph with Helpfulness:

### Task 3: Adding Helpfulness Check and "Loop" Limits

Now that we've done evaluation - let's see if we can add an extra step where we review the content we've generated to confirm if it fully answers the user's query!

We're going to make a few key adjustments to account for this:

1. We're going to add an artificial limit on how many "loops" the agent can go through - this will help us to avoid the potential situation where we never exit the loop.
2. We'll add to our existing conditional edge to obtain the behaviour we desire.

First, let's define our state again - we can check the length of the state object, so we don't need additional state for this.

In [56]:
class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

Now we can set our graph up! This process will be almost entirely the same - with the inclusion of one additional node/conditional edge!

#### 🏗️ Activity #5:

Please write markdown for the following cells to explain what each is doing.

##### ✅ Answer:

##### Creating the Enhanced Graph Structure

This cell initializes a new StateGraph with helpfulness checking capabilities. We create two main nodes:
- **"agent"**: The language model node that processes queries and generates responses
- **"action"**: The tool execution node that handles tool calls (search, arxiv, etc.)

This is the foundation for our enhanced agent that will include helpfulness evaluation.

In [57]:
graph_with_helpfulness_check = StateGraph(AgentState)

graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x118392e90>

##### ✅ Answer:

##### Setting the Graph Entry Point

This cell establishes the starting point of our graph workflow. By setting the entry point to "agent", we ensure that all queries first go through the language model node, which will analyze the input and decide whether to use tools or provide a direct response.

In [58]:
graph_with_helpfulness_check.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x118392e90>

##### ✅ Answer:

##### Implementing Advanced Conditional Routing

This cell defines the `tool_call_or_helpful` function, which is the core logic for our enhanced agent. This function:

1. **Checks for tool calls first** - If the agent wants to use tools, route to action node
2. **Implements loop limits** - Prevents infinite loops by limiting to 10 messages
3. **Evaluates helpfulness** - Uses another LLM to determine if the response is helpful
4. **Routes accordingly** - Sends to "continue", "action", or "end" based on conditions

This creates a more intelligent and robust agent that can self-evaluate its responses.

In [59]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  initial_query = state["messages"][0]
  final_response = state["messages"][-1]

  if len(state["messages"]) > 10:
    return "END"

  prompt_template = """\
  Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

  Initial Query:
  {initial_query}

  Final Response:
  {final_response}"""

  helpfullness_prompt_template = PromptTemplate.from_template(prompt_template)

  helpfulness_check_model = ChatOpenAI(model="gpt-4.1-mini")

  helpfulness_chain = helpfullness_prompt_template | helpfulness_check_model | StrOutputParser()

  helpfulness_response = helpfulness_chain.invoke({"initial_query" : initial_query.content, "final_response" : final_response.content})

  if "Y" in helpfulness_response:
    return "end"
  else:
    return "continue"

#### 🏗️ Activity #4:

Please write what is happening in our `tool_call_or_helpful` function!

##### ✅ Answer:

This function is the **core decision-making logic** for the enhanced LangGraph agent. It determines what the agent should do next based on the current state.

### **Function Purpose:**
The function acts as a **smart router** that decides whether to:
1. Use tools
2. Continue processing
3. End the conversation

### **Step-by-Step Breakdown:**

```python
def tool_call_or_helpful(state):
    last_message = state["messages"][-1]
```

**Gets the most recent message** from the conversation history to analyze what just happened.

```python
if last_message.tool_calls:
    return "action"
```

**Checks if the agent wants to use tools:**
- If the last message contains `tool_calls`, the agent is requesting to use tools
- Returns `"action"` to route to the tool execution node
- This is the **highest priority** - tools take precedence

```python
initial_query = state["messages"][0]
final_response = state["messages"][-1]

if len(state["messages"]) > 10:
    return "END"
```

**Implements loop limits:**
- Gets the original user query and the latest response
- If more than 10 messages have been exchanged, **force an end** to prevent infinite loops
- This is a **safety mechanism** to prevent the agent from getting stuck

```python
prompt_template = """\
Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

Initial Query:
{initial_query}

Final Response:
{final_response}"""

helpfullness_prompt_template = PromptTemplate.from_template(prompt_template)

helpfulness_check_model = ChatOpenAI(model="gpt-4.1-mini")

helpfulness_chain = helpfullness_prompt_template | helpfulness_check_model | StrOutputParser()
```

**Creates a helpfulness evaluation system:**
- Uses a **separate LLM** to evaluate if the response is helpful
- Compares the original query with the current response
- Asks for a simple "Y" or "N" answer
- Uses a smaller, cheaper model for this evaluation

```python
helpfulness_response = helpfulness_chain.invoke({
    "initial_query" : initial_query.content, 
    "final_response" : final_response.content
})

if "Y" in helpfulness_response:
    return "end"
else:
    return "continue"
```

**Makes the final decision:**
- If the evaluation model says "Y" (helpful) → **end the conversation**
- If the evaluation model says "N" (not helpful) → **continue processing**

### **Decision Flow Diagram:**

```
Start
  ↓
Check for tool calls?
  ↓ YES → Route to "action" (use tools)
  ↓ NO
Check message count > 10?
  ↓ YES → Route to "END" (force stop)
  ↓ NO
Evaluate helpfulness
  ↓
Helpful? (Y)
  ↓ YES → Route to "end" (satisfied)
  ↓ NO → Route to "continue" (try again)
```

### **Key Features:**

1. **Tool Priority**: Tools are always used when requested
2. **Loop Protection**: Prevents infinite loops with message count limit
3. **Self-Evaluation**: Agent evaluates its own helpfulness
4. **Intelligent Routing**: Three possible outcomes (action/end/continue)

### **Example Scenarios:**

**Scenario 1: Agent wants to use tools**
```
User: "What is QLoRA?"
Agent: [tool_calls: arxiv_search]
→ Returns "action" → Uses tools
```

**Scenario 2: Agent provides helpful response**
```
User: "What is QLoRA?"
Agent: "QLoRA is a technique for efficient fine-tuning..."
Helpfulness Check: "Y"
→ Returns "end" → Conversation ends
```

**Scenario 3: Agent provides unhelpful response**
```
User: "What is QLoRA?"
Agent: "I don't know about that."
Helpfulness Check: "N"
→ Returns "continue" → Agent tries again
```

**Scenario 4: Too many iterations**
```
User: "What is QLoRA?"
[10+ messages exchanged]
→ Returns "END" → Force stop
```

This function makes the agent **self-aware** and **self-regulating**, creating a more robust and intelligent conversational system.

##### ✅ Answer:

##### Adding Smart Conditional Routing to the Graph

This cell adds conditional edges to the graph, creating an intelligent routing system that determines the flow based on the `tool_call_or_helpful` function's decisions. The conditional edges connect the "agent" node to three possible destinations:

- **"continue" → "agent"**: Routes back to the agent node for another processing cycle when the response needs improvement
- **"action" → "action"**: Routes to the tool execution node when the agent wants to use external tools (search, arxiv, etc.)
- **"end" → END**: Terminates the graph execution when a satisfactory response has been provided

This creates a **cyclic workflow** where the agent can iteratively improve its responses while maintaining control over when to use tools and when to end the conversation. The routing is determined by the `tool_call_or_helpful` function, which evaluates whether tools are needed, if the response is helpful, or if the conversation should continue for another iteration.

In [60]:
graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue" : "agent",
        "action" : "action",
        "end" : END
    }
)

<langgraph.graph.state.StateGraph at 0x118392e90>

##### ✅ Answer:

##### Completing the Cyclic Graph Structure

This cell adds the final edge that connects the "action" node back to the "agent" node, completing the cyclic workflow of our enhanced LangGraph agent. This edge ensures that after tool execution (such as searching the web or querying arXiv), the results are always sent back to the agent for processing and potential further action.

This creates a **continuous loop** where:
1. Agent decides to use tools → Routes to action node
2. Tools execute and return results → Routes back to agent node
3. Agent processes tool results and decides next steps → Can use more tools or provide final response

This edge is essential for maintaining the **cyclic behavior** that allows the agent to use multiple tools in sequence and iteratively improve its responses based on the information gathered from each tool execution.

In [61]:
graph_with_helpfulness_check.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x118392e90>

##### ✅ Answer:

##### Finalizing the Enhanced Agent with Helpfulness Checking

This cell compiles the graph with all the advanced features we've added, including helpfulness evaluation and loop limiting. The compiled graph is now ready to be used as a runnable agent that can intelligently cycle through tool usage and response evaluation until it provides a satisfactory answer.

The compiled agent now possesses:
- **Self-evaluation capabilities** to determine if responses are helpful
- **Loop protection** to prevent infinite conversation cycles
- **Intelligent tool routing** to use external resources when needed
- **Cyclic workflow** that can iteratively improve responses

This transforms our basic LangGraph agent into a **sophisticated, self-aware conversational system** that can determine when it has provided a complete and helpful response to the user's query.

In [62]:
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

##### ✅ Answer:

##### Testing the Enhanced Agent with Complex Multi-Part Query

This cell demonstrates the enhanced agent's capabilities by testing it with a complex query that requires multiple tool calls and iterations. The agent will now showcase its advanced features:

**What the agent will do:**
1. **Parse the complex query** into multiple sub-questions about LoRA, Tim Dettmers, and Attention mechanisms
2. **Use appropriate tools** (arXiv for LoRA papers, search for Tim Dettmers, etc.)
3. **Evaluate its own helpfulness** after each response
4. **Continue iterating** until it provides comprehensive answers to all parts
5. **Respect loop limits** to prevent infinite processing

**Streaming output** allows us to observe the agent's decision-making process in real-time, showing how it routes between different nodes and evaluates its responses. This demonstrates the full capabilities of our improved LangGraph agent with helpfulness checking and intelligent tool usage.

In [63]:
inputs = {"messages" : [HumanMessage(content="Related to machine learning, what is LoRA? Also, who is Tim Dettmers? Also, what is Attention?")]}

async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_cHEdU5aZyza6hJX2faZXab2K', 'function': {'arguments': '{"query": "LoRA machine learning"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_xR1mBHYLHAYGOp0x5Y76O9aG', 'function': {'arguments': '{"query": "Tim Dettmers"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_DjGHqf06LMTTImkyjlos70OM', 'function': {'arguments': '{"query": "Attention in machine learning"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 79, 'prompt_tokens': 205, 'total_tokens': 284, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': None, 'id': 'chatcmpl-Bsg

### Task 4: LangGraph for the "Patterns" of GenAI

Let's ask our system about the 4 patterns of Generative AI:

1. Prompt Engineering
2. RAG
3. Fine-tuning
4. Agents

In [64]:
patterns = ["prompt engineering", "RAG", "fine-tuning", "LLM-based agents"]

In [65]:
for pattern in patterns:
  what_is_string = f"What is {pattern} and when did it break onto the scene??"
  inputs = {"messages" : [HumanMessage(content=what_is_string)]}
  messages = agent_with_helpfulness_check.invoke(inputs)
  print(messages["messages"][-1].content)
  print("\n\n")

Prompt engineering is the process of designing and refining prompts to effectively communicate with AI language models, such as GPT, to obtain desired responses. It involves crafting specific, clear, and contextually appropriate prompts to guide the AI's output in a useful and accurate manner.

Prompt engineering has gained significant prominence with the rise of large language models (LLMs) like GPT-3 and GPT-4, which are highly sensitive to the input prompts. The practice became especially notable around 2020-2021, as these models were released and started being widely adopted for various applications. The ability to engineer prompts effectively became a crucial skill for leveraging the full potential of these models.

Would you like more detailed information on its origins, techniques, or current trends?



Retrieval-Augmented Generation (RAG) is an innovative approach in artificial intelligence that combines traditional language models with external data retrieval to produce more a