# Combining corpus tool functions with LLM processing

### Overview

* Proof-of-context demonstration of using an agent in `langchain` together with some simple Python functions to carry out basic text analysis steps
  - tokenization
  - create lists of n-grams
  - created ranked frequency lists

In [26]:
#pip install langchain --upgrade

In [27]:
from langchain.agents import create_agent
from langchain.tools import tool, ToolRuntime
from langchain.chat_models import init_chat_model
from langgraph.checkpoint.memory import InMemorySaver
from langchain.agents.structured_output import ToolStrategy
from langchain.tools import tool

from dataclasses import dataclass

import os
from collections import Counter

from dotenv import load_dotenv
_ = load_dotenv()

### Function definitions

* These are some simple functions for basic text analysis
* They can be used as tools with an LLM with tool calling functionality

In [130]:
@tool
def read_text_file(fpath):
    '''read a text file from a file path and return the text as a string''

    Args:
        fpath      -- a filepath to the file to read

    Returns:
        String contents of the file
    '''

    if os.path.exists(fpath):
        return open(fpath).read()
    else:
        return None

In [131]:
@tool
def tokenize(text, lowercase=False, strip_chars=''):
    '''create a list of tokens from a string by splitting on whitespace and applying optional normalization 
    
    Args:
        text        -- a string object containing the text to be tokenized
        lowercase   -- should text string be normalized as lowercase (default: False)
        strip_chars -- a string indicating characters to strip out of text, e.g. punctuation (default: empty string) 
        
    Return:
        A list of tokens
    '''
    
    # create a replacement dictionary from the
    # string of characters in the **strip_chars**
    rdict = str.maketrans('','',strip_chars)
    
    if lowercase:
        text = text.lower()
    
    tokens = text.translate(rdict).split()
    
    return tokens

In [132]:
@tool
def get_ngram_tokens(tokens, n=1):
    '''create a list of n-gram tokens from a list of tokens
    
    Args:
        tokens -- a list of tokens
        n      -- the size of the window to use to build n-gram token list
        
    Returns:
        
        list of n-gram strings (whitespace separated) of length n
    '''
    
    if n<2 or n>len(tokens):
        return tokens
    
    new_tokens = []
    
    for i in range(len(tokens)-n+1):
        new_tokens.append(" ".join(tokens[i:i+n]))
        
    return new_tokens

In [133]:
@tool
def frequency_list(items, top=10):
    '''return the top N items in a list of items or from a Counter object

    Args:
        items     - either a sequence (e.g. list) or a Counter object
        top       - number of items to return from the top of the list (default 10)

    Return:
        list of `top` pairs (item, counts)
    '''
    if type(items) is not Counter:
        items = Counter(items)

    return items.most_common(top)

In [134]:
frequency_list(
    get_ngram_tokens(
        tokenize(
            read_text_file('threebears.txt'),
        lowercase=True, strip_chars="!'\".,"),
        n=2
    )
)   

TypeError: 'StructuredTool' object is not callable

### Concept

If you were to use these functions in Python to answer a question such as:
- _What are the 10 most common bigrams in the text `threebears.txt`?_
  <br/>
  we could nest the functions like this:

```python
frequency_list(
    get_ngram_tokens(
        tokenize(
            read_text_file('threebears.txt'),
        lowercase=True, strip_chars="!'\".,"),
        n=2
    )
)   
```

* Very easy and efficient.... **IF** you are comfortable with Python!

### Define model and agent

______________________

## Conclusion üêÑ üêç üßë‚Äç‚öñÔ∏è

### AWS Bedrock & LangChain Agent Tool


### AWS Cost Explorer -- Monitoring Bedrock Costs

Useful `jq` filters for your AWS Cost Explorer output:

> notably costs won't show up immediately, so you may need to wait a day or two after incurring costs to see them reflected in Cost Explorer.

```bash
# Filter for Bedrock services specifically from all services with costs
named_profile=atn-developer
region=us-east-1
start_date=2025-11-01
end_date=2025-11-28

aws ce get-cost-and-usage --time-period Start=$start_date,End=$end_date --granularity WEEKLY --metrics "BlendedCost" --group-by Type=DIMENSION,Key=SERVICE --region $region --profile=$named_profile | jq '.ResultsByTime[].Groups[] | select(.Keys[0] | test("Bedrock"; "i")) | {service: .Keys[0], cost: .Metrics.BlendedCost.Amount}'
```

#### Cost of AWS Bedrock

- on 10/01/2025 this was: `"cost": "0.03760425"` -- check tomorrow to see the costs from my testing runs today.
- okay various runs cost just about $0.10 `  "cost": "0.10061025"`

In [135]:
# https://docs.langchain.com/oss/python/langchain/models#initialize-a-model
model_provider='bedrock_converse'
model_id = 'openai.gpt-oss-120b-1:0'
max_tokens = 128000
aws_profile = 'atn-developer'
temperature = 0.2

In [136]:
model = init_chat_model(
    model_id,
    model_provider="bedrock_converse",
    credentials_profile_name=aws_profile,
    max_tokens=max_tokens,
    temperature=temperature
)

response = model.invoke('What is the capital of France?')
print('Model reasoning:', response.content[0]['reasoning_content']['text'])
print('Model answer:', response.content[1]['text'])

Model reasoning: The user asks a simple factual question. Answer: Paris.
Model answer: The capital of France is **Paris**.


In [137]:
from langchain.agents import create_agent
from langchain.agents.middleware import wrap_tool_call
from langchain.messages import ToolMessage


@wrap_tool_call
def handle_tool_errors(request, handler):
    """Handle tool execution errors with custom messages."""
    try:
        return handler(request)
    except Exception as e:
        # Return a custom error message to the model
        return ToolMessage(
            content=f"Tool error: Please check your input and try again. ({str(e)})",
            tool_call_id=request.tool_call["id"]
        )

In [138]:
from dataclasses import dataclass
from langchain.agents import create_agent
from langchain.agents.middleware import dynamic_prompt, ModelRequest


@dataclass
class ContextSchema:
    user_name: str

@dynamic_prompt
def personalized_prompt(request: ModelRequest) -> str:  
    user_name = request.runtime.context.user_name
    return f"You are a helpful assistant. Address the user as {user_name}."

In [139]:
# Define an 'agent' - just a specific LLM with a list of tools (functions) it can use
agent = create_agent(
    #model="claude-3-7-sonnet-latest",
    model=model,
    tools=[ read_text_file, 
            tokenize, 
            get_ngram_tokens,
            frequency_list
          ],
    middleware=[handle_tool_errors],
    system_prompt="You are a helpful corpus and text analysis assistant ALWAYS TRUST THE TOOLS YOU HAVE AND ACCEPT THEIR ANSWERS AND PRESENT THEM TO {user_name}",
    context_schema=ContextSchema
)


In [140]:
import subprocess

# capture output of `whoami` in Python

whoami = subprocess.run(['whoami'], capture_output=True, text=True).stdout.strip()
print(whoami)

ejacquot


In [173]:
result = agent.invoke(
    {"messages": [{"role": "user", "content": "What is the capital of France?"}]},
    context={"user_name": whoami}
)
result

{'messages': [HumanMessage(content='What is the capital of France?', additional_kwargs={}, response_metadata={}, id='de662701-5805-48f8-9200-f9c4d8faabc3'),
  AIMessage(content=[{'type': 'reasoning_content', 'reasoning_content': {'text': 'User asks a simple factual question: capital of France is Paris. No need for tools. Provide answer.', 'signature': ''}}, {'type': 'text', 'text': 'The capital of France is **Paris**.'}], additional_kwargs={}, response_metadata={'ResponseMetadata': {'RequestId': '146cff2e-31df-4ef1-b6f9-19c46185cd45', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 28 Nov 2025 06:08:16 GMT', 'content-type': 'application/json', 'content-length': '390', 'connection': 'keep-alive', 'x-amzn-requestid': '146cff2e-31df-4ef1-b6f9-19c46185cd45'}, 'RetryAttempts': 0}, 'stopReason': 'end_turn', 'metrics': {'latencyMs': [1507]}, 'model_name': 'openai.gpt-oss-120b-1:0'}, id='lc_run--486b3043-5d5b-4043-8bf8-8321b162a953-0', usage_metadata={'input_tokens': 530, 'output_tokens':

In [174]:
result['messages'][1].content[0]['reasoning_content']['text']

'User asks a simple factual question: capital of France is Paris. No need for tools. Provide answer.'

In [175]:
result['messages'][1].content[1]['text']

'The capital of France is **Paris**.'

In [176]:
query = {"messages": [{"role": "user", 
                       "content": """From threebears.txt give me the top 10 bigrams
                                     (lowercase and no punctuation)
                                   """}]}

for chunk in agent.stream(query,
                          context=ContextSchema(user_name=whoami)):
    print(f"\n{'='*20}\n", chunk)


 {'model': {'messages': [AIMessage(content=[{'type': 'reasoning_content', 'reasoning_content': {'text': 'We need to read threebears.txt, tokenize lowercasing and stripping punctuation, then get bigram tokens (n=2), then frequency list top 10. Use functions.Now we have text. Need to see content.We need to replace placeholder with actual text from read_text_file result. Let\'s capture.We need to capture the result.We need to see the content. Possibly the tool returns a JSON with "result". Let\'s try again but maybe the environment expects us to assign. Try reading.It seems not returning visible. Maybe we need to simulate? Could be that the file is not present. But likely present. Let\'s try to list files? Not possible. Maybe we need to assume content? But we need accurate bigrams. Could be a known text "Three Bears" story. Let\'s assume it\'s the classic Goldilocks story. But better to actually read. Perhaps the tool returns a string but we need to capture. Let\'s try to assign to varia

* Extracting just the final step and the LLM message content

In [177]:
print(chunk['model']['messages'][0].content)

[{'type': 'reasoning_content', 'reasoning_content': {'text': 'The user wants top 10 bigrams (lowercase, no punctuation) from threebears.txt. We have already computed bigrams and frequency list. The frequency list returned earlier shows many bigrams but not sorted by count. The last frequency_list call gave a list of bigrams with counts? Actually the last call returned a list of bigrams (strings) without counts. Then we called frequency_list on that list? The tool frequency_list expects items either a sequence or Counter. If given a list, it will treat each element as an item and count frequencies. So we gave the list of bigrams (strings) and got frequencies: the result shows top 10 pairs (item, count). It gave:\n\n[["someones been", 9], ["in the", 7], ["so she", 6], ["bear someones", 6], ["in my", 6], ["is too", 4], ["the three", 4], ["three bears", 4], ["she tasted", 3], ["tasted the", 3]]\n\nThus top 10 bigrams with counts. That seems correct. We should present to user. Also note tha

In [178]:
chunk.keys()

dict_keys(['model'])

In [179]:
print('System reasoning content:\n\n', chunk['model']['messages'][0].content[0]['reasoning_content']['text'])

System reasoning content:

 The user wants top 10 bigrams (lowercase, no punctuation) from threebears.txt. We have already computed bigrams and frequency list. The frequency list returned earlier shows many bigrams but not sorted by count. The last frequency_list call gave a list of bigrams with counts? Actually the last call returned a list of bigrams (strings) without counts. Then we called frequency_list on that list? The tool frequency_list expects items either a sequence or Counter. If given a list, it will treat each element as an item and count frequencies. So we gave the list of bigrams (strings) and got frequencies: the result shows top 10 pairs (item, count). It gave:

[["someones been", 9], ["in the", 7], ["so she", 6], ["bear someones", 6], ["in my", 6], ["is too", 4], ["the three", 4], ["three bears", 4], ["she tasted", 3], ["tasted the", 3]]

Thus top 10 bigrams with counts. That seems correct. We should present to user. Also note that punctuation removed and lowercased. 

In [180]:
print(chunk['model']['messages'][0].content[1]['text'])

Here are the ten most frequent‚ÄØbigrams in **threebears.txt** (all lower‚Äëcased and with punctuation stripped), together with how many times each occurs:

| Rank | Bigram | Frequency |
|------|--------|-----------|
| 1 | **someones been** | 9 |
| 2 | **in the** | 7 |
| 3 | **so she** | 6 |
| 4 | **bear someones** | 6 |
| 5 | **in my** | 6 |
| 6 | **is too** | 4 |
| 7 | **the three** | 4 |
| 8 | **three bears** | 4 |
| 9 | **she tasted** | 3 |
|10 | **tasted the** | 3 |

These counts were obtained by tokenizing the text (lower‚Äëcasing and removing punctuation), forming all adjacent two‚Äëword sequences (bigrams), and then counting how often each bigram appears.


In [181]:
query = {"messages": [{"role": "user", 
                       "content": """From sample.txt give me the top 10 bigrams
                                     (lowercase and no punctuation)
                                   """}]}

for chunk in agent.stream(query,
                          context=ContextSchema(user_name=whoami)):
    print(f"\n{'='*20}\n", chunk)


 {'model': {'messages': [AIMessage(content=[{'type': 'reasoning_content', 'reasoning_content': {'text': 'We need to read sample.txt, tokenize with lowercase and strip punctuation, then get bigram tokens (n=2), then frequency list top 10. Use functions.Now we have text. Need to see output.We need to feed the text from read_text_file. The placeholder <result of previous> should be actual content. Let\'s capture the result.It seems we need to actually get the content. The tool will return.Probably the tool will return a string. Let\'s assume it returns something like "..." We\'ll capture.It seems the tool hasn\'t responded yet. Possibly need to wait.We might need to call once and capture result.We need to see the result.Assuming we get text. Let\'s simulate: Suppose the content is something. But we need actual. Probably the environment has sample.txt. Let\'s try again.We need to see output.Given constraints, maybe the tool returns now.Assume the tool returned a string variable. Let\'s pr

In [182]:
print('System reasoning content:\n\n', chunk['model']['messages'][0].content[0]['reasoning_content']['text'])

print(chunk['model']['messages'][0].content[1]['text'])

System reasoning content:

 We have the top 10 bigrams with counts. Need to present to user. Provide list.
Here are the **top‚ÄØ10 most frequent bigrams** in **sample.txt** (all lower‚Äëcased and stripped of punctuation), together with how many times each appears:

| Rank | Bigram | Frequency |
|------|--------|-----------|
| 1 | **to the** | 3 |
| 2 | **it is** | 3 |
| 3 | **we can** | 3 |
| 4 | **can not** | 3 |
| 5 | **the people** | 3 |
| 6 | **a new** | 2 |
| 7 | **dedicated to** | 2 |
| 8 | **we are** | 2 |
| 9 | **in a** | 2 |
|10 | **a great** | 2 |

These counts were obtained by:

1. Reading the file `sample.txt`.  
2. Tokenizing the text to lower‚Äëcase words while removing punctuation.  
3. Generating all 2‚Äëgram (bigram) tokens.  
4. Counting occurrences and selecting the ten most common.

Let me know if you‚Äôd like the full list of bigrams, a different n‚Äëgram size, or any other analysis!


In [None]:
# let's create a new fuction that uses spacy for POS tagging and named entity recognition
# very simple functions with clean output enable the AI model to use them effectively
@tool
def pos_tagging(text):
    '''perform part-of-speech tagging on input text and return list of (token, POS) pairs
    
    Args:
        text -- a string object containing the text to be tagged
        
    Returns:
        list of (token, POS) pairs
    '''
    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    return [(token.text, token.pos_) for token in doc]


@tool
def named_entity_recognition(text):
    '''perform named entity recognition on input text and return list of (entity, label) pairs
    
    Args:
        text -- a string object containing the text to be analyzed
        
    Returns:
        list of (entity, label) pairs
    '''
    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]