# Combining corpus tool functions with LLM processing

### Overview

* Proof-of-context demonstration of using an agent in `langchain` together with some simple Python functions to carry out basic text analysis steps
  - tokenization
  - create lists of n-grams
  - created ranked frequency lists

In [46]:
from langchain.agents import create_agent
from langchain.tools import tool, ToolRuntime
from langchain.chat_models import init_chat_model
from langgraph.checkpoint.memory import InMemorySaver
from langchain.agents.structured_output import ToolStrategy

from dataclasses import dataclass

import os
from collections import Counter

from dotenv import load_dotenv
_ = load_dotenv()

### Function definitions

* These are some simple functions for basic text analysis
* They can be used as tools with an LLM with tool calling functionality

In [36]:
def read_text_file(fpath):
    '''read a text file from a file path and return the text as a string''

    Args:
        fpath      -- a filepath to the file to read

    Returns:
        String contents of the file
    '''

    if os.path.exists(fpath):
        return open(fpath).read()
    else:
        return None

In [14]:
def tokenize(text, lowercase=False, strip_chars=''):
    '''create a list of tokens from a string by splitting on whitespace and applying optional normalization 
    
    Args:
        text        -- a string object containing the text to be tokenized
        lowercase   -- should text string be normalized as lowercase (default: False)
        strip_chars -- a string indicating characters to strip out of text, e.g. punctuation (default: empty string) 
        
    Return:
        A list of tokens
    '''
    
    # create a replacement dictionary from the
    # string of characters in the **strip_chars**
    rdict = str.maketrans('','',strip_chars)
    
    if lowercase:
        text = text.lower()
    
    tokens = text.translate(rdict).split()
    
    return tokens

In [24]:
def get_ngram_tokens(tokens, n=1):
    '''create a list of n-gram tokens from a list of tokens
    
    Args:
        tokens -- a list of tokens
        n      -- the size of the window to use to build n-gram token list
        
    Returns:
        
        list of n-gram strings (whitespace separated) of length n
    '''
    
    if n<2 or n>len(tokens):
        return tokens
    
    new_tokens = []
    
    for i in range(len(tokens)-n+1):
        new_tokens.append(" ".join(tokens[i:i+n]))
        
    return new_tokens

In [44]:
def frequency_list(items, top=10):
    '''return the top N items in a list of items or from a Counter object

    Args:
        items     - either a sequence (e.g. list) or a Counter object
        top       - number of items to return from the top of the list (default 10)

    Return:
        list of `top` pairs (item, counts)
    '''
    if type(items) is not Counter:
        items = Counter(items)

    return items.most_common(top)

### Concept

* If you were to use these functions in Python to answer a question such as"
  - _What are the 10 most common bigrams in the text `threebears.txt`?_
  <br/>
  we could nest the functions like this:

In [50]:
frequency_list(
    get_ngram_tokens(
        tokenize(
            read_text_file('threebears.txt'),
        lowercase=True, strip_chars="!'\".,"),
        n=2
    )
)

[('someones been', 9),
 ('in the', 7),
 ('so she', 6),
 ('bear someones', 6),
 ('in my', 6),
 ('is too', 4),
 ('the three', 4),
 ('three bears', 4),
 ('she tasted', 3),
 ('tasted the', 3)]

* Very easy and efficient.... **IF** you are comfortable with Python!

### Define agent

In [51]:


# Define an 'agent' - just a specific LLM with a list of tools (functions) it can use
agent = create_agent(
    model="claude-3-7-sonnet-latest",
    tools=[ read_text_file, 
            tokenize, 
            get_ngram_tokens,
            frequency_list
          ],
    system_prompt="You are a helpful corpus and text analysis assistant ALWAYS TRUST THE TOOLS YOU HAVE AND ACCEPT THEIR ANSWERS",
)



In [52]:

query = {"messages": [{"role": "user", 
                       "content": """From threebears.txt give me the top 10 bigrams
                                     (lowercase and no punctuation)
                                   """}]}

for chunk in agent.stream(query):
    print(f"\n{'='*20}\n", chunk)


 {'model': {'messages': [AIMessage(content=[{'text': 'I\'ll help you find the top 10 bigrams from the "threebears.txt" file, with text converted to lowercase and punctuation removed.\n\nFirst, I need to read the file, then tokenize it with lowercase and punctuation removal, create bigrams from those tokens, and finally get the frequency counts.', 'type': 'text'}, {'id': 'toolu_01Umt9YBkLBkuu8su3gQAdvz', 'input': {'fpath': 'threebears.txt'}, 'name': 'read_text_file', 'type': 'tool_use'}], additional_kwargs={}, response_metadata={'id': 'msg_01F3NemyrxrNbrWwoM6Dw94h', 'model': 'claude-3-7-sonnet-20250219', 'stop_reason': 'tool_use', 'stop_sequence': None, 'usage': {'cache_creation': {'ephemeral_1h_input_tokens': 0, 'ephemeral_5m_input_tokens': 0}, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 891, 'output_tokens': 129, 'server_tool_use': None, 'service_tier': 'standard'}, 'model_name': 'claude-3-7-sonnet-20250219', 'model_provider': 'anthropic'}, id='lc_

* Extracting just the final step and the LLM message content

In [55]:
print(chunk['model']['messages'][0].content)

Here are the top 10 bigrams from "threebears.txt" with text converted to lowercase and punctuation removed:

1. "someones been" (9 occurrences)
2. "in the" (7 occurrences)
3. "so she" (6 occurrences)
4. "bear someones" (6 occurrences)
5. "in my" (6 occurrences)
6. "is too" (4 occurrences)
7. "the three" (4 occurrences)
8. "three bears" (4 occurrences)
9. "she tasted" (3 occurrences)
10. "tasted the" (3 occurrences)

These bigrams reflect the repetitive structure of the Goldilocks story, with phrases like "someone's been" appearing frequently as the three bears discover what Goldilocks has done in their home.
