# Tagging and Extraction 

Using OpenAI functions.

This is one of the major use cases for AI, let's go through the process.

## Tagging

We have seen the LLM, given a function description, select arguments from the input text generate a structured output forming a function call.

More generally, the LLm can evaluate the input text and generate structured output.

Example - need to generate sentiment from a text, in Spanish.


## Extraction

We are going to be extracting specific entities from the text. This entities are represented by a structure description.

    - Given an input schema, the LLM has been tuned to find and fill in the parameters of that schema.
    - The capability is not limited to function schema.
    - This can be used for general purposes.

In this case, rather than outputing one element, we are asking to output a list of these elements, which the model can find in the text. Like a list of the papers mentioned in the article.


In [21]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

In [22]:
from typing import List
from pydantic import BaseModel, Field
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

In [23]:
class Tagging(BaseModel):
    """Tag the piece of text with particular info."""
    sentiment: str = Field(description="sentiment of text, should be `pos`, `neg`, or `neutral`")
    language: str = Field(description="language of text (should be ISO 639-1 code)")

above, we are litterally tagging for the model to complete the task similarly, with the descriptors.

In [24]:
convert_pydantic_to_openai_function(Tagging)

{'name': 'Tagging',
 'description': 'Tag the piece of text with particular info.',
 'parameters': {'title': 'Tagging',
  'description': 'Tag the piece of text with particular info.',
  'type': 'object',
  'properties': {'sentiment': {'title': 'Sentiment',
    'description': 'sentiment of text, should be `pos`, `neg`, or `neutral`',
    'type': 'string'},
   'language': {'title': 'Language',
    'description': 'language of text (should be ISO 639-1 code)',
    'type': 'string'}},
  'required': ['sentiment', 'language']}}

In [25]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

we want temp=0 as we need the tagging to be spot on.

In [26]:
model = ChatOpenAI(temperature=0)

In [27]:
tagging_functions = [convert_pydantic_to_openai_function(Tagging)]

In [28]:
# crating the template
prompt = ChatPromptTemplate.from_messages([
    ("system", "Think carefully, and then tag the text as instructed"),
    ("user", "{input}")
])

In [29]:
# binding the functions to the model
model_with_functions = model.bind(
    functions=tagging_functions,
    function_call={"name": "Tagging"}
)
# we force it to always do the tagging

In [30]:
tagging_chain = prompt | model_with_functions

In [31]:
tagging_chain.invoke({"input": "I love langchain"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{"sentiment":"pos","language":"en"}'}})

In [32]:
# adding another example
tagging_chain.invoke({"input": "non mi piace questo cibo"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{"sentiment":"neg","language":"it"}'}})

what we actually want is to be able to parse the output, so that is why we are doing the next steps.

In [33]:
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

In [34]:
tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()

In [35]:
tagging_chain.invoke({"input": "non mi piace questo cibo"})

{'sentiment': 'neg', 'language': 'it'}

Now we can see the dict extracted is much more usable after parsing.

## Extraction

Extraction is similar to tagging, but used for extracting multiple pieces of information.

First we need to define the information we need to extract, generating a `person schema`.

What we really want to do is to define a list of these objects in the person schema, thus defining an `Information` object, which in turn has an `information schema`.

In [36]:
from typing import Optional
class Person(BaseModel):
    """Information about a person."""
    name: str = Field(description="person's name")
    age: Optional[int] = Field(description="person's age")

In [37]:
class Information(BaseModel):
    """Information to extract."""
    people: List[Person] = Field(description="List of info about people")

In [38]:
convert_pydantic_to_openai_function(Information)

{'name': 'Information',
 'description': 'Information to extract.',
 'parameters': {'title': 'Information',
  'description': 'Information to extract.',
  'type': 'object',
  'properties': {'people': {'title': 'People',
    'description': 'List of info about people',
    'type': 'array',
    'items': {'title': 'Person',
     'description': 'Information about a person.',
     'type': 'object',
     'properties': {'name': {'title': 'Name',
       'description': "person's name",
       'type': 'string'},
      'age': {'title': 'Age',
       'description': "person's age",
       'type': 'integer'}},
     'required': ['name']}}},
  'required': ['people']}}

we took care of resolving the issue of putting all the info that was needed into that json blurb.

Now we are going to set up or extraction chain.

In [39]:
extraction_functions = [convert_pydantic_to_openai_function(Information)]
extraction_model = model.bind(functions=extraction_functions, function_call={"name": "Information"})

In [40]:
extraction_model.invoke("Joe is 30, his mom is Martha")

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{"people":[{"name":"Joe","age":30},{"name":"Martha"}]}'}})

In [41]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, if not explicitly provided do not guess. Extract partial info"),
    ("human", "{input}")
])

In [42]:
extraction_chain = prompt | extraction_model

In [43]:
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{"people":[{"name":"Joe","age":30},{"name":"Martha"}]}'}})

In [44]:
extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()

In [45]:
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

{'people': [{'name': 'Joe', 'age': 30}, {'name': 'Martha'}]}

In [46]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

In [47]:
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="people")

In [48]:
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

[{'name': 'Joe', 'age': 30}, {'name': 'Martha'}]

## Doing it for real

We can apply tagging to a larger body of text.

For example, let's load this blog post and extract tag information from a sub-set of the text.

In [49]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
documents = loader.load()

In [50]:
doc = documents[0]

In [51]:
page_content = doc.page_content[:10000]

In [52]:
print(page_content[:1000])







LLM Powered Autonomous Agents | Lil'Log







































Lil'Log

















|






Posts




Archive




Search




Tags




FAQ




emojisearch.app









      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


 


Table of Contents



Agent System Overview

Component One: Planning

Task Decomposition

Self-Reflection


Component Two: Memory

Types of Memory

Maximum Inner Product Search (MIPS)


Component Three: Tool Use

Case Studies

Scientific Discovery Agent

Generative Agents Simulation

Proof-of-Concept Examples


Challenges

Citation

References





Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful gene

In [53]:
class Overview(BaseModel):
    """Overview of a section of text."""
    summary: str = Field(description="Provide a concise summary of the content.")
    language: str = Field(description="Provide the language that the content is written in.")
    keywords: str = Field(description="Provide keywords related to the content.")

In [54]:
overview_tagging_function = [
    convert_pydantic_to_openai_function(Overview)
]
tagging_model = model.bind(
    functions=overview_tagging_function,
    function_call={"name":"Overview"}
)
tagging_chain = prompt | tagging_model | JsonOutputFunctionsParser()

In [55]:
tagging_chain.invoke({"input": page_content})

{'summary': 'This text discusses the concept of building autonomous agents powered by LLM (large language model) as the core controller. It covers components such as planning, memory, and tool use, along with techniques like task decomposition and self-reflection. Various proof-of-concept examples and challenges are also mentioned.',
 'language': 'English',
 'keywords': 'LLM, autonomous agents, planning, memory, tool use, task decomposition, self-reflection, proof-of-concept, challenges'}

after tagging, we want to extract the papers mentioned in the article.

In [56]:
class Paper(BaseModel):
    """Information about papers mentioned."""
    title: str
    author: Optional[str]


class Info(BaseModel):
    """Information to extract"""
    papers: List[Paper]

In [57]:
paper_extraction_function = [
    convert_pydantic_to_openai_function(Info)
]
extraction_model = model.bind(
    functions=paper_extraction_function, 
    function_call={"name":"Info"}
)
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")

In [58]:
extraction_chain.invoke({"input": page_content})

[{'title': 'LLM Powered Autonomous Agents', 'author': 'Lilian Weng'}]

this is confusing, because there are the title and author of the article that we are passing in, not the list of mentions.

So the model is getting confused, so we need to instruct the model better, more explicitly.

In [59]:
template = """A article will be passed to you. Extract from it all papers that are mentioned by this article follow by its author. 

Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.

Do not make up or guess ANY extra information. Only extract what exactly is in the text."""

prompt = ChatPromptTemplate.from_messages([
    ("system", template),
    ("human", "{input}")
])

In [60]:
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")

In [61]:
extraction_chain.invoke({"input": page_content})

[{'title': 'Chain of thought (CoT; Wei et al. 2022)'},
 {'title': 'Tree of Thoughts (Yao et al. 2023)'},
 {'title': 'LLM+P (Liu et al. 2023)'},
 {'title': 'ReAct (Yao et al. 2023)'},
 {'title': 'Reflexion (Shinn & Labash 2023)'},
 {'title': 'Chain of Hindsight (CoH; Liu et al. 2023)'},
 {'title': 'Algorithm Distillation (AD; Laskin et al. 2023)'}]

In [62]:
extraction_chain.invoke({"input": "hi"})

[]

### Now extracting over the full article

Before we were just doing a section, now in order to do the full article, we are going to use text splitting to easy the task.

This way we will be under the token window.

In [63]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)

In [64]:
splits = text_splitter.split_text(doc.page_content)

In [65]:
len(splits)

15

this is to flatten the list of lists, from the splits.

In [66]:
def flatten(matrix):
    flat_list = []
    for row in matrix:
        flat_list += row
    return flat_list

In [67]:
flatten([[1, 2], [3, 4]])

[1, 2, 3, 4]

We need to add a method for preparing the splits to be passed into the chain.

As the chain takes in input variable (a dict with an input key), this list of splits is just text, so we are going to need to create a list of dictionaries with that text as the input key.

for this, we define a function, and becuase it is the first function in the chain, we are going to wrap it in a runnable lambda, a simple wrapper in LangChain that converts it in a runnable object, so that we can pipe it together.

In [69]:
print(splits[0])

LLM Powered Autonomous Agents | Lil'Log







































Lil'Log

















|






Posts




Archive




Search




Tags




FAQ




emojisearch.app









      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


 


Table of Contents



Agent System Overview

Component One: Planning

Task Decomposition

Self-Reflection


Component Two: Memory

Types of Memory

Maximum Inner Product Search (MIPS)


Component Three: Tool Use

Case Studies

Scientific Discovery Agent

Generative Agents Simulation

Proof-of-Concept Examples


Challenges

Citation

References





Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general pr

In [70]:
from langchain.schema.runnable import RunnableLambda

In [71]:
prep = RunnableLambda(
    lambda x: [{"input": doc} for doc in text_splitter.split_text(x)]
)

literally taking a string a creating a list of dictionaries, in this example below, it is super simle.

In [73]:
prep.invoke("hi")

[{'input': 'hi'}]

In [74]:
chain = prep | extraction_chain.map() | flatten

In [76]:
chain.invoke(doc.page_content)

KeyError: 'papers'