# Tagging and Extraction


**Tagging**

Tagging takes an unstructured piece of text along with a structure description, and use the LLM to generate a structured output to reason over that input text and create some response in the format of the structured description we provided.

[ Content ] -> [ LLM (with structure description) ] -> [ Structured output object (e.g. JSON) ]

The description to the LLM can be something on the lines of "hey, extract names and dates from this piece of text".

**Extraction**

It's slightly similar, in that accepts a piece of text and a structure description, but instead of reasoning over the text and respond with a single output, we are going to extract _specific_ entities of the text (from the structure description) and return them as a list.

[ Content ] -> [ LLM (with structure description) ] -> [ List of entities ]


## Tagging


In [None]:
import os
import openai

from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())  # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

In [None]:
from typing import List
from pydantic import BaseModel, Field
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

In [None]:
class Tagging(BaseModel):

    """Tag the piece of text with particular info."""  # this is the description

    sentiment: str = Field(
        description="sentiment of text, should be `pos`, `neg`, or `neutral`"
    )
    language: str = Field(description="language of text, (should be ISO 639-1 code)")

In [None]:
convert_pydantic_to_openai_function(Tagging)

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

In [None]:
model = ChatOpenAI(temperature=0)

In [None]:
tagging_functions = [convert_pydantic_to_openai_function(Tagging)]

In [None]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "Think carefully, and then tag the text as instructed."),
        ("user", "{input}"),
    ]
)

In [None]:
model_with_functions = model.bind(
    functions=tagging_functions, function_call={"name": "Tagging"}
)

In [None]:
tagging_chain = prompt | model_with_functions

In [None]:
tagging_chain.invoke({"input": "I love you."})

In [None]:
tagging_chain.invoke({"input": "Eu te odeio."})

The output is nested inside of the tagging chain, that's why we want an `output parser` to extract the value of `arguments`, preferably parsed as JSON.


In [None]:
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

In [None]:
tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()

In [None]:
tagging_chain.invoke({"input": "I love you."})

## Extraction


Extraction is similiar to tagging, but used for extracting multiple pieces of information.


In [None]:
from typing import Optional


class Person(BaseModel):  # this is a Person schema
    """Information about a person."""

    name: str = Field(description="person's name")
    age: Optional[int] = Field(description="person's age")

Now what we want to do is to extract a list of these objects.


In [None]:
class Information(BaseModel):
    """Information to extract."""

    people: List[Person] = Field(description="List of info about people")

This `Information` class is what we are going to use as an OpenAI function.


In [None]:
import json

json.dumps(convert_pydantic_to_openai_function(Information))

Under the hood, `convert_pydantic_to_openai_function` is taking care of all the conversion of the Pydantic model to an OpenAI function.

```json
{
  "name": "Information",
  "description": "Information to extract.",
  "parameters": {
    "$defs": {
      "Person": {
        "description": "Information about a person.",
        "properties": {
          "name": {
            "description": "name of person",
            "title": "Name",
            "type": "string"
          },
          "age": {
            "anyOf": [
              {
                "type": "integer"
              },
              {
                "type": "null"
              }
            ],
            "description": "age of person",
            "title": "Age"
          }
        },
        "required": ["name", "age"],
        "title": "Person",
        "type": "object"
      }
    },
    "description": "Information to extract.",
    "properties": {
      "people": {
        "description": "List of info about people",
        "items": {
          "description": "Information about a person.",
          "properties": {
            "name": {
              "description": "name of person",
              "title": "Name",
              "type": "string"
            },
            "age": {
              "anyOf": [
                {
                  "type": "integer"
                },
                {
                  "type": "null"
                }
              ],
              "description": "age of person",
              "title": "Age"
            }
          },
          "required": ["name", "age"],
          "title": "Person",
          "type": "object"
        },
        "title": "People",
        "type": "array"
      }
    },
    "required": ["people"],
    "title": "Information",
    "type": "object"
  }
}
```


In [None]:
extraction_functions = [convert_pydantic_to_openai_function(Information)]
extraction_model = model.bind(
    functions=extraction_functions,
    function_call={
        "name": "Information"
        # forcing the model model to call the converted to JSON "Information" function
    },
)

In [None]:
extraction_model.invoke("Joe is 30, his mom is Martha")

On this response Martha's age is `null`. We can resolve this by improving the prompt.


In [None]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Extract the relevant information, if not explicitly provided, do not guess. Extract partial info.",
        ),
        ("human", "{input}"),
    ]
)

In [None]:
extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()

In [None]:
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

In [None]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

The `JsonKeyOutputFunctionsParser` function will look only for a specific key in the output, and return it as a JSON object.


In [None]:
extraction_chain = (
    prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="people")
)

In [None]:
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})

## Real world example

Extracting information from a blog post.


In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent")
documents = loader.load()

In [None]:
doc = documents[0]

In [None]:
page_content = doc.page_content[:10000]

In [None]:
print(page_content[:1000])

First: we create a class to determine what we want to **tag**. This class will be `Overview`.


In [None]:
class Overview(BaseModel):
    """Overview of a section of text."""

    summary: str = Field(description="Provide a concise summary of the content.")
    language: str = Field(
        description="Provide the language that the content is written in."
    )
    keywords: str = Field(description="Provide keywords related to the content.")

In [None]:
overview_tagging_function = [convert_pydantic_to_openai_function(Overview)]
tagging_model = model.bind(
    functions=overview_tagging_function, function_call={"name": "Overview"}
)
tagging_chain = prompt | tagging_model | JsonOutputFunctionsParser()

In [None]:
tagging_chain.invoke({"input": page_content})

In [None]:
class Paper(BaseModel):
    """Information about papers mentioned."""

    title: str
    author: Optional[str]


class Info(BaseModel):
    """Information to extract."""

    papers: List[Paper]

In [None]:
paper_extraction_function = [convert_pydantic_to_openai_function(Info)]
extraction_model = model.bind(
    functions=paper_extraction_function,
    function_call={"name": "Info"},
)
extraction_chain = (
    prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")
)

In [None]:
extraction_chain.invoke({"input": page_content})

On the `extraction_chain` the model gets confused and spits the title and author of the blog post, instead of the papers and its authors mentionded within the blog post.

So we have to give better instructions to the functions to extract information about the papers, not the article itself.

We do this by passing a better `SystemMessage` to the prompt.


In [None]:
template = """An article will be passed to you. Extract from it all papers that are mentioned by this article. 

Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.

Do not make up or guess ANY extra information. Only extract what exactly is in the text."""

prompt = ChatPromptTemplate.from_messages([("system", template), ("human", "{input}")])

In [None]:
extraction_chain = (
    prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")
)

In [None]:
extraction_chain.invoke({"input": page_content})

Here we perform here a "sanity" check to see if the model is working alright.

The instruction "Just return an empty list" is followed with a prompt such as "hi" and an empty List is return.


In [None]:
extraction_chain.invoke({"input": "hi"})

How about we want to use the whole article as input?

The concept behind this solution is **text splitting**.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# The `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter` function refers to the number of overlapping characters between two consecutive chunks when the text is split into smaller parts.
# For example, if you have a text "Hello World" and you split it into chunks of 5 characters with a `chunk_overlap` of 2, you would get the following chunks: "Hello", "llo W", "o Wor", " World". As you can see, each chunk overlaps with the next one by 2 characters.
# Setting `chunk_overlap` to 0, as in your code, means that there will be no overlap between the chunks. Each chunk will be a completely separate piece of the original text.
text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)

In [None]:
splits = text_splitter.split_text(doc.page_content)

In [None]:
len(splits)

Where going to create a chain that:

1. We take in `page_content`
2. Split the `page_content` into splits
3. For each split, we'll need to prepare them into dictionaries with the `input` key variable to be passed to the chain
4. Each split we pass it to the `extraction_chain`
5. We join all the results together

So, we'll want to create a function that **joins a list of lists**.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# The `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter` function refers to the number of overlapping characters between two consecutive chunks when the text is split into smaller parts.
# For example, if you have a text "Hello World" and you split it into chunks of 5 characters with a `chunk_overlap` of 2, you would get the following chunks: "Hello", "llo W", "o Wor", " World". As you can see, each chunk overlaps with the next one by 2 characters.
# Setting `chunk_overlap` to 0, as in your code, means that there will be no overlap between the chunks. Each chunk will be a completely separate piece of the original text.
text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)

In [None]:
def flatten(matrix):
    flat_list = []
    for row in matrix:
        flat_list += row
    return flat_list

In [None]:
flatten([[1, 2], [3, 4]])

In [None]:
print(splits[0])

A `RunnableLambda` is just a simple wrapper in LangChain that takes in a function and returns a `Runnable` object.


In [182]:
from langchain.schema.runnable import RunnableLambda

In [183]:
prep = RunnableLambda(
    # we are here constructing a pre-processing function
    # we pass the whole document as input
    lambda x: [{"input": doc} for doc in text_splitter.split_text(x)]
)

In [185]:
prep.invoke("hi")

[{'input': 'hi'}]

Because `extraction_chain` operates over a single element, but prep is a list of elements, that's why we call the `map` function to map `extraction_chain` over each element of the list.

This leads to a list of lists.


In [186]:
chain = prep | extraction_chain.map() | flatten

In [187]:
chain.invoke(doc.page_content)

[{'title': 'AutoGPT', 'author': None},
 {'title': 'GPT-Engineer', 'author': None},
 {'title': 'BabyAGI', 'author': None},
 {'title': 'Chain of thought (CoT; Wei et al. 2022)', 'author': None},
 {'title': 'Tree of Thoughts (Yao et al. 2023)', 'author': None},
 {'title': 'LLM+P (Liu et al. 2023)', 'author': None},
 {'title': 'ReAct (Yao et al. 2023)', 'author': None},
 {'title': 'Reflexion (Shinn & Labash 2023)', 'author': None},
 {'title': 'Reflexion framework', 'author': 'Shinn & Labash'},
 {'title': 'Chain of Hindsight', 'author': 'Liu et al.'},
 {'title': 'Algorithm Distillation', 'author': 'Laskin et al.'},
 {'title': 'Algorithm Distillation', 'author': 'Laskin et al. 2023'},
 {'title': 'ED (expert distillation)', 'author': None},
 {'title': 'RL^2', 'author': 'Duan et al. 2017'},
 {'title': 'LSH: Locality-Sensitive Hashing', 'author': None},
 {'title': 'ANNOY: Approximate Nearest Neighbors Oh Yeah', 'author': None},
 {'title': 'HNSW: Hierarchical Navigable Small World', 'author': No