## Claude 3 Metadata Tagger
---

Use `Claude 3` to extract metadata from text. The schema for the metadata is provided as part of the prompt. This example uses synthetic data to demonstrate the metadata extraction capability for `Claude 3`. The extracted metadata is provided in JSON format so that it is parseable by downstream applications.

In [None]:
import json
import boto3
import logging
from typing import Optional, List, Dict

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [None]:
CLAUDE_MODEL_ID: str = "anthropic.claude-3-sonnet-20240229-v1:0"
REGION: str = "us-west-2"
ENDPOINT_URL: str = f"https://bedrock-runtime.{REGION}.amazonaws.com"
# Enter your schema below - 
YOUR_SCHEMA: Dict = {
    "properties": {
        "article_title": {"type": "string"},
        "author": {"type": "string"},
        "topic": {"type": "string", "enum": ["quantum physics",
                                             "classical mechanics",
                                             "thermodynamics",
                                             "relativity",
                                             "other"]},
        "publication_date": {
            "type": "string",
            "description": "The date the article was published, say \"date unknown\" if not found",
        },
    },
    "required": ["article_title", "author", "topic"],
}

In [None]:
# Example generated by Claude 3 Sonnet
ORIGINAL_DOCS: List[str] = [
    "Title: Quantum Entanglement\nAuthor: John Doe\n\nAn in-depth look at the fascinating world of quantum entanglement. Published 2022-03-04.",
    "Title: The Laws of Thermodynamics\nAuthor: Jane Smith\n\nA comprehensive guide to understanding the fundamental laws of thermodynamics",
    "Title: Cosmic Inflation\nAuthor: Michael Johnson\n\nExploring the theory of cosmic inflation and its implications for our universe",
    "Title: Neutrino Physics\nAuthor: Emily Wilson\n\nDiscovering the elusive nature of neutrinos and their role in particle physics",
    "Title: Astrobiology\nAuthor: David Thompson\n\nInvestigating the potential for life beyond Earth and the search for extraterrestrial intelligence",
    "Title: Evolutionary Genetics\nAuthor: Sarah Davis\n\nUnraveling the mysteries of how life evolves through the lens of genetics",
    "Title: Plate Tectonics\nAuthor: William Anderson\n\nUnderstanding the dynamic forces that shape the Earth's surface",
    "Title: Origin of Life\nAuthor: Robert Brown\n\nUnraveling the mysteries of how life began on Earth. Data published: March 1674.",
    "Title: Relativity and Spacetime\nAuthor: Amanda Miller\n\nUnderstanding Einstein's groundbreaking theories of relativity",
    "Title: Dark Matter and Dark Energy\nAuthor: Daniel Harris\n\nInvestigating the enigmatic components that shape our universe",
    "Title: Molecular Nanotechnology\nAuthor: Sophia Clark\n\nEngineering at the molecular scale for revolutionary applications"
]

In [None]:
METADATA_TAGGER_PROMPT: str = """
Human: you are a metadata tagger, see the following schema and tag the data that follows to extract the metadata fields listed in the schema. You have 
to at least provide everything in the metadata tagger that is given in "required" parameter within the schema. Provide the response in JSON without any tags around the
response.

<schema>
{schema}
</schema>

<data>
{text}
</data>

Assistant: Here is the metadata information as JSON:
"""

In [None]:
# Function to generate the prompt for Claude
def _generate_prompt(schema: str, text: str) -> str:
    """
    This function returns a fully formatted prompt that claude takes in to create a 
    metadata tagger
    """
    prompt: Optional[str] = None
    try:
        prompt = METADATA_TAGGER_PROMPT.format(schema=json.dumps(schema,
                                                                 indent=2),
                                               text=text)
    except Exception as e:
        logger.error(f"schema or text not provided: {e}")
    return prompt

In [None]:
# simple function to get a final summary on all of the data provided from LLM as a judge
def extract_metadata_from_text(schema: str, data: str) -> Optional[Dict]:
    """
    This function takes in the prompt that checks whether the text file has a response to the question and if not, 
    returns "not found" to move to the next hit
    """
    prompt = _generate_prompt(schema, data)
    logger.debug(f"prompt: {prompt}")
    body = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2000,
        "temperature": 0.1,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
    })

    try:
        bedrock = boto3.client(service_name="bedrock-runtime",
                               endpoint_url=ENDPOINT_URL)
        response = bedrock.invoke_model(modelId=CLAUDE_MODEL_ID,
                                        body=body)
        response_body = json.loads(response['body'].read().decode("utf-8"))
        # read the response text as string and then convert it into a dictionary
        response_as_str: str = response_body['content'][0]['text']
        # make it easy to convert to JSON by removing unnecessary formatting
        response_as_str = response_as_str.strip().replace("\n", " ").replace("\t", " ")
        llm_response = json.loads(response_as_str)
    except Exception as e:
        logger.error(f"exception={e}")
        llm_response = None
    return llm_response

In [None]:
# ready to extract metadata from docs
enhanced_documents = []
for doc in ORIGINAL_DOCS:
    metadata = extract_metadata_from_text(YOUR_SCHEMA, doc)
    enhanced_doc = {
        "page_content": doc,
        "metadata": metadata
    }
    enhanced_documents.append(enhanced_doc)

logger.info(f"==== original content with extracted metadata ====")
for doc in enhanced_documents:
    logger.info(f"Data: {doc['page_content']}\n\nMetadata: {json.dumps(doc['metadata'], indent=4)}\n\n")
    logger.info("---------------\n")