# Automatic Metadata Extraction

In this tutorial, we show you how to perform automated metadata extraction for better retrieval results.
We use two extractors: a QuestionAnsweredExtractor which generates question/answer pairs from a piece of text, and also a SummaryExtractor which extracts summaries, not only within the current text, but also within adjacent texts.



## Setup

In [20]:
import nest_asyncio
nest_asyncio.apply()

In [21]:
import os
# from dotenv import load_dotenv, find_dotenv
# load_dotenv('D:/.env')
# OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

## Define Metadata Extractors

Here we define metadata extractors. We define two metadata extractors:
- QuestionsAnsweredExtractor
- SummaryExtractor

In [27]:
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode

In [28]:
llm = OpenAI(temperature=0.1, model="gpt-4o", max_tokens=512)  #"gpt-3.5-turbo"

We also show how to instantiate the `SummaryExtractor` and `QuestionsAnsweredExtractor`.

In [29]:
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import SummaryExtractor, QuestionsAnsweredExtractor

# Node parser
node_parser = TokenTextSplitter(separator=" ", chunk_size=256, chunk_overlap=128)

# Question Answer Extractor
question_answer_extractor = QuestionsAnsweredExtractor(questions=3, llm=llm, metadata_mode=MetadataMode.EMBED)

# Summary Extractor
summary_extractor = SummaryExtractor(summaries=["prev", "self", "next"], llm=llm)

## Load in Data, Run Extractors

We load in Eugene's essay (https://eugeneyan.com/writing/llm-patterns/) using our LlamaHub SimpleWebPageReader.

We then run our extractors.

In [30]:
from llama_index.readers.web import UnstructuredURLLoader

In [31]:
loader = UnstructuredURLLoader(urls=["https://eugeneyan.com/writing/llm-patterns/"])

In [32]:
documents = loader.load_data()

In [33]:
len(documents)

1

In [34]:
print(documents[0].get_content())

eugeneyan

Start Here

Writing

Speaking

Prototyping

About

Patterns for Building LLM-based Systems & Products

[
llm
engineering
production
🔥
]
 · 66 min read

Discussions on HackerNews, Twitter, and LinkedIn

“There is a large class of problems that are easy to imagine and build demos for, but extremely hard to make products out of. For example, self-driving: It’s easy to demo a car self-driving around a block, but making it into a product takes a decade.” - Karpathy

This write-up is about practical patterns for integrating large language models (LLMs) into systems & products. We’ll build on academic research, industry resources, and practitioner know-how, and distill them into key ideas and practices.

There are seven key patterns. They’re also organized along the spectrum of improving performance vs. reducing cost/risk, and closer to the data vs. closer to the user.

Evals: To measure performance

RAG: To add recent, external knowledge

Fine-tuning: To get better at specific tas

In [35]:
orig_nodes = node_parser.get_nodes_from_documents(documents)

In [36]:
len(orig_nodes)

171

In [37]:
# take just these 8 nodes for testing
nodes = orig_nodes[20:28]

In [38]:
print(nodes[3].get_content(metadata_mode="all"))

source: https://eugeneyan.com/writing/llm-patterns/

LLM and ask it to generate a CoT of evaluation steps. Then, to evaluate coherence in news summarization, they concatenate the prompt, CoT, news article, and summary and ask the LLM to output a score between 1 to 5. Finally, they use the probabilities of the output tokens from the LLM to normalize the score and take their weighted summation as the final result.

Overview of G-Eval (source)

They found that GPT-4 as an evaluator had a high Spearman correlation with human judgments (0.514), outperforming all previous methods. It also outperformed traditional metrics on aspects such as coherence, consistency, fluency, and relevance. On topical chat, it did better than traditional metrics such as ROUGE-L, BLEU-4, and BERTScore across several criteria such as naturalness, coherence, engagingness, and groundedness.

The Vicuna paper adopted a similar approach. They start by defining eight categories (writing, roleplay, extraction, reasoning

### Run metadata extractors

In [39]:
# process nodes with metadata extractors
nodes_1 = summary_extractor(nodes)


  0%|          | 0/8 [00:00<?, ?it/s][A
 12%|█▎        | 1/8 [00:07<00:49,  7.08s/it][A
 38%|███▊      | 3/8 [00:07<00:10,  2.09s/it][A
 50%|█████     | 4/8 [00:08<00:06,  1.51s/it][A
 62%|██████▎   | 5/8 [00:13<00:08,  2.92s/it][A
 75%|███████▌  | 6/8 [00:14<00:04,  2.32s/it][A
 88%|████████▊ | 7/8 [00:16<00:01,  1.96s/it][A
100%|██████████| 8/8 [00:17<00:00,  2.14s/it][A


In [40]:
nodes_1[3].to_dict()

{'id_': 'e0008d03-0cc6-4c4f-8d55-cead0ff186a9',
 'embedding': None,
 'metadata': {'source': 'https://eugeneyan.com/writing/llm-patterns/',
  'prev_section_summary': 'The section discusses the use of large language models (LLMs) as a reference-free metric to evaluate other LLMs, eliminating the need for human judgments or gold references. It introduces G-Eval, a framework that uses LLMs with Chain-of-Thought (CoT) and a form-filling paradigm to evaluate LLM outputs. The process involves providing a task introduction and evaluation criteria to an LLM, generating a CoT of evaluation steps, and then using the LLM to output a score. The score is then normalized and a weighted summation is taken as the final result. The section also mentions that GPT-4, used as an evaluator, had a high correlation with human judgments, outperforming all previous methods and traditional metrics.',
  'next_section_summary': 'The section discusses the performance of GPT-4 in evaluating the quality of answers ge

In [44]:
nodes_1[3].metadata

{'source': 'https://eugeneyan.com/writing/llm-patterns/',
 'prev_section_summary': 'The section discusses the use of large language models (LLMs) as a reference-free metric to evaluate other LLMs, eliminating the need for human judgments or gold references. It introduces G-Eval, a framework that uses LLMs with Chain-of-Thought (CoT) and a form-filling paradigm to evaluate LLM outputs. The process involves providing a task introduction and evaluation criteria to an LLM, generating a CoT of evaluation steps, and then using the LLM to output a score. The score is then normalized and a weighted summation is taken as the final result. The section also mentions that GPT-4, used as an evaluator, had a high correlation with human judgments, outperforming all previous methods and traditional metrics.',
 'next_section_summary': 'The section discusses the performance of GPT-4 in evaluating the quality of answers generated by chatbots. It outperformed previous methods and traditional metrics in as

In [45]:
print(nodes_1[3].text)

LLM and ask it to generate a CoT of evaluation steps. Then, to evaluate coherence in news summarization, they concatenate the prompt, CoT, news article, and summary and ask the LLM to output a score between 1 to 5. Finally, they use the probabilities of the output tokens from the LLM to normalize the score and take their weighted summation as the final result.

Overview of G-Eval (source)

They found that GPT-4 as an evaluator had a high Spearman correlation with human judgments (0.514), outperforming all previous methods. It also outperformed traditional metrics on aspects such as coherence, consistency, fluency, and relevance. On topical chat, it did better than traditional metrics such as ROUGE-L, BLEU-4, and BERTScore across several criteria such as naturalness, coherence, engagingness, and groundedness.

The Vicuna paper adopted a similar approach. They start by defining eight categories (writing, roleplay, extraction, reasoning, math, coding, STEM, and humanities/social science) 

In [46]:
nodes_1 = question_answer_extractor(nodes_1)


  0%|          | 0/8 [00:00<?, ?it/s][A
 12%|█▎        | 1/8 [00:05<00:40,  5.72s/it][A
 25%|██▌       | 2/8 [00:06<00:16,  2.71s/it][A
 38%|███▊      | 3/8 [00:06<00:07,  1.55s/it][A
 50%|█████     | 4/8 [00:07<00:04,  1.15s/it][A
 62%|██████▎   | 5/8 [00:11<00:06,  2.23s/it][A
 75%|███████▌  | 6/8 [00:11<00:03,  1.60s/it][A
100%|██████████| 8/8 [00:12<00:00,  1.57s/it][A


In [47]:
nodes_1[3].metadata

{'source': 'https://eugeneyan.com/writing/llm-patterns/',
 'prev_section_summary': 'The section discusses the use of large language models (LLMs) as a reference-free metric to evaluate other LLMs, eliminating the need for human judgments or gold references. It introduces G-Eval, a framework that uses LLMs with Chain-of-Thought (CoT) and a form-filling paradigm to evaluate LLM outputs. The process involves providing a task introduction and evaluation criteria to an LLM, generating a CoT of evaluation steps, and then using the LLM to output a score. The score is then normalized and a weighted summation is taken as the final result. The section also mentions that GPT-4, used as an evaluator, had a high correlation with human judgments, outperforming all previous methods and traditional metrics.',
 'next_section_summary': 'The section discusses the performance of GPT-4 in evaluating the quality of answers generated by chatbots. It outperformed previous methods and traditional metrics in as

In [48]:
print(nodes_1[3].metadata["questions_this_excerpt_can_answer"])

1. How does the G-Eval framework use a Large Language Model (LLM) to evaluate coherence in news summarization?
2. What was the Spearman correlation between GPT-4's evaluations and human judgments, and how did this compare to previous methods?
3. What approach did the Vicuna paper adopt for evaluating chatbots, and what categories were defined for this process?


### Visualize some sample data

In [49]:
print(nodes_1[3].get_content(metadata_mode="all"))

[Excerpt from document]
source: https://eugeneyan.com/writing/llm-patterns/
prev_section_summary: The section discusses the use of large language models (LLMs) as a reference-free metric to evaluate other LLMs, eliminating the need for human judgments or gold references. It introduces G-Eval, a framework that uses LLMs with Chain-of-Thought (CoT) and a form-filling paradigm to evaluate LLM outputs. The process involves providing a task introduction and evaluation criteria to an LLM, generating a CoT of evaluation steps, and then using the LLM to output a score. The score is then normalized and a weighted summation is taken as the final result. The section also mentions that GPT-4, used as an evaluator, had a high correlation with human judgments, outperforming all previous methods and traditional metrics.
next_section_summary: The section discusses the performance of GPT-4 in evaluating the quality of answers generated by chatbots. It outperformed previous methods and traditional metri

In [50]:
print(nodes_1[1].get_content(metadata_mode="all"))

[Excerpt from document]
source: https://eugeneyan.com/writing/llm-patterns/
prev_section_summary: The section discusses the differences in the evaluation approach of three benchmarks: Original MMLU, HELM, and EleutherAI. Each uses a different method to predict probabilities and select answers. The Original MMLU focuses on the answers only, HELM uses the next token probabilities from the model, and EleutherAI computes the probability of the full answer sequence. The author notes that these differences can cause fluctuations in absolute scores and model ranking, making model metrics incomparable unless the evaluation's implementation is identical. The author also mentions that the QLoRA author found MMLU overly sensitive and untrustworthy.
next_section_summary: The section discusses the use of large language models (LLMs) as a reference-free metric to evaluate other LLMs, eliminating the need for human judgments or gold references. It introduces G-Eval, a framework that uses LLMs with Ch

## Setup RAG Query Engine.


In [51]:
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node, display_response

In [53]:
orig_nodes[0].metadata

{'source': 'https://eugeneyan.com/writing/llm-patterns/'}

In [54]:
index = VectorStoreIndex(orig_nodes[:20] + nodes_1 + orig_nodes[28:])

In [55]:
query_engine = index.as_query_engine()

### Querying

In [56]:
query_str = (
    "Can you describe metrics for evaluating text generation quality, compare"
    " them, and tell me about their downsides"
)

response = query_engine.query(query_str)
display_response(response, source_length=1000)

**`Final Response:`** There are metrics used to evaluate text generation quality, which can be grouped into two categories: context-dependent and context-free. Context-dependent metrics consider the context and are task-specific, requiring adjustments for different tasks. On the other hand, context-free metrics do not consider context and are task-agnostic, making them easier to apply across various tasks.

Some commonly used metrics for evaluating text generation quality include BLEU, ROUGE, BERTScore, and MoverScore. BLEU is a precision-based metric that counts the number of overlapping n-grams between the generated output and the reference, then divides it by the total number of words in the output. It is widely used in machine translation due to its cost-effectiveness.

However, these conventional metrics have downsides. Firstly, there is often a poor correlation between these metrics and human judgments. Metrics like BLEU and ROUGE have shown negative correlation with human evaluations of fluency and moderate to low correlation with human adequacy scores, especially in tasks requiring creativity and diversity. Secondly, these metrics may not be adaptable to a wide range of tasks. For instance, metrics like BLEU and ROUGE, which rely on n-gram overlap, are not suitable for tasks like abstractive summarization or dialogue where diverse responses are possible. Lastly, these metrics can have poor reproducibility, with high variance reported across different studies possibly due to variations in human judgment collection or metric parameter settings.