<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/index_structs/doc_summary/DocSummary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Summary Index

This demo showcases the document summary index, over Wikipedia articles on different cities.

The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.

Retrieval can be performed through the LLM or embeddings (which is a TODO). We first select the relevant documents to the query based on their summaries. All retrieved nodes corresponding to the selected documents are retrieved.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
!pip install llama-index

In [18]:
!pip show openai

Name: openai
Version: 1.3.7
Summary: The official Python library for the openai API
Home-page: 
Author: 
Author-email: OpenAI <support@openai.com>
License: 
Location: /opt/miniconda3/envs/py310_chat/lib/python3.10/site-packages
Requires: anyio, distro, httpx, pydantic, sniffio, tqdm, typing-extensions
Required-by: instructor, litellm, llama-index, open-interpreter, pandasai, pyautogen


In [1]:
import os
import openai

In [3]:


# os.environ["OPENAI_API_KEY"]="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"
os.environ["OPENAI_API_KEY"]

'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'

In [2]:
os.environ["openai_api_base1"]

'https://aigc789.top/v1'

In [4]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.WARNING)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# # Uncomment if you want to temporarily disable logger
# logger = logging.getLogger()
# logger.disabled = True

In [5]:
import nest_asyncio

nest_asyncio.apply()

In [10]:
from llama_index import (
    SimpleDirectoryReader,
    ServiceContext,
    get_response_synthesizer,
)
from llama_index.indices.document_summary import DocumentSummaryIndex
from llama_index.llms import OpenAI

### Load Datasets

Load Wikipedia pages on different cities

In [7]:
from read import get_filelisform

In [28]:
pdf_files = get_filelisform('./data-zh','.pdf')[:1]
pdf_files

[]

In [34]:
pdf_files = get_filelisform('./data-zh','.txt')[:1]
pdf_files

['./data-zh/Boston.txt']

In [35]:
# Load all wiki documents
city_docs = []
for file in pdf_files:
    docs = SimpleDirectoryReader(
        input_files=[file]
    ).load_data()
    title = file.split(':')[0]
    docs[0].doc_id = title
    city_docs.extend(docs)

Object `SimpleDirectoryReader` not found.


### Build Document Summary Index

We show two ways of building the index:
- default mode of building the document summary index
- customizing the summary query


In [15]:
!echo $http_proxy

http://127.0.0.1:7890


### chatgpt

In [13]:
api_base1=os.environ['openai_api_base1']
api_base1

'https://aigc789.top/v1'

In [14]:
api_key1=os.environ['openai_api_key1']
api_key1

'sk-A2IJsOOcVEjwYlph31BaB9B48cBf459bA4D3212b48D2Ed83'

In [30]:
# LLM (gpt-3.5-turbo)
system_prompt="Always respond in Chinese"
chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo",api_base=api_base1,api_key=api_key1,system_prompt=system_prompt)
service_context = ServiceContext.from_defaults(llm=chatgpt, chunk_size=1024)

In [31]:
from llama_index.prompts import SelectorPromptTemplate
from llama_index.prompts.chat_prompts import (
    CHAT_REFINE_PROMPT,
    CHAT_REFINE_TABLE_CONTEXT_PROMPT,
    CHAT_TEXT_QA_PROMPT,
    CHAT_TREE_SUMMARIZE_PROMPT,
)
from llama_index.prompts.default_prompts import (
    DEFAULT_TREE_SUMMARIZE_PROMPT,
)
from llama_index.prompts.utils import is_chat_model
from llama_index.prompts.base import PromptTemplate
from llama_index.prompts.prompt_type import PromptType

DEFAULT_TREE_SUMMARIZE_TMPL = (
    "Context information from multiple sources is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the information from multiple sources and not prior knowledge, "
    "answer the query below.Always respond in Chinese.\n"
    "Query: {query_str}\n"
    "Answer: "
)
DEFAULT_TREE_SUMMARIZE_PROMPT = PromptTemplate(
    DEFAULT_TREE_SUMMARIZE_TMPL, prompt_type=PromptType.SUMMARY
)


summary_template = SelectorPromptTemplate(
    default_template=DEFAULT_TREE_SUMMARIZE_PROMPT,
    conditionals=[(is_chat_model, CHAT_TREE_SUMMARIZE_PROMPT)],
)

In [38]:
new_summary_tmpl_str = (
    "Context information from multiple sources is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the information from multiple sources and not prior knowledge, "
    "answer the query below. Always respond in Chinese.\n"
    "Query: {query_str}\n"
    "Answer: "
)
summary_template = PromptTemplate(new_summary_tmpl_str)

In [39]:
# default mode of building the index
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", use_async=True,summary_template=summary_template
)
doc_summary_index = DocumentSummaryIndex.from_documents(
    city_docs,
    service_context=service_context,
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/1 [00:00<?, ?it/s]

current doc id: ./data-zh/Boston.txt


Generating embeddings:   0%|          | 0/1 [00:00<?, ?it/s]

In [24]:
doc_summary_index.get_document_summary("./data/01讲 价值感：怎么表达才能凸显干货.pdf")

'The provided text appears to be an introduction or overview of a training program called "当众表达训练营" (Public Speaking Training Camp) conducted by a person named 徐昆鹏 (Xu Kunpeng). The text emphasizes the importance of public speaking and the ability to express oneself effectively in order to establish authority and influence others. It mentions that having "人生杠杆" (leverage in life) is crucial for having influence, and one way to achieve this is by being able to replicate one\'s thoughts and ideas to others. The text also mentions that while writing can have a broad impact, public speaking has a deeper impact on the audience. The training program is said to last for 21 days, during which participants will learn the art of public speaking.\n\nBased on this text, some questions that can be answered include:\n- What is the purpose of the "当众表达训练营" (Public Speaking Training Camp)?\n- Who is the trainer or instructor of the training program?\n- What is the importance of public speaking and exp

In [37]:
doc_summary_index.storage_context.persist("index-zh-3")

In [20]:
from llama_index.indices.loading import load_index_from_storage
from llama_index import StorageContext

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="index-dedao-zh2")
doc_summary_index = load_index_from_storage(storage_context)

### Perform Retrieval from Document Summary Index

We show how to execute queries at a high-level. We also show how to perform retrieval at a lower-level so that you can view the parameters that are in place. We show both LLM-based retrieval and embedding-based retrieval using the document summaries.

#### High-level Querying

Note: this uses the default, embedding-based form of retrieval

In [30]:
query_engine = doc_summary_index.as_query_engine(
    response_mode="tree_summarize", use_async=True
)

In [31]:
response = query_engine.query("怎样表达才能凸显干货")

In [32]:
print(response)

通过将工作目标拆解成小事情的能力来表达，以及将自己活成解决方案而不是问题，可以凸显干货。此外，还可以使用“不是而是”大法原创金句来提供新认知，从而表达干货。


#### LLM-based Retrieval

In [None]:
from llama_index.indices.document_summary import (
    DocumentSummaryIndexLLMRetriever,
)

In [None]:
retriever = DocumentSummaryIndexLLMRetriever(
    doc_summary_index,
    # choice_select_prompt=None,
    # choice_batch_size=10,
    # choice_top_k=1,
    # format_node_batch_fn=None,
    # parse_choice_select_answer_fn=None,
    # service_context=None
)

In [None]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [None]:
print(len(retrieved_nodes))

20


In [None]:
print(retrieved_nodes[0].score)
print(retrieved_nodes[0].node.get_text())

10.0
Toronto is the most populous city in Canada and the capital city of the Canadian province of Ontario. With a recorded population of 2,794,356 in 2021, it is the fourth-most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established the town of York in 1793 and later designated it as the capital of U

In [None]:
# use retriever as part of a query engine
from llama_index.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Toronto?")
print(response)

The sports teams in Toronto include the Toronto Maple Leafs (NHL), Toronto Raptors (NBA), Toronto Blue Jays (MLB), Toronto FC (MLS), and Toronto Argonauts (CFL).


#### Embedding-based Retrieval

In [None]:
from llama_index.indices.document_summary import (
    DocumentSummaryIndexEmbeddingRetriever,
)

In [None]:
retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    # similarity_top_k=1,
)

In [None]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [None]:
len(retrieved_nodes)

20

In [None]:
print(retrieved_nodes[0].node.get_text())

Toronto is the most populous city in Canada and the capital city of the Canadian province of Ontario. With a recorded population of 2,794,356 in 2021, it is the fourth-most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established the town of York in 1793 and later designated it as the capital of Upper 

In [None]:
# use retriever as part of a query engine
from llama_index.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Toronto?")
print(response)

The sports teams in Toronto include the Toronto Maple Leafs (NHL), Toronto Raptors (NBA), Toronto Blue Jays (MLB), Toronto FC (MLS), Toronto Argonauts (CFL), Toronto Rock (NLL), Toronto Wolfpack (Rugby Football League), Toronto Six (NWHL), and Toronto Rush (American Ultimate Disc League).
