<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/index_structs/doc_summary/DocSummary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Summary Index

This demo showcases the document summary index, over Wikipedia articles on different cities.

The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.

Retrieval can be performed through the LLM or embeddings (which is a TODO). We first select the relevant documents to the query based on their summaries. All retrieved nodes corresponding to the selected documents are retrieved.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
!pip install llama-index

In [18]:
!pip show openai

Name: openai
Version: 1.3.7
Summary: The official Python library for the openai API
Home-page: 
Author: 
Author-email: OpenAI <support@openai.com>
License: 
Location: /opt/miniconda3/envs/py310_chat/lib/python3.10/site-packages
Requires: anyio, distro, httpx, pydantic, sniffio, tqdm, typing-extensions
Required-by: instructor, litellm, llama-index, open-interpreter, pandasai, pyautogen


In [3]:
import os
import openai

# os.environ["OPENAI_API_KEY"]="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"
# openai.api_key = os.environ["OPENAI_API_KEY"]

In [16]:
api_key1=os.environ['openai_api_key1']

In [17]:
api_key1

'sk-A2IJsOOcVEjwYlph31BaB9B48cBf459bA4D3212b48D2Ed83'

In [9]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.WARNING)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# # Uncomment if you want to temporarily disable logger
# logger = logging.getLogger()
# logger.disabled = True

In [10]:
import nest_asyncio

nest_asyncio.apply()

In [11]:
from llama_index import (
    SimpleDirectoryReader,
    ServiceContext,
    get_response_synthesizer,
)
from llama_index.indices.document_summary import DocumentSummaryIndex
from llama_index.llms import OpenAI

### Load Datasets

Load Wikipedia pages on different cities

In [12]:
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]

In [7]:
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [21]:
# Load all wiki documents
city_docs = []
for wiki_title in wiki_titles[:1]:
    docs = SimpleDirectoryReader(
        input_files=[f"data-zh/{wiki_title}.txt"]
    ).load_data()
    docs[0].doc_id = wiki_title
    city_docs.extend(docs)

In [22]:
city_docs

[Document(id_='Toronto', embedding=None, metadata={'file_path': 'data-zh/Toronto.txt', 'file_name': 'Toronto.txt', 'file_type': 'text/plain', 'file_size': 71081, 'creation_date': '2024-01-03', 'last_modified_date': '2024-01-03', 'last_accessed_date': '2024-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, hash='1759bb8756d42cef8fdaa021eb42d5bab5b0481fb2429309b5b242b35a55cf4f', text="多伦多是加拿大人口最多的城市，也是加拿大安大略省的省会城市。2021 年有记录的人口为 2,794,356 人，是北美人口第四多的城市。该市是金马蹄城市群的支柱，金马蹄城市群围绕安大略湖西端，人口为 9,765,188 人（截至 2021 年），而大多伦多地区 2021 年人口为 6,712,341 人。多伦多是国际商业、金融、艺术、体育和文化中心，是世界上最具多元文化和国际化的城市之一。多伦多地区是原住民穿越和居住的地方，地处广阔的斜坡高原，河流纵横交错。 、深谷、城市森林，已有一万多年的历史。在备受争议的多伦多购买案之后，密西沙加将该地区交给英国王室，英国人于 1793 年建立了约克镇，后来将其指定为上加拿大的首府。1812年战争期间，该镇是约克战役的所在地，遭受美军的严重破坏。1834 年，约克更名为多伦多市。1867年加拿大联邦期间，它

### Build Document Summary Index

We show two ways of building the index:
- default mode of building the document summary index
- customizing the summary query


In [23]:
api_base1="https://aigc789.top/v1"

In [15]:
api_base2="https://api.aigc369.com/v1"

In [24]:
# LLM (gpt-3.5-turbo)
chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo",api_base=api_base1,api_key=api_key1)
service_context = ServiceContext.from_defaults(llm=chatgpt, chunk_size=1024)

In [25]:
# default mode of building the index
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", use_async=True
)
doc_summary_index = DocumentSummaryIndex.from_documents(
    city_docs,
    service_context=service_context,
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Summarizing documents:   0%|          | 0/1 [00:00<?, ?it/s]

current doc id: Toronto


Generating embeddings:   0%|          | 0/1 [00:00<?, ?it/s]

In [27]:
doc_summary_index.get_document_summary("Toronto")

"The provided text is about the city of Toronto in Canada. It covers various aspects of the city, including its population, history, economy, cultural diversity, landmarks, architecture, climate, parks, neighborhoods, suburbs, industrial areas, healthcare, education, sports, government, crime rates, transportation system, and major events. The text provides information on topics such as the city's population statistics, ethnic composition, languages spoken, economic sectors, real estate market, technology and biotechnology industries, tourism attractions, educational institutions, healthcare facilities, sports teams, government structure, crime rates, transportation infrastructure, and major events hosted by Toronto.\n\nSome of the questions that this text can answer include:\n- What is the population of Toronto and its population density?\n- What are the major ethnic groups and languages spoken in Toronto?\n- What is the economic significance of Toronto and what industries are promine

In [28]:
doc_summary_index.storage_context.persist("index-zh")

In [29]:
from llama_index.indices.loading import load_index_from_storage
from llama_index import StorageContext

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="index-zh")
doc_summary_index = load_index_from_storage(storage_context)

### Perform Retrieval from Document Summary Index

We show how to execute queries at a high-level. We also show how to perform retrieval at a lower-level so that you can view the parameters that are in place. We show both LLM-based retrieval and embedding-based retrieval using the document summaries.

#### High-level Querying

Note: this uses the default, embedding-based form of retrieval

In [31]:
query_engine = doc_summary_index.as_query_engine(
    response_mode="tree_summarize", use_async=True
)

In [32]:
response = query_engine.query("What are the sports teams in Toronto?")

In [33]:
print(response)

The sports teams in Toronto include the Toronto Maple Leafs (ice hockey), Toronto Raptors (basketball), Toronto Blue Jays (baseball), Toronto FC (soccer), and Toronto Argonauts (Canadian football).


#### LLM-based Retrieval

In [None]:
from llama_index.indices.document_summary import (
    DocumentSummaryIndexLLMRetriever,
)

In [None]:
retriever = DocumentSummaryIndexLLMRetriever(
    doc_summary_index,
    # choice_select_prompt=None,
    # choice_batch_size=10,
    # choice_top_k=1,
    # format_node_batch_fn=None,
    # parse_choice_select_answer_fn=None,
    # service_context=None
)

In [None]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [None]:
print(len(retrieved_nodes))

20


In [None]:
print(retrieved_nodes[0].score)
print(retrieved_nodes[0].node.get_text())

10.0
Toronto is the most populous city in Canada and the capital city of the Canadian province of Ontario. With a recorded population of 2,794,356 in 2021, it is the fourth-most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established the town of York in 1793 and later designated it as the capital of U

In [None]:
# use retriever as part of a query engine
from llama_index.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Toronto?")
print(response)

The sports teams in Toronto include the Toronto Maple Leafs (NHL), Toronto Raptors (NBA), Toronto Blue Jays (MLB), Toronto FC (MLS), and Toronto Argonauts (CFL).


#### Embedding-based Retrieval

In [None]:
from llama_index.indices.document_summary import (
    DocumentSummaryIndexEmbeddingRetriever,
)

In [None]:
retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    # similarity_top_k=1,
)

In [None]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Toronto?")

In [None]:
len(retrieved_nodes)

20

In [None]:
print(retrieved_nodes[0].node.get_text())

Toronto is the most populous city in Canada and the capital city of the Canadian province of Ontario. With a recorded population of 2,794,356 in 2021, it is the fourth-most populous city in North America. The city is the anchor of the Golden Horseshoe, an urban agglomeration of 9,765,188 people (as of 2021) surrounding the western end of Lake Ontario, while the Greater Toronto Area proper had a 2021 population of 6,712,341. Toronto is an international centre of business, finance, arts, sports and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world.Indigenous peoples have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Toronto Purchase, when the Mississauga surrendered the area to the British Crown, the British established the town of York in 1793 and later designated it as the capital of Upper 

In [None]:
# use retriever as part of a query engine
from llama_index.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Toronto?")
print(response)

The sports teams in Toronto include the Toronto Maple Leafs (NHL), Toronto Raptors (NBA), Toronto Blue Jays (MLB), Toronto FC (MLS), Toronto Argonauts (CFL), Toronto Rock (NLL), Toronto Wolfpack (Rugby Football League), Toronto Six (NWHL), and Toronto Rush (American Ultimate Disc League).
