<a href="https://colab.research.google.com/github/bacoco/LLM_train/blob/main/Chat_with_Data_(Advanced_RAG)_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chat with Data (Advanced RAG) Notebook

In this notebook we walk through advanced RAG abstractions in LlamaIndex to help you mitigate response hallucination, especially compared to "naive" approaches.

1. Multi-Document Comparisons
2. Complex Document (with Embedded Tables)

In both settings, we first setup a "naive" RAG stack (using top-k retrieval with a fixed chunk size). We then implement a more advanced approach and compare the results.



In [1]:
!pip install llama-index==0.8.35 llama-hub pypdf

Collecting llama-index==0.8.35
  Downloading llama_index-0.8.35-py3-none-any.whl (843 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m843.5/843.5 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-hub
  Downloading llama_hub-0.0.37-py3-none-any.whl (12.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-3.16.4-py3-none-any.whl (276 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.6/276.6 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken (from llama-index==0.8.35)
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m84.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json (from llama-index==0.8.35)
  Downloading dataclasses_json-0.6.1-py3-none-any.whl

In [2]:
# setup OpenAI
import openai

openai.api_key = "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"

In [3]:
import nest_asyncio

nest_asyncio.apply()

In [4]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Multi-Document Comparisons


In [5]:
from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index.llms import OpenAI

from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

In [6]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo", max_tokens=-1)
service_context = ServiceContext.from_defaults(llm=llm)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [7]:
!wget https://www.dropbox.com/s/qctkz6nxhm0y5qe/lyft_2021.pdf?dl=1 -O lyft_2021.pdf
!wget https://www.dropbox.com/s/te0a2w227v27iag/uber_2021.pdf?dl=1 -O uber_2021.pdf

--2023-10-14 07:24:56--  https://www.dropbox.com/s/qctkz6nxhm0y5qe/lyft_2021.pdf?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601f:18::a27d:912
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/qctkz6nxhm0y5qe/lyft_2021.pdf [following]
--2023-10-14 07:24:56--  https://www.dropbox.com/s/dl/qctkz6nxhm0y5qe/lyft_2021.pdf
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucd82aa481be00d7bc9e00397fa7.dl.dropboxusercontent.com/cd/0/get/CFlZXzFfpKV2wLQbcVbNERW1rOej_ineAX6PxQq_KOZLVe2yJqmRHhbWFQrOPDZrfFRS9WJ_U6h0KT7Kjxi5bW1DoGRCygYNajSL8xV0BfclVgT_OesSz5b9Bjf7YQX973Szz4sur-wcsUDyKvZSCNue/file?dl=1# [following]
--2023-10-14 07:24:56--  https://ucd82aa481be00d7bc9e00397fa7.dl.dropboxusercontent.com/cd/0/get/CFlZXzFfpKV2wLQbcVbNERW1rOej_ineAX6PxQq_KOZLVe2yJqmRHhbWFQrOPDZrfFRS9WJ_U6h0KT7Kjxi5bW1DoGRCygYNa

In [8]:
lyft_docs = SimpleDirectoryReader(input_files=["lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["uber_2021.pdf"]).load_data()

In [9]:
print(lyft_docs[0].get_content())

UNITED STATESSECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
 
(Mark One)
☒
ANNUAL REPORT PURS UANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934For the fiscal year ended December 31, 2021
OR
☐
TRANSITION REPORT PURS UANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 FOR THE TRANSITION PERIODFROM                      TO
Commission File Number 001-38846
Lyft, Inc.
(Exact name of Registrant as specified in i
ts Charter)Delaware
20-8809830 (State or other jurisdiction of
incorporation or organization)
(I.R.S. EmployerIdentification No.)
185 Berry Street, Suite 5000
San Francisco, California
94107 (Address of principal executive offices)
(Zip Code) Registrant’s telephone number, including area code: (844) 250
-2773Securities registered pursuant to Section 12(b) of the Act: 
Title of each class
TradingSymbol(s)
Name of each exchange on which registered Class A common stock, par value of $0.00001 per share
LYFT Nasdaq Global Select Market S

### Setup Baseline Query Engine

We first setup a top-k query engine.

In [10]:
# setup baseline index
base_index = VectorStoreIndex.from_documents(lyft_docs + uber_docs)
base_engine = base_index.as_query_engine(similarity_top_k=4)

RetryError: ignored

In [None]:
response = base_engine.query("What are some risk factors for Uber?")
print(str(response))

In [None]:
print(len(response.source_nodes))

In [None]:
print(response.source_nodes[0].get_content())

In [None]:
response = base_engine.query("Compare and contrast the risk factors of Uber and Lyft?")
print(str(response))

In [None]:
print(response.source_nodes[3].get_content(metadata_mode="all"))

In [None]:
print(response.source_nodes[1].get_content(metadata_mode="all"))

### Setup SubQuestionQueryEngine

This query engine does the following:
- Can treat different documents as different "tools"
- Breaks down a complex question into sub-questions over any subset of different documents
- Does retrieval independently per document
- Combines results at the end.

In [None]:
lyft_index = VectorStoreIndex.from_documents(lyft_docs)

In [None]:
uber_index = VectorStoreIndex.from_documents(uber_docs)

In [None]:
lyft_engine = lyft_index.as_query_engine(similarity_top_k=2)

In [None]:
uber_engine = uber_index.as_query_engine(similarity_top_k=2)

In [None]:
query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine,
        metadata=ToolMetadata(
            name="lyft_10k",
            description="Provides information about Lyft financials for year 2021",
        ),
    ),
    QueryEngineTool(
        query_engine=uber_engine,
        metadata=ToolMetadata(
            name="uber_10k",
            description="Provides information about Uber financials for year 2021",
        ),
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

### Try some Queries

We compare our SubQuestionQueryEngine against the base top-k vector query engine.

In [None]:
response = s_engine.query("Compare and contrast the risk factors of Uber and Lyft?")
print(str(response))

While the base query engine couldn't generate a proper response, here we see that the response is much more detailed.

Here's some other comparison queries.

In [None]:
query_str = "Tell me what was higher, Uber's revenue growth or Lyft's revenue growth, and using the text explain the reasons for revenue growth."
base_response = base_engine.query(query_str)
response = s_engine.query(query_str)


In [None]:
print(f"Base query engine: {str(base_response)}\n\n")
print(f"Sub-question query engine: {str(response)}\n\n")

In [None]:
print(base_response.source_nodes[3].get_content(metadata_mode="all"))

## Complex Document (with Embedded Tables)

In this setting we walk through a document that has an embedded table inside of it.

In [None]:
!apt install ghostscript python3-tk

In [None]:
!pip install camelot-py pymupdf frontend ghostscript

In [None]:
import camelot
from llama_index import Document, SummaryIndex

# https://en.wikipedia.org/wiki/The_World%27s_Billionaires
from llama_index import VectorStoreIndex, ServiceContext, LLMPredictor
from llama_index.query_engine import PandasQueryEngine, RetrieverQueryEngine
from llama_index.retrievers import RecursiveRetriever
from llama_index.schema import IndexNode
from llama_index.llms import OpenAI

from llama_hub.file.pymu_pdf.base import PyMuPDFReader
from pathlib import Path
from typing import List

### Parse out Table, build Pandas Query Engine

In [None]:
!wget "https://www.dropbox.com/scl/fi/waoz9bo9yiemnhnqvu0cc/billionaires_page.pdf?rlkey=4i08msa7zr1lpnuq2y1vs2xgw&dl=1" -O billionaires_page.pdf

In [None]:
# initialize PDF reader
reader = PyMuPDFReader()

In [None]:
file_path = "billionaires_page.pdf"
docs = reader.load(file_path=file_path)

In [None]:
# use camelot to parse tables
def get_tables(path: str, pages: List[int]):
    table_dfs = []
    for page in pages:
        table_list = camelot.read_pdf(path, pages=str(page))
        table_df = table_list[0].df
        table_df = (
            table_df.rename(columns=table_df.iloc[0])
            .drop(table_df.index[0])
            .reset_index(drop=True)
        )
        table_dfs.append(table_df)
    return table_dfs

In [None]:
table_dfs = get_tables(file_path, pages=[3, 25])

In [None]:
# shows list of top billionaires in 2023
table_dfs[0]

In [None]:
# shows list of top billionaires
table_dfs[1]

In [None]:
# define query engines over these tables
df_query_engines = [PandasQueryEngine(table_df) for table_df in table_dfs]

In [None]:
response = df_query_engines[0].query(
    "What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))

In [None]:
response = df_query_engines[1].query("How many billionaires were there in 2009?")
print(str(response))

### Build Recursive Retriever

We define a top-level vector index that does top-k lookup over a set of Nodes. We define two special nodes (`IndexNode` objects) linking to each of these tables.

We define a `RecursiveRetriever` object to recursively retrieve/query nodes. We then put this in our `RetrieverQueryEngine` along with a `ResponseSynthesizer` to synthesize a response.

We pass in mappings from id to retriever and id to query engine. We then pass in a root id representing the retriever we query first.

In [None]:
llm = OpenAI(temperature=0, model="gpt-4")

service_context = ServiceContext.from_defaults(
    llm=llm,
)

In [None]:
doc_nodes = service_context.node_parser.get_nodes_from_documents(docs)

In [None]:
# define index nodes
summaries = [
    "This node provides information about the world's richest billionaires in 2023",
    "This node provides information on the number of billionaires and their combined net worth from 2000 to 2023.",
]

df_nodes = [
    IndexNode(text=summary, index_id=f"pandas{idx}")
    for idx, summary in enumerate(summaries)
]

df_id_query_engine_mapping = {
    f"pandas{idx}": df_query_engine
    for idx, df_query_engine in enumerate(df_query_engines)
}

In [None]:
# construct top-level vector index + query engine
vector_index = VectorStoreIndex(doc_nodes + df_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

In [None]:
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(
    # service_context=service_context,
    response_mode="compact"
)

query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)

### Define Baseline Retriever

We also define a baseline retriever that does top-k lookup over the raw document.

In [None]:
# baseline vector index (that doesn't include the extra df nodes).
# used to benchmark
vector_index0 = VectorStoreIndex(doc_nodes)
vector_query_engine0 = vector_index0.as_query_engine()

### Compare Results

We compare results between the recursive retriever vs. baseline retriever.

In [None]:
response = query_engine.query(
    "How many billionaires were there in 2009?"
)
print(str(response))

In [None]:
response.source_nodes[0].node.get_content()

In [None]:
response = vector_query_engine0.query(
    "How many billionaires were there in 2009?"
)
print(str(response))

In [None]:
print(response.source_nodes[1].node.get_content())

In [None]:
response = query_engine.query(
    "What is the average age of top 5 billionaires in 2023? Make sure age is a float."
)
print(str(response))

In [None]:
response = vector_query_engine0.query(
    "What is the average age of top 5 billionaires in 2023? Make sure age is a float."
)
print(str(response))

Of course, just like the baseline vector query engine, the recursive retriever can answer semantic queries over the article as well.

In [None]:
response = query_engine.query(
    "How is wealth accounted for in recipients if the billionaire is deceased?"
)
print(str(response))

In [None]:
response = vector_query_engine0.query(
    "How is wealth accounted for in recipients if the billionaire is deceased?"
)
print(str(response))