# Doc Q&A Demo
This notebook contains an example of Doc Q&A, where a user can upload a document and ask questions about it. The pipeline will take the following steps:
1. Parse the document in text format
2. Chunk the text
3. Embed each chunk
4. Index chunks and store in an in-memory vector database to allow semantic search

The example is built with the llama-index library.

References: https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/q_and_a/#qa-patterns

In [1]:
from typing import Tuple, List

In [2]:
from src.utils.text_overlap import find_overlap, find_overlap_chunks

In [3]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode

In [4]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Load the data

In [5]:
# Load all text document from the folder docs/
documents = SimpleDirectoryReader("docs").load_data()

In [6]:
print(type(documents))
print(len(documents))

<class 'list'>
1


In [7]:
doc_0 = documents[0]
print(type(doc_0))
print(doc_0.dict().keys())
print(doc_0.metadata)

<class 'llama_index.core.schema.Document'>
dict_keys(['id_', 'embedding', 'metadata', 'excluded_embed_metadata_keys', 'excluded_llm_metadata_keys', 'relationships', 'text', 'start_char_idx', 'end_char_idx', 'text_template', 'metadata_template', 'metadata_seperator', 'class_name'])
{'file_path': '/home/experiments/docs/state_of_the_union.txt', 'file_name': 'state_of_the_union.txt', 'file_type': 'text/plain', 'file_size': 39027, 'creation_date': '2023-05-10', 'last_modified_date': '2023-05-10'}


In [8]:
# Print all properties and methods of a Document object
print(dir(doc_0))

['Config', '__abstractmethods__', '__annotations__', '__class__', '__class_vars__', '__config__', '__custom_root_type__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__exclude_fields__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_validators__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__include_fields__', '__init__', '__init_subclass__', '__iter__', '__json_encoder__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__post_root_validators__', '__pre_root_validators__', '__pretty__', '__private_attributes__', '__reduce__', '__reduce_ex__', '__repr__', '__repr_args__', '__repr_name__', '__repr_str__', '__rich_repr__', '__schema_cache__', '__setattr__', '__setstate__', '__signature__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__try_update_forward_refs__', '__validators__', '_abc_impl', '_calculate_keys', '_compat_fields', '_copy_and_set_values', '_decompose_class', '_enforce_dict_if_root', '_get_value', '_init

In [9]:
# View the first few lines of object doc_0
print(doc_0.text[:1000])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. 

Groups of citizens blocking tanks with their bodies. Every

# Index the documents
The basic LlamaIndex example uses the one-line command `VectorStoreIndex.from_documents` to index/chunk/embed all the documents. It wouldn't work in my case though, as I would keep running into a `RateLimitError`. The error message pointed toward my OpenAI account. After some [digging](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/embeddings/utils.py#L31), I could confirm that LlamaIndex's default embedding model is OpenAI's; which would fail in my case as my account is empty. Also, since I want to use open-source solutions in that example, I need to use a different approach.

The solution is to define explicitly the embedding model that I want to use. And to do that in LlamaIndex, we need to [use the ingestion pipeline](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index/#using-the-ingestion-pipeline-to-create-nodes). LlamaIndex even has a nice [tutorial](https://docs.llamaindex.ai/en/stable/examples/low_level/oss_ingestion_retrieval/) and how to set it up with the sentence embedding model from HuggingFace's transformer library, which is exactly what I was hoping to do.

## Split the document

In this example, we're using `SentenceSplitter` which is a pretty basic type of text splitter, only making sure to not break sentences (see [documentation](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/)). 

It would be interesting to play with more complex splitter like `SemanticSplitter` which attempts to build chunks containing text that is semantically related ([doc](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/semantic_splitter/)). That imo makes a lot of sense and is an idea I was playing with in the past. It'd be interesting to see how it works.

In [10]:
SentenceSplitter?

[0;31mInit signature:[0m
[0mSentenceSplitter[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mseparator[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m' '[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mchunk_size[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m1024[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mchunk_overlap[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m200[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtokenizer[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mCallable[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mparagraph_separator[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'\n\n\n'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mchunking_tokenizer_fn[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mCallable[0m[0;34m[[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msecondary_chun

In [11]:
chunk_size = 1024
text_parser = SentenceSplitter(
    chunk_size=chunk_size,
    # separator=" ",
)

In [27]:
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in next step
doc_idxs = [] # keep track of what document each chunk comes from
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

In [26]:
doc_idxs

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [14]:
print(f"All documents were split into a total of {len(text_chunks)} chunks.")
print(f"The total length of all chunks is {sum([len(ch) for ch in text_chunks])}" + 
      f" compared to {len(doc_0.text)} for the original document."
     )

All documents were split into a total of 10 chunks.
The total length of all chunks is 46228 compared to 38539 for the original document.


In [15]:
# Find out by how much all chunks overlap
overlaps = find_overlap_chunks(text_chunks)
overlaps

[(781, 4047),
 (839, 3922),
 (811, 3790),
 (855, 3717),
 (944, 3588),
 (886, 3828),
 (836, 3656),
 (942, 3916),
 (795, 4000)]

In [18]:
# Look at the overlap for one example
idx = 4
over = overlaps[idx]
len_over = over[0]
idx_over = over[1]
text_0 = text_chunks[idx]
text_1 = text_chunks[idx+1]

In [19]:
print(text_0[idx_over:])
print("----"*20)
print(text_1[:len_over])

So that’s my plan. It will grow the economy and lower costs for families. 

So what are we waiting for? Let’s get this done. And while you’re at it, confirm my nominees to the Federal Reserve, which plays a critical role in fighting inflation.  

My plan will not only lower costs to give families a fair shot, it will lower the deficit. 

The previous Administration not only ballooned the deficit with tax cuts for the very wealthy and corporations, it undermined the watchdogs whose job was to keep pandemic relief funds from being wasted. 

But in my administration, the watchdogs have been welcomed back. 

We’re going after the criminals who stole billions in relief money meant for small businesses and millions of Americans.  

And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud. 

By the end of this year, the deficit will be down to less than half what it was before I took office.
-------------------------------------------------------

## Construct a Node for each text chunk

Nodes are a concept specific to LlamaIndex (afaik). They are chunks of documents (text, image, audio,...) augmented with metadata and relational information (for more, see the [LlamaIndex documentation](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_nodes/)).

In [24]:
TextNode?

[0;31mInit signature:[0m
[0mTextNode[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mid_[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0membedding[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mList[0m[0;34m[[0m[0mfloat[0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mextra_info[0m[0;34m:[0m [0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mAny[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexcluded_embed_metadata_keys[0m[0;34m:[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexcluded_llm_metadata_keys[0m[0;34m:[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrelationships[0m[0;34m:[0m [0mDict[0m[0;34m[[0m[0mllama_index[0m[0;

In [25]:
doc_idxs

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [29]:
nodes = []
for idx, _text_chunk in enumerate(text_chunks):
    node = TextNode(text=_text_chunk) # create a node
    src_doc = documents[doc_idxs[idx]] # save a copy of the original document that chunk was taken from
    node.metadata = src_doc.metadata
    nodes.append(node)

In [32]:
print(len(nodes))
print(dir(nodes[0]))

10
['Config', '__abstractmethods__', '__annotations__', '__class__', '__class_vars__', '__config__', '__custom_root_type__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__exclude_fields__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_validators__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__include_fields__', '__init__', '__init_subclass__', '__iter__', '__json_encoder__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__post_root_validators__', '__pre_root_validators__', '__pretty__', '__private_attributes__', '__reduce__', '__reduce_ex__', '__repr__', '__repr_args__', '__repr_name__', '__repr_str__', '__rich_repr__', '__schema_cache__', '__setattr__', '__setstate__', '__signature__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__try_update_forward_refs__', '__validators__', '_abc_impl', '_calculate_keys', '_copy_and_set_values', '_decompose_class', '_enforce_dict_if_root', '_get_value', '_init_private_attrib

## Embed each Node

Embeddings of each Node are added to the Node in the form of a property (`embedding`)

In [33]:
embed_model_name = "BAAI/bge-small-en"
embed_model = HuggingFaceEmbedding(model_name=embed_model_name)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [34]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(node.get_content(metadata_mode="all"))
    node.embedding = node_embedding

In [42]:
len(nodes[0].embedding)

384

**TODO**: Simple approach with VectorStoreIndex doesn't work out of the box as it still relies on OpenAI. Need to be modified to use the

In [43]:
VectorStoreIndex.from_documents?

[0;31mSignature:[0m
[0mVectorStoreIndex[0m[0;34m.[0m[0mfrom_documents[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdocuments[0m[0;34m:[0m [0mSequence[0m[0;34m[[0m[0mllama_index[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mschema[0m[0;34m.[0m[0mDocument[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstorage_context[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mllama_index[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mstorage[0m[0;34m.[0m[0mstorage_context[0m[0;34m.[0m[0mStorageContext[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshow_progress[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcallback_manager[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mllama_index[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mcallbacks[0m[0;34m.[0m[0mbase[0m[0;34m.[0m[0mCallbackManager[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    

In [44]:
index = VectorStoreIndex(nodes)

In [47]:
print(dir(index))

['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__orig_bases__', '__parameters__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_add_nodes_to_index', '_aget_node_with_embedding', '_async_add_nodes_to_index', '_build_index_from_nodes', '_callback_manager', '_delete_node', '_docstore', '_embed_model', '_get_node_with_embedding', '_graph_store', '_index_struct', '_insert', '_insert_batch_size', '_is_protocol', '_object_map', '_service_context', '_show_progress', '_storage_context', '_store_nodes_override', '_transformations', '_use_async', '_vector_store', 'as_chat_engine', 'as_query_engine', 'as_retriever', 'build_index_from_nodes', 'd

# Retrieve a document chunk

In [49]:
query_engine = index.as_query_engine()

In [50]:
response = query_engine.query("By how much will the deficit be down by the end of this year?")
print(response)

Retrying llama_index.embeddings.openai.base.get_embedding in 0.8513874749013055 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}.


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}