# Doc Q&A Demo
This notebook contains an example of Doc Q&A, where a user can upload a document and ask questions about it. The pipeline will take the following steps:
1. Parse the document in text format
2. Chunk the text
3. Embed each chunk
4. Index chunks and store in an in-memory vector database to allow semantic search

The example is built with the llama-index library.

References: https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/q_and_a/#qa-patterns

In [2]:
from src.utils.text_overlap import find_overlap, find_overlap_chunks

In [3]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode

In [4]:
Settings.llm = None
Settings.embed_model = None
Settings

LLM is explicitly disabled. Using MockLLM.
Embeddings have been explicitly disabled. Using MockEmbedding.


_Settings(_llm=MockLLM(callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7fd5a5f99a10>, system_prompt=None, messages_to_prompt=<function messages_to_prompt at 0x7fd65557e0c0>, completion_to_prompt=<function default_completion_to_prompt at 0x7fd6553e9760>, output_parser=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>, query_wrapper_prompt=None, max_tokens=None), _embed_model=MockEmbedding(model_name='unknown', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7fd5a5f99a10>, embed_dim=1), _callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7fd5a5f99a10>, _tokenizer=None, _node_parser=None, _prompt_helper=None, _transformations=None)

In [5]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Load the data

In [6]:
# Load all text document from the folder docs/
documents = SimpleDirectoryReader("docs").load_data()

In [7]:
print(type(documents))
print(len(documents))

<class 'list'>
1


In [8]:
doc_0 = documents[0]
print(type(doc_0))
print(doc_0.dict().keys())
print(doc_0.metadata)

<class 'llama_index.core.schema.Document'>
dict_keys(['id_', 'embedding', 'metadata', 'excluded_embed_metadata_keys', 'excluded_llm_metadata_keys', 'relationships', 'text', 'start_char_idx', 'end_char_idx', 'text_template', 'metadata_template', 'metadata_seperator', 'class_name'])
{'file_path': '/home/experiments/docs/state_of_the_union.txt', 'file_name': 'state_of_the_union.txt', 'file_type': 'text/plain', 'file_size': 39027, 'creation_date': '2023-05-10', 'last_modified_date': '2023-05-10'}


In [9]:
# Print all properties and methods of a Document object
print(dir(doc_0))

['Config', '__abstractmethods__', '__annotations__', '__class__', '__class_vars__', '__config__', '__custom_root_type__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__exclude_fields__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_validators__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__include_fields__', '__init__', '__init_subclass__', '__iter__', '__json_encoder__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__post_root_validators__', '__pre_root_validators__', '__pretty__', '__private_attributes__', '__reduce__', '__reduce_ex__', '__repr__', '__repr_args__', '__repr_name__', '__repr_str__', '__rich_repr__', '__schema_cache__', '__setattr__', '__setstate__', '__signature__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__try_update_forward_refs__', '__validators__', '_abc_impl', '_calculate_keys', '_compat_fields', '_copy_and_set_values', '_decompose_class', '_enforce_dict_if_root', '_get_value', '_init

In [10]:
# View the first few lines of object doc_0
print(doc_0.text[:1000])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. 

Groups of citizens blocking tanks with their bodies. Every

# Index the documents
The basic LlamaIndex example uses the one-line command `VectorStoreIndex.from_documents` to index/chunk/embed all the documents. It wouldn't work in my case though, as I would keep running into a `RateLimitError`. The error message pointed toward my OpenAI account. After some [digging](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/embeddings/utils.py#L31), I could confirm that LlamaIndex's default embedding model is OpenAI's; which would fail in my case as my account is empty. Also, since I want to use open-source solutions in that example, I need to use a different approach.

The solution is to define explicitly the embedding model that I want to use. And to do that in LlamaIndex, we need to [use the ingestion pipeline](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index/#using-the-ingestion-pipeline-to-create-nodes). LlamaIndex even has a nice [tutorial](https://docs.llamaindex.ai/en/stable/examples/low_level/oss_ingestion_retrieval/) and how to set it up with the sentence embedding model from HuggingFace's transformer library, which is exactly what I was hoping to do.

## Split the document

In this example, we're using `SentenceSplitter` which is a pretty basic type of text splitter, only making sure to not break sentences (see [documentation](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/)). 

It would be interesting to play with more complex splitter like `SemanticSplitter` which attempts to build chunks containing text that is semantically related ([doc](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/semantic_splitter/)). That imo makes a lot of sense and is an idea I was playing with in the past. It'd be interesting to see how it works.

In [11]:
SentenceSplitter?

[0;31mInit signature:[0m
[0mSentenceSplitter[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mseparator[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m' '[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mchunk_size[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m1024[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mchunk_overlap[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m200[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtokenizer[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mCallable[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mparagraph_separator[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'\n\n\n'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mchunking_tokenizer_fn[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mCallable[0m[0;34m[[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msecondary_chun

In [33]:
chunk_size = 128
text_parser = SentenceSplitter(
    chunk_size=chunk_size,
    chunk_overlap=min(200, int(chunk_size*0.5)),
    # separator=" ",
)

In [34]:
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in next step
doc_idxs = [] # keep track of what document each chunk comes from
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

In [35]:
print(len(doc_idxs), doc_idxs)

129 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [36]:
print(f"All documents were split into a total of {len(text_chunks)} chunks.")
print(f"The total length of all chunks is {sum([len(ch) for ch in text_chunks])}" + 
      f" compared to {len(doc_0.text)} for the original document."
     )

All documents were split into a total of 129 chunks.
The total length of all chunks is 68840 compared to 38539 for the original document.


In [38]:
# Find out by how much all chunks overlap
overlaps = find_overlap_chunks(text_chunks)
overlaps[:10]

[(290, 200),
 (292, 293),
 (292, 295),
 (283, 295),
 (263, 286),
 (228, 340),
 (231, 299),
 (229, 372),
 (287, 305),
 (285, 343)]

In [40]:
# Test for potential problems:
errors = [{idx, overlap} for idx, overlap in enumerate(overlaps) if overlap[0] == 0]
errors

[{(0, -1), 58}]

In [46]:
# Look at the overlap for one example
idx = 4
over = overlaps[idx]
len_over = over[0]
idx_over = over[1]
text_0 = text_chunks[idx]
text_1 = text_chunks[idx+1]

In [48]:
# print the overlap
print(text_0[idx_over:])
print("----"*20)
print(text_1[:len_over])

Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. 

Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos.   

They keep moving.
--------------------------------------------------------------------------------
Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. 

Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos.   

They keep moving.


## Construct a Node for each text chunk

Nodes are a concept specific to LlamaIndex (afaik). They are chunks of documents (text, image, audio,...) augmented with metadata and relational information (for more, see the [LlamaIndex documentation](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_nodes/)).

In [49]:
TextNode?

[0;31mInit signature:[0m
[0mTextNode[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mid_[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0membedding[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mList[0m[0;34m[[0m[0mfloat[0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mextra_info[0m[0;34m:[0m [0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mAny[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexcluded_embed_metadata_keys[0m[0;34m:[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexcluded_llm_metadata_keys[0m[0;34m:[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrelationships[0m[0;34m:[0m [0mDict[0m[0;34m[[0m[0mllama_index[0m[0;

In [50]:
nodes = []
for idx, _text_chunk in enumerate(text_chunks):
    node = TextNode(text=_text_chunk) # create a node
    src_doc = documents[doc_idxs[idx]] # save a copy of the original document that chunk was taken from
    node.metadata = src_doc.metadata
    nodes.append(node)

In [51]:
print(len(nodes))
print(dir(nodes[0]))

129
['Config', '__abstractmethods__', '__annotations__', '__class__', '__class_vars__', '__config__', '__custom_root_type__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__exclude_fields__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_validators__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__include_fields__', '__init__', '__init_subclass__', '__iter__', '__json_encoder__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__post_root_validators__', '__pre_root_validators__', '__pretty__', '__private_attributes__', '__reduce__', '__reduce_ex__', '__repr__', '__repr_args__', '__repr_name__', '__repr_str__', '__rich_repr__', '__schema_cache__', '__setattr__', '__setstate__', '__signature__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__try_update_forward_refs__', '__validators__', '_abc_impl', '_calculate_keys', '_copy_and_set_values', '_decompose_class', '_enforce_dict_if_root', '_get_value', '_init_private_attri

## Embed each Node

Embeddings of each Node are added to the Node in the form of a property (`embedding`)

In [52]:
#embed_model_name = "BAAI/bge-small-en-v1.5" # https://huggingface.co/BAAI/bge-small-en-v1.5
# embed_model_name = "BAAI/bge-base-en-v1.5" # larger dimension of the embedding space (768 vs 384)
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2" # https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

embed_model = HuggingFaceEmbedding(model_name=embed_model_name)

In [53]:
Settings.embed_model = embed_model

In [54]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(node.get_content(metadata_mode="all"))
    node.embedding = node_embedding

In [55]:
len(nodes[0].embedding)

384

In [56]:
index = VectorStoreIndex(nodes)

In [57]:
print(dir(index))

['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__orig_bases__', '__parameters__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_add_nodes_to_index', '_aget_node_with_embedding', '_async_add_nodes_to_index', '_build_index_from_nodes', '_callback_manager', '_delete_node', '_docstore', '_embed_model', '_get_node_with_embedding', '_graph_store', '_index_struct', '_insert', '_insert_batch_size', '_is_protocol', '_object_map', '_service_context', '_show_progress', '_storage_context', '_store_nodes_override', '_transformations', '_use_async', '_vector_store', 'as_chat_engine', 'as_query_engine', 'as_retriever', 'build_index_from_nodes', 'd

In [100]:
len(index.docstore.get_all_document_hashes()), len(nodes)

(129, 129)

# Retrieve a document chunk

In [72]:
retriever = index.as_retriever(similarity_top_k=5)

In [73]:
retriever._embed_model

HuggingFaceEmbedding(model_name='sentence-transformers/all-MiniLM-L6-v2', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7fd5a5f99a10>, max_length=256, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

In [76]:
query = "By how much will the deficit be down by the end of this year?"
documents_retrieved = retriever.retrieve(query)

In [80]:
for rank, doc in enumerate(documents_retrieved):
    print(f"Document #{rank+1}:")
    #print(doc)
    if "the deficit will be down to less than half what it was before I took office" in doc.get_content():
        print("Good answer")

Document #1:
Document #2:
Document #3:
Document #4:
Good answer
Document #5:
