# Doc Q&A Demo
This notebook contains an example of Doc Q&A, where a user can upload a document and ask questions about it. The pipeline will take the following steps:
1. Load text documents
2. Chunk the text
3. Embed each chunk
4. Index chunks and store in an in-memory vector database
5. Retrieve relevant chunks given a user query

The example is built with the langchain library.

References: 
* https://python.langchain.com/docs/integrations/vectorstores/faiss/
* https://huggingface.co/learn/cookbook/advanced_rag

In [1]:
from src.utils.text_overlap import find_overlap, find_overlap_chunks

In [2]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load text documents

Using `TextLoader`to load a single text document (https://python.langchain.com/docs/modules/data_connection/document_loaders/).

To load all files within a directory, use `DirectoryLoader` (https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory/)

In [4]:
document_path = "docs/state_of_the_union.txt"
loader = TextLoader(document_path)
document= loader.load()

In [4]:
print(f"'document' is of type {type(document)} and contains {len(document)} elements")
doc_0 = document[0]
print(f"The first element in 'document' is of type {type(doc_0)}")

'document' is of type <class 'list'> and contains 1 elements
The first element in 'document' is of type <class 'langchain_core.documents.base.Document'>


In [5]:
# Attributes of doc_0:
print(dir(doc_0))

['Config', '__abstractmethods__', '__annotations__', '__class__', '__class_vars__', '__config__', '__custom_root_type__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__exclude_fields__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_validators__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__include_fields__', '__init__', '__init_subclass__', '__iter__', '__json_encoder__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__post_root_validators__', '__pre_root_validators__', '__pretty__', '__private_attributes__', '__reduce__', '__reduce_ex__', '__repr__', '__repr_args__', '__repr_name__', '__repr_str__', '__rich_repr__', '__schema_cache__', '__setattr__', '__setstate__', '__signature__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__try_update_forward_refs__', '__validators__', '_abc_impl', '_calculate_keys', '_copy_and_set_values', '_decompose_class', '_enforce_dict_if_root', '_get_value', '_init_private_attribute

In [6]:
# Can access the text of the document using the attribute 'page_content':
print(type(doc_0.page_content))
doc_0.page_content[:150]

<class 'str'>


'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fello'

# Split the document

We can start with the simplest text splitter, [CharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/character_text_splitter/), but LanghChain has more complex chunking strategy like the [Semantic Chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker/).

In [43]:
CharacterTextSplitter?

[0;31mInit signature:[0m
[0mCharacterTextSplitter[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mseparator[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m'\n\n'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mis_separator_regex[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m:[0m [0;34m'Any'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'None'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      Splitting text that looks at characters.
[0;31mInit docstring:[0m Create a new TextSplitter.
[0;31mFile:[0m           /usr/local/lib/python3.11/site-packages/langchain_text_splitters/character.py
[0;31mType:[0m           ABCMeta
[0;31mSubclasses:[0m     

In [44]:
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=200, add_start_index=True)
docs = text_splitter.split_documents(document)

In [52]:
# metadata 'start_index' gives the start position of that chunk
docs[10].metadata["start_index"]

2902

In [53]:
print(f"The text splitter returns a {type(docs)} of {len(docs)} elements, each element of type {type(docs[0])}")

The text splitter returns a <class 'list'> of 125 elements, each element of type <class 'langchain_core.documents.base.Document'>


In [54]:
# Calculate and print by how many characters consecutive chunks overlap (length of overlap, index of start of overlap):
print(find_overlap_chunks(docs, convert_any_to_str=lambda elt: elt.page_content))

[(150, 340), (152, 324), (138, 278), (100, 324), (89, 349), (178, 257), (186, 181), (176, 148), (111, 370), (123, 331), (115, 296), (112, 381), (154, 116), (0, -1), (181, 306), (140, 296), (160, 324), (144, 304), (145, 322), (150, 340), (169, 328), (132, 302), (0, -1), (102, 391), (172, 315), (97, 349), (140, 267), (184, 315), (193, 187), (0, -1), (82, 315), (122, 336), (107, 345), (83, 381), (113, 208), (43, 422), (92, 303), (137, 294), (174, 264), (142, 332), (81, 284), (169, 299), (163, 317), (115, 346), (78, 370), (197, 250), (176, 298), (110, 349), (181, 308), (187, 241), (165, 264), (122, 329), (108, 350), (147, 345), (94, 357), (0, -1), (0, -1), (91, 284), (89, 346), (199, 261), (90, 373), (187, 298), (199, 298), (176, 202), (75, 325), (141, 296), (173, 274), (170, 176), (101, 361), (180, 288), (117, 352), (196, 227), (122, 297), (59, 266), (105, 391), (165, 251), (109, 336), (127, 289), (114, 355), (153, 334), (196, 158), (81, 387), (157, 286), (138, 259), (129, 255), (145, 276

In [55]:
print(docs[0].page_content[-250:])

 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.


In [56]:
print(docs[1].page_content[:250])

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking 


# Embed each chunk
We're going to use an open-source embedding model

Ref: https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub/

In [57]:
embed_model_name = "BAAI/bge-small-en-v1.5" # https://huggingface.co/BAAI/bge-small-en-v1.5
#embed_model_name = "BAAI/bge-base-en-v1.5" # larger dimension of the embedding space (768 vs 384)
#embed_model_name = "sentence-transformers/all-MiniLM-L6-v2" # https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

embed_model = HuggingFaceEmbeddings(model_name=embed_model_name)



In [60]:
print(dir(embed_model))

['Config', '__abstractmethods__', '__annotations__', '__class__', '__class_vars__', '__config__', '__custom_root_type__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__exclude_fields__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_validators__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__include_fields__', '__init__', '__init_subclass__', '__iter__', '__json_encoder__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__post_root_validators__', '__pre_root_validators__', '__pretty__', '__private_attributes__', '__reduce__', '__reduce_ex__', '__repr__', '__repr_args__', '__repr_name__', '__repr_str__', '__rich_repr__', '__schema_cache__', '__setattr__', '__setstate__', '__signature__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__try_update_forward_refs__', '__validators__', '__weakref__', '_abc_impl', '_calculate_keys', '_copy_and_set_values', '_decompose_class', '_enforce_dict_if_root', '_get_value', '_init_pr

In [58]:
vector_db = FAISS.from_documents(docs, embed_model)

In [30]:
print(vector_db.index.ntotal, len(docs))

125 125


In [31]:
print(dir(vector_db))

['_FAISS__add', '_FAISS__from', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_aembed_documents', '_aembed_query', '_asimilarity_search_with_relevance_scores', '_cosine_relevance_score_fn', '_create_filter_func', '_embed_documents', '_embed_query', '_euclidean_relevance_score_fn', '_get_retriever_tags', '_max_inner_product_relevance_score_fn', '_normalize_L2', '_select_relevance_score_fn', '_similarity_search_with_relevance_scores', 'aadd_documents', 'aadd_texts', 'add_documents', 'add_embeddings', 'add_texts', 'adelete', 'afrom_documents', 'afrom_embeddings', 'afrom_texts', 'amax_marginal_relevance_search', 'amax_marginal_relev

In [32]:
print(dir(vector_db.index))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__swig_destroy__', '__weakref__', 'add', 'add_c', 'add_with_ids', 'add_with_ids_c', 'assign', 'assign_c', 'cached_l2norms', 'check_compatible_for_merge', 'clear_l2norms', 'code_size', 'codes', 'compute_distance_subset', 'compute_residual', 'compute_residual_n', 'd', 'get_CodePacker', 'get_FlatCodesDistanceComputer', 'get_distance_computer', 'get_xb', 'is_trained', 'merge_from', 'metric_arg', 'metric_type', 'ntotal', 'permute_entries', 'permute_entries_c', 'range_search', 'range_search_c', 'reconstruct', 'reconstruct_batch', 'reconstruct_batch_c', 'reconstruct_c', 'reconstruct_n', 'reconstruct_n_c', 'remove_ids', 'remove_ids_c', '

# Similarity search for a user query

In [33]:
vector_db.similarity_search?

[0;31mSignature:[0m
[0mvector_db[0m[0;34m.[0m[0msimilarity_search[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mquery[0m[0;34m:[0m [0;34m'str'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mk[0m[0;34m:[0m [0;34m'int'[0m [0;34m=[0m [0;36m4[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfilter[0m[0;34m:[0m [0;34m'Optional[Union[Callable, Dict[str, Any]]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfetch_k[0m[0;34m:[0m [0;34m'int'[0m [0;34m=[0m [0;36m20[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m:[0m [0;34m'Any'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'List[Document]'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return docs most similar to query.

Args:
    query: Text to look up documents similar to.
    k: Number of Documents to return. Defaults to 4.
    filter: (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
    fetch_k: (Optional[int

In [34]:
query = "By how much will the deficit be down by the end of this year?"
nb_docs_retrieved = 5
documents_retrieved = vector_db.similarity_search(query, k=nb_docs_retrieved)

In [37]:
# Check answers
relevant_sentence_from_original_text = "the deficit will be down to less than half what it was before I took office"
for rank, doc in enumerate(documents_retrieved):
    print(f"Document #{rank+1}:")
    #print(doc)
    if relevant_sentence_from_original_text in doc.page_content:
        print("Good answer")

Document #1:
Good answer
Document #2:
Document #3:
Document #4:
Good answer
Document #5:


In [38]:
documents_retrieved[0]

Document(page_content='By the end of this year, the deficit will be down to less than half what it was before I took office.  \n\nThe only president ever to cut the deficit by more than one trillion dollars in a single year. \n\nLowering your costs also means demanding more competition. \n\nI’m a capitalist, but capitalism without competition isn’t capitalism. \n\nIt’s exploitation—and it drives up prices.', metadata={'source': 'docs/state_of_the_union.txt'})

In [41]:
documents_retrieved[3]

Document(page_content='But in my administration, the watchdogs have been welcomed back. \n\nWe’re going after the criminals who stole billions in relief money meant for small businesses and millions of Americans.  \n\nAnd tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud. \n\nBy the end of this year, the deficit will be down to less than half what it was before I took office.  \n\nThe only president ever to cut the deficit by more than one trillion dollars in a single year.', metadata={'source': 'docs/state_of_the_union.txt'})