# Parent Document Retriever
https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/

https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.parent_document_retriever.ParentDocumentRetriever.html

#### Test files
Available in a subdirectory

#### When to use?
The documents in use are small and may be passed as context. 
Idea is to still split the document so that context is not lost.


In [1]:
from langchain.storage import InMemoryStore
from langchain.retrievers import ParentDocumentRetriever
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
import logging

## 1. Create Vectorstore, Child  splitter

In [2]:
# Create the Chroma vector store
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vector_store = Chroma(collection_name="full_documents", embedding_function=embedding_function) 

# Smaller chunks stored in the vector DB
child_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=20)

## 2. Create the ParentDocumenRetriever

In [3]:
# In memory dictionary to store parent docs
parent_doc_store = InMemoryStore()

# Create the retriever
# If parent splitter is NOT used then the entire document is returned
# If parnet splitter is used then the bigger-parent-chunks are returned
parent_retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=parent_doc_store,
    child_splitter = child_splitter
)

# Add the data
loader = DirectoryLoader('./util', glob="**/*.txt")
docs = loader.load()
parent_retriever.add_documents(docs, ids=None)



## 3. Test the retriever behavior

In [4]:
# Test input
input = ["What is RAG?",
         "How is fine tuning different than RAG?",
         "What data is used to train ChatGPT?",
         "What are the benefits of generative AI?"]


# change input index for testing
ndx = 2 
print(input[ndx],"\n")

print("Results from vector store retriever: ","\n")
results = vector_store.as_retriever().invoke(input[ndx])
for doc in results:
    print("CHUNK: ", doc.page_content,"\n")
print("========================================")
print("Results from parent document retriever: ","\n")

# This will retrieve the entire document, instead of the chunk itself
results = parent_retriever.invoke(input[ndx])

# Print retrieved information to validate the behavior
print(results[0].page_content)

What data is used to train ChatGPT? 

Results from vector store retriever:  

CHUNK:  As mentioned in the previous section, ChatGPT does not copy or store training information in a database. Instead, it learns about associations between words, and those learnings help the model update its numbers/weights. The model then uses those weights to predict and generate new words in 

CHUNK:  What type of information is used to teach ChatGPT? As noted above, ChatGPT and our other services are developed using (1) information that is publicly available on the internet, (2) information that we license from third parties, and (3) information that our users or human trainers provide. This 

CHUNK:  Is personal information used to teach ChatGPT? A large amount of data on the internet relates to people, so our training information does incidentally include personal information. We don’t actively seek out personal information to train our models. 

CHUNK:  What is ChatGPT, and how does it work? ChatGP

## 4. Create the RAG chain with PDR

Left as an exercise 