In [2]:
!pip install langchain
!pip install langchain_community
!pip install langchain_core
!pip install chromadb
!pip install langchain_openai

Collecting chromadb
  Downloading chromadb-0.5.3-py3-none-any.whl (559 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.30.1-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.p

In [3]:
OpenAI_key="your api key"

In [5]:
from langchain.document_loaders import DirectoryLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.document_loaders import WebBaseLoader
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

chat = ChatOpenAI(temperature=0, model='gpt-4', openai_api_key=OpenAI_key)

In [6]:
# Loading a single website
loader = WebBaseLoader("http://www.paulgraham.com/superlinear.html")
paul_graham_essay = loader.load()
print (f"You have {len(paul_graham_essay)} document with length {len(paul_graham_essay[0].page_content)} characters or roughly {len(paul_graham_essay[0].page_content) / 4} tokens")

You have 1 document with length 24854 characters or roughly 6213.5 tokens


Then we need to define our parent and child splitters. These will be the text splitters that chunk up or create subsets of our documents. The only difference between the parent and child splitters are their text sizes.

In [7]:
# Split your website into big chunks
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000 * 4, chunk_overlap=0)

# This text splitter is used to create the child documents. They should be small chunk size.
child_splitter = RecursiveCharacterTextSplitter(chunk_size=125*4)

In [9]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="parent_document_splits",
    embedding_function=OpenAIEmbeddings(openai_api_key=OpenAI_key),
    persist_directory="./db",
    client_settings=None,
)

In [10]:
# The storage layer for the parent documents
docstore = InMemoryStore()

In [11]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)

we'll add our documents, but it's worth taking a second to appreciate all the small things which happen in the background

1. We'll add a large document.
2. It will be split into large chunks (check out the code for that here).
3. Those chunks will get an id assigned to them.
4. Those chunks will be further split into small chunks and the id from the parent the chunks were split from will be assigned to the child docs

In [12]:
retriever.add_documents(paul_graham_essay)

In [13]:
num_parent_docs = len(retriever.docstore.store.items())
num_child_docs = len(set(retriever.vectorstore.get()['documents']))

print (f"You have {num_parent_docs} parent docs and {num_child_docs} child docs")

You have 8 parent docs and 82 child docs


 if we query our vectorstore which holds our child docs, we'll get those back

In [14]:
child_docs = retriever.vectorstore.similarity_search("what is some investing advice?")

print (f"{len(child_docs)} child docs were found")
child_docs[0]

4 child docs were found


Document(page_content="as true in investing, for example. It's only useful to believe that\na company will do well if most other investors don't; if everyone\nelse thinks the company will do well, then its stock price will\nalready reflect that, and there's no room to make money.What else can we learn from these fields? In all of them you have\nto put in the initial effort. Superlinear returns seem small at\nfirst. At this rate, you find yourself thinking, I'll never get", metadata={'doc_id': '97959f38-cf63-4cf4-996f-d529f5420154', 'language': 'No language found.', 'source': 'http://www.paulgraham.com/superlinear.html', 'title': 'Superlinear Returns'})

Notice the doc_id on that child doc? That will correspond to a parent document. Let's go find that parent document to double check. I'll just get the first part of the page_content to save space

In [15]:
retriever.docstore.store.get(child_docs[0].metadata['doc_id']).page_content[:500]

"science. It has exponential growth, in the form of learning, combined\nwith thresholds at the extreme edge of performance — literally at\nthe limits of knowledge.The result has been a level of inequality in scientific discovery\nthat makes the wealth inequality of even the most stratified societies\nseem mild by comparison. Newton's discoveries were arguably greater\nthan all his contemporaries' combined.\n[11]This point may seem obvious, but it might be just as well to spell\nit out. Superlinear retur"

Nice! There it is

Now let's go do the proper Parent Document retrieval and ask the retriever (not the vectorstore) for similar docs. This will return the parent documents back to us

In [16]:
retrieved_docs = retriever.get_relevant_documents("what is some investing advice?")

print (f"{len(retrieved_docs)} retrieved docs were found")

  warn_deprecated(


2 retrieved docs were found


I'm going to only do the first doc to save space, but there are more waiting for you. Keep in mind that LangChain will do the union of docs, so if you have two child docs from the same parent doc, you'll only return the parent doc once, not twice.

However here we got the full document back. Sometimes this will be too long and we actually just want to get a larger chunk instead. Let's do that.

Notice the chunk size difference between the parent splitter and child splitter.

Now, let's do the full process, we'll see what small chunks are generated, but then return the larger chunks as our relevant documents

In [17]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

question = "what is some investing advice?"

chat.predict(text=PROMPT.format_prompt(
    context=retrieved_docs,
    question=question
).text)

  warn_deprecated(


"In investing, it's only useful to believe that a company will do well if most other investors don't; if everyone else thinks the company will do well, then its stock price will already reflect that, and there's no room to make money."