In [None]:
!pip install langchain
!pip install langchain_community
!pip install langchain_core
!pip install langchain_openai
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.3-py3-none-any.whl (559 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.30.1-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.

In [None]:
OpenAI_key="your api key"

In [None]:
from langchain_openai import ChatOpenAI
from langchain.schema.document import Document
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore
import uuid

In [None]:
# Get your LLM and summarize chain going
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", openai_api_key=OpenAI_key)

In [None]:
# Loading a single website
loader = WebBaseLoader("http://www.paulgraham.com/superlinear.html")
docs = loader.load()

In [None]:
docs

[Document(page_content='Superlinear Returns\n\nOctober 2023One of the most important things I didn\'t understand about the world\nwhen I was a child is the degree to which the returns for performance\nare superlinear.Teachers and coaches implicitly told us the returns were linear.\n"You get out," I heard a thousand times, "what you put in." They\nmeant well, but this is rarely true. If your product is only half\nas good as your competitor\'s, you don\'t get half as many customers.\nYou get no customers, and you go out of business.It\'s obviously true that the returns for performance are superlinear\nin business. Some think this is a flaw of capitalism, and that if\nwe changed the rules it would stop being true. But superlinear\nreturns for performance are a feature of the world, not an artifact\nof rules we\'ve invented. We see the same pattern in fame, power,\nmilitary victories, knowledge, and even benefit to humanity. In all\nof these, the rich get richer.\n[1]You can\'t understand 

In [None]:
# Split your website into big chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7000, chunk_overlap=0)
chunks = text_splitter.split_documents(docs)

print (f"Your {len(docs)} documents have been split into {len(chunks)} chunks")

Your 1 documents have been split into 5 chunks


Then we will create our chain which will get the summaries for us

In [None]:
chain = load_summarize_chain(llm)

Then we will loop through each one of our chunks, get the summary of that chunk, and then add a unique identifier to both the summary document and the original document that tie them together

In [None]:
id_key = "doc_id" # This is the key that we will tell the retriever to connect the summaries and original docs on

summaries = [] # To hold our summaries

for chunk in chunks:
    # First let's get an ID that we'll assign to the chunk and summary. You don't need a UUID here, use whatever you want
    unique_id = str(uuid.uuid4())

    # Then let's get the summary
    chunk_summary = chain.run([chunk])
    chunk_summary_document = Document(page_content=chunk_summary, metadata={id_key: unique_id}) # Give the ID to the summary
    summaries.append(chunk_summary_document)

    # Then finnally add that same id to your chunk
    chunk.metadata[id_key] = unique_id

print (f"You have {len(summaries)} summaries to go along with your {len(chunks)} chunks")

You have 5 summaries to go along with your 5 chunks


we have the same number of chunks and summaries.

Now we will set up our vectorstore (to hold the summaries and their embeddings) and docstore (to hold the original plain text chunks).

In [None]:
# The vectorstore to use to index the summary chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings(openai_api_key=OpenAI_key))

# The storage layer for the parent documents
docstore = InMemoryStore()

Then make our retriever. This special retriever knows which key links documents based on the id_key we set below

In [None]:
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    id_key=id_key, # "Hey, what should we join on?"
)

In [None]:
# Add your summary docs (with their ids) to the vectorstore. I'm unsure why a list of a uuid is returned, might be a bug.
retriever.vectorstore.add_documents(summaries)

['0a82efe4-c3a9-4c7b-9d39-72ad621b6929',
 'd8252e1d-a516-4fb6-a984-5746cbea0434',
 'b13001bf-a270-42e2-ba6c-445e5c654f75',
 'd92e1218-c8d7-4fc6-ae9c-f77dc5c5bc34',
 '72fc93e3-ee45-480f-b4a4-f998364132b8']

If you wanted to do regular similarity search on your summaries then you can try it out here. Just call .similarity_search on your vectorstore inside your retriever

In [None]:
_similar_docs = retriever.vectorstore.similarity_search("What is is the influence of organizations on equity?")
_similar_docs[0]

Document(page_content="The passage discusses various topics related to learning, competition, equity, and wealth accumulation. It emphasizes the importance of gradual improvements in technique rather than relying on a few exceptional individuals. It also explores the concept of superlinear returns and how it relates to effort and reward. The passage suggests that seeking competition can be motivating but is not always a reliable indicator of promising problems. It mentions the influence of organizations and institutions on outcomes and the potential negative impact of pressuring children into prestigious fields. The passage also touches on the shift from resource capture to discovery as a means of wealth accumulation. It concludes by discussing the conventional-minded individuals' dislike of inequality and their inability to understand novel ideas and great variation in performance.", metadata={'doc_id': 'da9aa266-48a1-4a8d-8acb-d520bf2b5c9b'})

But you see, we don't want the summaries returned, we want the original documents that are associated with the summaries.

Next, we'll map the summary unique ids to the original documents, then add those to the docstore. .mset takes a set of keys and values and adds them to the docstore.

First let's add those ids to the original documents as metadata. Note: This isn't critical to make the operation work, you'd only add the keys if you want to double check they match up later.

Then we'll add the original documents to the docstore along w/ their ids

In [None]:
# This will give each of your splits the ID you made earlier
retriever.docstore.mset([(x.metadata[id_key], x) for x in chunks])

Then we'll go run the same query and have the original document returned this time, not the summary

In [None]:
retrieved_docs = retriever.get_relevant_documents("What is is the influence of organizations on equity?")
print (retrieved_docs[0].page_content[:500])
print (retrieved_docs[0].metadata)

gradual improvements in technique, not the discoveries of a few
exceptionally learned people.[3]
It's not mathematically correct to describe a step function as
superlinear, but a step function starting from zero works like a
superlinear function when it describes the reward curve for effort
by a rational actor. If it starts at zero then the part before the
step is below any linearly increasing return, and the part after
the step must be above the necessary return at that point or no one
would bo
{'source': 'http://www.paulgraham.com/superlinear.html', 'title': 'Superlinear Returns', 'language': 'No language found.', 'doc_id': 'da9aa266-48a1-4a8d-8acb-d520bf2b5c9b'}


Notice how the 'doc_id' in the original document returned matches the 'doc_id' in the summary above - good to go ;)