## Multivector Retriever

In [11]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

loaders = [
    TextLoader("../docs/txt/paul_graham_essay.txt"),
    TextLoader("../docs/txt/state_of_the_union.txt"),
]
docs = []
for l in loaders:
    docs.extend(l.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

In [12]:
docs

[Document(page_content='What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on 

In [13]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
import torch

embedding_model = HuggingFaceInstructEmbeddings(
        model_name = 'hkunlp/instructor-base',              
        model_kwargs = {
            'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        },
    )

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=embedding_model
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs]

load INSTRUCTOR_Transformer
max_seq_length  512


In [14]:
doc_ids

['cdce121a-7d40-4cf0-b24e-00a1fe00aae5',
 'f88c1d39-4a99-4e25-b1fa-ee43e4c33b69',
 '19d528ca-92c3-4a17-a083-e2af856601df',
 'd3d26779-119a-4605-85ed-d8d79e9db9ba',
 'c2be8652-ad7b-4aec-a7a8-b70e89974299',
 '2d6fba26-7b15-4c83-9a28-67155526daea',
 'd61a9525-67c5-47e3-aa12-0f965e063aac',
 '00d2c550-46ba-49e9-9d58-278c4a2bdf4e',
 'dae35cec-4f43-4924-bdbf-caf60dfc4d72',
 '6d139e7a-777d-4c28-a0c0-fda0bba2a825',
 '1e6ba8c4-e5ba-4e56-8353-906586b55037',
 'd1f62d29-407d-4e05-8618-b4165066b831']

In [16]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [17]:
retriever.vectorstore.similarity_search("justice breyer")

[Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '34f4c91f-3498-451b-9759-e2bde2872fbb', 'source': '../docs/txt/state_of_the_union.txt'}),
 Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '1e6ba8c4-e5ba-4e56-8353-906586b

In [18]:
retriever.get_relevant_documents("retriever")

[Document(page_content='We originally hoped to launch in September, but we got more ambitious about the software as we worked on it. Eventually we managed to build a WYSIWYG site builder, in the sense that as you were creating pages, they looked exactly like the static ones that would be generated later, except that instead of leading to static pages, the links all referred to closures stored in a hash table on the server.\n\nIt helped to have studied art, because the main goal of an online store builder is to make users look legit, and the key to looking legit is high production values. If you get page layouts and fonts and colors right, you can make a guy running a store out of his bedroom look more legit than a big company.\n\n(If you\'re curious why my site looks so old-fashioned, it\'s because it\'s still made with this software. It may look clunky today, but in 1996 it was the last word in slick.)\n\nIn September, Robert rebelled. "We\'ve been working on this for a month," he sai