# VectorStore Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 


### Set up two vector stores 

In [1]:
import os
import openai

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

persist_directory = 'docs/chroma/'

In [3]:
embedding = OpenAIEmbeddings()

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [4]:
print(vectordb._collection.count())

223


In [5]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [6]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

### Similarity Search

In [7]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [8]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]

In [9]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

### Addressing Diversity: Maximum marginal relevance

Last class we encountered one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [10]:
question = "what did they say about Frontotemporal Dementia?"
docs_ss = vectordb.similarity_search(question,k=3)

In [11]:
docs_ss[0].page_content[:100]

'symptom onset and death and disease duration in genetic frontotemporal dementia: an international \nr'

In [12]:
docs_ss[1].page_content[:100]

'symptom onset and death and disease duration in genetic frontotemporal dementia: an international \nr'

Note the difference when using `MMR`:

In [13]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [14]:
docs_mmr[0].page_content[:100]

'symptom onset and death and disease duration in genetic frontotemporal dementia: an international \nr'

In [15]:
docs_mmr[1].page_content[:100]

'Table 1. continued from previous page.DiagnosisFrequencyComments (Frequency)ALS19.3%FTD-ALS11.0%\nAty'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about a specific pdf can include results from other pdfs as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [16]:
question = "what did they say about disorders in the article by Gossye?"

In [17]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source": "docs/NBK268647.pdf"}
)

In [18]:
for d in docs:
    print(d.metadata)

{'page': 17, 'source': 'docs/NBK268647.pdf'}
{'page': 17, 'source': 'docs/NBK268647.pdf'}
{'page': 4, 'source': 'docs/NBK268647.pdf'}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [19]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [20]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The pdf the chunk is from, should be one of `docs/NBK268647.pdf`, `docs/NBK1438.pdf`, or `docs/NBK1513.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the pdf",
        type="integer",
    ),
]

In [21]:
# pip install lark

In [22]:
document_content_description = "Disease reviews from the National Center for Biotechnology Information"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [23]:
question = "what did they say about disorders in the article by Gossye?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [24]:
docs = retriever.get_relevant_documents(question)



query='disorders' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/NBK268647.pdf') limit=None


In [25]:
for d in docs:
    print(d.metadata)

{'page': 4, 'source': 'docs/NBK268647.pdf'}
{'page': 4, 'source': 'docs/NBK268647.pdf'}
{'page': 7, 'source': 'docs/NBK268647.pdf'}
{'page': 7, 'source': 'docs/NBK268647.pdf'}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression refers to the use of an LLM to shrink each returned document to just the relevant sentences. 

In [26]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [27]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [28]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [29]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [30]:
question = "what did they say about disorders?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

"Pregnant women with CCD spectrum disorder should be monitored closely for cephalopelvic disproportion, which may require delivery by cesarean section. The primary cesarean section rate among women with a CCD spectrum disorder is 69%, which is higher than in controls [ Cooper et al 2001 ]."
----------------------------------------------------------------------------------------------------
Document 2:

"To inform affected persons & their families re nature, MOI, & implications of C9orf72 -FTD/ALS spectrum to facilitate medical & personal decision making"
----------------------------------------------------------------------------------------------------
Document 3:

"To inform affected persons & their families re nature, MOI, & implications of C9orf72 -FTD/ALS spectrum to facilitate medical & personal decision making"
----------------------------------------------------------------------------------------------------
Document 4:

Recurrent sinus infections and other upper 

Notice that the retrieved documents are now much shorter.

## Combining various techniques

In [31]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [32]:
question = "what did they say about disorders?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

"Pregnant women with CCD spectrum disorder should be monitored closely for cephalopelvic disproportion, which may require delivery by cesarean section. The primary cesarean section rate among women with a CCD spectrum disorder is 69%, which is higher than in controls [ Cooper et al 2001 ]."
----------------------------------------------------------------------------------------------------
Document 2:

"To inform affected persons & their families re nature, MOI, & implications of C9orf72 -FTD/ALS spectrum to facilitate medical & personal decision making"
----------------------------------------------------------------------------------------------------
Document 3:

Cruts et al 2013, Masrori & Van Damme 2020, Alzheimer Disease Overview, diffuse Lewy body disease, Huntington disease, GRN Frontotemporal Dementia, prion disease, corticobasal degeneration, progressive supranuclear palsy, Hensman Moss et al 2014, depression, obsessive compulsive disorder, bipolar disorder, schi

## Other types of retrieval

It's worth noting that vectordb is not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [33]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [34]:
# Load PDF
loader = PyPDFLoader("docs/NBK268647.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [35]:
# pip install scikit-learn

In [36]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [37]:
question = "What disorders were mentioned?"
docs_svm=svm_retriever.get_relevant_documents(question)



In [38]:
docs_svm[0].page_content[:100]

'functional impairment is significant.Cognitive functionCognitive rehab\nPsychiatric/ \nbehavioral \nman'

In [39]:
docs_svm[1].page_content[:100]

'FeatureFrequency\nCommentNearly allCommon\xa01InfrequentDisinhibition● Impulsivity, socially unacceptabl'

In [40]:
question = "what did they say about disorders?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content='•25% of familial FTD;\n•30%-50% of familial ALS (Of note, only 10% of individuals with ALS have a positive family history and \nsimplex cases [i.e., a single occurrence in a family] outnumber familial cases among individuals with \nC9orf72 -ALS.);\n•Up to 88% of individuals with manifestations of both FTD and ALS and a positive family history of these \ndisorders [ Cruts et al 2013 , Masrori & Van Damme 2020 ].\nDifferential  diagnosis for C9orf72 -FTD\n•Other types of dementia, especially with behavioral changes.  Differential  diagnosis includes "frontal \nvariant" Alzheimer disease (see Alzheimer Disease Overview ), diffuse  Lewy body disease, Huntington \ndisease , other forms of FTD (see GRN  Frontotemporal Dementia ), prion disease , corticobasal \ndegeneration, and progressive supranuclear palsy.\nSome individuals with C9orf72 -FTD/ALS have a choreiform movement disorder which (especially when \ncombined with behavioral abnormalities) may be confused with 