# Vectorstores and Embeddings

Recall the overall workflow for retrieval augmented generation (RAG):

![](images/lesson-2-overview.png)

In [1]:
import os
import openai

openai.api_key  = os.environ['OPENAI_API_KEY']

We just discussed `Document Loading` and `Splitting`.

In [2]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/NBK268647.pdf"),
    PyPDFLoader("docs/NBK268647.pdf"),
    PyPDFLoader("docs/NBK1438.pdf"),
    PyPDFLoader("docs/NBK1513.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

len(docs)

86

In [3]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [4]:
splits = text_splitter.split_documents(docs)

In [5]:
len(splits)

223

## Embeddings

In [6]:
from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

In [7]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [8]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [9]:
import numpy as np

In [10]:
np.dot(embedding1, embedding2)

0.963220689072572

In [11]:
np.dot(embedding1, embedding3)

0.771106734568816

In [12]:
np.dot(embedding2, embedding3)

0.7596423521230876

## Vectorstores

In [13]:
# pip install chromadb

In [14]:
from langchain.vectorstores import Chroma

In [15]:
persist_directory = 'docs/chroma/'

In [16]:
!rm -rf ./docs/chroma  # remove old database files if any

In [17]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [18]:
print(vectordb._collection.count())

223


### Similarity Search

In [19]:
question = "What is the mechanism of pathology?"

In [20]:
docs = vectordb.similarity_search(question,k=3)

In [21]:
len(docs)

3

In [22]:
print(docs[0].page_content)

that alter the C9orf72 protein sequence are pathogenic.
Mechanism of disease causation.  Although the molecular basis of C9orf72 -FTD/ALS has been under intense 
investigation, the exact disease mechanism is not yet fully understood. Mounting evidence indicates the 
involvement of multiple disease mechanisms in a multiple-hit model of both gain-of-function and loss-of-
function mechanisms:
•Gain-of-function mechanisms
⚬RNA toxicity caused by sequestration of RNA-binding proteins and normal C9orf72  transcripts by 
RNA species containing the pathogenic G 4C2 (GGGGCC) hexanucleotide repeat expansion into 
nuclear RNA foci, thereby interfering with their physiologic functions [ Mori et al 2013b ]
⚬G4C2 repeat-associated, non-ATG bidirectional translation of the expanded G 4C2 hexanucleotide 
repeat sequences into diverse aggregation-prone dipeptide repeat (DPR) proteins (poly-GA, poly-
GP , poly-GR, poly-PA, and poly-PR), leading to DPR-positive inclusion pathology [ Mori et al 2013a , 
M

Let's save this so we can use it later!

In [23]:
vectordb.persist()

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily. 

But there are some failure modes that can creep up. 

Here are some edge cases that can arise - we'll fix them in the next class.

## Failure mode #1: Duplicate chunks

In [24]:
question = "what did they say about Frontotemporal Dementia?"

In [25]:
docs = vectordb.similarity_search(question,k=5)

In [26]:
docs[0]

Document(page_content='symptom onset and death and disease duration in genetic frontotemporal dementia: an international \nretrospective cohort study. Lancet Neurol. 2020;19:145–56. PubMed PMID: 31810826 .\nMora JS, Genge A, Chio A, Estol CJ, Chaverri D, Hernández M, Marín S, Mascias J, Rodriguez GE, Povedano M, \nPaipa A, Dominguez R, Gamez J, Salvado M, Lunetta C, Ballario C, Riva N, Mandrioli J, Moussy A, Kinet JP , \nAuclair C, Dubreuil P , Arnold V , Mansfield  CD, Hermine O, et al. Masitinib as an add-on therapy to riluzole \nin patients with amyotrophic lateral sclerosis: a randomized clinical trial. Amyotroph Lateral Scler \nFrontotemporal Degener. 2020;21:5–14. PubMed PMID: 31280619 .\nMori K, Arzberger T, Grässer FA, Gijselinck I, May S, Rentzsch K, Weng SM, Schludi MH, van der Zee J, Cruts \nM, Van Broeckhoven C, Kremmer E, Kretzschmar HA, Haass C, Edbauer D. Bidirectional transcripts of the \nexpanded C9orf72 hexanucleotide repeat are translated into aggregating dipeptide r

In [27]:
docs[1]

Document(page_content='symptom onset and death and disease duration in genetic frontotemporal dementia: an international \nretrospective cohort study. Lancet Neurol. 2020;19:145–56. PubMed PMID: 31810826 .\nMora JS, Genge A, Chio A, Estol CJ, Chaverri D, Hernández M, Marín S, Mascias J, Rodriguez GE, Povedano M, \nPaipa A, Dominguez R, Gamez J, Salvado M, Lunetta C, Ballario C, Riva N, Mandrioli J, Moussy A, Kinet JP , \nAuclair C, Dubreuil P , Arnold V , Mansfield  CD, Hermine O, et al. Masitinib as an add-on therapy to riluzole \nin patients with amyotrophic lateral sclerosis: a randomized clinical trial. Amyotroph Lateral Scler \nFrontotemporal Degener. 2020;21:5–14. PubMed PMID: 31280619 .\nMori K, Arzberger T, Grässer FA, Gijselinck I, May S, Rentzsch K, Weng SM, Schludi MH, van der Zee J, Cruts \nM, Van Broeckhoven C, Kremmer E, Kretzschmar HA, Haass C, Edbauer D. Bidirectional transcripts of the \nexpanded C9orf72 hexanucleotide repeat are translated into aggregating dipeptide r

Notice that we're getting duplicate chunks (because of the duplicate pdf in the index).

Semantic search fetches all similar documents, but does not enforce diversity.


## Failure mode #2: Not obeying the question 

The question below asks a question about a particular pdf, but includes results from other pdfs as well.

In [28]:
question = "what did they say about disorders in the article by Gossye?"

In [29]:
docs = vectordb.similarity_search(question,k=5)

In [30]:
for doc in docs:
    print(doc.metadata)

{'page': 17, 'source': 'docs/NBK268647.pdf'}
{'page': 17, 'source': 'docs/NBK268647.pdf'}
{'page': 6, 'source': 'docs/NBK1513.pdf'}
{'page': 9, 'source': 'docs/NBK268647.pdf'}
{'page': 9, 'source': 'docs/NBK268647.pdf'}


In [31]:
print(docs[4].page_content)

Table 4. continued from previous page.System/ConcernEvaluationCommentGenetic 
counselingBy genetics professionals 2To inform affected  persons & their families re nature, MOI, & 
implications of C9orf72 -FTD/ALS spectrum to facilitate medical & 
personal decision makingFamily support 
& resourcesAssess need for:
•Community or online resources  such 
as Parent to Parent ;
•Social work involvement for parental 
support;
•Home nursing referral.•Early discussion of advanced care planning
•The affected  person's perspective & burden must be 
considered in clinical decision making.
•The presence of cognitive impairment may raise ethical 
concerns.ADL = activities of daily living; EMG = electromyography; LMN = lower motor neuron; MOI = mode of inheritance; OT = 
occupational therapy; PT = physical therapy; UMN = upper motor neuron1. Devenney et al [2014] , Piguet et al [2017] , Oskarsson et al [2018] , Masrori & Van Damme [2020] 2. Medical geneticist, certified  genetic counselor, or certifie

## Next steps 

Approaches discussed in the next lecture can be used to address both failure modes