# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

## Vectorstore retrieval


In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
#!pip install lark

### Similarity Search

In [3]:
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [4]:
embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [5]:
print(vectordb._collection.count())

150


In [6]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [7]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [8]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [9]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [10]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [11]:
question = "what did they say about dress code?"
docs_ss = vectordb.similarity_search(question,k=3)

In [12]:
docs_ss[0].page_content[:900]

'TIFF V olunteer Dress Codes \n Volunteer positions will outline specific dress code requirements in addition to wearing your V olunteer \n T-shirt and photo I.D. Accreditation. Requirements are listed in the Description link of all shifts on the \n Volunteer  Hub. The three V olunteer dress codes are: \n 1.  Volunteer T -shirt, Photo I.D. Accreditation + Smart Casual \n The smart-casual dress code ensures that your clothes are comfortable while still neat, clean and \n suited for a professional environment. Follow these tips to make sure you’re adhering to our \n smart-casual dress code: \n ●  Jeans/pants may be worn, but should be free of large rips and frays. \n ●  Dresses, skirts and shorts of any kind may be worn, but must be knee-length. \n ●  High heels, including open-toe shoes, may be worn. \n ●  Running shoes may be worn, as long as the laces are tidy . \n ●  No midrif f-baring or backless '

In [13]:
docs_ss[1].page_content

'Bottoms and shoes worn must be black. \n ●  Black dress pants (shorts may not be worn) \n ●  Black dresses or skirts (below mid-thigh) \n ●  Black shoes (clean, black sneakers may be worn, but sandals may not) \n *Please note: There may be some volunteer roles that don’ t require you to wear your 2023 \n Volunteer T -shirt. This will be communicated ahead of time. (examples include smart-casual dress \n code for in-of fice shifts, wearing all black for special events, etc.) \n Personal Appearance \n TIFF celebrates the individuality and diversity of our staf f and V olunteers. Where not deemed an \n occupational safety hazard, individual choices about the following are at V olunteers’ discretion: \n ●  Hairstyles, haircuts or hair colour \n ●  Tattoos, body art, body piercings \n ●  Religious and culture-specific attire or jewelry \n 16'

Note the difference in results with `MMR`.

In [14]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [15]:
docs_mmr[0].page_content

'TIFF V olunteer Dress Codes \n Volunteer positions will outline specific dress code requirements in addition to wearing your V olunteer \n T-shirt and photo I.D. Accreditation. Requirements are listed in the Description link of all shifts on the \n Volunteer  Hub. The three V olunteer dress codes are: \n 1.  Volunteer T -shirt, Photo I.D. Accreditation + Smart Casual \n The smart-casual dress code ensures that your clothes are comfortable while still neat, clean and \n suited for a professional environment. Follow these tips to make sure you’re adhering to our \n smart-casual dress code: \n ●  Jeans/pants may be worn, but should be free of large rips and frays. \n ●  Dresses, skirts and shorts of any kind may be worn, but must be knee-length. \n ●  High heels, including open-toe shoes, may be worn. \n ●  Running shoes may be worn, as long as the laces are tidy . \n ●  No midrif f-baring or backless shirts. \n 2.  Volunteer T -shirt, Photo I.D. Accreditation + Business Casual \n Busine

In [16]:
docs_mmr[1].page_content

'0'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [17]:
question = "what did they say about dress code on page 16?"

In [18]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"page":16}
)

In [19]:
for d in docs:
    print(d.metadata)

{'page': 16, 'source': 'saved/2023_Festival_Volunteer_Orientation_Manual.pdf'}
{'page': 16, 'source': 'saved/2023_Festival_Volunteer_Orientation_Manual.pdf'}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [20]:
from langchain_openai import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [21]:
metadata_field_info = [
#     AttributeInfo(
#         name="source",
#         description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
#         type="string",
#     ),
    AttributeInfo(
        name="page",
        description="The page from the manual",
        type="integer",
    ),
]

**Note:** The default model for `OpenAI` ("from langchain.llms import OpenAI") is `text-davinci-003`. Due to the deprication of OpenAI's model `text-davinci-003` on 4 January 2024, you'll be using OpenAI's recommended replacement model `gpt-3.5-turbo-instruct` instead.

In [22]:
document_content_description = "Training manual"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [23]:
question = "what did they say about dress code on page 16?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [24]:
docs = retriever.get_relevant_documents(question)

In [25]:
for d in docs:
    print(d.metadata)

{'page': 16, 'source': 'saved/2023_Festival_Volunteer_Orientation_Manual.pdf'}
{'page': 16, 'source': 'saved/2023_Festival_Volunteer_Orientation_Manual.pdf'}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [26]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [27]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [28]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [29]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [30]:
question = "what did they say about dress code on page 16?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

Black dress pants (shorts may not be worn) 
Black dresses or skirts (below mid-thigh) 
Black shoes (clean, black sneakers may be worn, but sandals may not)
----------------------------------------------------------------------------------------------------
Document 2:

- Volunteer positions will outline specific dress code requirements in addition to wearing your V olunteer 
 T-shirt and photo I.D. Accreditation. Requirements are listed in the Description link of all shifts on the 
 Volunteer  Hub.
- The three V olunteer dress codes are: 
 1.  Volunteer T -shirt, Photo I.D. Accreditation + Smart Casual 
 The smart-casual dress code ensures that your clothes are comfortable while still neat, clean and 
 suited for a professional environment. Follow these tips to make sure you’re adhering to our 
 smart-casual dress code: 
 ●  Jeans/pants may be worn, but should be free of large rips and frays. 
 ●  Dresses, skirts and shorts of any kind may be worn, but must be knee-length.

## Combining various techniques

In [31]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [33]:
question = "what did they say about volunteer t-shirts?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

- "All registered Festival Volunteers who have signed up for a minimum of four (4) shifts will be eligible to pick up a 2023 Festival Volunteer T-shirt and photo I.D. Accreditation."
- "To uphold our standard of professionalism and for security purposes, you must wear your Volunteer T-shirt and photo I.D. Accreditation whenever you are on shift, unless otherwise directed by your supervisor or the Volunteer Office."
- "If you arrive at your scheduled shift without your Volunteer T-shirt and photo I.D. Accreditation, you may be sent home without a Volunteer Reward Voucher and marked as a no-show."
- "T-shirt and photo I.D. Accreditation pickup will commence in late August. Volunteers will be notified of the details by email closer to the date, including instructions on how to submit a photo for your I.D."
----------------------------------------------------------------------------------------------------
Document 2:

*Please note: There may be some volunteer roles that don’ 

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [34]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [35]:
# Load PDF
loader = PyPDFLoader("saved/2023_Festival_Volunteer_Orientation_Manual.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

In [36]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [37]:
question = "What did they say about tattoos?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(page_content='Deb Carpenter \n Debbie Newell \n Debbie Phillips \n Debbie Randell \n Debbie Young-Hermanns \n Deborah Girardo \n Deborah Massa \n Debra Hubner \n Diana Chorozy \n Diane Reid \n Diane Sugai \n Diann Margott Santiago \n Diem Pham \n Dimitra Tzamtzis \n Dimple Dhawan \n Dinara Khalitova \n Dionisio Neto \n Divya Lamba \n Donna Shoom-Kirsch \n Dorothy De Souza \n Eddy Woo \n Edison Chai \n Edmond Kwan \n Edmund Smyk \n Eileen Chong \n Eimíle McLennon \n Elizabeth Archibald \n Elizabeth Hawtin \n Elizabeth Poad \n Emilia Tryon \n Emily Li \n Emma Wiatrzyk \n Etienne Harrison \n Eujin Ong \n Eva Chan \n Faith Horizon \n Fanny Lui \n Flora Sung \n Flora Velichkova \n Frances Scaini \n Francine Brodeur \n Gabriele Golz \n Gail Mackinnon \n Garry Meyer \n Gayle Forler \n George Vandebunte \n Gina Morrison \n Glenda Restoule \n Gloria Chan \n Grace Awang \n Grace Tsang \n Heather Ardiel \n Heather Brown \n Heather Kemp \n Heather Kerr \n Heather Wood \n Helen Ing \n Hele

In [38]:
question = "what did they say about tattoos?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content='receive postings for year-round V olunteer opportunities to support our programming all year long. \n Eligible V olunteers will receive postings for year-round opportunities as they become available, typically \n between October and June. \n Please note that volunteer opportunities are subject to availability and details may vary pending \n programming changes. \n 8   Volunteer Expectations \n All TIFF V olunteers are expected to be... \n INFORMED \n You must attend any mandatory orientation/training sessions to ensure that you know what your specific \n position entails. Y ou will be volunteering under the supervision and guidance of a staf f member or \n Volunteer Captain who has years of experience volunteering with TIFF . Please listen carefully to what \n they have to say . They are there to support you and help you do your best. If you have any questions, \n please ask. \n FRIENDL Y \n Volunteers are representatives of TIFF and work extensively with guests,