# Retriever & Vectorstore basics

https://python.langchain.com/docs/modules/data_connection/retrievers/

#### Sample PDF
Download it to local file system: https://constitutioncenter.org/media/files/constitution.pd

#### Dependencies
!pip install sentence-transformers chromad f

## Part-1 : Retriever : ChromaDB

### 1. Load documents using a loader

Use the PyPDFLoader

In [1]:
from langchain_community.document_loaders import PyPDFLoader

# Change this 
pdf_source = 'C:/Users/raj/Downloads/us-constitution.pdf'

pdf_loader = PyPDFLoader(pdf_source) 

documents = pdf_loader.load()

### 2. Split the document into chunks

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 1000

chunk_overlap = 200

pdf_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    keep_separator = False,
    strip_whitespace = True,
)

chunked_documents = pdf_text_splitter.split_documents(documents)

print("Number of chunks: ", len(chunked_documents))

Number of chunks:  74


### 3. ChromaDB Vectorstore : Generate the embeddings for the chunks & add 

In this sample we are using ChromaDB, but you may use any vector store.

https://python.langchain.com/docs/integrations/vectorstores/chroma



#### Parameters (check doc for additional parameters:
https://api.python.langchain.com/en/stable/vectorstores/langchain_community.vectorstores.chroma.Chroma.html#langchain_community.vectorstores.chroma.Chroma

* collection_name (required)
* embedding_function (optional)
* persist_directory (optional)
* collection_metadata (optional)
* client (optional)


In [3]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# load it into Chroma using default embedding all-MiniLM-L6-v2
collection_name = 'us-constitution'
collection_metadata = {'embedding': 'all-MiniLM-L6-v2', 'chunk_size': chunk_size, 'chunk_overlap': chunk_overlap}

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

vector_store = Chroma(collection_name=collection_name, collection_metadata=collection_metadata, embedding_function=embedding_function)
vector_store.add_documents(chunked_documents)

vector_store

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

<langchain_community.vectorstores.chroma.Chroma at 0x2917f6ef850>

### 4. Similarity - Search the vector store

In [4]:
k = 2
query = "What did the president say about Ketanji Brown Jackson"
result_docs = vector_store.similarity_search(query, k = k)

# Following is the same as above
# result_docs = vector_store.search(search_type="similarity", query=query, k=k)

result_docs

[Document(page_content='dent, and the  Time and Place for commencing Proceedings \nunder this Constitution  \nThat after such Publication the Electors should be ap-  \npointed, and the Senators and Representatives elected: That the Electors should meet on the Day fixed for the Election \nof the President, and should transmit their Votes certified, \nsigned, sealed and directed, as the Constitution requires, to the Secretary of the United States in Congress assembled, \nthat the Senators and Representatives should convene at the Time  and Place  assigned;  that the Senators  should  appoint \na President of the Senate, for the sole  Purpose of receiving, \nopening and counting the Votes for President; and, that  \nafter he shall be chosen, the Congress, together with the President, should, without Delay, proceed to execute this \nConstitution  \nBy the unanimous  Order of the Convention  \nGo. Washington- Presidt:  \nW. JACKSON  Secretary.  \n \n* Language  in brackets  has been changed

### 5. Similarity Search with Score

https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html#langchain_community.vectorstores.chroma.Chroma.similarity_search_with_relevance_scores

#### Relevance scores
Relevance between [0,1]. Higher the better. 0 is dissimilar, 1 is most similar.

##### Parameters
* query (str) – input text
* k (int) – Number of Documents to return. Defaults to 4.

#### Distance scores
Distance is a continuous number. Lower the better.

##### Parameters
* query (str) – Query text to search for.
* k (int) – Number of results to return. Defaults to 4.
* filter (Optional[Dict[str, str]]) – Filter by metadata. Defaults to None.
* where_document (Optional[Dict[str, st

#### Note
Behavior depends on the specific vector database implementation.r]]) (Any) –

In [7]:
k = 2

query = "What did the president say about Ketanji Brown Jackson"

result_docs_relevance = vector_store.similarity_search_with_relevance_scores(query, k = k)

print("similarity_search_with_relevance_scores")
for result in result_docs_relevance:
    print("Metadata : ", result[0].metadata, "  Score : ",result[1])

print("similarity_search_with_score")
result_docs_score = vector_store.similarity_search_with_score(query, k = k)

for result in result_docs_score:
    print("Metadata : ", result[0].metadata, "  Score : ",result[1])

similarity_search_with_relevance_scores
Metadata :  {'page': 10, 'source': 'C:/Users/raj/Downloads/us-constitution.pdf'}   Score :  0.01704647858189623
Metadata :  {'page': 5, 'source': 'C:/Users/raj/Downloads/us-constitution.pdf'}   Score :  -0.02897627841175665
similarity_search_with_score
Metadata :  {'page': 10, 'source': 'C:/Users/raj/Downloads/us-constitution.pdf'}   Score :  1.390106201171875
Metadata :  {'page': 5, 'source': 'C:/Users/raj/Downloads/us-constitution.pdf'}   Score :  1.4551922082901


### 6. MMR
Return docs selected using the maximal marginal relevance. Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents.

https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html#langchain_community.vectorstores.chroma.Chroma.max_marginal_relevance_search

#### Parameters
* query (str) – Text to look up documents similar to.
* k (int) – Number of Documents to return. Defaults to 4.
* fetch_k (int) – Number of Documents to fetch to pass to MMR algorithm.
* lambda_mult (float) – Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.
* filter (Optional[Dict[str, str]]) – Filter by metadata. Defaults to None.
* where_document (Optional[Dict[str, str]]) –

In [14]:
k = 2

query = "What did the president say about Ketanji Brown Jackson"
  
# Control the diversity by setting betwee 0 & 1, 0 = Max diversity, Default = 0.5
lambda_mult = 0.5

result_docs_mmr = vector_store.max_marginal_relevance_search(query, k = k, lambda_mult = lambda_mult)

for result in result_docs_mmr:
    print(result.metadata)

{'page': 10, 'source': 'C:/Users/raj/Downloads/us-constitution.pdf'}
{'page': 13, 'source': 'C:/Users/raj/Downloads/us-constitution.pdf'}


## Part-2 : Retriever : Information systems

https://python.langchain.com/docs/integrations/retrievers



### Arxiv
https://python.langchain.com/docs/integrations/retrievers/arxiv

https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html#langchain_community.retrievers.arxiv.ArxivRetriever

#### Dependency

!pip install --upgrade --quiet  arxiv

#### Creation Parameters
* ARXIV_MAX_QUERY_LENGTH  default = 300
* doc_content_chars_max  default = 4000
* get_full_documents default = False /* Setting thito True s require*s pip install pymup*df */
* load_all_available_meta default = False
* load_max_docs    default = 100 100

In [5]:
from langchain_community.retrievers import ArxivRetriever

retriever = ArxivRetriever(load_max_docs=2)

COT_Document_identifier = '2201.11903'

results = retriever.get_relevant_documents(query = COT_Document_identifier)

  warn_deprecated(


In [6]:
results[0].metadata

{'Entry ID': 'http://arxiv.org/abs/2201.11903v6',
 'Published': datetime.date(2023, 1, 10),
 'Title': 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models',
 'Authors': 'Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou'}

In [7]:
print(results[0])

page_content='We explore how generating a chain of thought -- a series of intermediate\nreasoning steps -- significantly improves the ability of large language models\nto perform complex reasoning. In particular, we show how such reasoning\nabilities emerge naturally in sufficiently large language models via a simple\nmethod called chain of thought prompting, where a few chain of thought\ndemonstrations are provided as exemplars in prompting. Experiments on three\nlarge language models show that chain of thought prompting improves performance\non a range of arithmetic, commonsense, and symbolic reasoning tasks. The\nempirical gains can be striking. For instance, prompting a 540B-parameter\nlanguage model with just eight chain of thought exemplars achieves state of the\nart accuracy on the GSM8K benchmark of math word problems, surpassing even\nfinetuned GPT-3 with a verifier.' metadata={'Entry ID': 'http://arxiv.org/abs/2201.11903v6', 'Published': datetime.date(2023, 1, 10), 'Title': '

In [8]:
docs = retriever.invoke("1605.08386")

In [9]:
docs[0]

Document(page_content='Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a generalization of the Glauber dynamics, is an expander in fixed\ndimension.', metadata={'Entry ID': 'http://arxiv.org/abs/1605.08386v1', 'Published': datetime.date(2016, 5, 26), 'Title': 'Heat-bath random walks with Markov bases', 'Authors': 'Caprice Stanley, Tobias Windisch'})

In [10]:
docs[0].page_content[:400]

'Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a ge'