In [1]:
import os

In [2]:
def get_file_contents(filename):
    """ Given a filename,
        return the contents of that file
    """
    try:
        with open(filename, 'r') as f:
            # It's assumed our file contains a single line,
            # with our API key
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)

In [None]:
filename = "..\\GoogleAPIKey.txt"
os.environ['GOOGLE_API_KEY'] = get_file_contents(filename)
filename_grokKey = "..\\GroqAPIKey.txt"
os.environ['GROQ_API_KEY'] = get_file_contents(filename_grokKey)

## Vector Storage

### Chroma

##### pip install langchain-chroma

In [4]:
from langchain_community.document_loaders import TextLoader
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

In [5]:
embeddings_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

In [7]:
#Load the documents and split it into chunks, embed each chunk  and load it into vector store
raw_documents = TextLoader("..\\RAGFiles\\LangchainRetrieval.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db =  Chroma.from_documents(documents,embeddings_model)

In [9]:
query="What is text embedding and how does langchain help in doing it"
docs = db.similarity_search(query)
print(docs[0].page_content)

Text embedding models
Another key part of retrieval is creating embeddings for documents. Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of a text that are similar. LangChain provides integrations with over 25 different embedding providers and methods, from open-source to proprietary API, allowing you to choose the one best suited for your needs. LangChain provides a standard interface, allowing you to easily swap between models.

Vector stores
With the rise of embeddings, there has emerged a need for databases to support efficient storage and searching of these embeddings. LangChain provides integrations with over 50 different vectorstores, from open-source local ones to cloud-hosted proprietary ones, allowing you to choose the one best suited for your needs. LangChain exposes a standard interface, allowing you to easily swap between vector stores.


In [10]:
embedding_vector = embeddings_model.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

Text embedding models
Another key part of retrieval is creating embeddings for documents. Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of a text that are similar. LangChain provides integrations with over 25 different embedding providers and methods, from open-source to proprietary API, allowing you to choose the one best suited for your needs. LangChain provides a standard interface, allowing you to easily swap between models.

Vector stores
With the rise of embeddings, there has emerged a need for databases to support efficient storage and searching of these embeddings. LangChain provides integrations with over 50 different vectorstores, from open-source local ones to cloud-hosted proprietary ones, allowing you to choose the one best suited for your needs. LangChain exposes a standard interface, allowing you to easily swap between vector stores.


In [11]:
print(len(docs))

4


In [12]:
print(docs[1].page_content)

Parent Document Retriever: This allows you to create multiple embeddings per parent document, allowing you to look up smaller chunks but return larger context.
Self Query Retriever: User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the semantic part of a query from other metadata filters present in the query.
Ensemble Retriever: Sometimes you may want to retrieve documents from multiple different sources, or using multiple different algorithms. The ensemble retriever allows you to easily do this.
And more!
Indexing
The LangChain Indexing API syncs your data from any source into a vector store, helping you:

Avoid writing duplicated content into the vector store
Avoid re-writing unchanged content
Avoid re-computing embeddings over unchanged content
All of which should save you time and money, as well as improve your vector search results.


In [13]:
#Load the documents and split it into chunks, embed each chunk  and load it into vector store
raw_documents = TextLoader("..\\RAGFiles\\LangchainRetrieval.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=500,chunk_overlap=20)
documents = text_splitter.split_documents(raw_documents)
db =  Chroma.from_documents(documents,embeddings_model)

Created a chunk of size 760, which is longer than the specified 500


In [14]:
query="What is text embedding and how does langchain help in doing it"
docs = db.similarity_search(query)
print(docs[0].page_content)

Text embedding models
Another key part of retrieval is creating embeddings for documents. Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of a text that are similar. LangChain provides integrations with over 25 different embedding providers and methods, from open-source to proprietary API, allowing you to choose the one best suited for your needs. LangChain provides a standard interface, allowing you to easily swap between models.


In [15]:
print(len(docs))

4


In [16]:
print(docs[1].page_content)

Text embedding models
Another key part of retrieval is creating embeddings for documents. Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of a text that are similar. LangChain provides integrations with over 25 different embedding providers and methods, from open-source to proprietary API, allowing you to choose the one best suited for your needs. LangChain provides a standard interface, allowing you to easily swap between models.

Vector stores
With the rise of embeddings, there has emerged a need for databases to support efficient storage and searching of these embeddings. LangChain provides integrations with over 50 different vectorstores, from open-source local ones to cloud-hosted proprietary ones, allowing you to choose the one best suited for your needs. LangChain exposes a standard interface, allowing you to easily swap between vector stores.


In [17]:
print(docs[2].page_content)

Parent Document Retriever: This allows you to create multiple embeddings per parent document, allowing you to look up smaller chunks but return larger context.
Self Query Retriever: User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the semantic part of a query from other metadata filters present in the query.
Ensemble Retriever: Sometimes you may want to retrieve documents from multiple different sources, or using multiple different algorithms. The ensemble retriever allows you to easily do this.
And more!
Indexing
The LangChain Indexing API syncs your data from any source into a vector store, helping you:


#### FAISS

##### pip install faiss-cpu

In [18]:
from langchain_community.vectorstores import FAISS


In [19]:
#Load the documents and split it into chunks, embed each chunk  and load it into vector store
raw_documents = TextLoader("..\\RAGFiles\\LangchainRetrieval.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db =  FAISS.from_documents(documents,embeddings_model)