## Data Retriever (As a Retriever)

In LangChain, a retriever is a component that fetches relevant documents or chunks of data from a knowledge base (like a vector store) based on a user query. It's an essential part of Retrieval-Augmented Generation (RAG) pipelines.

```
User query → Retriever → Top N relevant documents

```

It works in conjuction with LLM to build a RAG.

We can convert the the vectorstore into a Retriever class. This allows us to easily use it in the chain pipeline.

In [2]:
# Required libraries
! pip install langchain-openai




In [3]:
# Connect to LLM using my libraries
from langchain_openai import ChatOpenAI
from utility.llm_factory import LLMFactory

embedding_model = LLMFactory.get_embedding_model('openai')
embedding_model

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7a78c3229790>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7a78c3031150>, model='text-embedding-3-large', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [4]:
# Convert text to embedding vector
text = "The quick brown fox jumps over the lazy dog."
query_result = embedding_model.embed_query(text)
len(query_result)
# Output the embedding vector length
print(f"Embedding vector dimesion: {len(query_result)}")
# Output the first 10 elements of the embedding vector
print(f"First 10 elements of the embedding vector: {query_result[:10]}")

Embedding vector dimesion: 3072
First 10 elements of the embedding vector: [-0.012696055695414543, 0.009628890082240105, -0.011472541838884354, 0.0206488985568285, 0.0013094117166474462, -0.021486921235919, 0.010106563568115234, 0.06275119632482529, -0.01743088848888874, 0.006817320827394724]


In [5]:
## Step1: Load a text file as a document
from langchain_community.document_loaders import TextLoader

loader = TextLoader('./_data/speech.txt', encoding='utf-8')
docs = loader.load()

docs

[Document(metadata={'source': './_data/speech.txt'}, page_content="Good morning everyone,\n\nToday, I want to talk about something incredibly simple, yet profoundly powerful: small steps.\n\nIn a world obsessed with big wins and overnight success, we often forget that every great achievement starts with a single small action.\n\nWhether you're trying to learn a new skill, change a habit, or build something meaningful — it always begins with the decision to take one small step forward.\n\nThink about the tallest buildings. They're built one brick at a time. Olympic athletes? They train for years, often making tiny improvements day after day.\n\nSo, the next time you feel overwhelmed by your goals, just focus on the next step. Not the next ten, not the whole staircase — just the next one.\n\nProgress isn’t always loud. Sometimes, it whispers.\n\nBut those whispers? They build momentum.\n\nAnd that momentum? It builds success.\n\nThank you.\n\n")]

In [6]:
## Step 2: Split the document into smaller chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter


with open('./_data/speech.txt', 'r', encoding='utf-8') as f:
    speech = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=80,
    chunk_overlap=20
)

#final_documents = text_splitter.split_text(speech)

# Note: the input here is a list of text strings, not Document objects.
document_chunks = text_splitter.create_documents([speech])
document_chunks[::]  # Display the first two chunks

[Document(metadata={}, page_content='Good morning everyone,'),
 Document(metadata={}, page_content='Today, I want to talk about something incredibly simple, yet profoundly'),
 Document(metadata={}, page_content='yet profoundly powerful: small steps.'),
 Document(metadata={}, page_content='In a world obsessed with big wins and overnight success, we often forget that'),
 Document(metadata={}, page_content='often forget that every great achievement starts with a single small action.'),
 Document(metadata={}, page_content="Whether you're trying to learn a new skill, change a habit, or build something"),
 Document(metadata={}, page_content='or build something meaningful — it always begins with the decision to take one'),
 Document(metadata={}, page_content='to take one small step forward.'),
 Document(metadata={}, page_content="Think about the tallest buildings. They're built one brick at a time. Olympic"),
 Document(metadata={}, page_content='at a time. Olympic athletes? They train for yea

In [7]:
# Install Required libraries
! pip install langchain-chroma



In [8]:
## Step3: Create embeddings for the document chunks
## and store them in a vector database
from langchain.vectorstores import Chroma
vector_store_db = Chroma.from_documents(
    document_chunks,
    embedding_model,
    collection_name="speech_collection",
    persist_directory="./_data/chroma_db"
)
# Check the number of documents in the vector store
print(f"Number of documents in the vector store: {vector_store_db._collection.count()}")


Number of documents in the vector store: 17


In [9]:
## Step 4: Load the vector store from the persisted directory
vector_store_db_loaded = Chroma(    
    collection_name="speech_collection",
    embedding_function=embedding_model,
    persist_directory="./_data/chroma_db"
)

  vector_store_db_loaded = Chroma(


In [10]:
## Step 6: Conver the vedctor store to a retriever
retriever = vector_store_db_loaded.as_retriever(search_kwargs={"k": 3})    

docs = retriever.invoke("Progress isn’t always loud. Sometimes, it whispers.")

docs[::]


[Document(metadata={}, page_content='Progress isn’t always loud. Sometimes, it whispers.'),
 Document(metadata={}, page_content='But those whispers? They build momentum.'),
 Document(metadata={}, page_content='yet profoundly powerful: small steps.')]