## Data Connection

Includes:
- data loaders
- document transformers
- embedding models
- vector stores
- retrievers

![](retireval_system.png)

### DocumentLoader
Document loaders are used to load data from a source as Document objects, which consist of text and associated metadata.

Document loaders have a `load()` method that loads data from the configured source and returns it as documents
They may also have a `lazy_load()` method for loading data into **memory** as and when they are needed.

Each document consists of :
1. page_content (the text content of the document)
2. metadata (associated metadata such as the source URL or title)

In [5]:
from typing import Optional, List, Dict, Any

from langchain_community.document_loaders import TextLoader, WikipediaLoader
from langchain_core.callbacks import Callbacks

# loading data from Text
txt_loader = TextLoader(file_path="./tmp/langchain.txt")
txt_docs = txt_loader.load()

wiki_loader = WikipediaLoader(query='LangChain')
wiki_docs = wiki_loader.load()

### retrievers
Retrievers in LangChain are a type of component that is used to search and retrieve information  from a given index stored in a vector store as a backend, such as Chroma, to index and search  embeddings.

A few examples of retrievers
- BM25 Retriever
- TF-IDF Retriever
- Dense Retriever
- KNN Retriever

In [14]:
from langchain_community.retrievers import KNNRetriever, PubMedRetriever
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.schema import Document, BaseRetriever
from local_settings import OPENAI_API_KEY

In [9]:
# KNN Retriever
words = ["cat", "dog", "computer", "animal"]
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
knn_retriever = KNNRetriever.from_texts(texts=words, embeddings=embeddings)

result = knn_retriever.invoke('dog')
result

[Document(page_content='dog'),
 Document(page_content='animal'),
 Document(page_content='cat'),
 Document(page_content='computer')]

In [13]:
# specialized retrievers -> biomedical
pubmed_retriever = PubMedRetriever()
documents = pubmed_retriever.invoke('COVID')
for document in documents:
    print(document.metadata['Title'])

{'@book': 'lactmed', '@part': 'Covid_vaccines', '#text': 'COVID-19 Vaccines'}
Prescription Digital Therapeutics for Substance Use Disorder in Primary Care: Mixed Methods Evaluation of a Pilot Implementation Study.
Nourishing the Infant Gut Microbiome to Support Immune Health: Protocol of SUN (Seeding Through Feeding) Randomized Controlled Trial.


In [15]:
# Custom retrievers
class MyRetriever(BaseRetriever):
    def _get_relevant_documents(
        self,
        query: str,
        *,
        callbacks: Callbacks = None,
        tags: Optional[List[str]] = None,
        metadata: Optional[Dict[str, Any]] = None,
        run_name: Optional[str] = None,
        **kwargs: Any,
    ) -> List[Document]:
        # You can customize this method to perform any retrieval operations you need, such as querying  a database or searching through indexed documents.
        pass

NameError: name 'Callbacks' is not defined