# Data Connections Pipeline
source -> load -> transform -> embed -> store (and then later retrieve)

# Loading Libraries


In [2]:
# !pip install chromadb
# !pip install sentence_transformers

In [1]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
# from langchain.embeddings import OpenAIEmbeddings  # THESE COULD ALSO BE USED
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_huggingface import HuggingFaceEmbeddings
from dotenv import load_dotenv
load_dotenv()

## 1. Document Loader
We will load the documents here. If you want to load more, just add them in the list. After all, *documents* object that you see below is simply a **Python list**.

In [40]:
loader = TextLoader("Sample.txt")
documents = loader.load()
len(documents), type(documents)

## 2. Document Transformer
We will split the documents into chunks here

In [3]:
# Split the document into chunks
text_splitter = CharacterTextSplitter(chunk_size=1_000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
print("Number of chunks: ", len(texts))
print("Example of one document (chunk): ", texts[0])

# Add more metadata to the document (can be used to combine vector search with keyword search)
author_name = "Bruno"
for text in (texts):
    text.metadata['author'] = author_name
    text.metadata['starts_with_india'] = True if text.page_content.lower().startswith("india") else False
print("Number of documents (chunks): ", len(texts))
print("Example of one document (chunk): ", texts[0])

## 3. Data Embedding
We will only instantiate the embedder here, so that we can later use it for our Chroma vector database.

In [4]:
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

## 4. Vector Database
Here, we will use ChromaDB, but we can also use other Vector Databases.

In [5]:
db = Chroma.from_documents(texts, embedding_model)
print(len(db), "(It should be the same as the number of chunks from above)")
# db._collection.get(include=["embeddings"])  # To have a look at the embeddings created

## 5. Retrieval
Used to specify how many documents (*k*) you want to retrieve by using the *search_method* function. The functions are: *cosine_similarity*, *euclidian_distance*, ...

- By default, use *cosine_similarity*. It is more scalable and better if your database has higher dimensional representations.
- Only for lower dimensional vector representation, use *euclidian_distance*

In [36]:
# Let's first specify a simple retriever
retriever_simple = db.as_retriever(search_kwargs={"k": 1}, search_method="euclidian_distance")

# Now, let's specify a more complex retrieval including also the metadata
filter_criteria = {
    "starts_with_india": True,
}  # IF YOU WANT TO HAVE ONLY 1 FILTER CRITERIA
filter_criteria = {
    "$and": [
        {"starts_with_india": True},
        {"author": "Bruno"}
    ]
}  # IF YOU WANT TO HAVE MULTIPLE FILTER CRITERIAs
retriever_complex = db.as_retriever(search_kwargs={"k": 2, "filter": filter_criteria}, 
                                    search_method="cosine_similarity")

# Questions to test the DB
This will return *k* (where k is the number from above; the number of documents you want your query to retrieve) documents. Using this it only does search on the vector database; no LLM needed.

In [37]:
retriever_simple.invoke("What is the capital of india?")

In [38]:
retriever_simple.invoke("What is the currency of India?")

In [39]:
retriever_complex.invoke("What is the currency of india?")