<a href="https://colab.research.google.com/drive/1aaU4YZC-fswSImo1fV-w67FXPQg5Ictm?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

### 📊 What is Vector Embeddings?

Vector embedding is a way to represent words, phrases, or texts as numerical vectors in a multi-dimensional space. This helps the model understand language better by capturing meanings and relationships between words.

![Vector Embedding](https://qdrant.tech/articles_data/what-are-embeddings/BERT-model.jpg)
Source: [Qdrant Blog](https://qdrant.tech/articles/what-are-embeddings/)



#### 📌 Key Points

1. **Representation:** Each word or token is a vector of real numbers.
2. **Dimensionality:** These vectors usually have hundreds or thousands of dimensions, where a dimension represents a specific feature or characteristic of the word (e.g., meaning, context, or usage).
3. **Semantic Meaning:** Similar words are closer together in this space.


#### 🔄 Example: Animal Relationship Analogy 🐶❤️🐱

1. Consider a simplified 3D vector space with these word embeddings:

  * cat: [2, 4, -1]
  * kitten: [1.5, 3, -0.5]
  * dog: [-2, 3, 1]
  * puppy: [-1.5, 2, 0.5]

2. **Explanation**

  * **Semantic Similarity:**
    * "**Cat**" and "**kitten**," as well as "**dog**" and "**puppy**," have approximate similar type vectors, reflecting their semantic (similarity) closeness.

  * **Analogy Relationship:** For "cat is to kitten as dog is to what?"

    * **We can solve this by vector arithmetic**

      [dog] - [cat] + [kitten] ≈ [puppy]
    
    * **Let's calculate**

      (−2, 3, 1) **−** (2, 4, −1) **+** (1.5, 3, −0.5) **≈** −1.5, 2, 0.5 (close to "puppy").

    * 🤔💡 **Understanding above Vector Arithmetic**

      In the analogy "cat is to kitten as dog is to what?" we want to find a word that fits the same relationship.

      Here's a simpler way to think about the vector arithmetic:

      * **Think of Relationships**:

        * "Cat" is a more general term, while "kitten" is a specific younger version of "cat."

        * Similarly, "dog" is the general term, and we want to find the younger version of "dog."

      * **Why the Math?**

        When we say [dog]−[cat]+[kitten], we are using the relationship between the words:
          * Subtracting the vector for "cat" from "dog" helps us find out what makes "dog" different from "cat." This captures the essence of being a dog instead of a cat.
        
          * Adding the vector for "kitten" represents moving back towards the younger version of a dog, just as "kitten" is the younger version of "cat."

      * **Result:**

        The result of this calculation gives us a vector that points to a word that shares a similar relationship with "dog" as "kitten" does with "cat." In this case, that word is "puppy."

      So, in simple terms, we're using vector math to navigate the relationships between these words and find the one that fits the analogy.

  * [**Cosine Similarity:**](https://www.youtube.com/watch?v=e9U0QAFbfLI) Measures similarity between words, with higher values indicating similar meanings.

  * **Capturing Relationships:** Embeddings encode relationships like animal type, age, and size etc.

  * **Contextual Usage:** In advanced embedding models, word embeddings may vary by context, capturing similarity meanings.

This shows how vector embeddings can encode complex semantic relationships, enabling mathematical operations to reveal meaningful linguistic patterns, foundational for understanding and generating text in Large Language Models.

[Qdrant Blog](https://qdrant.tech/articles/what-are-embeddings/)


--------------------------------------------------------------------------


### 📚 What is Embedding Models?

Embedding models are machine learning algorithms that convert text, images, or other data into dense vector representations in a high-dimensional space. These representations capture semantic meaning and relationships between items.

#### 📌 Key points

1. Transform data into numerical vectors
2. Preserve semantic similarity in vector space
3. Useful for various downstream tasks

#### 🔄 Examples

1. **Text embeddings**

  * OpenAI's Embedding Model
  * Google's Embedding Model
  * Huggingface open-source Embedding Models

2. **Image embeddings**

  * OpenAI's CLIP (Contrastive Language-Image Pre-Training)

3. [**Multi-modal embeddings**](https://weaviate.io/blog/multimodal-models)

  * Amazon Titan Multimodel
  * Google's Multimodel Embedding API
  * Microsoft's Azure Multimodel Embedding API
  * multi2vec-clip

These models are used in search, recommendation systems, clustering, and as input for other machine learning tasks.


--------------------------------------------------------------------------


### 🏆 Where to find "RIGHT" Embedding Models based on Ranking?

1. The Masstive Text Embedding Benchmark (MTEB) Leaderboard on Hugging Face is an excellent resource for finding the latest proprietary and open-source text embedding models, along with their performance statistics for tasks like **retrieval** and **summarization**.

  It’s important to see benchmark results with care, as they are self-reported and may use publicly available datasets, so it's advisable to evaluate and experiment with popular embedding models on your own data and see the best fit.

  General strategy is to start with most popular Embedding Models. You can find the top-10 models to use based on the anonymous ranking at [MTEB Arena Leaderboard](https://huggingface.co/spaces/mteb/arena)

2. Key metrics to focus on include the **"average"** and **"retrieval average"** scores, as well as the model's **Size** (manageable on consumer hardware), **Max Tokens** (ideally around 100-200 tokens for embedding), and **Embedding Dimensions** (ideally up to 512 tokens balancing detail capture and operational efficiency).

3. The leaderboard features a mix of **Small**, **Large**, **Proprietary**, and **Open-Source** models for comparison.

[MTEB Leaderboard Link](https://huggingface.co/spaces/mteb/leaderboard)

[MTEB Arena Leaderboard: Retrieval](https://huggingface.co/spaces/mteb/arena)


# Install required libraries

In [None]:
!pip install -qU \
     Sentence-transformers==3.4.1 \
     langchain==0.3.19 \
     langchain-openai==0.3.7 \
     langchain-google-genai==2.0.11 \
     langchain-huggingface==0.1.2 \
     langchain-chroma==0.2.2 \
     langchain-community==0.3.18 \
     einops==0.8.1

### Import related libraries

In [None]:
import os
import getpass

from langchain_google_genai import (
    ChatGoogleGenerativeAI,
    HarmBlockThreshold,
    HarmCategory,
)
from langchain.vectorstores import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from sklearn.metrics.pairwise import cosine_similarity

### Provide OpenAI API Key.

If you want to use OpenAI Embedding. You can create OpenAI API key using following link

- [OpenAI API Key](https://platform.openai.com/settings/profile?tab=api-keys)

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass()

··········


### Provide Google API Key.

It can be used both for Gemini Pro LLM  & Google Embedding Model. You can create Google API key using following link

- [Google Gemini-Pro API Key](https://console.cloud.google.com/apis/credentials)

- [YouTube Video explaining Google API Key](https://www.youtube.com/watch?v=ZHX7zxvDfoc)



In [None]:
os.environ["GOOGLE_API_KEY"] = getpass.getpass()

··········


### Provide Huggingface API Key.

If you want to use Huggingface Embedding Models. You can create Huggingface API key using following link

- [Huggingface API Key](https://huggingface.co/settings/tokens)




In [None]:
os.environ["HF_TOKEN"] = getpass.getpass()

··········


### Embedding Model Decision Flow

<br>

![MTEB Areana](https://raw.githubusercontent.com/genieincodebottle/generative-ai/main/images/embedding.png)

### 💰 Paid: OpenAI Embedding Model

1. 🤖 **Models:** text-embedding-3-small, text-embedding-3-large, ada v2
2. 👍 **Pros:**
  * High quality embeddings
  * Multiple model options for different needs
  * Easy integration with OpenAI API
  * Batch processing available for cost savings
  * Regular updates and improvements
  * Suitable for various NLP tasks
3. 👎 **Cons:**
  * Not free - usage costs can add up
  * Closed-source model
  * Requires API key and internet connection
  * Limited control over model architecture
  * Potential privacy concerns with data handling
  * Usage subject to OpenAI's terms and policies
4. 🔗 [Document Link](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)


In [None]:
from langchain_openai import OpenAIEmbeddings

def get_openai_embeddings():
    openai_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
    return openai_embeddings

### 🔓 Free: Google Gemini's Embedding Model

1. 🤖**Models:** text-embedding-004
2. 👍 **Pros:**
  * Free to use
  * Supports 100+ languages
  * Long context window (3,072 tokens)
  * Multiple embedding dimensions (768 and 1,024)
  * High-quality embeddings
  * Easy integration with Google Cloud
3. 👎 **Cons:**
  * Potential usage limits or quotas
  * Closed-source model
  * Possible vendor lock-in to Google ecosystem
  * Limited customization options
  * Dependency on Google's infrastructure
  * Potential for unexpected changes to service terms
4. 🔗 [Document Link](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api#generative-ai-get-text-embedding-python_vertex_ai_sdk)


In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

def get_google_embeddings():
    gemini_embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
    return gemini_embeddings

### 🔓 Free: Huggingface Open-Source Embedding Models

1. 🤖 **Models:** gte-large-en-v1.5, bge-multilingual-gemma2,snowflake-arctic-embed-l, nomic-embed-text-v1.5, e5-mistral-7b-instruct and more. See the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) to check top open-source embedding models as per your usecase.
2. 👍 **Pros:**
  * Open-source and freely available
  * Flexible and customizable
  * Community support and updates
  * Integration with Hugging Face ecosystem
  * Suitable for various NLP tasks
  * Can be fine-tuned on specific domains
3. 👎 **Cons:**
  * Potentially less optimized than commercial alternatives
  * Performance can vary depending on specific use case
  * Limited official support compared to commercial options
  * May need significant computational resources for training/fine-tuning
  * Ongoing maintenance and updates depend on community involvement
4. 🔗 **MTEB Leaderboard:**

  * [Huggingface Embedding Models Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)

  * [Massive Text Embedding Benchmark (MTEB)Arena Leaderboard](https://huggingface.co/spaces/mteb/arena)

### Huggingface based Nomic AI Embedding model

You can use any other huggingface open-source embedding models as per your requirement, fitness and system constraints. You can get the model name from MTEB leaderboard.

Popular Models

1. nomic-ai/nomic-embed-text-v1.5
2. nomic-ai/nomic-embed-text-v1
3. sentence-transformers/all-MiniLM-L12-v2
4. sentence-transformers/all-MiniLM-L6-v2

.....



In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

def get_huggingface_embeddings():

    # Change model_name as per your choosen huggingface embedding model
    nomic_embeddings = HuggingFaceEmbeddings(model_name="nomic-ai/nomic-embed-text-v1.5", model_kwargs = {'trust_remote_code': True})
    return nomic_embeddings

# Let's implement Basic RAG using following components

1. **Chroma:** Used as the vector store for efficient similarity search.
2. **Embedding Models:** As we choose OpenAI, Google or Huggingface Embedding Models
3. **ChatGoogleGenerativeAI:** The Gemini Pro model used for generating responses.
4. **cosine_similarity:** Used for computing similarities in the evaluation step.

In [None]:
# Helper function for printing docs
def pretty_print_docs(docs):
    # Iterate through each document and format the output
    for i, d in enumerate(docs):
        print(f"{'-' * 50}\nDocument {i + 1}:")
        print(f"Content:\n{d.page_content}\n")
        print("Metadata:")
        for key, value in d.metadata.items():
            print(f"  {key}: {value}")
    print(f"{'-' * 50}")  # Final separator for clarity

# Example usage
# Assuming `docs` is a list of Document objects

# Step 1: Load and preprocess data code

In [None]:
def load_and_process_data(url):
    # Load data from web
    loader = WebBaseLoader(url)
    data = loader.load()

    # Split text into chunks (Experiment with Chunk Size and Chunk Overlap to get optimal chunking)
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    chunks = text_splitter.split_documents(data)

    # Add unique IDs to each text chunk
    for idx, chunk in enumerate(chunks):
        chunk.metadata["id"] = idx

    return chunks

# Step 2: Intialize Embedding Model

In [None]:
def get_embedding_model(embedding_model="huggingface"):

    if embedding_model == "openai":
        embeddings = get_openai_embeddings()
    elif embedding_model == "google":
        embeddings = get_google_embeddings()
    elif embedding_model == "huggingface":
        embeddings = get_huggingface_embeddings()
    else:
        embeddings = get_huggingface_embeddings()
    return embeddings

# Step 3: Create Chroma vector store

In [None]:
def create_vector_stores(chunks, embeddings):
    # Create vector stores using the specified embedding model
    vectorstore = Chroma.from_documents(chunks, embeddings)
    return vectorstore

# Step 4: Implement Basic RAG

1. **Retrieval:** We use the Chroma vector store's similarity search to retrieve the top 5 most relevant documents for the query.
2. **Context Formation:** We combine the retrieved documents into a single context string.
3. **Response Generation:** Using the Gemini Pro model, we generate a final response based on the retrieved context and the query.

In [None]:
def basic_rag(query, vectorstore, llm):
    # Retrieve relevant documents
    docs = vectorstore.similarity_search(query, k=5)
    context = "\n\n".join([doc.page_content for doc in docs])

    # Generate response
    prompt = f"{context}\n\nQuestion: {query}\nAnswer:"
    response = llm.invoke(prompt)

    return {
        "query": query,
        "final_answer": response.content,
        "retrieval_method": "Basic similarity search",
        "context": context
    }

# Step 5: RAG Evaluation

1. We compute embeddings for the query, response, and context.
2. We calculate cosine similarities between query-response and response-context.
3. We compute an overall relevance score as the average of these similarities.


In [None]:
def evaluate_response(query, embeddings, response, context):
    # Compute embeddings
    query_embedding = embeddings.embed_query(query)
    response_embedding = embeddings.embed_query(response)
    context_embedding = embeddings.embed_query(context)

    # Compute cosine similarities
    query_response_similarity = cosine_similarity([query_embedding], [response_embedding])[0][0]
    response_context_similarity = cosine_similarity([response_embedding], [context_embedding])[0][0]

    # Compute relevance score (average of the two similarities)
    relevance_score = (query_response_similarity + response_context_similarity) / 2

    return {
        "query_response_similarity": query_response_similarity,
        "response_context_similarity": response_context_similarity,
        "relevance_score": relevance_score
    }

### Step 4: Load and process data and create chunks to store in the Chroma Vector Store

1. [Langchain Chunking Strategy](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/)
2. [Langchain Vectorstore](https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/)


In [None]:
# Initialize the gemini-pro language model with specified settings (Change temeprature  and other parameters as per your requirement)
llm = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0.3, safety_settings={
          HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
        },)

# Load Documents from a web URL
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
chunks = load_and_process_data(url)

# Get Embedding Model.
# embedding_model value options (openai, google, huggingface)
embeddings = get_embedding_model(embedding_model="google")

# Create vector stores using the specified embedding model
vectorstore = create_vector_stores(chunks, embeddings)

# Step 5: RAG in Action to test Embedding Model

## This implementation demonstrashowstes the key parts of Basic RAG with evaluation:

1. Chunking and embedding of source documents
2. Retrieval of relevant documents based on query similarity
3. Generation of a response using the retrieved context
4. Evaluation of the response based on its similarity to both the query and the context

In [None]:
query = "What are the main applications of artificial intelligence in healthcare?"
result = basic_rag(query, vectorstore, llm)


print(f"Query: {result['query']}")
print("=========================")
print(f"Final Answer: {result['final_answer']}")
print("=========================")
print(f"Retrieval Method: {result['retrieval_method']}")


# Evaluate the response
evaluation = evaluate_response(query, embeddings, result["final_answer"], result["context"])
print("\nEvaluation:")
print(f"Query-Response Similarity: {evaluation['query_response_similarity']:.4f}")
print(f"Response-Context Similarity: {evaluation['response_context_similarity']:.4f}")
print(f"Overall Relevance Score: {evaluation['relevance_score']:.4f}")

Query: What are the main applications of artificial intelligence in healthcare?
Final Answer: The main applications of artificial intelligence in healthcare include:
- Increasing patient care and quality of life
- Processing and integrating big data for medical research
- Overcoming discrepancies in funding allocated to different fields of research
- Deepening the understanding of biomedically relevant pathways
Retrieval Method: Basic similarity search

Evaluation:
Query-Response Similarity: 0.8314
Response-Context Similarity: 0.7585
Overall Relevance Score: 0.7949


# Retrieval Demonstration:

We showcase the retrieval process by displaying the retrieved documents for a sample query.

In [None]:
# Demonstrate retrieval
demo_query = "Explain the concept of machine learning and its relationship to AI"
print(f"\nDemonstration Query: {demo_query}")

# Retrieve documents
docs = vectorstore.similarity_search(demo_query, k=5)
print("\nRetrieved Documents:")
pretty_print_docs(docs)


Demonstration Query: Explain the concept of machine learning and its relationship to AI

Retrieved Documents:
--------------------------------------------------
Document 1:
Content:
There are several kinds of machine learning. Unsupervised learning analyzes a stream of data and finds patterns and makes predictions without any other guidance.[49] Supervised learning requires a human to label the input data first, and comes in two main varieties: classification (where the program must learn to predict what category the input belongs in) and regression (where the program must deduce a numeric function based on numeric input).[50]

Metadata:
  id: 40
  language: en
  source: https://en.wikipedia.org/wiki/Artificial_intelligence
  title: Artificial intelligence - Wikipedia
--------------------------------------------------
Document 2:
Content:
Learning
Machine learning is the study of programs that can improve their performance on a given task automatically.[46] It has been a part of AI fr