# 📐 Vector Representations of Text

## 🔎 What are Vector Representations?
Vector representations of text convert **words, sentences, or documents** into numerical vectors in a high-dimensional space, capturing their **semantic meaning**.  

These vectors enable machines to understand and compare text for tasks like:  
- Similarity search  
- Classification  
- Text generation  

---

## 🤖 Why for RAG?
In **Retrieval-Augmented Generation (RAG)**, embeddings are used to:  
- **Retrieve**: Compare query and document embeddings to find relevant chunks.  
- **Generate**: Provide context to the generative model, ensuring coherent responses.  

---

## ⚙️ Key Properties
- **Semantic Similarity**  
  Similar texts (e.g., *“broken screen”* and *“display issue”*) have similar vectors.  

- **Dimensionality**  
  Vectors typically have **100–768 dimensions**, depending on the model.  

- **Dense vs. Sparse**  
  - **Dense embeddings** (e.g., Word2Vec, BERT) capture rich semantics in fewer dimensions.  
  - **Sparse representations** (e.g., TF-IDF) rely on large, sparse vectors with limited semantic richness.  


### Example: Converting a text to embeddings using sentence transformers

In [2]:
cleaned_text= "hello i have an issue with product123 the screen is broken please help"

In [3]:
from sentence_transformers import SentenceTransformer
import logging

# Set up logging
logging.basicConfig(
    filename='embeddings.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def get_vector_representation(text: str, model_name: str = 'all-MiniLM-L6-v2') -> list:
    """
    Convert text to a vector representation using SentenceTransformers.
    
    Args:
        text (str): Input text.
        model_name (str): Name of the SentenceTransformers model.
    
    Returns:
        list: Vector representation.
    """
    try:
        model = SentenceTransformer(model_name)
        embedding = model.encode([text], show_progress_bar=False)[0]
        logging.info(f"Generated vector for text: {text[:50]}...")
        return embedding.tolist()
    except Exception as e:
        logging.error(f"Vector representation error: {str(e)}")
        return []

# Get vector representation
vector = get_vector_representation(cleaned_text)
print(f"\nVector Representation (first 5 dimensions): {vector[:5]}")
print(f"Vector Length: {len(vector)}")




Vector Representation (first 5 dimensions): [-0.06879273802042007, -0.001389497541822493, 0.029840512201189995, -0.1683955192565918, -0.010793237015604973]
Vector Length: 384



### 🔎 Explanation
- The **all-MiniLM-L6-v2** model generates a **384-dimensional dense vector**.  
- Each dimension captures a **semantic feature**, enabling similarity comparisons in RAG.  

---

### 🤖 RAG Considerations
- ✅ Use **dense embeddings** for retrieval to capture semantic relationships.  
- ✅ Ensure **query and document embeddings** use the **same model** for consistency.  


# 📝 Word Embeddings (Word2Vec, GloVe)

## 🔎 What are Word Embeddings?
Word embeddings map **individual words** to fixed-size vectors, capturing their **semantic** and **syntactic** properties.  
Unlike sentence or document embeddings, they focus on **word-level meaning**.  

---

## ⚙️ Common Models
- **Word2Vec**  
  - Uses neural networks to learn word associations from a large corpus.  
  - Two main architectures: **CBOW (Continuous Bag of Words)** and **Skip-gram**.  

- **GloVe (Global Vectors)**  
  - Uses **word co-occurrence statistics** across a corpus.  
  - Emphasizes **global context** rather than local windows.  

---

## 🤖 Why for RAG?
Word embeddings are less common in **modern RAG pipelines** (which prefer sentence/document embeddings).  
However, they can still be useful for:  
- 🔍 **Fine-grained analysis** (e.g., matching specific terms in queries).  
- 🛠️ **Custom embeddings** for domain-specific vocabulary (e.g., *“product123”*).  


In [6]:
!pip install gensim



DEPRECATION: Loading egg at c:\users\hp\appdata\local\programs\python\python311\lib\site-packages\mcqgenerator-0.0.1-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
    pytz>=2011n
        ~~~~~~^

[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from gensim.models import KeyedVectors
from nltk.tokenize import word_tokenize
import numpy as np

# Load pretrained Word2Vec (GoogleNews model, 300-dim)
# Download: https://code.google.com/archive/p/word2vec/
# file: GoogleNews-vectors-negative300.bin.gz
word2vec = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)

# Sample text
cleaned_text = "hello i have an issue with product123 the screen is broken please help"

# Tokenize
tokens = word_tokenize(cleaned_text.lower())

# Get embeddings for each word
embeddings = []
for token in tokens:
    if token in word2vec:  # check if word exists in pretrained vocab
        embeddings.append(word2vec[token])
    else:
        print(f"'{token}' not in vocabulary")  # e.g., product123 may not exist

# Option 1: Keep per-word embeddings
print("Word embeddings shape:", [vec.shape for vec in embeddings])

# Option 2: Create sentence-level embedding by averaging
if embeddings:
    sentence_embedding = np.mean(embeddings, axis=0)
    print("Sentence embedding shape:", sentence_embedding.shape)


## Sentence Embeddings (Sentence-BERT, Universal Sentence Encoder)

### What are Sentence Embeddings?  
Sentence embeddings map entire **sentences** to fixed-size vectors, capturing their **overall meaning**.  
Unlike word embeddings, they consider **context and word order**.

---

### Popular Models
- **Sentence-BERT (SBERT):**  
  Fine-tunes BERT for sentence-level tasks, producing high-quality embeddings for similarity search.  

- **Universal Sentence Encoder (USE):**  
  A TensorFlow-based model for general-purpose sentence embeddings.  

---

### Why for RAG?  
Sentence embeddings are ideal for:
- **Retrieval:** Comparing query and document chunk similarity.  
- **Clustering:** Grouping similar tickets in a knowledge base.  


In [7]:
!pip install sentence-transformers tensorflow-hub



DEPRECATION: Loading egg at c:\users\hp\appdata\local\programs\python\python311\lib\site-packages\mcqgenerator-0.0.1-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
    pytz>=2011n
        ~~~~~~^

[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
from sentence_transformers import SentenceTransformer
import tensorflow_hub as hub
import numpy as np

# Sample sentences (from chunked ticket)
sentences = [
    "hello i have an issue with product123 the screen is broken please help",
    "i tried restarting the device but it did not work"
]

def get_sbert_embeddings(sentences: list, model_name: str = 'all-MiniLM-L6-v2') -> list:
    """
    Generate Sentence-BERT embeddings.
    
    Args:
        sentences (list): List of sentences.
        model_name (str): Name of the SentenceTransformers model.
    
    Returns:
        list: List of embeddings.
    """
    try:
        model = SentenceTransformer(model_name)
        embeddings = model.encode(sentences, show_progress_bar=False)
        logging.info(f"Generated SBERT embeddings for {len(sentences)} sentences")
        return embeddings.tolist()
    except Exception as e:
        logging.error(f"SBERT error: {str(e)}")
        return []

def get_use_embeddings(sentences: list, model_url: str = 'https://tfhub.dev/google/universal-sentence-encoder/4') -> list:
    """
    Generate Universal Sentence Encoder embeddings.
    
    Args:
        sentences (list): List of sentences.
        model_url (str): URL of the USE model.
    
    Returns:
        list: List of embeddings.
    """
    try:
        model = hub.load(model_url)
        embeddings = model(sentences).numpy()
        logging.info(f"Generated USE embeddings for {len(sentences)} sentences")
        return embeddings.tolist()
    except Exception as e:
        logging.error(f"USE error: {str(e)}")
        return []

# Get embeddings
sbert_embeddings = get_sbert_embeddings(sentences)
use_embeddings = get_use_embeddings(sentences)

print("\nSBERT Embeddings (first 5 dimensions for first sentence):")
print(sbert_embeddings[0][:5])
print(f"SBERT Embedding Length: {len(sbert_embeddings[0])}")

print("\nUSE Embeddings (first 5 dimensions for first sentence):")
print(use_embeddings[0][:5])
print(f"USE Embedding Length: {len(use_embeddings[0])}")



KeyboardInterrupt: 

## 📄 Document Embeddings  

### What are Document Embeddings?  
Document embeddings represent entire documents (or chunks) as fixed-size vectors, capturing their overall meaning.  

They can be derived by:  
- **Averaging word embeddings** (e.g., Word2Vec).  
- **Using sentence embedding models** on document chunks.  
- **Fine-tuned transformer models** designed for long texts.  

### Why Important?  
Document embeddings allow us to:  
- Compare large texts for **semantic similarity**.  
- Power **retrieval in RAG systems**.  
- Enable **clustering and topic modeling** at the document level.  


### Example
**Let’s generate document embeddings for our ticket using SBERT (averaging sentence embeddings)**.



In [10]:
from langchain.text_splitter import SpacyTextSplitter

def sentence_chunk(text: str) -> list:
    """
    Split text into sentence-based chunks using LangChain SpacyTextSplitter.
    
    Args:
        text (str): Input text.
    
    Returns:
        list: List of sentence chunks.
    """
    try:
        splitter = SpacyTextSplitter(chunk_size=1000)  # Large chunk_size to ensure sentence-based splitting
        chunks = splitter.split_text(text)
        print(f"Sentence chunking: {len(chunks)} chunks created")
        return chunks
    except Exception as e:
        print(f"Sentence chunking error: {str(e)}")
        return []


In [14]:
sample_ticket = """
Hello, I am experiencing a critical issue with my device (Product123). The screen suddenly stopped 
responding, and I can no longer interact with it properly. Initially, I thought it was a temporary 
glitch, but after multiple restarts, the issue persists. This started immediately after I updated 
the device to version 2.3, which makes me think the update caused the malfunction. 

Before the screen issue occurred, I had noticed that the device was running slower than usual, 
frequently lagging when switching between applications. I assumed this was normal after the update, 
but now with the screen completely failing, I suspect the two issues are connected. 

I already tried basic troubleshooting: restarting the device, disconnecting it from power, leaving 
it off for several minutes, and reconnecting. None of these worked. I also reset the device to 
factory settings, but the issue remains. The problem is making it impossible for me to use the device 
for my daily work, and it’s becoming a serious inconvenience. 

I reached out to customer support last week and logged this under ticket12345, but unfortunately I 
have not received any response. My account is linked to user123, and I’ve been a premium customer for 
over two years. I was expecting faster assistance, especially given the urgency of the issue. 

Please note, this product is still under warranty (code xyz-789). I also purchased extended coverage, 
so I would like to request either a replacement device or a repair service as soon as possible. I 
depend on this product for my work, and the downtime is costing me productivity every single day. 

In addition, I want to mention that I tried connecting the device to an external monitor, and while 
the output displays fine externally, the touch screen on the actual device does not respond at all. 
This seems to confirm that the issue is specifically with the screen hardware or the drivers related 
to it. If this is a known bug introduced in version 2.3, I would like to be informed about any 
upcoming patches or fixes. 
"""

In [15]:
def get_document_embedding(chunks: list, model_name: str = 'all-MiniLM-L6-v2') -> list:
    """
    Generate document embedding by averaging sentence embeddings.
    
    Args:
        chunks (list): List of document chunks.
        model_name (str): Name of the SentenceTransformers model.
    
    Returns:
        list: Document embedding.
    """
    try:
        model = SentenceTransformer(model_name)
        embeddings = model.encode(chunks, show_progress_bar=False)
        doc_embedding = np.mean(embeddings, axis=0).tolist()
        logging.info("Generated document embedding")
        return doc_embedding
    except Exception as e:
        logging.error(f"Document embedding error: {str(e)}")
        return []

# Use sentence chunks from previous section
chunks = sentence_chunk(sample_ticket)
doc_embedding = get_document_embedding(chunks)
print("\nDocument Embedding (first 5 dimensions):")
print(doc_embedding[:5])
print(f"Document Embedding Length: {len(doc_embedding)}")



Sentence chunking: 3 chunks created





Document Embedding (first 5 dimensions):
[-0.030796987935900688, -0.04647569730877876, 0.034133993089199066, -0.06959479302167892, 0.01470125000923872]
Document Embedding Length: 384


## 🌍 Multilingual Embeddings  

### What are Multilingual Embeddings?  
Multilingual embeddings map text from multiple languages to a **shared vector space**, enabling cross-lingual similarity comparisons.  

### Popular Models  
- **distiluse-base-multilingual-cased-v1** (SentenceTransformers)  
- **LaBSE** (Language-agnostic BERT Sentence Embedding)  

### Why for RAG?  
For customer support tickets in multiple languages, multilingual embeddings ensure:  
- **Consistent retrieval** across languages.  
- **Better coverage** in global applications.  
- **Cross-lingual search**, where a query in one language can retrieve documents in another.  


### Example
**Let’s embed an English and Spanish ticket using a multilingual model**.



In [16]:
english_ticket = "hello i have an issue with product123 the screen is broken please help"
spanish_ticket = "hola tengo un problema con product123 la pantalla está rota por favor ayuda"

In [17]:
def get_multilingual_embeddings(texts: list, model_name: str = 'distiluse-base-multilingual-cased-v1') -> list:
    """
    Generate multilingual embeddings.
    
    Args:
        texts (list): List of texts in different languages.
        model_name (str): Name of the multilingual SentenceTransformers model.
    
    Returns:
        list: List of embeddings.
    """
    try:
        model = SentenceTransformer(model_name)
        embeddings = model.encode(texts, show_progress_bar=False)
        logging.info(f"Generated multilingual embeddings for {len(texts)} texts")
        return embeddings.tolist()
    except Exception as e:
        logging.error(f"Multilingual embedding error: {str(e)}")
        return []

# Get embeddings
texts = [english_ticket, spanish_ticket]
multi_embeddings = get_multilingual_embeddings(texts)
print("\nMultilingual Embeddings (first 5 dimensions for English):")
print(multi_embeddings[0][:5])
print("Multilingual Embeddings (first 5 dimensions for Spanish):")
print(multi_embeddings[1][:5])

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([multi_embeddings[0]], [multi_embeddings[1]])[0][0]
print(f"\nSimilarity between English and Spanish tickets: {similarity:.4f}")

modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/539M [00:00<?, ?B/s]

KeyboardInterrupt: 

## 🎯 Fine-tuning Embeddings for Domain-specific Tasks  

### 🔎 What is Fine-tuning Embeddings?  
Fine-tuning adapts **pre-trained embedding models** to domain-specific data (e.g., customer support tickets), improving performance on specialized tasks.  

---

### 🤖 Why for RAG?  
Fine-tuned embeddings:  
- Better capture **domain-specific terms** (e.g., `product123`, `warranty code`).  
- Improve **retrieval accuracy** in specialized knowledge bases.  

---

### 📝 Example  
Let’s simulate fine-tuning an **SBERT model** on a small dataset of **customer support tickets** to adapt it for the RAG pipeline.  


In [18]:
# Simulated positive and negative pairs
training_data = [
    ("hello i have an issue with product123 the screen is broken", "product123 screen broken issue help", 1),  # Positive pair
    ("hello i have an issue with product123 the screen is broken", "i love my new phone", 0),  # Negative pair
]

In [20]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sklearn.metrics.pairwise import cosine_similarity

def fine_tune_embeddings(training_data: list, model_name: str = 'all-MiniLM-L6-v2', epochs: int = 1) -> SentenceTransformer:
    """
    Fine-tune a SentenceTransformer model on domain-specific data.
    
    Args:
        training_data (list): List of (sentence1, sentence2, label) tuples.
        model_name (str): Base model name.
        epochs (int): Number of training epochs.
    
    Returns:
        SentenceTransformer: Fine-tuned model.
    """
    try:
        model = SentenceTransformer(model_name)
        train_examples = [InputExample(texts=[pair[0], pair[1]], label=pair[2]) for pair in training_data]
        train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
        train_loss = losses.ContrastiveLoss(model=model)
        
        model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=epochs,
            warmup_steps=100
        )
        logging.info("Fine-tuning completed")
        return model
    except Exception as e:
        logging.error(f"Fine-tuning error: {str(e)}")
        return None

# Fine-tune model
fine_tuned_model = fine_tune_embeddings(training_data, epochs=1)

# Test fine-tuned model
test_sentences = [
    "hello i have an issue with product123 the screen is broken",
    "product123 screen issue"
]
embeddings = fine_tuned_model.encode(test_sentences, show_progress_bar=False)
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"\nSimilarity after fine-tuning: {similarity:.4f}")

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]


Similarity after fine-tuning: 0.8846
