<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width=400px style="opacity:0.8">
</center>

In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



# SPLADE Embeddings Walkthrough

## Introduction

SPLADE (Sparse Lexical and Expansion Model) is a model designed to improve information retrieval by combining the strengths of sparse and dense vector representations. It leverages the power of pretrained language models to enhance sparse vector embeddings, allowing for more efficient and accurate search results.

SPLADE embeddings are built on the idea of merging sparse and dense retrieval methods. Sparse vectors, like those used in traditional TF-IDF or BM25 models, are efficient and interpretable but suffer from vocabulary mismatch issues. Dense vectors, on the other hand, capture semantic meaning but require extensive data for fine-tuning and are computationally expensive.

SPLADE aims to bridge this gap by using a pretrained language model to identify and expand relevant terms, creating a sparse vector that retains the efficiency of traditional methods while incorporating semantic understanding.


## From Input Sentence to Sparse Embedding

- **Transformer encoding**  
  SPLADE passes the input sentence through a transformer (e.g., BERT). Instead of only producing a pooled embedding, it uses the hidden states of each token.

- **Vocabulary projection**  
  Each hidden state is projected into a vector the size of the whole vocabulary.  
  → For every token, the model predicts scores for *all vocabulary words*, not just the input tokens.

- **Sparse activation via regularization**  
  To avoid dense vectors, SPLADE applies:
  - ReLU (to enforce non-negativity)  
  - Log-saturation  
  - L1 regularization (to enforce sparsity)  

  Most dimensions are pruned to zero, leaving only a **small subset of activated vocabulary terms**.  
  This explains why the sparse embedding has *more tokens than the input sentence*: the model expands into semantically related words.

**Example**:  
Input: `"jaguar speed"`  
Expansion: `"animal"`, `"leopard"`, `"fast"`, etc.



## Indexing with SPLADE

- Each document is processed into a sparse vocabulary vector.  
- Non-zero terms are stored in a standard **inverted index**.  
- The weight of each activated word acts like a TF-IDF score or BM-25 score.

Result: Indexing is efficient and compatible with traditional IR systems.



## Search with SPLADE

- Queries are processed the same way as documents.  
- Retrieval = **sparse dot product** between query and document vectors.  
- Matches occur when:
  - **Exact words overlap**  
  - **Expanded terms overlap** (semantic matches)



## Why It Works

- **Expansion effect**: Queries and documents get expanded into related words, reducing vocabulary mismatch.  
- **Sparse structure**: Interpretable, efficient, and indexable in existing search engines.  
- **Hybrid efficiency**: Combines BM25-like sparse retrieval with semantic richness of transformer models.



## Summary

SPLADE converts a sentence into a **sparse, vocabulary-sized vector** by expanding tokens into semantically related terms.  
- **Indexing**: Documents are stored as sparse term-weight vectors.  
- **Search**: Queries are expanded the same way, and matching is done via dot product.  
- **Benefit**: Captures both lexical overlap *and* semantic similarity within a sparse, efficient framework.


## BM42 vs. SPLADE: Key Differences

### BM42
- **What it is**: A refinement over BM25, where instead of using raw **term frequency (TF)** in the chunk/document, you replace that with a **transformer-derived importance score** for each token.  
- **Effect**:  
  - Still a **lexical keyword match** → if a query term isn’t present in the document, there is no hit.  
  - But among the matching terms, the scoring is much smarter, since the transformer assigns higher weight to “important” words in context and lower weight to stopwords or less relevant tokens.  
- **So**: BM42 = *keyword search with improved weighting* (context-aware TF).


### SPLADE
- **What it is**: Uses a transformer to generate a **sparse vocabulary-sized embedding** by projecting contextualized hidden states onto the entire vocabulary.  
- **Effect**:  
  - Not restricted to exact query tokens.  
  - Documents and queries both expand into semantically related terms.  
  - Matches can happen on **synonyms or related words**, not just identical words.  
- **So**: SPLADE = *semantic search in sparse space*, since overlap may occur through expanded equivalents.


### Key Difference
- **BM42**: Purely keyword-based → retrieval depends on **exact term overlap**, but scoring is transformer-aware.  
- **SPLADE**: Expands into a much larger vocabulary → retrieval allows **semantic overlap** (e.g., query *“car”* matches doc with *“automobile”*).  


Summary:  
- **BM42** stays lexical, just with smarter importance weighting.  
- **SPLADE** pushes into semantic territory by **expanding the representation space itself**.
