In [None]:
# !pip uninstall purecpp_libs purecpp_extract purecpp_chunks_clean -y

In [None]:
#!pip install purecpp_extract_dev==0.0.400 purecpp_chunks_clean_dev==0.0.400

In [None]:
import os
os.environ["OPENAI_API_KEY"] = ""

In [4]:
import purecpp_libs_dev as libs
from purecpp_extract_dev import WebLoader
import purecpp_chunks_clean_dev as cc

In [5]:
loader = WebLoader("https://www.espn.com/")
docs = loader.Load()

Scrapped https://www.espn.com/


In [6]:
cc.AvailableModels()

╔══════════════════════════════════════════════════════════╗
║                Available Embedding Models                ║
╠══════════════════════════════════════════════════════════╣
║ 🔸 Vendor: cohere
║    └── embed-english-light-v3.0
╠──────────────────────────────────────────────────────────╣
║ 🔸 Vendor: huggingface
║    └── bge-small
║    └── bge-large
╠──────────────────────────────────────────────────────────╣
║ 🔸 Vendor: openai
║    └── text-embedding-ada-002
║    └── text-embedding-3-small
║    └── text-embedding-3-large
╠──────────────────────────────────────────────────────────╣
╚══════════════════════════════════════════════════════════╝



In [7]:
chunk_default = cc.ChunkDefault(chunk_size=420, overlap=60, items_opt=docs, max_workers=4)
chunk_default

<purecpp_chunks_clean_dev.ChunkDefault at 0x7f6b267e0b30>

# Embedding Results by Model

Below are the results obtained after running `chunk_default.CreateEmb()` with different OpenAI models.  
*(`text-embedding-ada-002` is used as a default.)*

## Summary Table

| Model                        | Vendor | Dimension (dim) | Number of Chunks (n) | Flatten Size (`flatVD.size`) |
|------------------------------|--------|-----------------|----------------------|-------------------------------|
| text-embedding-ada-002       | openai | 1536            | 24                   | 36,864                        |
| text-embedding-3-small       | openai | 1536            | 24                   | 36,864                        |
| text-embedding-3-large       | openai | 3072            | 24                   | 73,728                        |

---

## Details by Model

### 1. `text-embedding-ada-002`

```plaintext
Flatten vector dimensions: <36864>
dim: 1536 → n: 24 → expected size: 36864
Model: text-embedding-ada-002, openai

VDBdata(
    flatVD = [-0.0164003, -0.00974306, 0.0139928, -0.011572, -0.0230242, ...],
    flatVD.size = 36864,
    model = 'text-embedding-ada-002',
    vendor = 'openai',
    dim = 1536,
    n = 24
)


In [8]:
chunk_default.CreateEmb()

Flatten vector dimensions: <35328>
dim: 1536 n: 23 → expected size: 35328
Model text-embedding-ada-002, openai


VDBdata(
  flatVD = [-0.0164003, -0.00974306, 0.0139928, -0.011572, -0.0230242, -0.00333359, -0.011818, -0.0224124, ...]
  flatVD.size = 35328,
  model = 'text-embedding-ada-002',
  vendor = 'openai',
  dim = 1536,
  n = 23
)

______________________________________________________________________________________
|  Model: text-embedding-ada-002 was added to chunks                                    
|_____________________________________________________________________________________|


In [9]:
chunk_default.CreateEmb("text-embedding-3-small")

Flatten vector dimensions: <35328>
dim: 1536 n: 23 → expected size: 35328
Model text-embedding-3-small, openai


VDBdata(
  flatVD = [0.0122094, 0.0349976, 0.015416, 0.0186773, -0.00630341, -0.00091468, 0.00028648, 0.00851646, ...]
  flatVD.size = 35328,
  model = 'text-embedding-3-small',
  vendor = 'openai',
  dim = 1536,
  n = 23
)

______________________________________________________________________________________
|  Model: text-embedding-3-small was added to chunks                                    
|_____________________________________________________________________________________|


In [10]:
chunk_default.CreateEmb("text-embedding-3-large")

Flatten vector dimensions: <70656>
dim: 3072 n: 23 → expected size: 70656
Model text-embedding-3-large, openai
______________________________________________________________________________________
|  Model: text-embedding-3-large was added to chunks                                    
|_____________________________________________________________________________________|


VDBdata(
  flatVD = [0.0366267, 0.0108322, -0.0039584, 0.0206926, -0.0455404, 0.00583078, 0.0461771, 0.0477186, ...]
  flatVD.size = 70656,
  model = 'text-embedding-3-large',
  vendor = 'openai',
  dim = 3072,
  n = 23
)

## Internal Structure of `ChunkDefault` and `VectorDatabase`

- **References to Chunks**  
  - `ChunkDefault` holds only _references_ to the `RAGLibrary::Document` objects (“chunks”), without copying each instance multiple times.  
  - This minimizes memory duplication and preserves direct access to the original text.

- **Contiguous `vdb_data` Array (VectorDatabase)**  
  - Each call to `CreateEmb(model)` produces a `vdb_data` object that aggregates all embeddings for that model:  
    - **`flatVD`**: a contiguous `std::vector<float>` containing all embeddings concatenated.  
    - **`dim`**: the dimensionality of each embedding (D).  
    - **`n`**: the total number of chunks.  
    - **`model`** and **`vendor`**: identifiers of the embedding model.  
  - This contiguous memory layout:  
    1. **Avoids overhead** of having N separate `std::vector<float>`.  
    2. **Enables efficient vector search** (e.g., with FAISS or manual computations), since all data resides in one contiguous block.

- **Benefits**  
  1. **Memory efficiency**: a single flattened vector instead of N independent vectors.  
  2. **Performance**: cache-friendly sequential access and reduced heap fragmentation.  
  3. **Simplicity**: `vdb_data` encapsulates everything needed for downstream searches—pointer, dimensions, and model metadata.  


In [11]:
chunk_default.printVD()

Chunk list contains 23 chunks.
With an average size of: 420 and  overlap of: 60 | Quantity of different embeddings: 3
╔═════════════════════════════════════════════════════════════════════════════════════╗
║ ➤ Model / Vendor  : text-embedding-ada-002 / openai              
║ ➤ Dimensions      : 1536                                    
║ ➤ Flat Vector Size: 35328                                   
╚═════════════════════════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════════════════════════╗
║ ➤ Model / Vendor  : text-embedding-3-small / openai              
║ ➤ Dimensions      : 1536                                    
║ ➤ Flat Vector Size: 35328                                   
╚═════════════════════════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════════════════════════╗
║ ➤ Model / Vendor  : text-embedding-3-large / openai 

# How `VectorDataBase` and `ChunkQuery` Work Together

1. **Preparing the Query and the Vector Database**  
   - When you call, for example:  
     ```cpp
     auto result = VectorDataBase::PureCosine("Playoff", chunk_default, 1, 4);
     ```  
     you are requesting:
     1. That the string `"Playoff"` be converted into an embedding compatible with the **embedding model** stored in the element at index `pos = 1` of `chunk_default`.  
     2. That the **4** most similar chunks be returned (using cosine similarity).

2. **`ChunkQuery` — Generating the Query Embedding**  
   - Internally, `PureCosine` does:
     ```cpp
     Chunk::ChunkQuery cq(query, {}, &chunks, pos);
     ```  
     which:
     - **`setChunks(&chunks, pos)`**: selects the `vdb_data` (VectorDatabase entry) at position `pos` in the `ChunkDefault` embeddings vector.  
     - Creates _views_ (spans) pointing to each chunk embedding within the same contiguous block (`flatVD`).  
     - Generates a new embedding for the query (`"Playoff"`) using the same model (`model`) from that `vdb_data`.

3. **Building the FAISS Index**  
   - Still in `PureCosine`, it retrieves:
     ```cpp
     const float* base = vdb->flatVD.data();
     ```
     and, knowing `d = vdb->dim` and `ndb = vdb->n`, normalizes each chunk vector in-place to apply cosine similarity.  
   - It then creates a **`IndexFlatIP`** FAISS index with dimension `d` and adds the contiguous block of `ndb` vectors.

4. **Performing the Top-k Search**  
   - The normalized query embedding is passed to FAISS:
     ```cpp
     index.search(1, normalized_query.data(), k, D.data(), I.data());
     ```  
     returning:
     - **`I`**: the indices of the top-`k` most similar chunks.  
     - **`D`**: the similarity scores (cosine values).

5. **Mapping Indices Back to Chunks**  
   - With the index vector `I`, the code does:
     ```cpp
     for (auto idx : I)
         matched.push_back(chunks_list[idx]);
     ```  
     translating each FAISS index back into the original `page_content` of that chunk.

6. **Constructing the `PureResult`**  
   - Finally, it packs everything into a `PureResult`:
     ```cpp
     {
       indices = I,
       distances = D,
       query = query,
       query_embedding = emb_query,
       matched_chunks = matched
     }
     ```  
   - The Python user receives an object containing:
     - The original **query**.
     - The **query embedding**.
     - The **indices** and **scores** of the matched chunks.
     - The actual **chunks** of text that were most relevant.

---

## Benefits of This Design

- **`vdb_data` Reuse**: each `CreateEmb()` creates just one contiguous embeddings vector (`flatVD`), avoiding redundant copies.  
- **`ChunkQuery` as an Adapter**: unifies query embedding generation, data views, and metadata retrieval.  
- **Pure FAISS Backends**: easily switch between L2, dot-product, or cosine similarity without changing index-building logic.  
- **Straightforward Pipeline**: FAISS indices → chunk indices → `Document` objects → user response.

This way, the entire pipeline (chunking → embeddings → indexing → search → text return) remains modular, memory-efficient, and seamlessly integrable into your Python notebook via PyBind11.  


In [12]:
cc.PureCosine("Playoff", chunk_default, 1, 4)

PureResult(
  query: 'Playoff',
  index: [9, 8, 21, 4],
  distances: [0.4066, 0.3835, 0.3833, 0.3810],
  query_embed: [0.0258, -0.0375, -0.0212, 0.0033, -0.0075, -0.0061, 0.0021, 0.0370, ...],
  chunks: [
 [0]: 'Document(metadata={}, page_content="test playoff games of the Eastern Conference finals. 14h NBA insiders J.R. Smith crashes Brunson'...")'

 [1]: 'Document(metadata={}, page_content="s could have interesting salary cap fallout. 6h Dan Graziano Barnwell's NFC offseason superlative...")'

 [2]: 'Document(metadata={}, page_content="valishvili vs. O'Malley 2 (Jun. 7, ESPN+ PPV) Quick Links NBA Playoffs Stanley Cup Playoffs Frenc...")'

 [3]: 'Document(metadata={}, page_content="experts break down the 2025 field and give their takes on who can advance to Omaha. 6h ESPN ⚾ F...")'

]

In [13]:
cc.PureIP("Playoff", chunk_default, 0, 5)

PureResult(
  query: 'Playoff',
  index: [21, 9, 8, 1, 17],
  distances: [0.8222, 0.8195, 0.8108, 0.7996, 0.7970],
  query_embed: [-0.0187, 0.0000, -0.0039, -0.0072, -0.0470, 0.0238, -0.0083, -0.0021, ...],
  chunks: [
 [0]: 'Document(metadata={}, page_content="valishvili vs. O'Malley 2 (Jun. 7, ESPN+ PPV) Quick Links NBA Playoffs Stanley Cup Playoffs Frenc...")'

 [1]: 'Document(metadata={}, page_content="test playoff games of the Eastern Conference finals. 14h NBA insiders J.R. Smith crashes Brunson'...")'

 [2]: 'Document(metadata={}, page_content="s could have interesting salary cap fallout. 6h Dan Graziano Barnwell's NFC offseason superlative...")'

 [3]: 'Document(metadata={}, page_content="Watch ESPN BET ESPN+ Subscribe Now NCAA Women's College World Series NCAA Baseball Tournament: Re...")'

 [4]: 'Document(metadata={}, page_content="atriots: How close are the Patriots to returning to winning? The Patriots have revamped their fro...")'

]

In [14]:
cc.PureIP("Playoff", chunk_default, 2, 5)

PureResult(
  query: 'Playoff',
  index: [9, 1, 0, 19, 14],
  distances: [0.3616, 0.3392, 0.3367, 0.3298, 0.3236],
  query_embed: [0.0045, 0.0009, -0.0094, 0.0127, -0.0343, 0.0141, -0.0097, 0.0827, ...],
  chunks: [
 [0]: 'Document(metadata={}, page_content="test playoff games of the Eastern Conference finals. 14h NBA insiders J.R. Smith crashes Brunson'...")'

 [1]: 'Document(metadata={}, page_content="Watch ESPN BET ESPN+ Subscribe Now NCAA Women's College World Series NCAA Baseball Tournament: Re...")'

 [2]: 'Document(metadata={}, page_content="\n        Skip to main content\n     \n        Skip to navigation\n     < > Menu ESPN scores NFL ...")'

 [3]: 'Document(metadata={}, page_content="A League Join a Public League Practice With a Mock Draft Sign up for FREE! Reactivate A League Cr...")'

 [4]: 'Document(metadata={}, page_content="ds club-record $147M for Wirtz Braves' Sale fastest to reach 2,500 career K's Swiatek, Sabalenka ...")'

]

In [15]:
cc.PureL2("Playoff", chunk_default, 0, 1)

PureResult(
  query: 'Playoff',
  index: [21],
  distances: [0.3557],
  query_embed: [-0.0187, 0.0000, -0.0039, -0.0072, -0.0470, 0.0238, -0.0083, -0.0021, ...],
  chunks: [
 [0]: 'Document(metadata={}, page_content="valishvili vs. O'Malley 2 (Jun. 7, ESPN+ PPV) Quick Links NBA Playoffs Stanley Cup Playoffs Frenc...")'

]

In [16]:
result = cc.PureL2("Playoff", chunk_default, 2, 6)