<a href="https://colab.research.google.com/github/arssite/Datalysis/blob/main/Model_testing_for_PCM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
!pip install -q sentence-transformers transformers adapters torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.2/302.2 kB[0m [31m763.1 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m60.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m79.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

In [9]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from adapters import AutoAdapterModel
import time

In [10]:
# === 2. Define Helper Pooling Functions ===

# For SciBERT (and bge-m3 if we used AutoModel)
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

# For Specter-2
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [18]:
# === 1. Define Our Data (7 Chunks) ===
hard_test_chunks = [
    # [0] Physics - Electrostatics
    "Coulomb's Law describes the electrostatic force $\vec{F}$ between two stationary point charges, $q_1$ and $q_2$. The force is proportional to the product of the charges and inversely proportional to the square of the distance $r$ between them. $\vec{F} = k_e \frac{q_1 q_2}{r^2} \hat{r}$",
    # [1] Physics - Gravity
    "Newton's Law of Universal Gravitation states that any two bodies in the universe attract each other with a force $\vec{F}$ that is directly proportional to the product of their masses, $m_1$ and $m_2$, and inversely proportional to the square of the distance $r$ between their centers. $\vec{F} = G \frac{m_1 m_2}{r^2} \hat{r}$",
    # [2] Math - Calculus
    "The definite integral $\int_a^b f(x) \, dx$ represents the signed area of the region in the xy-plane that is bounded on the left by $x=a$, on the right by $x=b$, by the x-axis, and by the graph of $y=f(x)$",
    # [3] Math - Algebra
    "The quadratic formula is a solution for all quadratic equations in the form $ax^2 + bx + c = 0$. The roots of the equation, $x$, are given by $x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$",
    # [4] Chemistry - Physical
    "A mole is a unit of measurement for the amount of substance. One mole contains exactly $6.022 \times 10^{23}$ elementary entities. This number is known as the Avogadro constant, $N_A$.",
    # [5] Chemistry - Organic
    "Benzene ($C_6H_6$) is an organic chemical compound. It is a cyclic molecule with a ring of six carbon atoms, each with one hydrogen atom attached. It is known for its delogcalized pi electrons, which result in a resonance structure.",
    # [6] Physics - Dynamics (NEW CHUNK)
    "Newton's Second Law (Dynamics) states that the force $\vec{F}$ acting on a body is equal to the mass $m$ of the body times its acceleration $\vec{a}$. The formula is $\vec{F} = m\vec{a}$."
]

# === 2. Define the Retrieval Stress-Test Queries ===
test_queries = [
    ("Conceptual F=ma", "What is the relation between force and acceleration?"),
    ("Hard LaTeX (Coulomb)", "Find information on $\vec{F} = k_e \frac{q_1 q_2}{r^2} \hat{r}$"),
    ("Hard LaTeX (Gravity)", "Find information on $\vec{F} = G \frac{m_1 m_2}{r^2} \hat{r}$"),
    ("Conceptual Math", "How do I find the area under a curve?"),
    ("Keyword Chem", "What is a mole?"),
    ("Keyword Chem 2", "What is Benzene ($C_6H_6$)?")
]

print("Data and queries defined successfully.")

Data and queries defined successfully.


  "Coulomb's Law describes the electrostatic force $\vec{F}$ between two stationary point charges, $q_1$ and $q_2$. The force is proportional to the product of the charges and inversely proportional to the square of the distance $r$ between them. $\vec{F} = k_e \frac{q_1 q_2}{r^2} \hat{r}$",
  "Newton's Law of Universal Gravitation states that any two bodies in the universe attract each other with a force $\vec{F}$ that is directly proportional to the product of their masses, $m_1$ and $m_2$, and inversely proportional to the square of the distance $r$ between their centers. $\vec{F} = G \frac{m_1 m_2}{r^2} \hat{r}$",
  "The definite integral $\int_a^b f(x) \, dx$ represents the signed area of the region in the xy-plane that is bounded on the left by $x=a$, on the right by $x=b$, by the x-axis, and by the graph of $y=f(x)$",
  "The quadratic formula is a solution for all quadratic equations in the form $ax^2 + bx + c = 0$. The roots of the equation, $x$, are given by $x = \frac{-b \pm 

In [19]:
print("\n" + "="*80)
print("🚀 TESTING MODEL: BAAI/bge-m3")
print("="*80)

# --- 1. Load Model ---
model_name = "BAAI/bge-m3"
model = SentenceTransformer(model_name, device=device)
print(f"Loaded {model_name}.")

# --- 2. Create Chunk Embeddings (The "Database") ---
with torch.no_grad():
    chunk_embeddings = model.encode(hard_test_chunks, normalize_embeddings=True)
    chunk_embeddings = torch.tensor(chunk_embeddings).to(device)

# --- 3. Run All Queries ---
for q_name, q_text in test_queries:
    print(f"\n  QUERY: '{q_name}'")

    with torch.no_grad():
        query_vector = model.encode(q_text, normalize_embeddings=True)
        query_vector = torch.tensor(query_vector).to(device)

    # 4. Calculate Scores
    scores = torch.mm(query_vector.unsqueeze(0), chunk_embeddings.T)
    top_scores, top_indices = torch.topk(scores, k=3)

    # 5. Print results
    for i in range(3):
        score = top_scores[0][i].item()
        chunk_index = top_indices[0][i].item()
        print(f"    {i+1}. [Chunk {chunk_index}] (Score: {score:.4f})")


🚀 TESTING MODEL: BAAI/bge-m3
Loaded BAAI/bge-m3.

  QUERY: 'Conceptual F=ma'
    1. [Chunk 6] (Score: 0.6012)
    2. [Chunk 1] (Score: 0.5680)
    3. [Chunk 0] (Score: 0.5372)

  QUERY: 'Hard LaTeX (Coulomb)'
    1. [Chunk 0] (Score: 0.5666)
    2. [Chunk 1] (Score: 0.4780)
    3. [Chunk 6] (Score: 0.4477)

  QUERY: 'Hard LaTeX (Gravity)'
    1. [Chunk 1] (Score: 0.5293)
    2. [Chunk 0] (Score: 0.4944)
    3. [Chunk 6] (Score: 0.4743)

  QUERY: 'Conceptual Math'
    1. [Chunk 2] (Score: 0.4666)
    2. [Chunk 4] (Score: 0.4224)
    3. [Chunk 3] (Score: 0.4166)

  QUERY: 'Keyword Chem'
    1. [Chunk 4] (Score: 0.6618)
    2. [Chunk 5] (Score: 0.4095)
    3. [Chunk 1] (Score: 0.3315)

  QUERY: 'Keyword Chem 2'
    1. [Chunk 5] (Score: 0.7637)
    2. [Chunk 4] (Score: 0.3767)
    3. [Chunk 6] (Score: 0.3334)


In [20]:
print("\n" + "="*80)
print("🚀 TESTING MODEL: allenai/scibert_scivocab_uncased")
print("="*80)

# --- 1. Load Model ---
model_name = "allenai/scibert_scivocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)
print(f"Loaded {model_name}.")

# --- 2. Create Chunk Embeddings (The "Database") ---
with torch.no_grad():
    encoded_input = tokenizer(hard_test_chunks, padding=True, truncation=True, return_tensors='pt').to(device)
    model_output = model(**encoded_input)
    chunk_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    chunk_embeddings = F.normalize(chunk_embeddings, p=2, dim=1)

# --- 3. Run All Queries ---
for q_name, q_text in test_queries:
    print(f"\n  QUERY: '{q_name}'")

    with torch.no_grad():
        encoded_input = tokenizer(q_text, padding=True, truncation=True, return_tensors='pt').to(device)
        model_output = model(**encoded_input)
        query_vector = mean_pooling(model_output, encoded_input['attention_mask'])
        query_vector = F.normalize(query_vector, p=2, dim=1)

    # 4. Calculate Scores (with the bug fix)
    if query_vector.dim() == 1:
        scores = torch.mm(query_vector.unsqueeze(0), chunk_embeddings.T)
    else:
        scores = torch.mm(query_vector, chunk_embeddings.T)

    top_scores, top_indices = torch.topk(scores, k=3)

    # 5. Print results
    for i in range(3):
        score = top_scores[0][i].item()
        chunk_index = top_indices[0][i].item()
        print(f"    {i+1}. [Chunk {chunk_index}] (Score: {score:.4f})")


🚀 TESTING MODEL: allenai/scibert_scivocab_uncased


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded allenai/scibert_scivocab_uncased.

  QUERY: 'Conceptual F=ma'
    1. [Chunk 5] (Score: 0.7416)
    2. [Chunk 4] (Score: 0.7412)
    3. [Chunk 6] (Score: 0.7385)

  QUERY: 'Hard LaTeX (Coulomb)'
    1. [Chunk 3] (Score: 0.8155)
    2. [Chunk 0] (Score: 0.7992)
    3. [Chunk 1] (Score: 0.7735)

  QUERY: 'Hard LaTeX (Gravity)'
    1. [Chunk 3] (Score: 0.8146)
    2. [Chunk 0] (Score: 0.7918)
    3. [Chunk 1] (Score: 0.7901)

  QUERY: 'Conceptual Math'
    1. [Chunk 2] (Score: 0.7471)
    2. [Chunk 3] (Score: 0.7462)
    3. [Chunk 5] (Score: 0.7320)

  QUERY: 'Keyword Chem'
    1. [Chunk 4] (Score: 0.6290)
    2. [Chunk 5] (Score: 0.5970)
    3. [Chunk 3] (Score: 0.5794)

  QUERY: 'Keyword Chem 2'
    1. [Chunk 5] (Score: 0.8284)
    2. [Chunk 3] (Score: 0.7793)
    3. [Chunk 4] (Score: 0.7713)


In [22]:
print("\n" + "="*80)
print("🚀 TESTING MODEL: allenai/specter2")
print("="*80)

# --- 1. Load Model ---
model_name = "allenai/specter2_base"
adapter_name = "allenai/specter2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoAdapterModel.from_pretrained(model_name).to(device)
model.load_adapter(adapter_name, source="hf", set_active=True)
print(f"Loaded {model_name} with {adapter_name} adapter.")

# --- 2. Create Chunk Embeddings (The "Database") ---
with torch.no_grad():
    encoded_input = tokenizer(hard_test_chunks, padding=True, truncation=True, return_tensors='pt').to(device)
    model_output = model(**encoded_input)
    chunk_embeddings = cls_pooling(model_output)
    chunk_embeddings = F.normalize(chunk_embeddings, p=2, dim=1)

# --- 3. Run All Queries ---
for q_name, q_text in test_queries:
    print(f"\n  QUERY: '{q_name}'")

    with torch.no_grad():
        encoded_input = tokenizer(q_text, padding=True, truncation=True, return_tensors='pt').to(device)
        model_output = model(**encoded_input)
        query_vector = cls_pooling(model_output)
        query_vector = F.normalize(query_vector, p=2, dim=1)

    # 4. Calculate Scores (with the bug fix)
    if query_vector.dim() == 1:
        scores = torch.mm(query_vector.unsqueeze(0), chunk_embeddings.T)
    else:
        scores = torch.mm(query_vector, chunk_embeddings.T)

    top_scores, top_indices = torch.topk(scores, k=3)

    # 5. Print results
    for i in range(3):
        score = top_scores[0][i].item()
        chunk_index = top_indices[0][i].item()
        print(f"    {i+1}. [Chunk {chunk_index}] (Score: {score:.4f})")


🚀 TESTING MODEL: allenai/specter2


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded allenai/specter2_base with allenai/specter2 adapter.

  QUERY: 'Conceptual F=ma'
    1. [Chunk 6] (Score: 0.9212)
    2. [Chunk 1] (Score: 0.8891)
    3. [Chunk 0] (Score: 0.8596)

  QUERY: 'Hard LaTeX (Coulomb)'
    1. [Chunk 2] (Score: 0.9279)
    2. [Chunk 3] (Score: 0.9076)
    3. [Chunk 0] (Score: 0.9012)

  QUERY: 'Hard LaTeX (Gravity)'
    1. [Chunk 2] (Score: 0.9299)
    2. [Chunk 3] (Score: 0.9090)
    3. [Chunk 0] (Score: 0.9006)

  QUERY: 'Conceptual Math'
    1. [Chunk 2] (Score: 0.9080)
    2. [Chunk 4] (Score: 0.8758)
    3. [Chunk 3] (Score: 0.8627)

  QUERY: 'Keyword Chem'
    1. [Chunk 4] (Score: 0.9291)
    2. [Chunk 2] (Score: 0.8956)
    3. [Chunk 5] (Score: 0.8851)

  QUERY: 'Keyword Chem 2'
    1. [Chunk 5] (Score: 0.9421)
    2. [Chunk 4] (Score: 0.8910)
    3. [Chunk 2] (Score: 0.8748)


In [21]:
print("\n" + "="*80)
print("🚀 TESTING MODEL: jina-ai/jina-embeddings-v2-base-en")
print("="*80)

# --- 1. Load Model ---
model_name = "jina-ai/jina-embeddings-v2-base-en"
model = SentenceTransformer(model_name, device=device, trust_remote_code=True)
print(f"Loaded {model_name}.")

# --- 2. Create Chunk Embeddings (The "Database") ---
with torch.no_grad():
    chunk_embeddings = model.encode(hard_test_chunks, normalize_embeddings=True)
    chunk_embeddings = torch.tensor(chunk_embeddings).to(device)

# --- 3. Run All Queries ---
for q_name, q_text in test_queries:
    print(f"\n  QUERY: '{q_name}'")

    with torch.no_grad():
        query_vector = model.encode(q_text, normalize_embeddings=True)
        query_vector = torch.tensor(query_vector).to(device)

    # 4. Calculate Scores
    scores = torch.mm(query_vector.unsqueeze(0), chunk_embeddings.T)
    top_scores, top_indices = torch.topk(scores, k=3)

    # 5. Print results
    for i in range(3):
        score = top_scores[0][i].item()
        chunk_index = top_indices[0][i].item()
        print(f"    {i+1}. [Chunk {chunk_index}] (Score: {score:.4f})")




🚀 TESTING MODEL: jina-ai/jina-embeddings-v2-base-en


OSError: jina-ai/jina-embeddings-v2-base-en is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`

In [23]:
print("\n" + "="*80)
print("🚀 TESTING MODEL: intfloat/e5-mistral-7b-instruct")
print("="*80)

# --- 1. Load Model ---
model_name = "intfloat/e5-mistral-7b-instruct"
model = SentenceTransformer(model_name, device=device)
print(f"Loaded {model_name}.")

# --- 2. Create Chunk Embeddings (The "Database") ---
# E5 needs a "passage: " prefix for chunks
passage_chunks = [f"passage: {chunk}" for chunk in hard_test_chunks]
with torch.no_grad():
    chunk_embeddings = model.encode(passage_chunks, normalize_embeddings=True)
    chunk_embeddings = torch.tensor(chunk_embeddings).to(device)

# --- 3. Run All Queries ---
for q_name, q_text in test_queries:
    print(f"\n  QUERY: '{q_name}'")

    # E5 needs a "query: " prefix for queries
    query_with_prefix = f"query: {q_text}"

    with torch.no_grad():
        query_vector = model.encode(query_with_prefix, normalize_embeddings=True)
        query_vector = torch.tensor(query_vector).to(device)

    # 4. Calculate Scores
    scores = torch.mm(query_vector.unsqueeze(0), chunk_embeddings.T)
    top_scores, top_indices = torch.topk(scores, k=3)

    # 5. Print results
    for i in range(3):
        score = top_scores[0][i].item()
        chunk_index = top_indices[0][i].item()
        print(f"    {i+1}. [Chunk {chunk_index}] (Score: {score:.4f})")


🚀 TESTING MODEL: intfloat/e5-mistral-7b-instruct


ImportError: cannot import name 'LossKwargs' from 'transformers.utils' (/usr/local/lib/python3.12/dist-packages/transformers/utils/__init__.py)

In [13]:
# === 4. Main Test Loop ===
models_to_test = [
    "BAAI/bge-m3",
    "intfloat/e5-mistral-7b-instruct",
    "allenai/specter2_base", # This will also load the adapter
    "allenai/scibert_scivocab_uncased",
    "jina-ai/jina-embeddings-v2-base-en"
]

In [14]:
# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"--- Running on {device} ---")

--- Running on cpu ---


In [17]:
for model_id in models_to_test:
    print("\n" + "="*80)
    print(f"🚀 TESTING MODEL: {model_id}")
    print("="*80)

    model = None
    tokenizer = None
    model_type = ""

    # --- Model Loading Logic ---
    start_time = time.time()
    try:
        if "bge-m3" in model_id or "jina" in model_id:
            model_type = "SentenceTransformer"
            # Jina needs trust_remote_code
            trust = "jina" in model_id
            model = SentenceTransformer(model_id, device=device, trust_remote_code=trust)

        elif "e5-mistral" in model_id:
            model_type = "E5_Mistral"
            # This is a very large model, needs GPU
            model = SentenceTransformer(model_id, device=device)

        elif "scibert" in model_id:
            model_type = "AutoModel_Mean"
            tokenizer = AutoTokenizer.from_pretrained(model_id)
            model = AutoModel.from_pretrained(model_id).to(device)

        elif "specter2_base" in model_id:
            model_type = "AdapterModel_CLS"
            tokenizer = AutoTokenizer.from_pretrained(model_id)
            model = AutoAdapterModel.from_pretrained(model_id).to(device)
            adapter_name = "allenai/specter2"
            model.load_adapter(adapter_name, source="hf", set_active=True)

        print(f"Loaded {model_id} in {time.time() - start_time:.2f} seconds.")

    except Exception as e:
        print(f"Failed to load {model_id}: {e}")
        continue

    # --- 1. Create Chunk Embeddings (The "Database") ---
    chunk_embeddings = None
    if model_type == "SentenceTransformer":
        chunk_embeddings = model.encode(hard_test_chunks, normalize_embeddings=True)
    elif model_type == "E5_Mistral":
        # E5 needs a "passage: " prefix for chunks
        passage_chunks = [f"passage: {chunk}" for chunk in hard_test_chunks]
        chunk_embeddings = model.encode(passage_chunks, normalize_embeddings=True)
    else: # SciBERT and Specter
        encoded_input = tokenizer(hard_test_chunks, padding=True, truncation=True, return_tensors='pt').to(device)
        with torch.no_grad():
            model_output = model(**encoded_input)

        if model_type == "AutoModel_Mean":
            chunk_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
        elif model_type == "AdapterModel_CLS":
            chunk_embeddings = cls_pooling(model_output)

        chunk_embeddings = F.normalize(chunk_embeddings, p=2, dim=1).cpu().numpy()

    chunk_embeddings = torch.tensor(chunk_embeddings).to(device)


    # --- 2. Run All Queries ---
    for q_name, q_text in test_queries:
        print(f"\n  QUERY: '{q_name}'")

        query_vector = None
        if model_type == "SentenceTransformer":
            query_vector = model.encode(q_text, normalize_embeddings=True)
        elif model_type == "E5_Mistral":
            # E5 needs a "query: " prefix for queries
            query_vector = model.encode(f"query: {q_text}", normalize_embeddings=True)
        else: # SciBERT and Specter
            encoded_input = tokenizer(q_text, padding=True, truncation=True, return_tensors='pt').to(device)
            with torch.no_grad():
                model_output = model(**encoded_input)

            if model_type == "AutoModel_Mean":
                query_vector = mean_pooling(model_output, encoded_input['attention_mask'])
            elif model_type == "AdapterModel_CLS":
                query_vector = cls_pooling(model_output)

            query_vector = F.normalize(query_vector, p=2, dim=1).cpu().numpy()

        query_vector = torch.tensor(query_vector).to(device)

        # 3. Calculate Scores
        scores = torch.mm(query_vector.unsqueeze(0), chunk_embeddings.T)
        top_scores, top_indices = torch.topk(scores, k=3)

        # 4. Print results
        for i in range(3):
            score = top_scores[0][i].item()
            chunk_index = top_indices[0][i].item()
            print(f"    {i+1}. [Chunk {chunk_index}] (Score: {score:.4f})")

print("\n" + "="*80)
print("✅ Bake-off Complete.")


🚀 TESTING MODEL: BAAI/bge-m3
Loaded BAAI/bge-m3 in 4.12 seconds.

  QUERY: 'Conceptual F=ma'
    1. [Chunk 6] (Score: 0.6012)
    2. [Chunk 1] (Score: 0.5680)
    3. [Chunk 0] (Score: 0.5372)

  QUERY: 'Hard LaTeX (Coulomb)'
    1. [Chunk 0] (Score: 0.5666)
    2. [Chunk 1] (Score: 0.4780)
    3. [Chunk 6] (Score: 0.4477)

  QUERY: 'Hard LaTeX (Gravity)'
    1. [Chunk 1] (Score: 0.5293)
    2. [Chunk 0] (Score: 0.4944)
    3. [Chunk 6] (Score: 0.4743)

  QUERY: 'Conceptual Math'
    1. [Chunk 2] (Score: 0.4666)
    2. [Chunk 4] (Score: 0.4224)
    3. [Chunk 3] (Score: 0.4166)

  QUERY: 'Keyword Chem'
    1. [Chunk 4] (Score: 0.6618)
    2. [Chunk 5] (Score: 0.4095)
    3. [Chunk 1] (Score: 0.3315)

  QUERY: 'Keyword Chem 2'
    1. [Chunk 5] (Score: 0.7637)
    2. [Chunk 4] (Score: 0.3767)
    3. [Chunk 6] (Score: 0.3334)

🚀 TESTING MODEL: intfloat/e5-mistral-7b-instruct
Failed to load intfloat/e5-mistral-7b-instruct: cannot import name 'LossKwargs' from 'transformers.utils' (/usr/loca

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded allenai/specter2_base in 0.90 seconds.

  QUERY: 'Conceptual F=ma'


RuntimeError: self must be a matrix

In [4]:
# === 3. Load Our Winning Model ===
model_name = "BAAI/bge-m3"
print(f"Loading model: {model_name}...")
model = SentenceTransformer(model_name)
print("Model loaded.")

Loading model: BAAI/bge-m3...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Model loaded.


In [5]:
# === 4. Create Our "Vector Database" ===
print("Generating embeddings for all 7 chunks...")
# We use normalize_embeddings=True to make cosine similarity calculation faster
chunk_embeddings = model.encode(hard_test_chunks, normalize_embeddings=True)
# Convert to a PyTorch tensor
chunk_embeddings = torch.tensor(chunk_embeddings)
print(f"Created 'database' tensor of shape: {chunk_embeddings.shape}")

Generating embeddings for all 7 chunks...
Created 'database' tensor of shape: torch.Size([7, 1024])


In [6]:
# === 5. Define the Retrieval Stress-Test Queries ===
test_queries = [
    # 1. User's Conceptual Query
    "What is the relation between force and acceleration?",

    # 2. Hard LaTeX Query (Coulomb)
    "Find information on $\vec{F} = k_e \frac{q_1 q_2}{r^2} \hat{r}$",

    # 3. Hard LaTeX "Nuisance" Query (Gravity)
    "Find information on $\vec{F} = G \frac{m_1 m_2}{r^2} \hat{r}$",

    # 4. Math Conceptual Query
    "How do I find the area under a curve?",

    # 5. Chemistry Keyword Query
    "What is a mole?",

    # 6. Benzene Query
    "What is Benzene ($C_6H_6$)?"
]
print("\n" + "="*50)
print("🚀 RUNNING MANUAL RETRIEVAL STRESS-TEST 🚀")
print("="*50 + "\n")





🚀 RUNNING MANUAL RETRIEVAL STRESS-TEST 🚀



  "Find information on $\vec{F} = k_e \frac{q_1 q_2}{r^2} \hat{r}$",
  "Find information on $\vec{F} = G \frac{m_1 m_2}{r^2} \hat{r}$",


In [7]:
for query in test_queries:
    print(f"\nQUERY: '{query}'")

    # 1. Create the query embedding
    query_vector = model.encode(query, normalize_embeddings=True)
    query_vector = torch.tensor(query_vector)

    # 2. Calculate Cosine Similarity
    # This is the "search"
    # We multiply the single query vector (1, 1024)
    # by the transposed chunk database (1024, 7)
    # The result is a (1, 7) tensor of scores.
    scores = torch.mm(query_vector.unsqueeze(0), chunk_embeddings.T)

    # 3. Get the top results
    # We sort the scores and get the top 3
    top_scores, top_indices = torch.topk(scores, k=3)

    # 4. Print results
    print("--- Top 3 Results ---")
    for i in range(3):
        score = top_scores[0][i].item()
        chunk_index = top_indices[0][i].item()
        retrieved_text = hard_test_chunks[chunk_index]

        print(f"  {i+1}. [Chunk {chunk_index}] (Score: {score:.4f})")
        print(f"      > {retrieved_text[:100]}...")
    print("---")


QUERY: 'What is the relation between force and acceleration?'
--- Top 3 Results ---
  1. [Chunk 6] (Score: 0.6012)
      > Newton's Second Law (Dynamics) states that the force $ec{F}$ acting on a body is equal to the mass ...
  2. [Chunk 1] (Score: 0.5680)
      > Newton's Law of Universal Gravitation states that any two bodies in the universe attract each other ...
  3. [Chunk 0] (Score: 0.5372)
      > Coulomb's Law describes the electrostatic force $ec{F}$ between two stationary point charges, $q_1$...
---

QUERY: 'Find information on $ec{F} = k_e rac{q_1 q_2}{r^2} \hat{r}$'
--- Top 3 Results ---
  1. [Chunk 0] (Score: 0.5666)
      > Coulomb's Law describes the electrostatic force $ec{F}$ between two stationary point charges, $q_1$...
  2. [Chunk 1] (Score: 0.4780)
      > Newton's Law of Universal Gravitation states that any two bodies in the universe attract each other ...
  3. [Chunk 6] (Score: 0.4477)
      > Newton's Second Law (Dynamics) states that the force $ec{F}$ a