In [2]:
import arXiv_rag as ar # local file


In [3]:
from dotenv import load_dotenv
import os
from openai import OpenAI
from dotenv import load_dotenv
from IPython.display import Markdown, display

In [4]:
load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

ping_client = False
#ping_client = True

if ping_client:
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is 2 + 2?"}
            ]
        )

        print("✅ Success! Response:")
        print(response.choices[0].message.content)

    except Exception as e:
        print(f"❌ Error: {e}")


# Retrieval-Augmented Generation on arXiv Abstracts (`hep-ph`)

This notebook demonstrates a simple RAG (Retrieval-Augmented Generation) system using abstracts from the [arXiv preprint server](https://arxiv.org).

We:
- Fetch recent abstracts from arXiv
- Embed them using a transformer-based sentence encoder
- Store and retrieve embeddings using FAISS
- Use GPT to generate answers from retrieved context

##  FAISS: Facebook AI Similarity Search

FAISS (Facebook AI Similarity Search) is a high-performance library for **efficient similarity search** over dense vector representations. It is especially well-suited for applications like retrieval-augmented generation (RAG), recommendation systems, and nearest-neighbor search in embedding spaces.  

###  How It Works:
- FAISS stores all of our abstract embeddings as vectors in a **vector index**.
- When a user enters a query, we embed it using the same model (`all-MiniLM-L6-v2`).
- FAISS compares the query vector to the stored vectors and returns the **top-k most similar** entries based on distance (usually L2 or cosine).

###  Why We Use FAISS:
- **Speed**: Handles millions of vectors efficiently with GPU/CPU support.
- **Scalability**: Works well for large-scale document search.
- **Simplicity**: Easy to use for exact or approximate nearest neighbor search.

###  In This Project:
- We use `IndexFlatL2` (a brute-force exact search index using Euclidean distance).
- Abstracts are embedded once and stored.
- At query time, FAISS retrieves the most semantically similar papers in milliseconds.

This allows us to build a fast and responsive retrieval system that scales with more data.  Please visit [their gitub](https://github.com/facebookresearch/faiss/wiki/) for more info.


##  Model Overview: `all-MiniLM-L6-v2`

We use the `all-MiniLM-L6-v2` model from the [SentenceTransformers](https://www.sbert.net/) library to convert text into dense vector embeddings. These embeddings represent the **semantic meaning** of text and are used for similarity search in our RAG system.

###  Key Features:
- **Architecture**: MiniLM (6 Transformer layers, distilled from BERT)
- **Embedding Dimension**: 384
- **Speed**: Extremely fast, making it suitable for real-time or large-scale applications
- **Use Case**: Optimized for general-purpose semantic similarity tasks (e.g., question answering, duplicate detection, clustering)

###  Why We Use It:
- Lightweight and fast — ideal for prototyping and scalable applications
- High-quality embeddings despite small size
- Pretrained on a diverse set of tasks like Natural Language Inference (NLI) and Semantic Textual Similarity (STS)

###  Output:
Each input text (e.g., an arXiv abstract or a user query) is mapped to a 384-dimensional vector that can be compared to other vectors using cosine or Euclidean distance.

This model is especially useful for identifying semantically similar scientific texts, even when exact keywords don’t match.


In [6]:
fetch_and_embed_abstracts = False

## Step 1: Fetch Recent arXiv Abstracts

The function fetch_arxiv_abstracts() uses the official arXiv API to fetch recent papers in the `hep-ph` category. 
We store the title, abstract, arXiv ID, and PDF link for each paper.

## Step 2: Embed Abstracts into Semantic Vectors

We use `sentence-transformers` with the model `all-MiniLM-L6-v2` to convert each abstract into an embedding vector. These embeddings capture the semantic meaning of each paper.

## Step 3: Store Embeddings in a FAISS Index

We store the embeddings in a FAISS index for efficient similarity search. Metadata (like titles and abstracts) is saved separately in a JSON file.

If you want to skip this step, set `fetch_and_embed_abstracts` to `False`


In [8]:
if fetch_and_embed_abstracts:
    papers = ar.fetch_arxiv_abstracts(query="hep-ph", max_results=500)
    embeddings = ar.embed_abstracts(papers, show_progress_bar = False)
    ar.store_faiss_index(embeddings, papers)

## Step 4: Retrieve Similar Abstracts

Given a user query, we embed it and use FAISS to retrieve the most semantically similar abstracts. These are the most relevant papers to the question being asked.  The default it to return the abstracts



In [10]:
query = input("Ask a physics question.  I will return papers with abstracts for you to read: ")
ppr = ar.retrieve_similar_abstracts(query)

Ask a physics question.  I will return papers with abstracts for you to read:  What is going on with the Higss Boson?



Top Matching Abstracts:

--- [1] Deconstructing resonant Higgs pair production at the LHC: effects of coloured and neutral scalars in the NMSSM test case ---
 Abstract: We study resonant production of pairs of Standard Model (SM)-like Higgs
bosons, in the presence of new neutral Higgs states together with new coloured
scalars (stops or sbottoms) in loops within the Next-to-Minimal Supersymmetric
SM (NMSSM). This is used as a test case to prove that the Large Hadron Collider
has sensitivity to a variety of effects stemming from interferences between
resonant (heavy) Higgs diagrams and/or among these and non-resonant topologies
involving loops of both tops and stops. These effects can alter significantly
the naive description of individual $s$-channel Breit-Wigner resonances,
leading to distortions of the latter which, on the one hand, may mask their
presence but, on the other hand, could enable one to extract features of the
underlying new physics scenario. This last aspect is made pos

### You can also return just the list of paper titles.

In [12]:
query = input("This query will provide a list of titles. Please ask your question: ")
ppr_titles = ar.retrieve_similar_abstracts(query, include_abstract = False)

This query will provide a list of titles. Please ask your question:  What is going on with the Higgs Boson?



Top Matches:

--- [1] Status of the W boson mass and the future of the electroweak fit in the next decades ---
--- [2] Loop-corrected Trilinear Higgs Self-Couplings in the NMSSM with Inverse Seesaw Mechanism ---
--- [3] Deconstructing resonant Higgs pair production at the LHC: effects of coloured and neutral scalars in the NMSSM test case ---


## Step 5:  Adding a generative AI componment

For this you are going to need to add your API key to .env and make sure you have tokens:
```python
OPENAI_API_KEY=sk-proj-cy...
``` 

### Model Overview: GPT-4o

**GPT-4o** ("o" for *omni*) is OpenAI's most advanced and versatile model as of 2024. It is optimized for **multimodal reasoning**, supporting text, vision, and audio inputs (though this notebook uses text-only).

Key characteristics:

-  **High Accuracy**: Comparable to GPT-4-turbo in reasoning tasks, with improved response coherence and factuality.
-  **Faster & Cheaper**: Lower latency and cost per token compared to GPT-4-turbo, making it suitable for interactive applications.
-  **Context-Aware**: Supports longer context windows (up to 128k tokens).
-  **Chat-Optimized**: Built for chat-style usage, with conversational memory and role-awareness.

In this notebook, GPT-4o is used to **synthesize answers** from a set of retrieved arXiv abstracts. The model is provided with relevant context and prompted to generate expert-level answers to user queries.

API Pricing:

-  Input: \$0.005 per 1,000 tokens  
-  Output: \$0.015 per 1,000 tokens  

[Token estimator tool](https://platform.openai.com/tokenizer)

#### Asking GPT-4o Questions About Retrieved arXiv Abstracts

This cell demonstrates how to use OpenAI's GPT-4o model to analyze and answer questions based on a set of retrieved arXiv paper abstracts. 

The workflow is as follows:

1. **Retrieve Relevant Abstracts**  
   We use the `retrieve_similar_abstracts()` function to get the top-k papers related to a given query. These papers include metadata such as title and abstract.

2. **Formulate a Research Question**  
   The user provides a natural language question they'd like to answer using the context of the retrieved papers.

3. **Query GPT-4o via API**  
   The function `ask_question_about_abstracts()`:
   - Concatenates all titles and abstracts into a single context string
   - Builds a structured prompt with this context plus the user's question
   - Sends it to the GPT-4o model via the OpenAI API
   - Returns a concise and informed answer grounded in the abstract content

This enables a form of lightweight, domain-specific retrieval-augmented generation (RAG) using just the OpenAI API and local FAISS-based retrieval.


In [14]:
answer = ar.ask_question_about_abstracts(ppr, "What is going on with the Higgs Boson these days?")
#print("GPT-4o's Answer:\n", answer)
answer = ar.format_for_markdown(answer)
display(Markdown(f"### GPT-4o's Answer:\n\n{answer}"))

### GPT-4o's Answer:

Recent research on the Higgs boson has been exploring various extensions and modifications of the Standard Model (SM) to understand its properties and interactions better. Here are some key developments:

1. **Resonant Higgs Pair Production**: Researchers are investigating the production of Higgs boson pairs at the Large Hadron Collider (LHC) in the context of the Next-to-Minimal Supersymmetric Standard Model (NMSSM). This involves studying the effects of new neutral Higgs states and colored scalars, such as stops or sbottoms, which can significantly alter the expected signals. These studies aim to identify potential new physics scenarios by analyzing the interference effects between resonant and non-resonant processes (Paper 1).

2. **Trilinear Higgs Self-Couplings**: Another area of focus is the loop-corrected trilinear Higgs self-couplings in the NMSSM with an Inverse Seesaw Mechanism. This research highlights the importance of higher-order corrections to the Higgs boson mass and self-couplings, which are crucial for precise theoretical predictions. The study finds that new effects from extended neutrino and sneutrino sectors can lead to significant corrections, impacting both the trilinear couplings and the Higgs boson mass (Paper 2).

3. **New Scalar Particles**: There is also interest in new scalar particles associated with additional gauge symmetries, such as a quark-universal $U(1)$ symmetry. These scalars, which may mix with the SM Higgs boson, could be produced at the LHC and decay into Standard Model particles. Such scenarios could lead to observable phenomena like diphoton resonances, providing potential new avenues for discovering physics beyond the Standard Model (Paper 3).

Overall, these studies reflect a vibrant area of research focused on understanding the Higgs boson's role in the broader context of particle physics and exploring potential new physics that could emerge from its interactions.

# Retrieval and Rerank

## Function: `retrieve_and_rerank(query, initial_k=10, final_k=3)`

This function enhances the traditional dense retrieval step by introducing a reranking phase using a more expressive cross-encoder model.

##  What is an Encoder in NLP?

An **Encoder** is a model that transforms input text into a dense vector (embedding) that captures its semantic meaning. Encoders are the backbone of many NLP systems, especially those built on the Transformer architecture (e.g., BERT, RoBERTa).

Encoders enable semantic search by mapping both queries and documents into a shared vector space, where similarity can be measured using cosine similarity or Euclidean distance.

---

##  BiEncoder vs. CrossEncoder

| Type         | Description                                                                 | Performance | Speed     |
|--------------|-----------------------------------------------------------------------------|-------------|-----------|
| **BiEncoder** | Encodes the query and document **separately**. Suitable for fast retrieval. | Lower       | Fast (can precompute) |
| **CrossEncoder** | Encodes the query and document **together**, allowing deep interaction.     | Higher      | Slow (no precompute)  |

- **BiEncoder** (used in retrieval): `all-MiniLM-L6-v2`  
   → Used to embed the query and all abstracts separately for fast FAISS search.

- **CrossEncoder** (used in reranking): `cross-encoder/ms-marco-MiniLM-L-6-v2`  
   → Used to score each query–abstract pair more accurately via token-level attention.

---

## Function: `retrieve_and_rerank(query, initial_k=10, final_k=3)`

This function performs a two-step semantic search:

1. **Retrieve**: 
   - Use a **BiEncoder** to find the top `initial_k` abstracts from the FAISS index.
2. **Rerank**: 
   - Use a **CrossEncoder** to evaluate and sort these top abstracts based on fine-grained semantic relevance to the query.
   - Return the top `final_k` abstracts with highest CrossEncoder scores.

This hybrid approach balances **scalability** (fast retrieval) and **accuracy** (deep reranking).



In [26]:
from IPython.display import display, Markdown, HTML
import warnings
warnings.filterwarnings("ignore")  # Optional to silence general warnings


query = "What is going on with the Higgs Boson?"

# Naive top-k
print("=== Naive Dense Retrieval ===")
papers_naive = ar.retrieve_similar_abstracts(query, k=3)
answer_naive = ar.ask_question_about_abstracts(papers_naive, query)
display(Markdown("**Naive Answer:**\n\n" + ar.format_for_markdown(answer_naive)))

# Retrieve + Rerank
print("=== Retrieve and Rerank ===")
papers_rerank = ar.retrieve_and_rerank(query, initial_k=10, final_k=3)
answer_rerank = ar.ask_question_about_abstracts(papers_rerank, query)
display(Markdown("**Reranked Answer:**\n\n" + ar.format_for_markdown(answer_rerank)))


=== Naive Dense Retrieval ===

Top Matching Abstracts:

--- [1] Status of the W boson mass and the future of the electroweak fit in the next decades ---
 Abstract: A precise determination of the W boson mass is an essential test for the
Standard Model of particle physics: the comparison of experimental value and
theoretical prediction allows to probe the internal consistency of the
electroweak sector and could possibly highlight signals of New Physics. We
provide a concise and up-to-date summary of past and recent measurements at
lepton and hadron colliders, a discussion of the known perturbative and
non-perturbative theoretical ingredients used to provide predictions for the
relevant observables, and an overview of future prospects to reduce systematic
uncertainties and to compare different measurements in a consistent way. We
conclude with a brief discussion on the relevance of the global electroweak fit
at present and future colliders.

 arXiv link: http://arxiv.org/pdf/2506.01887v1

**Naive Answer:**

The research papers you provided discuss various aspects of the Higgs boson, particularly within the context of extensions to the Standard Model of particle physics, such as the Next-to-Minimal Supersymmetric Standard Model (NMSSM). Here's a summary of the key points related to the Higgs boson from these papers:

1. **Trilinear Higgs Self-Couplings and Mass Corrections (Paper 2)**:
   - The paper by Thi Nhung Dao et al. focuses on the NMSSM with an Inverse Seesaw Mechanism. It examines higher-order corrections to the SM-like Higgs boson mass and the trilinear Higgs self-couplings.
   - The study finds that these corrections are significant and highly correlated, with new effects from the extended neutrino and sneutrino sectors.
   - The corrections can lead to changes of up to 10% in the effective SM-like trilinear Higgs self-couplings and up to 4.5% in the SM-like Higgs boson mass.
   - These findings are important for understanding the Higgs boson's properties and potential deviations from the Standard Model predictions.

2. **Resonant Higgs Pair Production (Paper 3)**:
   - The paper by Stefano Moretti et al. investigates the resonant production of Higgs boson pairs at the Large Hadron Collider (LHC) within the NMSSM framework.
   - It explores the effects of new neutral Higgs states and colored scalars (stops or sbottoms) in loops, which can interfere with resonant Higgs production.
   - These interferences can significantly alter the expected signal, potentially masking or revealing new physics scenarios.
   - The study provides benchmark points that could exhibit observable features in di-Higgs decay channels during current or future LHC runs.

Overall, these papers highlight ongoing research efforts to understand the Higgs boson's properties and interactions, particularly in scenarios beyond the Standard Model. They explore how new physics, such as additional particles and interactions in the NMSSM, could affect Higgs boson measurements and potentially reveal new insights into fundamental physics.

=== Retrieve and Rerank ===

Reranked Top Matches:

--- [1] Flavour hierarchies, extended groups and composites ---
 Abstract: In these proceedings, I present a composite Higgs model in which the flavour
hierarchies between the third and light families emerge naturally. In
particular, CKM mixing angles turn out to be suppressed while PMNS matrix
remains anarchic. This flavour structure arises as a consequence of the
extended non-universal gauge symmetry of the model and the electroweak charges
of the fundamental fermions of the new composite sector that realises the Higgs
boson as a pseudo Nambu-Goldstone boson. The model is described in detail in
arXiv:2412.14243.

 arXiv link: http://arxiv.org/pdf/2505.15787v1
————————————————————————————————————————————————————————————————————————————————
--- [2] Searching for charged Higgs bosons via $e^+ e^- \to H^\pm W^\mp S$ at the ILC ---
 Abstract: We investigate the phenomenology of the charged Higgs boson at the
International Linear Collider

**Reranked Answer:**

The Higgs boson continues to be a central focus of theoretical and experimental research in particle physics, as evidenced by the recent papers you provided. Here's a summary of the current research directions and findings related to the Higgs boson:

1. **Composite Higgs Models**: In Paper 1, a composite Higgs model is discussed where the Higgs boson is realized as a pseudo Nambu-Goldstone boson. This model aims to explain the flavor hierarchies between different families of particles through an extended non-universal gauge symmetry. The model predicts specific patterns in the CKM and PMNS matrices, which are important for understanding particle mixing and flavor physics.

2. **Charged Higgs Bosons**: Paper 2 explores the potential for discovering charged Higgs bosons at the International Linear Collider (ILC) within the framework of the type-X Two-Higgs Doublet Model (2HDM). The study focuses on the production processes and decay channels of these charged Higgs bosons, particularly in scenarios where they have masses around 200 GeV. The research highlights the potential for observing these particles through specific decay signatures, which could provide insights into Higgs mixing parameters.

3. **Trilinear Higgs Self-Couplings**: Paper 3 investigates the higher-order corrections to the trilinear Higgs self-couplings in the Next-to-Minimal Supersymmetric Standard Model (NMSSM) with an Inverse Seesaw Mechanism. The study emphasizes the importance of loop corrections and their impact on the Higgs boson mass and self-couplings. The research also considers the effects of extended neutrino and sneutrino sectors, which can lead to significant corrections in the Higgs sector.

Overall, these papers reflect ongoing efforts to understand the properties and interactions of the Higgs boson, explore extensions to the Standard Model, and investigate potential new particles and phenomena that could be revealed through future collider experiments.