In [2]:
import arXiv_rag as ar # local file


In [3]:
from dotenv import load_dotenv
import os
from openai import OpenAI
from dotenv import load_dotenv
from IPython.display import Markdown, display

In [6]:
load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

ping_client = False
#ping_client = True

if ping_client:
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is 2 + 2?"}
            ]
        )

        print("✅ Success! Response:")
        print(response.choices[0].message.content)

    except Exception as e:
        print(f"❌ Error: {e}")


# Retrieval-Augmented Generation on arXiv Abstracts (`hep-ph`)

This notebook demonstrates a simple RAG (Retrieval-Augmented Generation) system using abstracts from the [arXiv preprint server](https://arxiv.org).

We:
- Fetch recent abstracts from arXiv
- Embed them using a transformer-based sentence encoder
- Store and retrieve embeddings using FAISS
- Use GPT to generate answers from retrieved context

##  FAISS: Facebook AI Similarity Search

FAISS (Facebook AI Similarity Search) is a high-performance library for **efficient similarity search** over dense vector representations. It is especially well-suited for applications like retrieval-augmented generation (RAG), recommendation systems, and nearest-neighbor search in embedding spaces.  

###  How It Works:
- FAISS stores all of our abstract embeddings as vectors in a **vector index**.
- When a user enters a query, we embed it using the same model (`all-MiniLM-L6-v2`).
- FAISS compares the query vector to the stored vectors and returns the **top-k most similar** entries based on distance (usually L2 or cosine).

###  Why We Use FAISS:
- **Speed**: Handles millions of vectors efficiently with GPU/CPU support.
- **Scalability**: Works well for large-scale document search.
- **Simplicity**: Easy to use for exact or approximate nearest neighbor search.

###  In This Project:
- We use `IndexFlatL2` (a brute-force exact search index using Euclidean distance).
- Abstracts are embedded once and stored.
- At query time, FAISS retrieves the most semantically similar papers in milliseconds.

This allows us to build a fast and responsive retrieval system that scales with more data.  Please visit [their gitub](https://github.com/facebookresearch/faiss/wiki/) for more info.


##  Model Overview: `all-MiniLM-L6-v2`

We use the `all-MiniLM-L6-v2` model from the [SentenceTransformers](https://www.sbert.net/) library to convert text into dense vector embeddings. These embeddings represent the **semantic meaning** of text and are used for similarity search in our RAG system.

###  Key Features:
- **Architecture**: MiniLM (6 Transformer layers, distilled from BERT)
- **Embedding Dimension**: 384
- **Speed**: Extremely fast, making it suitable for real-time or large-scale applications
- **Use Case**: Optimized for general-purpose semantic similarity tasks (e.g., question answering, duplicate detection, clustering)

###  Why We Use It:
- Lightweight and fast — ideal for prototyping and scalable applications
- High-quality embeddings despite small size
- Pretrained on a diverse set of tasks like Natural Language Inference (NLI) and Semantic Textual Similarity (STS)

###  Output:
Each input text (e.g., an arXiv abstract or a user query) is mapped to a 384-dimensional vector that can be compared to other vectors using cosine or Euclidean distance.

This model is especially useful for identifying semantically similar scientific texts, even when exact keywords don’t match.


In [8]:
fetch_and_embed_abstracts = False

## Step 1: Fetch Recent arXiv Abstracts

The function fetch_arxiv_abstracts() uses the official arXiv API to fetch recent papers in the `hep-ph` category. 
We store the title, abstract, arXiv ID, and PDF link for each paper.

## Step 2: Embed Abstracts into Semantic Vectors

We use `sentence-transformers` with the model `all-MiniLM-L6-v2` to convert each abstract into an embedding vector. These embeddings capture the semantic meaning of each paper.

## Step 3: Store Embeddings in a FAISS Index

We store the embeddings in a FAISS index for efficient similarity search. Metadata (like titles and abstracts) is saved separately in a JSON file.

If you want to skip this step, set `fetch_and_embed_abstracts` to `False`


In [10]:
if fetch_and_embed_abstracts:
    papers = ar.fetch_arxiv_abstracts(query="hep-ph", max_results=500)
    embeddings = ar.embed_abstracts(papers, show_progress_bar = False)
    ar.store_faiss_index(embeddings, papers)

## Step 4: Retrieve Similar Abstracts

Given a user query, we embed it and use FAISS to retrieve the most semantically similar abstracts. These are the most relevant papers to the question being asked.  The default it to return the abstracts



In [12]:
query = input("Ask a physics question.  I will return papers with abstracts for you to read: ")
ppr = ar.retrieve_similar_abstracts(query)

Ask a physics question.  I will return papers with abstracts for you to read:  Higgs Bosons



Top Matching Abstracts:

--- [1] Searching for charged Higgs bosons via $e^+ e^- \to H^\pm W^\mp S$ at the ILC ---
 Abstract: We investigate the phenomenology of the charged Higgs boson at the
International Linear Collider (ILC) within the framework of the type-X
Two-Higgs Doublet Model (2HDM), where a light charged Higgs boson, with a mass
around 200 GeV or even smaller than top quark mass, is still being consistent
with flavor physics data as well as with the colliders experimental data. In
the theoretically and experimentally allowed parameter space, the $e^+ e^- \to
H^\pm W^\mp S$ (with $S = H, A$) production processes can yield signatures with
event rates larger than those from $e^+ e^- \to H^+ H^-$ and offer sensitivity
to the Higgs mixing parameter $\sin(\beta-\alpha)$. We consider the bosonic
$H^\pm \to W^\pm S$ decays, where the neutral scalar $S$ further decays into a
pair of tau leptons. We show, through a detector-level Monte Carlo analysis,
that the resulting $[\tau\tau][

### You can also return just the list of paper titles.

In [None]:
query = input("This query will provide a list of titles. Please ask your question: ")
ppr_titles = ar.retrieve_similar_abstracts(query, include_abstract = False)

## Step 5:  Adding a generative AI componment

For this you are going to need to add your API key to .env and make sure you have tokens:
```python
OPENAI_API_KEY=sk-proj-cy...
``` 

### Model Overview: GPT-4o

**GPT-4o** ("o" for *omni*) is OpenAI's most advanced and versatile model as of 2024. It is optimized for **multimodal reasoning**, supporting text, vision, and audio inputs (though this notebook uses text-only).

Key characteristics:

-  **High Accuracy**: Comparable to GPT-4-turbo in reasoning tasks, with improved response coherence and factuality.
-  **Faster & Cheaper**: Lower latency and cost per token compared to GPT-4-turbo, making it suitable for interactive applications.
-  **Context-Aware**: Supports longer context windows (up to 128k tokens).
-  **Chat-Optimized**: Built for chat-style usage, with conversational memory and role-awareness.

In this notebook, GPT-4o is used to **synthesize answers** from a set of retrieved arXiv abstracts. The model is provided with relevant context and prompted to generate expert-level answers to user queries.

API Pricing:

-  Input: \$0.005 per 1,000 tokens  
-  Output: \$0.015 per 1,000 tokens  

[Token estimator tool](https://platform.openai.com/tokenizer)

#### Asking GPT-4o Questions About Retrieved arXiv Abstracts

This cell demonstrates how to use OpenAI's GPT-4o model to analyze and answer questions based on a set of retrieved arXiv paper abstracts. 

The workflow is as follows:

1. **Retrieve Relevant Abstracts**  
   We use the `retrieve_similar_abstracts()` function to get the top-k papers related to a given query. These papers include metadata such as title and abstract.

2. **Formulate a Research Question**  
   The user provides a natural language question they'd like to answer using the context of the retrieved papers.

3. **Query GPT-4o via API**  
   The function `ask_question_about_abstracts()`:
   - Concatenates all titles and abstracts into a single context string
   - Builds a structured prompt with this context plus the user's question
   - Sends it to the GPT-4o model via the OpenAI API
   - Returns a concise and informed answer grounded in the abstract content

This enables a form of lightweight, domain-specific retrieval-augmented generation (RAG) using just the OpenAI API and local FAISS-based retrieval.


In [31]:
answer = ar.ask_question_about_abstracts(ppr, "What is going on with the Higgs Boson these days?")
#print("GPT-4o's Answer:\n", answer)
answer = ar.format_for_markdown(answer)
display(Markdown(f"### GPT-4o's Answer:\n\n{answer}"))

### 🧠 GPT-4o's Answer:

Recent research on the Higgs boson is focused on several key areas:

1. **Charged Higgs Bosons**: Studies are being conducted at the International Linear Collider (ILC) to explore the phenomenology of charged Higgs bosons within the type-X Two-Higgs Doublet Model (2HDM). This involves investigating production processes like $e^+ e^- \to H^\pm W^\mp S$ and analyzing decay channels to detect potential signatures of charged Higgs bosons, particularly those lighter than the top quark.

2. **Higgs Self-Couplings**: Researchers are calculating corrections to the Higgs wave-function renormalization constant due to modified cubic, quartic, and quintic Higgs self-couplings up to the two-loop level. These calculations aim to set constraints on Higgs self-interactions through precision measurements at the high-luminosity LHC and future colliders, enhancing our understanding of Higgs boson interactions.

3. **New Physics Searches**: The Circular Electron-Positron Collider (CEPC) is proposed as a next-generation Higgs factory, offering opportunities to explore physics beyond the Standard Model. It aims to conduct precision measurements and searches for new physics, including exotic decays of the Higgs, dark matter phenomena, and electroweak phase transition studies. The CEPC's capabilities are expected to significantly advance the exploration of fundamental particle physics in the post-Higgs discovery era.

Overall, the focus is on understanding the properties and interactions of the Higgs boson, searching for new physics, and testing theoretical models that extend beyond the Standard Model.