# Retrieval-Augmented Generation (RAG) Perplexity Scoring Demo

This notebook demonstrates how to use the RAG pipeline to generate perplexity scores for research paper abstracts. We'll walk through the entire process:

1. **Loading sample research paper abstracts** - We'll start with two versions of an abstract (original and modified)
2. **Creating masked abstracts** - Identifying and masking differences between the two abstracts
3. **Generating queries for research services** - Creating semantic search queries based on the masked abstract
4. **Retrieving relevant papers** - Sending queries to research paper services
5. **Culling citation relationships** - Ensuring we don't include papers that cite our test paper
6. **Generating RAG context** - Creating context from retrieved papers
7. **Calculating perplexity scores** - Comparing perplexity scores to identify the "real" abstract

Note: This notebook needs to be run from the root project directory. (Not from the `notebooks` directory)

Let's begin by importing necessary libraries.

In [None]:
# Import required libraries
import sys
import os
import logging
import json
from pprint import pprint
import datetime
import zlib

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)]
)

# Import project-specific modules
from data.db import get_db_session, close_db_session
from data.models import ResearchPaper, PerplexityScoreEvent
from rag_orchestration.utils import (
    mask_abstract_differences,
    generate_service_queries,
    dispatch_queries,
    cull_citing_papers,
    generate_rag_context,
    assemble_test_abstract_prompt
)
from prompts_library.rag import service_query_creation_template
from services.llm_services import BasicOpenAI, TogetherClient

## Database Connection and Sample Paper

Instead of fetching data from the database, we'll create a toy sample paper with original and modified abstracts to demonstrate the pipeline.

In [None]:
session = get_db_session()

# Toy sample paper but noramlly would be loaded from the database
sample_paper = ResearchPaper(
    id=999999,  # Use a dummy ID that won't conflict with real records
    doi="10.1234/demo.12345",
    title="The Effect of Neural Oscillations on Cognitive Performance in Laboratory Settings",
    authors="Smith, J., Johnson, M., & Williams, P.",
    date="2023-04-15",
    abstract="""Neural oscillations have been implicated in numerous cognitive processes, including attention, memory, and executive function. In this study, we examined the relationship between different frequency bands of neural oscillations and cognitive performance in a controlled laboratory setting. Electroencephalography (EEG) data was collected from 45 healthy participants while they performed a battery of cognitive tasks. Results showed that theta band (4-8 Hz) power positively correlated with working memory performance, while alpha band (8-12 Hz) power was inversely related to attentional control. Beta oscillations (13-30 Hz) were found to predict performance on tasks requiring inhibitory control. These findings suggest that specific neural oscillation patterns serve as electrophysiological signatures of distinct cognitive processes. Our results contribute to the growing understanding of the neural mechanisms underlying cognitive function and may have implications for interventions targeting cognitive enhancement.""",
    gpt4_incorrect_abstract="""Neural oscillations have been implicated in numerous cognitive processes, including attention, memory, and executive function. In this study, we examined the relationship between different frequency bands of neural oscillations and cognitive performance in a controlled laboratory setting. Electroencephalography (EEG) data was collected from 45 healthy participants while they performed a battery of cognitive tasks. Results showed that theta band (4-8 Hz) power negatively correlated with working memory performance, while alpha band (8-12 Hz) power was directly related to attentional control. Gamma oscillations (30-100 Hz) were found to predict performance on tasks requiring inhibitory control. These findings suggest that specific neural oscillation patterns serve as electrophysiological signatures of distinct cognitive processes. Our results contribute to the growing understanding of the neural mechanisms underlying cognitive function and may have implications for interventions targeting cognitive enhancement.""",
    category="neuroscience",
    license="CC-BY-4.0",
    published_journal="Journal of Cognitive Neuroscience"
)

print(f"Created test paper: {sample_paper.title}")
print(f"DOI: {sample_paper.doi}")

## Sample Paper Abstracts

Let's examine both the original and modified (GPT-4 incorrect) versions of the abstract. 

The original abstract represents the real content of the paper, while the modified abstract contains subtle but scientifically significant changes that alter the meaning or claims.

In [None]:
print("=== ORIGINAL ABSTRACT ===")
print(sample_paper.abstract)
print("\n=== GPT-4 MODIFIED ABSTRACT ===")
print(sample_paper.gpt4_incorrect_abstract)

## Creating a Masked Abstract

Now we need to identify the differences between the original and modified abstracts. The `mask_abstract_differences` function compares both versions and replaces different segments with placeholders using double brackets (e.g., `[[DIFF]]`).

This masking process is crucial as it helps us create search queries that won't be biased towards either version of the abstract.

In [None]:
# Create a masked abstract by identifying and bracketing differences
masked_abstract = mask_abstract_differences(
    abstract_a=sample_paper.abstract,
    abstract_b=sample_paper.gpt4_incorrect_abstract
)

print("=== MASKED ABSTRACT ===")
print(masked_abstract)

## Generating Service Queries

Next, we'll use the masked abstract to generate search queries for various academic research services. We use an LLM (in this case, OpenAI's model) to generate appropriate search queries based on the content of the masked abstract.

The queries will target services like ArXiv, bioRxiv, PubMed, etc. while avoiding specific terms that might bias the search toward either the original or modified versions.

In [None]:
# Initialize OpenAI client for query generation
llm_client = BasicOpenAI()

# Generate service queries based on the masked abstract
query_plan = generate_service_queries(
    masked_abstract=masked_abstract,
    llm_client=llm_client,
    template_str=service_query_creation_template
)

print("=== GENERATED SERVICE QUERIES ===")
pprint(query_plan)

## Dispatching Queries to Services

Now we'll send the generated queries to various research services to retrieve relevant papers. The `dispatch_queries` function handles communication with these services and returns a list of research papers that match our queries.

Each service may return papers in different formats, but our function standardizes them into `ResearchPaper` objects.

In [None]:
# Dispatch queries to services
results = dispatch_queries(query_plan, max_results_per_query=5)

# Flatten the results into a single list of papers
candidate_papers = [p for papers in results.values() for p in papers]

# Display the number of papers returned
print(f"Retrieved {len(candidate_papers)} candidate papers from all services")
print("\nSample of retrieved papers:")
for i, paper in enumerate(candidate_papers[:3], 1):
    print(f"\n{i}. {paper.title}")
    print(f"   DOI: {paper.doi}")


## Culling Papers with Citation Relationships

To prevent "data leakage" in our RAG context, we need to remove any papers that cite our test paper. This step is crucial for maintaining the integrity of the perplexity scoring.

If we included papers that cite our test paper, the RAG context might contain information directly derived from the test paper, making it too easy to distinguish between the original and modified abstracts.

In [None]:
# Remove papers that cite our test paper
culled_papers = cull_citing_papers(
    candidate_papers, 
    test_doi=sample_paper.doi, 
    include_opencitations=True
)

print(f"Original candidate papers: {len(candidate_papers)}")
print(f"After culling papers that cite test paper: {len(culled_papers)}")
print(f"Removed {len(candidate_papers) - len(culled_papers)} citing papers")

## Fetching Full Text and Generating RAG Context

Next, we'll fetch the full text of the remaining papers and generate a RAG context. This involves:

1. Ensuring full-text chunks and embeddings are stored in the database
2. Performing vector similarity search to find relevant chunks
3. Assembling these chunks into a coherent context

If vector search fails to return sufficient data, we'll use a fallback method of concatenating full texts.

In [None]:
# Import additional required function
from rag_orchestration.utils import get_or_fetch_research_paper_text

# Ensure full-text chunks and embeddings are stored
print("Fetching full text for culled papers...")
for i, paper in enumerate(culled_papers[:5], 1):  # Limit to 5 for demo purposes
    try:
        text = get_or_fetch_research_paper_text(paper, session)
        print(f"Paper {i}/{min(5, len(culled_papers))}: Retrieved {len(text) if text else 0} chars")
    except Exception as e:
        print(f"Error fetching text for DOI {paper.doi}: {str(e)}")

# Generate RAG context using vector similarity search
print("\nGenerating RAG context...")
rag_context, chunks_retrieved = generate_rag_context(
    session, 
    masked_abstract, 
    k=15,  # Number of chunks to retrieve
    return_chunk_count=True
)

print(f"Retrieved {chunks_retrieved} text chunks for RAG context")
print(f"Generated RAG context with {len(rag_context)} characters")

# Truncate if necessary to a reasonable size for demonstration
rag_size_chars = 10000
if len(rag_context) > rag_size_chars:
    print(f"Truncating RAG context from {len(rag_context)} to {rag_size_chars} chars")
    rag_context = rag_context[:rag_size_chars]
    # Snap to a clean break
    for sep in (". ", ".\n", "\n\n", "\n"):
        pos = rag_context.rfind(sep, max(0, rag_size_chars - 200))
        if pos != -1:
            rag_context = rag_context[: pos + len(sep)]
            break

print(f"Final RAG context: {len(rag_context)} chars")
print("\nSample of RAG context (first 500 chars):")
print(rag_context[:500] + "...")

## Calculating Perplexity Scores

Now we'll calculate perplexity scores for both the original and modified abstracts using the RAG context. The perplexity score measures how "surprised" the language model is by the text. Lower perplexity indicates text that seems more natural or expected to the model.

In our case, we expect the original abstract to have a lower perplexity score than the modified one, as the RAG context should provide information that corroborates the claims in the original abstract.

In [None]:
# Initialize the Together.ai client for perplexity scoring
together_model = "meta-llama/Llama-3.3-70B-Instruct-Turbo"
tc = TogetherClient(model=together_model)

# Helper function to calculate and display perplexity
def calculate_perplexity(abstract_text, abstract_source):
    # Assemble the full prompt with RAG context and abstract
    full_prompt = assemble_test_abstract_prompt(rag_context.strip(), abstract_text.strip())
    
    print(f"\n=== {abstract_source.upper()} ABSTRACT ===")
    print(f"Prompt size: {len(full_prompt)} chars")
    
    # Calculate perplexity and zlib ratio
    perplexity = tc.perplexity_score(full_prompt)
    zlib_size = len(zlib.compress(full_prompt.encode("utf-8")))
    zlib_ratio = perplexity / zlib_size if zlib_size else None
    
    print(f"Perplexity score: {perplexity:.6f}")
    print(f"Zlib compression size: {zlib_size}")
    print(f"Zlib-perplexity ratio: {zlib_ratio:.8f}")
    
    return perplexity, zlib_size, zlib_ratio

# Calculate perplexity for both abstracts
print(f"Calculating perplexity scores using model: {together_model}")

# Original abstract
orig_perplexity, orig_zlib, orig_ratio = calculate_perplexity(
    sample_paper.abstract, "original"
)

# Modified abstract
mod_perplexity, mod_zlib, mod_ratio = calculate_perplexity(
    sample_paper.gpt4_incorrect_abstract, "modified"
)

## Results Analysis

Let's analyze the results of our perplexity calculations and determine which abstract the model considers more likely to be the original.

In [None]:
# Determine which abstract has lower perplexity
if orig_perplexity < mod_perplexity:
    preferred = "ORIGINAL"
    diff = mod_perplexity - orig_perplexity
    diff_percent = (diff / mod_perplexity) * 100
else:
    preferred = "MODIFIED"
    diff = orig_perplexity - mod_perplexity
    diff_percent = (diff / orig_perplexity) * 100

print(f"\n=== RESULTS ANALYSIS ===")
print(f"Model preferred the {preferred} abstract")
print(f"Perplexity difference: {diff:.6f} ({diff_percent:.2f}% lower)")


## Storing the Results

Finally, let's store these perplexity scores in the database for future analysis. This is the same process that happens in the `score_rag_perplexity_for_paper` function in the pipeline module.

In [None]:
# Helper function to store perplexity event
def create_perplexity_event(abstract_text, abstract_source, perplexity, zlib_size, zlib_ratio, full_prompt):
    # Create metadata dictionary
    metadata = {
        "method": "rag",
        "rag_size_chars": len(rag_context),
        "abstract_source": abstract_source,
        "chunks_retrieved": chunks_retrieved,
        "demo_notebook": True
    }
    
    # Create perplexity event
    evt = PerplexityScoreEvent(
        research_paper_id=sample_paper.id,
        model=together_model,
        abstract_text=abstract_text,
        abstract_source=abstract_source,
        prompt_template_name="rag_demo",
        full_prompt=full_prompt,
        zlib_compression_size=zlib_size,
        perplexity_score=perplexity,
        zlib_perplexity_ratio=zlib_ratio,
        evaluation_datetime=datetime.datetime.utcnow(),
        additional_metadata=metadata
    )
    
    session.add(evt)
    return evt

# Store perplexity events (commented out to prevent actual database writes)
# Store for original abstract
full_prompt_orig = assemble_test_abstract_prompt(rag_context.strip(), sample_paper.abstract.strip())
# evt_orig = create_perplexity_event(
#     sample_paper.abstract, "original", 
#     orig_perplexity, orig_zlib, orig_ratio, full_prompt_orig
# )

# Store for modified abstract
full_prompt_mod = assemble_test_abstract_prompt(rag_context.strip(), sample_paper.gpt4_incorrect_abstract.strip())
# evt_mod = create_perplexity_event(
#     sample_paper.gpt4_incorrect_abstract, "gpt4", 
#     mod_perplexity, mod_zlib, mod_ratio, full_prompt_mod
# )

# session.commit()
# print("Stored perplexity events in database")
print("Database storage commented out to prevent actual writes")

## Conclusion

In this notebook, we've demonstrated the complete RAG pipeline for perplexity scoring of research paper abstracts. We've seen how to:

1. Create masked abstracts to identify differences between original and modified versions
2. Generate service queries based on these masked abstracts
3. Retrieve and filter relevant papers
4. Generate a RAG context from these papers
5. Calculate perplexity scores for both abstracts
6. Analyze the results to determine which abstract is more likely to be original

This approach demonstrates how RAG can be used to enhance the model's ability to distinguish between real and fake scientific content by grounding its evaluation in relevant literature.

Such techniques can be valuable for detecting misinformation in scientific publications, helping to maintain the integrity of the scientific literature.