# Research Paper Peer Review System - Agentic Testing

This notebook tests the LangGraph-based agentic system with structured outputs using Pydantic models.

## Features Tested:
- LangGraph StateGraph workflow
- Pydantic BaseModel state management
- Structured outputs from all agents
- ClarityAgent with ReAct pattern
- RigorAgent with ReAct pattern
- Orchestrator validation with structured decision making

## Setup and Imports

In [None]:
import sys
import os
import asyncio
import json
import time
from pathlib import Path

# Add backend to path (we're in app/tests, so go up 2 levels to backend root)
backend_path = Path(__file__).parent.parent.parent if '__file__' in globals() else Path.cwd().parent.parent
if str(backend_path) not in sys.path:
    sys.path.insert(0, str(backend_path))

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

print(f"Backend path: {backend_path}")
print(f"OpenAI API Key loaded: {'OPENAI_API_KEY' in os.environ}")

Backend path: /Users/arnabbhattacharya/Desktop/AIE8-certification-challenge/backend
OpenAI API Key loaded: True


In [8]:
import importlib
import app.agents.section.section_analyzer
importlib.reload(app.agents.section.section_analyzer)

<module 'app.agents.section.section_analyzer' from '/Users/arnabbhattacharya/Desktop/AIE8-certification-challenge/backend/app/agents/section/section_analyzer.py'>

In [9]:
# Import the agentic system components
from app.agents.review_controller_langgraph import LangGraphReviewController
from app.agents.clarity import ClarityAgent
from app.agents.rigor import RigorAgent
from app.agents.section import SectionAnalyzer
from app.models.schemas import (
    ReviewState,
    Section,
    Suggestion,
    ClarityIssue,
    ClarityAnalysisResponse,
    RigorIssue,
    OrchestratorDecision
)

print("✓ All imports successful")

✓ All imports successful


## Sample Research Paper

We'll use a sample paper with various issues to test the agents.

In [10]:
SAMPLE_PAPER = """
# Neural Architecture Search with Reinforcement Learning

# 1. Abstract

We present a novel approach to NAS using RL. Our method achieves SOTA results on ImageNet.
The architecture it discovered outperformed manually designed architectures by a significant margin.

# 2. Introduction

Deep learning has revolutionized computer vision. However, designing neural architectures remains 
a time-consuming process that requires expert knowledge. This is problematic because different 
tasks may require different architectures, and it's not always clear what architecture will work 
best for a given task.

In this work, we propose using RL to automatically search for optimal neural architectures. 
Our approach uses a controller RNN to generate architecture descriptions, which are then trained 
and evaluated. The validation accuracy is used as the reward signal to train the controller.

# 3. Methods

This section describes our neural architecture search methodology, including the search space 
design and training procedures.

## 3.1 Architecture Search Space

Our search space includes various layer types including convolutional layers, pooling layers, 
and skip connections. The controller generates a sequence of tokens that specifies the architecture.
Each token represents a decision about the architecture like layer type, kernel size, etc.

## 3.2 Training Procedure

We trained the controller using REINFORCE. The controller generates 100 architectures per iteration.
Each architecture is trained for 50 epochs on CIFAR-10. We used the validation accuracy as the 
reward signal. The controller is then updated using policy gradients.

### 3.2.1 Hyperparameters

The learning rate was set to 0.001 with Adam optimizer. We used a batch size of 32 for training
the controller network.

# 4. Results

Our method discovered architectures that achieved 96.4% accuracy on CIFAR-10 test set. This 
outperformed the previous best result. The discovered architecture also transferred well to 
ImageNet, achieving competitive accuracy.

Figure 1 shows the accuracy over time. As you can see, it improves significantly. The final 
architecture found by our method has an interesting structure with multiple skip connections.

## 4.1 Comparison with Baselines

We compared our approach against manually designed architectures and other NAS methods. Our 
method achieved superior performance while requiring less computational resources.

# 5. Discussion

The results demonstrate that RL can effectively search for neural architectures. One limitation 
is the computational cost - searching for architectures required significant GPU resources. 
Future work could explore more efficient search methods.

Our approach has implications for AutoML and could make deep learning more accessible. It shows 
that automated methods can match or exceed human-designed architectures.

# 6. Conclusion

We presented a method for neural architecture search using reinforcement learning. The approach 
successfully discovered high-performing architectures that outperformed manual designs.
"""

print(f"Sample paper length: {len(SAMPLE_PAPER)} characters")
print(f"Sample paper word count: ~{len(SAMPLE_PAPER.split())} words")

Sample paper length: 3075 characters
Sample paper word count: ~430 words


## Test 1: Section Analyzer

Test the markdown parser that converts content into Section Pydantic models.

In [11]:
analyzer = SectionAnalyzer()
sections = analyzer.parse_markdown(SAMPLE_PAPER)

print(f"\n{'='*80}")
print("SECTION ANALYSIS")
print(f"{'='*80}\n")
print(f"Total sections parsed: {len(sections)}\n")

for i, section in enumerate(sections, 1):
    print(f"{i}. {section.title}")
    print(f"   Level: {section.level}")
    print(f"   Lines: {section.line_start}-{section.line_end}")
    print(f"   Content length: {len(section.content)} chars")
    print(f"   Type: {type(section).__name__}")
    print()

# Verify it's a Pydantic model
assert isinstance(sections[0], Section), "Sections should be Pydantic Section models"
print("✓ All sections are valid Pydantic Section models")


SECTION ANALYSIS

Total sections parsed: 6

1. Abstract
   Level: 1
   Lines: 4-8
   Content length: 194 chars
   Type: Section

2. Introduction
   Level: 1
   Lines: 9-19
   Content length: 603 chars
   Type: Section

3. Methods
   Level: 1
   Lines: 20-41
   Content length: 902 chars
   Type: Section

4. Results
   Level: 1
   Lines: 42-55
   Content length: 632 chars
   Type: Section

5. Discussion
   Level: 1
   Lines: 56-64
   Content length: 420 chars
   Type: Section

6. Conclusion
   Level: 1
   Lines: 65-69
   Content length: 187 chars
   Type: Section

✓ All sections are valid Pydantic Section models


In [12]:
sections[0]

Section(title='Abstract', content='\nWe present a novel approach to NAS using RL. Our method achieves SOTA results on ImageNet.\nThe architecture it discovered outperformed manually designed architectures by a significant margin.\n\n', level=1, line_start=4, line_end=8, section_number='1', parent_section=None, subsections=[])

## Test 2: Individual Agent Testing

Test ClarityAgent and RigorAgent with structured outputs.

In [13]:
# Initialize agents
clarity_agent = ClarityAgent()
rigor_agent = RigorAgent()

print("✓ Agents initialized successfully")
print(f"  - ClarityAgent: {clarity_agent.agent_name}")
print(f"  - RigorAgent: {rigor_agent.agent_name}")

✓ Agents initialized successfully
  - ClarityAgent: ClarityAgent
  - RigorAgent: RigorAgent


### Test ClarityAgent on Abstract Section

In [14]:
# Find abstract section
abstract_section = next((s for s in sections if 'abstract' in s.title.lower()), None)

if abstract_section:
    # Convert to dict for agent (agents expect dict input)
    section_dict = {
        "title": abstract_section.title,
        "content": abstract_section.content,
        "level": abstract_section.level,
        "line_start": abstract_section.line_start,
        "line_end": abstract_section.line_end
    }
    
    print(f"\n{'='*80}")
    print("TESTING CLARITY AGENT")
    print(f"{'='*80}\n")
    print(f"Section: {abstract_section.title}")
    print(f"Content:\n{abstract_section.content[:200]}...\n")
    
    start_time = time.time()
    clarity_suggestions = await clarity_agent.analyze(section_dict)
    elapsed = time.time() - start_time
    
    print(f"\n{'='*80}")
    print(f"CLARITY ANALYSIS COMPLETE ({elapsed:.2f}s)")
    print(f"{'='*80}\n")
    print(f"Found {len(clarity_suggestions)} clarity issues:\n")
    
    for i, suggestion in enumerate(clarity_suggestions, 1):
        print(f"{i}. {suggestion['title']}")
        print(f"   Severity: {suggestion['severity']}")
        print(f"   Issue: {suggestion['description'][:100]}...")
        print(f"   Fix: {suggestion['suggested_fix'][:100]}...")
        print()
else:
    print("No abstract section found")


TESTING CLARITY AGENT

Section: Abstract
Content:

We present a novel approach to NAS using RL. Our method achieves SOTA results on ImageNet.
The architecture it discovered outperformed manually designed architectures by a significant margin.

...


CLARITY ANALYSIS COMPLETE (12.49s)

Found 3 clarity issues:

1. Clarity Issue
   Issue: The term 'NAS' is not defined, which may confuse readers unfamiliar with the acronym....
   Fix: Define 'NAS' as 'Neural Architecture Search' in the first sentence....

2. Clarity Issue
   Issue: The phrase 'SOTA results' may not be clear to all readers....
   Fix: Replace 'SOTA results' with 'state-of-the-art results' for clarity....

3. Clarity Issue
   Severity: error
   Issue: The phrase 'by a significant margin' is vague and does not provide quantitative context....
   Fix: Provide specific metrics or percentages to quantify the performance improvement over manually design...



### Test RigorAgent on Methods Section

In [15]:
# Find methods section
methods_section = next((s for s in sections if 'method' in s.title.lower()), None)

if methods_section:
    section_dict = {
        "title": methods_section.title,
        "content": methods_section.content,
        "level": methods_section.level,
        "line_start": methods_section.line_start,
        "line_end": methods_section.line_end
    }
    
    print(f"\n{'='*80}")
    print("TESTING RIGOR AGENT")
    print(f"{'='*80}\n")
    print(f"Section: {methods_section.title}")
    print(f"Content:\n{methods_section.content[:200]}...\n")
    
    start_time = time.time()
    rigor_suggestions = await rigor_agent.analyze(section_dict)
    elapsed = time.time() - start_time
    
    print(f"\n{'='*80}")
    print(f"RIGOR ANALYSIS COMPLETE ({elapsed:.2f}s)")
    print(f"{'='*80}\n")
    print(f"Found {len(rigor_suggestions)} rigor issues:\n")
    
    for i, suggestion in enumerate(rigor_suggestions, 1):
        print(f"{i}. {suggestion['title']}")
        print(f"   Severity: {suggestion['severity']}")
        print(f"   Issue: {suggestion['description'][:100]}...")
        print(f"   Fix: {suggestion['suggested_fix'][:100]}...")
        print()
else:
    print("No methods section found")


TESTING RIGOR AGENT

Section: Methods
Content:

This section describes our neural architecture search methodology, including the search space 
design and training procedures.


## 3.1. Architecture Search Space


Our search space includes various ...


RIGOR ANALYSIS COMPLETE (12.87s)

Found 3 rigor issues:

1. Rigor Issue
   Issue: Lack of detail on search space design...
   Fix: Provide a more detailed description of the specific layer types and their configurations included in...

2. Rigor Issue
   Issue: Insufficient detail on training procedure...
   Fix: Include more information on the training process, such as the number of iterations for training the ...

3. Rigor Issue
   Issue: Missing justification for hyperparameter choices...
   Fix: Add a rationale for the choice of learning rate, optimizer, and batch size, including any experiment...



## Test 3: Full LangGraph Workflow

Test the complete LangGraph StateGraph with all agents and orchestrator validation.

In [16]:
# Initialize LangGraph controller
controller = LangGraphReviewController()

print("✓ LangGraph ReviewController initialized")
print(f"  Workflow nodes: parse_sections, analyze_section, next_section, validate_suggestions")
print(f"  Conditional edges: continue/validate based on section index")

✓ LangGraph ReviewController initialized
  Workflow nodes: parse_sections, analyze_section, next_section, validate_suggestions
  Conditional edges: continue/validate based on section index


In [17]:
print(f"\n{'='*80}")
print("RUNNING FULL LANGGRAPH WORKFLOW")
print(f"{'='*80}\n")

start_time = time.time()

result = await controller.review(
    content=SAMPLE_PAPER,
    session_id="notebook-test-session",
    target_venue="NeurIPS"
)

elapsed_time = time.time() - start_time

print(f"\n{'='*80}")
print("LANGGRAPH WORKFLOW COMPLETE")
print(f"{'='*80}\n")
print(f"Processing time: {elapsed_time:.2f}s")
print(f"Session ID: {result['session_id']}")
print(f"\nMetadata:")
for key, value in result['metadata'].items():
    print(f"  {key}: {value}")


RUNNING FULL LANGGRAPH WORKFLOW


LANGGRAPH WORKFLOW COMPLETE

Processing time: 87.53s
Session ID: notebook-test-session

Metadata:
  total_sections: 6
  target_venue: NeurIPS
  clarity_suggestions: 18
  rigor_suggestions: 8
  final_suggestions: 6


## Test 4: Analyze Results

Analyze the final suggestions from the orchestrator.

In [18]:
suggestions = result['suggestions']

print(f"\n{'='*80}")
print("FINAL SUGGESTIONS ANALYSIS")
print(f"{'='*80}\n")
print(f"Total suggestions: {len(suggestions)}\n")

# Group by type
by_type = {}
for s in suggestions:
    stype = s['type']
    by_type[stype] = by_type.get(stype, 0) + 1

print("By Type:")
for stype, count in sorted(by_type.items()):
    print(f"  {stype}: {count}")

# Group by severity
by_severity = {}
for s in suggestions:
    severity = s['severity']
    by_severity[severity] = by_severity.get(severity, 0) + 1

print("\nBy Severity:")
for severity, count in sorted(by_severity.items()):
    print(f"  {severity}: {count}")

# Group by section
by_section = {}
for s in suggestions:
    section = s['section']
    by_section[section] = by_section.get(section, 0) + 1

print("\nBy Section:")
for section, count in sorted(by_section.items(), key=lambda x: -x[1]):
    print(f"  {section}: {count}")


FINAL SUGGESTIONS ANALYSIS

Total suggestions: 6

By Type:
  SuggestionType.CLARITY: 4
  SuggestionType.RIGOR: 2

By Severity:
  SeverityLevel.ERROR: 4

By Section:
  Results: 2
  Abstract: 2
  Introduction: 1
  Methods: 1


## Test 5: Display Top Suggestions

In [19]:
print(f"\n{'='*80}")
print("TOP 10 SUGGESTIONS")
print(f"{'='*80}\n")

for i, suggestion in enumerate(suggestions[:10], 1):
    print(f"{i}. [{suggestion['type'].upper()}] {suggestion['title']}")
    print(f"   Section: {suggestion['section']}")
    print(f"   Severity: {suggestion['severity']}")
    print(f"   Lines: {suggestion.get('line_start', 'N/A')}-{suggestion.get('line_end', 'N/A')}")
    print(f"   Description: {suggestion['description']}")
    if suggestion.get('suggested_fix'):
        print(f"   Suggested Fix: {suggestion['suggested_fix']}")
    print()


TOP 10 SUGGESTIONS

1. [RIGOR] Rigor Issue
   Section: Results
   Severity: SeverityLevel.ERROR
   Lines: 42-55
   Description: Lack of statistical significance testing for accuracy results
   Suggested Fix: Include statistical significance tests (e.g., t-tests) to validate that the accuracy improvement is not due to random chance.

2. [RIGOR] Rigor Issue
   Section: Results
   Severity: SeverityLevel.ERROR
   Lines: 42-55
   Description: No mention of statistical analysis for comparison results
   Suggested Fix: Include statistical analysis (e.g., confidence intervals) for the performance comparison to support claims of superiority.

3. [CLARITY] Clarity Issue
   Section: Abstract
   Severity: SeverityLevel.ERROR
   Lines: 4-8
   Description: The phrase 'by a significant margin' is vague and does not provide quantitative context.
   Suggested Fix: Provide specific metrics or percentages to quantify the performance improvement over manually designed architectures.

4. [CLARITY] Clarit

## Test 6: Verify Structured Outputs

Verify that all data contracts are properly enforced via Pydantic models.

In [20]:
print(f"\n{'='*80}")
print("STRUCTURED OUTPUT VERIFICATION")
print(f"{'='*80}\n")

# Verify all suggestions have required fields
required_fields = ['id', 'type', 'severity', 'title', 'description', 'section']
all_valid = True

for i, suggestion in enumerate(suggestions, 1):
    for field in required_fields:
        if field not in suggestion:
            print(f"❌ Suggestion {i} missing required field: {field}")
            all_valid = False

if all_valid:
    print("✓ All suggestions have required fields")

# Verify severity values are valid
valid_severities = {'info', 'warning', 'error'}
invalid_severities = set()

for suggestion in suggestions:
    if suggestion['severity'] not in valid_severities:
        invalid_severities.add(suggestion['severity'])

if invalid_severities:
    print(f"❌ Invalid severity levels found: {invalid_severities}")
else:
    print("✓ All severity levels are valid")

# Verify suggestion types
valid_types = {'clarity', 'rigor', 'coherence', 'citation', 'best_practices', 'structure'}
invalid_types = set()

for suggestion in suggestions:
    if suggestion['type'] not in valid_types:
        invalid_types.add(suggestion['type'])

if invalid_types:
    print(f"❌ Invalid suggestion types found: {invalid_types}")
else:
    print("✓ All suggestion types are valid")

print("\n✓ Structured output verification complete")
print("  All agent responses validated via Pydantic models")
print("  Data contracts enforced throughout the workflow")


STRUCTURED OUTPUT VERIFICATION

✓ All suggestions have required fields
✓ All severity levels are valid
✓ All suggestion types are valid

✓ Structured output verification complete
  All agent responses validated via Pydantic models
  Data contracts enforced throughout the workflow


## Test 7: Performance Metrics

In [21]:
print(f"\n{'='*80}")
print("PERFORMANCE METRICS")
print(f"{'='*80}\n")

metadata = result['metadata']
total_sections = metadata['total_sections']
clarity_count = metadata['clarity_suggestions']
rigor_count = metadata['rigor_suggestions']
final_count = metadata['final_suggestions']

print(f"Processing Metrics:")
print(f"  Total time: {elapsed_time:.2f}s")
print(f"  Time per section: {elapsed_time / total_sections:.2f}s")
print(f"  Suggestions per second: {len(suggestions) / elapsed_time:.2f}")

print(f"\nAgent Metrics:")
print(f"  Clarity suggestions: {clarity_count}")
print(f"  Rigor suggestions: {rigor_count}")
print(f"  Total before orchestrator: {clarity_count + rigor_count}")
print(f"  Final after orchestrator: {final_count}")
print(f"  Orchestrator filter rate: {(1 - final_count/(clarity_count + rigor_count))*100:.1f}%")

# Rough token and cost estimation
input_tokens = len(SAMPLE_PAPER.split()) * 1.3
output_tokens = len(json.dumps(suggestions)) / 4
total_tokens = input_tokens + output_tokens

# gpt-4o-mini pricing
cost_per_1k_input = 0.00015
cost_per_1k_output = 0.0006
estimated_cost = (input_tokens / 1000 * cost_per_1k_input + 
                  output_tokens / 1000 * cost_per_1k_output)

print(f"\nCost Estimation (gpt-4o-mini):")
print(f"  Estimated tokens: ~{int(total_tokens):,}")
print(f"  Estimated cost: ${estimated_cost:.4f}")


PERFORMANCE METRICS

Processing Metrics:
  Total time: 87.53s
  Time per section: 14.59s
  Suggestions per second: 0.07

Agent Metrics:
  Clarity suggestions: 18
  Rigor suggestions: 8
  Total before orchestrator: 26
  Final after orchestrator: 6
  Orchestrator filter rate: 76.9%

Cost Estimation (gpt-4o-mini):
  Estimated tokens: ~1,183
  Estimated cost: $0.0005


## Test 8: Export Results

In [None]:
# Save results to JSON
output_file = "notebook_review_results.json"

output_data = {
    "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
    "session_id": result['session_id'],
    "processing_time": elapsed_time,
    "metadata": metadata,
    "suggestions": suggestions,
    "performance": {
        "time_per_section": elapsed_time / total_sections,
        "suggestions_per_second": len(suggestions) / elapsed_time,
        "orchestrator_filter_rate": (1 - final_count/(clarity_count + rigor_count))*100,
        "estimated_cost": estimated_cost
    }
}

with open(output_file, 'w') as f:
    json.dump(output_data, f, indent=2)

print(f"\n✓ Results exported to {output_file}")
print(f"  File size: {os.path.getsize(output_file)} bytes")

## Summary

This notebook demonstrates:

1. ✅ **Section Parsing**: Markdown → Pydantic Section models
2. ✅ **Agent Testing**: Individual ClarityAgent and RigorAgent testing
3. ✅ **Structured Outputs**: All LLM responses validated via Pydantic
4. ✅ **LangGraph Workflow**: Complete StateGraph execution
5. ✅ **Orchestrator Validation**: Structured decision making with filtering
6. ✅ **Data Contracts**: Type-safe agent communication
7. ✅ **Performance Metrics**: Timing, cost, and efficiency analysis

The agentic system successfully processes research papers using:
- **LangGraph** for workflow orchestration
- **Pydantic** for state management and structured outputs
- **ReAct pattern** for agent self-reflection
- **Structured outputs** for guaranteed data contracts