# Arxiv Demo Runbook

This notebook guides you through setting up and running the Arxiv Knowledge Assistant demo.

## Prerequisites
* Unity Catalog enabled workspace
* Permissions to create Catalogs/Schemas/Volumes/Tables
* Permissions to create Agents

## 1. Setup Environment
Install dependencies and configure catalog settings.

In [None]:
%pip install arxiv databricks-sdk streamlit
dbutils.library.restartPython()

In [None]:
import os
import sys
from databricks.sdk import WorkspaceClient

# Add src to path so we can import arxiv_demo
sys.path.append(os.path.abspath("src"))

# Helper to auto-detect warehouse
def get_warehouse_id():
    w = WorkspaceClient()
    # List all warehouses
    warehouses = list(w.warehouses.list())
    if not warehouses:
        return "" 
    
    # Prefer running ones
    running = [wh for wh in warehouses if wh.state == "RUNNING"]
    if running:
        return running[0].id
        
    # Fallback to first available
    return warehouses[0].id

detected_wh = get_warehouse_id()

# Configuration Widgets
dbutils.widgets.text("catalog", "arxiv_demo", "Catalog")
dbutils.widgets.text("schema", "main", "Schema")
dbutils.widgets.text("volume", "pdfs", "Volume")
dbutils.widgets.text("warehouse_id", detected_wh, "SQL Warehouse ID")

# Set env vars for the demo code to pick up
os.environ["ARXIV_CATALOG"] = dbutils.widgets.get("catalog")
os.environ["ARXIV_SCHEMA"] = dbutils.widgets.get("schema")
os.environ["ARXIV_VOLUME"] = dbutils.widgets.get("volume")
os.environ["DATABRICKS_WAREHOUSE_ID"] = dbutils.widgets.get("warehouse_id")

print(f"Configuring for {os.environ['ARXIV_CATALOG']}.{os.environ['ARXIV_SCHEMA']}")
print(f"Using Warehouse ID: {os.environ['DATABRICKS_WAREHOUSE_ID']}")

In [None]:
from arxiv_demo.setup import DatabricksSetup
from databricks.sdk import WorkspaceClient

catalog = os.environ["ARXIV_CATALOG"]
warehouse_id = os.environ["DATABRICKS_WAREHOUSE_ID"]

# Pre-check: Ensure Catalog exists (setup.py expects it)
w = WorkspaceClient()
print(f"Ensuring Catalog '{catalog}' exists...")
try:
    w.catalogs.get(catalog)
except Exception:
    print(f"Creating catalog {catalog}...")
    w.catalogs.create(name=catalog)

# Run Setup (Schema, Volume, Tables)
setup = DatabricksSetup()
setup.setup_all(warehouse_id)

## 2. Ingest Golden Set Papers
Download seminal LLM Agent papers (ReAct, Reflexion, etc.) and upload them to the UC Volume.

In [None]:
from arxiv_demo.ingestion import ArxivIngestion

# Initialize ingestion (uses env vars set above)
ingestion = ArxivIngestion()

# Golden Set IDs
GOLDEN_SET_IDS = [
    "2210.03629", "2303.11366", "2305.04091", "2304.08354", "2305.16291"
]

print("Ingesting golden set papers...")
# Note: we import logic from the script or just invoke ingestion directly 
# Re-implementing simplified logic here for visibility in notebook

import arxiv
from arxiv_demo.ingestion import PaperMetadata

client = arxiv.Client()
search = arxiv.Search(id_list=GOLDEN_SET_IDS)
papers = []

for result in client.results(search):
    p = PaperMetadata(
        arxiv_id=result.entry_id.split("/")[-1],
        title=result.title,
        authors=[a.name for a in result.authors],
        abstract=result.summary,
        published=result.published.isoformat(),
        updated=result.updated.isoformat(),
        categories=result.categories,
        pdf_url=result.pdf_url
    )
    papers.append(p)

print(f"Found {len(papers)} papers. Uploading...")
ingestion.download_and_upload(papers, delay_seconds=10.0)

## 3. Manual Step: Create Knowledge Assistant

1. Go to **Agents** > **Create Agent**.
2. Select **Knowledge Source**: Unity Catalog Volume (`arxiv_demo.main.pdfs`).
3. Name it: `arxiv-papers`.
4. Deploy and copy the **Serving Endpoint Name** (e.g. `agents_arxiv-papers`).
5. Paste it below.

In [None]:
dbutils.widgets.text("ka_endpoint", "agents_arxiv-papers", "KA Endpoint (After Creation)")
os.environ["KA_ENDPOINT"] = dbutils.widgets.get("ka_endpoint")

## 4. Manual Step: Create KIE Agent

1. Go to **Agents** > **Create Agent**.
2. Select **Pattern**: Key Information Extraction.
3. Name it: `arxiv-kie`.
4. Configure **Schema** using this JSON Definition:

```json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Generated Schema",
  "type": "object",
  "properties": {
    "affiliation": {
      "description": "The \"affiliation\" field must contain the name of the organization, institution, or company with which the authors are associated. This information should be extracted as a string and may include department names, university names, or corporate entities. Ensure that the extracted content is precise and accurately reflects the authors' affiliations as stated in the source document.",
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ]
    },
    "contributions": {
      "description": "The \"contributions\" field must contain an array of strings that explicitly list the contributions made by the authors to the dataset or research presented in the document. Each entry in the array should clearly articulate a specific contribution, such as data collection, analysis, or writing, and should not include vague or general statements. If no contributions are provided, this field should be set to null.",
      "anyOf": [
        {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        {
          "type": "null"
        }
      ]
    },
    "authors": {
      "description": "The \"authors\" field must contain an array of strings, each representing the full name of an author associated with the dataset or research work. The names should be formatted as \"First Last\" without any titles or affiliations included. This field is required to accurately attribute contributions to the respective authors in the context of the dataset.",
      "anyOf": [
        {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        {
          "type": "null"
        }
      ]
    },
    "title": {
      "description": "The \"title\" field must contain the complete title of the document or work being referenced. It should be a string that accurately reflects the main subject or focus of the content, without any abbreviations or alterations. Ensure that the title is extracted as it appears in the source material, maintaining proper capitalization and punctuation.",
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ]
    },
    "methodology": {
      "type": "string",
      "description": "The \"methodology\" field must describe the specific research methods, experimental designs, techniques, or approaches used to conduct the study. This includes data collection procedures, model architectures, training strategies, evaluation protocols, and any novel technical contributions to the research process itself."
    },
    "limitations": {
      "type": "array",
      "description": "The \"limitations\" field must contain an array of strings listing the acknowledged weaknesses, constraints, and boundaries of the research. This includes scope restrictions, potential biases in data or methods, scenarios where the approach may fail, computational requirements, and areas identified for future improvement.",
      "items": {
        "type": "string"
      }
    },
    "topics": {
      "type": "array",
      "description": "The \"topics\" field must list the main subject areas, research themes, and technical domains covered by the paper. This includes specific tasks addressed (e.g., question answering, code generation), model types (e.g., transformer, diffusion), and application areas (e.g., healthcare, robotics).",
      "items": {
        "type": "string"
      }
    }
  },
  "required": [
    "affiliation",
    "contributions",
    "authors",
    "title",
    "methodology",
    "limitations",
    "topics"
  ]
}
```

5. Deploy and copy the **Serving Endpoint Name**.
6. Paste it below.

In [None]:
dbutils.widgets.text("kie_endpoint", "agents_arxiv-kie", "KIE Endpoint (After Creation)")
os.environ["KIE_ENDPOINT"] = dbutils.widgets.get("kie_endpoint")

## 5. Create Evaluation Dataset
Create the table `arxiv_demo.main.eval_questions` for use in the KA Evaluation UI.

In [None]:
# Run the script logic
# We can just run the script file as a command if in the repo
%run scripts/create_eval_table

## 6. Run the App
Run the Streamlit app directly in this notebook.

In [None]:
from streamlit.web.cli import main
import sys

# Fake args for streamlit
sys.argv = ["streamlit", "run", "app/main.py"]
main()