# üöÄ O-ISAC Extraction Pipeline v3.0

**Optimized PRISMA-compliant data extraction for O-ISAC systematic review**

## Features:
- ‚úÖ **Resume Support** - Only processes new/changed PDFs
- ‚úÖ **v2.0 Schema** - PRISMA Protocol Section 9 aligned
- ‚úÖ **GPU Optimized** - Batched visual analysis
- ‚úÖ **Async LLM** - Parallel Groq API calls

## Requirements:
- GPU Runtime (A100 recommended)
- GROQ_API_KEY in Colab secrets
- Google Drive mounted with PDFs

## üì¶ Step 1: Install Dependencies

In [None]:
# Install required packages
!pip install -q marker-pdf openai nest_asyncio transformers torch pillow pandas

# Verify marker installation
import shutil
if shutil.which('marker_single'):
    print('‚úÖ Marker installed successfully')
else:
    print('‚ö†Ô∏è Marker not in PATH - may need runtime restart')

## üîó Step 2: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Verify project path
import os
PROJECT = '/content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST'
PDF_DIR = os.path.join(PROJECT, 'data/retrieved_docs')

if os.path.exists(PDF_DIR):
    pdfs = [f for f in os.listdir(PDF_DIR) if f.endswith('.pdf')]
    print(f'‚úÖ Found {len(pdfs)} PDFs in retrieved_docs')
else:
    print('‚ùå PDF directory not found!')

## ‚öôÔ∏è Step 3: Load Pipeline Module

In [None]:
# Add notebooks folder to path and import pipeline
import sys
sys.path.insert(0, os.path.join(PROJECT, 'analysis/notebooks'))

from extraction_pipeline_v3 import (
    Config, CheckpointManager,
    phase1_marker_conversion,
    phase2_visual_analysis,
    phase3_llm_extraction,
    run_full_pipeline
)

# Initialize directories
Config.init_dirs()
print('‚úÖ Pipeline loaded')
print(f'üìÅ Output: {Config.OUTPUT_DIR}')

## üîë Step 4: Verify API Key

Make sure `GROQ_API_KEY` is set in Colab Secrets (üîë icon in left sidebar)

In [None]:
from google.colab import userdata
try:
    key = userdata.get('GROQ_API_KEY')
    print(f'‚úÖ GROQ_API_KEY found ({key[:8]}...)')
except:
    print('‚ùå GROQ_API_KEY not found! Add it to Colab Secrets.')

---
## üéØ Step 5: Run Pipeline

Choose your run mode:
- **Test**: Process first 3 papers
- **Full**: Process all papers
- **Resume**: Continue from checkpoint

In [None]:
# ========================================
# üß™ TEST RUN (First 3 papers)
# ========================================
results = run_full_pipeline(limit=3)
print(f'\nüìä Extracted {len(results)} papers')

In [None]:
# ========================================
# üöÄ FULL RUN (All papers)
# ========================================
# Uncomment to run all:
# results = run_full_pipeline()
# print(f'\nüìä Extracted {len(results)} papers')

In [None]:
# ========================================
# ‚è≠Ô∏è SKIP PHASES (Resume from checkpoint)
# ========================================
# If Marker/Visual already done, skip to LLM:
# results = run_full_pipeline(skip_phase1=True, skip_phase2=True)

## üìä Step 6: View Results

In [None]:
import pandas as pd

csv_path = os.path.join(Config.OUTPUT_DIR, 'extraction_v3.csv')
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path)
    print(f'üìä Total experiments: {len(df)}')
    print(f'üìÑ Total papers: {df["Paper_ID"].nunique()}')
    print('\n--- Sample Data ---')
    display(df.head(10))
else:
    print('No results yet - run the pipeline first!')

In [None]:
# Distribution analysis
if 'df' in dir() and len(df) > 0:
    print('\nüìà Medium Distribution:')
    print(df['Medium'].value_counts())
    
    print('\nüìà ISAC Waveform Relationship:')
    print(df['ISAC_Relationship'].value_counts())
    
    print('\nüìà Coupling Mode:')
    print(df['Coupling_Mode'].value_counts())

---
## üîß Utility: Check Checkpoint Status

In [None]:
# View checkpoint status
import json
checkpoint_path = os.path.join(Config.OUTPUT_DIR, 'checkpoint.json')
if os.path.exists(checkpoint_path):
    with open(checkpoint_path) as f:
        cp = json.load(f)
    print(f'‚úÖ Processed: {len(cp.get("processed", {}))} papers')
    print(f'‚ùå Errors: {len(cp.get("errors", []))}')
    print(f'üïê Last run: {cp.get("last_run", "Never")}')
else:
    print('No checkpoint file yet')

In [None]:
# Force reprocess all (clears checkpoint)
# Uncomment to reset:
# import os
# checkpoint_path = os.path.join(Config.OUTPUT_DIR, 'checkpoint.json')
# if os.path.exists(checkpoint_path):
#     os.remove(checkpoint_path)
#     print('üóëÔ∏è Checkpoint cleared!')