# 🚀 O-ISAC Extraction Pipeline v3.0

**Optimized PRISMA-compliant data extraction for O-ISAC systematic review**

## Features:
- ✅ **Resume Support** - Only processes new/changed PDFs
- ✅ **v2.0 Schema** - PRISMA Protocol Section 9 aligned
- ✅ **GPU Optimized** - Batched visual analysis
- ✅ **Async LLM** - Parallel Groq API calls

## Requirements:
- GPU Runtime (A100 recommended)
- GROQ_API_KEY in Colab secrets
- Google Drive mounted with PDFs

## 📦 Step 1: Install Dependencies

In [1]:
# Install required packages
!pip install -q marker-pdf openai nest_asyncio transformers torch pillow pandas

# Verify marker installation
import shutil
if shutil.which('marker_single'):
    print('✅ Marker installed successfully')
else:
    print('⚠️ Marker not in PATH - may need runtime restart')

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.9/188.9 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m948.6/948.6 kB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m126.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.2/223.2 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.0/50.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 🔗 Step 2: Mount Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

# Verify project path
import os
PROJECT = '/content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST'
PDF_DIR = os.path.join(PROJECT, 'data/retrieved_docs')

if os.path.exists(PDF_DIR):
    pdfs = [f for f in os.listdir(PDF_DIR) if f.endswith('.pdf')]
    print(f'✅ Found {len(pdfs)} PDFs in retrieved_docs')
else:
    print('❌ PDF directory not found!')

Mounted at /content/drive
✅ Found 32 PDFs in retrieved_docs


## ⚙️ Step 3: Load Pipeline Module

In [3]:
# Add notebooks folder to path and import pipeline
import sys
sys.path.insert(0, os.path.join(PROJECT, 'analysis/notebooks'))

from extraction_pipeline_v3 import (
    Config, CheckpointManager,
    phase1_marker_conversion,
    phase2_visual_analysis,
    phase3_llm_extraction,
    run_full_pipeline
)

# Initialize directories
Config.init_dirs()
print('✅ Pipeline loaded')
print(f'📁 Output: {Config.OUTPUT_DIR}')

✅ Pipeline loaded
📁 Output: /content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST/data/extraction_results_v3


## 🔑 Step 4: Verify API Key

Make sure `GROQ_API_KEY` is set in Colab Secrets (🔑 icon in left sidebar)

In [4]:
from google.colab import userdata
try:
    key = userdata.get('GROQ_API_KEY')
    print(f'✅ GROQ_API_KEY found ({key[:8]}...)')
except:
    print('❌ GROQ_API_KEY not found! Add it to Colab Secrets.')

✅ GROQ_API_KEY found (gsk_XXCl...)


---
## 🎯 Step 5: Run Pipeline

Choose your run mode:
- **Test**: Process first 3 papers
- **Full**: Process all papers
- **Resume**: Continue from checkpoint

In [5]:
# ========================================
# 🧪 TEST RUN (First 3 papers)
# ========================================
results = run_full_pipeline(limit=3)
print(f'\n📊 Extracted {len(results)} papers')


🚀 O-ISAC EXTRACTION PIPELINE v3.0
📅 Started: 2025-12-08 14:34

📄 PHASE 1: PDF → MARKDOWN (Marker)
Found 32 PDFs

📋 PDFs to convert: 32

[1/32] 🔨 Processing: O_ISAC_001
   ✅ Done

[2/32] 🔨 Processing: O_ISAC_002
   ✅ Done

[3/32] 🔨 Processing: O_ISAC_003
   ✅ Done

[4/32] 🔨 Processing: O_ISAC_004
   ✅ Done

[5/32] 🔨 Processing: O_ISAC_005
   ✅ Done

[6/32] 🔨 Processing: O_ISAC_006
   ✅ Done

[7/32] 🔨 Processing: O_ISAC_007
   ✅ Done

[8/32] 🔨 Processing: O_ISAC_008
   ✅ Done

[9/32] 🔨 Processing: O_ISAC_009
   ✅ Done

[10/32] 🔨 Processing: O_ISAC_010
   ✅ Done

[11/32] 🔨 Processing: O_ISAC_011
   ✅ Done

[12/32] 🔨 Processing: O_ISAC_012
   ✅ Done

[13/32] 🔨 Processing: O_ISAC_013
   ✅ Done

[14/32] 🔨 Processing: O_ISAC_014
   ✅ Done

[15/32] 🔨 Processing: O_ISAC_015
   ✅ Done

[16/32] 🔨 Processing: O_ISAC_016
   ✅ Done

[17/32] 🔨 Processing: O_ISAC_017
   ✅ Done

[18/32] 🔨 Processing: O_ISAC_018
   ✅ Done

[19/32] 🔨 Processing: O_ISAC_019
   ✅ Done

[20/32] 🔨 Processing: O_ISAC_020
   

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Device: CUDA
Loading BLIP model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

Loading DePlot model...


preprocessor_config.json:   0%|          | 0.00/249 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/851k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.13G [00:00<?, ?B/s]


🖼️ Analyzing: O_ISAC_009
   ✅ 20 images analyzed

🖼️ Analyzing: O_ISAC_011
   ✅ 7 images analyzed

🖼️ Analyzing: O_ISAC_012
   ✅ 3 images analyzed

🖼️ Analyzing: O_ISAC_013
   ✅ 21 images analyzed

🖼️ Analyzing: O_ISAC_014
   ✅ 5 images analyzed

🖼️ Analyzing: O_ISAC_015
   ✅ 5 images analyzed

🖼️ Analyzing: O_ISAC_016
   ✅ 4 images analyzed

🖼️ Analyzing: O_ISAC_017
   ✅ 4 images analyzed

🖼️ Analyzing: O_ISAC_018
   ✅ 12 images analyzed

🖼️ Analyzing: O_ISAC_019
   ✅ 7 images analyzed

🖼️ Analyzing: O_ISAC_020
   ✅ 6 images analyzed

🖼️ Analyzing: O_ISAC_021
   ✅ 5 images analyzed

🖼️ Analyzing: O_ISAC_022
   ✅ 15 images analyzed

🖼️ Analyzing: O_ISAC_023
   ✅ 13 images analyzed

🖼️ Analyzing: O_ISAC_024
   ✅ 3 images analyzed

🖼️ Analyzing: O_ISAC_025
   ✅ 2 images analyzed

🖼️ Analyzing: O_ISAC_026
   ✅ 4 images analyzed

🖼️ Analyzing: O_ISAC_027
   ✅ 10 images analyzed

🖼️ Analyzing: O_ISAC_028
   ✅ 6 images analyzed

🖼️ Analyzing: O_ISAC_029
   ✅ 10 images analyzed

🖼️ Analyzing

In [None]:
# ========================================
# 🚀 FULL RUN (All papers)
# ========================================
# Uncomment to run all:
# results = run_full_pipeline()
# print(f'\n📊 Extracted {len(results)} papers')

In [None]:
# ========================================
# ⏭️ SKIP PHASES (Resume from checkpoint)
# ========================================
# If Marker/Visual already done, skip to LLM:
# results = run_full_pipeline(skip_phase1=True, skip_phase2=True)

## 📊 Step 6: View Results

In [6]:
import pandas as pd

csv_path = os.path.join(Config.OUTPUT_DIR, 'extraction_v3.csv')
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path)
    print(f'📊 Total experiments: {len(df)}')
    print(f'📄 Total papers: {df["Paper_ID"].nunique()}')
    print('\n--- Sample Data ---')
    display(df.head(10))
else:
    print('No results yet - run the pipeline first!')

📊 Total experiments: 5
📄 Total papers: 3

--- Sample Data ---


Unnamed: 0,Paper_ID,Title,Year,Medium,Exp_ID,Scenario,Data_Rate_Gbps,Range_Resolution_m,ISAC_Relationship,Coupling_Mode
0,O_ISAC_001,Modulation Strategies for Robust Optical Wirel...,2025,wireless_vlc,E1,VLC CE-OFDM system,1,NR,single_dual_function,joint_waveform
1,O_ISAC_002,Photonic Terahertz Integrated Sensing and Comm...,2025,wireless_fso,E1,Radar-centric photonic terahertz integrated se...,6,0.013,single_dual_function,joint_waveform
2,O_ISAC_002,Photonic Terahertz Integrated Sensing and Comm...,2025,wireless_fso,E2,Multi-channel photonic THz-ISAC system,120,0.0025,single_dual_function,joint_waveform
3,O_ISAC_003,Analysis of Visible Light-Based ISAC Channel C...,2025,wireless_vlc,E1,Monostatic Sensing,NR,NR,NR,NR
4,O_ISAC_003,Analysis of Visible Light-Based ISAC Channel C...,2025,wireless_vlc,E2,Bi-static Sensing,NR,NR,NR,NR


In [7]:
# Distribution analysis
if 'df' in dir() and len(df) > 0:
    print('\n📈 Medium Distribution:')
    print(df['Medium'].value_counts())

    print('\n📈 ISAC Waveform Relationship:')
    print(df['ISAC_Relationship'].value_counts())

    print('\n📈 Coupling Mode:')
    print(df['Coupling_Mode'].value_counts())


📈 Medium Distribution:
Medium
wireless_vlc    3
wireless_fso    2
Name: count, dtype: int64

📈 ISAC Waveform Relationship:
ISAC_Relationship
single_dual_function    3
NR                      2
Name: count, dtype: int64

📈 Coupling Mode:
Coupling_Mode
joint_waveform    3
NR                2
Name: count, dtype: int64


---
## 🔧 Utility: Check Checkpoint Status

In [None]:
# View checkpoint status
import json
checkpoint_path = os.path.join(Config.OUTPUT_DIR, 'checkpoint.json')
if os.path.exists(checkpoint_path):
    with open(checkpoint_path) as f:
        cp = json.load(f)
    print(f'✅ Processed: {len(cp.get("processed", {}))} papers')
    print(f'❌ Errors: {len(cp.get("errors", []))}')
    print(f'🕐 Last run: {cp.get("last_run", "Never")}')
else:
    print('No checkpoint file yet')

In [None]:
# Force reprocess all (clears checkpoint)
# Uncomment to reset:
# import os
# checkpoint_path = os.path.join(Config.OUTPUT_DIR, 'checkpoint.json')
# if os.path.exists(checkpoint_path):
#     os.remove(checkpoint_path)
#     print('🗑️ Checkpoint cleared!')