# üß† O-ISAC CoT Master Pipeline

Tek notebook ile t√ºm extraction pipeline'ƒ± √ßalƒ±≈ütƒ±r.

**A≈üamalar:**
1. üì¶ Setup & Mount
2. üè≠ Phase 1: Data Prep (PDF ‚Üí Markdown)
3. üñºÔ∏è Phase 2: Visual Analysis
4. üß† Phase 3: CoT Extraction
5. üìä Results & Export

**Gereksinimler:**
- Colab GPU Runtime (T4 veya A100)
- GROQ_API_KEY (Colab Secrets'da ayarlƒ±)

---
**Son G√ºncelleme:** 2025-12-11
**Versiyon:** 1.0

---
## üì¶ Section 1: Setup & Mount

In [None]:
# @title 1.1 Install Dependencies
# Phase 1 & 2 heavy dependencies
!pip install marker-pdf -q
!pip install transformers torch pillow -q

# Phase 3 light dependencies
!pip install groq nest_asyncio pandas pyyaml -q

print("‚úÖ T√ºm baƒüƒ±mlƒ±lƒ±klar y√ºklendi!")

In [None]:
# @title 1.2 Mount Google Drive & Setup Paths
from google.colab import drive
from google.colab import userdata
import os
import sys

# Mount Drive
drive.mount('/content/drive')

# Project Paths
PROJECT_ROOT = '/content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST'
NOTEBOOKS_DIR = os.path.join(PROJECT_ROOT, 'analysis/notebooks')
COT_LAB_DIR = os.path.join(PROJECT_ROOT, 'analysis/cot_laboratory')
PDF_DIR = os.path.join(PROJECT_ROOT, 'data/retrieved_docs')
MARKDOWN_DIR = os.path.join(PROJECT_ROOT, 'data/processed_markdowns')
OUTPUT_DIR = os.path.join(PROJECT_ROOT, 'data/extraction_results_v3')

# Add to Python Path
sys.path.insert(0, NOTEBOOKS_DIR)
sys.path.insert(0, PROJECT_ROOT)

print(f"üìÅ Project Root: {PROJECT_ROOT}")
print(f"üìÑ PDF Directory: {PDF_DIR}")
print(f"üìù Markdown Directory: {MARKDOWN_DIR}")
print(f"üìä Output Directory: {OUTPUT_DIR}")
print("‚úÖ Paths configured!")

In [None]:
# @title 1.3 Load API Key
try:
    os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
    print("‚úÖ GROQ_API_KEY y√ºklendi!")
except Exception as e:
    print("‚ùå HATA: Sol men√ºden üîë Secrets b√∂l√ºm√ºne GROQ_API_KEY ekleyin!")
    print(f"   Hata detayƒ±: {e}")

---
## üè≠ Section 2: Phase 1 - Data Prep (PDF ‚Üí Markdown)

**‚ö†Ô∏è GPU Gerektirir!** Bu adƒ±m PDF'leri OCR ile markdown'a √ßevirir.

In [None]:
# @title 2.1 Import Pipeline & Check Status
from extraction_pipeline_v3 import Config, CheckpointManager, phase1_marker_conversion

# Initialize
Config.init_dirs()
checkpoint = CheckpointManager(Config.CHECKPOINT_FILE)

# Show current status
processed = checkpoint.data.get('processed', {})
print(f"üìä Mevcut durum: {len(processed)} paper i≈ülenmi≈ü")
print(f"üìÇ PDF'ler: {PDF_DIR}")

# List PDFs
import glob
pdfs = glob.glob(os.path.join(PDF_DIR, '*.pdf'))
print(f"üìÑ Toplam PDF: {len(pdfs)}")

In [None]:
# @title 2.2 Run PDF ‚Üí Markdown Conversion (Phase 1)
# ‚ö†Ô∏è Bu adƒ±m uzun s√ºrebilir (paper ba≈üƒ±na ~1-2 dk)

print("‚è≥ Phase 1: PDF ‚Üí Markdown d√∂n√º≈ü√ºm√º ba≈ülƒ±yor...")
phase1_marker_conversion(checkpoint, force_all=False)
print("‚úÖ Phase 1 tamamlandƒ±!")

---
## üñºÔ∏è Section 3: Phase 2 - Visual Analysis

BLIP ve DePlot modelleri ile g√∂rsel analiz yapƒ±lƒ±r.

In [None]:
# @title 3.1 Run Visual Analysis (Phase 2)
from extraction_pipeline_v3 import phase2_visual_analysis

print("‚è≥ Phase 2: G√∂rsel analiz ba≈ülƒ±yor...")
phase2_visual_analysis(checkpoint)
print("‚úÖ Phase 2 tamamlandƒ±!")

---
## üß† Section 4: Phase 3 - CoT Extraction

Chain-of-Thought extraction ile yapƒ±sal veri √ßƒ±karma.

In [None]:
# @title 4.1 Import CoT Laboratory
sys.path.insert(0, COT_LAB_DIR)
from core.assembler import CoTAssembler
from core.batch_runner import CoTFactory

# Default Recipe
RECIPE_PATH = 'analysis/cot_laboratory/recipes/experiment_v1_full_analysis.yaml'

print("‚úÖ CoT Laboratory y√ºklendi!")
print(f"üìú Recipe: {RECIPE_PATH}")

In [None]:
# @title 4.2 Single Paper Test
# √ñnce tek bir paper √ºzerinde test et

TEST_PAPER_ID = "O_ISAC_029"  # @param {type:"string"}

import json

# Find paper markdown
paper_path = os.path.join(MARKDOWN_DIR, TEST_PAPER_ID, TEST_PAPER_ID, f"{TEST_PAPER_ID}.md")
vis_path = os.path.join(MARKDOWN_DIR, TEST_PAPER_ID, TEST_PAPER_ID, "visual_analysis.txt")

print(f"üìÑ Paper: {paper_path}")
print(f"   Exists: {os.path.exists(paper_path)}")

# Read content
with open(paper_path, 'r', encoding='utf-8') as f:
    content = f.read()

# Read visual if exists
visual_content = None
if os.path.exists(vis_path):
    with open(vis_path, 'r', encoding='utf-8') as f:
        visual_content = f.read()
    print(f"üñºÔ∏è Visual Analysis: {len(visual_content)} chars")

# Run extraction
assembler = CoTAssembler(PROJECT_ROOT)
result = assembler.run_extraction(
    RECIPE_PATH,
    content,
    paper_id=TEST_PAPER_ID,
    visual_content=visual_content
)

# Show result
if result['status'] == 'success':
    print("\n‚úÖ EXTRACTION BA≈ûARILI!")
    print("\nüìã Reasoning Trace:")
    trace = result['parsed_output'].get('reasoning_trace', [])
    for step in trace:
        print(f"  {step.get('key')}: {step.get('value')[:80]}...")
else:
    print(f"\n‚ùå HATA: {result.get('error_message')}")

In [None]:
# @title 4.3 Batch Extraction (T√ºm Paper'lar)
# ‚ö†Ô∏è Bu uzun s√ºrecek! ~2-3 dk per paper

RUN_BATCH = False  # @param {type:"boolean"}

if RUN_BATCH:
    factory = CoTFactory(PROJECT_ROOT)
    factory.run_batch(RECIPE_PATH)
else:
    print("‚ÑπÔ∏è Batch mode kapalƒ±. √áalƒ±≈ütƒ±rmak i√ßin RUN_BATCH = True yapƒ±n.")

---
## üìä Section 5: Results & Export

In [None]:
# @title 5.1 View Latest Logs
import glob

logs_dir = os.path.join(COT_LAB_DIR, 'logs')
log_files = sorted(glob.glob(os.path.join(logs_dir, '*_RESULT.json')))[-5:]

print(f"üìã Son 5 extraction log:")
for log in log_files:
    filename = os.path.basename(log)
    # Parse: 20251210_144637_O_ISAC_029_llama-3.3-70b-versatile_RESULT.json
    parts = filename.split('_')
    date = parts[0]
    paper_id = f"{parts[2]}_{parts[3]}_{parts[4]}"
    print(f"  ‚Ä¢ {date}: {paper_id}")

In [None]:
# @title 5.2 Export to CSV (TODO)
# Bu fonksiyon t√ºm log JSON'larƒ±nƒ± birle≈ütirip CSV'ye √ßevirecek

print("üìä CSV export fonksiyonu hen√ºz implemente edilmedi.")
print("   Sonu√ßlar logs/ klas√∂r√ºnde JSON olarak mevcut.")

---
## ‚úÖ Done!

**Sonraki Adƒ±mlar:**
1. Single paper test sonu√ßlarƒ±nƒ± kontrol et
2. Kalite iyi ise batch mode'u a√ß
3. T√ºm paper'larƒ± i≈üle
4. CSV export yap