# 🧠 O-ISAC CoT Master Pipeline

Tek notebook ile tüm extraction pipeline'ı çalıştır.

**Aşamalar:**
1. 📦 Setup & Mount
2. 🏭 Phase 1: Data Prep (PDF → Markdown)
3. 🖼️ Phase 2: Visual Analysis
4. 🧠 Phase 3: CoT Extraction
5. 📊 Results & Export

**Gereksinimler:**
- Colab GPU Runtime (T4 veya A100)
- GROQ_API_KEY (Colab Secrets'da ayarlı)

---
**Son Güncelleme:** 2025-12-11
**Versiyon:** 1.0

---
## 📦 Section 1: Setup & Mount

In [1]:
# @title 1.1 Install Dependencies
# Phase 1 & 2 heavy dependencies
!pip install marker-pdf -q
!pip install transformers torch pillow -q

# Phase 3 light dependencies
!pip install groq nest_asyncio pandas pyyaml -q

print("✅ Tüm bağımlılıklar yüklendi!")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.9/188.9 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.2/223.2 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.0/50.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m948.6/948.6 kB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m114.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [1]:
# @title 1.2 Mount Google Drive & Setup Paths
from google.colab import drive
from google.colab import userdata
import os
import sys

# Mount Drive
drive.mount('/content/drive')

# Project Paths
PROJECT_ROOT = '/content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST'
NOTEBOOKS_DIR = os.path.join(PROJECT_ROOT, 'analysis/notebooks')
COT_LAB_DIR = os.path.join(PROJECT_ROOT, 'analysis/cot_laboratory')
PDF_DIR = os.path.join(PROJECT_ROOT, 'data/retrieved_docs')
MARKDOWN_DIR = os.path.join(PROJECT_ROOT, 'data/processed_markdowns')
OUTPUT_DIR = os.path.join(PROJECT_ROOT, 'data/extraction_results_v3')

# Add to Python Path
sys.path.insert(0, NOTEBOOKS_DIR)
sys.path.insert(0, PROJECT_ROOT)

print(f"📁 Project Root: {PROJECT_ROOT}")
print(f"📄 PDF Directory: {PDF_DIR}")
print(f"📝 Markdown Directory: {MARKDOWN_DIR}")
print(f"📊 Output Directory: {OUTPUT_DIR}")
print("✅ Paths configured!")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
📁 Project Root: /content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST
📄 PDF Directory: /content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST/data/retrieved_docs
📝 Markdown Directory: /content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST/data/processed_markdowns
📊 Output Directory: /content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST/data/extraction_results_v3
✅ Paths configured!


In [2]:
# @title 1.3 Load API Key
try:
    os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
    print("✅ GROQ_API_KEY yüklendi!")
except Exception as e:
    print("❌ HATA: Sol menüden 🔑 Secrets bölümüne GROQ_API_KEY ekleyin!")
    print(f"   Hata detayı: {e}")

✅ GROQ_API_KEY yüklendi!


---
## 🏭 Section 2: Phase 1 - Data Prep (PDF → Markdown)

**⚠️ GPU Gerektirir!** Bu adım PDF'leri OCR ile markdown'a çevirir.

In [3]:
# @title 2.1 Import Pipeline & Check Status
from extraction_pipeline_v3 import Config, CheckpointManager, phase1_marker_conversion

# Initialize
Config.init_dirs()
checkpoint = CheckpointManager(Config.CHECKPOINT_FILE)

# Show current status
processed = checkpoint.data.get('processed', {})
print(f"📊 Mevcut durum: {len(processed)} paper işlenmiş")
print(f"📂 PDF'ler: {PDF_DIR}")

# List PDFs
import glob
pdfs = glob.glob(os.path.join(PDF_DIR, '*.pdf'))
print(f"📄 Toplam PDF: {len(pdfs)}")

🌍 Environment: Google Colab
📊 Mevcut durum: 32 paper işlenmiş
📂 PDF'ler: /content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST/data/retrieved_docs
📄 Toplam PDF: 32


In [4]:
# @title 2.2 Run PDF → Markdown Conversion (Phase 1)
# ⚠️ Bu adım uzun sürebilir (paper başına ~1-2 dk)

print("⏳ Phase 1: PDF → Markdown dönüşümü başlıyor...")
phase1_marker_conversion(checkpoint, force_all=False)
print("✅ Phase 1 tamamlandı!")

⏳ Phase 1: PDF → Markdown dönüşümü başlıyor...

📄 PHASE 1: PDF → MARKDOWN (Marker)
Found 32 PDFs
   ⏩ O_ISAC_001 - already processed, skipping
   ⏩ O_ISAC_002 - already processed, skipping
   ⏩ O_ISAC_003 - already processed, skipping
   ⏩ O_ISAC_004 - already processed, skipping
   ⏩ O_ISAC_005 - already processed, skipping
   ⏩ O_ISAC_006 - already processed, skipping
   ⏩ O_ISAC_007 - already processed, skipping
   ⏩ O_ISAC_008 - already processed, skipping
   ⏩ O_ISAC_009 - already processed, skipping
   ⏩ O_ISAC_010 - already processed, skipping
   ⏩ O_ISAC_011 - already processed, skipping
   ⏩ O_ISAC_012 - already processed, skipping
   ⏩ O_ISAC_013 - already processed, skipping
   ⏩ O_ISAC_014 - already processed, skipping
   ⏩ O_ISAC_015 - already processed, skipping
   ⏩ O_ISAC_016 - already processed, skipping
   ⏩ O_ISAC_017 - already processed, skipping
   ⏩ O_ISAC_018 - already processed, skipping
   ⏩ O_ISAC_019 - already processed, skipping
   ⏩ O_ISAC_020 - already pro

---
## 🖼️ Section 3: Phase 2 - Visual Analysis

BLIP ve DePlot modelleri ile görsel analiz yapılır.

In [5]:
# @title 3.1 Run Visual Analysis (Phase 2)
from extraction_pipeline_v3 import phase2_visual_analysis

print("⏳ Phase 2: Görsel analiz başlıyor...")
phase2_visual_analysis(checkpoint)
print("✅ Phase 2 tamamlandı!")

⏳ Phase 2: Görsel analiz başlıyor...

👁️ PHASE 2: VISUAL ANALYSIS (BLIP + DePlot)


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Device: CUDA
Loading BLIP model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

Loading DePlot model...


preprocessor_config.json:   0%|          | 0.00/249 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/851k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

Papers to analyze: 32
[1/32] 👁️ Analyzing: O_ISAC_001
   ✅ 2 images analyzed
[2/32] 👁️ Analyzing: O_ISAC_002
   ✅ 3 images analyzed
[3/32] 👁️ Analyzing: O_ISAC_003
   ✅ 8 images analyzed
[4/32] 👁️ Analyzing: O_ISAC_004
   ✅ 6 images analyzed
[5/32] 👁️ Analyzing: O_ISAC_005
   ✅ 5 images analyzed
[6/32] 👁️ Analyzing: O_ISAC_006
   ✅ 8 images analyzed
[7/32] 👁️ Analyzing: O_ISAC_007
   ✅ 5 images analyzed
[8/32] 👁️ Analyzing: O_ISAC_008
   ✅ 3 images analyzed
[9/32] 👁️ Analyzing: O_ISAC_009
   ✅ 20 images analyzed
[10/32] 👁️ Analyzing: O_ISAC_010
   ✅ 9 images analyzed
[11/32] 👁️ Analyzing: O_ISAC_011
   ✅ 7 images analyzed
[12/32] 👁️ Analyzing: O_ISAC_012
   ✅ 3 images analyzed
[13/32] 👁️ Analyzing: O_ISAC_013
   ✅ 21 images analyzed
[14/32] 👁️ Analyzing: O_ISAC_014
   ✅ 5 images analyzed
[15/32] 👁️ Analyzing: O_ISAC_015
   ✅ 5 images analyzed
[16/32] 👁️ Analyzing: O_ISAC_016
   ✅ 4 images analyzed
[17/32] 👁️ Analyzing: O_ISAC_017
   ✅ 4 images analyzed
[18/32] 👁️ Analyzing: O_ISAC_018


---
## 🧠 Section 4: Phase 3 - CoT Extraction

Chain-of-Thought extraction ile yapısal veri çıkarma.

In [6]:
# @title 4.1 Import CoT Laboratory
sys.path.insert(0, COT_LAB_DIR)
from core.assembler import CoTAssembler
from core.batch_runner import CoTFactory

# Default Recipe
RECIPE_PATH = 'analysis/cot_laboratory/recipes/experiment_v1_full_analysis.yaml'

print("✅ CoT Laboratory yüklendi!")
print(f"📜 Recipe: {RECIPE_PATH}")

✅ CoT Laboratory yüklendi!
📜 Recipe: analysis/cot_laboratory/recipes/experiment_v1_full_analysis.yaml


In [7]:
# @title 4.2 Single Paper Test
# Önce tek bir paper üzerinde test et

TEST_PAPER_ID = "O_ISAC_029"  # @param {type:"string"}

import json

# Find paper markdown
paper_path = os.path.join(MARKDOWN_DIR, TEST_PAPER_ID, TEST_PAPER_ID, f"{TEST_PAPER_ID}.md")
vis_path = os.path.join(MARKDOWN_DIR, TEST_PAPER_ID, TEST_PAPER_ID, "visual_analysis.txt")

print(f"📄 Paper: {paper_path}")
print(f"   Exists: {os.path.exists(paper_path)}")

# Read content
with open(paper_path, 'r', encoding='utf-8') as f:
    content = f.read()

# Read visual if exists
visual_content = None
if os.path.exists(vis_path):
    with open(vis_path, 'r', encoding='utf-8') as f:
        visual_content = f.read()
    print(f"🖼️ Visual Analysis: {len(visual_content)} chars")

# Run extraction
assembler = CoTAssembler(PROJECT_ROOT)
result = assembler.run_extraction(
    RECIPE_PATH,
    content,
    paper_id=TEST_PAPER_ID,
    visual_content=visual_content
)

# Show result
if result['status'] == 'success':
    print("\n✅ EXTRACTION BAŞARILI!")
    print("\n📋 Reasoning Trace:")
    trace = result['parsed_output'].get('reasoning_trace', [])
    for step in trace:
        print(f"  {step.get('key')}: {step.get('value')[:80]}...")
else:
    print(f"\n❌ HATA: {result.get('error_message')}")

📄 Paper: /content/drive/MyDrive/AKU_WorkSpace/survey_fdgit/OISAC_PRISMA_COMST/data/processed_markdowns/O_ISAC_029/O_ISAC_029/O_ISAC_029.md
   Exists: True
🖼️ Visual Analysis: 1012 chars
[INFO] Loading Recipe: analysis/cot_laboratory/recipes/experiment_v1_full_analysis.yaml...
[INFO] Assembling System Prompt from Modules...
[INFO] Calling Groq API (Model: llama-3.3-70b-versatile)...

[DEBUG] RAW RESPONSE LEN: 6994
[DEBUG] RAW RESPONSE START: {
  "reasoning_trace":[
      {
         "key":"step_0_visual_inspection",
         "type":"string",
         "required":true,
         "description":"MANDATORY: You MUST describe what you see in the ...
[INFO] Logging Run Evidence...
[OK] Run Logged: 20251211_065216_O_ISAC_029_llama-3.3-70b-versatile

✅ EXTRACTION BAŞARILI!

📋 Reasoning Trace:
  step_0_visual_inspection: The paper contains several figures, including a diagram of a photonic-based THz ...
  step_1_concept_analysis: The system uses a photonic-based THz ISAC architecture, where communi

In [8]:
# @title 4.3 Batch Extraction (Tüm Paper'lar)
# ⚠️ Bu uzun sürecek! ~2-3 dk per paper

RUN_BATCH = False  # @param {type:"boolean"}

if RUN_BATCH:
    factory = CoTFactory(PROJECT_ROOT)
    factory.run_batch(RECIPE_PATH)
else:
    print("ℹ️ Batch mode kapalı. Çalıştırmak için RUN_BATCH = True yapın.")

ℹ️ Batch mode kapalı. Çalıştırmak için RUN_BATCH = True yapın.


---
## 📊 Section 5: Results & Export

In [9]:
# @title 5.1 View Latest Logs
import glob
from datetime import datetime

logs_dir = os.path.join(COT_LAB_DIR, 'logs')
log_files = sorted(glob.glob(os.path.join(logs_dir, '*_RESULT.json')))[-5:]

print(f"📋 Son 5 extraction log:")
for log in log_files:
    filename = os.path.basename(log)
    # Parse: 20251211_093015_O_ISAC_029_llama-3.3-70b-versatile_RESULT.json
    parts = filename.split('_')
    date_str = parts[0]  # YYYYMMDD
    time_str = parts[1]  # HHMMSS
    paper_id = f"{parts[2]}_{parts[3]}_{parts[4]}"
    
    # Format full timestamp
    try:
        dt = datetime.strptime(f"{date_str}_{time_str}", "%Y%m%d_%H%M%S")
        formatted = dt.strftime("%Y-%m-%d %H:%M:%S")
    except:
        formatted = f"{date_str}_{time_str}"
    
    print(f"  • {formatted} | {paper_id}")

In [10]:
# @title 5.2 Export to CSV (TODO)
# Bu fonksiyon tüm log JSON'larını birleştirip CSV'ye çevirecek

print("📊 CSV export fonksiyonu henüz implemente edilmedi.")
print("   Sonuçlar logs/ klasöründe JSON olarak mevcut.")

📊 CSV export fonksiyonu henüz implemente edilmedi.
   Sonuçlar logs/ klasöründe JSON olarak mevcut.


---
## ✅ Done!

**Sonraki Adımlar:**
1. Single paper test sonuçlarını kontrol et
2. Kalite iyi ise batch mode'u aç
3. Tüm paper'ları işle
4. CSV export yap