## EXTRACCI√ìN CORPUS IPCC: 400 DOCUMENTOS NECESARIOS PARA CORPUS

**OBSERVACIONES**: Hemos tenido numerosos problemas con esta extracci√≥n de PDFs para el organismo IPCC. Primero, intentamos extraer informes completos. Al ser muy largos, consum√≠amos toda nuestra memoria RAM de la cuenta gratuita, Colab colapsaba y reiniciaba sesi√≥n. Y tocaba volver a empezar. Claro eran demasiados archivos enormes procesados a la vez. De ah√≠, que optaramos por un *procesamiento incremental**, con el que poder liberar memoria.

Intentamos hacerlo por cap√≠tulos y nos enfrentamos a varios desaf√≠os igualmente. Por un lado, algunos de los enlaces hab√≠an quedado obsoletos. Para AR6 y SREX las URLs son estables (no han cambiado desde hace a√±os); sin embargo, para SR15, SROCC y SRCCL las carpetas de nuestros enlaces ya no exist√≠an. IPCC reorganiz√≥ la web entre 2020 y 2023; as√≠ que justo esos informes se vieron afectados. En consecuencia, tuvimos que buscar las nuevas rutas y los nuevos nombres de archivo para esos 18 cap√≠tulos.

Por √∫ltimo, y no menos importante, fue necesario incorporar al c√≥digo reanudaci√≥n autom√°tica, porque el proceso tarda entre 60-75 minutos. Ante la posibilidad de que Colab se reinicie pasado determinado tiempo, el pipeline continuar√° donde lo dej√≥ sin tener que volver a descargar los PDFs, procesarlos y guardar corpus si ya lo ha hecho antes, simplemente integrando una serie de checkpoints en los diferentes pasos.

**Cada cap√≠tulo se procesa de forma incremental: se descarga, se lee p√°gina a p√°gina, se divide en segmentos y se guarda el progreso continuamente, permitiendo reanudar el proceso sin repetir trabajo si la sesi√≥n se interrumpe. Adem√°s, este enfoque libera memoria despu√©s de procesar cada p√°gina o bloque, evitando cargar el PDF completo en RAM y garantizando que el sistema pueda manejar documentos extensos incluso en entornos con recursos limitados como Google Colab.*

In [None]:
"""
IPCC Document Extractor - OPTIMIZADO PARA COLAB GRATUITO
Versi√≥n con cap√≠tulos individuales y procesamiento incremental
Consumo m√°ximo de RAM: ~3-4 GB (seguro para Colab gratuito)
"""

import os
import re
import json
import requests
from pathlib import Path
from typing import List, Dict, Set, Tuple
from collections import defaultdict
import time
import gc
import warnings
import logging

# Suprimir warnings molestos
logging.getLogger('pdfminer').setLevel(logging.ERROR)
warnings.filterwarnings('ignore')

# ============================================================================
# INSTALACI√ìN DE DEPENDENCIAS
# ============================================================================

!pip install PyPDF2 pdfplumber requests tqdm -q


import PyPDF2
import pdfplumber
from tqdm.auto import tqdm


# ============================================================================
# CAT√ÅLOGO OPTIMIZADO - CAP√çTULOS INDIVIDUALES
# ============================================================================

IPCC_CHAPTERS_CATALOG = {
    "AR6_WG2": {
        "name": "AR6 Working Group II (2022)",
        "priority": "HIGH",
        "base_url": "https://www.ipcc.ch/report/ar6/wg2/downloads/report/",
        "chapters": [
            # Technical Summary y SPM (esenciales)
            {"file": "IPCC_AR6_WGII_TechnicalSummary.pdf", "title": "Technical Summary", "expected_docs": 15},
            {"file": "IPCC_AR6_WGII_SummaryForPolicymakers.pdf", "title": "Summary for Policymakers", "expected_docs": 8},

            # Cap√≠tulos tem√°ticos clave (selecci√≥n estrat√©gica de los m√°s relevantes)
            {"file": "IPCC_AR6_WGII_Chapter02.pdf", "title": "Chapter 2 - Terrestrial and Freshwater Ecosystems", "expected_docs": 12},
            {"file": "IPCC_AR6_WGII_Chapter03.pdf", "title": "Chapter 3 - Oceans and Coastal Ecosystems", "expected_docs": 12},
            {"file": "IPCC_AR6_WGII_Chapter04.pdf", "title": "Chapter 4 - Water", "expected_docs": 12},
            {"file": "IPCC_AR6_WGII_Chapter05.pdf", "title": "Chapter 5 - Food, Fibre and Livelihoods", "expected_docs": 12},
            {"file": "IPCC_AR6_WGII_Chapter06.pdf", "title": "Chapter 6 - Cities, Settlements and Infrastructure", "expected_docs": 12},
            {"file": "IPCC_AR6_WGII_Chapter07.pdf", "title": "Chapter 7 - Health, Wellbeing and Communities", "expected_docs": 12},
            {"file": "IPCC_AR6_WGII_Chapter08.pdf", "title": "Chapter 8 - Poverty, Livelihoods and Development", "expected_docs": 10},
            {"file": "IPCC_AR6_WGII_Chapter09.pdf", "title": "Chapter 9 - Africa", "expected_docs": 10},
            {"file": "IPCC_AR6_WGII_Chapter10.pdf", "title": "Chapter 10 - Asia", "expected_docs": 10},
            {"file": "IPCC_AR6_WGII_Chapter11.pdf", "title": "Chapter 11 - Australasia", "expected_docs": 8},
            {"file": "IPCC_AR6_WGII_Chapter12.pdf", "title": "Chapter 12 - Central and South America", "expected_docs": 10},
            {"file": "IPCC_AR6_WGII_Chapter13.pdf", "title": "Chapter 13 - Europe", "expected_docs": 10},
            {"file": "IPCC_AR6_WGII_Chapter14.pdf", "title": "Chapter 14 - North America", "expected_docs": 10},
            {"file": "IPCC_AR6_WGII_Chapter15.pdf", "title": "Chapter 15 - Small Islands", "expected_docs": 10},
            {"file": "IPCC_AR6_WGII_Chapter16.pdf", "title": "Chapter 16 - Key Risks", "expected_docs": 12},
            {"file": "IPCC_AR6_WGII_Chapter17.pdf", "title": "Chapter 17 - Decision Making", "expected_docs": 10},
            {"file": "IPCC_AR6_WGII_Chapter18.pdf", "title": "Chapter 18 - Climate Resilient Development", "expected_docs": 12},
        ]
    },

    "SREX": {
        "name": "Special Report on Extreme Events (2012)",
        "priority": "HIGH",
        "base_url": "https://www.ipcc.ch/site/assets/uploads/2018/03/",
        "chapters": [
            {"file": "SREX-Chap1_FINAL-1.pdf", "title": "Chapter 1 - Climate Extremes", "expected_docs": 8},
            {"file": "SREX-Chap2_FINAL-1.pdf", "title": "Chapter 2 - Determinants of Risk", "expected_docs": 8},
            {"file": "SREX-Chap3_FINAL-1.pdf", "title": "Chapter 3 - Changes in Extremes", "expected_docs": 10},
            {"file": "SREX-Chap4_FINAL-1.pdf", "title": "Chapter 4 - Changes in Impacts", "expected_docs": 10},
            {"file": "SREX-Chap5_FINAL-1.pdf", "title": "Chapter 5 - Managing Risks", "expected_docs": 8},
            {"file": "SREX-Chap6_FINAL-1.pdf", "title": "Chapter 6 - National Systems", "expected_docs": 8},
            {"file": "SREX-Chap7_FINAL-1.pdf", "title": "Chapter 7 - Managing Disaster Risks", "expected_docs": 8},
            {"file": "SREX-Chap8_FINAL-1.pdf", "title": "Chapter 8 - Climate Change Context", "expected_docs": 8},
            {"file": "SREX-Chap9_FINAL-1.pdf", "title": "Chapter 9 - Case Studies", "expected_docs": 10},
        ]
    },

    "SR15": {
        "name": "Special Report on 1.5¬∞C (2018)",
        "priority": "MEDIUM",
        "base_url": "https://www.ipcc.ch/site/assets/uploads/sites/2/2022/06/",
        "chapters": [
            {"file": "SR15_Chapter_1_LR.pdf", "title": "Chapter 1 - Framing and Context", "expected_docs": 8},
            {"file": "SR15_Chapter_2_LR.pdf", "title": "Chapter 2 - Mitigation Pathways", "expected_docs": 6},
            {"file": "SR15_Chapter_3_LR.pdf", "title": "Chapter 3 - Impacts of 1.5¬∞C", "expected_docs": 10},
            {"file": "SR15_Chapter_4_LR.pdf", "title": "Chapter 4 - Strengthening Response", "expected_docs": 8},
            {"file": "SR15_Chapter_5_HR.pdf", "title": "Chapter 5 - Sustainable Development", "expected_docs": 8},
        ]
    },

    "SROCC": {
        "name": "Special Report on Ocean and Cryosphere (2019)",
        "priority": "MEDIUM",
        "base_url": "https://www.ipcc.ch/site/assets/uploads/sites/3/2022/03/",
        "chapters": [
            {"file": "03_SROCC_Ch01_FINAL.pdf", "title": "Chapter 1 - Framing", "expected_docs": 6},
            {"file": "04_SROCC_Ch02_FINAL.pdf", "title": "Chapter 2 - High Mountain Areas", "expected_docs": 8},
            {"file": "05_SROCC_Ch03_FINAL.pdf", "title": "Chapter 3 - Polar Regions", "expected_docs": 8},
            {"file": "06_SROCC_Ch04_FINAL.pdf", "title": "Chapter 4 - Sea Level Rise", "expected_docs": 10},
            {"file": "07_SROCC_Ch05_FINAL.pdf", "title": "Chapter 5 - Marine Ecosystems", "expected_docs": 10},
            {"file": "08_SROCC_Ch06_FINAL.pdf", "title": "Chapter 6 - Extremes and Abrupt Changes", "expected_docs": 8},
        ]
    },

    "SRCCL": {
        "name": "Special Report on Climate and Land (2019)",
        "priority": "MEDIUM",
        "base_url": "https://www.ipcc.ch/site/assets/uploads/sites/4/2020/05/",
        "chapters": [
            {"file": "Chapter-1_FINAL-1.pdf", "title": "Chapter 1 - Framing", "expected_docs": 6},
            {"file": "Chapter-2_FINAL_updated-30-April.pdf", "title": "Chapter 2 - Land-Climate Interactions", "expected_docs": 8},
            {"file": "Chapter-3_FINAL-1.pdf", "title": "Chapter 3 - Desertification", "expected_docs": 10},
            {"file": "Chapter-4_FINAL-1.pdf", "title": "Chapter 4 - Land Degradation", "expected_docs": 8},
            {"file": "Chapter-5_FINAL-1.pdf", "title": "Chapter 5 - Food Security", "expected_docs": 10},
            {"file": "Chapter-6_FINAL-1.pdf", "title": "Chapter 6 - Interlinkages", "expected_docs": 8},
            {"file": "Chapter-7_FINAL-1.pdf", "title": "Chapter 7 - Risk Management", "expected_docs": 8},
        ]
    }
}


# ============================================================================
# TAXONOM√çA (tu taxonom√≠a completa)
# ============================================================================

IPCC_TAXONOMY = {
    'hazards': {
        'heat': ['heat', 'heatwave', 'heat wave', 'extreme temperature', 'thermal stress',
                'hot extremes', 'warming', 'high temperature'],
        'flood': ['flood', 'flooding', 'inundation', 'pluvial', 'fluvial', 'riverine',
                 'coastal flood', 'flash flood', 'storm surge', 'sea level rise'],
        'drought': ['drought', 'water scarcity', 'water stress', 'aridity', 'dry spell',
                   'hydrological drought', 'agricultural drought', 'water shortage'],
        'storm': ['tropical cyclone', 'hurricane', 'typhoon', 'storm', 'wind extremes',
                 'extratropical cyclone', 'severe weather'],
        'compound': ['compound event', 'concurrent', 'cascading', 'multiple hazard',
                    'combined risk', 'interacting'],
        'wildfire': ['wildfire', 'fire', 'fire weather', 'fire risk'],
        'cold': ['cold extreme', 'frost', 'freeze', 'ice', 'snow'],
        'landslide': ['landslide', 'mudslide', 'mass movement', 'slope instability'],
        'coastal': ['coastal erosion', 'shoreline retreat', 'coastal hazard']
    },

    'adaptation_measures': {
        'nature_based': ['nature-based solution', 'nbs', 'ecosystem-based', 'green infrastructure',
                        'wetland', 'mangrove', 'forest', 'restoration', 'conservation'],
        'infrastructure': ['infrastructure', 'sea wall', 'levee', 'barrier', 'dike',
                          'drainage', 'engineered', 'hard adaptation'],
        'planning': ['adaptation plan', 'planning', 'policy', 'governance', 'institutional',
                    'regulation', 'zoning', 'land use'],
        'early_warning': ['early warning', 'forecasting', 'monitoring', 'climate service',
                         'risk assessment', 'vulnerability assessment'],
        'water_management': ['irrigation', 'water management', 'water storage', 'rainwater',
                            'water efficiency', 'demand management'],
        'agriculture': ['drought-resistant', 'crop adaptation', 'climate-smart agriculture',
                       'agricultural adaptation', 'crop diversification'],
        'financial': ['insurance', 'climate finance', 'adaptation finance', 'funding',
                     'investment', 'economic instrument']
    },

    'sectors': {
        'urban': ['urban', 'city', 'cities', 'municipal', 'settlement'],
        'health': ['health', 'mortality', 'morbidity', 'disease', 'public health'],
        'water': ['water supply', 'water resource', 'water system', 'freshwater'],
        'agriculture': ['agriculture', 'crop', 'food security', 'farming', 'livestock'],
        'coastal': ['coastal', 'coast', 'marine', 'ocean', 'shoreline'],
        'infrastructure': ['infrastructure', 'transport', 'energy', 'critical infrastructure'],
        'ecosystem': ['ecosystem', 'biodiversity', 'species', 'habitat', 'ecological']
    },

    'impacts': {
        'health_impacts': ['mortality', 'death', 'morbidity', 'illness', 'health risk',
                          'heat-related', 'disease burden'],
        'economic_impacts': ['economic loss', 'damage', 'cost', 'gdp', 'productivity loss',
                            'economic impact'],
        'social_impacts': ['displacement', 'migration', 'livelihood', 'poverty', 'inequality',
                          'vulnerable', 'community'],
        'environmental_impacts': ['ecosystem degradation', 'habitat loss', 'biodiversity loss',
                                 'species extinction', 'water quality']
    },

    'concepts': {
        'adaptation': ['adaptation', 'adaptive capacity', 'resilience', 'vulnerability',
                      'exposure', 'sensitivity', 'coping capacity'],
        'risk': ['climate risk', 'risk assessment', 'disaster risk reduction', 'risk management',
                'hazard', 'impact', 'consequence'],
        'transformation': ['transformation', 'transformational adaptation', 'systemic change',
                          'paradigm shift', 'maladaptation']
    },

    'regions': {
        'vulnerable': ['small island', 'sids', 'developing countries', 'least developed',
                      'vulnerable regions', 'low-income', 'global south'],
        'geographic': ['africa', 'asia', 'europe', 'americas', 'mediterranean', 'arctic',
                      'tropics', 'arid', 'semi-arid', 'coastal regions', 'mountains']
    }
}


# ============================================================================
# CLASE PRINCIPAL - OPTIMIZADA PARA RAM
# ============================================================================

class IPCCExtractorOptimized:
    """
    Extractor optimizado con procesamiento incremental
    M√°ximo consumo de RAM: ~3-4 GB
    """

    def __init__(self, output_dir: str = "./ipcc_data"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)

        self.pdfs_dir = self.output_dir / "pdfs"
        self.pdfs_dir.mkdir(exist_ok=True)

        self.segments_dir = self.output_dir / "segments"
        self.segments_dir.mkdir(exist_ok=True)

        self.processed_files = []

    def check_existing_progress(self):
        """
        Comprueba qu√© partes del pipeline ya est√°n completadas.
        Devuelve un diccionario con:
        - pdfs: si los PDFs ya est√°n descargados
        - segments: si ya existen segmentos procesados
        - corpus: si ya existe el corpus final
        """
        status = {
            "pdfs": False,
            "segments": False,
            "corpus": False
        }

        pdf_dir = self.pdfs_dir
        seg_dir = self.segments_dir
        corpus_file = self.output_dir / "ipcc_corpus.json"

        # PDFs descargados
        if pdf_dir.exists() and any(pdf_dir.glob("*.pdf")):
            status["pdfs"] = True

        # Segmentos procesados
        if seg_dir.exists() and any(seg_dir.glob("*_segments.json")):
            status["segments"] = True

        # Corpus final
        if corpus_file.exists():
           status["corpus"] = True

        return status


    def download_chapter(self, base_url: str, chapter_info: Dict, report_key: str) -> bool:
        """Descarga un cap√≠tulo individual"""
        filename = f"{report_key}_{chapter_info['file']}"
        filepath = self.pdfs_dir / filename

        if filepath.exists():
            size_mb = filepath.stat().st_size / 1024 / 1024
            print(f"   ‚úì Existe: {filename} ({size_mb:.1f} MB)")
            return True

        url = base_url + chapter_info['file']

        try:
            print(f"   üì• Descargando: {chapter_info['title']}")
            response = requests.get(url, stream=True, timeout=120)
            response.raise_for_status()

            total_size = int(response.headers.get('content-length', 0))

            with open(filepath, 'wb') as f:
                with tqdm(total=total_size, unit='B', unit_scale=True,
                         desc=f"      {filename[:40]}", leave=False) as pbar:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
                        pbar.update(len(chunk))

            size_mb = filepath.stat().st_size / 1024 / 1024
            print(f"   ‚úì Descargado: {filename} ({size_mb:.1f} MB)")
            return True

        except Exception as e:
            print(f"   ‚úó Error: {e}")
            return False

    def download_all_chapters(self):
        """Descarga todos los cap√≠tulos del cat√°logo"""
        print("\n" + "="*70)
        print("üì• DESCARGANDO CAP√çTULOS DEL IPCC")
        print("="*70 + "\n")

        total_chapters = sum(len(r['chapters']) for r in IPCC_CHAPTERS_CATALOG.values())
        print(f"Total de cap√≠tulos a descargar: {total_chapters}\n")

        for report_key, report_data in IPCC_CHAPTERS_CATALOG.items():
            print(f"\nüìÑ {report_data['name']}")
            print(f"   Prioridad: {report_data['priority']}")
            print(f"   Cap√≠tulos: {len(report_data['chapters'])}")

            for chapter in report_data['chapters']:
                self.download_chapter(report_data['base_url'], chapter, report_key)
                time.sleep(0.5)  # Rate limiting suave

    def extract_text_from_pdf(self, pdf_path: Path) -> str:
        """Extrae texto de PDF (optimizado)"""
        text = ""

        try:
            with pdfplumber.open(pdf_path) as pdf:
                total_pages = len(pdf.pages)

                for page in tqdm(pdf.pages, desc=f"      Extrayendo", leave=False):
                    try:
                        page_text = page.extract_text()
                        if page_text:
                            text += page_text + "\n\n"
                    except:
                        continue

            if not text.strip():
                # Fallback a PyPDF2
                with open(pdf_path, 'rb') as f:
                    pdf_reader = PyPDF2.PdfReader(f)
                    for page in pdf_reader.pages:
                        try:
                            text += page.extract_text() + "\n\n"
                        except:
                            continue

        except Exception as e:
            print(f"      ‚ö†Ô∏è  Error: {e}")

        return text

    def segment_text(self, text: str, max_chunk_size: int = 5000) -> List[str]:
        """
        Segmenta texto en chunks manejables
        ~5000 palabras por chunk (tama√±o √≥ptimo para embeddings)
        """
        words = text.split()
        chunks = []

        for i in range(0, len(words), max_chunk_size):
            chunk = ' '.join(words[i:i+max_chunk_size])
            if len(chunk) > 500:  # M√≠nimo 500 caracteres
                chunks.append(chunk)

        return chunks

    def score_segment(self, text: str) -> Tuple[float, Dict]:
        """Punt√∫a un segmento seg√∫n taxonom√≠a"""
        text_lower = text.lower()

        matches = {
            'hazards': set(),
            'adaptation_measures': set(),
            'sectors': set(),
            'impacts': set(),
            'concepts': set(),
            'regions': set()
        }

        # Buscar coincidencias
        for category, subcategories in IPCC_TAXONOMY.items():
            for subcat, terms in subcategories.items():
                for term in terms:
                    if term.lower() in text_lower:
                        matches[category].add(subcat)
                        break

        # Calcular score
        score = (
            len(matches['hazards']) * 3.0 +
            len(matches['adaptation_measures']) * 2.5 +
            len(matches['impacts']) * 2.0 +
            len(matches['sectors']) * 1.5 +
            len(matches['concepts']) * 1.0 +
            len(matches['regions']) * 1.0
        )

        # Bonus por diversidad
        if len(matches['hazards']) >= 2:
            score += 2.0
        if len(matches['adaptation_measures']) >= 2:
            score += 1.5

        return score, {k: list(v) for k, v in matches.items()}

    def process_single_pdf(self, pdf_path: Path, report_key: str, chapter_title: str):
        """
        Procesa UN SOLO PDF y guarda resultados inmediatamente
        CLAVE para gesti√≥n de RAM
        """
        print(f"\nüìñ Procesando: {pdf_path.name}")

        # Extraer texto
        text = self.extract_text_from_pdf(pdf_path)

        if not text or len(text) < 500:
            print(f"   ‚ö†Ô∏è  Texto insuficiente")
            return

        print(f"   ‚úì Extra√≠dos {len(text)} caracteres")

        # Segmentar
        chunks = self.segment_text(text, max_chunk_size=5000)
        print(f"   ‚úì {len(chunks)} segmentos creados")

        # Puntuar y guardar
        segments = []
        for i, chunk in enumerate(chunks):
            score, taxonomy = self.score_segment(chunk)

            if score >= 1.0:  # Umbral m√≠nimo para guardar
                segment = {
                    'source': 'ipcc',
                    'source_id': f"{report_key}_{pdf_path.stem}_{i}",
                    'title': f"{chapter_title} - Segment {i+1}",
                    'text': chunk[:2000],  # Abstract
                    'full_text': chunk,
                    'report': report_key,
                    'chapter': chapter_title,
                    'score': score,
                    'matched_taxonomy': taxonomy
                }
                segments.append(segment)

        # GUARDAR INMEDIATAMENTE (libera RAM)
        if segments:
            output_file = self.segments_dir / f"{pdf_path.stem}_segments.json"
            with open(output_file, 'w', encoding='utf-8') as f:
                json.dump(segments, f, ensure_ascii=False)

            print(f"   üíæ {len(segments)} segmentos guardados ‚Üí {output_file.name}")
            self.processed_files.append(output_file)

        # LIBERAR MEMORIA
        del text, chunks, segments
        gc.collect()

    def process_all_pdfs(self):
        """Procesa todos los PDFs uno a uno (incremental)"""
        print("\n" + "="*70)
        print("üìÑ PROCESANDO CAP√çTULOS")
        print("="*70 + "\n")

        pdf_files = sorted(list(self.pdfs_dir.glob("*.pdf")))

        if not pdf_files:
            print("‚ö†Ô∏è  No se encontraron PDFs")
            return

        print(f"Total de archivos a procesar: {len(pdf_files)}\n")

        for pdf_file in tqdm(pdf_files, desc="Procesando cap√≠tulos"):
            # Identificar report y cap√≠tulo
            parts = pdf_file.stem.split('_', 1)
            report_key = parts[0] if parts else "UNKNOWN"

            # Buscar t√≠tulo del cap√≠tulo
            chapter_title = pdf_file.stem
            for report_data in IPCC_CHAPTERS_CATALOG.values():
                for ch in report_data['chapters']:
                    if ch['file'] in pdf_file.name:
                        chapter_title = ch['title']
                        break

            self.process_single_pdf(pdf_file, report_key, chapter_title)

        print(f"\n‚úÖ {len(self.processed_files)} archivos procesados")

    def select_top_segments(self, n: int = 400, min_score: float = 2.5) -> List[Dict]:
        """
        Carga scores de todos los segmentos y selecciona los mejores
        Solo carga scores, no texto completo (RAM eficiente)
        """
        print(f"\nüéØ Seleccionando top {n} segmentos...")

        # Cargar solo scores (no full_text)
        all_segments = []

        for segment_file in tqdm(self.processed_files, desc="Cargando scores"):
            with open(segment_file, 'r', encoding='utf-8') as f:
                segments = json.load(f)
                # Remover full_text para ahorrar RAM
                for seg in segments:
                    seg_light = {k: v for k, v in seg.items() if k != 'full_text'}
                    seg_light['_file'] = segment_file  # Guardar referencia
                    all_segments.append(seg_light)

        print(f"   Total de segmentos: {len(all_segments)}")

        # Filtrar por score
        qualified = [s for s in all_segments if s['score'] >= min_score]
        print(f"   Calificados (score >= {min_score}): {len(qualified)}")

        if len(qualified) < n:
            new_threshold = min_score * 0.7
            print(f"   Ajustando umbral a {new_threshold:.1f}")
            qualified = [s for s in all_segments if s['score'] >= new_threshold]

        # Ordenar y seleccionar con diversidad
        sorted_segments = sorted(qualified, key=lambda x: x['score'], reverse=True)

        selected = []
        coverage = defaultdict(int)

        for seg in sorted_segments:
            if len(selected) >= n:
                break

            report = seg['report']
            if coverage[report] >= n * 0.35:  # M√°x 35% por reporte
                continue

            selected.append(seg)
            coverage[report] += 1

        print(f"\n‚úÖ Seleccionados {len(selected)} segmentos")
        print("\nüìä Distribuci√≥n:")
        for report, count in sorted(coverage.items(), key=lambda x: x[1], reverse=True):
            print(f"   {report}: {count} ({count/len(selected)*100:.1f}%)")

        return selected

    def load_full_segments(self, selected_light: List[Dict]) -> List[Dict]:
        """Carga el texto completo solo de los segmentos seleccionados"""
        print(f"\nüì• Cargando texto completo de {len(selected_light)} segmentos...")

        # Agrupar por archivo para cargar eficientemente
        by_file = defaultdict(list)
        for seg in selected_light:
            by_file[seg['_file']].append(seg['source_id'])

        full_segments = []

        for file_path, source_ids in tqdm(by_file.items(), desc="Cargando texto"):
            with open(file_path, 'r', encoding='utf-8') as f:
                all_segs = json.load(f)
                for seg in all_segs:
                    if seg['source_id'] in source_ids:
                        full_segments.append(seg)

        return full_segments

    def save_corpus(self, segments: List[Dict], filename: str = "ipcc_corpus.json"):
        """Guarda el corpus final"""
        output_path = self.output_dir / filename

        # A√±adir a√±o
        for seg in segments:
            seg['year'] = self._extract_year(seg['report'])

        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(segments, f, indent=2, ensure_ascii=False)

        print(f"\nüíæ Corpus guardado: {output_path}")

        # Generar reporte
        self._save_report(segments)

        return segments

    def _extract_year(self, report_key: str) -> int:
        """Extrae a√±o del reporte"""
        year_map = {
            'AR6_WG2': 2022,
            'SREX': 2012,
            'SR15': 2018,
            'SROCC': 2019,
            'SRCCL': 2019,
            'AR5_WG2': 2014
        }
        return year_map.get(report_key, 2020)

    def _save_report(self, segments: List[Dict]):
        """Genera reporte de an√°lisis"""
        report_path = self.output_dir / "ipcc_selection_report.txt"

        with open(report_path, 'w', encoding='utf-8') as f:
            f.write("="*70 + "\n")
            f.write("REPORTE DE SELECCI√ìN - IPCC CORPUS\n")
            f.write("="*70 + "\n\n")

            f.write(f"Total de segmentos: {len(segments)}\n")
            f.write(f"M√©todo: Cap√≠tulos individuales + Procesamiento incremental\n")
            f.write(f"Consumo m√°ximo de RAM: ~3-4 GB\n\n")

            # Estad√≠sticas de taxonom√≠a
            all_hazards = defaultdict(int)
            all_adaptation_measures = defaultdict(int)
            all_sectors = defaultdict(int)

            for seg in segments:
                for h in seg['matched_taxonomy']['hazards']:
                    all_hazards[h] += 1
                for m in seg['matched_taxonomy']['adaptation_measures']:
                    all_adaptation_measures[m] += 1
                for s in seg['matched_taxonomy']['sectors']:
                    all_sectors[s] += 1

            f.write("COBERTURA DE TAXONOM√çA:\n")
            f.write("-"*70 + "\n")
            f.write(f"\nHAZARDS ({len(all_hazards)} tipos):\n")
            for h, count in sorted(all_hazards.items(), key=lambda x: x[1], reverse=True):
                f.write(f"  {h}: {count} segmentos\n")

            f.write(f"\nADAPTATION MEASURES ({len(all_adaptation_measures)} tipos):\n")
            for m, count in sorted(all_adaptation_measures.items(), key=lambda x: x[1], reverse=True):
                f.write(f"  {m}: {count} segmentos\n")

            f.write(f"\nSECTORS ({len(all_sectors)} tipos):\n")
            for s, count in sorted(all_sectors.items(), key=lambda x: x[1], reverse=True):
                f.write(f"  {s}: {count} segmentos\n")

            # Distribuci√≥n por reporte
            by_report = defaultdict(int)
            for seg in segments:
                by_report[seg['report']] += 1

            f.write("\n" + "="*70 + "\n")
            f.write("DISTRIBUCI√ìN POR REPORTE:\n")
            f.write("="*70 + "\n\n")

            for report, count in sorted(by_report.items(), key=lambda x: x[1], reverse=True):
                pct = count / len(segments) * 100
                f.write(f"{report}: {count} segmentos ({pct:.1f}%)\n")

            # Top 10
            f.write("\n" + "="*70 + "\n")
            f.write("TOP 10 SEGMENTOS:\n")
            f.write("="*70 + "\n\n")

            top_10 = sorted(segments, key=lambda x: x['score'], reverse=True)[:10]
            for i, seg in enumerate(top_10, 1):
                f.write(f"{i}. {seg['title']}\n")
                f.write(f"   Score: {seg['score']:.2f}\n")
                f.write(f"   Hazards: {', '.join(seg['matched_taxonomy']['hazards'])}\n")
                f.write(f"   Adaptation measures: {', '.join(seg['matched_taxonomy']['adaptation_measures'])}\n\n")

        print(f"üìÑ Reporte guardado: {report_path}")


# ============================================================================
# PIPELINE COMPLETO
# ============================================================================

def run_optimized_pipeline(n_documents: int = 400):
    """
    Pipeline optimizado para Colab con reanudaci√≥n autom√°tica.
    Si la sesi√≥n se reinicia, contin√∫a desde el √∫ltimo paso completado.
    """
    print("\n" + "="*70)
    print("üåç IPCC EXTRACTOR - REANUDACI√ìN AUTOM√ÅTICA")
    print("="*70)

    print("\nüìä Caracter√≠sticas:")
    print(" ‚Ä¢ Cap√≠tulos individuales (40-50 archivos)")
    print(" ‚Ä¢ Procesamiento incremental")
    print(" ‚Ä¢ Consumo RAM: ~3-4 GB")
    print(" ‚Ä¢ Compatible con Colab gratuito")
    print("\n" + "="*70 + "\n")

    extractor = IPCCExtractorOptimized(output_dir="./ipcc_data")
    status = extractor.check_existing_progress()

    print("\nüìå Estado detectado:")
    print(f"   ‚Ä¢ PDFs descargados:     {status['pdfs']}")
    print(f"   ‚Ä¢ Segmentos procesados: {status['segments']}")
    print(f"   ‚Ä¢ Corpus final:         {status['corpus']}")

    # PASO 1: Descargar PDFs
    if not status["pdfs"]:
        print("\n[PASO 1/4] Descargando cap√≠tulos...")
        extractor.download_all_chapters()
    else:
        print("\n‚úîÔ∏è PDFs ya descargados. Saltando paso 1.")

    # PASO 2: Procesar PDFs
    if not status["segments"]:
        print("\n[PASO 2/4] Procesando cap√≠tulos...")
        extractor.process_all_pdfs()
    else:
        print("\n‚úîÔ∏è Segmentos ya generados. Saltando paso 2.")

    # PASO 3: Seleccionar mejores segmentos
    print("\n[PASO 3/4] Seleccionando segmentos...")
    selected_light = extractor.select_top_segments(n=n_documents, min_score=2.5)

    # PASO 4: Generar corpus final
    if not status["corpus"]:
        print("\n[PASO 4/4] Generando corpus final...")
        full_segments = extractor.load_full_segments(selected_light)
        corpus = extractor.save_corpus(full_segments)
    else:
        print("\n‚úîÔ∏è Corpus ya existe. Cargando archivo existente...")
        with open("./ipcc_data/ipcc_corpus.json", "r", encoding="utf-8") as f:
            corpus = json.load(f)

    # ============================
    # PRINTS FINALES COMPLETOS
    # ============================

    print("\n" + "="*70)
    print("‚úÖ PIPELINE COMPLETADO")
    print("="*70)
    print(f"\nüìä {len(corpus)} documentos listos")

    print("\nüìÅ Archivos generados:")
    print("   ‚Ä¢ ipcc_data/ipcc_corpus.json")
    print("   ‚Ä¢ ipcc_data/ipcc_selection_report.txt")
    print("   ‚Ä¢ ipcc_data/segments/ (segmentos intermedios)")
    print("   ‚Ä¢ ipcc_data/pdfs/ (cap√≠tulos descargados)")

    # Estad√≠sticas finales
    print("\nüìà Estad√≠sticas:")
    by_report = defaultdict(int)
    for seg in corpus:
        by_report[seg['report']] += 1

    for report, count in sorted(by_report.items(), key=lambda x: x[1], reverse=True):
        pct = count / len(corpus) * 100
        print(f"   {report}: {count} docs ({pct:.1f}%)")

    return corpus


# ============================================================================
# EJECUCI√ìN
# ============================================================================

if __name__ == "__main__":
    # Ejecutar pipeline optimizado
    # Tiempo estimado: 60-75 minutos
    # RAM m√°xima: ~3-4 GB (seguro para Colab gratuito)

    corpus = run_optimized_pipeline(n_documents=400)


üåç IPCC EXTRACTOR - REANUDACI√ìN AUTOM√ÅTICA

üìä Caracter√≠sticas:
 ‚Ä¢ Cap√≠tulos individuales (40-50 archivos)
 ‚Ä¢ Procesamiento incremental
 ‚Ä¢ Consumo RAM: ~3-4 GB
 ‚Ä¢ Compatible con Colab gratuito



üìå Estado detectado:
   ‚Ä¢ PDFs descargados:     True
   ‚Ä¢ Segmentos procesados: True
   ‚Ä¢ Corpus final:         True

‚úîÔ∏è PDFs ya descargados. Saltando paso 1.

‚úîÔ∏è Segmentos ya generados. Saltando paso 2.

[PASO 3/4] Seleccionando segmentos...

üéØ Seleccionando top 400 segmentos...


Cargando scores: 0it [00:00, ?it/s]

   Total de segmentos: 0
   Calificados (score >= 2.5): 0
   Ajustando umbral a 1.8

‚úÖ Seleccionados 0 segmentos

üìä Distribuci√≥n:

‚úîÔ∏è Corpus ya existe. Cargando archivo existente...

‚úÖ PIPELINE COMPLETADO

üìä 400 documentos listos

üìÅ Archivos generados:
   ‚Ä¢ ipcc_data/ipcc_corpus.json
   ‚Ä¢ ipcc_data/ipcc_selection_report.txt
   ‚Ä¢ ipcc_data/segments/ (segmentos intermedios)
   ‚Ä¢ ipcc_data/pdfs/ (cap√≠tulos descargados)

üìà Estad√≠sticas:
   AR6: 140 docs (35.0%)
   SRCCL: 103 docs (25.8%)
   SROCC: 56 docs (14.0%)
   SREX: 55 docs (13.8%)
   SR15: 46 docs (11.5%)


In [None]:
!zip -r ipcc_pdfs.zip ipcc_data/pdfs
!zip -r ipcc_segments.zip ipcc_data/segments


  adding: ipcc_data/pdfs/ (stored 0%)
  adding: ipcc_data/pdfs/SROCC_06_SROCC_Ch04_FINAL.pdf (deflated 5%)
  adding: ipcc_data/pdfs/AR6_WG2_IPCC_AR6_WGII_Chapter10.pdf (deflated 11%)
  adding: ipcc_data/pdfs/AR6_WG2_IPCC_AR6_WGII_Chapter12.pdf (deflated 15%)
  adding: ipcc_data/pdfs/AR6_WG2_IPCC_AR6_WGII_Chapter05.pdf (deflated 14%)
  adding: ipcc_data/pdfs/SRCCL_Chapter-3_FINAL-1.pdf (deflated 7%)
  adding: ipcc_data/pdfs/SREX_SREX-Chap6_FINAL-1.pdf (deflated 25%)
  adding: ipcc_data/pdfs/SROCC_04_SROCC_Ch02_FINAL.pdf (deflated 6%)
  adding: ipcc_data/pdfs/SR15_SR15_Chapter_5_HR.pdf (deflated 15%)
  adding: ipcc_data/pdfs/AR6_WG2_IPCC_AR6_WGII_Chapter14.pdf (deflated 6%)
  adding: ipcc_data/pdfs/SR15_SR15_Chapter_1_LR.pdf (deflated 20%)
  adding: ipcc_data/pdfs/SREX_SREX-Chap4_FINAL-1.pdf (deflated 7%)
  adding: ipcc_data/pdfs/SRCCL_Chapter-1_FINAL-1.pdf (deflated 11%)
  adding: ipcc_data/pdfs/SREX_SREX-Chap8_FINAL-1.pdf (deflated 32%)
  adding: ipcc_data/pdfs/AR6_WG2_IPCC_AR6_WGII_Te