# Class Parser Workflow Test

This notebook reproduces the workflow from poc.ipynb step by step using the new src/ structure.
Test each component individually and then run the complete pipeline.

## Workflow Overview
1. **Data Extraction** - Extract slides and plan from PDF (deterministic)
2. **Branch A** - With plan: outline_two_pass → mapping_two_pass → enrich → writer
3. **Branch B** - No plan: outline_one_shot → enrich → writer
4. **Export** - Generate DOCX document


In [2]:
# Setup and imports
import os
import json
import logging
import sys
from pathlib import Path

# Configure logging for the src package
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)]
)

# Set our package to DEBUG level for detailed progress
logging.getLogger("src").setLevel(logging.DEBUG)

# Load environment variables
import dotenv
dotenv.load_dotenv()

# Import our src modules
from src import (
    Course, CoursePipeline, Content, ContentSection, 
    MappingItem, SectionSlideMapping, Slides
)
from src.data_extraction import SlidesExtractor, PlanExtractor
from src.llm import (
    OutlineOneShot, OutlineTwoPass, MappingTwoPass, Writer, PromptManager
)
# Ensure src is importable when running the notebook directly
import sys
sys.path.append(str(Path('.').resolve()))

print("✅ All imports successful!")
print(f"📁 Working directory: {os.getcwd()}")

# Check if volume directory exists
volume_path = Path("./volume")
slides_path = volume_path / "slides"
print(f"📂 Volume directory exists: {volume_path.exists()}")
print(f"📂 Slides directory exists: {slides_path.exists()}")

if slides_path.exists():
    pdf_files = list(slides_path.glob("*.pdf"))
    print(f"📄 PDF files found: {len(pdf_files)}")
    for pdf in pdf_files:
        print(f"  - {pdf.name}")


✅ All imports successful!
📁 Working directory: /Users/youssefjanjar/Documents/formascience/class_parser
📂 Volume directory exists: True
📂 Slides directory exists: True
📄 PDF files found: 3
  - cours_4_plan.pdf
  - cours_1.pdf
  - cours_4.pdf


In [12]:
# Dev hot-reload setup (no kernel restart needed)
%load_ext autoreload
%autoreload 2

import sys, importlib
from pathlib import Path
# Ensure project root is on sys.path
sys.path.insert(0, str(Path('.').resolve()))

import src
import src.models as models
import src.llm as llm
import src.data_extraction as data_extraction
import src.course as course_mod
import src.pipeline as pipeline_mod


def refresh_imports():
    """Reload modules and rebind notebook aliases."""
    importlib.reload(src)
    importlib.reload(models)
    importlib.reload(llm)
    importlib.reload(data_extraction)
    importlib.reload(course_mod)
    importlib.reload(pipeline_mod)
    
    globals().update({
        # Models
        "Content": models.Content,
        "ContentSection": models.ContentSection,
        "MappingItem": models.MappingItem,
        "SectionSlideMapping": models.SectionSlideMapping,
        "Slides": models.Slides,
        "OutlineAndMapping": models.OutlineAndMapping,
        # LLM
        "OutlineOneShot": llm.OutlineOneShot,
        "OutlineTwoPass": llm.OutlineTwoPass,
        "MappingTwoPass": llm.MappingTwoPass,
        "Writer": llm.Writer,
        "PromptManager": llm.PromptManager,
        # Deterministic extractors
        "SlidesExtractor": data_extraction.SlidesExtractor,
        "PlanExtractor": data_extraction.PlanExtractor,
        # Orchestrators
        "Course": course_mod.Course,
        "CoursePipeline": pipeline_mod.CoursePipeline,
    })
    print("🔁 Imports refreshed")

# Initial bind
refresh_imports()


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
🔁 Imports refreshed


In [13]:
# Initialize optional variables to avoid NameError in later cells
branch_a_outline = None
branch_a_mapping = None
branch_b_outline = None
branch_b_mapping = None
enriched_for_writer = None
course_b = None
course_a = None
plan_text = None


## Step 1: Data Extraction - Slides

Extract slides from PDF using the exact logic from poc.ipynb


In [4]:
# Test slides extraction
pdf_path = "./volume/slides/cours_1.pdf"  # Update this path as needed

# Initialize extractor
slides_extractor = SlidesExtractor(
    min_avg_len=10,
    max_lines=20,
    merge_tol=2.0
)

print(f"🔍 Extracting slides from: {pdf_path}")

try:
    # Extract slides as Slides objects
    slides = slides_extractor.extract_slides(pdf_path)
    print(f"✅ Extracted {len(slides)} slides")
    
    # Display first few slides
    print("\n📋 First 3 slides:")
    for i, slide in enumerate(slides[:3]):
        print(f"\n--- Slide {i+1}: {slide.id} ---")
        print(f"Title: {slide.title}")
        print(f"Content (first 200 chars): {slide.content[:200]}...")
        print("=" * 50)
        
except FileNotFoundError:
    print(f"❌ PDF file not found: {pdf_path}")
    print("Please ensure the PDF file exists in ./volume/slides/")
    slides = []  # Empty list for testing
except Exception as e:
    print(f"❌ Error extracting slides: {e}")
    slides = []


🔍 Extracting slides from: ./volume/slides/cours_1.pdf


Cannot set gray non-stroke color because /'P412' is an invalid float value
Cannot set gray non-stroke color because /'P414' is an invalid float value
Cannot set gray non-stroke color because /'P444' is an invalid float value
Cannot set gray non-stroke color because /'P451' is an invalid float value
Cannot set gray non-stroke color because /'P455' is an invalid float value
Cannot set gray non-stroke color because /'P465' is an invalid float value
Cannot set gray non-stroke color because /'P467' is an invalid float value
Cannot set gray non-stroke color because /'P482' is an invalid float value
Cannot set gray non-stroke color because /'P488' is an invalid float value


✅ Extracted 110 slides

📋 First 3 slides:

--- Slide 1: SL_001 ---
Title: 7.1 Architecture du génome
Content (first 200 chars): L1SpS: UE 2 Les molécules du vivant
U.E.2 Les molécules du vivant
7. Organisation du génome humain, méthodes en
biotechnologie
7.1 Architecture du génome
humain
Jean Muller
Laboratoire Diagnostic Géné...

--- Slide 2: SL_002 ---
Title: Plan du cours
Content (first 200 chars): L1SpS: UE 2 Les molécules du vivant
Plan du cours
• Introduction
• Notions fondamentales
• Le génome humain
• Projet de séquençage
• Architecture globale
• Génome mitochondrial
• Génome nucléaire
• De...

--- Slide 3: SL_003 ---
Title: Introduction
Content (first 200 chars): L1SpS: UE 2 Les molécules du vivant
Introduction
3...


In [6]:
# Test raw slides extraction (poc.ipynb compatible format)
if slides:
    print("🔄 Testing raw slides extraction...")
    raw_slides = slides_extractor.extract_raw_slides(pdf_path)
    print(f"✅ Raw slides extracted: {len(raw_slides)}")
    
    # Verify format matches poc.ipynb
    if raw_slides:
        sample_slide = raw_slides[0]
        print(f"\n📋 Raw slide format:")
        print(f"  - Keys: {list(sample_slide.keys())}")
        print(f"  - ID: {sample_slide['id']}")
        print(f"  - Title: {sample_slide['title']}")
        print(f"  - Content length: {len(sample_slide['content'])} chars")


🔄 Testing raw slides extraction...


Cannot set gray non-stroke color because /'P412' is an invalid float value
Cannot set gray non-stroke color because /'P414' is an invalid float value
Cannot set gray non-stroke color because /'P444' is an invalid float value
Cannot set gray non-stroke color because /'P451' is an invalid float value
Cannot set gray non-stroke color because /'P455' is an invalid float value
Cannot set gray non-stroke color because /'P465' is an invalid float value
Cannot set gray non-stroke color because /'P467' is an invalid float value
Cannot set gray non-stroke color because /'P482' is an invalid float value
Cannot set gray non-stroke color because /'P488' is an invalid float value


✅ Raw slides extracted: 110

📋 Raw slide format:
  - Keys: ['id', 'title', 'content']
  - ID: SL_001
  - Title: 7.1 Architecture du génome
  - Content length: 282 chars


## Step 2: Data Extraction - Plan (Optional)

Extract plan text from PDF for Branch A (two-pass approach)


In [24]:
# Test plan extraction
plan_extractor = PlanExtractor()

# Option 1: Extract from specific page (like poc.ipynb page 2)
try:
    plan_text_page2 = plan_extractor.extract_plan_from_page(pdf_path, page_number=2)
    print(f"✅ Plan extracted from page 2")
    print(f"📄 Plan length: {len(plan_text_page2)} characters")
    print(f"\n📋 Plan preview (first 500 chars):")
    print(plan_text_page2 + "...")
    
    # Use this as our plan for Branch A testing
    plan_text = plan_text_page2
    
except Exception as e:
    print(f"❌ Error extracting plan from page 2: {e}")
    plan_text = None

print("\n" + "="*60)

# # Option 2: Extract full PDF as plan
# try:
#     full_plan_text = plan_extractor.extract_plan_from_pdf(pdf_path)
#     print(f"✅ Full PDF plan extracted")
#     print(f"📄 Full plan length: {len(full_plan_text)} characters")
    
# except Exception as e:
#     print(f"❌ Error extracting full plan: {e}")
#     full_plan_text = None


✅ Plan extracted from page 2
📄 Plan length: 1599 characters

📋 Plan preview (first 500 chars):
   L1SpS: UE 2 Les molécules du vivant                                                             
      Plan      du     cours                                                                       
        •  Introduction                                                                            
             •  Notions  fondamentales                                                             
        •  Le génome    humain                                                                     
             •  Projet de séquençage                                                               
             •  Architecture globale                                                               
                  •  Génome    mitochondrial                                                       
                  •  Génome    nucléaire                                                           
     

## Step 3: Branch B - One-Shot Approach (No Plan)

Test the one-shot outline and mapping generation when no plan is provided


In [14]:
slides

[Slides(id='SL_001', title='7.1 Architecture du génome', content='L1SpS: UE 2 Les molécules du vivant\nU.E.2 Les molécules du vivant\n7. Organisation du génome humain, méthodes en\nbiotechnologie\n7.1 Architecture du génome\nhumain\nJean Muller\nLaboratoire Diagnostic Génétique (HUS)\nLaboratoire de Génétique médicale (Inserm U1112)\njeanmuller@unistra.fr'),
 Slides(id='SL_002', title='Plan du cours', content='L1SpS: UE 2 Les molécules du vivant\nPlan du cours\n• Introduction\n• Notions fondamentales\n• Le génome humain\n• Projet de séquençage\n• Architecture globale\n• Génome mitochondrial\n• Génome nucléaire\n• Description des principaux éléments constituants\n• Gènes (codants et non-codants), pseudogènes et éléments répétés\n• Comparaison aux autres génomes\n• Les types de variations du génome humain et leurs conséquences\n• La variabilité du génome humain\n• Evolution des génomes: notions essentielles, mécanismes\n2'),
 Slides(id='SL_003', title='Introduction', content='L1SpS: UE 2

In [15]:
# Test Branch B: One-shot approach
if slides:
    print("🚀 Testing Branch B: One-shot outline and mapping generation")
    
    # Initialize one-shot generator
    one_shot = OutlineOneShot(model="gpt-5-mini")
    
    try:
        print("🤖 Generating outline and mapping in one shot...")
        outline_b, mapping_b = one_shot.build_outline_and_mapping(slides)  # Use first 10 slides for testing
        
        print(f"✅ One-shot generation successful!")
        print(f"📋 Outline sections: {len(outline_b.sections)}")
        print(f"🗺️ Mapping entries: {len(mapping_b.mapping)}")
        
        # Display outline structure
        print("\n📋 Generated Outline:")
        print(outline_b.print_outline())
        
        # Display mapping
        print("\n🗺️ Generated Mapping:")
        for item in mapping_b.mapping[:20]:  # Show first 5 mappings
            print(f"  {item.section_id} → {item.slide_ids}")
        
        # Store for later use
        branch_b_outline = outline_b
        branch_b_mapping = mapping_b
        
    except Exception as e:
        print(f"❌ Error in one-shot generation: {e}")
        branch_b_outline = None
        branch_b_mapping = None

        
else:
    print("⚠️ Skipping Branch B test - no slides available")
    branch_b_outline = None
    branch_b_mapping = None


🚀 Testing Branch B: One-shot outline and mapping generation
🤖 Generating outline and mapping in one shot...
outline=Content(sections=[ContentSection(id='SEC_1', title='Introduction et objectifs du cours sur le génome humain', content=[], subsections=[ContentSection(id='SEC_1.1', title='Présentation et organisation du cours', content=[], subsections=[]), ContentSection(id='SEC_1.2', title='Introduction au thème et contexte biologique', content=[], subsections=[])]), ContentSection(id='SEC_2', title='Notions fondamentales en évolution et homologie génétique', content=[], subsections=[ContentSection(id='SEC_2.1', title='Arbre de la vie, évolution et sélection naturelle', content=[], subsections=[]), ContentSection(id='SEC_2.2', title='Homologie, orthologie et paralogie: définitions et méthodes', content=[], subsections=[]), ContentSection(id='SEC_2.3', title='Dogme central et référentiels moléculaires', content=[], subsections=[])]), ContentSection(id='SEC_3', title='Le génome humain: str

In [16]:
# Display mapping
print("\n🗺️ Generated Mapping:")
for item in mapping_b.mapping[:20]:  # Show first 5 mappings
    print(f"  {item.section_id} → {item.slide_ids}")


🗺️ Generated Mapping:
  SEC_1 → ['SL_001']
  SEC_1.1 → ['SL_002']
  SEC_1.2 → ['SL_003']
  SEC_2 → ['SL_004']
  SEC_2.1 → ['SL_004', 'SL_005']
  SEC_2.2 → ['SL_006', 'SL_007', 'SL_008', 'SL_009', 'SL_010']
  SEC_2.3 → ['SL_011', 'SL_012']
  SEC_3 → ['SL_013']
  SEC_3.1 → ['SL_014', 'SL_015', 'SL_016', 'SL_017']
  SEC_3.2 → ['SL_018', 'SL_019', 'SL_020', 'SL_021', 'SL_022', 'SL_024', 'SL_025', 'SL_026', 'SL_027', 'SL_028', 'SL_029']
  SEC_3.3 → ['SL_030', 'SL_031', 'SL_032', 'SL_033', 'SL_034', 'SL_035']
  SEC_4 → ['SL_036']
  SEC_4.1 → ['SL_037', 'SL_038', 'SL_039', 'SL_040', 'SL_041', 'SL_042', 'SL_043', 'SL_044', 'SL_045', 'SL_046', 'SL_047']
  SEC_4.2 → ['SL_048']
  SEC_4.3 → ['SL_049', 'SL_050', 'SL_051', 'SL_052', 'SL_053']
  SEC_4.4 → ['SL_054', 'SL_055', 'SL_056', 'SL_057', 'SL_058', 'SL_059', 'SL_060', 'SL_061', 'SL_062', 'SL_063', 'SL_064', 'SL_065', 'SL_066']
  SEC_5 → ['SL_068']
  SEC_5.1 → ['SL_067', 'SL_069', 'SL_070', 'SL_071', 'SL_072']
  SEC_5.2 → ['SL_073']
  SEC_6 → 

## Step 3.5: Content Enrichment (using Content.enrich_with_slides)

Note: No separate ContentEnricher class is needed; enrichment is a method on Content.


In [17]:
# Demonstrate enrichment method directly on Content
# Use whatever outline/mapping you have (prefer Branch B if available)

if 'branch_b_outline' in locals() and branch_b_outline and 'branch_b_mapping' in locals() and branch_b_mapping and slides:
    enriched_demo = branch_b_outline.enrich_with_slides(slides, branch_b_mapping)
    print("✅ Enrichment via Content.enrich_with_slides complete (Branch B)")
    print(f"Sections enriched: {len(enriched_demo.sections)}")
elif 'branch_a_outline' in locals() and branch_a_outline and 'branch_a_mapping' in locals() and branch_a_mapping and slides:
    enriched_demo = branch_a_outline.enrich_with_slides(slides, branch_a_mapping)
    print("✅ Enrichment via Content.enrich_with_slides complete (Branch A)")
    print(f"Sections enriched: {len(enriched_demo.sections)}")
else:
    print("⚠️ Skipping enrichment demo - no outline/mapping available")
    enriched_demo = None


✅ Enrichment via Content.enrich_with_slides complete (Branch B)
Sections enriched: 9


In [None]:
type(enriched_demo)

writer = Writer()
written_course = writer.write_course(enriched_demo)

print(written_course.print_content())

## Step 4: Complete Pipeline Test

Test the full CoursePipeline end-to-end for both branches


In [26]:
print(plan_text)

   L1SpS: UE 2 Les molécules du vivant                                                             
      Plan      du     cours                                                                       
        •  Introduction                                                                            
             •  Notions  fondamentales                                                             
        •  Le génome    humain                                                                     
             •  Projet de séquençage                                                               
             •  Architecture globale                                                               
                  •  Génome    mitochondrial                                                       
                  •  Génome    nucléaire                                                           
             •  Description  des principaux  éléments   constituants                               


In [27]:
two_pass = OutlineTwoPass()

outline = two_pass.build_outline(plan_text)

print(outline.print_outline())

Course Outline:
Introduction
  Notions fondamentales
Le génome humain
  Projet de séquençage
  Architecture globale
    Génome mitochondrial
    Génome nucléaire
  Description des principaux éléments constituants
    Gènes (codants et non-codants), pseudogènes et éléments répétés
Comparaison aux autres génomes
Les types de variations du génome humain et leurs conséquences
La variabilité du génome humain
Evolution des génomes: notions essentielles, mécanismes


In [None]:
mapping_two_pass = MappingTwoPass()

mapping = mapping_two_pass.build_mapping(slides, outline)


In [None]:
print(mapping.visualize_mapping(outline))

In [None]:
course_with_slides = outline.enrich_with_slides(slides, mapping)
print(course_with_slides.print_content())

In [None]:
writer = Writer()

final_course = writer.write_course(course_with_slides)

print(final_course.print_content())

In [36]:
print(final_course.print_content())

Course Content:

[Root] ID: SEC_1
Title: Introduction
Content:
  [1] Ce module propose une introduction générale aux molécules du vivant en
      prenant le génome humain comme point d'appui pour aborder notions
      fondamentales, architecture moléculaire, variabilité et mécanismes
      évolutifs. Le plan couvre successivement les concepts de base, la
      description du génome, l'histoire du séquençage, l'organisation nucléaire
      et mitochondriale, les différents types d'éléments composant le génome,
      les variations et leur interprétation, ainsi que les processus d'évolution
      génomique.
  [2] Le génome constitue l'une des références biologiques majeures d'un
      organisme et s'articule avec d'autres couches d'information fonctionnelle
      : le transcriptome (ensemble des ARN exprimés), le méthylome (marquages
      épigénétiques) et le protéome (ensemble des protéines). L'étude conjointe
      de ces niveaux permet de comprendre la régulation des gènes, l'express

## Step 5: Summary and Production Example

Summary of all tests and example of how to use the system in production
