# Class Parser Workflow Test

This notebook reproduces the workflow from poc.ipynb step by step using the new src/ structure.
Test each component individually and then run the complete pipeline.

## Workflow Overview
1. **Data Extraction** - Extract slides and plan from PDF (deterministic)
2. **Branch A** - With plan: outline_two_pass ‚Üí mapping_two_pass ‚Üí enrich ‚Üí writer
3. **Branch B** - No plan: outline_one_shot ‚Üí enrich ‚Üí writer
4. **Export** - Generate DOCX document


In [2]:
# Setup and imports
import os
import json
import logging
import sys
from pathlib import Path

# Configure logging for the src package
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)]
)

# Set our package to DEBUG level for detailed progress
logging.getLogger("src").setLevel(logging.DEBUG)

# Load environment variables
import dotenv
dotenv.load_dotenv()

# Import our src modules
from src import (
    Course, CoursePipeline, Content, ContentSection, 
    MappingItem, SectionSlideMapping, Slides
)
from src.data_extraction import SlidesExtractor, PlanExtractor
from src.llm import (
    OutlineOneShot, OutlineTwoPass, MappingTwoPass, Writer, PromptManager
)
# Ensure src is importable when running the notebook directly
import sys
sys.path.append(str(Path('.').resolve()))

print("‚úÖ All imports successful!")
print(f"üìÅ Working directory: {os.getcwd()}")

# Check if volume directory exists
volume_path = Path("./volume")
slides_path = volume_path / "slides"
print(f"üìÇ Volume directory exists: {volume_path.exists()}")
print(f"üìÇ Slides directory exists: {slides_path.exists()}")

if slides_path.exists():
    pdf_files = list(slides_path.glob("*.pdf"))
    print(f"üìÑ PDF files found: {len(pdf_files)}")
    for pdf in pdf_files:
        print(f"  - {pdf.name}")


‚úÖ All imports successful!
üìÅ Working directory: /Users/youssefjanjar/Documents/formascience/class_parser
üìÇ Volume directory exists: True
üìÇ Slides directory exists: True
üìÑ PDF files found: 3
  - cours_4_plan.pdf
  - cours_1.pdf
  - cours_4.pdf


In [12]:
# Dev hot-reload setup (no kernel restart needed)
%load_ext autoreload
%autoreload 2

import sys, importlib
from pathlib import Path
# Ensure project root is on sys.path
sys.path.insert(0, str(Path('.').resolve()))

import src
import src.models as models
import src.llm as llm
import src.data_extraction as data_extraction
import src.course as course_mod
import src.pipeline as pipeline_mod


def refresh_imports():
    """Reload modules and rebind notebook aliases."""
    importlib.reload(src)
    importlib.reload(models)
    importlib.reload(llm)
    importlib.reload(data_extraction)
    importlib.reload(course_mod)
    importlib.reload(pipeline_mod)
    
    globals().update({
        # Models
        "Content": models.Content,
        "ContentSection": models.ContentSection,
        "MappingItem": models.MappingItem,
        "SectionSlideMapping": models.SectionSlideMapping,
        "Slides": models.Slides,
        "OutlineAndMapping": models.OutlineAndMapping,
        # LLM
        "OutlineOneShot": llm.OutlineOneShot,
        "OutlineTwoPass": llm.OutlineTwoPass,
        "MappingTwoPass": llm.MappingTwoPass,
        "Writer": llm.Writer,
        "PromptManager": llm.PromptManager,
        # Deterministic extractors
        "SlidesExtractor": data_extraction.SlidesExtractor,
        "PlanExtractor": data_extraction.PlanExtractor,
        # Orchestrators
        "Course": course_mod.Course,
        "CoursePipeline": pipeline_mod.CoursePipeline,
    })
    print("üîÅ Imports refreshed")

# Initial bind
refresh_imports()


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
üîÅ Imports refreshed


In [13]:
# Initialize optional variables to avoid NameError in later cells
branch_a_outline = None
branch_a_mapping = None
branch_b_outline = None
branch_b_mapping = None
enriched_for_writer = None
course_b = None
course_a = None
plan_text = None


## Step 1: Data Extraction - Slides

Extract slides from PDF using the exact logic from poc.ipynb


In [4]:
# Test slides extraction
pdf_path = "./volume/slides/cours_1.pdf"  # Update this path as needed

# Initialize extractor
slides_extractor = SlidesExtractor(
    min_avg_len=10,
    max_lines=20,
    merge_tol=2.0
)

print(f"üîç Extracting slides from: {pdf_path}")

try:
    # Extract slides as Slides objects
    slides = slides_extractor.extract_slides(pdf_path)
    print(f"‚úÖ Extracted {len(slides)} slides")
    
    # Display first few slides
    print("\nüìã First 3 slides:")
    for i, slide in enumerate(slides[:3]):
        print(f"\n--- Slide {i+1}: {slide.id} ---")
        print(f"Title: {slide.title}")
        print(f"Content (first 200 chars): {slide.content[:200]}...")
        print("=" * 50)
        
except FileNotFoundError:
    print(f"‚ùå PDF file not found: {pdf_path}")
    print("Please ensure the PDF file exists in ./volume/slides/")
    slides = []  # Empty list for testing
except Exception as e:
    print(f"‚ùå Error extracting slides: {e}")
    slides = []


üîç Extracting slides from: ./volume/slides/cours_1.pdf


Cannot set gray non-stroke color because /'P412' is an invalid float value
Cannot set gray non-stroke color because /'P414' is an invalid float value
Cannot set gray non-stroke color because /'P444' is an invalid float value
Cannot set gray non-stroke color because /'P451' is an invalid float value
Cannot set gray non-stroke color because /'P455' is an invalid float value
Cannot set gray non-stroke color because /'P465' is an invalid float value
Cannot set gray non-stroke color because /'P467' is an invalid float value
Cannot set gray non-stroke color because /'P482' is an invalid float value
Cannot set gray non-stroke color because /'P488' is an invalid float value


‚úÖ Extracted 110 slides

üìã First 3 slides:

--- Slide 1: SL_001 ---
Title: 7.1 Architecture du g√©nome
Content (first 200 chars): L1SpS: UE 2 Les mol√©cules du vivant
U.E.2 Les mol√©cules du vivant
7. Organisation du g√©nome humain, m√©thodes en
biotechnologie
7.1 Architecture du g√©nome
humain
Jean Muller
Laboratoire Diagnostic G√©n√©...

--- Slide 2: SL_002 ---
Title: Plan du cours
Content (first 200 chars): L1SpS: UE 2 Les mol√©cules du vivant
Plan du cours
‚Ä¢ Introduction
‚Ä¢ Notions fondamentales
‚Ä¢ Le g√©nome humain
‚Ä¢ Projet de s√©quen√ßage
‚Ä¢ Architecture globale
‚Ä¢ G√©nome mitochondrial
‚Ä¢ G√©nome nucl√©aire
‚Ä¢ De...

--- Slide 3: SL_003 ---
Title: Introduction
Content (first 200 chars): L1SpS: UE 2 Les mol√©cules du vivant
Introduction
3...


In [6]:
# Test raw slides extraction (poc.ipynb compatible format)
if slides:
    print("üîÑ Testing raw slides extraction...")
    raw_slides = slides_extractor.extract_raw_slides(pdf_path)
    print(f"‚úÖ Raw slides extracted: {len(raw_slides)}")
    
    # Verify format matches poc.ipynb
    if raw_slides:
        sample_slide = raw_slides[0]
        print(f"\nüìã Raw slide format:")
        print(f"  - Keys: {list(sample_slide.keys())}")
        print(f"  - ID: {sample_slide['id']}")
        print(f"  - Title: {sample_slide['title']}")
        print(f"  - Content length: {len(sample_slide['content'])} chars")


üîÑ Testing raw slides extraction...


Cannot set gray non-stroke color because /'P412' is an invalid float value
Cannot set gray non-stroke color because /'P414' is an invalid float value
Cannot set gray non-stroke color because /'P444' is an invalid float value
Cannot set gray non-stroke color because /'P451' is an invalid float value
Cannot set gray non-stroke color because /'P455' is an invalid float value
Cannot set gray non-stroke color because /'P465' is an invalid float value
Cannot set gray non-stroke color because /'P467' is an invalid float value
Cannot set gray non-stroke color because /'P482' is an invalid float value
Cannot set gray non-stroke color because /'P488' is an invalid float value


‚úÖ Raw slides extracted: 110

üìã Raw slide format:
  - Keys: ['id', 'title', 'content']
  - ID: SL_001
  - Title: 7.1 Architecture du g√©nome
  - Content length: 282 chars


## Step 2: Data Extraction - Plan (Optional)

Extract plan text from PDF for Branch A (two-pass approach)


In [24]:
# Test plan extraction
plan_extractor = PlanExtractor()

# Option 1: Extract from specific page (like poc.ipynb page 2)
try:
    plan_text_page2 = plan_extractor.extract_plan_from_page(pdf_path, page_number=2)
    print(f"‚úÖ Plan extracted from page 2")
    print(f"üìÑ Plan length: {len(plan_text_page2)} characters")
    print(f"\nüìã Plan preview (first 500 chars):")
    print(plan_text_page2 + "...")
    
    # Use this as our plan for Branch A testing
    plan_text = plan_text_page2
    
except Exception as e:
    print(f"‚ùå Error extracting plan from page 2: {e}")
    plan_text = None

print("\n" + "="*60)

# # Option 2: Extract full PDF as plan
# try:
#     full_plan_text = plan_extractor.extract_plan_from_pdf(pdf_path)
#     print(f"‚úÖ Full PDF plan extracted")
#     print(f"üìÑ Full plan length: {len(full_plan_text)} characters")
    
# except Exception as e:
#     print(f"‚ùå Error extracting full plan: {e}")
#     full_plan_text = None


‚úÖ Plan extracted from page 2
üìÑ Plan length: 1599 characters

üìã Plan preview (first 500 chars):
   L1SpS: UE 2 Les mol√©cules du vivant                                                             
      Plan      du     cours                                                                       
        ‚Ä¢  Introduction                                                                            
             ‚Ä¢  Notions  fondamentales                                                             
        ‚Ä¢  Le g√©nome    humain                                                                     
             ‚Ä¢  Projet de s√©quen√ßage                                                               
             ‚Ä¢  Architecture globale                                                               
                  ‚Ä¢  G√©nome    mitochondrial                                                       
                  ‚Ä¢  G√©nome    nucl√©aire                                    

## Step 3: Branch B - One-Shot Approach (No Plan)

Test the one-shot outline and mapping generation when no plan is provided


In [14]:
slides

[Slides(id='SL_001', title='7.1 Architecture du g√©nome', content='L1SpS: UE 2 Les mol√©cules du vivant\nU.E.2 Les mol√©cules du vivant\n7. Organisation du g√©nome humain, m√©thodes en\nbiotechnologie\n7.1 Architecture du g√©nome\nhumain\nJean Muller\nLaboratoire Diagnostic G√©n√©tique (HUS)\nLaboratoire de G√©n√©tique m√©dicale (Inserm U1112)\njeanmuller@unistra.fr'),
 Slides(id='SL_002', title='Plan du cours', content='L1SpS: UE 2 Les mol√©cules du vivant\nPlan du cours\n‚Ä¢ Introduction\n‚Ä¢ Notions fondamentales\n‚Ä¢ Le g√©nome humain\n‚Ä¢ Projet de s√©quen√ßage\n‚Ä¢ Architecture globale\n‚Ä¢ G√©nome mitochondrial\n‚Ä¢ G√©nome nucl√©aire\n‚Ä¢ Description des principaux √©l√©ments constituants\n‚Ä¢ G√®nes (codants et non-codants), pseudog√®nes et √©l√©ments r√©p√©t√©s\n‚Ä¢ Comparaison aux autres g√©nomes\n‚Ä¢ Les types de variations du g√©nome humain et leurs cons√©quences\n‚Ä¢ La variabilit√© du g√©nome humain\n‚Ä¢ Evolution des g√©nomes: notions essentielles, m√©canismes\n2'),
 Sl

In [15]:
# Test Branch B: One-shot approach
if slides:
    print("üöÄ Testing Branch B: One-shot outline and mapping generation")
    
    # Initialize one-shot generator
    one_shot = OutlineOneShot(model="gpt-5-mini")
    
    try:
        print("ü§ñ Generating outline and mapping in one shot...")
        outline_b, mapping_b = one_shot.build_outline_and_mapping(slides)  # Use first 10 slides for testing
        
        print(f"‚úÖ One-shot generation successful!")
        print(f"üìã Outline sections: {len(outline_b.sections)}")
        print(f"üó∫Ô∏è Mapping entries: {len(mapping_b.mapping)}")
        
        # Display outline structure
        print("\nüìã Generated Outline:")
        print(outline_b.print_outline())
        
        # Display mapping
        print("\nüó∫Ô∏è Generated Mapping:")
        for item in mapping_b.mapping[:20]:  # Show first 5 mappings
            print(f"  {item.section_id} ‚Üí {item.slide_ids}")
        
        # Store for later use
        branch_b_outline = outline_b
        branch_b_mapping = mapping_b
        
    except Exception as e:
        print(f"‚ùå Error in one-shot generation: {e}")
        branch_b_outline = None
        branch_b_mapping = None

        
else:
    print("‚ö†Ô∏è Skipping Branch B test - no slides available")
    branch_b_outline = None
    branch_b_mapping = None


üöÄ Testing Branch B: One-shot outline and mapping generation
ü§ñ Generating outline and mapping in one shot...
outline=Content(sections=[ContentSection(id='SEC_1', title='Introduction et objectifs du cours sur le g√©nome humain', content=[], subsections=[ContentSection(id='SEC_1.1', title='Pr√©sentation et organisation du cours', content=[], subsections=[]), ContentSection(id='SEC_1.2', title='Introduction au th√®me et contexte biologique', content=[], subsections=[])]), ContentSection(id='SEC_2', title='Notions fondamentales en √©volution et homologie g√©n√©tique', content=[], subsections=[ContentSection(id='SEC_2.1', title='Arbre de la vie, √©volution et s√©lection naturelle', content=[], subsections=[]), ContentSection(id='SEC_2.2', title='Homologie, orthologie et paralogie: d√©finitions et m√©thodes', content=[], subsections=[]), ContentSection(id='SEC_2.3', title='Dogme central et r√©f√©rentiels mol√©culaires', content=[], subsections=[])]), ContentSection(id='SEC_3', title='Le

In [16]:
# Display mapping
print("\nüó∫Ô∏è Generated Mapping:")
for item in mapping_b.mapping[:20]:  # Show first 5 mappings
    print(f"  {item.section_id} ‚Üí {item.slide_ids}")


üó∫Ô∏è Generated Mapping:
  SEC_1 ‚Üí ['SL_001']
  SEC_1.1 ‚Üí ['SL_002']
  SEC_1.2 ‚Üí ['SL_003']
  SEC_2 ‚Üí ['SL_004']
  SEC_2.1 ‚Üí ['SL_004', 'SL_005']
  SEC_2.2 ‚Üí ['SL_006', 'SL_007', 'SL_008', 'SL_009', 'SL_010']
  SEC_2.3 ‚Üí ['SL_011', 'SL_012']
  SEC_3 ‚Üí ['SL_013']
  SEC_3.1 ‚Üí ['SL_014', 'SL_015', 'SL_016', 'SL_017']
  SEC_3.2 ‚Üí ['SL_018', 'SL_019', 'SL_020', 'SL_021', 'SL_022', 'SL_024', 'SL_025', 'SL_026', 'SL_027', 'SL_028', 'SL_029']
  SEC_3.3 ‚Üí ['SL_030', 'SL_031', 'SL_032', 'SL_033', 'SL_034', 'SL_035']
  SEC_4 ‚Üí ['SL_036']
  SEC_4.1 ‚Üí ['SL_037', 'SL_038', 'SL_039', 'SL_040', 'SL_041', 'SL_042', 'SL_043', 'SL_044', 'SL_045', 'SL_046', 'SL_047']
  SEC_4.2 ‚Üí ['SL_048']
  SEC_4.3 ‚Üí ['SL_049', 'SL_050', 'SL_051', 'SL_052', 'SL_053']
  SEC_4.4 ‚Üí ['SL_054', 'SL_055', 'SL_056', 'SL_057', 'SL_058', 'SL_059', 'SL_060', 'SL_061', 'SL_062', 'SL_063', 'SL_064', 'SL_065', 'SL_066']
  SEC_5 ‚Üí ['SL_068']
  SEC_5.1 ‚Üí ['SL_067', 'SL_069', 'SL_070', 'SL_071', 'S

## Step 3.5: Content Enrichment (using Content.enrich_with_slides)

Note: No separate ContentEnricher class is needed; enrichment is a method on Content.


In [17]:
# Demonstrate enrichment method directly on Content
# Use whatever outline/mapping you have (prefer Branch B if available)

if 'branch_b_outline' in locals() and branch_b_outline and 'branch_b_mapping' in locals() and branch_b_mapping and slides:
    enriched_demo = branch_b_outline.enrich_with_slides(slides, branch_b_mapping)
    print("‚úÖ Enrichment via Content.enrich_with_slides complete (Branch B)")
    print(f"Sections enriched: {len(enriched_demo.sections)}")
elif 'branch_a_outline' in locals() and branch_a_outline and 'branch_a_mapping' in locals() and branch_a_mapping and slides:
    enriched_demo = branch_a_outline.enrich_with_slides(slides, branch_a_mapping)
    print("‚úÖ Enrichment via Content.enrich_with_slides complete (Branch A)")
    print(f"Sections enriched: {len(enriched_demo.sections)}")
else:
    print("‚ö†Ô∏è Skipping enrichment demo - no outline/mapping available")
    enriched_demo = None


‚úÖ Enrichment via Content.enrich_with_slides complete (Branch B)
Sections enriched: 9


In [None]:
type(enriched_demo)

writer = Writer()
written_course = writer.write_course(enriched_demo)

print(written_course.print_content())

## Step 4: Complete Pipeline Test

Test the full CoursePipeline end-to-end for both branches


In [26]:
print(plan_text)

   L1SpS: UE 2 Les mol√©cules du vivant                                                             
      Plan      du     cours                                                                       
        ‚Ä¢  Introduction                                                                            
             ‚Ä¢  Notions  fondamentales                                                             
        ‚Ä¢  Le g√©nome    humain                                                                     
             ‚Ä¢  Projet de s√©quen√ßage                                                               
             ‚Ä¢  Architecture globale                                                               
                  ‚Ä¢  G√©nome    mitochondrial                                                       
                  ‚Ä¢  G√©nome    nucl√©aire                                                           
             ‚Ä¢  Description  des principaux  √©l√©ments   constituants       

In [27]:
two_pass = OutlineTwoPass()

outline = two_pass.build_outline(plan_text)

print(outline.print_outline())

Course Outline:
Introduction
  Notions fondamentales
Le g√©nome humain
  Projet de s√©quen√ßage
  Architecture globale
    G√©nome mitochondrial
    G√©nome nucl√©aire
  Description des principaux √©l√©ments constituants
    G√®nes (codants et non-codants), pseudog√®nes et √©l√©ments r√©p√©t√©s
Comparaison aux autres g√©nomes
Les types de variations du g√©nome humain et leurs cons√©quences
La variabilit√© du g√©nome humain
Evolution des g√©nomes: notions essentielles, m√©canismes


In [None]:
mapping_two_pass = MappingTwoPass()

mapping = mapping_two_pass.build_mapping(slides, outline)


In [None]:
print(mapping.visualize_mapping(outline))

In [None]:
course_with_slides = outline.enrich_with_slides(slides, mapping)
print(course_with_slides.print_content())

In [None]:
writer = Writer()

final_course = writer.write_course(course_with_slides)

print(final_course.print_content())

In [36]:
print(final_course.print_content())

Course Content:

[Root] ID: SEC_1
Title: Introduction
Content:
  [1] Ce module propose une introduction g√©n√©rale aux mol√©cules du vivant en
      prenant le g√©nome humain comme point d'appui pour aborder notions
      fondamentales, architecture mol√©culaire, variabilit√© et m√©canismes
      √©volutifs. Le plan couvre successivement les concepts de base, la
      description du g√©nome, l'histoire du s√©quen√ßage, l'organisation nucl√©aire
      et mitochondriale, les diff√©rents types d'√©l√©ments composant le g√©nome,
      les variations et leur interpr√©tation, ainsi que les processus d'√©volution
      g√©nomique.
  [2] Le g√©nome constitue l'une des r√©f√©rences biologiques majeures d'un
      organisme et s'articule avec d'autres couches d'information fonctionnelle
      : le transcriptome (ensemble des ARN exprim√©s), le m√©thylome (marquages
      √©pig√©n√©tiques) et le prot√©ome (ensemble des prot√©ines). L'√©tude conjointe
      de ces niveaux permet de comprendre la r

## Step 5: Summary and Production Example

Summary of all tests and example of how to use the system in production
