# Class Parser Workflow Test

This notebook reproduces the workflow from poc.ipynb step by step using the new src/ structure.
Test each component individually and then run the complete pipeline.

## Workflow Overview
1. **Data Extraction** - Extract slides and plan from PDF (deterministic)
2. **Branch A** - With plan: outline_two_pass ‚Üí mapping_two_pass ‚Üí enrich ‚Üí writer
3. **Branch B** - No plan: outline_one_shot ‚Üí enrich ‚Üí writer
4. **Export** - Generate DOCX document


In [1]:
# Dev hot-reload setup (no kernel restart needed)
%load_ext autoreload
%autoreload 2

import sys, importlib
from pathlib import Path
# Ensure project root is on sys.path
sys.path.insert(0, str(Path('.').resolve()))

import src
import src.models as models
import src.llm as llm
import src.data_extraction as data_extraction
import src.course as course_mod
import src.pipeline as pipeline_mod


def refresh_imports():
    """Reload modules and rebind notebook aliases."""
    importlib.reload(src)
    importlib.reload(models)
    importlib.reload(llm)
    importlib.reload(data_extraction)
    importlib.reload(course_mod)
    importlib.reload(pipeline_mod)
    
    globals().update({
        # Models
        "Content": models.Content,
        "ContentSection": models.ContentSection,
        "MappingItem": models.MappingItem,
        "SectionSlideMapping": models.SectionSlideMapping,
        "Slides": models.Slides,
        "OutlineAndMapping": models.OutlineAndMapping,
        # LLM
        "OutlineOneShot": llm.OutlineOneShot,
        "OutlineTwoPass": llm.OutlineTwoPass,
        "MappingTwoPass": llm.MappingTwoPass,
        "Writer": llm.Writer,
        "PromptManager": llm.PromptManager,
        # Deterministic extractors
        "SlidesExtractor": data_extraction.SlidesExtractor,
        "PlanExtractor": data_extraction.PlanExtractor,
        # Orchestrators
        "Course": course_mod.Course,
        "CoursePipeline": pipeline_mod.CoursePipeline,
    })
    print("üîÅ Imports refreshed")

# Initial bind
refresh_imports()


üîÅ Imports refreshed


## Step 1: Data Extraction - Slides

Extract slides from PDF using the exact logic from poc.ipynb


In [2]:
slides_path = "/Users/youssefjanjar/Documents/formascience/class_parser/volume/slides/"
import os 
chapters = os.listdir(slides_path)
for index, chapter in enumerate(chapters):
    print(f"id {index} : {chapter}")

id 0 : cours_4_plan.pdf
id 1 : .DS_Store
id 2 : Ch6-Code geÃÅneÃÅtique et traduction.pdf
id 3 : cours_1.pdf
id 4 : cours_4.pdf
id 5 : Ch1-Sucres et Lipides.pdf
id 6 : Ch3_MeÃÅtabolisme des glucides.pdf
id 7 : Ch2-Acides amineÃÅs et proteÃÅines.pdf
id 8 : Ch7.1-GeÃÅnome humain.pdf
id 9 : Ch7.2-Techniques de biologie moleÃÅculaire.pdf
id 10 : Ch4-Plan du polycopieÃÅ.docx
id 11 : Ch4-Structure, reÃÅplication et reÃÅparation de l_ADN.pdf
id 12 : Ch5-Transcription, maturation et reÃÅgulation.pdf


In [3]:
# Test slides extraction
pdf_path = slides_path+chapters[2]

# Initialize extractor
slides_extractor = SlidesExtractor(
    min_avg_len=10,
    max_lines=20,
    merge_tol=2.0
)

print(f"üîç Extracting slides from: {pdf_path}")

try:
    # Extract slides as Slides objects
    slides = slides_extractor.extract_slides(pdf_path)
    print(f"‚úÖ Extracted {len(slides)} slides")
    
    # Display first few slides
    print("\nüìã First 3 slides:")
    for i, slide in enumerate(slides[:3]):
        print(f"\n--- Slide {i+1}: {slide.id} ---")
        print(f"Title: {slide.title}")
        print(f"Content (first 200 chars): {slide.content[:200]}...")
        print("=" * 50)
        
except FileNotFoundError:
    print(f"‚ùå PDF file not found: {pdf_path}")
    print("Please ensure the PDF file exists in ./volume/slides/")
    slides = []  # Empty list for testing
except Exception as e:
    print(f"‚ùå Error extracting slides: {e}")
    slides = []


üîç Extracting slides from: /Users/youssefjanjar/Documents/formascience/class_parser/volume/slides/Ch6-Code geÃÅneÃÅtique et traduction.pdf
‚úÖ Extracted 68 slides

üìã First 3 slides:

--- Slide 1: SL_001 ---
Title: Code g√©n√©tique et traduction
Content (first 200 chars): U.E.2 2024
Chapitre 6
Code g√©n√©tique et traduction
Philippe KASTNER
Texte en caract√®res normaux: connaissances requises pour l‚Äôexamen
Texte en italiques: non requis pour l‚Äôexamen...

--- Slide 2: SL_002 ---
Title: Premi√®re partie: le code g√©n√©tique
Content (first 200 chars): Premi√®re partie: le code g√©n√©tique
I. Nature du code g√©n√©tique
II. M√©canisme de d√©chiffrage: les ARN de transfert
III. Les aminoacyl-tRNA synth√©tases...

--- Slide 3: SL_003 ---
Title: I. Nature du code g√©n√©tique
Content (first 200 chars): I. Nature du code g√©n√©tique
II. M√©canisme de d√©chiffrage: les ARN de transfert
III. Les aminoacyl-tRNA synth√©tases...


In [4]:
import json
import tiktoken

# Dump slides to JSON
slides_json = json.dumps([slide.model_dump() for slide in slides], ensure_ascii=False, indent=2)

# Calculate tokens using GPT-4 encoding (tiktoken doesn't have GPT-5 yet, using GPT-4 as closest)
encoding = tiktoken.encoding_for_model("gpt-4")
token_count = len(encoding.encode(slides_json))

print(f"üìä Slides JSON dump:")
print(f"Number of slides: {len(slides)}")
print(f"JSON size: {len(slides_json)} characters")
print(f"Estimated tokens (GPT-4 encoding): {token_count}")

# Display the JSON (truncated for readability)
print(f"\nüìã JSON content (first 1000 chars):")
print(slides_json[:1000] + "..." if len(slides_json) > 1000 else slides_json)

üìä Slides JSON dump:
Number of slides: 68
JSON size: 36152 characters
Estimated tokens (GPT-4 encoding): 11539

üìã JSON content (first 1000 chars):
[
  {
    "id": "SL_001",
    "title": "Code g√©n√©tique et traduction",
    "content": "U.E.2 2024\nChapitre 6\nCode g√©n√©tique et traduction\nPhilippe KASTNER\nTexte en caract√®res normaux: connaissances requises pour l‚Äôexamen\nTexte en italiques: non requis pour l‚Äôexamen"
  },
  {
    "id": "SL_002",
    "title": "Premi√®re partie: le code g√©n√©tique",
    "content": "Premi√®re partie: le code g√©n√©tique\nI. Nature du code g√©n√©tique\nII. M√©canisme de d√©chiffrage: les ARN de transfert\nIII. Les aminoacyl-tRNA synth√©tases"
  },
  {
    "id": "SL_003",
    "title": "I. Nature du code g√©n√©tique",
    "content": "I. Nature du code g√©n√©tique\nII. M√©canisme de d√©chiffrage: les ARN de transfert\nIII. Les aminoacyl-tRNA synth√©tases"
  },
  {
    "id": "SL_004",
    "title": "D√©couverte du code g√©n√©tique",
    "content": 

# Single Shot

# Single shot:  Extract the plan for this class 

In [5]:
outline_one_shot = OutlineOneShot()

outline, mapping = outline_one_shot.build_outline_and_mapping(slides)

print(outline.print_outline())

Course Outline:
Nature du code g√©n√©tique
  D√©couverte et preuves exp√©rimentales du code triplet
  Propri√©t√©s du code (triplets, d√©g√©n√©rescence, codons initiateur/stop, universalit√©)
  Variations et exceptions au code (mitochondrial, s√©l√©nocyst√©ine)
  Structure du message codant et cons√©quences des mutations (phases de lecture)
D√©codage : ARN de transfert, appariement codon‚Äëanticodon et modifications
  Fonction et architecture des ARNt (anticodon, extr√©mit√© 3‚Äô, structure 3D)
  Modifications post‚Äëtranscriptionnelles des ARN (nucl√©osides tRNA, rRNA, pseudo‚Äëuridine en ARNm)
  Appariements bancals (wobble) et correspondance codon‚Äëanticodon
  Biais d‚Äôusage des codons (codon usage bias, abondance d‚ÄôARNt)
  Aminoacyl‚ÄëtRNA synth√©tases : activation, greffe et √©dition (fidelit√©)
Structure et biogen√®se du ribosome
  Organisation des sous‚Äëunit√©s, ARNr et sites fonctionnels (A, P, E)
  Modifications des ARNr et implications fonctionnelles
  Biogen√®se et asse

In [6]:
print(mapping.visualize_mapping(outline))

Section-to-Slides Mapping:

[Root] SEC_1
Title: Nature du code g√©n√©tique
Slides: 10 slide(s)
Slide IDs: SL_004, SL_005, SL_006, SL_007, SL_008, SL_009, SL_010, SL_011, SL_012, SL_013

  [Level 1] SEC_1.1
  Title: D√©couverte et preuves exp√©rimentales du code triplet
  Slides: 2 slide(s)
  Slide IDs: SL_004, SL_005

  [Level 1] SEC_1.2
  Title: Propri√©t√©s du code (triplets, d√©g√©n√©rescence, codons initiateur/stop, universalit√©)
  Slides: 3 slide(s)
  Slide IDs: SL_006, SL_007, SL_008

  [Level 1] SEC_1.3
  Title: Variations et exceptions au code (mitochondrial, s√©l√©nocyst√©ine)
  Slides: 2 slide(s)
  Slide IDs: SL_009, SL_013

  [Level 1] SEC_1.4
  Title: Structure du message codant et cons√©quences des mutations (phases de lecture)
  Slides: 3 slide(s)
  Slide IDs: SL_010, SL_011, SL_012

[Root] SEC_2
Title: D√©codage : ARN de transfert, appariement codon‚Äëanticodon et modifications
Slides: 11 slide(s)
Slide IDs: SL_015, SL_016, SL_017, SL_018, SL_020, SL_021, SL_022, SL_023

# Single shot: Writing content from mapping and the outline

In [7]:
outline_with_slides = outline.enrich_with_slides(slides=slides, mapping=mapping)
writer = Writer()

course = writer.write_course(enriched_content=outline_with_slides)

type(course)

src.models.Content

In [10]:
type(course)

src.models.Content

In [11]:
print(course.print_content())

Course Content:

[Root] ID: SEC_1
Title: Nature du code g√©n√©tique
Content:
  [1] Le code g√©n√©tique d√©finit la correspondance entre la s√©quence de
      nucl√©otides d‚Äôun ARN messager et la s√©quence d‚Äôacides amin√©s d‚Äôune
      prot√©ine. Il permet de traduire l‚Äôinformation contenue dans l‚ÄôARNm en une
      cha√Æne polypeptidique fonctionnelle.
  [2] Rep√®res historiques et preuves exp√©rimentales :
  [3] ‚Ä¢ 1953 : d√©couverte de la structure de l‚ÄôADN, posant le cadre pour
      l‚Äôhypoth√®se d‚Äôun support s√©quentiel de l‚Äôinformation g√©n√©tique.
  [4] ‚Ä¢ 1961 : d√©monstration que les acides amin√©s sont sp√©cifi√©s par des
      triplets de bases adjacents (mod√®le non chevauchant).
  [5] ‚Ä¢ 1961‚Äì1968 : d√©chiffrage exp√©rimental progressif des correspondances
      codon ‚Üí acide amin√© (r√©compens√© par le Prix Nobel en 1968).
  [6] Caract√©ristiques fondamentales du code :
  [7] ‚Ä¢ Les codons sont des triplets de bases ; 4^3 = 64 combinaisons possibles

# Store a course as a artifact that I can load again into Course

In [12]:
type(course)

src.models.Content

In [13]:
print(course.print_outline())

Course Outline:
Nature du code g√©n√©tique
  D√©couverte et preuves exp√©rimentales du code triplet
  Propri√©t√©s du code (triplets, d√©g√©n√©rescence, codons initiateur/stop, universalit√©)
  Variations et exceptions au code (mitochondrial, s√©l√©nocyst√©ine)
  Structure du message codant et cons√©quences des mutations (phases de lecture)
D√©codage : ARN de transfert, appariement codon‚Äëanticodon et modifications
  Fonction et architecture des ARNt (anticodon, extr√©mit√© 3‚Äô, structure 3D)
  Modifications post‚Äëtranscriptionnelles des ARN (nucl√©osides tRNA, rRNA, pseudo‚Äëuridine en ARNm)
  Appariements bancals (wobble) et correspondance codon‚Äëanticodon
  Biais d‚Äôusage des codons (codon usage bias, abondance d‚ÄôARNt)
  Aminoacyl‚ÄëtRNA synth√©tases : activation, greffe et √©dition (fidelit√©)
Structure et biogen√®se du ribosome
  Organisation des sous‚Äëunit√©s, ARNr et sites fonctionnels (A, P, E)
  Modifications des ARNr et implications fonctionnelles
  Biogen√®se et asse

In [16]:
try:
    course_json = course.model_dump_json()
    course_dict = json.loads(course_json)
    course_formatted = Content.model_validate(course_dict)
    
    pilot_course = Course(
        name="Code Genetique et traduction",
        course_title="Code Genetique et traduction",
        level="L1", 
        block="SANTE",
        semester="S1",
        subject="UE2",
        chapter="6",
        content=course_formatted
    )
    
    print("‚úÖ Course created successfully!")
    
except Exception as e:
    print(f"‚ùå Error creating course: {e}")
    # You could inspect the course_dict here to debug
    print("Course data structure:")
    print(json.dumps(course_dict, indent=2)[:500] + "...")

‚úÖ Course created successfully!


In [17]:
print(pilot_course.print_outline())

Course: Code Genetique et traduction
Level: L1
Block: SANTE
Semester: S1
Subject: UE2
Chapter: 6


Course Outline:
Nature du code g√©n√©tique
  D√©couverte et preuves exp√©rimentales du code triplet
  Propri√©t√©s du code (triplets, d√©g√©n√©rescence, codons initiateur/stop, universalit√©)
  Variations et exceptions au code (mitochondrial, s√©l√©nocyst√©ine)
  Structure du message codant et cons√©quences des mutations (phases de lecture)
D√©codage : ARN de transfert, appariement codon‚Äëanticodon et modifications
  Fonction et architecture des ARNt (anticodon, extr√©mit√© 3‚Äô, structure 3D)
  Modifications post‚Äëtranscriptionnelles des ARN (nucl√©osides tRNA, rRNA, pseudo‚Äëuridine en ARNm)
  Appariements bancals (wobble) et correspondance codon‚Äëanticodon
  Biais d‚Äôusage des codons (codon usage bias, abondance d‚ÄôARNt)
  Aminoacyl‚ÄëtRNA synth√©tases : activation, greffe et √©dition (fidelit√©)
Structure et biogen√®se du ribosome
  Organisation des sous‚Äëunit√©s, ARNr et sites 

In [18]:
pilot_course.save_to_json(output_path="volume/artifacts/json")

üíæ Course saved to: volume/artifacts/json/code_genetique_et_traduction_20250819_181142.json


'volume/artifacts/json/code_genetique_et_traduction_20250819_181142.json'

In [19]:
pilot_course.to_docx(output_path="volume/artifacts/docx", template_path="volume/templates/fs_template.docx")

üìÑ Course exported to DOCX: volume/artifacts/docx/Chapitre_6_code_genetique_et_traduction.docx


'volume/artifacts/docx/Chapitre_6_code_genetique_et_traduction.docx'

In [32]:
# Check what's actually in your fresh_content variable
print(f"Type: {type(fresh_content)}")
print(f"Is Content? {isinstance(fresh_content, Content)}")

# Check the sections
if hasattr(fresh_content, 'sections'):
    print(f"Sections type: {type(fresh_content.sections)}")
    if fresh_content.sections:
        print(f"First section type: {type(fresh_content.sections[0])}")
        print(f"First section: {fresh_content.sections[0]}")
else:
    print("No sections attribute")

# Check if it's the right Content class
print(f"Content class: {Content}")
print(f"fresh_content's content class: {type(fresh_content).__module__}.{type(fresh_content).__name__}")

Type: <class 'src.models.Content'>
Is Content? True
Sections type: <class 'list'>
First section type: <class 'src.models.ContentSection'>
First section: id='SEC_1' title='Pr√©sentation du module et plan du cours' content=["Ce module porte sur l'organisation du g√©nome humain et sur les m√©thodes biotechnologiques permettant de le caract√©riser et de l'analyser. Il vise √† donner une vision int√©gr√©e de la composition, de l'architecture et de la dynamique des g√©nomes, ainsi que des cons√©quences fonctionnelles et m√©dicales des variations g√©nomiques.", "Le plan du cours pr√©sente successivement une introduction aux concepts √©volutifs et aux organismes mod√®les, les notions fondamentales de g√©n√©tique mol√©culaire, la description du g√©nome humain (nucl√©aire et mitochondrial), le projet de s√©quen√ßage, l'architecture globale du g√©nome, la description des √©l√©ments constitutifs (g√®nes codants et non codants, pseudog√®nes, √©l√©ments r√©p√©t√©s), la comparaison interg√©nomique, l

# Two-shots

# Two-shots

## Plan from page

Extract plan text from PDF for Branch A (two-pass approach)


## Extract the slides

In [70]:
pdf_path = './volume/slides/cours_1.pdf'
plan_path = './volume/slides/cours_4_plan.pdf'

extractor = SlidesExtractor()

slides = extractor.extract_slides(pdf_path=pdf_path)

slides[:5]



2025-08-18 10:12:25,875 [DEBUG] src.data_extraction.slides_extractor: Starting slide extraction from ./volume/slides/cours_1.pdf
2025-08-18 10:12:25,924 [DEBUG] src.data_extraction.slides_extractor: PDF opened, found 111 pages
2025-08-18 10:12:27,073 [DEBUG] src.data_extraction.slides_extractor: Successfully extracted 110 slides


[Slides(id='SL_001', title='7.1 Architecture du g√©nome', content='L1SpS: UE 2 Les mol√©cules du vivant\nU.E.2 Les mol√©cules du vivant\n7. Organisation du g√©nome humain, m√©thodes en\nbiotechnologie\n7.1 Architecture du g√©nome\nhumain\nJean Muller\nLaboratoire Diagnostic G√©n√©tique (HUS)\nLaboratoire de G√©n√©tique m√©dicale (Inserm U1112)\njeanmuller@unistra.fr'),
 Slides(id='SL_002', title='Plan du cours', content='L1SpS: UE 2 Les mol√©cules du vivant\nPlan du cours\n‚Ä¢ Introduction\n‚Ä¢ Notions fondamentales\n‚Ä¢ Le g√©nome humain\n‚Ä¢ Projet de s√©quen√ßage\n‚Ä¢ Architecture globale\n‚Ä¢ G√©nome mitochondrial\n‚Ä¢ G√©nome nucl√©aire\n‚Ä¢ Description des principaux √©l√©ments constituants\n‚Ä¢ G√®nes (codants et non-codants), pseudog√®nes et √©l√©ments r√©p√©t√©s\n‚Ä¢ Comparaison aux autres g√©nomes\n‚Ä¢ Les types de variations du g√©nome humain et leurs cons√©quences\n‚Ä¢ La variabilit√© du g√©nome humain\n‚Ä¢ Evolution des g√©nomes: notions essentielles, m√©canismes\n2'),
 Sl

In [74]:
# Test plan extraction
plan_extractor = PlanExtractor()

plan_txt = plan_extractor.extract_plan_from_page(pdf_path=pdf_path, page_number=2)

print(plan_txt)

2025-08-18 10:13:00,872 [DEBUG] src.data_extraction.plan_extractor: Starting plan extraction from ./volume/slides/cours_1.pdf (page 2)
2025-08-18 10:13:00,928 [DEBUG] src.data_extraction.plan_extractor: PDF opened, found 111 pages
2025-08-18 10:13:00,948 [DEBUG] src.data_extraction.plan_extractor: Successfully extracted plan from page 2 (1599 chars)
   L1SpS: UE 2 Les mol√©cules du vivant                                                             
      Plan      du     cours                                                                       
        ‚Ä¢  Introduction                                                                            
             ‚Ä¢  Notions  fondamentales                                                             
        ‚Ä¢  Le g√©nome    humain                                                                     
             ‚Ä¢  Projet de s√©quen√ßage                                                               
             ‚Ä¢  Architecture globa

## Build the Outline

In [75]:
out_builder = OutlineTwoPass()

outline = out_builder.build_outline(plan_txt)

2025-08-18 10:13:26,958 [DEBUG] src.llm.outline_two_pass: Starting outline generation from plan text (1599 chars)
2025-08-18 10:13:41,765 [INFO] httpx: HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-08-18 10:13:41,817 [DEBUG] src.llm.outline_two_pass: Successfully generated outline with 6 sections


In [78]:
print(outline.print_outline())

Course Outline:
Introduction
  Notions fondamentales
Le g√©nome humain
  Projet de s√©quen√ßage
  Architecture globale
    G√©nome mitochondrial
    G√©nome nucl√©aire
  Description des principaux √©l√©ments constituants
    G√®nes (codants et non-codants), pseudog√®nes et √©l√©ments r√©p√©t√©s
Comparaison aux autres g√©nomes
Les types de variations du g√©nome humain et leurs cons√©quences
La variabilit√© du g√©nome humain
Evolution des g√©nomes: notions essentielles, m√©canismes


## Build the mapping

In [79]:
slides = slides[:10]

In [80]:
mapping_builder = MappingTwoPass()

mapping = mapping_builder.build_mapping(slides=slides, outline=outline) 

type(mapping)

2025-08-18 10:14:02,625 [DEBUG] src.llm.mapping_two_pass: Starting mapping generation for 10 slides and 6 sections
2025-08-18 10:15:20,967 [INFO] httpx: HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-08-18 10:15:20,996 [DEBUG] src.llm.mapping_two_pass: Successfully generated mapping with 13 entries


src.models.SectionSlideMapping

In [82]:
print(mapping.visualize_mapping(outline))

Section-to-Slides Mapping:

[Root] SEC_1
Title: Introduction
Slides: 2 slide(s)
Slide IDs: SL_002, SL_003

  [Level 1] SEC_1.1
  Title: Notions fondamentales
  Slides: 7 slide(s)
  Slide IDs: SL_004, SL_005, SL_006, SL_007, SL_008, SL_009, SL_010

[Root] SEC_2
Title: Le g√©nome humain
Slides: 1 slide(s)
Slide IDs: SL_001

  [Level 1] SEC_2.1
  Title: Projet de s√©quen√ßage
  Slides: 1 slide(s)
  Slide IDs: SL_002

  [Level 1] SEC_2.2
  Title: Architecture globale
  Slides: 1 slide(s)
  Slide IDs: SL_001

    [Level 2] SEC_2.2.1
    Title: G√©nome mitochondrial
    Slides: 1 slide(s)
    Slide IDs: SL_001

    [Level 2] SEC_2.2.2
    Title: G√©nome nucl√©aire
    Slides: 1 slide(s)
    Slide IDs: SL_001

  [Level 1] SEC_2.3
  Title: Description des principaux √©l√©ments constituants
  Slides: 1 slide(s)
  Slide IDs: SL_001

    [Level 2] SEC_2.3.1
    Title: G√®nes (codants et non-codants), pseudog√®nes et √©l√©ments r√©p√©t√©s
    Slides: 1 slide(s)
    Slide IDs: SL_001

[Root] SEC_3


In [87]:
course_with_slides = outline.enrich_with_slides(slides=slides, mapping=mapping)

print(course_with_slides.print_outline())

Course Outline:
Introduction
  Notions fondamentales
Le g√©nome humain
  Projet de s√©quen√ßage
  Architecture globale
    G√©nome mitochondrial
    G√©nome nucl√©aire
  Description des principaux √©l√©ments constituants
    G√®nes (codants et non-codants), pseudog√®nes et √©l√©ments r√©p√©t√©s
Comparaison aux autres g√©nomes
Les types de variations du g√©nome humain et leurs cons√©quences
La variabilit√© du g√©nome humain
Evolution des g√©nomes: notions essentielles, m√©canismes


## Write the course

In [84]:
writer = Writer()

course = writer.write_course(enriched_content=course_with_slides)
type(course)

2025-08-18 10:16:07,926 [DEBUG] src.llm.writer: Starting content enhancement for 6 sections
2025-08-18 10:19:58,868 [INFO] httpx: HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-08-18 10:19:58,980 [DEBUG] src.llm.writer: Successfully enhanced content with 6 sections


src.models.Content

In [89]:
type(course)

src.models.Content

In [88]:
test_course = Course(
    name="Architecture du g√©nome humain",
    course_title="Architecture du g√©nome humain",
    level="L1",
    block="SANTE",
    semester="S1",
    subject="UE-1 Constitution et transformation de la mati√®re",
    content=course
)

print(test_course.print_outline())

ValidationError: 1 validation error for Course
content
  Input should be a valid dictionary or instance of Content [type=model_type, input_value=Content(sections=[Content...ides=None, mapping=None), input_type=Content]
    For further information visit https://errors.pydantic.dev/2.11/v/model_type

In [90]:
from src.models import Content as ModelsContent
print("Course content annotation ->", ModelsContent, ModelsContent.__module__)

print("Returned content instance ->", type(course), type(course).__module__)

print("Same class object? ->", type(course) is ModelsContent)

Course content annotation -> <class 'src.models.Content'> src.models
Returned content instance -> <class 'src.models.Content'> src.models
Same class object? -> False


## Tests


In [64]:
refresh_imports()

from src.models import PipelineConfig

config = PipelineConfig.load("pipeline.yaml")
config.inputs, config.outputs, config.metadata

üîÅ Imports refreshed


({'slides_pdf': '/Users/youssefjanjar/Documents/formascience/class_parser/volume/slides/cours_1.pdf',
  'plan_page': 2},
 {'save_json': True,
  'save_docx': True,
  'template_path': '/Users/youssefjanjar/Documents/formascience/class_parser/volume/templates/fs_template.docx',
  'output_dir': '/Users/youssefjanjar/Documents/formascience/class_parser/volume/artifacts'},
 CourseMetadata(name='Architecture du g√©nome humain', course_title='Architecture du g√©nome humain', level='L1', block='SANTE', semester='S1', subject='UE-1 - Les bases de la biologie', chapter=None, year=2024, professor='Youssef Janjar'))

In [65]:
import os 
os.listdir(config.outputs['output_dir'])

['architecture_du_g√©nome_humain_20250817_224951.json',
 '.DS_Store',
 'architecture_du_g√©nome_humain_filled.docx',
 'docx']

In [66]:
from src.course import Course

course = Course.load_from_json(config.outputs['output_dir']+'/architecture_du_g√©nome_humain_20250817_224951.json')

print(course.content.print_outline())

Course Outline:
Introduction
  Notions fondamentales
Le g√©nome humain
  Projet de s√©quen√ßage
  Architecture globale
    G√©nome mitochondrial
    G√©nome nucl√©aire
  Description des principaux √©l√©ments constituants
    G√®nes (codants et non-codants), pseudog√®nes et √©l√©ments r√©p√©t√©s
Comparaison aux autres g√©nomes
Les types de variations du g√©nome humain et leurs cons√©quences
La variabilit√© du g√©nome humain
Evolution des g√©nomes: notions essentielles, m√©canismes


In [67]:
course.save_to_json(output_path=config.outputs['output_dir'])

üíæ Course saved to: /Users/youssefjanjar/Documents/formascience/class_parser/volume/artifacts/json/architecture_du_g√©nome_humain_20250818_082858.json


'/Users/youssefjanjar/Documents/formascience/class_parser/volume/artifacts/json/architecture_du_g√©nome_humain_20250818_082858.json'

In [61]:
course.to_docx(template_path=config.outputs['template_path'], output_path=config.outputs['output_dir']+'/docx')

üìÑ Course exported to DOCX: /Users/youssefjanjar/Documents/formascience/class_parser/volume/artifacts/docx/architecture_du_g√©nome_humain.docx


'/Users/youssefjanjar/Documents/formascience/class_parser/volume/artifacts/docx/architecture_du_g√©nome_humain.docx'

# Testing the pipeline

In [68]:
import os
from pathlib import Path

# Print current working directory
print("Current working directory:", os.getcwd())

# Print root folder using pathlib
root_path = Path.cwd()
print("Root folder (pathlib):", root_path)

# Print absolute path
print("Absolute path:", root_path.absolute())

# List contents of root folder
print("\nContents of root folder:")
for item in root_path.iterdir():
    if item.is_dir():
        print(f"üìÅ {item.name}/")
    else:
        print(f"üìÑ {item.name}")


Current working directory: /Users/youssefjanjar/Documents/formascience/class_parser
Root folder (pathlib): /Users/youssefjanjar/Documents/formascience/class_parser
Absolute path: /Users/youssefjanjar/Documents/formascience/class_parser

Contents of root folder:
üìÅ .cursor/
üìÑ .DS_Store
üìÑ create_structure.sh
üìÅ audio_extractor/
üìÑ pyproject.toml
üìÑ data_structure.md
üìÅ tests/
üìÅ __pycache__/
üìÑ README.md
üìÅ .mypy_cache/
üìÑ .gitignore
üìÑ diagram.mermaid
üìÑ .env
üìÅ .venv/
üìÑ pipeline.yaml
üìÑ poetry.lock
üìÅ .git/
üìÑ main.py
üìÑ test_workflow.ipynb
üìÅ data/
üìÅ volume/
üìÅ notebooks/
üìÅ src/


In [2]:
import logging
import sys
from pathlib import Path

from src.pipeline import CoursePipeline
from src.models import CourseMetadata

# Configure logging level
logging.basicConfig(
    level=logging.INFO,  # Change to DEBUG for more detailed logs, WARNING for less
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler(),  # Console output
        logging.FileHandler('pipeline.log')  # Optional: save to file
    ]
)

# Initialize the pipeline
pipeline = CoursePipeline()
# Method 1: Using config file
course = pipeline.process_from_config("configs/cours_1.yaml")


# Get processing statistics
stats = pipeline.get_processing_statistics(course)
print(f"Processed {stats['total_sections']} sections from {stats['total_slides']} slides")

2025-08-18 13:10:15,727 - src.pipeline - INFO - üöÄ Processing course 'Architecture du g√©nome humain' (Branch A - Plan Provided)
2025-08-18 13:10:15,727 - src.pipeline - INFO - üìù Generating outline from plan...
2025-08-18 13:10:26,263 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-08-18 13:10:26,318 - src.pipeline - INFO - ‚úÖ Generated outline with 7 sections
2025-08-18 13:10:26,319 - src.pipeline - INFO - üó∫Ô∏è Generating section-slide mapping...
2025-08-18 13:12:39,982 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-08-18 13:12:40,014 - src.pipeline - INFO - ‚úÖ Generated section-slide mapping
2025-08-18 13:12:40,015 - src.pipeline - INFO - üîó Enriching content with slides...
2025-08-18 13:12:40,016 - src.pipeline - INFO - ‚úÖ Content enriched with slide data
2025-08-18 13:12:40,017 - src.pipeline - INFO - ‚úçÔ∏è Enhancing content with AI...
2025-08-18 13:16:48,819 - httpx - INFO 

üíæ Course saved to: /Users/youssefjanjar/Documents/formascience/class_parser/volume/artifacts/json/architecture_du_g√©nome_humain_20250818_132011.json


2025-08-18 13:20:11,906 - src.pipeline - INFO - üíæ Course saved to DOCX
2025-08-18 13:20:11,906 - src.pipeline - INFO - üéâ Course processing complete!


üìÑ Course exported to DOCX: /Users/youssefjanjar/Documents/formascience/class_parser/volume/artifacts/docx/architecture_du_g√©nome_humain.docx
Processed 7 sections from 110 slides


In [4]:
print(course.print_outline())

Course: Architecture du g√©nome humain
Level: L1
Block: SANTE
Semester: S1
Subject: UE-1
Chapter: 1


Course Outline:
Introduction
  Notions fondamentales
Le g√©nome humain
  Projet de s√©quen√ßage
  Architecture globale
    G√©nome mitochondrial
    G√©nome nucl√©aire
Description des principaux √©l√©ments constituants
  G√®nes (codants et non-codants), pseudog√®nes et √©l√©ments r√©p√©t√©s
Comparaison aux autres g√©nomes
Les types de variations du g√©nome humain et leurs cons√©quences
La variabilit√© du g√©nome humain
Evolution des g√©nomes: notions essentielles, m√©canismes


In [6]:
import os

jsons = os.listdir("/Users/youssefjanjar/Documents/formascience/class_parser/volume/artifacts/json/")
file_path = "/Users/youssefjanjar/Documents/formascience/class_parser/volume/artifacts/json/" + jsons[-1]

jsons

['architecture_du_g√©nome_humain_20250818_113431.json',
 'architecture_du_g√©nome_humain_20250818_123215.json',
 'r√©plication_et_r√©paration_de_l‚Äôadn_20250818_120405.json',
 'architecture_du_g√©nome_humain_20250818_114200.json',
 'architecture_du_g√©nome_humain_20250818_132011.json']

In [7]:
from src.course import Course

course = Course.load_from_json(
file_path)
course

Course(name='Architecture du g√©nome humain', course_title='Architecture du g√©nome humain', level='L1', block='SANTE', semester='S1', subject='UE-1', chapter=1, content=Content(sections=[ContentSection(id='SEC_1', title='Introduction', content=["Ce module aborde les mol√©cules du vivant en pla√ßant le g√©nome au centre d'un cadre plus large qui comprend √©galement le transcriptome, le prot√©ome et le m√©thylome. Le programme couvre des notions fondamentales d'√©volution et d'homologie, la structure et l'architecture du g√©nome humain (nucl√©aire et mitochondrial), le projet de s√©quen√ßage, la description des √©l√©ments constitutifs du g√©nome, la comparaison avec d'autres g√©nomes, ainsi que les types et les cons√©quences des variations g√©nomiques.", "Le g√©nome est pr√©sent√© comme une r√©f√©rence stable de l'information h√©r√©ditaire d'un organisme, servant de base de comparaison pour les autres couches fonctionnelles (expression, modifications √©pig√©n√©tiques, production prot√©i

In [10]:
course.subject = "UE-2"
course.chapter = "7.1"

In [11]:
print(course.print_outline())

Course: Architecture du g√©nome humain
Level: L1
Block: SANTE
Semester: S1
Subject: UE-2
Chapter: 7.1


Course Outline:
Introduction
  Notions fondamentales
Le g√©nome humain
  Projet de s√©quen√ßage
  Architecture globale
    G√©nome mitochondrial
    G√©nome nucl√©aire
Description des principaux √©l√©ments constituants
  G√®nes (codants et non-codants), pseudog√®nes et √©l√©ments r√©p√©t√©s
Comparaison aux autres g√©nomes
Les types de variations du g√©nome humain et leurs cons√©quences
La variabilit√© du g√©nome humain
Evolution des g√©nomes: notions essentielles, m√©canismes


In [7]:
course.save_to_json(output_path="volume/artifacts/json")

üíæ Course saved to: volume/artifacts/json/architecture_du_g√©nome_humain_20250818_114200.json


'volume/artifacts/json/architecture_du_g√©nome_humain_20250818_114200.json'

In [12]:
atifacts_path = "volume/artifacts/docx"
template_path = "volume/templates/fs_template.docx"
course.to_docx(output_path=atifacts_path, template_path=template_path)

üìÑ Course exported to DOCX: volume/artifacts/docx/architecture_du_g√©nome_humain.docx


'volume/artifacts/docx/architecture_du_g√©nome_humain.docx'

## Step 1: Data Extraction - Slides

Extract slides from PDF using the exact logic from poc.ipynb


In [None]:
# Method 2: Direct method call
metadata = CourseMetadata(
    name="Architecture du g√©nome humain",
    course_title="Architecture du g√©nome humain",
    level="L1",
    block="SANTE",
    semester="S1",
    subject="UE-1 Constitution et transformation de la mati√®re"
)

course = pipeline.process_course_no_plan(
    slides=slides,  # Your slides list
    metadata=metadata,
    save_json=True,
    save_docx=True,
    test_mode=True,  # Process only first 10 slides
    output_path=""
)

# Get processing statistics
stats = pipeline.get_processing_statistics(course)
print(f"Processed {stats['total_sections']} sections from {stats['total_slides']} slides")