# üè∑Ô∏è Step 5 : Mini-Taxonomy of Definitions

**Objectif**: Extraire et classifier les d√©finitions d'"Agentic AI" dans le corpus

**Approche hybride**:
1. **Extraction semi-automatique** - Identifier les paragraphes d√©finitionnels
2. **Classification manuelle** - Grouper en cat√©gories conceptuelles
3. **Visualisation** - Tableaux et diagrammes (treemap, sunburst)

**Cat√©gories attendues**:
- AI as Copilots/Assistants
- AI as Autonomous Workers
- AI as Multi-Agent Ecosystems/Orchestrators
- AI as Governance/Risk Challenges

**Output**: Taxonomie des d√©finitions + visualisations pour le rapport

## üîß Setup Config & Imports

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Imports
import json
import re
from pathlib import Path
from collections import Counter, defaultdict

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# NLP
import nltk
from nltk.tokenize import sent_tokenize

# Viz
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
%matplotlib inline

plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['font.size'] = 10

print("‚úÖ Imports")

## üìÇ Load processed corpus

In [None]:
# Paths
PROJECT_ROOT = Path.cwd().parent
PROCESSED_DATA = PROJECT_ROOT / "data" / "processed"
TEXTS_DIR = PROCESSED_DATA / "texts"
METADATA_FILE = PROCESSED_DATA / "metadata" / "corpus_metadata.json"

# Create taxonomy folder
TAXONOMY_DIR = PROCESSED_DATA / "taxonomy"
TAXONOMY_DIR.mkdir(exist_ok=True)

print(f"üìÅ Taxonomy folder : {TAXONOMY_DIR}")

In [None]:
# Load raw texts
texts = {}
with open(METADATA_FILE, 'r', encoding='utf-8') as f:
    metadata = json.load(f)

for doc_id in metadata.keys():
    text_file = TEXTS_DIR / f"{doc_id}.txt"
    if text_file.exists():
        with open(text_file, 'r', encoding='utf-8') as f:
            texts[doc_id] = f.read()

print(f"‚úÖ {len(texts)} documents loaded")

In [None]:
# Mapping doc_id -> source_type
doc_to_source = {doc_id: metadata[doc_id]['source_type'] 
                 for doc_id in texts.keys()}

## üîç Semi-automatic extraction of definition

### Extraction strategy

Nous allons identifier les phrases/paragraphes qui:
1. Contiennent des termes cl√©s: "agentic ai", "ai agent", "autonomous agent"
2. Utilisent des marqueurs d√©finitionnels: "is", "are", "defined as", "refers to", "means"
3. Sont dans les premi√®res sections (introduction, definitions)

In [None]:
def extract_definition_candidates(text, doc_id):
    """
    Extract candidate phrases as definitions.
    
    Criterias:
    - Contains key terms (agentic, agent, autonomous)
    - Contains definitional markers
    - Reasonable length (30-300 words)
    """
    # Tokenize in phrases
    sentences = sent_tokenize(text)
    
    # Key terms to search
    key_terms = [
        r'\bagentic\s+ai\b',
        r'\bai\s+agent[s]?\b',
        r'\bautonomous\s+agent[s]?\b',
        r'\bagentic\s+system[s]?\b',
        r'\bagent[s]?\s+are\b',
        r'\bagent[s]?\s+is\b'
    ]
    
    # Definitional markers
    def_markers = [
        r'\bis\s+defined\s+as\b',
        r'\bare\s+defined\s+as\b',
        r'\brefers?\s+to\b',
        r'\bmeans?\b',
        r'\bcan\s+be\s+understood\s+as\b',
        r'\bcharacterized\s+by\b',
        r'\bconsists?\s+of\b',
        r'\benables?\b',
        r'\bcapable\s+of\b'
    ]
    
    candidates = []
    
    for i, sentence in enumerate(sentences):
        sentence_lower = sentence.lower()
        
        # Check key terms presence
        has_key_term = any(re.search(pattern, sentence_lower) for pattern in key_terms)
        
        if not has_key_term:
            continue
        
        # Check definitional markers presence
        has_def_marker = any(re.search(pattern, sentence_lower) for pattern in def_markers)
        
        # Check length
        word_count = len(sentence.split())
        
        # Score phrase
        score = 0
        if has_key_term:
            score += 2
        if has_def_marker:
            score += 3
        if 30 <= word_count <= 300:
            score += 1
        if i < len(sentences) * 0.3:  # In first 30%
            score += 1
        
        if score >= 3:
            # Extract next phrase for context
            context = sentence
            if i + 1 < len(sentences):
                context += " " + sentences[i + 1]
            
            candidates.append({
                'doc_id': doc_id,
                'sentence_id': i,
                'text': sentence.strip(),
                'context': context.strip(),
                'word_count': word_count,
                'score': score,
                'has_def_marker': has_def_marker
            })
    
    return candidates

In [None]:
# Extract candidates for docs
all_candidates = []

for doc_id, text in texts.items():
    candidates = extract_definition_candidates(text, doc_id)
    all_candidates.extend(candidates)
    
    filename = metadata[doc_id]['filename']
    print(f"\nüìÑ {filename}")
    print(f"   {len(candidates)} potential definitions found")

print(f"\n‚úÖ Total: {len(all_candidates)} candidate definitions extracted")

### Extracted definitions overview

In [None]:
# Create DataFrame
df_candidates = pd.DataFrame(all_candidates)

# Add metadata
df_candidates['filename'] = df_candidates['doc_id'].map(
    lambda x: metadata[x]['filename']
)
df_candidates['source_type'] = df_candidates['doc_id'].map(doc_to_source)

# Desc score order
df_candidates = df_candidates.sort_values('score', ascending=False)

print("TOP 10 CANDIDATE DEFINITIONS (by score)")
for i, row in df_candidates.head(10).iterrows():
    print(f"\n{row['filename']} (Score: {row['score']})")
    print(f"  {row['text'][:200]}...")

## üìù Manual classification of definitions

**IMPORTANT**: Cette section n√©cessite une r√©vision manuelle.

Pour chaque d√©finition candidate, tu dois:
1. Lire le texte complet
2. Assigner une cat√©gorie
3. √âventuellement fusionner ou supprimer certaines entr√©es

In [None]:
# Define taxonomy categories
TAXONOMY_CATEGORIES = {
    'copilot': {
        'label': 'AI as Copilots/Assistants',
        'description': 'AI agents that augment human work, provide suggestions, collaborate with users',
        'keywords': ['copilot', 'assistant', 'augment', 'support', 'collaborate', 'suggest', 'help']
    },
    'autonomous_worker': {
        'label': 'AI as Autonomous Workers',
        'description': 'AI agents that independently execute tasks with minimal human intervention',
        'keywords': ['autonomous', 'independent', 'execute', 'automate', 'replace', 'perform']
    },
    'orchestrator': {
        'label': 'AI as Multi-Agent Ecosystems/Orchestrators',
        'description': 'AI systems coordinating multiple agents, workflows, or complex processes',
        'keywords': ['orchestrate', 'coordinate', 'multi-agent', 'ecosystem', 'workflow', 'multi-step', 'planning']
    },
    'governance': {
        'label': 'AI as Governance/Risk Challenges',
        'description': 'AI agents framed through ethical, regulatory, or risk management lens',
        'keywords': ['governance', 'risk', 'compliance', 'regulation', 'ethics', 'safety', 'alignment', 'control']
    },
    'other': {
        'label': 'Other/Uncategorized',
        'description': 'Definitions that don\'t fit main categories',
        'keywords': []
    }
}

print("\nüìã Taxonomy categories:")
for cat_id, cat_info in TAXONOMY_CATEGORIES.items():
    print(f"\n  {cat_info['label']}")
    print(f"    {cat_info['description']}")
    print(f"    Key words: {', '.join(cat_info['keywords'][:5])}")

### Semi-automatic classification (First try)

On utilise les mots-cl√©s pour sugg√©rer une cat√©gorie, mais la validation manuelle sera n√©cessaire.

In [None]:
def suggest_category(text):
    """
    Suggest a category based on key-words.
    """
    text_lower = text.lower()
    
    scores = {}
    for cat_id, cat_info in TAXONOMY_CATEGORIES.items():
        if cat_id == 'other':
            continue
        
        score = sum(1 for keyword in cat_info['keywords'] if keyword in text_lower)
        scores[cat_id] = score
    
    if max(scores.values()) == 0:
        return 'other', 0
    
    suggested_cat = max(scores.items(), key=lambda x: x[1])[0]
    confidence = scores[suggested_cat]
    
    return suggested_cat, confidence

In [None]:
# Suggest cats
df_candidates['suggested_category'] = df_candidates.apply(
    lambda row: suggest_category(row['text'])[0], axis=1
)
df_candidates['category_confidence'] = df_candidates.apply(
    lambda row: suggest_category(row['text'])[1], axis=1
)

In [None]:
print("SUGGESTION FOR CLASSIFICATION")
for cat_id in ['copilot', 'autonomous_worker', 'orchestrator', 'governance', 'other']:
    count = (df_candidates['suggested_category'] == cat_id).sum()
    print(f"  {TAXONOMY_CATEGORIES[cat_id]['label']:45} : {count:2} definitions")

### Export for manual classification

In [None]:
# 

# %%
# Cr√©er un fichier CSV pour r√©vision manuelle
review_df = df_candidates[[
    'filename', 'source_type', 'text', 'context', 
    'suggested_category', 'category_confidence'
]].copy()

# Ajouter une colonne vide pour la cat√©gorie finale (√† remplir manuellement)
review_df['final_category'] = review_df['suggested_category']
review_df['notes'] = ''
review_df['keep'] = True  # Pour marquer les d√©finitions √† garder

# Sauvegarder
review_file = TAXONOMY_DIR / 'definitions_for_manual_review.csv'
review_df.to_csv(review_file, index=False, encoding='utf-8')

print(f"\nüíæ Fichier pour r√©vision manuelle: {review_file}")
print("\n‚ö†Ô∏è  √âTAPE MANUELLE REQUISE:")
print("   1. Ouvrez le fichier CSV dans Excel/Google Sheets")
print("   2. Lisez chaque d√©finition")
print("   3. Corrigez la colonne 'final_category' si n√©cessaire")
print("   4. Mettez 'keep' √† False pour les d√©finitions non pertinentes")
print("   5. Ajoutez des notes si besoin")
print("   6. Sauvegardez le fichier")

# %% [markdown]
# ## üìä Chargement des R√©sultats de la Classification Manuelle
# 
# **NOTE**: Apr√®s avoir compl√©t√© la r√©vision manuelle du CSV, ex√©cutez cette section.
# 
# Si tu n'as pas encore fait la r√©vision, on continue avec la classification automatique.

# %%
print("\n" + "="*70)
print("CHARGEMENT DES R√âSULTATS MANUELS")
print("="*70)

# V√©rifier si le fichier r√©vis√© existe
reviewed_file = TAXONOMY_DIR / 'definitions_for_manual_review.csv'

if reviewed_file.exists():
    # Charger les r√©sultats r√©vis√©s
    df_reviewed = pd.read_csv(reviewed_file)
    
    # Filtrer pour garder seulement les d√©finitions valid√©es
    df_definitions = df_reviewed[df_reviewed['keep'] == True].copy()
    
    print(f"‚úÖ R√©sultats manuels charg√©s: {len(df_definitions)} d√©finitions valid√©es")
    
    # Utiliser les cat√©gories finales
    category_col = 'final_category'
else:
    print("‚ö†Ô∏è  Fichier r√©vis√© non trouv√©. Utilisation de la classification automatique.")
    df_definitions = df_candidates.copy()
    df_definitions['final_category'] = df_definitions['suggested_category']
    category_col = 'final_category'

# Nettoyer les cat√©gories invalides
valid_categories = list(TAXONOMY_CATEGORIES.keys())
df_definitions[category_col] = df_definitions[category_col].apply(
    lambda x: x if x in valid_categories else 'other'
)

print(f"\nüìä R√©partition finale des d√©finitions:")
print("‚îÄ"*70)

for cat_id in ['copilot', 'autonomous_worker', 'orchestrator', 'governance', 'other']:
    count = (df_definitions[category_col] == cat_id).sum()
    pct = count / len(df_definitions) * 100 if len(df_definitions) > 0 else 0
    print(f"  {TAXONOMY_CATEGORIES[cat_id]['label']:45} : {count:2} ({pct:5.1f}%)")

# %% [markdown]
# ## üìã Cr√©ation de la Table des D√©finitions

# %%
print("\n" + "="*70)
print("TABLE DES D√âFINITIONS PAR CAT√âGORIE")
print("="*70)

# Cr√©er une table propre pour le rapport
table_data = []

for _, row in df_definitions.iterrows():
    table_data.append({
        'Report': row['filename'][:40],
        'Source Type': row['source_type'],
        'Category': TAXONOMY_CATEGORIES[row[category_col]]['label'],
        'Definition': row['text'][:150] + '...' if len(row['text']) > 150 else row['text']
    })

df_table = pd.DataFrame(table_data)

# Sauvegarder la table compl√®te
df_table.to_csv(TAXONOMY_DIR / 'definitions_table.csv', index=False, encoding='utf-8')
print(f"üíæ Table sauvegard√©e: {TAXONOMY_DIR / 'definitions_table.csv'}")

# Afficher quelques exemples par cat√©gorie
print("\nüìã Exemples par cat√©gorie:\n")

for cat_id in ['copilot', 'autonomous_worker', 'orchestrator', 'governance']:
    cat_label = TAXONOMY_CATEGORIES[cat_id]['label']
    cat_defs = df_table[df_table['Category'] == cat_label]
    
    if len(cat_defs) == 0:
        continue
    
    print(f"{'‚îÄ'*70}")
    print(f"{cat_label}")
    print(f"{'‚îÄ'*70}")
    
    for _, row in cat_defs.head(2).iterrows():
        print(f"\n  üìÑ {row['Report']}")
        print(f"     {row['Definition']}")
    
    print()

# %% [markdown]
# ## üìä Visualisations de la Taxonomie

# %% [markdown]
# ### 1. Distribution Globale (Bar Chart)

# %%
# Compter les d√©finitions par cat√©gorie
category_counts = df_definitions[category_col].value_counts()

# Cr√©er le mapping vers les labels complets
category_labels = {cat_id: TAXONOMY_CATEGORIES[cat_id]['label'] 
                  for cat_id in category_counts.index}
category_counts.index = category_counts.index.map(category_labels)

# Graphique
fig, ax = plt.subplots(figsize=(14, 8))

colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A', '#98D8C8']
bars = ax.bar(range(len(category_counts)), category_counts.values, 
             color=colors[:len(category_counts)], edgecolor='black', linewidth=1.5)

ax.set_xticks(range(len(category_counts)))
ax.set_xticklabels(category_counts.index, rotation=45, ha='right', fontsize=10)
ax.set_ylabel('Nombre de D√©finitions', fontsize=12, fontweight='bold')
ax.set_title('Distribution des D√©finitions par Cat√©gorie Conceptuelle', 
            fontsize=14, fontweight='bold', pad=20)
ax.grid(axis='y', alpha=0.3)

# Ajouter les valeurs sur les barres
for i, (bar, count) in enumerate(zip(bars, category_counts.values)):
    ax.text(i, count + 0.5, str(count), ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig(TAXONOMY_DIR / 'taxonomy_distribution.png', bbox_inches='tight')
plt.show()

print(f"üíæ Graphique sauvegard√©: {TAXONOMY_DIR / 'taxonomy_distribution.png'}")

# %% [markdown]
# ### 2. Distribution par Type de Source

# %%
# Cr√©er une matrice cat√©gorie √ó type de source
cross_tab = pd.crosstab(
    df_definitions[category_col].map(lambda x: TAXONOMY_CATEGORIES[x]['label']),
    df_definitions['source_type']
)

# Heatmap
fig, ax = plt.subplots(figsize=(12, 8))

sns.heatmap(cross_tab, annot=True, fmt='d', cmap='YlGnBu', 
           linewidths=0.5, cbar_kws={'label': 'Nombre de D√©finitions'}, ax=ax)

ax.set_xlabel('Type de Source', fontsize=12, fontweight='bold')
ax.set_ylabel('Cat√©gorie Conceptuelle', fontsize=12, fontweight='bold')
ax.set_title('Distribution des D√©finitions: Cat√©gorie √ó Type de Source', 
            fontsize=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.savefig(TAXONOMY_DIR / 'taxonomy_by_source_heatmap.png', bbox_inches='tight')
plt.show()

print(f"üíæ Heatmap sauvegard√©: {TAXONOMY_DIR / 'taxonomy_by_source_heatmap.png'}")

# %% [markdown]
# ### 3. Treemap (Visualisation Hi√©rarchique)

# %%
# Pr√©parer les donn√©es pour le treemap
treemap_data = []

for cat_id, cat_info in TAXONOMY_CATEGORIES.items():
    count = (df_definitions[category_col] == cat_id).sum()
    if count > 0:
        treemap_data.append({
            'category': cat_info['label'],
            'count': count,
            'parent': 'Agentic AI Definitions'
        })
        
        # Ajouter les sous-niveaux par type de source
        for source_type in df_definitions['source_type'].unique():
            source_count = ((df_definitions[category_col] == cat_id) & 
                          (df_definitions['source_type'] == source_type)).sum()
            if source_count > 0:
                treemap_data.append({
                    'category': f"{source_type} ({source_count})",
                    'count': source_count,
                    'parent': cat_info['label']
                })

df_treemap = pd.DataFrame(treemap_data)

# Ajouter le n≈ìud racine
root_count = df_definitions.shape[0]
df_treemap = pd.concat([
    pd.DataFrame([{'category': 'Agentic AI Definitions', 'count': root_count, 'parent': ''}]),
    df_treemap
], ignore_index=True)

# Cr√©er le treemap avec Plotly
fig = px.treemap(
    df_treemap,
    names='category',
    parents='parent',
    values='count',
    title='Taxonomie Hi√©rarchique des D√©finitions d\'Agentic AI',
    color='count',
    color_continuous_scale='RdYlBu_r'
)

fig.update_layout(
    font=dict(size=14),
    title_font=dict(size=18, family='Arial Black'),
    height=700
)

fig.write_html(TAXONOMY_DIR / 'taxonomy_treemap.html')
fig.show()

print(f"üíæ Treemap interactif sauvegard√©: {TAXONOMY_DIR / 'taxonomy_treemap.html'}")

# %% [markdown]
# ### 4. Sunburst Chart (Alternative au Treemap)

# %%
# Cr√©er un sunburst chart
fig = px.sunburst(
    df_treemap,
    names='category',
    parents='parent',
    values='count',
    title='Taxonomie des D√©finitions - Vue Sunburst',
    color='count',
    color_continuous_scale='Viridis'
)

fig.update_layout(
    font=dict(size=13),
    title_font=dict(size=18, family='Arial Black'),
    height=700
)

fig.write_html(TAXONOMY_DIR / 'taxonomy_sunburst.html')
fig.show()

print(f"üíæ Sunburst sauvegard√©: {TAXONOMY_DIR / 'taxonomy_sunburst.html'}")

# %% [markdown]
# ### 5. Graphique Sankey (Flux: Source Type ‚Üí Cat√©gorie)

# %%
# Pr√©parer les donn√©es pour Sankey
source_types = df_definitions['source_type'].unique()
categories = df_definitions[category_col].unique()

# Cr√©er les mappings
source_to_idx = {s: i for i, s in enumerate(source_types)}
cat_to_idx = {c: i + len(source_types) for i, c in enumerate(categories)}

# Construire les flux
sources = []
targets = []
values = []
labels = list(source_types) + [TAXONOMY_CATEGORIES[c]['label'] for c in categories]

for _, row in df_definitions.iterrows():
    source_idx = source_to_idx[row['source_type']]
    target_idx = cat_to_idx[row[category_col]]
    
    # V√©rifier si le flux existe d√©j√†
    try:
        idx = sources.index(source_idx)
        if targets[idx] == target_idx:
            values[idx] += 1
            continue
    except ValueError:
        pass
    
    sources.append(source_idx)
    targets.append(target_idx)
    values.append(1)

# Cr√©er le diagramme Sankey
fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=15,
        thickness=20,
        line=dict(color="black", width=0.5),
        label=labels,
        color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A', 
               '#98D8C8', '#F7DC6F', '#BB8FCE', '#85C1E9'][:len(labels)]
    ),
    link=dict(
        source=sources,
        target=targets,
        value=values
    )
)])

fig.update_layout(
    title="Flux des D√©finitions: Type de Source ‚Üí Cat√©gorie Conceptuelle",
    font=dict(size=12),
    height=600
)

fig.write_html(TAXONOMY_DIR / 'taxonomy_sankey.html')
fig.show()

print(f"üíæ Sankey sauvegard√©: {TAXONOMY_DIR / 'taxonomy_sankey.html'}")

# %% [markdown]
# ## üìä Analyse Comparative des Cat√©gories

# %%
print("\n" + "="*70)
print("ANALYSE COMPARATIVE DES CAT√âGORIES")
print("="*70)

# Statistiques par cat√©gorie
for cat_id, cat_info in TAXONOMY_CATEGORIES.items():
    cat_defs = df_definitions[df_definitions[category_col] == cat_id]
    
    if len(cat_defs) == 0:
        continue
    
    print(f"\n{'‚îÄ'*70}")
    print(f"{cat_info['label']}")
    print(f"{'‚îÄ'*70}")
    
    print(f"  Nombre total: {len(cat_defs)}")
    print(f"  R√©partition par type:")
    
    for source_type in cat_defs['source_type'].value_counts().index:
        count = (cat_defs['source_type'] == source_type).sum()
        pct = count / len(cat_defs) * 100
        print(f"    ‚Ä¢ {source_type:15} : {count:2} ({pct:5.1f}%)")

# %% [markdown]
# ## üìã Cr√©ation du R√©sum√© pour le Rapport

# %%
print("\n" + "="*70)
print("R√âSUM√â POUR LE RAPPORT")
print("="*70)

summary_text = f"""
MINI-TAXONOMY DES D√âFINITIONS D'AGENTIC AI
{'='*70}

M√âTHODOLOGIE
{'-'*70}
‚Ä¢ Extraction semi-automatique: {len(df_candidates)} d√©finitions candidates
‚Ä¢ Validation manuelle: {len(df_definitions)} d√©finitions finales
‚Ä¢ Cat√©gorisation: 4 cat√©gories conceptuelles principales

R√âSULTATS
{'-'*70}
"""

for cat_id, cat_info in TAXONOMY_CATEGORIES.items():
    if cat_id == 'other':
        continue
    
    count = (df_definitions[category_col] == cat_id).sum()
    pct = count / len(df_definitions) * 100 if len(df_definitions) > 0 else 0
    
    summary_text += f"\n{cat_info['label']}: {count} d√©finitions ({pct:.1f}%)\n"
    summary_text += f"  {cat_info['description']}\n"

summary_text += f"\n{'-'*70}\n"
summary_text += "INSIGHTS CL√âS\n"
summary_text += f"{'-'*70}\n"

# Identifier la cat√©gorie dominante
dominant_cat = df_definitions[category_col].value_counts().idxmax()
dominant_count = df_definitions[category_col].value_counts().max()
dominant_pct = dominant_count / len(df_definitions) * 100

summary_text += f"\n‚Ä¢ Cadrage dominant: {TAXONOMY_CATEGORIES[dominant_cat]['label']} ({dominant_pct:.1f}%)\n"

# Analyser par type de source
summary_text += "\n‚Ä¢ Diff√©rences par type de source:\n"
for source_type in df_definitions['source_type'].unique():
    source_defs = df_definitions[df_definitions['source_type'] == source_type]
    if len(source_defs) > 0:
        top_cat = source_defs[category_col].value_counts().idxmax()
        summary_text += f"  - {source_type}: privil√©gie '{TAXONOMY_CATEGORIES[top_cat]['label']}'\n"

print(summary_text)

# Sauvegarder le r√©sum√©
summary_file = TAXONOMY_DIR / 'taxonomy_summary.txt'
with open(summary_file, 'w', encoding='utf-8') as f:
    f.write(summary_text)

print(f"\nüíæ R√©sum√© sauvegard√©: {summary_file}")

# %% [markdown]
# ## üíæ Sauvegarde Compl√®te des R√©sultats

# %%
print("\n" + "="*70)
print("SAUVEGARDE DES R√âSULTATS")
print("="*70)

# 1. D√©finitions compl√®tes avec cat√©gories
df_definitions.to_csv(TAXONOMY_DIR / 'definitions_categorized.csv', 
                     index=False, encoding='utf-8')
print(f"‚úÖ D√©finitions cat√©goris√©es: {TAXONOMY_DIR / 'definitions_categorized.csv'}")

# 2. Table pour le rapport (format propre)
report_table = []
for cat_id, cat_info in TAXONOMY_CATEGORIES.items():
    if cat_id == 'other':
        continue
    
    cat_defs = df_definitions[df_definitions[category_col] == cat_id]
    
    for _, row in cat_defs.iterrows():
        report_table.append({
            'Category': cat_info['label'],
            'Report': row['filename'],
            'Source_Type': row['source_type'],
            'Definition_Excerpt': row['text'][:200] + '...' if len(row['text']) > 200 else row['text']
        })

df_report_table = pd.DataFrame(report_table)
df_report_table.to_csv(TAXONOMY_DIR / 'taxonomy_table_for_report.csv', 
                       index=False, encoding='utf-8')
print(f"‚úÖ Table pour rapport: {TAXONOMY_DIR / 'taxonomy_table_for_report.csv'}")

# 3. Statistiques agr√©g√©es
stats = {
    'total_definitions': len(df_definitions),
    'num_categories': len([c for c in TAXONOMY_CATEGORIES.keys() if c != 'other']),
    'category_distribution': {
        TAXONOMY_CATEGORIES[cat]['label']: int((df_definitions[category_col] == cat).sum())
        for cat in TAXONOMY_CATEGORIES.keys()
    },
    'by_source_type': {}
}

for source_type in df_definitions['source_type'].unique():
    source_defs = df_definitions[df_definitions['source_type'] == source_type]
    stats['by_source_type'][source_type] = {
        'total': len(source_defs),
        'distribution': {
            TAXONOMY_CATEGORIES[cat]['label']: int((source_defs[category_col] == cat).sum())
            for cat in TAXONOMY_CATEGORIES.keys()
        }
    }

with open(TAXONOMY_DIR / 'taxonomy_statistics.json', 'w') as f:
    json.dump(stats, f, indent=2)

print(f"‚úÖ Statistiques JSON: {TAXONOMY_DIR / 'taxonomy_statistics.json'}")

print("\n" + "‚îÄ"*70)
print("Fichiers g√©n√©r√©s:")
print("  ‚Ä¢ definitions_categorized.csv - Toutes les d√©finitions avec cat√©gories")
print("  ‚Ä¢ taxonomy_table_for_report.csv - Table format√©e pour le rapport")
print("  ‚Ä¢ taxonomy_distribution.png - Bar chart de distribution")
print("  ‚Ä¢ taxonomy_by_source_heatmap.png - Heatmap cat√©gorie √ó source")
print("  ‚Ä¢ taxonomy_treemap.html - Treemap interactif")
print("  ‚Ä¢ taxonomy_sunburst.html - Sunburst interactif")
print("  ‚Ä¢ taxonomy_sankey.html - Diagramme Sankey des flux")
print("  ‚Ä¢ taxonomy_statistics.json - Statistiques agr√©g√©es")
print("  ‚Ä¢ taxonomy_summary.txt - R√©sum√© textuel")

# %% [markdown]
# ## üìù Template Texte pour le Rapport

# %%
print("\n" + "="*70)
print("TEMPLATE POUR LE RAPPORT (Section: Mini-Taxonomy)")
print("="*70)

template = f"""
### Mini-Taxonomy of Definitions of "Agentic AI"

A hybrid qualitative-quantitative analysis was conducted to extract and 
categorize explicit definitions of "agentic AI" across the corpus. The 
methodology combined:

1. **Semi-automatic extraction**: Pattern matching identified {len(df_candidates)} 
   candidate definitions containing key terms ("agentic AI", "AI agents", 
   "autonomous agents") and definitional markers ("is defined as", "refers to").

2. **Manual validation**: Each candidate was manually reviewed, resulting in 
   {len(df_definitions)} validated definitions spanning {len(df_definitions['doc_id'].unique())} 
   reports.

3. **Conceptual categorization**: Definitions were grouped into four primary 
   conceptual frames:

"""

for cat_id in ['copilot', 'autonomous_worker', 'orchestrator', 'governance']:
    cat_info = TAXONOMY_CATEGORIES[cat_id]
    count = (df_definitions[category_col] == cat_id).sum()
    pct = count / len(df_definitions) * 100 if len(df_definitions) > 0 else 0
    
    template += f"\n**{cat_info['label']}** ({pct:.1f}%)\n"
    template += f"{cat_info['description']}\n"

template += f"""
#### Results and Interpretation

Figure X presents the distribution of definitions across conceptual categories. 
The dominant frame is **{TAXONOMY_CATEGORIES[dominant_cat]['label']}** ({dominant_pct:.1f}%), 
suggesting that the prevailing discourse conceptualizes agentic AI as 
{TAXONOMY_CATEGORIES[dominant_cat]['description'].lower()}.

**Institutional differences** (Figure Y - Heatmap) reveal:

"""

for source_type in df_definitions['source_type'].unique():
    source_defs = df_definitions[df_definitions['source_type'] == source_type]
    if len(source_defs) > 0:
        top_cat = source_defs[category_col].value_counts().idxmax()
        top_pct = (source_defs[category_col] == top_cat).sum() / len(source_defs) * 100
        template += f"- **{source_type}**: {top_pct:.0f}% frame agentic AI as '{TAXONOMY_CATEGORIES[top_cat]['label']}'\n"

template += f"""
**Key insight**: The taxonomy exposes conceptual fragmentation in how agentic 
AI is defined. While some actors emphasize augmentation (copilots), others 
stress full autonomy (autonomous workers) or systemic complexity (orchestrators). 
This definitional ambiguity poses challenges for standardization and may lead 
to misaligned adoption strategies across organizations.

**Table 1** (see Appendix) provides the complete taxonomy with representative 
definitions from each category and source type.
"""

print(template)

# Sauvegarder le template
with open(TAXONOMY_DIR / 'report_template.txt', 'w', encoding='utf-8') as f:
    f.write(template)

print(f"\nüíæ Template sauvegard√©: {TAXONOMY_DIR / 'report_template.txt'}")

# %% [markdown]
# ## üìã R√©sum√© de l'√âtape 5
# 
# **‚úÖ Analyses compl√©t√©es:**
# - Extraction semi-automatique de d√©finitions (pattern matching)
# - Classification en 4 cat√©gories conceptuelles
# - Validation manuelle (fichier CSV pour r√©vision)
# - Analyse comparative par type de source
# - Visualisations multiples (bar, heatmap, treemap, sunburst, sankey)
# 
# **üìÇ Fichiers g√©n√©r√©s:**
# - 5 visualisations PNG/HTML
# - 3 fichiers CSV (candidates, d√©finitions, table rapport)
# - 2 fichiers JSON (statistiques)
# - 2 fichiers TXT (r√©sum√©, template)
# 
# **üìä Visuels pour le rapport:**
# 1. `taxonomy_distribution.png` - Distribution globale
# 2. `taxonomy_by_source_heatmap.png` - Heatmap par source
# 3. `taxonomy_treemap.html` - Vue hi√©rarchique interactive
# 4. `taxonomy_table_for_report.csv` - Table compl√®te
# 
# **üîë Insights cl√©s:**
# - Fragmentation conceptuelle des d√©finitions
# - Diff√©rences narratives entre types de sources
# - Dominance d'un cadrage particulier (√† interpr√©ter)
# 
# **‚û°Ô∏è Prochaine √©tape:**
# - Synth√®se finale et r√©daction du rapport complet

# %%
print("\n" + "="*70)
print("üéâ √âTAPE 5 TERMIN√âE AVEC SUCC√àS!")
print("="*70)
print(f"\nüìä R√©sum√© de la Taxonomie:")
print(f"  ‚Ä¢ D√©finitions extraites    : {len(df_candidates)} candidates")
print(f"  ‚Ä¢ D√©finitions valid√©es     : {len(df_definitions)}")
print(f"  ‚Ä¢ Cat√©gories conceptuelles : 4 principales")
print(f"  ‚Ä¢ Documents couverts       : {len(df_definitions['doc_id'].unique())}")
print(f"  ‚Ä¢ Types de sources         : {len(df_definitions['source_type'].unique())}")
print(f"\nüìÇ Tous les fichiers dans: {TAXONOMY_DIR}")
print(f"\n‚úÖ Toutes les √©tapes d'analyse sont compl√®tes!")
print(f"‚û°Ô∏è Pr√™t pour la r√©daction du rapport final\n")