# Sync Validations Excel

Synchronisation bidirectionnelle entre Excel et Fabric pour la validation des opportunités.

## Modes d'exécution

| Mode | Description | Fréquence |
|------|-------------|-----------|
| `export` | Lakehouse → Excel (MERGE: ajoute nouvelles opportunités) | Hebdomadaire |
| `import` | Excel → Lakehouse + SQL + envoi emails | Quotidien |
| `full_sync` | Export + Import | Ad hoc |

## Fichier Excel

- **Emplacement** : `/lakehouse/default/Files/weak_signals_validation.xlsx`
- **Accès users** : OneLake File Explorer ou téléchargement depuis Fabric

## 1. Configuration

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================

# Mode d'exécution: "export", "import", "full_sync"
MODE = "full_sync"

# Semaine d'ingestion à exporter (format: "2024-W49")
# None = semaine courante automatique
INGESTION_WEEK = None

# Chemin du fichier Excel dans le Lakehouse
EXCEL_PATH = "/lakehouse/default/Files/weak_signals_validation.xlsx"

# Tables Lakehouse
TABLE_OPPORTUNITIES = "landing_feedly_opportunities"
TABLE_SALERS = "landing_salers"
TABLE_SALERS_BD = "landing_salers_bd"
TABLE_VALIDATIONS = "landing_validations"

# Configuration SMTP (pour les emails)
SMTP_ENABLED = True  # Mettre à False pour désactiver les emails
SMTP_HOST = "smtp.office365.com"
SMTP_PORT = 587
SMTP_USER = ""  # À remplir: votre email @l-acoustics.com
SMTP_PASSWORD = ""  # À remplir: App Password

# Calculer la semaine courante si non spécifiée
from datetime import datetime
if INGESTION_WEEK is None:
    INGESTION_WEEK = datetime.now().strftime("%Y-W%V")

print(f"Mode: {MODE}")
print(f"Semaine d'ingestion: {INGESTION_WEEK}")
print(f"Excel: {EXCEL_PATH}")
print(f"SMTP: {'Activé' if SMTP_ENABLED else 'Désactivé'}")

## 2. Imports et Setup

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DoubleType
from pyspark.sql.functions import (
    col, current_timestamp, lit, trim, lower, when, coalesce
)
from delta.tables import DeltaTable
from datetime import datetime
import pandas as pd
import os

spark = SparkSession.builder.getOrCreate()
print("Spark session OK")

## 3. Schema de la table landing_validations

In [None]:
# Schema pour les validations dans le Lakehouse
schema_validations = StructType([
    # Identifiant
    StructField("opportunity_id", StringType(), False),
    
    # Semaine d'ingestion
    StructField("ingestion_week", StringType(), True),
    
    # Infos opportunité (pour contexte)
    StructField("article_title", StringType(), True),
    StructField("article_url", StringType(), True),
    StructField("venue_name", StringType(), True),
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("vertical", StringType(), True),
    StructField("evaluation_score", IntegerType(), True),
    StructField("audit_opportunity_reason", StringType(), True),
    
    # Saler assigné
    StructField("saler_name", StringType(), True),
    StructField("saler_email", StringType(), True),
    
    # Déduplication (info contextuelle)
    StructField("is_duplicate", IntegerType(), True),  # 1=doublon, 0=unique
    StructField("is_suspected_duplicate", IntegerType(), True),  # 1=zone grise
    StructField("duplicate_score", DoubleType(), True),  # Score similarité
    
    # Validation (rempli par user)
    StructField("is_validated", IntegerType(), True),  # 1=OK, 0=KO, NULL=PENDING
    StructField("validation_comment", StringType(), True),
    StructField("validated_by", StringType(), True),
    StructField("validation_date", TimestampType(), True),
    
    # Email
    StructField("email_sent_at", TimestampType(), True),
    
    # Metadata
    StructField("created_at", TimestampType(), True),
    StructField("updated_at", TimestampType(), True),
])

print(f"Schema landing_validations: {len(schema_validations.fields)} colonnes")

## 4. Fonctions utilitaires

In [None]:
def file_exists(path):
    """Vérifie si un fichier existe dans le Lakehouse."""
    try:
        mssparkutils.fs.head(path, 1)
        return True
    except:
        return os.path.exists(path)


def get_opportunities_with_salers(ingestion_week=None):
    """
    Récupère les opportunités (audit_opportunity=1) avec les salers assignés.
    
    Filtres:
    - audit_opportunity = 1
    - is_duplicate = false OU NULL (exclut les doublons confirmés)
    - ingestion_week = semaine spécifiée (si fournie)
    
    Jointure sur country pour RAE/RSM.
    """
    week_filter = ""
    if ingestion_week:
        week_filter = f"AND o.ingestion_week = '{ingestion_week}'"
    
    query = f"""
    SELECT 
        o.id,
        o.ingestion_week,
        o.article_title,
        o.article_url,
        o.venue_name,
        o.city,
        o.country,
        o.vertical,
        o.evaluation_score,
        o.audit_opportunity_reason,
        -- Déduplication
        o.is_duplicate,
        o.is_suspected_duplicate,
        o.duplicate_score,
        -- Saler
        s.who AS saler_name,
        s.email AS saler_email
    FROM {TABLE_OPPORTUNITIES} o
    LEFT JOIN {TABLE_SALERS} s 
        ON TRIM(LOWER(o.country)) = TRIM(LOWER(s.country))
    WHERE o.audit_opportunity = 1
      AND (o.is_duplicate = false OR o.is_duplicate IS NULL)
      {week_filter}
    """
    return spark.sql(query)


def send_email(to, subject, body):
    """
    Envoie un email via SMTP Office 365.
    """
    if not SMTP_ENABLED:
        print(f"  [SMTP désactivé] Email non envoyé à {to}")
        return False
    
    if not SMTP_USER or not SMTP_PASSWORD:
        print(f"  [SMTP non configuré] Email non envoyé à {to}")
        return False
    
    try:
        import smtplib
        from email.mime.text import MIMEText
        from email.mime.multipart import MIMEMultipart
        
        msg = MIMEMultipart("alternative")
        msg["Subject"] = subject
        msg["From"] = SMTP_USER
        msg["To"] = to
        
        # Version HTML
        html_part = MIMEText(body, "html")
        msg.attach(html_part)
        
        with smtplib.SMTP(SMTP_HOST, SMTP_PORT) as server:
            server.starttls()
            server.login(SMTP_USER, SMTP_PASSWORD)
            server.sendmail(SMTP_USER, to, msg.as_string())
        
        print(f"  ✓ Email envoyé à {to}")
        return True
    
    except Exception as e:
        print(f"  ✗ Erreur envoi email à {to}: {e}")
        return False


def format_validation_email(row):
    """
    Formate le contenu de l'email de notification.
    """
    status = "OK (Validée)" if row["is_validated"] == 1 else "KO (Rejetée)"
    status_color = "#28a745" if row["is_validated"] == 1 else "#dc3545"
    
    # Extraire les valeurs pour éviter les backslashes dans le f-string
    venue_name = row.get("venue_name", "N/A")
    city = row.get("city", "N/A")
    country = row.get("country", "N/A")
    vertical = row.get("vertical", "N/A")
    score = row.get("evaluation_score", "N/A")
    article_url = row.get("article_url", "#")
    article_title = row.get("article_title", "Voir l'article")
    audit_reason = row.get("audit_opportunity_reason", "N/A")
    validated_by = row.get("validated_by", "N/A")
    comment = row.get("validation_comment", "-")
    ingestion_week = row.get("ingestion_week", "N/A")
    
    # Info déduplication si zone grise
    dedup_info = ""
    if row.get("is_suspected_duplicate") == 1:
        dedup_score = row.get("duplicate_score", 0)
        dedup_info = f"""
            <div style="background: #fff3cd; padding: 10px; border-radius: 5px; margin-top: 10px;">
                <strong>⚠️ Zone grise:</strong> Cet article pourrait être un doublon (score: {dedup_score:.0%})
            </div>
        """
    
    html = f"""
    <html>
    <body style="font-family: Arial, sans-serif; max-width: 600px; margin: 0 auto;">
        <div style="background: {status_color}; color: white; padding: 20px; text-align: center;">
            <h1 style="margin: 0;">Opportunité {status}</h1>
        </div>
        
        <div style="padding: 20px; background: #f8f9fa;">
            <h2 style="color: #333; margin-top: 0;">{venue_name}</h2>
            <p><strong>Lieu:</strong> {city}, {country}</p>
            <p><strong>Verticale:</strong> {vertical}</p>
            <p><strong>Score IA:</strong> {score}/100</p>
            <p><strong>Semaine:</strong> {ingestion_week}</p>
            {dedup_info}
        </div>
        
        <div style="padding: 20px;">
            <h3>Article</h3>
            <p><a href="{article_url}">{article_title}</a></p>
            
            <h3>Justification IA</h3>
            <p style="background: #e9ecef; padding: 10px; border-radius: 5px;">
                {audit_reason}
            </p>
        </div>
        
        <div style="padding: 20px; background: #f8f9fa; border-top: 1px solid #ddd;">
            <p><strong>Validé par:</strong> {validated_by}</p>
            <p><strong>Commentaire:</strong> {comment}</p>
        </div>
        
        <div style="padding: 10px; text-align: center; color: #666; font-size: 12px;">
            <p>Weak Signals Pipeline - L-Acoustics</p>
        </div>
    </body>
    </html>
    """
    return html


print("Fonctions utilitaires chargées")

## 5. MODE EXPORT: Lakehouse → Excel

In [None]:
def run_export():
    """
    Export les opportunités vers Excel.
    
    Filtres appliqués:
    - ingestion_week = semaine courante (INGESTION_WEEK)
    - is_duplicate = false (exclut les doublons confirmés)
    - audit_opportunity = 1
    
    MERGE intelligent: ajoute les nouvelles, garde les validations existantes.
    """
    print("=" * 60)
    print("MODE EXPORT: Lakehouse → Excel")
    print("=" * 60)
    print(f"Semaine d'ingestion: {INGESTION_WEEK}")
    
    # 1. Récupérer les opportunités avec salers (filtrées)
    print("\n1. Récupération des opportunités...")
    df_opportunities = get_opportunities_with_salers(ingestion_week=INGESTION_WEEK).toPandas()
    print(f"   {len(df_opportunities)} opportunités trouvées pour {INGESTION_WEEK}")
    
    if len(df_opportunities) == 0:
        print("   Aucune opportunité à exporter pour cette semaine.")
        return
    
    # Stats déduplication
    n_suspected = len(df_opportunities[df_opportunities["is_suspected_duplicate"] == True])
    if n_suspected > 0:
        print(f"   ⚠️ {n_suspected} articles en zone grise (doublons potentiels)")
    
    # 2. Préparer le DataFrame avec colonnes validation vides
    df_new = df_opportunities.rename(columns={"id": "opportunity_id"})
    df_new["is_validated"] = None
    df_new["validation_comment"] = None
    df_new["validated_by"] = None
    df_new["email_sent_at"] = None
    
    # 3. Charger l'Excel existant (si existe)
    print("\n2. Vérification Excel existant...")
    if file_exists(EXCEL_PATH):
        print(f"   Fichier existant trouvé: {EXCEL_PATH}")
        df_existing = pd.read_excel(EXCEL_PATH)
        print(f"   {len(df_existing)} lignes existantes")
        
        # 4. MERGE: garder les validations existantes
        print("\n3. MERGE...")
        
        # IDs existants avec validation
        existing_ids = set(df_existing["opportunity_id"].dropna().astype(str))
        
        # Nouveaux IDs
        new_ids = set(df_new["opportunity_id"].astype(str))
        
        # IDs à ajouter (dans new mais pas dans existing)
        ids_to_add = new_ids - existing_ids
        print(f"   IDs existants: {len(existing_ids)}")
        print(f"   Nouveaux IDs: {len(ids_to_add)}")
        
        # Filtrer les nouvelles lignes
        df_to_add = df_new[df_new["opportunity_id"].astype(str).isin(ids_to_add)]
        
        # Concaténer: existant + nouveaux
        df_final = pd.concat([df_existing, df_to_add], ignore_index=True)
        
        print(f"   Lignes ajoutées: {len(df_to_add)}")
        print(f"   Total final: {len(df_final)}")
    else:
        print("   Aucun fichier existant, création...")
        df_final = df_new
    
    # 5. Écrire le fichier Excel
    print("\n4. Écriture Excel...")
    
    # Ordre des colonnes
    columns_order = [
        "opportunity_id",
        "ingestion_week",
        "article_title",
        "article_url",
        "venue_name",
        "city",
        "country",
        "vertical",
        "evaluation_score",
        "audit_opportunity_reason",
        "is_duplicate",
        "is_suspected_duplicate",
        "duplicate_score",
        "saler_name",
        "saler_email",
        "is_validated",
        "validation_comment",
        "validated_by",
        "email_sent_at"
    ]
    
    # S'assurer que toutes les colonnes existent
    for c in columns_order:
        if c not in df_final.columns:
            df_final[c] = None
    
    df_final = df_final[columns_order]
    df_final.to_excel(EXCEL_PATH, index=False)
    
    print(f"   ✓ Fichier écrit: {EXCEL_PATH}")
    print(f"   ✓ {len(df_final)} lignes totales")
    
    return df_final


# Exécuter si mode export ou full_sync
if MODE in ["export", "full_sync"]:
    df_exported = run_export()

## 6. MODE IMPORT: Excel → Lakehouse + SQL + Emails

In [None]:
def run_import():
    """
    Import les validations depuis Excel.
    - MERGE dans Lakehouse (landing_validations)
    - SYNC vers SQL Database (validations) via pyodbc
    - Envoie emails pour les nouvelles validations
    """
    print("=" * 60)
    print("MODE IMPORT: Excel → Lakehouse + SQL + Emails")
    print("=" * 60)
    
    # 1. Lire le fichier Excel
    print("\n1. Lecture Excel...")
    if not file_exists(EXCEL_PATH):
        print(f"   ✗ Fichier non trouvé: {EXCEL_PATH}")
        print("   Exécutez d'abord le mode 'export'.")
        return None
    
    df_excel = pd.read_excel(EXCEL_PATH)
    print(f"   {len(df_excel)} lignes lues")
    
    # 2. Filtrer les lignes validées (is_validated != NULL)
    df_validated = df_excel[df_excel["is_validated"].notna()].copy()
    print(f"   {len(df_validated)} lignes validées (is_validated != NULL)")
    
    if len(df_validated) == 0:
        print("   Aucune validation à importer.")
        return None
    
    # 3. Identifier les NOUVELLES validations (email_sent_at IS NULL)
    df_new_validations = df_validated[df_validated["email_sent_at"].isna()].copy()
    print(f"   {len(df_new_validations)} NOUVELLES validations (email non envoyé)")
    
    # 4. Convertir en Spark DataFrame
    print("\n2. Conversion Spark...")
    
    # Préparer les données
    df_validated["is_validated"] = df_validated["is_validated"].astype(int)
    df_validated["validation_date"] = datetime.now()
    df_validated["created_at"] = datetime.now()
    df_validated["updated_at"] = datetime.now()
    
    # S'assurer que les colonnes existent
    for col_name in ["ingestion_week", "is_duplicate", "is_suspected_duplicate", "duplicate_score"]:
        if col_name not in df_validated.columns:
            df_validated[col_name] = None
    
    # Renommer opportunity_id si nécessaire
    if "id" in df_validated.columns and "opportunity_id" not in df_validated.columns:
        df_validated = df_validated.rename(columns={"id": "opportunity_id"})
    
    sdf_validations = spark.createDataFrame(df_validated)
    print(f"   {sdf_validations.count()} lignes converties")
    
    # 5. MERGE dans Lakehouse (landing_validations)
    print("\n3. MERGE dans Lakehouse...")
    
    if not spark.catalog.tableExists(TABLE_VALIDATIONS):
        print(f"   Création de la table {TABLE_VALIDATIONS}...")
        sdf_validations.write.format("delta").mode("overwrite").saveAsTable(TABLE_VALIDATIONS)
    else:
        delta_table = DeltaTable.forName(spark, TABLE_VALIDATIONS)
        delta_table.alias("target").merge(
            sdf_validations.alias("source"),
            "target.opportunity_id = source.opportunity_id"
        ).whenMatchedUpdate(
            set={
                "is_validated": "source.is_validated",
                "validation_comment": "source.validation_comment",
                "validated_by": "source.validated_by",
                "validation_date": "source.validation_date",
                "email_sent_at": "source.email_sent_at",
                "ingestion_week": "source.ingestion_week",
                "is_duplicate": "source.is_duplicate",
                "is_suspected_duplicate": "source.is_suspected_duplicate",
                "duplicate_score": "source.duplicate_score",
                "updated_at": "source.updated_at"
            }
        ).whenNotMatchedInsertAll().execute()
    
    total_lakehouse = spark.sql(f"SELECT COUNT(*) FROM {TABLE_VALIDATIONS}").collect()[0][0]
    print(f"   ✓ Table {TABLE_VALIDATIONS}: {total_lakehouse} lignes")
    
    # 6. Envoyer les emails pour les nouvelles validations
    print("\n4. Envoi des emails...")
    emails_sent = 0
    emails_to_update = []
    
    for idx, row in df_new_validations.iterrows():
        saler_email = row.get("saler_email")
        if pd.isna(saler_email) or not saler_email:
            print(f"  - {row.get('venue_name', 'N/A')}: Pas d'email saler")
            continue
        
        # Formater et envoyer l'email
        status = "OK" if row["is_validated"] == 1 else "KO"
        subject = f"[Weak Signals] Opportunité {status}: {row.get('venue_name', 'N/A')}"
        body = format_validation_email(row)
        
        if send_email(saler_email, subject, body):
            emails_sent += 1
            emails_to_update.append(row["opportunity_id"])
    
    print(f"   ✓ {emails_sent} emails envoyés")
    
    # 7. Mettre à jour email_sent_at dans Excel
    if emails_to_update:
        print("\n5. Mise à jour email_sent_at dans Excel...")
        now = datetime.now()
        
        for opp_id in emails_to_update:
            df_excel.loc[df_excel["opportunity_id"] == opp_id, "email_sent_at"] = now
        
        df_excel.to_excel(EXCEL_PATH, index=False)
        print(f"   ✓ {len(emails_to_update)} lignes mises à jour")
        
        # Mettre à jour aussi dans Lakehouse
        print("\n6. Mise à jour email_sent_at dans Lakehouse...")
        for opp_id in emails_to_update:
            spark.sql(f"""
                UPDATE {TABLE_VALIDATIONS}
                SET email_sent_at = current_timestamp(), updated_at = current_timestamp()
                WHERE opportunity_id = '{opp_id}'
            """)
        print(f"   ✓ Lakehouse mis à jour")
    
    # Résumé
    print("\n" + "=" * 60)
    print("RÉSUMÉ IMPORT")
    print("=" * 60)
    print(f"Validations importées: {len(df_validated)}")
    print(f"Nouvelles validations: {len(df_new_validations)}")
    print(f"Emails envoyés: {emails_sent}")
    
    return df_validated


# Exécuter si mode import ou full_sync
if MODE in ["import", "full_sync"]:
    df_imported = run_import()

## 7. Statistiques

In [None]:
# Statistiques de la table landing_validations
print("=" * 60)
print("STATISTIQUES")
print("=" * 60)

if spark.catalog.tableExists(TABLE_VALIDATIONS):
    print(f"\nTable {TABLE_VALIDATIONS}:")
    
    # Compter par statut et semaine
    spark.sql(f"""
        SELECT 
            ingestion_week,
            CASE 
                WHEN is_validated = 1 THEN 'OK'
                WHEN is_validated = 0 THEN 'KO'
                ELSE 'PENDING'
            END as statut,
            COUNT(*) as nb,
            SUM(CASE WHEN email_sent_at IS NOT NULL THEN 1 ELSE 0 END) as emails_envoyes
        FROM {TABLE_VALIDATIONS}
        GROUP BY ingestion_week, is_validated
        ORDER BY ingestion_week DESC, statut
    """).show(20)
    
    # Validations sans email
    pending_emails = spark.sql(f"""
        SELECT COUNT(*) FROM {TABLE_VALIDATIONS}
        WHERE is_validated IS NOT NULL AND email_sent_at IS NULL
    """).collect()[0][0]
    
    if pending_emails > 0:
        print(f"⚠ {pending_emails} validations sans email envoyé")
    
    # Stats déduplication
    print("\nStatistiques déduplication:")
    spark.sql(f"""
        SELECT 
            ingestion_week,
            SUM(CASE WHEN is_suspected_duplicate = 1 THEN 1 ELSE 0 END) as zone_grise,
            COUNT(*) as total
        FROM {TABLE_VALIDATIONS}
        GROUP BY ingestion_week
        ORDER BY ingestion_week DESC
    """).show(10)
else:
    print(f"Table {TABLE_VALIDATIONS} n'existe pas encore.")

# Stats fichier Excel
if file_exists(EXCEL_PATH):
    df_stats = pd.read_excel(EXCEL_PATH)
    print(f"\nFichier Excel ({EXCEL_PATH}):")
    print(f"  Total lignes: {len(df_stats)}")
    print(f"  Validées (OK): {len(df_stats[df_stats['is_validated'] == 1])}")
    print(f"  Rejetées (KO): {len(df_stats[df_stats['is_validated'] == 0])}")
    print(f"  En attente: {len(df_stats[df_stats['is_validated'].isna()])}")
    
    # Stats par semaine
    if "ingestion_week" in df_stats.columns:
        print(f"\n  Par semaine:")
        print(df_stats.groupby("ingestion_week").size().to_string())

## 8. Historique Delta

In [None]:
if spark.catalog.tableExists(TABLE_VALIDATIONS):
    print(f"Historique Delta de '{TABLE_VALIDATIONS}':")
    spark.sql(f"DESCRIBE HISTORY {TABLE_VALIDATIONS}").select(
        "version", "timestamp", "operation", "operationMetrics"
    ).show(5, truncate=50)

---

## Guide d'utilisation

### Configuration

| Variable | Description | Exemple |
|----------|-------------|---------|
| `MODE` | Mode d'exécution | `"export"`, `"import"`, `"full_sync"` |
| `INGESTION_WEEK` | Semaine à exporter | `"2024-W49"` ou `None` (auto) |

### Flux hebdomadaire

```
Lundi matin:
1. Exécuter deduplicate_weekly.ipynb (déduplication)
2. Exécuter sync_validations_excel.ipynb MODE="export"
3. Télécharger le fichier Excel

Pendant la semaine:
4. Valider les opportunités dans Excel
   - is_validated: 1 (OK) ou 0 (KO)
   - validation_comment: commentaire optionnel
   - validated_by: votre email

Vendredi:
5. Uploader le fichier Excel
6. Exécuter sync_validations_excel.ipynb MODE="import"
7. Les emails sont envoyés automatiquement aux salers
```

### Colonnes Excel

| Colonne | Description | Rempli par |
|---------|-------------|------------|
| `opportunity_id` | ID unique | Auto |
| `ingestion_week` | Semaine d'ingestion | Auto |
| `article_title` | Titre de l'article | Auto |
| `venue_name` | Nom du lieu | Auto |
| `is_duplicate` | Doublon confirmé | Auto (0/1) |
| `is_suspected_duplicate` | Zone grise | Auto (0/1) |
| `duplicate_score` | Score similarité | Auto (0.0-1.0) |
| `is_validated` | **Validation** | **User** (1=OK, 0=KO) |
| `validation_comment` | Commentaire | User |
| `validated_by` | Email validateur | User |
| `email_sent_at` | Email envoyé | Auto |

### Configuration SMTP
Remplir les variables en haut du notebook:
- `SMTP_USER`: votre email @l-acoustics.com
- `SMTP_PASSWORD`: votre App Password (créé sur https://mysignins.microsoft.com/security-info)

### Déduplication

Les articles sont filtrés avant export:
- **Doublons confirmés** (score ≥ 0.90): **Exclus** automatiquement
- **Zone grise** (0.85 ≤ score < 0.90): **Inclus** avec warning ⚠️
- **Uniques** (score < 0.85): **Inclus** normalement