# Sync Validations - IMPORT

Import des validations depuis Excel vers le Lakehouse + envoi des emails aux salers.

## Fréquence
- **CRON**: Quotidien ou hebdomadaire (vendredi)

## Flux
```
Excel → landing_validations (Lakehouse) → Emails aux salers
```

## Actions
1. Lire le fichier Excel
2. MERGE dans la table `landing_validations`
3. Envoyer les emails pour les nouvelles validations
4. Mettre à jour `email_sent_at` dans Excel et Lakehouse

## 1. Configuration

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================

# Chemin du fichier Excel dans le Lakehouse
EXCEL_PATH = "/lakehouse/default/Files/weak_signals_validation.xlsx"

# Table Lakehouse pour les validations
TABLE_VALIDATIONS = "landing_validations"

# Configuration SMTP (pour les emails)
SMTP_ENABLED = True  # Mettre à False pour désactiver les emails
SMTP_HOST = "smtp.office365.com"
SMTP_PORT = 587
SMTP_USER = ""  # À remplir: votre email @l-acoustics.com
SMTP_PASSWORD = ""  # À remplir: App Password

print(f"Excel: {EXCEL_PATH}")
print(f"Table: {TABLE_VALIDATIONS}")
print(f"SMTP: {'Activé' if SMTP_ENABLED else 'Désactivé'}")

## 2. Imports et Setup

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DoubleType, BooleanType
from pyspark.sql.functions import col, current_timestamp, lit
from delta.tables import DeltaTable
from datetime import datetime
import pandas as pd
import os

spark = SparkSession.builder.getOrCreate()
print("Spark session ready")

## 3. Schema de la table landing_validations

In [None]:
# Schema pour les validations dans le Lakehouse
schema_validations = StructType([
    # Identifiant
    StructField("opportunity_id", StringType(), False),
    
    # Semaine d'ingestion
    StructField("ingestion_week", StringType(), True),
    
    # Infos opportunité (pour contexte)
    StructField("article_title", StringType(), True),
    StructField("article_url", StringType(), True),
    StructField("venue_name", StringType(), True),
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("vertical", StringType(), True),
    StructField("evaluation_score", IntegerType(), True),
    StructField("audit_opportunity_reason", StringType(), True),
    
    # Saler assigné
    StructField("saler_name", StringType(), True),
    StructField("saler_email", StringType(), True),
    
    # Déduplication (info contextuelle)
    StructField("is_duplicate", BooleanType(), True),
    StructField("is_suspected_duplicate", BooleanType(), True),
    StructField("duplicate_score", DoubleType(), True),
    
    # Validation (rempli par user)
    StructField("is_validated", IntegerType(), True),  # 1=OK, 0=KO, NULL=PENDING
    StructField("validation_comment", StringType(), True),
    StructField("validated_by", StringType(), True),
    StructField("validation_date", TimestampType(), True),
    
    # Email
    StructField("email_sent_at", TimestampType(), True),
    
    # Metadata
    StructField("created_at", TimestampType(), True),
    StructField("updated_at", TimestampType(), True),
])

print(f"Schema landing_validations: {len(schema_validations.fields)} colonnes")

## 4. Fonctions utilitaires

In [None]:
def file_exists(path):
    """Vérifie si un fichier existe dans le Lakehouse."""
    try:
        mssparkutils.fs.head(path, 1)
        return True
    except:
        return os.path.exists(path)


def send_email(to, subject, body):
    """
    Envoie un email via SMTP Office 365.
    """
    if not SMTP_ENABLED:
        print(f"  [SMTP désactivé] Email non envoyé à {to}")
        return False
    
    if not SMTP_USER or not SMTP_PASSWORD:
        print(f"  [SMTP non configuré] Email non envoyé à {to}")
        return False
    
    try:
        import smtplib
        from email.mime.text import MIMEText
        from email.mime.multipart import MIMEMultipart
        
        msg = MIMEMultipart("alternative")
        msg["Subject"] = subject
        msg["From"] = SMTP_USER
        msg["To"] = to
        
        # Version HTML
        html_part = MIMEText(body, "html")
        msg.attach(html_part)
        
        with smtplib.SMTP(SMTP_HOST, SMTP_PORT) as server:
            server.starttls()
            server.login(SMTP_USER, SMTP_PASSWORD)
            server.sendmail(SMTP_USER, to, msg.as_string())
        
        print(f"  ✓ Email envoyé à {to}")
        return True
    
    except Exception as e:
        print(f"  ✗ Erreur envoi email à {to}: {e}")
        return False


def format_validation_email(row):
    """
    Formate le contenu de l'email de notification.
    """
    status = "OK (Validée)" if row["is_validated"] == 1 else "KO (Rejetée)"
    status_color = "#28a745" if row["is_validated"] == 1 else "#dc3545"
    
    # Extraire les valeurs
    venue_name = row.get("venue_name", "N/A")
    city = row.get("city", "N/A")
    country = row.get("country", "N/A")
    vertical = row.get("vertical", "N/A")
    score = row.get("evaluation_score", "N/A")
    article_url = row.get("article_url", "#")
    article_title = row.get("article_title", "Voir l'article")
    audit_reason = row.get("audit_opportunity_reason", "N/A")
    validated_by = row.get("validated_by", "N/A")
    comment = row.get("validation_comment", "-")
    ingestion_week = row.get("ingestion_week", "N/A")
    
    # Info déduplication si zone grise
    dedup_info = ""
    if row.get("is_suspected_duplicate") == 1 or row.get("is_suspected_duplicate") == True:
        dedup_score = row.get("duplicate_score", 0) or 0
        dedup_info = f"""
            <div style="background: #fff3cd; padding: 10px; border-radius: 5px; margin-top: 10px;">
                <strong>⚠️ Zone grise:</strong> Cet article pourrait être un doublon (score: {dedup_score:.0%})
            </div>
        """
    
    html = f"""
    <html>
    <body style="font-family: Arial, sans-serif; max-width: 600px; margin: 0 auto;">
        <div style="background: {status_color}; color: white; padding: 20px; text-align: center;">
            <h1 style="margin: 0;">Opportunité {status}</h1>
        </div>
        
        <div style="padding: 20px; background: #f8f9fa;">
            <h2 style="color: #333; margin-top: 0;">{venue_name}</h2>
            <p><strong>Lieu:</strong> {city}, {country}</p>
            <p><strong>Verticale:</strong> {vertical}</p>
            <p><strong>Score IA:</strong> {score}/100</p>
            <p><strong>Semaine:</strong> {ingestion_week}</p>
            {dedup_info}
        </div>
        
        <div style="padding: 20px;">
            <h3>Article</h3>
            <p><a href="{article_url}">{article_title}</a></p>
            
            <h3>Justification IA</h3>
            <p style="background: #e9ecef; padding: 10px; border-radius: 5px;">
                {audit_reason}
            </p>
        </div>
        
        <div style="padding: 20px; background: #f8f9fa; border-top: 1px solid #ddd;">
            <p><strong>Validé par:</strong> {validated_by}</p>
            <p><strong>Commentaire:</strong> {comment}</p>
        </div>
        
        <div style="padding: 10px; text-align: center; color: #666; font-size: 12px;">
            <p>Weak Signals Pipeline - L-Acoustics</p>
        </div>
    </body>
    </html>
    """
    return html


print("Fonctions utilitaires chargées")

## 5. Import: Excel → Lakehouse + Emails

In [None]:
def run_import():
    """
    Import les validations depuis Excel.
    - MERGE dans Lakehouse (landing_validations)
    - Envoie emails pour les nouvelles validations
    """
    import numpy as np
    
    print("=" * 60)
    print("IMPORT: Excel → Lakehouse + Emails")
    print("=" * 60)
    
    # 1. Lire le fichier Excel
    print("\n1. Lecture Excel...")
    if not file_exists(EXCEL_PATH):
        print(f"   ✗ Fichier non trouvé: {EXCEL_PATH}")
        print("   Exécutez d'abord sync_validations_export.ipynb.")
        return None
    
    df_excel = pd.read_excel(EXCEL_PATH)
    print(f"   {len(df_excel)} lignes lues")
    
    # 2. Filtrer les lignes validées (is_validated != NULL)
    df_validated = df_excel[df_excel["is_validated"].notna()].copy()
    print(f"   {len(df_validated)} lignes validées (is_validated != NULL)")
    
    if len(df_validated) == 0:
        print("   Aucune validation à importer.")
        return None
    
    # 3. Identifier les NOUVELLES validations (email_sent_at IS NULL)
    df_new_validations = df_validated[df_validated["email_sent_at"].isna()].copy()
    print(f"   {len(df_new_validations)} NOUVELLES validations (email non envoyé)")
    
    # 4. Préparer les données pour Spark
    print("\n2. Préparation des données...")
    
    # S'assurer que toutes les colonnes du schema existent
    required_cols = ["opportunity_id", "ingestion_week", "article_title", "article_url",
                     "venue_name", "city", "country", "vertical", "evaluation_score",
                     "audit_opportunity_reason", "saler_name", "saler_email",
                     "is_duplicate", "is_suspected_duplicate", "duplicate_score",
                     "is_validated", "validation_comment", "validated_by",
                     "validation_date", "email_sent_at", "created_at", "updated_at"]
    
    for col_name in required_cols:
        if col_name not in df_validated.columns:
            df_validated[col_name] = None
    
    # Sélectionner seulement les colonnes requises
    df_validated = df_validated[required_cols]
    
    # Convertir les types - IMPORTANT: faire avant le remplacement des NaN
    df_validated["is_validated"] = df_validated["is_validated"].astype(int)
    
    # Remplacer NaN par None pour toutes les colonnes
    df_validated = df_validated.replace({np.nan: None})
    
    # Convertir les booléens explicitement
    for bool_col in ["is_duplicate", "is_suspected_duplicate"]:
        df_validated[bool_col] = df_validated[bool_col].apply(
            lambda x: bool(x) if x is not None else None
        )
    
    # Convertir les timestamps
    now = datetime.now()
    df_validated["validation_date"] = now
    df_validated["created_at"] = now
    df_validated["updated_at"] = now
    # email_sent_at reste None si vide
    df_validated["email_sent_at"] = df_validated["email_sent_at"].apply(
        lambda x: pd.to_datetime(x) if x is not None else None
    )
    
    # 5. Convertir en Spark DataFrame
    print("\n3. Conversion Spark...")
    sdf_validations = spark.createDataFrame(df_validated, schema=schema_validations)
    print(f"   {sdf_validations.count()} lignes converties")
    
    # 6. MERGE dans Lakehouse (landing_validations)
    print("\n4. MERGE dans Lakehouse...")
    
    if not spark.catalog.tableExists(TABLE_VALIDATIONS):
        print(f"   Création de la table {TABLE_VALIDATIONS}...")
        sdf_validations.write.format("delta").mode("overwrite").saveAsTable(TABLE_VALIDATIONS)
    else:
        delta_table = DeltaTable.forName(spark, TABLE_VALIDATIONS)
        delta_table.alias("target").merge(
            sdf_validations.alias("source"),
            "target.opportunity_id = source.opportunity_id"
        ).whenMatchedUpdate(
            set={
                "is_validated": "source.is_validated",
                "validation_comment": "source.validation_comment",
                "validated_by": "source.validated_by",
                "validation_date": "source.validation_date",
                "email_sent_at": "source.email_sent_at",
                "updated_at": "source.updated_at"
            }
        ).whenNotMatchedInsertAll().execute()
    
    total_lakehouse = spark.sql(f"SELECT COUNT(*) FROM {TABLE_VALIDATIONS}").collect()[0][0]
    print(f"   ✓ Table {TABLE_VALIDATIONS}: {total_lakehouse} lignes")
    
    # 7. Envoyer les emails pour les nouvelles validations
    print("\n5. Envoi des emails...")
    emails_sent = 0
    emails_to_update = []
    
    for idx, row in df_new_validations.iterrows():
        saler_email = row.get("saler_email")
        if pd.isna(saler_email) or not saler_email:
            print(f"  - {row.get('venue_name', 'N/A')}: Pas d'email saler")
            continue
        
        # Formater et envoyer l'email
        status = "OK" if row["is_validated"] == 1 else "KO"
        subject = f"[Weak Signals] Opportunité {status}: {row.get('venue_name', 'N/A')}"
        body = format_validation_email(row)
        
        if send_email(saler_email, subject, body):
            emails_sent += 1
            emails_to_update.append(row["opportunity_id"])
    
    print(f"   ✓ {emails_sent} emails envoyés")
    
    # 8. Mettre à jour email_sent_at
    if emails_to_update:
        print("\n6. Mise à jour email_sent_at...")
        now = datetime.now()
        
        # Dans Excel
        for opp_id in emails_to_update:
            df_excel.loc[df_excel["opportunity_id"] == opp_id, "email_sent_at"] = now
        
        df_excel.to_excel(EXCEL_PATH, index=False)
        print(f"   ✓ Excel: {len(emails_to_update)} lignes mises à jour")
        
        # Dans Lakehouse
        for opp_id in emails_to_update:
            spark.sql(f"""
                UPDATE {TABLE_VALIDATIONS}
                SET email_sent_at = current_timestamp(), updated_at = current_timestamp()
                WHERE opportunity_id = '{opp_id}'
            """)
        print(f"   ✓ Lakehouse mis à jour")
    
    # Résumé
    print("\n" + "=" * 60)
    print("RÉSUMÉ IMPORT")
    print("=" * 60)
    print(f"Validations importées: {len(df_validated)}")
    print(f"Nouvelles validations: {len(df_new_validations)}")
    print(f"Emails envoyés: {emails_sent}")
    
    return df_validated


# Exécuter l'import
df_imported = run_import()

## 6. Statistiques

In [None]:
# Statistiques de la table landing_validations
print("=" * 60)
print("STATISTIQUES")
print("=" * 60)

if spark.catalog.tableExists(TABLE_VALIDATIONS):
    print(f"\nTable {TABLE_VALIDATIONS}:")
    
    # Compter par statut et semaine
    spark.sql(f"""
        SELECT 
            ingestion_week,
            CASE 
                WHEN is_validated = 1 THEN 'OK'
                WHEN is_validated = 0 THEN 'KO'
                ELSE 'PENDING'
            END as statut,
            COUNT(*) as nb,
            SUM(CASE WHEN email_sent_at IS NOT NULL THEN 1 ELSE 0 END) as emails_envoyes
        FROM {TABLE_VALIDATIONS}
        GROUP BY ingestion_week, is_validated
        ORDER BY ingestion_week DESC, statut
    """).show(20)
    
    # Validations sans email
    pending_emails = spark.sql(f"""
        SELECT COUNT(*) FROM {TABLE_VALIDATIONS}
        WHERE is_validated IS NOT NULL AND email_sent_at IS NULL
    """).collect()[0][0]
    
    if pending_emails > 0:
        print(f"⚠ {pending_emails} validations sans email envoyé")
else:
    print(f"Table {TABLE_VALIDATIONS} n'existe pas encore.")

# Stats fichier Excel
if file_exists(EXCEL_PATH):
    df_stats = pd.read_excel(EXCEL_PATH)
    print(f"\nFichier Excel ({EXCEL_PATH}):")
    print(f"  Total lignes: {len(df_stats)}")
    print(f"  Validées (OK): {len(df_stats[df_stats['is_validated'] == 1])}")
    print(f"  Rejetées (KO): {len(df_stats[df_stats['is_validated'] == 0])}")
    print(f"  En attente: {len(df_stats[df_stats['is_validated'].isna()])}")
    print(f"  Emails envoyés: {len(df_stats[df_stats['email_sent_at'].notna()])}")

## 7. Historique Delta

In [None]:
if spark.catalog.tableExists(TABLE_VALIDATIONS):
    print(f"Historique Delta de '{TABLE_VALIDATIONS}':")
    spark.sql(f"DESCRIBE HISTORY {TABLE_VALIDATIONS}").select(
        "version", "timestamp", "operation", "operationMetrics"
    ).show(5, truncate=50)

---

## Guide d'utilisation

### Prérequis

1. Avoir exécuté `sync_validations_export.ipynb` au moins une fois
2. Avoir validé des opportunités dans le fichier Excel

### Configuration SMTP

Pour activer l'envoi d'emails, remplir les variables:
- `SMTP_USER`: votre email @l-acoustics.com
- `SMTP_PASSWORD`: votre App Password

Créer un App Password: https://mysignins.microsoft.com/security-info

### Validation dans Excel

Colonnes à remplir:
- `is_validated`: 1 (OK) ou 0 (KO)
- `validation_comment`: commentaire optionnel
- `validated_by`: votre email

### Emails envoyés

Un email est envoyé au saler (`saler_email`) pour chaque validation où:
- `is_validated` est rempli (1 ou 0)
- `email_sent_at` est vide (évite les doublons)