# D√©duplication Hebdomadaire

D√©tecte les doublons s√©mantiques parmi les articles de la semaine courante en les comparant √† l'historique complet.

## Principe

Utilise `ai.similarity()` de Fabric AI Functions pour comparer le texte :
- **Texte compar√©** : `title + venue_name + city + country`
- **Seuils** :
  - `>= 0.90` : Doublon confirm√© (`is_duplicate = true`)
  - `0.85 - 0.90` : Zone grise (`is_suspected_duplicate = true`)
  - `< 0.85` : Unique

## Ordonnancement

```
# Lundi 6h00 - Ingestion des articles
landing_feedly_opportunities.ipynb (mode incremental)

# Lundi 7h00 - D√©duplication (CE NOTEBOOK)
deduplicate_weekly.ipynb

# Lundi 8h00 - Export Excel
sync_validations_excel.ipynb (mode export)
```

## 1. Configuration

In [None]:
# Table source
TABLE_LANDING = "landing_feedly_opportunities"

# Seuils de similarit√©
THRESHOLD_DUPLICATE = 0.90      # >= 0.90 = doublon confirm√©
THRESHOLD_SUSPECTED = 0.85      # >= 0.85 et < 0.90 = zone grise

# Mode debug (affiche plus de d√©tails)
DEBUG_MODE = False

# Mode FORCE_REPROCESS: retraiter tous les articles de la semaine m√™me s'ils ont d√©j√† √©t√© trait√©s
FORCE_REPROCESS = True  # Mettre √† False en production

print(f"Table: {TABLE_LANDING}")
print(f"Seuil doublon confirm√©: >= {THRESHOLD_DUPLICATE}")
print(f"Seuil zone grise: >= {THRESHOLD_SUSPECTED} et < {THRESHOLD_DUPLICATE}")
print(f"Force reprocess: {FORCE_REPROCESS}")

## 2. Imports et Setup

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, lit, concat_ws, coalesce, when,
    max as spark_max, first, row_number
)
from pyspark.sql.window import Window
from pyspark.sql.types import DoubleType, BooleanType, StringType
from datetime import datetime
from delta.tables import DeltaTable

import synapse.ml.spark.aifunc as aifunc

spark = SparkSession.builder.getOrCreate()

# Semaine courante au format ISO 8601 (ex: "2024-W49")
current_week = datetime.now().strftime("%Y-W%V")
print(f"Spark session ready")
print(f"AI Functions loaded")
print(f"Semaine courante: {current_week}")

## 3. Chargement des donn√©es

In [None]:
# Articles de la semaine courante √† traiter
if FORCE_REPROCESS:
    # Mode force: retraiter TOUS les articles de la semaine
    new_articles = spark.sql(f"""
        SELECT 
            id,
            article_title,
            venue_name,
            city,
            country,
            ingestion_week
        FROM {TABLE_LANDING}
        WHERE ingestion_week = '{current_week}'
    """)
    print(f"‚ö†Ô∏è Mode FORCE_REPROCESS activ√© - tous les articles de la semaine seront retrait√©s")
else:
    # Mode normal: seulement les articles non encore trait√©s
    new_articles = spark.sql(f"""
        SELECT 
            id,
            article_title,
            venue_name,
            city,
            country,
            ingestion_week
        FROM {TABLE_LANDING}
        WHERE ingestion_week = '{current_week}'
          AND is_duplicate IS NULL
    """)

# Historique (semaines pr√©c√©dentes uniquement en mode FORCE_REPROCESS)
if FORCE_REPROCESS:
    # En mode force, l'historique = seulement les semaines pr√©c√©dentes
    historical = spark.sql(f"""
        SELECT 
            id,
            article_title,
            venue_name,
            city,
            country,
            ingestion_week
        FROM {TABLE_LANDING}
        WHERE ingestion_week != '{current_week}'
    """)
else:
    # Mode normal: semaines pr√©c√©dentes + articles d√©j√† trait√©s de cette semaine
    historical = spark.sql(f"""
        SELECT 
            id,
            article_title,
            venue_name,
            city,
            country,
            ingestion_week
        FROM {TABLE_LANDING}
        WHERE ingestion_week != '{current_week}'
           OR (ingestion_week = '{current_week}' AND is_duplicate IS NOT NULL)
    """)

new_count = new_articles.count()
hist_count = historical.count()

print(f"üì• Articles √† traiter (semaine {current_week}): {new_count}")
print(f"üìö Articles historiques (r√©f√©rence): {hist_count}")

if new_count == 0:
    print("\n‚úÖ Aucun nouvel article √† traiter cette semaine.")

## 4. Pr√©paration du texte √† comparer

In [None]:
if new_count > 0:
    # Cr√©er le texte √† comparer pour les nouveaux articles
    # Format: "title | venue_name | city | country"
    new_articles_with_text = new_articles.withColumn(
        "text_to_compare",
        concat_ws(
            " | ",
            coalesce(col("article_title"), lit("")),
            coalesce(col("venue_name"), lit("")),
            coalesce(col("city"), lit("")),
            coalesce(col("country"), lit(""))
        )
    ).select("id", "text_to_compare", "article_title")
    
    # Cr√©er le texte √† comparer pour l'historique
    historical_with_text = historical.withColumn(
        "text_to_compare",
        concat_ws(
            " | ",
            coalesce(col("article_title"), lit("")),
            coalesce(col("venue_name"), lit("")),
            coalesce(col("city"), lit("")),
            coalesce(col("country"), lit(""))
        )
    ).select(
        col("id").alias("hist_id"),
        col("text_to_compare").alias("hist_text"),
        col("article_title").alias("hist_title")
    )
    
    if DEBUG_MODE:
        print("\nüîç Exemples de textes √† comparer (nouveaux):")
        new_articles_with_text.show(5, truncate=60)
    
    print(f"‚úÖ Textes pr√©par√©s pour la comparaison")

## 5. Calcul de similarit√© avec Fabric AI Functions

Utilise `ai.similarity()` pour calculer la similarit√© s√©mantique :
1. **Intra-semaine** : Compare les nouveaux articles entre eux (m√™me semaine)
2. **Historique** : Compare les nouveaux articles avec l'historique (semaines pr√©c√©dentes)

In [None]:
if new_count > 0:
    # =========================================================================
    # ETAPE 1: Comparaison INTRA-SEMAINE (nouveaux articles entre eux)
    # =========================================================================
    # Important: 2 articles sur le m√™me √©v√©nement peuvent arriver la m√™me semaine
    
    print("üìä √âtape 1: Comparaison intra-semaine (nouveaux articles entre eux)...")
    print(f"   Nombre d'articles √† comparer entre eux: {new_count}")
    
    # Ajouter un index num√©rique pour la comparaison (√©vite les probl√®mes avec les IDs string)
    from pyspark.sql.functions import monotonically_increasing_id
    
    new_with_idx = new_articles_with_text.withColumn("idx", monotonically_increasing_id())
    
    # Debug: afficher les articles
    print("\nüîç Debug - Articles √† traiter:")
    new_with_idx.select("idx", "id", "article_title", "text_to_compare").show(5, truncate=50)
    
    # Self-join pour comparer chaque article avec tous les autres
    new_left = new_with_idx.select(
        col("idx").alias("left_idx"),
        col("id").alias("left_id"),
        col("text_to_compare").alias("left_text"),
        col("article_title").alias("left_title")
    )
    
    new_right = new_with_idx.select(
        col("idx").alias("right_idx"),
        col("id").alias("right_id"),
        col("text_to_compare").alias("right_text"),
        col("article_title").alias("right_title")
    )
    
    # Cross join et filtre pour √©viter comparaison avec soi-m√™me et doublons (A vs B, pas B vs A)
    intra_week_df = new_left.crossJoin(new_right).filter(
        col("left_idx") < col("right_idx")  # Comparaison num√©rique fiable
    )
    
    intra_comparisons = intra_week_df.count()
    expected_comparisons = (new_count * (new_count - 1)) // 2
    print(f"   Comparaisons intra-semaine: {intra_comparisons} (attendu: {expected_comparisons})")
    
    if intra_comparisons > 0:
        # Debug: afficher quelques paires √† comparer
        print("\nüîç Debug - Exemples de paires √† comparer:")
        intra_week_df.select("left_id", "left_title", "right_id", "right_title").show(5, truncate=40)
        
        # Calculer la similarit√© intra-semaine avec df.ai.similarity()
        print("\n   Calcul des similarit√©s intra-semaine...")
        intra_similarity = intra_week_df.select(
            col("left_id").alias("new_id"),
            col("left_text").alias("new_text"),
            col("left_title").alias("new_title"),
            col("right_id").alias("hist_id"),
            col("right_text").alias("hist_text"),
            col("right_title").alias("hist_title")
        ).ai.similarity(
            input_col="new_text",
            other_col="hist_text",
            output_col="similarity_score"
        )
        
        # Aussi comparer dans l'autre sens (B vs A) pour avoir le meilleur match pour chaque article
        intra_similarity_reverse = intra_week_df.select(
            col("right_id").alias("new_id"),
            col("right_text").alias("new_text"),
            col("right_title").alias("new_title"),
            col("left_id").alias("hist_id"),
            col("left_text").alias("hist_text"),
            col("left_title").alias("hist_title")
        ).ai.similarity(
            input_col="new_text",
            other_col="hist_text",
            output_col="similarity_score"
        )
        
        intra_similarity_all = intra_similarity.union(intra_similarity_reverse)
        intra_count = intra_similarity_all.count()
        print(f"   ‚úÖ Similarit√©s intra-semaine calcul√©es: {intra_count} comparaisons")
    else:
        intra_similarity_all = None
        print("   ‚ö†Ô∏è Pas assez d'articles pour la comparaison intra-semaine")
    
    # =========================================================================
    # ETAPE 2: Comparaison avec l'HISTORIQUE (semaines pr√©c√©dentes)
    # =========================================================================
    
    if hist_count > 0:
        print(f"\nüìä √âtape 2: Comparaison avec l'historique ({hist_count} articles)...")
        
        cross_df = new_articles_with_text.crossJoin(historical_with_text)
        hist_comparisons = new_count * hist_count
        print(f"   Comparaisons historique: {hist_comparisons}")
        
        # Calculer la similarit√© avec l'historique
        hist_similarity = cross_df.select(
            col("id").alias("new_id"),
            col("text_to_compare").alias("new_text"),
            col("article_title").alias("new_title"),
            col("hist_id"),
            col("hist_text"),
            col("hist_title")
        ).ai.similarity(
            input_col="new_text",
            other_col="hist_text",
            output_col="similarity_score"
        )
        print(f"   ‚úÖ Similarit√©s historique calcul√©es")
    else:
        hist_similarity = None
        print("\nüìö Pas d'historique disponible - seule la comparaison intra-semaine sera utilis√©e")
    
    # =========================================================================
    # ETAPE 3: Combiner les r√©sultats
    # =========================================================================
    
    print("\nüìä √âtape 3: Fusion des r√©sultats...")
    
    # Combiner intra-semaine et historique
    if intra_similarity_all is not None and hist_similarity is not None:
        all_similarities = intra_similarity_all.union(hist_similarity)
        print(f"   Sources: intra-semaine + historique")
    elif intra_similarity_all is not None:
        all_similarities = intra_similarity_all
        print(f"   Sources: intra-semaine uniquement")
    elif hist_similarity is not None:
        all_similarities = hist_similarity
        print(f"   Sources: historique uniquement")
    else:
        all_similarities = None
        print(f"   ‚ö†Ô∏è Aucune source de comparaison")
    
    # =========================================================================
    # ETAPE 4: Classification de TOUS les articles
    # =========================================================================
    
    print("\nüìä √âtape 4: Classification des articles...")
    
    # Cr√©er un DataFrame de base avec tous les nouveaux articles
    all_new_articles = new_articles_with_text.select(
        col("id").alias("new_id"),
        col("article_title").alias("new_title")
    )
    
    if all_similarities is not None:
        total_similarities = all_similarities.count()
        print(f"   Total des similarit√©s calcul√©es: {total_similarities}")
        
        # Debug: afficher les meilleures similarit√©s
        print("\nüîç Debug - Top 10 des similarit√©s les plus √©lev√©es:")
        all_similarities.select(
            "new_id", "new_title", "hist_title", "similarity_score"
        ).orderBy(col("similarity_score").desc()).show(10, truncate=50)
        
        # Fen√™tre par new_id, ordonn√©e par score d√©croissant
        window_spec = Window.partitionBy("new_id").orderBy(col("similarity_score").desc())
        
        # Ajouter le rang et filtrer pour garder seulement le meilleur match
        best_matches = all_similarities.withColumn(
            "rank", row_number().over(window_spec)
        ).filter(col("rank") == 1).drop("rank").select(
            col("new_id"),
            col("similarity_score").alias("max_similarity"),
            col("hist_id").alias("best_match_id"),
            col("hist_title").alias("best_match_title")
        )
        
        # LEFT JOIN pour inclure TOUS les articles, m√™me ceux sans match
        classified = all_new_articles.join(
            best_matches,
            on="new_id",
            how="left"
        ).withColumn(
            # Si pas de match trouv√©, mettre 0.0
            "max_similarity", coalesce(col("max_similarity"), lit(0.0))
        ).withColumn(
            "is_duplicate",
            when(col("max_similarity") >= THRESHOLD_DUPLICATE, True)
            .otherwise(False)
        ).withColumn(
            "is_suspected_duplicate",
            when(
                (col("max_similarity") >= THRESHOLD_SUSPECTED) & 
                (col("max_similarity") < THRESHOLD_DUPLICATE),
                True
            ).otherwise(False)
        ).withColumn(
            "duplicate_of",
            when(col("max_similarity") >= THRESHOLD_SUSPECTED, col("best_match_id"))
            .otherwise(None)
        )
        
    else:
        # Pas de comparaison possible = tous uniques
        print("   ‚ö†Ô∏è Aucune comparaison possible - tous les articles marqu√©s comme uniques")
        classified = all_new_articles.withColumn(
            "max_similarity", lit(0.0)
        ).withColumn(
            "is_duplicate", lit(False)
        ).withColumn(
            "is_suspected_duplicate", lit(False)
        ).withColumn(
            "duplicate_of", lit(None).cast(StringType())
        ).withColumn(
            "best_match_id", lit(None).cast(StringType())
        ).withColumn(
            "best_match_title", lit(None).cast(StringType())
        )
    
    # Stats
    duplicates_count = classified.filter(col("is_duplicate") == True).count()
    suspected_count = classified.filter(col("is_suspected_duplicate") == True).count()
    unique_count = classified.filter(
        (col("is_duplicate") == False) & (col("is_suspected_duplicate") == False)
    ).count()
    total_classified = classified.count()
    
    print(f"\n" + "="*60)
    print(f"üìà R√âSULTATS DE LA D√âDUPLICATION")
    print(f"="*60)
    print(f"   Total articles classifi√©s: {total_classified} / {new_count}")
    print(f"   ‚úÖ Doublons confirm√©s (>= {THRESHOLD_DUPLICATE}): {duplicates_count}")
    print(f"   ‚ö†Ô∏è Zone grise ({THRESHOLD_SUSPECTED} - {THRESHOLD_DUPLICATE}): {suspected_count}")
    print(f"   üÜï Uniques (< {THRESHOLD_SUSPECTED}): {unique_count}")
    print(f"="*60)
    
    # Toujours afficher le d√©tail des articles avec doublons potentiels
    if duplicates_count > 0 or suspected_count > 0:
        print("\nüîç D√©tail des articles avec doublons potentiels:")
        classified.filter(
            col("max_similarity") >= THRESHOLD_SUSPECTED
        ).select(
            "new_id", "new_title", "max_similarity", 
            "is_duplicate", "is_suspected_duplicate", "best_match_title"
        ).orderBy(col("max_similarity").desc()).show(20, truncate=40)
    
    # Debug: afficher tous les r√©sultats
    if DEBUG_MODE:
        print("\nüîç Debug - Tous les articles classifi√©s:")
        classified.select(
            "new_id", "new_title", "max_similarity", 
            "is_duplicate", "is_suspected_duplicate"
        ).orderBy(col("max_similarity").desc()).show(50, truncate=40)

## 6. Mise √† jour de la table Delta

In [None]:
if new_count > 0:
    # Pr√©parer le DataFrame pour le merge
    updates_df = classified.select(
        col("new_id").alias("id"),
        col("is_duplicate"),
        col("is_suspected_duplicate"),
        col("duplicate_of"),
        col("max_similarity").alias("duplicate_score")
    )
    
    # Merge dans la table Delta
    delta_table = DeltaTable.forName(spark, TABLE_LANDING)
    
    delta_table.alias("target").merge(
        updates_df.alias("source"),
        "target.id = source.id"
    ).whenMatchedUpdate(
        set={
            "is_duplicate": "source.is_duplicate",
            "is_suspected_duplicate": "source.is_suspected_duplicate",
            "duplicate_of": "source.duplicate_of",
            "duplicate_score": "source.duplicate_score"
        }
    ).execute()
    
    print(f"\n Table '{TABLE_LANDING}' mise √† jour avec les r√©sultats de d√©duplication")
    print(f"   - {duplicates_count + suspected_count + unique_count} articles trait√©s")

## 7. Statistiques finales

In [None]:
# Stats de la semaine courante
print(f"\n Statistiques de d√©duplication - Semaine {current_week}:\n")

stats = spark.sql(f"""
    SELECT
        COUNT(*) as total_articles,
        SUM(CASE WHEN is_duplicate = true THEN 1 ELSE 0 END) as doublons_confirmes,
        SUM(CASE WHEN is_suspected_duplicate = true THEN 1 ELSE 0 END) as zone_grise,
        SUM(CASE WHEN is_duplicate = false AND is_suspected_duplicate = false THEN 1 ELSE 0 END) as uniques,
        SUM(CASE WHEN is_duplicate IS NULL THEN 1 ELSE 0 END) as non_traites
    FROM {TABLE_LANDING}
    WHERE ingestion_week = '{current_week}'
""")

stats.show()

---

## R√©sum√©

Ce notebook effectue la d√©duplication s√©mantique des articles Feedly :

1. **Charge** les articles de la semaine courante (`ingestion_week`)
2. **Compare intra-semaine** : chaque nouvel article avec les autres de la m√™me semaine
3. **Compare avec l'historique** : chaque article avec les semaines pr√©c√©dentes via `ai.similarity()`
4. **Classifie** selon les seuils :
   - `>= 0.90` ‚Üí `is_duplicate = true`
   - `0.85 - 0.90` ‚Üí `is_suspected_duplicate = true`
   - `< 0.85` ‚Üí Unique
5. **Met √† jour** la table Delta avec les r√©sultats

### Pourquoi la comparaison intra-semaine ?

Si 2 articles sur le **m√™me √©v√©nement** arrivent la **m√™me semaine** (ex: m√™me stade couvert par 2 journaux diff√©rents), ils doivent √™tre d√©tect√©s comme doublons entre eux, pas seulement par rapport √† l'historique.

### Colonnes mises √† jour

| Colonne | Type | Description |
|---------|------|-------------|
| `is_duplicate` | Boolean | `true` si doublon confirm√© |
| `is_suspected_duplicate` | Boolean | `true` si zone grise |
| `duplicate_of` | String | ID de l'article original (peut √™tre de la m√™me semaine ou de l'historique) |
| `duplicate_score` | Double | Score de similarit√© (0.0 √† 1.0) |