# CLASSIFICATION MULTI-CLASSE RAKUTEN
## TF-IDF, Transformers, Calibration et Fusion Probabiliste

#I.Introduction

Dans l'objectif de prédire un code produit (correspondant à une catégorie) à partir du titre et de la description saisis par le vendeur, nous avons cherché à créer un modèle performant et robuste.

Le défi est que ces textes sont souvents courts, 37% des descriptions produits sont manquantes, il y a beaucoup de bruit(fautes, abréviations, références) et que certaines catégories se ressemblent forcement.

Nous avons donc combiné deux familles de modèles complémentaires:
*  une approche lexicale/statistique très efficace sur les mots-clés et références,
*   une approche sémantique plus “compréhensive” du contexte grâce aux Transformers.

#### 1) Baseline lexicale : TF-IDF + classifieur linéaire

La première famille repose sur la transformation du texte en vecteurs TF-IDF (pondération des mots/segments de mots selon leur importance), puis à l’apprentissage d’un classifieur linéaire (régression logistique).

Cela nous permet de repérer des mots-clés, marques, références, format et sert de socle stable. Il apporte un signal différent des transformers.

#### 2) Modèles de langage : Transformers (CamemBERT, FlauBERT, XLM-RoBERTa, etc.)

La seconde famille utilise des modèles de Transformers pré-entraînés. Après fine-tuning, ils sont capables de mieux capter :

- le sens global d’une phrase,

- des formulations implicites ou variées,

- des relations entre les mots (contexte) que TF-IDF ne capture pas directement.

Pourquoi on en utilise plusieurs :
Chaque Transformer a ses points forts (langue, style, robustesse, taille) et ses faiblesses. Les combiner permet souvent d’améliorer les performances et de réduire la variance.

#### 3) Calibration : temperature scaling

Les Transformers peuvent être mal calibrés : trop confiants (probabilités extrêmes) ou parfois trop “timides”.
Avant de combiner des modèles, nous appliquons un temperature scaling pour ajuster le niveau de confiance des probabilités, ce qui améliore la qualité du blending.

####4) Ensemble de Transformers : moyenne de probabilités

Nous combinons ensuite les Transformers via une moyenne pondérée de leurs probabilités.
Etant donnée que les modèles ne font pas les mêmes erreurs, l’ensemble est plus robuste qu’un modèle unique.


####5) Fusion finale TF-IDF ↔ Transformers

Enfin, nous fusionnons la baseline TF-IDF et l’ensemble Transformer grâce à un poids w appris sur un sous-ensemble dédié appelé blend (jeu intermédiaire distinct de la validation finale).

- Si w est élevé, on fait davantage confiance aux Transformers.

- Si w est faible, TF-IDF pèse plus lourd.

Ce mécanisme permet d’obtenir un classifieur final stable (moins sensible aux variations), performant (meilleure généralisation) et mieux équilibré entre lexical et sémantique.

###6) Métrique de suivi

Tout au long du notebook, on suit en priorité le F1-score pondéré (weighted-F1) qui est plus pertinent que l’accuracy car comme il a été vu dans le notebook exploratoire, les classes sont déséquilibrées.

7) Architecture du pipeline



```
┌──────────────────────────────┐
│ Données brutes (titre + desc)│
└───────────────┬──────────────┘
                v
┌──────────────────────────────────────────────┐
│ Préparation du texte                         │
│ - concatenation TITLE/DESC                   │
│ - nettoyage minimal                          │
│ - text_hash (anti-fuite / GroupSplit)        │
└───────────────┬──────────────────────────────┘
                v
┌──────────────────────────────────────────────┐
│ Splits                                       │
│ - train_base : entraîne les modèles          │
│ - blend     : apprend calibration & poids    │
│ - val       : évaluation finale (jamais fit) │
└───────┬───────────────────────────────┬──────┘
        │                               │
        v                               v
┌──────────────────────┐       ┌───────────────────────────┐
│ TF-IDF + linéaire    │       │ Transformers (fine-tuning)│
│ → proba (blend/val)  │       │ → logits/proba (blend/val)│
└──────────┬───────────┘       └─────────────┬─────────────┘
           │                                  v
           │                     ┌───────────────────────────┐
           │                     │ Calibration (TempScaling) │
           │                     └─────────────┬─────────────┘
           │                                   v
           │                     ┌───────────────────────────┐
           │                     │ Ensemble Transformers     │
           │                     │ (moyenne / poids)         │
           │                     └─────────────┬─────────────┘
           └───────────────────────┬───────────┘
                                   v
                ┌──────────────────────────────────────────┐
                │ Blending final TF-IDF ↔ Transformers     │
                │ - poids appris sur BLEND                 │
                │ - option : poids par super-catégorie     │
                │ - objectif : maximiser weighted-F1       │
                └──────────────────────────┬───────────────┘
                                           v
                ┌───────────────────────────────────────────┐
                │ Évaluation & exports                      │
                │ - weighted-F1, rapports, confusion matrix │
                │ - sauvegarde des prédictions              │
                └───────────────────────────────────────────┘
```



# II. Installation, Imports Et Configuration Générale

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# a) Installations
!pip -q install -U "transformers>=4.40" datasets accelerate evaluate scikit-learn
!pip -q install -U langdetect
!pip -q install -U sacremoses
!pip -q install -U fasttext


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/512.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m135.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:

# b) Imports

import os, re, json, math, hashlib, random

from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Tuple, Optional

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F

from sklearn.model_selection import train_test_split, GroupShuffleSplit
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException



In [4]:
# c) Configuration Générale
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

TRAIN_PATH = "/content/drive/MyDrive/X_train_update.csv" #texte brut
Y_PATH     = "/content/drive/MyDrive/Y_train_CVw08PX.csv"

LABEL_COL = "prdtypecode"
TEXT_COLS = ("designation", "description")
TEXT_SEP = " [SEP] "
TEXT_MODEL_COL = "text_model"   # colonne utilisée comme entrée des modèles
TEXT_HASH_COL  = "text_hash"    # hash pour anti-fuite (GroupSplit + cache)
HASH_NORMALIZE = True           # normalise UNIQUEMENT pour le hash (n'affecte pas le texte du modèle)
TFIDF_LOWERCASE = False         # conserve la casse (entrée brute) côté TF‑IDF
VAL_SIZE = 0.15
BLEND_SIZE_IN_TRAIN = 0.15

DEFAULT_USE_GROUP_SPLIT_FOR_VAL = False #permet de mieux conserver la répartition des classes

DEFAULT_FILTER_LANG_FR = False
LANG_CACHE_PATH = "./cache/lang_by_text_hash.parquet"


OUT_DIR = Path("./runs_outputs") #chemin du dossier de sortie
OUT_DIR.mkdir(exist_ok=True, parents=True)

# TF-IDF
WORD_NGRAM = (1, 2)
CHAR_NGRAM = (3, 5)
TFIDF_MIN_DF = 2
TFIDF_MAX_DF = 0.95
SVM_C = 2.5
CAL_METHOD = "sigmoid" #méthode de calibration des probabilités

# Transformer
ENABLE_GRAD_CHECKPOINTING = False  # désactive si instabilités (autograd/checkpoint)
GRAD_CKPT_USE_REENTRANT = False  # recommandé PyTorch>=2

MAX_LENGTH = 384
EPOCHS = 6
PATIENCE = 2
LR = 2e-5
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.06
BATCH_SIZE = 16
GRAD_ACCUM = 1
DROPOUT = 0.15
MLP_DIM = 512
POOLING = "mean" #prend la moyenne des embeddings des tokens pour obtenur un vecteur de phrase

# Calibration
TEMP_MAX_ITERS = 200
TEMP_LR = 0.05

# Blending
W_GRID = np.linspace(0, 1, 21)

# TF-IDF
TFIDF_MAX_FEATURES = 250_000
TFIDF_C = 4.0


In [5]:
# d) Définition des catégories
categories = {
    "Livres & Revues": {
        "Livres spécialisés": 10,
        "Littérature": 2705,
        "Presse & Magazines": 2280,
        "Séries & Encyclopédies": 2403,
    },
    "Jeux Vidéo": {
        "Rétro Gaming": 40,
        "Accessoires": 50,
        "Consoles": 60,
        "Jeux Vidéo Modernes": 2462,
        "Jeux PC": 2905,
    },
    "Collection": {
        "Figurines": 1140,
        "Jeux de cartes": 1160,
        "Jeux de rôle & Figurines": 1180,
    },
    "Jouets & Loisirs": {
        "Jouets & Figurines": 1280,
        "Jeux éducatifs": 1281,
        "Modélisme & Drones": 1300,
        "Loisirs & Plein air": 1302,
    },
    "Bébé": {
        "Vêtement Bébé": 1301,
        "Puériculture": 2584,
    },
    "Maison": {
        "Équipement Maison": 1560,
        "Textiles d'intérieur": 1920,
        "Décoration & Lumières": 2060,
    },
    "Jardin & Extérieur": {
        "Équipement Jardin": 2582,
        "Bricolage": 2585,
    },
    "Autres": {
        "Épicerie": 1940,
        "Animaux": 2220,
        "Bureau & Papeterie": 1320,
        "Hygiène & Beauté": 2522,
    },
}

label_to_super: Dict[int,str] = {}
for sup, d in categories.items():
    for _, lab in d.items():
        label_to_super[int(lab)] = sup

SUPERS = sorted(set(label_to_super.values()))
print("Supers:", SUPERS)


Supers: ['Autres', 'Bébé', 'Collection', 'Jardin & Extérieur', 'Jeux Vidéo', 'Jouets & Loisirs', 'Livres & Revues', 'Maison']


In [6]:
# =========================
# 3) Load data + texte
# =========================
def load_data(train_path: str, y_path: Optional[str], label_col: str) -> pd.DataFrame:
    p = Path(train_path)
    if not p.exists():
        raise FileNotFoundError(f"TRAIN_PATH introuvable: {train_path}")

    if p.suffix.lower() in (".parquet", ".pq"):
        df = pd.read_parquet(p)
    else:
        df = pd.read_csv(p)

    # merge labels si absent
    if label_col not in df.columns:
        if y_path is None:
            raise ValueError(f"{label_col} absent et Y_PATH=None")
        y = pd.read_csv(y_path)
        if len(y) == len(df):
            df[label_col] = y[label_col].values
        else:
            keys = [k for k in ["imageid","productid"] if k in df.columns and k in y.columns]
            if not keys:
                raise ValueError("Impossible de merger Y: tailles diff et pas de clés (imageid/productid).")
            df = df.merge(y, on=keys, how="inner")
    return df

df = load_data(TRAIN_PATH, Y_PATH, LABEL_COL)
print("raw df:", df.shape)


raw df: (84916, 6)


In [7]:
# =========================
# 3b) Construction du texte (RAW) + hash anti-doublons
# =========================
# Objectif : modéliser sur le texte brut (pas de lower/regex avant tokenisation),
# tout en gardant un hash *normalisé* pour :
# - limiter les fuites (GroupSplit sur doublons exacts / quasi-exacts)
# - mettre en cache la détection de langue

def safe_str(x) -> str:
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return ""
    return str(x)

_ws = re.compile(r"\s+")

def normalize_for_hash(s: str) -> str:
    # N'AFFECTE PAS l'entrée du modèle : uniquement utilisé pour le hash (stabilité des doublons)
    s = safe_str(s)
    s = _ws.sub(" ", s).strip()
    if HASH_NORMALIZE:
        s = s.lower()
    return s

def build_text(df: pd.DataFrame) -> pd.Series:
    # On privilégie les colonnes brutes, sinon on retombe sur *_cleaned
    cols = [c for c in TEXT_COLS if c in df.columns]
    if not cols:
        cols = [c for c in TEXT_COLS_FALLBACK if c in df.columns]
    if not cols:
        raise ValueError(
            f"Aucune colonne texte trouvée. Cherché: {TEXT_COLS} puis {TEXT_COLS_FALLBACK}. "
            f"Colonnes disponibles: {list(df.columns)[:30]}..."
        )
    parts = [df[c].map(safe_str) for c in cols]
    txt = parts[0]
    for p in parts[1:]:
        txt = txt + TEXT_SEP + p
    return txt

# ---- Texte utilisé par les modèles (RAW) ----
df[TEXT_MODEL_COL] = build_text(df)

# Filtre minimum (texte non vide)
df = df.loc[df[TEXT_MODEL_COL].str.strip().str.len() > 0].reset_index(drop=True)

# ---- Hash (sur version normalisée) ----
df["_hash_src"] = df[TEXT_MODEL_COL].map(normalize_for_hash)
df[TEXT_HASH_COL] = df["_hash_src"].map(lambda s: hashlib.md5(s.encode("utf-8")).hexdigest())

# (Option) alias pour compatibilité avec du code plus ancien
df["text_clean"] = df[TEXT_MODEL_COL]

print("after non-empty:", df.shape)


after non-empty: (84916, 10)


In [8]:
# =========================
# 3c) Option: filtre FR-only (langdetect) + cache
# =========================
def detect_lang_safe(text: str) -> str:
    try:
        return detect(text)
    except LangDetectException:
        return "unk"

def add_lang_column(df: pd.DataFrame, cache_path: str) -> pd.DataFrame:
    cache_p = Path(cache_path)
    if cache_p.exists():
        cache = pd.read_parquet(cache_p)
        if {"text_hash","lang"}.issubset(cache.columns):
            df2 = df.merge(cache[["text_hash","lang"]], on="text_hash", how="left")
            miss = df2["lang"].isna().sum()
            print(f"Lang cache loaded. missing={miss}/{len(df2)}")
            if miss == 0:
                return df2
        else:
            print("Cache langue incompatible, recalcul…")

    langs = []
    for t in df[TEXT_MODEL_COL].tolist():
        langs.append(detect_lang_safe(t[:5000]))
    df2 = df.copy()
    df2["lang"] = langs
    out = df2[["text_hash","lang"]].drop_duplicates("text_hash")
    cache_p.parent.mkdir(parents=True, exist_ok=True)
    out.to_parquet(cache_p, index=False)
    print("Lang cache saved:", cache_p)
    return df2

def maybe_filter_fr(df: pd.DataFrame, filter_fr: bool) -> pd.DataFrame:
    if not filter_fr:
        return df
    df2 = add_lang_column(df, LANG_CACHE_PATH)
    before = len(df2)
    df2 = df2.loc[df2["lang"]=="fr"].reset_index(drop=True)
    print(f"FR-only: {before} -> {len(df2)}")
    return df2


In [9]:
# =========================
# 4) Split: train_base / blend / val (+ label encoding)
# =========================
label_list = sorted(df[LABEL_COL].unique().astype(int).tolist())
label2id = {lab:i for i,lab in enumerate(label_list)}
id2label = {i:lab for lab,i in label2id.items()}
num_labels = len(label_list)
print("num_labels:", num_labels)

def y_to_ids(y: np.ndarray) -> np.ndarray:
    return np.array([label2id[int(v)] for v in y], dtype=np.int64)

def split_train_val(df_in: pd.DataFrame, use_group_split: bool):
    if use_group_split:
        gss = GroupShuffleSplit(n_splits=1, test_size=VAL_SIZE, random_state=SEED)
        tr_idx, va_idx = next(gss.split(df_in, groups=df_in["text_hash"]))
        train_full = df_in.iloc[tr_idx].reset_index(drop=True)
        test_df = df_in.iloc[va_idx].reset_index(drop=True)
    else:
        train_full, test_df = train_test_split(
            df_in,
            test_size=VAL_SIZE,
            stratify=df_in[LABEL_COL],
            random_state=SEED,
        )
        train_full = train_full.reset_index(drop=True)
        test_df = test_df.reset_index(drop=True)

    train_base_df, blend_df = train_test_split(
        train_full,
        test_size=BLEND_SIZE_IN_TRAIN,
        stratify=train_full[LABEL_COL],
        random_state=SEED,
    )
    train_base_df = train_base_df.reset_index(drop=True)
    blend_df = blend_df.reset_index(drop=True)

    overlap = len(set(train_full["text_hash"]).intersection(set(test_df["text_hash"])))
    print("train_base:", train_base_df.shape, "blend:", blend_df.shape, "val:", test_df.shape)
    print("overlap text_hash train_full/val =", overlap, "ratio_val =", overlap/len(test_df))
    return train_base_df, blend_df, test_df


num_labels: 27


In [10]:

# =========================
# 5) TF-IDF baseline (word+char) — Linear SVM calibré
# =========================
def train_tfidf(train_base_df: pd.DataFrame, blend_df: pd.DataFrame, test_df: pd.DataFrame):
    """TF‑IDF word+char + LinearSVC + CalibratedClassifierCV.
    Retourne: (pipe, P_blend, P_val, metrics_val)
    """
    X_tr = train_base_df[TEXT_MODEL_COL].values
    y_tr = y_to_ids(train_base_df[LABEL_COL].values)

    # Features: word + char n-grams (robuste aux typos, marques, refs)
    word_vec = TfidfVectorizer(
        ngram_range=(1,2),
        max_features=TFIDF_MAX_FEATURES,
        lowercase=TFIDF_LOWERCASE,
        sublinear_tf=True,
        min_df=2,
    )
    char_vec = TfidfVectorizer(
        analyzer="char",
        ngram_range=(3,5),
        max_features=int(TFIDF_MAX_FEATURES * 0.8),
        lowercase=TFIDF_LOWERCASE,
        sublinear_tf=True,
        min_df=2,
    )
    feats = FeatureUnion([("word", word_vec), ("char", char_vec)], n_jobs=-1)

    base = LinearSVC(C=TFIDF_C, class_weight=None)
    clf = CalibratedClassifierCV(base, cv=2, method="sigmoid")

    pipe = Pipeline([("feats", feats), ("clf", clf)])
    pipe.fit(X_tr, y_tr)

    # Predict proba on blend/val (for later fusion)
    P_bl = pipe.predict_proba(blend_df[TEXT_MODEL_COL].values).astype(np.float32)
    P_va = pipe.predict_proba(test_df[TEXT_MODEL_COL].values).astype(np.float32)

    y_va = y_to_ids(test_df[LABEL_COL].values)
    pred_va = P_va.argmax(axis=1)

    acc = float((pred_va == y_va).mean())
    f1w = float(f1_score(y_va, pred_va, average="weighted"))
    f1m = float(f1_score(y_va, pred_va, average="macro"))

    return pipe, P_bl, P_va, {"acc": acc, "f1_weighted": f1w, "f1_macro": f1m}

# =========================
# 5b) fastText baseline (très différent) — char subwords + ngrams
# =========================
def _fasttext_available() -> bool:
    try:
        import fasttext  # noqa
        return True
    except Exception:
        return False

def _fasttext_clean_one_line(t: str) -> str:
    """fastText predict() n'accepte qu'une seule ligne -> on supprime \n/\r et on compacte les espaces."""
    if t is None:
        return ""
    t = str(t)
    # nouvelles lignes "classiques" + séparateurs unicode
    t = t.replace("\r", " ").replace("\n", " ").replace("\u2028", " ").replace("\u2029", " ")
    import re as _re
    t = _re.sub(r"\s+", " ", t).strip()
    return t

def _patch_fasttext_for_numpy2():
    """Workaround pour fasttext + NumPy>=2 (np.array(copy=False) -> np.asarray)."""
    try:
        import numpy as _np
        ver = tuple(int(x) for x in _np.__version__.split(".")[:2])
        if ver < (2, 0):
            return
    except Exception:
        return

    try:
        import fasttext  # noqa: F401
        import fasttext.FastText as _ft_mod
        import inspect as _inspect
        import os as _os
        import importlib as _importlib

        ft_path = getattr(_ft_mod, "__file__", None) or _inspect.getsourcefile(_ft_mod)
        if (not ft_path) or (not _os.path.exists(ft_path)):
            return

        with open(ft_path, "r", encoding="utf-8") as f:
            code = f.read()

        if "np.array(probs, copy=False)" not in code:
            return

        new_code = code.replace("np.array(probs, copy=False)", "np.asarray(probs)")
        if new_code != code:
            with open(ft_path, "w", encoding="utf-8") as f:
                f.write(new_code)
            _importlib.reload(_ft_mod)
            print("Patched fastText for NumPy>=2.0 (FastText.py).")
    except Exception as e:
        print("WARN: fastText patch (NumPy>=2) failed:", repr(e))

def _write_fasttext_file(df_in: pd.DataFrame, out_path: str):
    """Create a fastText supervised file: __label__<id> <text>"""
    out_p = Path(out_path)
    out_p.parent.mkdir(parents=True, exist_ok=True)
    with out_p.open("w", encoding="utf-8") as f:
        for _, row in df_in.iterrows():
            y_id = int(label2id[int(row[LABEL_COL])])
            txt = str(row[TEXT_MODEL_COL]).replace("\n", " ").replace("\r", " ")
            f.write(f"__label__{y_id} {txt}\n")
    return str(out_p)

def _fasttext_predict_proba(model, texts: List[str], num_labels: int) -> np.ndarray:
    """Return dense proba matrix (N, C) from fastText model.predict.
    Robuste: nettoyage 1-ligne + prédiction sample-par-sample.
    """
    P = np.zeros((len(texts), num_labels), dtype=np.float32)
    for i, t in enumerate(texts):
        t = _fasttext_clean_one_line(t)
        labs, probs = model.predict(t, k=num_labels)
        for lab, pr in zip(labs, probs):
            j = int(str(lab).replace("__label__", ""))
            P[i, j] = float(pr)

    s = P.sum(axis=1, keepdims=True)
    P = P / np.maximum(s, 1e-12)
    return P

def train_fasttext(train_base_df: pd.DataFrame, blend_df: pd.DataFrame, test_df: pd.DataFrame):
    """Train fastText supervised model and return (model, P_blend, P_val, metrics_val).
    If fastText isn't available, returns (None, None, None, None).
    """
    if not _fasttext_available():
        print("fastText not installed/available -> skipping.")
        return None, None, None, None

    import fasttext

    _patch_fasttext_for_numpy2()

    tr_file = _write_fasttext_file(train_base_df, str(OUT_DIR / "fasttext" / "train.txt"))

    print("Training fastText...")
    ft = fasttext.train_supervised(
        input=tr_file,
        lr=0.35,
        epoch=30,
        wordNgrams=3,
        dim=240,
        minn=2,
        maxn=5,
        bucket=2000000,
        loss="softmax",
        thread=max(2, os.cpu_count() or 2),
        verbose=2,
    )

    X_bl = [_fasttext_clean_one_line(t) for t in blend_df[TEXT_MODEL_COL].astype(str).tolist()]
    X_va = [_fasttext_clean_one_line(t) for t in test_df[TEXT_MODEL_COL].astype(str).tolist()]

    P_bl = _fasttext_predict_proba(ft, X_bl, num_labels).astype(np.float32)
    P_va = _fasttext_predict_proba(ft, X_va, num_labels).astype(np.float32)

    y_va = y_to_ids(test_df[LABEL_COL].values)
    pred_va = P_va.argmax(axis=1)

    acc = float((pred_va == y_va).mean())
    f1w = float(f1_score(y_va, pred_va, average="weighted"))
    f1m = float(f1_score(y_va, pred_va, average="macro"))

    return ft, P_bl, P_va, {"acc": acc, "f1_weighted": f1w, "f1_macro": f1m}


In [11]:
# =========================
# 6) Transformers (Trainer)
# =========================
from transformers import (
    AutoTokenizer, AutoModel,
    TrainingArguments, Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
    set_seed,
)
from datasets import Dataset



# --- Extra deps (some tokenizers like FlauBERT may require them) ---
import importlib.util, subprocess, sys
def _ensure_pkg(pkg: str):
    if importlib.util.find_spec(pkg) is None:
        print(f"Installing missing dependency: {pkg}")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

for _pkg in ["sacremoses", "sentencepiece"]:
    _ensure_pkg(_pkg)

def load_tokenizer(model_name: str):
    # FlauBERT has no fast tokenizer; use slow to avoid AutoTokenizer issues
    use_fast = True
    if "flaubert" in model_name.lower():
        use_fast = False
    return AutoTokenizer.from_pretrained(model_name, use_fast=use_fast)


import inspect

def safe_training_args(**kwargs):
    """Create TrainingArguments with backward/forward compatibility across transformers versions."""
    sig = inspect.signature(TrainingArguments.__init__)
    params = set(sig.parameters.keys())

    # rename if needed
    if "evaluation_strategy" in kwargs and "evaluation_strategy" not in params and "eval_strategy" in params:
        kwargs["eval_strategy"] = kwargs.pop("evaluation_strategy")
    if "save_strategy" in kwargs and "save_strategy" not in params and "save_strategy" in params:
        # no-op; kept for symmetry
        pass

    # drop unknown keys (older/newer transformers)
    kwargs = {k: v for k, v in kwargs.items() if k in params}
    return TrainingArguments(**kwargs)

class TextClassifier(nn.Module):
    def __init__(self, model_name: str, num_labels: int, dropout: float = 0.15, mlp_dim: int = 512, pooling: str = "mean"):
        super().__init__()
        self.pooling = pooling
        self.backbone = AutoModel.from_pretrained(model_name)
        cfg = self.backbone.config
        hidden = getattr(cfg, 'hidden_size', None) or getattr(cfg, 'd_model', None) or getattr(cfg, 'emb_dim', None)
        if hidden is None:
            raise ValueError(f"Cannot infer hidden size for backbone config type={type(cfg)}")
        self.classifier = nn.Sequential(
            nn.LayerNorm(hidden),
            nn.Linear(hidden, mlp_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(mlp_dim, num_labels),
        )

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
        backbone_inputs = {'input_ids': input_ids, 'attention_mask': attention_mask, 'token_type_ids': token_type_ids}

        backbone_inputs = {k: v for k, v in backbone_inputs.items() if v is not None}

        # Pass through only keys that the backbone actually supports (e.g. langs for FlauBERT/XLM)
        try:
            allowed = set(inspect.signature(self.backbone.forward).parameters.keys())
            for k, v in kwargs.items():
                if (k in allowed) and (v is not None):
                    backbone_inputs[k] = v
        except Exception:
            pass

        out = self.backbone(**backbone_inputs)
        last = out.last_hidden_state
        if self.pooling == "cls":
            pooled = last[:, 0]
        else:
            mask = attention_mask.unsqueeze(-1).float()
            pooled = (last * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-6)
        logits = self.classifier(pooled)
        loss = None
        if labels is not None:
            loss = F.cross_entropy(logits, labels)
        return {"loss": loss, "logits": logits}

def make_hf_dataset(df: pd.DataFrame) -> Dataset:
    # label est le code d'origine (ex: 2705) -> on convertira ensuite en ids 0..C-1
    try:
        return Dataset.from_pandas(df[[TEXT_MODEL_COL, LABEL_COL]].rename(columns={TEXT_MODEL_COL:"text", LABEL_COL:"label"}), preserve_index=False)
    except TypeError:
        return Dataset.from_pandas(df[[TEXT_MODEL_COL, LABEL_COL]].rename(columns={TEXT_MODEL_COL:"text", LABEL_COL:"label"}))

def tokenize_fn(examples, tokenizer, max_length):
    # Some tokenizers (e.g., BERT) return token_type_ids; RoBERTa/XLM-R don't need them.
    # We disable them when supported to avoid passing unexpected keys around.
    try:
        return tokenizer(examples["text"], truncation=True, max_length=max_length, return_token_type_ids=False)
    except TypeError:
        return tokenizer(examples["text"], truncation=True, max_length=max_length)

def add_label_ids(ds: Dataset) -> Dataset:
    def map_labels(batch):
        return {"labels": [label2id[int(x)] for x in batch["label"]]}
    ds = ds.map(map_labels, batched=True)
    ds = ds.remove_columns(["label"])
    return ds

# ---- Safety checks to avoid CUDA "device-side assert triggered" ----
def ensure_tokenizer_model_compat(tokenizer, model):
    # Ensure pad token exists (required by DataCollatorWithPadding)
    if tokenizer.pad_token is None:
        # Prefer existing special tokens if any
        if tokenizer.eos_token is not None:
            tokenizer.pad_token = tokenizer.eos_token
        elif tokenizer.sep_token is not None:
            tokenizer.pad_token = tokenizer.sep_token
        else:
            tokenizer.add_special_tokens({"pad_token": "<pad>"})
    # Resize embeddings if tokenizer vocab > model embeddings
    try:
        emb_n = model.backbone.get_input_embeddings().weight.shape[0]
        tok_n = len(tokenizer)
        if tok_n > emb_n:
            model.backbone.resize_token_embeddings(tok_n)
            print(f"Resized embeddings: {emb_n} -> {tok_n}")
    except Exception as e:
        print("⚠️ Could not check/resize embeddings:", repr(e))
    return tokenizer, model

def dataset_sanity_checks(ds: Dataset, num_labels: int, tokenizer, model, sample_n: int = 512):
    # label range
    labs = ds["labels"] if "labels" in ds.column_names else ds["label"]
    labs_arr = np.array(labs, dtype=np.int64)
    mn, mx = int(labs_arr.min()), int(labs_arr.max())
    if mn < 0 or mx >= num_labels:
        raise ValueError(f"Invalid labels range: min={mn}, max={mx}, num_labels={num_labels}. "
                         "This will crash CrossEntropyLoss on GPU.")

    # input id range (sampled)
    if "input_ids" in ds.column_names:
        import random as _random
        idxs = list(range(len(ds)))
        _random.shuffle(idxs)
        idxs = idxs[:min(sample_n, len(ds))]
        max_id = -1
        for i in idxs:
            arr = ds[i]["input_ids"]
            # arr may be list[int] or torch tensor
            if hasattr(arr, "max"):
                v = int(arr.max())
            else:
                v = int(max(arr)) if len(arr) else -1
            if v > max_id:
                max_id = v
        # model vocab size (embeddings)
        try:
            vocab_n = int(model.backbone.get_input_embeddings().weight.shape[0])
            if max_id >= vocab_n:
                raise ValueError(f"input_ids contain token id {max_id} but model vocab size is {vocab_n}. "
                                 "Tokenizer/model mismatch or missing resize_token_embeddings().")
        except Exception as e:
            print("⚠️ Could not validate input_ids range:", repr(e))

def preflight_forward_pass(ds: Dataset, tokenizer, model, batch_size: int = 8):
    # one CPU forward pass on a small batch to catch label/vocab issues before GPU
    from torch.utils.data import DataLoader
    collator = DataCollatorWithPadding(tokenizer=tokenizer)
    dl = DataLoader(ds, batch_size=batch_size, shuffle=False, collate_fn=collator)
    batch = next(iter(dl))
    model_cpu = model.cpu()
    batch_cpu = {k: v.cpu() if hasattr(v, "cpu") else v for k, v in batch.items()}
    with torch.no_grad():
        out = model_cpu(**batch_cpu)
    return out

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = (preds == labels).mean()
    f1w = f1_score(labels, preds, average="weighted")
    f1m = f1_score(labels, preds, average="macro")
    return {"accuracy": acc, "f1_weighted": f1w, "f1_macro": f1m}


# ---- Per-model scaling / OOM safety ----
TARGET_EFF_BS = int(BATCH_SIZE) * int(GRAD_ACCUM)

def _hparam_candidates_for_model(model_name: str):
    """Small list of (max_len, per_device_bs, grad_accum) candidates.
    Tries strongest first, then progressively safer fallbacks.
    """
    name = (model_name or "").lower()

    def _cand(max_len: int, per_bs: int):
        per_bs = int(max(1, per_bs))
        max_len = int(max(64, max_len))
        ga = int(max(1, math.ceil(TARGET_EFF_BS / per_bs)))
        return (max_len, per_bs, ga)

    cands = [_cand(MAX_LENGTH, BATCH_SIZE)]

    heavy = ("large" in name) or ("xl" in name) or ("mdeberta" in name)
    if heavy:
        cands += [
            _cand(min(MAX_LENGTH, 320), min(BATCH_SIZE, 8)),
            _cand(min(MAX_LENGTH, 256), min(BATCH_SIZE, 6)),
            _cand(min(MAX_LENGTH, 192), min(BATCH_SIZE, 4)),
        ]

    cands.append(_cand(128, 2))

    # dedupe while preserving order
    out, seen = [], set()
    for t in cands:
        if t not in seen:
            out.append(t); seen.add(t)
    return out

def _is_oom(e: Exception) -> bool:
    msg = str(e).lower()
    return ("out of memory" in msg) or ("cuda oom" in msg)


def train_transformer_one(
    model_name: str,
    train_base_df: pd.DataFrame,
    blend_df: pd.DataFrame,
    seed: int,
    out_dir: str,
):
    set_seed(seed)

    last_err = None
    for (max_len, per_bs, ga) in _hparam_candidates_for_model(model_name):
        try:
            # (Re)load fresh model each attempt (safer after an OOM)
            tokenizer = load_tokenizer(model_name)
            model = TextClassifier(
                model_name,
                num_labels=num_labels,
                dropout=DROPOUT,
                mlp_dim=MLP_DIM,
                pooling=POOLING,
            )
            tokenizer, model = ensure_tokenizer_model_compat(tokenizer, model)

            # Memory savers (esp. large backbones)
            if ENABLE_GRAD_CHECKPOINTING:
                try:
                    if hasattr(model.backbone, "gradient_checkpointing_enable"):
                        # PyTorch checkpoint re-entrant mode peut déclencher des erreurs autograd dans certains cas.
                        try:
                            model.backbone.gradient_checkpointing_enable(
                                gradient_checkpointing_kwargs={"use_reentrant": bool(GRAD_CKPT_USE_REENTRANT)}
                            )
                        except TypeError:
                            # compat anciens Transformers
                            model.backbone.gradient_checkpointing_enable()
                    if hasattr(model.backbone, "config") and hasattr(model.backbone.config, "use_cache"):
                        model.backbone.config.use_cache = False
                except Exception as _e:
                    print("⚠️ gradient checkpointing failed:", repr(_e))

            tr_ds = make_hf_dataset(train_base_df)
            bl_ds = make_hf_dataset(blend_df)

            tr_ds = tr_ds.map(lambda x: tokenize_fn(x, tokenizer, max_len), batched=True, remove_columns=["text"])
            bl_ds = bl_ds.map(lambda x: tokenize_fn(x, tokenizer, max_len), batched=True, remove_columns=["text"])

            tr_ds = add_label_ids(tr_ds)
            bl_ds = add_label_ids(bl_ds)

            tr_ds.set_format(type="torch")
            bl_ds.set_format(type="torch")

            # Preflight checks (catch label/id mismatch early)
            _ = tr_ds[0]["labels"]
            _ = bl_ds[0]["labels"]
            # Mixed precision: bf16 (A100/H100) > fp16 (T4/L4)
            use_bf16 = False
            use_fp16 = torch.cuda.is_available()
            if torch.cuda.is_available():
                try:
                    major, _minor = torch.cuda.get_device_capability(0)
                    use_bf16 = (major >= 8)
                    use_fp16 = (not use_bf16)
                except Exception:
                    pass

            args = safe_training_args(
                output_dir=out_dir,
                overwrite_output_dir=True,
                evaluation_strategy="epoch",
                save_strategy="epoch",
                save_total_limit=2,
                load_best_model_at_end=True,
                metric_for_best_model="f1_weighted",
                greater_is_better=True,
                num_train_epochs=EPOCHS,
                per_device_train_batch_size=per_bs,
                per_device_eval_batch_size=min(per_bs, 16),
                gradient_accumulation_steps=ga,
                learning_rate=LR,
                weight_decay=WEIGHT_DECAY,
                warmup_ratio=WARMUP_RATIO,
                bf16=use_bf16,
                fp16=use_fp16,
                report_to="none",
                logging_steps=200,
                seed=seed,
            )

            trainer_kwargs = dict(
                model=model,
                args=args,
                train_dataset=tr_ds,
                eval_dataset=bl_ds,
                data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
                compute_metrics=compute_metrics,
                callbacks=[EarlyStoppingCallback(early_stopping_patience=PATIENCE)],
            )
            # Compat Transformers>=4.47: tokenizer arg déprécié -> processing_class
            try:
                if "processing_class" in inspect.signature(Trainer.__init__).parameters:
                    trainer_kwargs["processing_class"] = tokenizer
                else:
                    trainer_kwargs["tokenizer"] = tokenizer
            except Exception:
                trainer_kwargs["tokenizer"] = tokenizer

            trainer = Trainer(**trainer_kwargs)
            trainer.train()
            met = trainer.evaluate()
            return trainer, tokenizer, met

        except RuntimeError as e:
            if _is_oom(e):
                last_err = e
                print(f"[OOM] {model_name} (max_len={max_len}, bs={per_bs}, ga={ga}) -> retry smaller...")
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                continue
            raise

    # If we reach here, all candidates OOM'ed
    raise last_err

@torch.no_grad()
def predict_logits(trainer: Trainer, df: pd.DataFrame, tokenizer) -> Tuple[np.ndarray, np.ndarray]:
    ds = make_hf_dataset(df)
    ds = ds.map(lambda x: tokenize_fn(x, tokenizer, MAX_LENGTH), batched=True, remove_columns=["text"])
    ds = add_label_ids(ds)
    ds.set_format(type="torch")
    preds = trainer.predict(ds)
    logits = preds.predictions.astype(np.float32)
    labels = np.array(preds.label_ids, dtype=np.int64)  # déjà des ids 0..C-1
    return logits, labels


In [12]:
# =========================
# 7) Temperature scaling
# =========================
def temperature_scale_logits(logits: np.ndarray, labels: np.ndarray, max_iters: int = 200, lr: float = 0.05) -> float:
    device = "cuda" if torch.cuda.is_available() else "cpu"
    logits_t = torch.tensor(logits, dtype=torch.float32, device=device)
    labels_t = torch.tensor(labels, dtype=torch.long, device=device)

    log_T = torch.zeros(1, device=device, requires_grad=True)
    opt = torch.optim.Adam([log_T], lr=lr)

    for _ in range(max_iters):
        opt.zero_grad()
        T = torch.exp(log_T)
        loss = F.cross_entropy(logits_t / T, labels_t)
        loss.backward()
        opt.step()

    T = float(torch.exp(log_T).detach().cpu().item())
    return max(T, 1e-3)

def softmax_np(logits: np.ndarray) -> np.ndarray:
    x = logits - logits.max(axis=1, keepdims=True)
    ex = np.exp(x)
    return ex / np.maximum(ex.sum(axis=1, keepdims=True), 1e-12)


In [13]:

# =========================
# 8) Blending (global / groupwise) — objective: weighted-F1
# =========================
def _score_weighted_f1(P: np.ndarray, y_true: np.ndarray) -> float:
    preds = P.argmax(axis=1)
    return float(f1_score(y_true, preds, average="weighted"))

def blend_global(P_tfidf: np.ndarray, P_tr: np.ndarray, y_true: np.ndarray, w_grid=W_GRID):
    best_f1, best_w, best_acc = -1.0, 0.5, -1.0
    for w in w_grid:
        P = (1-w)*P_tfidf + w*P_tr
        f1w = _score_weighted_f1(P, y_true)
        acc = float((P.argmax(axis=1) == y_true).mean())
        if f1w > best_f1:
            best_f1, best_w, best_acc = f1w, float(w), acc
    return {"f1_weighted": best_f1, "acc": best_acc, "w": best_w}

def apply_group_blend(P_tfidf: np.ndarray, P_tr: np.ndarray, w_by_super: Dict[str,float]) -> np.ndarray:
    P = P_tfidf.copy()
    for sup, w in w_by_super.items():
        ids = [i for i,lab in enumerate(label_list) if label_to_super.get(int(lab), None) == sup]
        if not ids:
            continue
        P[:, ids] = (1-w)*P_tfidf[:, ids] + w*P_tr[:, ids]
    return P

def grid_search_group_w(P_t, P_r, y_true, grid=np.linspace(0,1,11)):
    """Coordinate-descent grid search, optimizing weighted-F1 on the blend set."""
    w_by = {s: 0.5 for s in SUPERS}
    for s in SUPERS:
        best_f1, best_w = -1.0, 0.5
        for w in grid:
            tmp = dict(w_by); tmp[s] = float(w)
            P = apply_group_blend(P_t, P_r, tmp)
            f1w = _score_weighted_f1(P, y_true)
            if f1w > best_f1:
                best_f1, best_w = f1w, float(w)
        w_by[s] = best_w
    return w_by

# =========================
# 8b) Blending multi-modèles (simplex) — Dirichlet random search + refine
# =========================
def _blend_with_weights(P_list: List[np.ndarray], w: np.ndarray) -> np.ndarray:
    w = np.asarray(w, dtype=np.float64)
    w = np.clip(w, 0.0, 1.0)
    s = w.sum()
    if s <= 0:
        w = np.ones_like(w) / len(w)
    else:
        w = w / s
    P_stack = np.stack(P_list, axis=0).astype(np.float64)  # (K,N,C)
    P = np.tensordot(w, P_stack, axes=(0,0))               # (N,C)
    P = P / np.maximum(P.sum(axis=1, keepdims=True), 1e-12)
    return P.astype(np.float32)

def search_dirichlet_weights(P_list: List[np.ndarray], y_true: np.ndarray, n_samples: int = 8000, alpha: float = 1.0, seed: int = 42):
    """Random search on simplex (Dirichlet). Returns best weights for weighted-F1."""
    rng = np.random.default_rng(seed)
    K = len(P_list)
    best_w = np.ones(K, dtype=np.float64) / K
    best_f1 = -1.0
    P_stack = np.stack(P_list, axis=0).astype(np.float64)  # (K,N,C)
    for _ in range(int(n_samples)):
        w = rng.dirichlet(alpha * np.ones(K))
        P = np.tensordot(w, P_stack, axes=(0,0))
        preds = P.argmax(axis=1)
        f1w = float(f1_score(y_true, preds, average="weighted"))
        if f1w > best_f1:
            best_f1 = f1w
            best_w = w
    return best_w.astype(np.float32), float(best_f1)

def refine_weights_coordinate(P_list: List[np.ndarray], y_true: np.ndarray, w0: np.ndarray, step: float = 0.05, n_rounds: int = 3):
    """Coordinate ascent on simplex around w0 (fast local improvement)."""
    w = np.asarray(w0, dtype=np.float64).copy()
    K = len(P_list)
    for _ in range(int(n_rounds)):
        improved = False
        for j in range(K):
            best = w.copy()
            best_f1 = _score_weighted_f1(_blend_with_weights(P_list, best), y_true)
            for delta in (-step, step):
                cand = w.copy()
                cand[j] = max(0.0, cand[j] + delta)
                cand = cand / max(cand.sum(), 1e-12)
                f1w = _score_weighted_f1(_blend_with_weights(P_list, cand), y_true)
                if f1w > best_f1:
                    best_f1 = f1w
                    best = cand
            if not np.allclose(best, w):
                w = best
                improved = True
        if not improved:
            break
    return w.astype(np.float32), float(_score_weighted_f1(_blend_with_weights(P_list, w), y_true))

def apply_group_blend_multi(P_list: List[np.ndarray], w_by_super: Dict[str, List[float]]) -> np.ndarray:
    """Blend per super-catégorie by acting on the columns (classes) that belong to that super."""
    P_out = P_list[0].copy()
    for sup, w in w_by_super.items():
        cols = [i for i, lab in enumerate(label_list) if label_to_super.get(int(lab), None) == sup]
        if not cols:
            continue
        sub_list = [P[:, cols] for P in P_list]
        sub = _blend_with_weights(sub_list, np.asarray(w))
        P_out[:, cols] = sub
    P_out = P_out / np.maximum(P_out.sum(axis=1, keepdims=True), 1e-12)
    return P_out.astype(np.float32)


# --- FIX: groupwise simplex sans renormalisation par super (évite effondrement des métriques) ---
def apply_group_blend_multi_fixed(P_list, w_by_super, w_fallback=None):
    """
    Version FIX :
    - On NE normalise PAS à l'intérieur d'un sous-groupe de classes.
    - On remplace les colonnes (classes) du groupe par une somme pondérée.
    - On normalise UNE SEULE FOIS à la fin (sur toutes les classes).
    """
    # Base : soit mélange global (recommandé), soit 1er modèle si pas fourni
    if w_fallback is not None:
        P_out = _blend_with_weights(P_list, w_fallback).astype(np.float64)
    else:
        P_out = P_list[0].astype(np.float64).copy()

    for sup, w in w_by_super.items():
        cols = [i for i, lab in enumerate(label_list) if label_to_super.get(int(lab), None) == sup]
        if not cols:
            continue

        w = _norm_w(w)  # s'assure sum=1
        sub_stack = np.stack([P[:, cols] for P in P_list], axis=0).astype(np.float64)  # (K,N,|cols|)
        sub = np.tensordot(w, sub_stack, axes=(0, 0))  # (N,|cols|) -> somme pondérée, SANS renorm interne
        P_out[:, cols] = sub

    # Normalisation globale finale
    P_out = P_out / np.maximum(P_out.sum(axis=1, keepdims=True), 1e-12)
    return P_out.astype(np.float32)

def grid_search_group_simplex_fixed(P_list_bl, y_true, w_global,
                                   n_samples_per_super=2000, alpha=1.0, shrink=0.35, seed=42, verbose=True):
    """
    Même logique que grid_search_group_simplex, mais s'appuie sur apply_group_blend_multi_fixed()
    pour éviter le bug final.
    """
    rng = np.random.default_rng(seed)
    K = len(P_list_bl)

    # init : chaque super démarre aux poids globaux
    w_by = {s: np.asarray(w_global, dtype=np.float32).copy() for s in SUPERS}

    # prédictions actuelles (avec fallback = global)
    P_cur = apply_group_blend_multi_fixed(P_list_bl, w_by, w_fallback=w_global)
    best_global = _score_weighted_f1(P_cur, y_true)

    for sup in SUPERS:
        cols = [i for i, lab in enumerate(label_list) if label_to_super.get(int(lab), None) == sup]
        if not cols:
            continue

        best_w = w_by[sup].copy()
        best_f1 = best_global

        sub_stack = np.stack([P[:, cols] for P in P_list_bl], axis=0).astype(np.float64)  # (K,N,|cols|)

        # candidats : meilleur actuel + global + tirages Dirichlet
        cand_ws = [best_w.astype(np.float64), np.asarray(w_global, dtype=np.float64)]
        for _ in range(int(n_samples_per_super)):
            cand_ws.append(rng.dirichlet(alpha * np.ones(K)))

        # on teste les candidats pour ce sup
        for w_raw in cand_ws:
            w = _norm_w(w_raw)

            # shrink : rapproche vers w_global pour limiter l'overfit sur blend
            wg = _norm_w(w_global)
            w = (1.0 - float(shrink)) * w + float(shrink) * wg
            w = _norm_w(w)

            # on remplace uniquement les colonnes de cette super dans P_cur
            P_tmp = P_cur.astype(np.float64).copy()
            sub = np.tensordot(w, sub_stack, axes=(0, 0))  # (N,|cols|)
            P_tmp[:, cols] = sub

            # renormalisation globale
            P_tmp = P_tmp / np.maximum(P_tmp.sum(axis=1, keepdims=True), 1e-12)

            f1w = _score_weighted_f1(P_tmp, y_true)
            if f1w > best_f1:
                best_f1 = f1w
                best_w = w.astype(np.float32)

        # on fige le meilleur poids pour ce sup
        w_by[sup] = best_w

        # on met à jour P_cur en appliquant la meilleure version trouvée pour ce sup
        P_cur = apply_group_blend_multi_fixed(P_list_bl, w_by, w_fallback=w_global)
        best_global = _score_weighted_f1(P_cur, y_true)

        if verbose:
            print(f"  [group_simplex_fixed] {sup}: f1_weighted={best_global:.6f}")

    return w_by

def grid_search_group_simplex(P_list_bl: List[np.ndarray], y_true: np.ndarray, w_global: np.ndarray,
                             n_samples_per_super: int = 2000, alpha: float = 1.0, shrink: float = 0.35, seed: int = 42):
    """Coordinate descent over supers; for each super, random-search simplex weights (Dirichlet).
    shrink pulls super-specific weights toward global weights to reduce overfit.
    """
    rng = np.random.default_rng(seed)
    K = len(P_list_bl)
    w_by = {s: w_global.astype(np.float32).copy() for s in SUPERS}

    P_cur = apply_group_blend_multi(P_list_bl, w_by)
    best_global = _score_weighted_f1(P_cur, y_true)

    for sup in SUPERS:
        cols = [i for i, lab in enumerate(label_list) if label_to_super.get(int(lab), None) == sup]
        if not cols:
            continue

        best_w = w_by[sup].copy()
        best_f1 = best_global

        sub_stack = np.stack([P[:, cols] for P in P_list_bl], axis=0).astype(np.float64)  # (K,N,|cols|)

        cand_ws = [best_w.astype(np.float64), w_global.astype(np.float64)]
        for _ in range(int(n_samples_per_super)):
            cand_ws.append(rng.dirichlet(alpha * np.ones(K)))

        for w_raw in cand_ws:
            w = np.asarray(w_raw, dtype=np.float64)
            w = (1.0 - float(shrink)) * w + float(shrink) * w_global
            w = w / max(w.sum(), 1e-12)

            P_tmp = P_cur.copy()
            sub = np.tensordot(w, sub_stack, axes=(0,0))
            P_tmp[:, cols] = sub
            P_tmp = P_tmp / np.maximum(P_tmp.sum(axis=1, keepdims=True), 1e-12)

            f1w = _score_weighted_f1(P_tmp, y_true)
            if f1w > best_f1:
                best_f1 = f1w
                best_w = w.astype(np.float32)

        w_by[sup] = best_w
        P_cur = apply_group_blend_multi(P_list_bl, w_by)
        best_global = _score_weighted_f1(P_cur, y_true)
        print(f"  [group_simplex] {sup}: f1_weighted={best_global:.6f}")

    return w_by


In [14]:

# =========================
# 9) Diagnostics & exports
# =========================
def export_run_artifacts(run_dir: Path, df_eval: pd.DataFrame, y_true_ids: np.ndarray, P: np.ndarray):
    run_dir.mkdir(parents=True, exist_ok=True)

    pred_ids = P.argmax(axis=1)
    acc = float((pred_ids == y_true_ids).mean())
    f1w = float(f1_score(y_true_ids, pred_ids, average="weighted"))
    f1m = float(f1_score(y_true_ids, pred_ids, average="macro"))

    rep = classification_report(y_true_ids, pred_ids, output_dict=True, zero_division=0)
    pd.DataFrame(rep).T.to_csv(run_dir / "classification_report.csv", index=True)

    cm = confusion_matrix(y_true_ids, pred_ids)
    pd.DataFrame(cm, index=label_list, columns=label_list).to_csv(run_dir / "confusion_matrix.csv")

    # metrics by super-category
    y_true_super = df_eval[LABEL_COL].map(label_to_super)
    sup_rows = []
    for sup in SUPERS:
        idx = (y_true_super == sup).values
        if idx.sum() == 0:
            continue
        acc_sup = float((pred_ids[idx] == y_true_ids[idx]).mean())
        f1w_sup = float(f1_score(y_true_ids[idx], pred_ids[idx], average="weighted"))
        f1m_sup = float(f1_score(y_true_ids[idx], pred_ids[idx], average="macro"))
        sup_rows.append([sup, int(idx.sum()), acc_sup, f1w_sup, f1m_sup])

    pd.DataFrame(
        sup_rows,
        columns=["super","n","acc","f1_weighted","f1_macro"],
    ).to_csv(run_dir / "metrics_by_super.csv", index=False)

    with open(run_dir / "metrics.json", "w") as f:
        json.dump({"acc": acc, "f1_weighted": f1w, "f1_macro": f1m}, f, indent=2)

    return {"acc": acc, "f1_weighted": f1w, "f1_macro": f1m}

In [15]:
# =========================
# 10) RUN3 (config unique)
# =========================
from IPython.display import display

@dataclass
class RunCfg:
    name: str
    use_group_split_for_val: bool
    filter_lang_fr: bool
    transformer_models: List[str]
    transformer_seeds: List[int]
    blend_mode_legacy: str                 # legacy: "global" | "group_fixed" | "group_search"
    use_tfidf: bool = True
    use_fasttext: bool = True
    blend_mode: str = "simplex_group_search"  # simplex_global | simplex_group_search
    simplex_samples_global: int = 14000
    simplex_samples_group: int = 3000
    simplex_alpha: float = 1.0
    group_shrink: float = 0.30
    include_transformers_individual: bool = True
    include_transformer_mean: bool = True
    fixed_w_by_super: Optional[Dict[str,float]] = None


# ---- Models (pruning + diversité) ----
# Basé sur ton dernier reblending (Simplex GLOBAL) :
# - camembert-base-ccnet ~0.4% de poids → redondant avec CamemBERT large → désactivé par défaut.
# - fastText ~3–4% → gain marginal → désactivé par défaut.
#
# On garde un "core" très solide (CamemBERT + XLM-R + FlauBERT) et on propose
# 1–2 compléments optionnels (mDeBERTa, ELECTRA) pour gagner en diversité.

MODEL_CCNET           = "almanach/camembert-base-ccnet"
MODEL_CAMEMBERT_BASE  = "almanach/camembert-base"
MODEL_CAMEMBERT_LARGE = "almanach/camembert-large"
MODEL_XLMR_BASE       = "xlm-roberta-base"
MODEL_XLMR_LARGE      = "FacebookAI/xlm-roberta-large"

MODEL_FLAUBERT_BASE   = "flaubert/flaubert_base_cased"
MODEL_FLAUBERT_LARGE  = "flaubert/flaubert_large_cased"
MODEL_MDEBERTA        = "microsoft/mdeberta-v3-base"
MODEL_ELECTRA_FR      = "dbmdz/electra-base-french-europeana-cased-discriminator"

AUTO_SCALE = True

def _gpu_vram_gb() -> float:
    if torch.cuda.is_available():
        return float(torch.cuda.get_device_properties(0).total_memory) / (1024**3)
    return 0.0

VRAM_GB = _gpu_vram_gb()
print(f"GPU VRAM (GB): {VRAM_GB:.1f}")

# Sélection automatique (base vs large)
MODEL_CAMEMBERT = MODEL_CAMEMBERT_LARGE if (AUTO_SCALE and VRAM_GB >= 16) else MODEL_CAMEMBERT_BASE
MODEL_XLMR      = MODEL_XLMR_LARGE      if (AUTO_SCALE and VRAM_GB >= 22) else MODEL_XLMR_BASE
MODEL_FLAUBERT  = MODEL_FLAUBERT_LARGE  if (AUTO_SCALE and VRAM_GB >= 24) else MODEL_FLAUBERT_BASE

# Switches
INCLUDE_CCNET    = False
INCLUDE_MDEBERTA = True
INCLUDE_ELECTRA  = False

TRANSFORMER_MODELS_CORE = [MODEL_CAMEMBERT, MODEL_XLMR, MODEL_FLAUBERT]
TRANSFORMER_MODELS_EXTRA = []
if INCLUDE_MDEBERTA:
    TRANSFORMER_MODELS_EXTRA.append(MODEL_MDEBERTA)
if INCLUDE_ELECTRA:
    TRANSFORMER_MODELS_EXTRA.append(MODEL_ELECTRA_FR)
if INCLUDE_CCNET:
    TRANSFORMER_MODELS_EXTRA.append(MODEL_CCNET)

TRANSFORMER_MODELS = TRANSFORMER_MODELS_CORE + TRANSFORMER_MODELS_EXTRA
TRANSFORMER_SEEDS  = list(range(41, 41 + len(TRANSFORMER_MODELS)))

# --- Transformers utilisés dans RUN3 ---

# (Optionnel) poids fournis (blending par super-cat) — utilisé seulement si blend_mode="group_fixed"
W_BY_SUP_USER = {
    "Autres": 0.30,
    "Bébé": 0.20,
    "Collection": 0.60,
    "Jardin & Extérieur": 0.40,
    "Jeux Vidéo": 0.50,
    "Jouets & Loisirs": 0.60,
    "Livres & Revues": 0.30,
    "Maison": 0.40,
    "Mode": 0.35,
    "Santé & Beauté": 0.35,
    "Sports & Loisirs": 0.45,
}

RUNS = [
    RunCfg(
        name="R_MAX_PERF_simplex_group_wF1",
        # Pour score "leaderboard-like": garder False (split non-strict).
        # Si tu veux une estimation plus robuste (anti-doublons), mets True.
        use_group_split_for_val=False,
        filter_lang_fr=False,
        transformer_models=TRANSFORMER_MODELS,
        transformer_seeds=TRANSFORMER_SEEDS,
        blend_mode_legacy="global",
        use_tfidf=True,
        use_fasttext=False,
        blend_mode="simplex_group_search",
        simplex_samples_global=14000,
        simplex_samples_group=3000,
        simplex_alpha=1.0,
        group_shrink=0.30,
        include_transformers_individual=True,
        include_transformer_mean=True,
    ),
]

# Cache (logits/probas) pour reruns
CACHE = {}

GPU VRAM (GB): 79.3


In [16]:
# =========================
# 11) Exécution des runs
# =========================

def get_split(df_base: pd.DataFrame, filter_fr: bool, use_group_split: bool):
    """Retourne (train_base_df, blend_df, test_df).

    Robustesse:
    - N'appelle maybe_filter_fr QUE si filter_fr=True
    - Si maybe_filter_fr n'est pas défini, ignore le filtre FR et continue.
    - Cache key inclut les tailles de split pour éviter de réutiliser un split obsolète.
    """
    key = ("split", bool(filter_fr), bool(use_group_split), float(VAL_SIZE), float(BLEND_SIZE_IN_TRAIN), int(SEED))
    if key in CACHE:
        return CACHE[key]

    df_run = df_base

    # Applique le filtre FR uniquement si demandé
    if filter_fr:
        fn = globals().get("maybe_filter_fr", None)
        if callable(fn):
            df_run = fn(df_base, filter_fr=True)
        else:
            print("⚠️  maybe_filter_fr non défini — filtre FR ignoré (fr_only=True mais cellule langdetect non exécutée).")
            df_run = df_base

    train_base_df, blend_df, test_df = split_train_val(df_run, use_group_split)
    CACHE[key] = (train_base_df, blend_df, test_df)
    return train_base_df, blend_df, test_df


    # Applique le filtre FR uniquement si demandé
    if filter_fr:https://support.google.com/drive/answer/2450387
        fn = globals().get("maybe_filter_fr", None)
        if callable(fn):
            df_run = fn(df_base, filter_fr=True)
        else:
            print("⚠️  maybe_filter_fr non défini — filtre FR ignoré (fr_only=True mais cellule langdetect non exécutée).")
            df_run = df_base

    train_base_df, blend_df, test_df = split_train_val(df_run, use_group_split)
    CACHE[key] = (train_base_df, blend_df, test_df)
    return train_base_df, blend_df, test_df


def run_one(cfg: RunCfg):
    print("\n" + "="*100)
    print("RUN:", cfg.name)
    print("group_split:", cfg.use_group_split_for_val, "| fr_only:", cfg.filter_lang_fr)
    print("models:", cfg.transformer_models, "| blend_mode:", cfg.blend_mode)

    run_dir = OUT_DIR / cfg.name
    run_dir.mkdir(parents=True, exist_ok=True)

    train_base_df, blend_df, test_df = get_split(df, cfg.filter_lang_fr, cfg.use_group_split_for_val)


    # TF-IDF (cache par split)
    P_t_bl = P_t_va = None
    met_t = None
    if cfg.use_tfidf:
        key_t = ("tfidf", cfg.filter_lang_fr, cfg.use_group_split_for_val)
        if key_t in CACHE:
            P_t_bl, P_t_va, met_t = CACHE[key_t]
            print("TF-IDF cached.")
        else:
            _, P_t_bl, P_t_va, met_t = train_tfidf(train_base_df, blend_df, test_df)
            CACHE[key_t] = (P_t_bl, P_t_va, met_t)
        print("TF-IDF val:", {k: round(float(v), 5) for k,v in met_t.items()})

    # fastText (optionnel) — cache par split
    if cfg.use_fasttext:
        key_ft = ("fasttext", cfg.filter_lang_fr, cfg.use_group_split_for_val)
        if key_ft in CACHE:
            ft_model, P_ft_bl, P_ft_va, met_ft = CACHE[key_ft]
            print("fastText cached")
        else:
            ft_model, P_ft_bl, P_ft_va, met_ft = train_fasttext(train_base_df, blend_df, test_df)
            CACHE[key_ft] = (ft_model, P_ft_bl, P_ft_va, met_ft)
        if met_ft is not None:
            print("fastText val:", {k: round(float(v), 5) for k,v in met_ft.items()})

    # Transformers (cache par split+model+seed)
    logits_bl_list = []
    logits_va_list = []

    for i, model_name in enumerate(cfg.transformer_models):
        seed = cfg.transformer_seeds[i] if i < len(cfg.transformer_seeds) else cfg.transformer_seeds[0]
        key_m = ("tr", cfg.filter_lang_fr, cfg.use_group_split_for_val, model_name, seed)

        if key_m in CACHE:
            logits_bl, y_bl_ids, logits_va, y_va_ids, T = CACHE[key_m]
            print(f"Transformer cached: {model_name} seed={seed} (T={T:.3f})")
        else:
            out_dir = str(run_dir / f"model_{i}_{seed}")
            trainer, tok, met = train_transformer_one(
                model_name=model_name,
                train_base_df=train_base_df,
                blend_df=blend_df,
                seed=seed,
                out_dir=out_dir,
            )
            print("Blend metrics:", {k: round(float(v), 5) for k,v in met.items()})

            logits_bl, y_bl_ids = predict_logits(trainer, blend_df, tok)
            logits_va, y_va_ids = predict_logits(trainer, test_df, tok)

            # temp scaling sur blend
            T = temperature_scale_logits(logits_bl, y_bl_ids, max_iters=TEMP_MAX_ITERS, lr=TEMP_LR)
            CACHE[key_m] = (logits_bl, y_bl_ids, logits_va, y_va_ids, T)
            print(f"Temp {model_name} seed={seed}: T={T:.3f}")

        logits_bl_list.append(logits_bl / float(T))
        logits_va_list.append(logits_va / float(T))

    # Ensemble logits (moyenne)
    logits_bl_ens = np.mean(np.stack(logits_bl_list, axis=0), axis=0)
    logits_va_ens = np.mean(np.stack(logits_va_list, axis=0), axis=0)

    P_tr_bl = softmax_np(logits_bl_ens)
    P_tr_va = softmax_np(logits_va_ens)

    # y ids
    y_bl = y_to_ids(blend_df[LABEL_COL].values)
    y_va = y_to_ids(test_df[LABEL_COL].values)

    # sanity: y ids from transformer predictions (if available) should match mapping
    try:
        assert np.all(y_bl == y_bl_ids)
        assert np.all(y_va == y_va_ids)
    except Exception:
        pass

    pred_tr = P_tr_va.argmax(axis=1)
    acc_tr = float((pred_tr == y_va).mean())
    f1w_tr = float(f1_score(y_va, pred_tr, average="weighted"))
    print("Transformer ensemble val:", {"acc": round(acc_tr,5), "f1_weighted": round(f1w_tr,5)})

    # Fusion / Blending multi-composants
    # Components:
    # - TF-IDF (probas)
    # - fastText (probas)
    # - Transformers: probas par modèle (et optionnellement moyenne)
    comp_names = []
    P_bl_list = []
    P_va_list = []

    if cfg.use_tfidf:
        comp_names.append("tfidf_svm")
        P_bl_list.append(P_t_bl)
        P_va_list.append(P_t_va)

    if cfg.use_fasttext and ("P_ft_bl" in locals()) and (P_ft_bl is not None):
        comp_names.append("fasttext")
        P_bl_list.append(P_ft_bl)
        P_va_list.append(P_ft_va)

    # Transformers: par modèle (plus flexible qu'une moyenne fixe)
    if cfg.include_transformers_individual:
        for j, model_name in enumerate(cfg.transformer_models):
            comp_names.append(f"tr_{j}:" + Path(model_name).name)
            P_bl_list.append(softmax_np(logits_bl_list[j]))
            P_va_list.append(softmax_np(logits_va_list[j]))

    if cfg.include_transformer_mean:
        comp_names.append("tr_mean")
        P_bl_list.append(P_tr_bl)
        P_va_list.append(P_tr_va)

    assert len(P_bl_list) >= 2, "Need at least 2 components for simplex blending."

    # 1) Global simplex weights (Dirichlet) + refine
    w0, f1_0 = search_dirichlet_weights(
        P_bl_list, y_bl,
        n_samples=int(cfg.simplex_samples_global),
        alpha=float(cfg.simplex_alpha),
        seed=SEED,
    )
    w1, f1_1 = refine_weights_coordinate(P_bl_list, y_bl, w0, step=0.05, n_rounds=4)
    w_global = w1 if f1_1 >= f1_0 else w0

    print("Best simplex global f1_weighted (blend):", round(max(f1_0, f1_1), 6))
    print("Weights global:")
    for n, w in zip(comp_names, w_global):
        print(f"  {n:25s} {float(w):.4f}")

    # 2) Groupwise per super-catégorie (act on class columns), objective weighted-F1
    if cfg.blend_mode == "simplex_group_search":
        w_by = grid_search_group_simplex_fixed(
            P_bl_list, y_bl,
            w_global=w_global,
            n_samples_per_super=int(cfg.simplex_samples_group),
            alpha=float(cfg.simplex_alpha),
            shrink=float(cfg.group_shrink),
            seed=SEED,
        )
        P_val = apply_group_blend_multi_fixed(P_va_list, w_by, w_fallback=w_global)
        meta = {
            "blend_mode": "simplex_group_search_fixed",
            "components": comp_names,
            "w_global": [float(x) for x in w_global],
            "w_by_super": {k: [float(x) for x in v] for k,v in w_by.items()},
            "params": {
                "simplex_samples_global": int(cfg.simplex_samples_global),
                "simplex_samples_group": int(cfg.simplex_samples_group),
                "simplex_alpha": float(cfg.simplex_alpha),
                "group_shrink": float(cfg.group_shrink),
            }
        }
    else:
        P_val = _blend_with_weights(P_va_list, w_global)
        meta = {
            "blend_mode": "simplex_global",
            "components": comp_names,
            "w_global": [float(x) for x in w_global],
            "params": {
                "simplex_samples_global": int(cfg.simplex_samples_global),
                "simplex_alpha": float(cfg.simplex_alpha),
            }
        }

    metrics = export_run_artifacts(run_dir, test_df, y_va, P_val)
    with open(run_dir / "run_meta.json", "w") as f:
        json.dump({**cfg.__dict__, **meta}, f, indent=2)

    print("VAL metrics:", {k: round(v, 5) for k,v in metrics.items()})
    return {"run": cfg.name, **metrics}

results = []
for cfg in RUNS:
    results.append(run_one(cfg))

res_df = pd.DataFrame(results).sort_values("f1_weighted", ascending=False)
print("\n=== SUMMARY (val) ===")
display(res_df)
res_df.to_csv(OUT_DIR / "summary.csv", index=False)


RUN: R_MAX_PERF_simplex_group_wF1
group_split: False | fr_only: False
models: ['almanach/camembert-large', 'FacebookAI/xlm-roberta-large', 'flaubert/flaubert_large_cased', 'microsoft/mdeberta-v3-base'] | blend_mode: simplex_group_search
train_base: (61351, 10) blend: (10827, 10) val: (12738, 10)
overlap text_hash train_full/val = 317 ratio_val = 0.024886167373214006
TF-IDF val: {'acc': 0.86552, 'f1_weighted': 0.86499, 'f1_macro': 0.85233}


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/809k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/374 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/456 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

Some weights of CamembertModel were not initialized from the model checkpoint at almanach/camembert-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/61351 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/61351 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1 Weighted,F1 Macro
1,0.5091,0.46224,0.859518,0.859003,0.837559
2,0.3664,0.369394,0.891383,0.891031,0.876606
3,0.2452,0.398353,0.8948,0.895389,0.882884
4,0.1682,0.440505,0.900065,0.900251,0.890567
5,0.0986,0.47909,0.906807,0.906842,0.897633
6,0.0667,0.502226,0.912164,0.912073,0.904086


Blend metrics: {'eval_loss': 0.50223, 'eval_accuracy': 0.91216, 'eval_f1_weighted': 0.91207, 'eval_f1_macro': 0.90409, 'eval_runtime': 26.3309, 'eval_samples_per_second': 411.19, 'eval_steps_per_second': 25.711, 'epoch': 6.0}


Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/12738 [00:00<?, ? examples/s]

Map:   0%|          | 0/12738 [00:00<?, ? examples/s]

Temp almanach/camembert-large seed=41: T=1.775


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Map:   0%|          | 0/61351 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/61351 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1 Weighted,F1 Macro
1,0.6035,0.510131,0.850097,0.849776,0.827718
2,0.4178,0.450888,0.869308,0.869012,0.850033
3,0.314,0.414311,0.888427,0.888537,0.873271
4,0.2477,0.417695,0.895077,0.895114,0.880201
5,0.1537,0.464014,0.899326,0.899108,0.887185
6,0.1139,0.504498,0.900619,0.900272,0.887076


Blend metrics: {'eval_loss': 0.5045, 'eval_accuracy': 0.90062, 'eval_f1_weighted': 0.90027, 'eval_f1_macro': 0.88708, 'eval_runtime': 26.4347, 'eval_samples_per_second': 409.576, 'eval_steps_per_second': 25.61, 'epoch': 6.0}


Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/12738 [00:00<?, ? examples/s]

Map:   0%|          | 0/12738 [00:00<?, ? examples/s]

Temp FacebookAI/xlm-roberta-large seed=42: T=1.728


tokenizer_config.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.49G [00:00<?, ?B/s]

Map:   0%|          | 0/61351 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/61351 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1 Weighted,F1 Macro
1,0.598,0.557155,0.831717,0.831884,0.80587
2,0.4271,0.417085,0.875866,0.875649,0.85896
3,0.3275,0.385314,0.887134,0.88718,0.873888
4,0.2344,0.400806,0.891567,0.891311,0.878606
5,0.2088,0.421166,0.897756,0.897613,0.886905
6,0.1374,0.441438,0.898217,0.898178,0.887637


Blend metrics: {'eval_loss': 0.44144, 'eval_accuracy': 0.89822, 'eval_f1_weighted': 0.89818, 'eval_f1_macro': 0.88764, 'eval_runtime': 40.522, 'eval_samples_per_second': 267.188, 'eval_steps_per_second': 16.707, 'epoch': 6.0}


Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/12738 [00:00<?, ? examples/s]

Map:   0%|          | 0/12738 [00:00<?, ? examples/s]

Temp flaubert/flaubert_large_cased seed=43: T=1.606


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Map:   0%|          | 0/61351 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/61351 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1 Weighted,F1 Macro
1,0.6012,0.568704,0.832825,0.830366,0.795388
2,0.4519,0.467547,0.859518,0.859237,0.836871
3,0.3776,0.43853,0.870694,0.870364,0.852431
4,0.275,0.43503,0.881685,0.881185,0.86612
5,0.2516,0.419743,0.890736,0.89057,0.876279
6,0.1824,0.436224,0.893692,0.893356,0.879642


Blend metrics: {'eval_loss': 0.43622, 'eval_accuracy': 0.89369, 'eval_f1_weighted': 0.89336, 'eval_f1_macro': 0.87964, 'eval_runtime': 30.2332, 'eval_samples_per_second': 358.116, 'eval_steps_per_second': 22.393, 'epoch': 6.0}


Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/10827 [00:00<?, ? examples/s]

Map:   0%|          | 0/12738 [00:00<?, ? examples/s]

Map:   0%|          | 0/12738 [00:00<?, ? examples/s]

Temp microsoft/mdeberta-v3-base seed=44: T=1.447
Transformer ensemble val: {'acc': 0.91443, 'f1_weighted': 0.91402}
Best simplex global f1_weighted (blend): 0.919639
Weights global:
  tfidf_svm                 0.3723
  tr_0:camembert-large      0.2144
  tr_1:xlm-roberta-large    0.0948
  tr_2:flaubert_large_cased 0.0267
  tr_3:mdeberta-v3-base     0.0052
  tr_mean                   0.2866


NameError: name '_norm_w' is not defined

In [18]:
#069-++++++++++++++ ============================================================
# RE-BLENDING SANS RÉENTRAÎNER : FIX groupwise simplex
# - Recalcule w_global + w_by_super + P_val final
# - Corrige le bug de renormalisation "dans chaque super"
# - Exporte dans OUT_DIR/<RUN_NAME>_REBLEND_FIXED/
# ============================================================

import numpy as np
import json
from pathlib import Path

# ----------------------------
# 0) Choix du run
# ----------------------------
cfg = RUNS[0]  # RUN3 unique dans ton notebook
RUN_NAME = cfg.name
print("Reblend run =", RUN_NAME)

run_dir_fixed = Path(OUT_DIR) / f"{RUN_NAME}_REBLEND_FIXED"
run_dir_fixed.mkdir(parents=True, exist_ok=True)

# ----------------------------
# 1) Split (doit être identique au run)
# ----------------------------
train_base_df, blend_df, test_df = get_split(df, cfg.filter_lang_fr, cfg.use_group_split_for_val)

y_bl = y_to_ids(blend_df[LABEL_COL].values)
y_va = y_to_ids(test_df[LABEL_COL].values)

# ----------------------------
# 2) Helpers blending FIX
# ----------------------------
def _norm_w(w):
    w = np.asarray(w, dtype=np.float64)
    w = np.clip(w, 0.0, 1.0)
    s = w.sum()
    if s <= 0:
        return np.ones_like(w) / len(w)
    return w / s

def apply_group_blend_multi_fixed(P_list, w_by_super, w_fallback=None):
    """
    Version FIX :
    - On NE normalise PAS à l'intérieur d'un sous-groupe de classes.
    - On remplace les colonnes (classes) du groupe par une somme pondérée.
    - On normalise UNE SEULE FOIS à la fin (sur toutes les classes).
    """
    # Base : soit mélange global (recommandé), soit 1er modèle si pas fourni
    if w_fallback is not None:
        P_out = _blend_with_weights(P_list, w_fallback).astype(np.float64)
    else:
        P_out = P_list[0].astype(np.float64).copy()

    for sup, w in w_by_super.items():
        cols = [i for i, lab in enumerate(label_list) if label_to_super.get(int(lab), None) == sup]
        if not cols:
            continue

        w = _norm_w(w)  # s'assure sum=1
        sub_stack = np.stack([P[:, cols] for P in P_list], axis=0).astype(np.float64)  # (K,N,|cols|)
        sub = np.tensordot(w, sub_stack, axes=(0, 0))  # (N,|cols|) -> somme pondérée, SANS renorm interne
        P_out[:, cols] = sub

    # Normalisation globale finale
    P_out = P_out / np.maximum(P_out.sum(axis=1, keepdims=True), 1e-12)
    return P_out.astype(np.float32)

def grid_search_group_simplex_fixed(P_list_bl, y_true, w_global,
                                   n_samples_per_super=2000, alpha=1.0, shrink=0.35, seed=42, verbose=True):
    """
    Même logique que grid_search_group_simplex, mais s'appuie sur apply_group_blend_multi_fixed()
    pour éviter le bug final.
    """
    rng = np.random.default_rng(seed)
    K = len(P_list_bl)

    # init : chaque super démarre aux poids globaux
    w_by = {s: np.asarray(w_global, dtype=np.float32).copy() for s in SUPERS}

    # prédictions actuelles (avec fallback = global)
    P_cur = apply_group_blend_multi_fixed(P_list_bl, w_by, w_fallback=w_global)
    best_global = _score_weighted_f1(P_cur, y_true)

    for sup in SUPERS:
        cols = [i for i, lab in enumerate(label_list) if label_to_super.get(int(lab), None) == sup]
        if not cols:
            continue

        best_w = w_by[sup].copy()
        best_f1 = best_global

        sub_stack = np.stack([P[:, cols] for P in P_list_bl], axis=0).astype(np.float64)  # (K,N,|cols|)

        # candidats : meilleur actuel + global + tirages Dirichlet
        cand_ws = [best_w.astype(np.float64), np.asarray(w_global, dtype=np.float64)]
        for _ in range(int(n_samples_per_super)):
            cand_ws.append(rng.dirichlet(alpha * np.ones(K)))

        # on teste les candidats pour ce sup
        for w_raw in cand_ws:
            w = _norm_w(w_raw)

            # shrink : rapproche vers w_global pour limiter l'overfit sur blend
            wg = _norm_w(w_global)
            w = (1.0 - float(shrink)) * w + float(shrink) * wg
            w = _norm_w(w)

            # on remplace uniquement les colonnes de cette super dans P_cur
            P_tmp = P_cur.astype(np.float64).copy()
            sub = np.tensordot(w, sub_stack, axes=(0, 0))  # (N,|cols|)
            P_tmp[:, cols] = sub

            # renormalisation globale
            P_tmp = P_tmp / np.maximum(P_tmp.sum(axis=1, keepdims=True), 1e-12)

            f1w = _score_weighted_f1(P_tmp, y_true)
            if f1w > best_f1:
                best_f1 = f1w
                best_w = w.astype(np.float32)

        # on fige le meilleur poids pour ce sup
        w_by[sup] = best_w

        # on met à jour P_cur en appliquant la meilleure version trouvée pour ce sup
        P_cur = apply_group_blend_multi_fixed(P_list_bl, w_by, w_fallback=w_global)
        best_global = _score_weighted_f1(P_cur, y_true)

        if verbose:
            print(f"  [group_simplex_fixed] {sup}: f1_weighted={best_global:.6f}")

    return w_by

# ----------------------------
# 3) Reconstruire les composants depuis CACHE
# ----------------------------
comp_names, P_bl_list, P_va_list = [], [], []

# 3a) TF-IDF : récupère du cache si possible sinon (re)entraîne vite
if cfg.use_tfidf:
    key_t = ("tfidf", cfg.filter_lang_fr, cfg.use_group_split_for_val)
    if key_t in CACHE:
        P_t_bl, P_t_va, met_t = CACHE[key_t]
        print("TF-IDF loaded from CACHE.")
    else:
        # fallback : entraîne TF-IDF (rapide)
        _, P_t_bl, P_t_va, met_t = train_tfidf(train_base_df, blend_df, test_df)
        CACHE[key_t] = (P_t_bl, P_t_va, met_t)
        print("TF-IDF retrained (quick fallback).")

    comp_names.append("tfidf_svm")
    P_bl_list.append(P_t_bl)
    P_va_list.append(P_t_va)

# 3b) fastText : récupère du cache si possible (sinon on peut skipper pour aller vite)
if cfg.use_fasttext:
    key_ft = ("fasttext", cfg.filter_lang_fr, cfg.use_group_split_for_val)
    if key_ft in CACHE:
        ft_model, P_ft_bl, P_ft_va, met_ft = CACHE[key_ft]
        if P_ft_bl is not None:
            comp_names.append("fasttext")
            P_bl_list.append(P_ft_bl)
            P_va_list.append(P_ft_va)
        print("fastText loaded from CACHE.")
    else:
        print("fastText absent du CACHE -> skip (pour éviter réentraînement).")

# 3c) Transformers : on reconstruit à partir des logits stockés dans CACHE (aucun réentraînement)
logits_bl_list, logits_va_list = [], []
for i, model_name in enumerate(cfg.transformer_models):
    seed = cfg.transformer_seeds[i] if i < len(cfg.transformer_seeds) else cfg.transformer_seeds[0]
    key_m = ("tr", cfg.filter_lang_fr, cfg.use_group_split_for_val, model_name, seed)

    if key_m not in CACHE:
        raise RuntimeError(
            f"Logits manquants dans CACHE pour {model_name} seed={seed}. "
            "Il faut exécuter au moins une fois le run (ou recharger les logits)."
        )

    logits_bl, y_bl_ids, logits_va, y_va_ids, T = CACHE[key_m]
    logits_bl_list.append(logits_bl / float(T))
    logits_va_list.append(logits_va / float(T))

# Ajout Transformers individuels
if cfg.include_transformers_individual:
    for j, model_name in enumerate(cfg.transformer_models):
        comp_names.append(f"tr_{j}:" + Path(model_name).name)
        P_bl_list.append(softmax_np(logits_bl_list[j]).astype(np.float32))
        P_va_list.append(softmax_np(logits_va_list[j]).astype(np.float32))

# Ajout moyenne Transformers
logits_bl_ens = np.mean(np.stack(logits_bl_list, axis=0), axis=0)
logits_va_ens = np.mean(np.stack(logits_va_list, axis=0), axis=0)
P_tr_bl = softmax_np(logits_bl_ens).astype(np.float32)
P_tr_va = softmax_np(logits_va_ens).astype(np.float32)

if cfg.include_transformer_mean:
    comp_names.append("tr_mean")
    P_bl_list.append(P_tr_bl)
    P_va_list.append(P_tr_va)

assert len(P_bl_list) >= 2, "Il faut au moins 2 composants pour faire du simplex blending."

print("\nComposants utilisés pour le reblending :")
for n in comp_names:
    print(" -", n)

# ----------------------------
# 4) Recalcul w_global sur BLEND (Dirichlet + refine)
# ----------------------------
w0, f1_0 = search_dirichlet_weights(
    P_bl_list, y_bl,
    n_samples=int(cfg.simplex_samples_global),
    alpha=float(cfg.simplex_alpha),
    seed=SEED,
)
w1, f1_1 = refine_weights_coordinate(P_bl_list, y_bl, w0, step=0.05, n_rounds=4)
w_global = w1 if f1_1 >= f1_0 else w0

print("\n✅ Simplex GLOBAL (BLEND)")
print("Best f1_weighted (blend) =", round(max(f1_0, f1_1), 6))
print("Weights global :")
for n, w in zip(comp_names, w_global):
    print(f"  {n:25s} {float(w):.4f}")

# ----------------------------
# 5) Recalcul groupwise FIX (si demandé) + P_val final
# ----------------------------
if cfg.blend_mode == "simplex_group_search":
    w_by_super = grid_search_group_simplex_fixed(
        P_list_bl=P_bl_list,
        y_true=y_bl,
        w_global=w_global,
        n_samples_per_super=int(cfg.simplex_samples_group),
        alpha=float(cfg.simplex_alpha),
        shrink=float(cfg.group_shrink),
        seed=SEED,
        verbose=True,
    )
    P_val = apply_group_blend_multi_fixed(P_va_list, w_by_super, w_fallback=w_global)
    meta = {
        "blend_mode": "simplex_group_search_FIXED",
        "components": comp_names,
        "w_global": [float(x) for x in w_global],
        "w_by_super": {k: [float(x) for x in v] for k, v in w_by_super.items()},
        "params": {
            "simplex_samples_global": int(cfg.simplex_samples_global),
            "simplex_samples_group": int(cfg.simplex_samples_group),
            "simplex_alpha": float(cfg.simplex_alpha),
            "group_shrink": float(cfg.group_shrink),
        }
    }
else:
    P_val = _blend_with_weights(P_va_list, w_global)
    meta = {
        "blend_mode": "simplex_global",
        "components": comp_names,
        "w_global": [float(x) for x in w_global],
        "params": {
            "simplex_samples_global": int(cfg.simplex_samples_global),
            "simplex_alpha": float(cfg.simplex_alpha),
        }
    }

# ----------------------------
# 6) Metrics + exports (dans un nouveau dossier)
# ----------------------------
metrics = export_run_artifacts(run_dir_fixed, test_df, y_va, P_val)

with open(run_dir_fixed / "run_meta.json", "w", encoding="utf-8") as f:
    json.dump({**cfg.__dict__, **meta}, f, ensure_ascii=False, indent=2)

print("\n✅ REBLEND terminé")
print("Run dir :", run_dir_fixed)
print("VAL metrics :", {k: round(float(v), 6) for k, v in metrics.items()})


Reblend run = R_MAX_PERF_simplex_group_wF1
TF-IDF loaded from CACHE.

Composants utilisés pour le reblending :
 - tfidf_svm
 - tr_0:camembert-large
 - tr_1:xlm-roberta-large
 - tr_2:flaubert_large_cased
 - tr_3:mdeberta-v3-base
 - tr_mean

✅ Simplex GLOBAL (BLEND)
Best f1_weighted (blend) = 0.919639
Weights global :
  tfidf_svm                 0.3723
  tr_0:camembert-large      0.2144
  tr_1:xlm-roberta-large    0.0948
  tr_2:flaubert_large_cased 0.0267
  tr_3:mdeberta-v3-base     0.0052
  tr_mean                   0.2866
  [group_simplex_fixed] Autres: f1_weighted=0.920209
  [group_simplex_fixed] Bébé: f1_weighted=0.920216
  [group_simplex_fixed] Collection: f1_weighted=0.920503
  [group_simplex_fixed] Jardin & Extérieur: f1_weighted=0.920612
  [group_simplex_fixed] Jeux Vidéo: f1_weighted=0.920891
  [group_simplex_fixed] Jouets & Loisirs: f1_weighted=0.920902
  [group_simplex_fixed] Livres & Revues: f1_weighted=0.920721
  [group_simplex_fixed] Maison: f1_weighted=0.920796

✅ REBLEND 