# <a id='toc1_'></a>[Projet 7 : Réalisez une analyse de sentiments grâce au Deep Learning](#toc0_)
# <a id='toc2_'></a>[Modèle avancé BERT](#toc0_)

[Lien OpenClassroom](https://openclassrooms.com/fr/paths/795/projects/1516/1578-mission)

---

**Table of contents**<a id='toc0_'></a>    
 

<!-- vscode-jupyter-toc-config
    numbering=false
    anchor=true
    flat=false
    minLevel=2
    maxLevel=6
    /vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

---
---

## <a id='toc1_'></a>[Imports](#toc0_)

In [10]:
import os

os.environ["TF_USE_LEGACY_KERAS"] = "1"

import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, TFBertForSequenceClassification
import mlflow
import mlflow.tensorflow
import logging
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.callbacks import EarlyStopping

# Configuration du logging pour TensorFlow et Transformers (moins verbeux)
tf.get_logger().setLevel(logging.ERROR)
logging.getLogger("transformers").setLevel(logging.ERROR)

---
---

## <a id='toc2_'></a>[Chargement des données](#toc0_)

In [11]:
TRAIN_DATA_PATH = "./train_data.csv"
VAL_DATA_PATH = "./validation_data.csv"
TEST_DATA_PATH = "./test_data.csv"

train_df = pd.read_csv(TRAIN_DATA_PATH).sample(10000)
val_df = pd.read_csv(VAL_DATA_PATH).sample(3000)
test_df = pd.read_csv(TEST_DATA_PATH)

# Handle potential NaN values in 'cleaned_text' that might result from preprocessing
train_df["cleaned_text"].fillna("", inplace=True)
val_df["cleaned_text"].fillna("", inplace=True)
test_df["cleaned_text"].fillna("", inplace=True)


X_train = train_df["cleaned_text"].to_list()
y_train = (
    train_df["sentiment"].replace({"negative": 0, "positive": 1}).astype(int).to_list()
)
X_val = val_df["cleaned_text"].to_list()
y_val = (
    val_df["sentiment"].replace({"negative": 0, "positive": 1}).astype(int).to_list()
)
X_test = test_df["cleaned_text"].to_list()
y_test = (
    test_df["sentiment"].replace({"negative": 0, "positive": 1}).astype(int).to_list()
)

print("Data loaded successfully:")
print(f"Train samples: {len(X_train)}")
print(f"Validation samples: {len(X_val)}")
print(f"Test samples: {len(X_test)}")

Data loaded successfully:
Train samples: 10000
Validation samples: 3000
Test samples: 238738


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_df["cleaned_text"].fillna("", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  val_df["cleaned_text"].fillna("", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting va

---
---

## <a id='toc3_'></a>[Préparation pour BERT](#toc0_)

In [12]:
EXPERIMENT_NAME = "Tweet Sentiment Analysis - BERT Models"
mlflow.set_experiment(EXPERIMENT_NAME)
print(f"MLflow Experiment: {EXPERIMENT_NAME}")

# --- 1. Paramètres et Données ---

MODEL_NAME = "distilbert-base-uncased"
MAX_LENGTH = 16
BATCH_SIZE = 16
EPOCHS = 6
LEARNING_RATE = 0.0001

MLflow Experiment: Tweet Sentiment Analysis - BERT Models


---

### <a id='toc3_1_'></a>[Création d'un Tokenizer](#toc0_)

In [None]:
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)


def encode_texts(texts):
    return tokenizer(
        texts,
        max_length=MAX_LENGTH,
        truncation=True,
        padding="max_length",
        return_attention_mask=True,
        return_token_type_ids=False,
        return_tensors="tf",
    )


train_encodings = encode_texts(X_train)
val_encodings = encode_texts(X_val)
test_encodings = encode_texts(X_test)

# Conversion en tf.data.Dataset pour l'efficacité
train_dataset = (
    tf.data.Dataset.from_tensor_slices((dict(train_encodings), y_train))
    .shuffle(len(X_train))
    .batch(BATCH_SIZE)
    .prefetch(tf.data.AUTOTUNE)
)

val_dataset = (
    tf.data.Dataset.from_tensor_slices((dict(val_encodings), y_val))
    .batch(BATCH_SIZE)
    .prefetch(tf.data.AUTOTUNE)
)

test_dataset = (
    tf.data.Dataset.from_tensor_slices((dict(test_encodings), y_test))
    .batch(BATCH_SIZE)
    .prefetch(tf.data.AUTOTUNE)
)

print("\nExemple d'encodage (première phrase d'entraînement):")
for key, value in train_encodings.items():
    print(f"{key}: {value[0].numpy().tolist()[:10]}...")  # Affiche les 10


Exemple d'encodage (première phrase d'entraînement):
input_ids: [101, 9379, 2154, 5199, 102, 0, 0, 0, 0, 0]...
attention_mask: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]...


---

### Construction modèle BERT pour Classification

In [None]:
model = TFBertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Compile the model
optimizer = Adam(learning_rate=2e-5)
loss = SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

# Prepare TensorFlow datasets
train_dataset = (
    tf.data.Dataset.from_tensor_slices((dict(train_encodings), y_train))
    .shuffle(1000)
    .batch(16)
)

test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), y_test)).batch(
    16
)

---
---

## Entrainement

---

### <a id='toc4_1_'></a>[MLFlow Setup](#toc0_)

In [15]:
EXPERIMENT_NAME = "Tweet Sentiment Analysis - BERT Models"
mlflow.set_experiment(EXPERIMENT_NAME)
print(f"MLflow experiment set to: '{EXPERIMENT_NAME}'")
MODEL_ARTIFACT_PATH_BERT = "bert-model"

MLflow experiment set to: 'Tweet Sentiment Analysis - BERT Models'


---

### <a id='toc4_2_'></a>[Entrainement du modèle avec MLFlow](#toc0_)

In [None]:
with mlflow.start_run(run_name="DistilBERT_FineTuning_CPU") as run:
    mlflow.log_param("model_name", MODEL_NAME)
    mlflow.log_param("max_length", MAX_LENGTH)
    mlflow.log_param("batch_size", BATCH_SIZE)
    mlflow.log_param("epochs", EPOCHS)
    mlflow.log_param("learning_rate", LEARNING_RATE)
    mlflow.log_param("train_samples", len(X_train))
    mlflow.log_param("val_samples", len(X_val))
    mlflow.log_param("test_samples", len(X_test))

    print("\n--- Démarrage de l'entraînement ---")
    mlflow.tensorflow.autolog(
        log_models=False,
        log_input_examples=False,
        log_model_signatures=False,
    )

    history = model.fit(train_dataset, epochs=EPOCHS, validation_data=val_dataset)
    print("--- Entraînement terminé ---")

    # --- 5. Évaluation ---
    print("\n--- Évaluation sur le jeu de test ---")
    results = model.evaluate(test_dataset, batch_size=BATCH_SIZE, return_dict=True)
    print(f"Résultats du test: {results}")
    mlflow.log_metrics(
        {"test_loss": results["loss"], "test_accuracy": results["accuracy"]}
    )

    # --- 6. Sauvegarde du Modèle et du Tokenizer avec MLflow ---

    print("\n--- Sauvegarde du modèle et du tokenizer avec MLflow ---")
    mlflow.transformers.log_model(
        transformers_model={
            "model": model,
            "tokenizer": tokenizer,
        },
        artifact_path="bert_sentiment_model",
        input_example=X_train[:5],
    )

    print(f"MLflow Run ID: {run.info.run_id}")
    print(
        f"Modèle et Tokenizer sauvegardés dans MLflow run sous 'bert_sentiment_model'"
    )




--- Démarrage de l'entraînement ---
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6




--- Entraînement terminé ---

--- Évaluation sur le jeu de test ---

KeyboardInterrupt: 

---
---

## <a id='toc6_2_'></a>[Evaluation du modèle sur les données de Test](#toc0_)

In [None]:
print("\n--- Exemple de prédiction avec le modèle chargé depuis MLflow ---")

# Tokenization des exemples
sample_encodings = encode_texts(X_test)

# Prédictions (logits)
predictions = model.predict(dict(sample_encodings))
logits = predictions.logits

# Conversion des logits en probabilités et en classes prédites
probabilities = tf.nn.softmax(logits, axis=-1).numpy()
predicted_classes = np.argmax(probabilities, axis=1)

for tweet, true, prob, pred_class in zip(
    X_test, y_test, probabilities, predicted_classes
)[:50]:
    sentiment = "Non-Négatif/Positif" if pred_class == 1 else "Négatif"
    print(f"\nTweet: {tweet}")
    print(f"  Probabilités (Négatif, Non-Négatif/Positif): {prob}")
    print(f"  Sentiment Prédit: {sentiment} | 'Vrai' Sentiment : {true}")


--- Exemple de prédiction avec le modèle chargé depuis MLflow ---

Tweet: This is an amazing flight!
  Probabilités (Négatif, Non-Négatif/Positif): [0.27767384 0.72232616]
  Sentiment Prédit: Non-Négatif/Positif (Classe: 1)

Tweet: I will never fly with this airline again, horrible service.
  Probabilités (Négatif, Non-Négatif/Positif): [0.5397522  0.46024784]
  Sentiment Prédit: Négatif (Classe: 0)

Tweet: The flight was okay, nothing special.
  Probabilités (Négatif, Non-Négatif/Positif): [0.38441566 0.6155843 ]
  Sentiment Prédit: Non-Négatif/Positif (Classe: 1)
