# R√©alisez une analyse de sentiments gr√¢ce au Deep Learning

## advanced-model

#### Load data

In [2]:
import pandas as pd
import numpy as np
import re

# Chargement du fichier avec un encodage diff√©rent
df = pd.read_csv("./output/data_clean.csv")

# V√©rification des premi√®res lignes
df.head()

Unnamed: 0,id,timestamp,date,query,user,tweet,tweet_tokenized
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,a thats a bummer you shoulda got david carr of...,"['thats', 'bummer', 'shoulda', 'got', 'david',..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he cant update his facebook by t...,"['upset', 'cant', 'update', 'facebook', 'texti..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,i dived many times for the ball managed to sav...,"['dived', 'many', 'times', 'ball', 'managed', ..."
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,"['whole', 'body', 'feels', 'itchy', 'like', 'f..."
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,no its not behaving at all im mad why am i her...,"['behaving', 'im', 'mad', 'cant', 'see']"


In [3]:
df[df["tweet"].isna() | (df["tweet"] == "")]
df = df[~(df['tweet'].isna() | (df['tweet'] == ""))]

#### Pr√©traitement et Vectorisation

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')  # WordNet d√©pend de cette ressource
nltk.download('punkt')    # Pour la tokenisation

[nltk_data] Downloading package wordnet to /home/bruno/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/bruno/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /home/bruno/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

##### Stemming et Lemmatization

- **stemming** : Le stemming consiste √† troncater un mot jusqu‚Äô√† sa racine ou un radical commun en appliquant des r√®gles heuristiques simples. Cette approche est souvent rapide, mais elle peut produire des formes de mots non valides.
- **Lemmatization** : La lemmatisation consiste √† r√©duire un mot √† sa "lemme", c‚Äôest-√†-dire sa forme canonique ou de base, en tenant compte de son contexte linguistique et de sa cat√©gorie grammaticale.
    - ex :
        -  Happily	-> Happy
        -  Better	-> Good

In [4]:
from mlflow.tracking import MlflowClient
import mlflow
import mlflow.keras
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import accuracy_score, classification_report, f1_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Bidirectional, Embedding, Dropout
from transformers import TFBertForSequenceClassification, BertTokenizer
from gensim.models import Word2Vec, FastText
import numpy as np

# Choix des techniques de pr√©traitement
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preprocess_text(text, technique="lemmatization"):
    if technique == "lemmatization":
        return " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    elif technique == "stemming":
        return " ".join([stemmer.stem(word) for word in text.split()])
    else:
        return text

df['processed_tweet_lemma'] = df['tweet'].apply(lambda x: preprocess_text(x, "lemmatization"))
df['processed_tweet_stem'] = df['tweet'].apply(lambda x: preprocess_text(x, "stemming"))

#### S√©lection et D√©coupage des Donn√©es

In [6]:
X = df['processed_tweet_lemma']  # Peut √™tre chang√© pour `processed_tweet_stem`
y = df['id'].apply(lambda x: 1 if x == 4 else 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Embedding avec GloVe (Global Vectors for Word Representation)

GloVe (Global Vectors for Word Representation) est une m√©thode populaire pour g√©n√©rer des embeddings pr√©-entra√Æn√©s. D√©velopp√© par Stanford, GloVe repose sur l‚Äôid√©e de capturer les relations s√©mantiques et contextuelles entre les mots en utilisant les **cooccurrences globales** dans un corpus de texte.

- **cooccurrences globales** :  Les cooccurrences globales utilis√©es dans des m√©thodes comme GloVe sont obtenues en agr√©geant les cooccurrences locales sur l'ensemble du corpus. Cela signifie que les cooccurrences locales (les mots qui apparaissent ensemble dans une fen√™tre contextuelle autour d'un mot cible) sont comptabilis√©es et accumul√©es pour former une vue d'ensemble du corpus. 

- **descente de gradient** : La descente de gradient est une m√©thode d'optimisation utilis√©e pour minimiser une fonction objective (ou fonction de co√ªt) dans de nombreux algorithmes de machine learning, y compris les r√©seaux neuronaux. C'est un processus it√©ratif qui ajuste les param√®tres du mod√®le (comme les poids et les biais) pour r√©duire l'erreur entre les pr√©dictions du mod√®le et les valeurs r√©elles.

- glove.6B.50d.txt  : 50 dimensions
- glove.6B.100d.txt : 100 dimensions
- glove.6B.200d.txt : 200 dimensions
- glove.6B.300d.txt : 300 dimensions

In [7]:
embedding_dim = 300
embedding_index = {}

with open("./input/glove.6B.300d.txt", "r", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coeffs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coeffs

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(X_train)
vocab_size = len(tokenizer.word_index) + 1
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
max_length = 300
X_train_pad = tf.keras.preprocessing.sequence.pad_sequences(X_train_seq, maxlen=max_length, padding='post')
X_test_pad = tf.keras.preprocessing.sequence.pad_sequences(X_test_seq, maxlen=max_length, padding='post')

embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, idx in tokenizer.word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[idx] = embedding_vector

#### Pr√©pare MLflow

In [8]:
# D√©finir l'URI de tracking pour MLflow
mlflow.set_tracking_uri("http://127.0.0.1:5000")

# V√©rifier et terminer les runs actives
if mlflow.active_run() is not None:
    mlflow.end_run()

# D√©finir un nom d'exp√©rience
experiment_name = "p7-sentiment-analysis"
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='mlflow-artifacts:/277281536415448661', creation_time=1733133411530, experiment_id='277281536415448661', last_update_time=1733133411530, lifecycle_stage='active', name='p7-sentiment-analysis', tags={}>

#### Mod√®le LSTM Bidirectionnel

In [9]:
from mlflow.tracking import MlflowClient

with mlflow.start_run(run_name="bidirectional_lstm") as run:
    model_lstm = Sequential([
        Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length=max_length, trainable=False),
        Bidirectional(LSTM(128, return_sequences=True)),
        Dropout(0.5),
        Bidirectional(LSTM(64)),
        Dense(32, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    model_lstm.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model_lstm.fit(X_train_pad, y_train, validation_split=0.2, epochs=5, batch_size=32)

    # √âvaluation
    y_pred_lstm = (model_lstm.predict(X_test_pad) > 0.5).astype("int32")
    acc_lstm = accuracy_score(y_test, y_pred_lstm)
    f1_lstm = f1_score(y_test, y_pred_lstm)

    mlflow.log_metric("accuracy", acc_lstm)
    mlflow.log_metric("f1_score", f1_lstm)
    mlflow.keras.log_model(model_lstm, "lstm_model")
    

    mlflow.log_metric("lemmatization", "true")
    mlflow.log_metric("glove_embedding_dim", embedding_dim)

    # Ajouter le mod√®le au Model Registry
    model_name = "bidirectional-lstm"
    client = MlflowClient()
    model_uri = f"runs:/{run.info.run_id}/lstm_model"
    try:
        client.get_registered_model(model_name)
    except mlflow.exceptions.MlflowException:
        client.create_registered_model(model_name)
    client.create_model_version(name=model_name, source=model_uri, run_id=run.info.run_id)


I0000 00:00:1733408458.840353  126573 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5520 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9


Epoch 1/5


I0000 00:00:1733408463.436164  151012 cuda_dnn.cc:529] Loaded cuDNN version 90501


[1m31932/31932[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m2514s[0m 79ms/step - accuracy: 0.7806 - loss: 0.4593 - val_accuracy: 0.8161 - val_loss: 0.4034
Epoch 2/5
[1m31932/31932[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m2359s[0m 74ms/step - accuracy: 0.8258 - loss: 0.3865 - val_accuracy: 0.8233 - val_loss: 0.3896
Epoch 3/5
[1m31932/31932[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m2342s[0m 73ms/step - accuracy: 0.8365 - loss: 0.3659 - val_accuracy: 0.8250 - val_loss: 0.3877
Epoch 4/5
[1m31932/31932[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m2406s[0m 75ms/step - accuracy: 0.8444 - loss: 0.3518 - val_accuracy: 0.8241 - val_loss: 0.3891
Epoch 5/5
[1m31932/31932[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m2492s[0m 78ms/step - accuracy: 0.8486 - loss: 0.3442 - val_a



üèÉ View run bidirectional_lstm at: http://127.0.0.1:5000/#/experiments/277281536415448661/runs/605b207f04ec4418bc4bc8cff95827e7
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/277281536415448661


TypeError: 'true' has type str, but expected one of: int, float

#### Mod√®le BERT

In [35]:
import mlflow
import mlflow.keras
from transformers import TFBertForSequenceClassification, BertTokenizer, AdamWeightDecay
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
import tensorflow as tf
from mlflow.tracking import MlflowClient

# Configurations de MLflow
mlflow.set_tracking_uri("http://127.0.0.1:5000")

# V√©rifier et terminer les runs actives
if mlflow.active_run() is not None:
    mlflow.end_run()

# D√©finir un nom d'exp√©rience
experiment_name = "bert-classification-experiment"
mlflow.set_experiment(experiment_name)

# D√©marrer une run MLflow
with mlflow.start_run(run_name="bert_classification") as run:
    # Initialisation du tokenizer BERT
    tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')

    # Tokenisation des donn√©es d'entra√Ænement et de test
    max_length = 100
    X_train_enc = tokenizer_bert(list(X_train), truncation=True, padding=True, max_length=max_length, return_tensors='tf')
    X_test_enc = tokenizer_bert(list(X_test), truncation=True, padding=True, max_length=max_length, return_tensors='tf')

    # Initialisation du mod√®le BERT pour la classification
    model_bert = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

    # Configuration de l'optimiseur AdamWeightDecay
    optimizer = AdamWeightDecay(learning_rate=5e-5, weight_decay_rate=0.01)

    # Compilation du mod√®le
    model_bert.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    # Entra√Ænement du mod√®le
    model_bert.fit(
        X_train_enc['input_ids'], 
        y_train, 
        validation_split=0.2, 
        epochs=3, 
        batch_size=16
    )

    # √âvaluation sur le jeu de test
    y_pred_bert = np.argmax(model_bert.predict(X_test_enc['input_ids']).logits, axis=1)
    acc_bert = accuracy_score(y_test, y_pred_bert)
    f1_bert = f1_score(y_test, y_pred_bert)

    # Enregistrer les m√©triques dans MLflow
    mlflow.log_metric("accuracy", acc_bert)
    mlflow.log_metric("f1_score", f1_bert)

    # Enregistrer le mod√®le dans MLflow
    mlflow.keras.log_model(model_bert, "bert_model")

    # Ajouter le mod√®le au Model Registry
    model_name = "bert-classification"
    client = MlflowClient()
    model_uri = f"runs:/{run.info.run_id}/bert_model"

    try:
        client.get_registered_model(model_name)
    except mlflow.exceptions.MlflowException:
        client.create_registered_model(model_name)

    client.create_model_version(name=model_name, source=model_uri, run_id=run.info.run_id)

    # Affichage des r√©sultats finaux
    print("Test Accuracy:", acc_bert)
    print("Test F1 Score:", f1_bert)

2024/12/02 23:26:50 INFO mlflow.tracking.fluent: Experiment with name 'bert-classification-experiment' does not exist. Creating a new experiment.
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3


2024/12/03 08:49:34 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: bert-classification, version 1


Test Accuracy: 0.5013262848678256
Test F1 Score: 0.667844545080967
üèÉ View run bert_classification at: http://127.0.0.1:5000/#/experiments/740765759709431351/runs/5752810efd5a440e9e99c9508e6ecfc1
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/740765759709431351


#### R√©sultats Finaux

In [None]:
print("LSTM Accuracy:", acc_lstm, "F1 Score:", f1_lstm)
print("BERT Accuracy:", acc_bert, "F1 Score:", f1_bert)

Vectorisation parie en production.
TEster USE et Word Embeding :