# Plateforme Agnostique de Traitement et d'Analyse des Textes
### Carnet d'expérimentation
---

## Sujet : Bert Embeddings

---

# Observations et environnement
---

## Environnement

In [40]:
_rs = 42

In [41]:
cd ../..

/Volumes/Geek


In [42]:
import ast
import importlib
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
tqdm.pandas()

## Données

In [4]:
import patat.util.file

filename = 'data/prod/230517-OIDS-Label.pickle'

df_label = patat.util.file.pickle_load(filename)

In [5]:
labels = ['infox', 'entites_nommees', 'ouverture_esprit', 'faits', 'opinions',
       'propos_raportes', 'sources_citees', 'fausse_nouvelle', 'insinuations',
       'exageration', ]

In [6]:
df_label[labels].describe()

Unnamed: 0,infox,entites_nommees,ouverture_esprit,faits,opinions,propos_raportes,sources_citees,fausse_nouvelle,insinuations,exageration
count,900.0,804.0,803.0,804.0,804.0,803.0,803.0,802.0,802.0,552.0
mean,0.414444,0.618159,0.063512,0.717662,0.547264,0.244085,0.400996,0.15212,0.331671,0.317029
std,0.4929,0.48614,0.244033,0.450417,0.498071,0.429811,0.490406,0.359361,0.471107,0.465741
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Urls duppliquées

In [7]:
df_label.duplicated(subset='url').sum()

0

### Sites

In [8]:
df_label.value_counts('site')

site
www.francesoir.fr                    169
www.francetvinfo.fr                   91
www.breizh-info.com                   66
www.wikistrike.com                    62
lezarceleurs.blogspot.com             58
lesmoutonsrebelles.com                47
lemediaen442.fr                       32
www.profession-gendarme.com           28
lesdeqodeurs.fr                       28
fr.sott.net                           26
www.dreuz.info                        25
www.lelibrepenseur.org                23
www.polemia.com                       19
reseauinternational.net               17
actu.fr                               17
www.mondialisation.ca                 16
www.nouvelordremondial.cc             14
lesakerfrancophone.fr                 13
www.lesalonbeige.fr                   13
www.voltairenet.org                   12
lesobservateurs.ch                     9
www.anguillesousroche.com              9
lecourrier-du-soir.com                 9
www.cnews.fr                           9
www.preuves

# Experience
---

## Get Bert Embeddings
Choix n°0

Pour obtenir un embedding de phrase avec BERT, vous pouvez suivre les étapes suivantes :

1. Tout d'abord, assurez-vous d'avoir installé la bibliothèque Transformers de Hugging Face. Vous pouvez l'installer en utilisant la commande `pip install transformers`.

2. Importez les bibliothèques nécessaires dans votre script :

```python
from transformers import BertTokenizer, BertModel
import torch
```

3. Chargez le modèle pré-entraîné de BERT et le tokenizer correspondant :

```python
model_name = 'bert-base-uncased'  # exemple pour BERT non-casé en anglais
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
```

4. Convertissez votre phrase en tokens utilisables par BERT en utilisant le tokenizer :

```python
sentence = "Votre phrase ici"
tokens = tokenizer.tokenize(sentence)
```

5. Convertissez les tokens en indices numériques que le modèle peut comprendre :

```python
input_ids = tokenizer.convert_tokens_to_ids(tokens)
```

6. Ajoutez des padding et des masques pour avoir une taille fixe d'entrée :

```python
max_length = 64  # Taille maximale de la phrase d'entrée
input_ids = input_ids[:max_length]
input_ids = input_ids + [0] * (max_length - len(input_ids))  # Padding
attention_mask = [1] * len(input_ids)
```

7. Préparez les données en tant que tenseurs PyTorch :

```python
input_ids = torch.tensor(input_ids).unsqueeze(0)  # Ajoute une dimension de lot
attention_mask = torch.tensor(attention_mask).unsqueeze(0)  # Ajoute une dimension de lot
```

8. Passez les données à travers le modèle BERT pour obtenir les embeddings :

```python
outputs = model(input_ids, attention_mask=attention_mask)
embeddings = outputs[0]  # Récupère les embeddings de la dernière couche cachée
```
Les embeddings obtenus sont des tenseurs PyTorch qui représentent les vecteurs d'une taille de la phrase donnée.

Notez que cette méthode utilise BERT base non-casé en anglais comme exemple, vous pouvez choisir un modèle différent en fonction de vos besoins linguistiques et de la casse des textes.

In [9]:
import torch

In [10]:
# English Bert version
from transformers import BertTokenizer, BertModel
model_name = 'bert-base-uncased'  # exemple pour BERT non-casé en anglais
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
# Camembert version
from transformers import CamembertTokenizer, CamembertModel
model_name = 'camembert-base'
tokenizer = CamembertTokenizer.from_pretrained(model_name)
model = CamembertModel.from_pretrained(model_name)

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertModel: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
def get_embeddings(sentence,tokenizer,model):
    tokens = tokenizer.tokenize(sentence)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    max_length = 512  # Taille maximale de la phrase d'entrée
    input_ids = input_ids[:max_length]
    input_ids = input_ids + [0] * (max_length - len(input_ids))  # Padding
    attention_mask = [1] * len(input_ids)
    input_ids = torch.tensor(input_ids).unsqueeze(0)  # Ajoute une dimension de lot
    attention_mask = torch.tensor(attention_mask).unsqueeze(0)  # Ajoute une dimension de lot
    outputs = model(input_ids, attention_mask=attention_mask)
    embeddings = outputs[0]  # Récupère les embeddings de la dernière couche cachée
    return embeddings[0][0].detach().numpy()

In [13]:
sentence = 'Ceci est un test avec une phrase un peu plus longue. On verra ce que ca donne...'
get_embeddings(sentence,tokenizer,model)

array([-5.06394589e-03,  9.97290388e-02,  1.32782057e-01, -1.41223207e-01,
       -2.34577842e-02,  4.36505564e-02,  3.25186364e-03,  1.93185136e-01,
        2.32618488e-02,  6.91335797e-02,  3.44001874e-02,  1.63361967e-01,
       -7.94454068e-02,  9.57238898e-02,  2.68591940e-01,  2.20870338e-02,
        6.40096068e-02, -8.44110250e-02,  1.07058696e-01, -1.41572759e-01,
        3.04726381e-02, -7.29611069e-02, -3.19031179e-02, -3.52645546e-01,
        2.42682531e-01, -2.10098833e-01, -4.53932956e-02, -1.08195864e-01,
       -1.59330852e-02,  6.23205751e-02,  4.71787900e-02, -2.09260687e-01,
        7.66488835e-02,  1.05318323e-01,  1.70413986e-01, -1.11778617e-01,
       -5.83170131e-02,  1.07461050e-01, -1.19124502e-01, -4.59931567e-02,
       -1.44121408e-01,  8.08116868e-02,  2.13408649e-01, -5.47286719e-02,
        1.32382274e-01,  1.99937314e-01, -2.52126783e-01,  1.79821216e-02,
       -7.60386288e-02,  9.41729546e-02,  5.94821572e-02,  1.07978750e-02,
       -4.17250514e-01,  

## Calcul des embeddings des textes

In [14]:
df_label['embeddings']=df_label['text'].progress_apply(lambda text: get_embeddings(text,tokenizer,model))

  0%|          | 0/904 [00:00<?, ?it/s]

## Prédiction infox

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, train_test_split

In [17]:
def get_balanced_df_ml(label,df_label):
    df_0 = df_label[df_label[label] == 0]
    df_1 = df_label[df_label[label] == 1]
    min_sample = min(len(df_0),len(df_1))
    df_0=df_0.sample(min_sample,random_state=_rs)
    df_1=df_1.sample(min_sample,random_state=_rs)
    df_ml = pd.concat([df_0,df_1])
    df_ml = df_ml.sample(frac=1,random_state=_rs)
    return df_ml

In [18]:
df_ml = get_balanced_df_ml('infox',df_label)

In [20]:
import numpy as np

In [24]:
matrix = np.array([r['embeddings'] for i,r in df_ml.iterrows()])

In [25]:
X = pd.DataFrame(matrix)

In [26]:
y = df_ml['infox']

In [27]:
logreg = LogisticRegression(random_state=_rs, solver='lbfgs', multi_class='ovr', max_iter=1000)

In [28]:
scores = cross_validate(logreg, X, y, cv=4,scoring=('roc_auc','f1','accuracy','precision','recall'))

In [30]:
pd.DataFrame(scores).mean()

fit_time          0.044024
score_time        0.008357
test_roc_auc      0.838486
test_f1           0.748893
test_accuracy     0.760106
test_precision    0.784991
test_recall       0.718628
dtype: float64

## Prédiction liste de labels

In [31]:
labels = ['infox', 'entites_nommees', 'ouverture_esprit', 'faits', 'opinions',
       'propos_raportes', 'sources_citees', 'fausse_nouvelle', 'insinuations',
       'exageration', ]

In [32]:
def get_df_ml(label,df_label):
    return df_label[df_label[label].notna()]

In [33]:
def get_balanced_df_ml(label,df_label):
    df_0 = df_label[df_label[label] == 0]
    df_1 = df_label[df_label[label] == 1]
    min_sample = min(len(df_0),len(df_1))
    df_0=df_0.sample(min_sample,random_state=_rs)
    df_1=df_1.sample(min_sample,random_state=_rs)
    df_ml = pd.concat([df_0,df_1])
    df_ml = df_ml.sample(frac=1,random_state=_rs)
    return df_ml

In [43]:
def get_scores(label,df_ml):
    logreg = LogisticRegression(C=100,random_state=_rs, solver='lbfgs', multi_class='ovr', max_iter=1000)
    matrix = np.array([r['embeddings'] for i,r in df_ml.iterrows()])
    X = pd.DataFrame(matrix)
    y = df_ml[label]
    classifier = logreg
    scores = cross_validate(classifier, X, y, cv=4,scoring=('roc_auc','f1','accuracy','precision','recall'))
    df_scores=pd.DataFrame(scores)
    score_dic = df_scores.mean().to_dict()
    score_dic['label']=label
    score_dic['n_samples']=len(df_ml)
    return score_dic

In [44]:
score_list = []
for label in labels:
    print(f'Processing {label}')
    df_ml = get_balanced_df_ml(label,df_label)
    score_list.append(get_scores(label,df_ml))

Processing infox
Processing entites_nommees
Processing ouverture_esprit
Processing faits
Processing opinions
Processing propos_raportes
Processing sources_citees
Processing fausse_nouvelle
Processing insinuations
Processing exageration


In [45]:
pd.DataFrame(score_list)

Unnamed: 0,fit_time,score_time,test_roc_auc,test_f1,test_accuracy,test_precision,test_recall,label,n_samples
0,0.204311,0.008248,0.806511,0.71634,0.725231,0.741514,0.694349,infox,746
1,0.195557,0.008036,0.667461,0.637787,0.63526,0.635635,0.641576,entites_nommees,614
2,0.057497,0.007549,0.490015,0.483891,0.470769,0.472588,0.508013,ouverture_esprit,102
3,0.156301,0.007646,0.630469,0.62283,0.618926,0.615948,0.630561,faits,454
4,0.230577,0.007785,0.648533,0.608811,0.615385,0.619753,0.598901,opinions,728
5,0.126148,0.007612,0.616306,0.583541,0.591837,0.5984,0.571429,propos_raportes,392
6,0.24885,0.00778,0.579437,0.551167,0.562112,0.566377,0.537191,sources_citees,644
7,0.090349,0.007559,0.608737,0.589782,0.577869,0.570344,0.614785,fausse_nouvelle,244
8,0.178159,0.007935,0.708899,0.645714,0.648496,0.65367,0.63936,insinuations,532
9,0.119112,0.007674,0.670602,0.606895,0.611383,0.615636,0.600026,exageration,350


# Sauvegarde des résultats
---

# Conclusions
---

# Bricolages
---

In [None]:
import patat.model.camembert

In [None]:
importlib.reload(patat.model.camembert)

In [None]:
model = patat.model.camembert.Camembert()

In [None]:
pd.DataFrame(matrix)

In [None]:
emb2 = model.get_embeddings('Voici est un autre texte')

In [None]:
import numpy as np

In [None]:
pd.DataFrame(np.array([df_label['embeddings'][0],df_label['embeddings'][1]]))

In [None]:
np.array([[1,2,3],[4,5,6]])