**Bilal FAYE (USPN)**<br><br><br>


Sur ce notebook, un tokenizer et trois modèles ont été utiisés.<br>
**Tokenizer:** **tiktoken** de OpenAI qui utilise l'approche BPE.<br>
**Modèles:**<br>
  **1. BERT:** Qui utilise que l'encoder. Il est entraîné par **MLM** (Masked Langage Model), qui consiste à masquer des mots de la séquence d'entrée.<br><br>

  **2. GPT:** Qui utilise de le decoder. Il est entraîné par l'approche "Predict the Next Token", qui consiste à prédire de manière autorégressive (token par token) des tokens en se basant sur le contexte (tokens avant le token à prédire).<br><br>

  **3. T5:** Qui utilise le Transformer complet pour faire diverse tâche (traduction, classification, résumé, QA, etc.). Il est entraîné avec un processus autoregressif. <br><br>

**Principe à comprendre**<br><br>
Ces modèles LLMs pré-entraînés présentent un **tokenizer** un le **modèle**.<br>

Le modèle est toujours présenté sous deux formes: <br>
**1 ère forme:** La sortie de l'encoder ou du decoder qui est représentée sous forme de séquence. Ce modèle doit être utilisé avec d'autres couches de sortie pour fine-tuner le modèle dans la même tâche de pré-entraînement ou dans une autres tâche. Sortie: **last_hidden_state** (séquence) et/ou **projection** (vecteur).<br><br>

**2 nde forme:** La sortie est présente sous forme de **logits**, dans ce cas le modèle contient une **tête de projection**. Un tel modèle pourra être utilisé avec la méthode **generate** et pourra être **fine-tuner**, mais dans une tâche similaire que la tâche de pré-entraînement. Sortie: **last_hidden_state** (séquence)  et **logits** (séquence). Les logits sont des séquences qui ont **la dimension du vocabulaire**, donc permettent de savoir avec **argmax** le token correspondant sur le vocabulaire. Par exemple si la tâche de pré-entrainement est la génération de textes, le modèle peut être utilisé pour une tâche de chatbot, en donnant l'attribut labels qui sont les réponses de de chaque question.<br>


# **Visualize attention with BertViz**

Ce lien permet d'avoir accés au notebook de bertviz et de visualiser le mécanismisme d'auto-attention: [BertViz](https://colab.research.google.com/github/davidarps/2022_course_embeddings_and_transformers/blob/main/Visualizing_Attention_with_BertViz.ipynb)

# **Tokenization with tiktoken**

**tiktoken est un tokenizer [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) utilisé par les modèles d'OpenAI.**

In [None]:
! pip install tiktoken

In [None]:
import tiktoken

In [None]:
# Import tiktoken, une bibliothèque de Huggingface utile pour avoir une idée de la taille d'un modèle avant de l'utiliser.
import tiktoken

# Récupérer le tokenizer de GPT-3
encoding = tiktoken.encoding_for_model("davinci")

# Donner un texte en exemple et l'afficher
text = "We need to stop anthropomorphizing ChatGPT."
print(f"text: {text}")

# Tokenizer le texte et afficher la taille du vocabulaire sur lequel GPT-3 est entraîné
token_integers = encoding.encode(text)
print(f"total number of tokens: {encoding.n_vocab}")

# Afficher la liste des tokens (ids) et les tokens correspondants (mots)
print(f"token integers: {token_integers}")
token_strings = [encoding.decode_single_token_bytes(token) for token in token_integers]
print(f"token strings: {token_strings}")
print(f"number of tokens in text: {len(token_integers)}")

# Décoder les tokens (ids) pour reconstruire la phrase de départ (avant tokenization)
encoded_decoded_text = encoding.decode(token_integers)
print(f"encoded-decoded text: {encoded_decoded_text}")

text: We need to stop anthropomorphizing ChatGPT.
total number of tokens: 50257
token integers: [1135, 761, 284, 2245, 17911, 25831, 2890, 24101, 38, 11571, 13]
token strings: [b'We', b' need', b' to', b' stop', b' anthrop', b'omorph', b'izing', b' Chat', b'G', b'PT', b'.']
number of tokens in text: 11
encoded-decoded text: We need to stop anthropomorphizing ChatGPT.


# **T5**

## **Inference**

Cette approche permet de tester de manière directe le modèle T5 générique.

In [None]:
# Installer la bibliothèque transformers de HuggingFace pour avoir accés à T5
! pip install transformers

In [None]:
# Installer sentencepiece pour la tokenization (ici on utilise pas BPE)
! pip install sentencepiece

In [None]:
# Importer le tokiner et le modèle de T5
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Charger le tokenizer pré-entraîné
tokenizer = T5Tokenizer.from_pretrained("t5-small")

# Charger le modèle pré-entraîné
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Spécifier la tâche à exécuter (ici, on fait de la traduction d'anglais vers l'allemand)
task_prefix = "translate English to German: "

# Donner deux phrases à traduire en guise d'exemple
sentences = ["The house is wonderful.", "I like to work in NYC."]

# Tokeniser les deux phrases en concaténant avec la tâche cible (translate English to German)
inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)

In [None]:
# Afficher la sortie de la tokenisation
inputs

{'input_ids': tensor([[13959,  1566,    12,  2968,    10,    37,   629,    19,  1627,     5,
             1,     0,     0],
        [13959,  1566,    12,  2968,    10,    27,   114,    12,   161,    16,
         13465,     5,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
# Afficher la taille de la matrice des ids obtenue après tokenisation
inputs['input_ids'].shape

torch.Size([2, 13])

In [None]:
# Tester l'encodeur du modèle T5
encoder_output = model.encoder(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
print(encoder_output)
print(encoder_output.keys())
print(encoder_output["last_hidden_state"].shape)

BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[ 0.0154,  0.1263,  0.0301,  ..., -0.0117,  0.0373,  0.1015],
         [-0.1926, -0.1285,  0.0228,  ..., -0.0339,  0.0535,  0.1575],
         [ 0.0109, -0.0210,  0.0022,  ...,  0.0008, -0.0056, -0.0393],
         ...,
         [ 0.0756, -0.0119, -0.0273,  ..., -0.0044, -0.0505,  0.0554],
         [ 0.0373, -0.1201,  0.2142,  ...,  0.1950,  0.0546, -0.1554],
         [ 0.0321, -0.1309,  0.2183,  ...,  0.1683,  0.0588, -0.1607]],

        [[ 0.0164,  0.1266,  0.0305,  ..., -0.0114,  0.0363,  0.0988],
         [-0.0324, -0.0596, -0.0139,  ...,  0.0319, -0.0131,  0.0450],
         [ 0.0098, -0.0204,  0.0016,  ...,  0.0017, -0.0040, -0.0397],
         ...,
         [-0.0303, -0.3878,  0.1376,  ..., -0.0516,  0.0796, -0.3034],
         [ 0.0241, -0.1246,  0.0031,  ...,  0.0801, -0.2018, -0.0541],
         [ 0.0946,  0.0095, -0.0317,  ..., -0.0046, -0.0563,  0.0309]]],
       grad_fn=<MulBackward0>), past_key_values=None, hi

In [None]:
# Tester le décodeur du modèle T5 (cette méthode permet uniquement de prendre le décodeur pas la couche de logits pour la prédiction)
decoder_output = model.decoder(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
print(decoder_output)
print(decoder_output.keys())
print(decoder_output["last_hidden_state"].shape)

BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[ 4.0756e-02,  2.2054e-01, -9.8648e-02,  ...,  1.0998e-01,
           4.5049e-05, -1.1511e-01],
         [-4.2975e-02,  8.5212e-02, -2.6601e-02,  ...,  9.1395e-02,
           2.6582e-04, -2.4325e-01],
         [ 4.9926e-02,  2.8985e-02,  3.7887e-03,  ...,  1.1895e-01,
           5.1045e-04,  3.9288e-02],
         ...,
         [-6.6533e-03, -6.9466e-03,  8.2814e-02,  ...,  8.2662e-02,
           1.5332e-04, -1.0807e-01],
         [-2.9056e-02,  4.2412e-02,  2.6476e-02,  ...,  6.4245e-02,
           1.1994e-04,  6.3035e-02],
         [-2.5118e-02,  3.7915e-02,  3.1097e-02,  ...,  6.8023e-02,
           1.1328e-04,  4.0840e-02]],

        [[ 4.0756e-02,  2.2054e-01, -9.8648e-02,  ...,  1.0998e-01,
           4.5049e-05, -1.1511e-01],
         [-4.2975e-02,  8.5212e-02, -2.6601e-02,  ...,  9.1395e-02,
           2.6582e-04, -2.4325e-01],
         [ 4.9926e-02,  2.8985e-02,  3.7887e-03,  ...,  1.1895e-01,
           5.104

In [None]:
# Utiliser l'encodeur et le décodeur (le transformer) pour traduire.
output_sequences = model.generate(

    input_ids=inputs["input_ids"],

    attention_mask=inputs["attention_mask"],

    do_sample=False,  # disable sampling to test if batching affects output

)

# Afficher la sortie (des tokens en ids)
print(output_sequences)
print(output_sequences.shape)



tensor([[    0,   644,  4598,   229, 19250,     5,     1,     0,     0,     0],
        [    0,  1674,     3,  5269,    15,  5462,    16, 13465,     5,     1]])
torch.Size([2, 10])


In [None]:
# Traduire les tokens (ids) en mots
print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.']


In [None]:
# Test pour obtenir le token de l'id 0
tokenizer.decode(0)

'<pad>'

In [None]:
# Test pour obtenir le token de l'id 644
tokenizer.decode(644)

'Das'

## **Training**

Cette partie permet d'orienter le modèle dans un domaine spécifique

### **Supervised method: Translation example**

In [None]:
# Importer le tokenizer et modèle T5
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Charger le tokenizer et le modèle T5 pré-entraîné
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Tokenizer le texte à traduire (on retourne seulement input_ids, pas attention_mask)
input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids

# Tokenizer la sortie escompter: la bonne traduction
labels = tokenizer("Das Haus ist wunderbar.", return_tensors="pt").input_ids

# Traduire le texte
output = model(input_ids=input_ids, labels=labels)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# Type de sotie retourner par model
type(output)

transformers.modeling_outputs.Seq2SeqLMOutput

In [None]:
# Les clés du dictionnaire de sortie
output.keys()

odict_keys(['loss', 'logits', 'past_key_values', 'encoder_last_hidden_state'])

In [None]:
# Les logits sont les sorties de prédictions, qui correspond à une séquence avec des vecteurs de taille, la taille du vocabulaire
output.logits.shape

torch.Size([1, 6, 32128])

In [None]:
# La loss (entropy dans ce cas)
output.loss

tensor(0.2542, grad_fn=<NllLossBackward0>)

In [None]:
# Exemple d'entrainement du modèle (code complet)
from transformers import T5Tokenizer, T5ForConditionalGeneration, AdamW

# Charger le tokenizer et le modèle
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Créer un jeu de données
source_texts = ["translate English to German: The house is wonderful.", ...]
target_texts = ["Das Haus ist wunderbar.", ...]

# Tokenizer les données
tokenized_inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True)
tokenized_labels = tokenizer(target_texts, return_tensors="pt", padding=True, truncation=True)

# Récupérer les ids
input_ids = tokenized_inputs["input_ids"]
labels = tokenized_labels["input_ids"]

# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Entraîner le modèle
num_epochs = 3
for epoch in range(num_epochs):
    model.train()

    # Forward pass
    outputs = model(input_ids=input_ids, labels=labels)
    loss = outputs.loss

    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}')

# Sauvegarder le modèle
model.save_pretrained("trained_t5_model")


### **Unsupervised method with mask**

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")

model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids

labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids

# the forward function automatically creates the correct decoder_input_ids

loss = model(input_ids=input_ids, labels=labels).loss

loss.item()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


3.7837319374084473

# **GPT-1**

Pour tester le modèle, on peut utiliser l'application avec gradio: [GPT-1](https://huggingface.co/instruct-generalize/gpt-1)

# **GPT-2**

Tester le modèle sur, l'application gradio: [GPT-2](https://huggingface.co/gpt2)

In [None]:
# Installer la bibliothèque de HuggingFace transformers
! pip install transformers

Successfully installed huggingface-hub-0.17.3 safetensors-0.4.0 tokenizers-0.14.1 transformers-4.35.0


## **Inference**

In [None]:
# Importer la bibliothèque pipeline pour tester en inférence des modèles. set_seed permet d'avoir les mêmes résultats
from transformers import pipeline, set_seed

# Charger le modèle gpt-2
generator = pipeline('text-generation', model='gpt2')

# Fixer l'aléatoire
set_seed(42)

# Générer 5 textes, de taille maximale égale à 30, en prenant pour contexte  "Hello, I'm a language model,"
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, but what I'm really doing is making a human-readable document. There are other languages, but those are"},
 {'generated_text': "Hello, I'm a language model, not a syntax model. That's why I like it. I've done a lot of programming projects.\n"},
 {'generated_text': "Hello, I'm a language model, and I'll do it in no time!\n\nOne of the things we learned from talking to my friend"},
 {'generated_text': "Hello, I'm a language model, not a command line tool.\n\nIf my code is simple enough:\n\nif (use (string"},
 {'generated_text': "Hello, I'm a language model, I've been using Language in all my work. Just a small example, let's see a simplified example."}]

## **Training**

Cette approche permet aux développeur d'orienter le modèle vers un domaine précis.

GPT2Tokenizer est utilisé pour tokeniser le texte.<br>
GPT2Model est utilisé pour obtenir les représentations cachées des tokens du texte.<br>
Les embeddings (représentations) des tokens sont stockés dans hidden_states.<br>

Notez que GPT2Model est utilisé ici pour obtenir les représentations cachées, mais il ne génère pas de texte de manière générative. Pour générer du texte à l'aide de GPT-2, vous pouvez utiliser GPT2LMHeadModel qui est spécifiquement conçu pour la génération de langage.

In [2]:
! pip install transformers

In [None]:
# Importer le modèle GPT-2 et son tokenizer
from transformers import GPT2Tokenizer, GPT2Model

# Charger le modèle GPT-2 et le tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

In [None]:
# Prendre un texte pour exemple
text = "Replace me by any text you'd like."

In [None]:
# Tokenizer le texte d'exemple, pour les développeur tensorflow, on peut remplacer 'pt' par 'tf' pour avoir un tf.Tensor en sortie
encoded_input = tokenizer(text, return_tensors='pt')
encoded_input

{'input_ids': tensor([[3041, 5372,  502,  416,  597, 2420,  345, 1549,  588,   13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

**input_ids :**<br>
    Il s'agit d'une séquence d'identifiants numériques qui représentent les mots ou les tokens dans le texte d'entrée. Chaque mot ou token est encodé sous forme d'un identifiant unique dans le vocabulaire du modèle. Dans votre exemple, la séquence d'input_ids est [3041, 5372, 502, 416, 597, 2420, 345, 1549, 588, 13].

**attention_mask :**<br>
    Il s'agit d'un masque binaire qui indique au modèle quels tokens dans la séquence d'entrée doivent être pris en compte et quels tokens doivent être ignorés lors du calcul des représentations. Un "1" dans le masque signifie que le token correspondant dans la séquence d'input_ids doit être pris en compte, tandis qu'un "0" signifie que le token doit être ignoré. Dans votre exemple, le masque d'attention est [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], ce qui signifie que tous les tokens de la séquence d'input_ids doivent être pris en compte.

In [None]:
# Afficher la taille du vocabulaire d'entraînement
tokenizer.vocab_size

50257

In [None]:
# Encoder le texte
output = model(**encoded_input)

In [None]:
# Voir les clés du dictionnaire
output.keys()

odict_keys(['last_hidden_state', 'past_key_values'])

In [None]:
# Voir la taille du dernier bloc du decoder
output.last_hidden_state.shape

torch.Size([1, 10, 768])

GPT2LMHeadModel est utilisé pour générer du texte continuant à partir du texte d'entrée.<br>
La méthode generate est utilisée pour générer le texte en spécifiant divers paramètres tels que max_length, num_beams, no_repeat_ngram_size, top_k, top_p, et temperature.<br><br>

GPT2LMHeadModel est particulièrement utile pour des tâches de génération de texte, comme la rédaction automatique, la création de dialogues, et d'autres applications où la création de texte contextuellement cohérent est nécessaire.

Tout le modèle ne peuvent pas utiliser generate, pour utiliser generate avec cette approche, ça ne va pas marche, il faut utiliser un modèle qui a une tête de prédiction (**lm_head**) avec par exemple: GPT2Tokenizer, GPT2LMHeadModel (voir code en dessous).

In [3]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Charger le modèle GPT-2 et le tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Texte d'entrée
input_text = "It is your,"

# Tokenization (tokenizer.encoder retourne uniquement input_ids, alors tokenizer retourne inputs_ids et attention_mask)
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Génération
output = model.generate(input_ids, max_new_tokens=100, do_sample=True, temperature=0.8, pad_token_id=model.config.eos_token_id)


# Décodage de la sortie
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

# Afficher le résultat
print(decoded_output)


vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

It is your, dear. I have given you the best gift I have ever given or I will give you the best gift. Please, please!

You have no other choice, dear.

Just a little further.

But when you take off your coat and walk around with that heavy, heavy coat on the ground, you will see that it is very cold, very cold.

And you will understand that it is cold.

And I will tell you why.

Because


In [None]:
# On doit nécessaire avoir logits sur les keys pour pouvoir exploiter generate
model(input_ids).keys()

odict_keys(['logits', 'past_key_values'])

In [6]:
output = model(input_ids)

In [8]:
output.logits

tensor([[[ -37.0891,  -36.4551,  -40.3288,  ...,  -45.2598,  -43.0251,
           -37.6606],
         [-119.6817, -118.9799, -127.9672,  ..., -125.9709, -125.1079,
          -122.6017],
         [ -99.3594,  -98.1936, -102.1048,  ..., -101.8920, -104.3247,
           -99.3788],
         [ -79.1349,  -80.0958,  -81.9465,  ...,  -83.8376,  -85.3862,
           -77.1349]]], grad_fn=<UnsafeViewBackward0>)

In [9]:
output.logits.shape

torch.Size([1, 4, 50257])

In [12]:
output.logits[0].argmax(axis=-1)

tensor([ 318,  407, 5798,  534])

In [13]:
ids = output.logits[0].argmax(axis=-1)

In [14]:
tokenizer.decode(ids)

' is not responsibility your'

# **GPT-3 and GPT-4**

A ma connaissance, le code n'est pas encore disponible sur des plateformes comme HuggingFace, mais pour faire des tests avec DALL-E (Génération d'images à partir de textes) et GPT-4 (communiquer en donnant en entrée du texte et/ou de l'image), on peut se connecter sur : [GPT](https://platform.openai.com/docs/overview)

In [None]:
! pip install openai==0.28

In [None]:
import openai

openai.api_key = "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"

prompt = f"Ecris moi une conclusion d'un article qui parle de l'utilisation de l'API ChatGPT"

completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[{"role": "user", "content": prompt}]
)

print(completion['choices'][0]['message']['content'])

"user" (utilisateur) : Ce rôle est généralement attribué aux messages de l'utilisateur. Il contient les instructions, questions ou commentaires que l'utilisateur souhaite adresser à l'assistant.<br><br><br>

"assistant" (assistant) : Message qu'on donne à l'assistant, sur lequel il
se base pour donner une réponse.<br><br><br>

"system" (système) : Ce rôle est utilisé pour fournir des instructions spécifiques au système ou pour définir des paramètres pour le comportement de l'assistant. Par exemple, indiquer que l'assistant est "utile" peut influencer son comportement.

In [None]:
message1 = [
    {"role": "user", "content": "Tell me a joke."},
    {"role": "assistant", "content": "Sure, here's one: Why did the scarecrow win an award? Because he was outstanding in his field!"}
]

message2 = [
    {"role": "user", "content": "Quel temps fait-il aujourd'hui?"},
    {"role": "assistant", "content": "Le temps aujourd'hui est ensoleillé avec une température maximale de 25 degrés Celsius."},
    {"role": "user", "content": "Devrais-je prendre un parapluie?"}
]



    GPT-3:
        Nom du modèle: text-davinci-003
        Description: GPT-3 est le modèle le plus puissant d'OpenAI. Il excelle dans une variété de tâches de traitement du langage naturel et de génération de texte.
        Utilisation recommandée: Tâches complexes nécessitant une compréhension approfondie du langage, création de contenu, traduction, questionnement et réponse avancés, etc.
        Tarification: Consultez le site d'OpenAI pour les détails sur la tarification.

    GPT-3.5:
        Nom du modèle: text-davinci-002
        Description: Version précédente de GPT-3, moins puissante que text-davinci-003.
        Utilisation recommandée: Tâches similaires à GPT-3, mais avec une puissance de calcul réduite.
        Tarification: Consultez le site d'OpenAI pour les détails sur la tarification.

    GPT-3.5 Turbo:
        Nom du modèle: text-davinci-003
        Description: Version optimisée pour la vitesse et la latence par rapport à GPT-3 standard.
        Utilisation recommandée: Conversations en temps réel, chatbots, applications nécessitant des réponses rapides.
        Tarification: Consultez le site d'OpenAI pour les détails sur la tarification.

    ChatGPT:
        Nom du modèle: gpt-3.5-turbo
        Description: Modèle spécifique pour les conversations et le chat en temps réel. Il offre une expérience fluide dans les échanges interactifs.
        Utilisation recommandée: Conversations, chat en direct, chatbots interactifs.
        Tarification: Consultez le site d'OpenAI pour les détails sur la tarification.

# **Other Chatbots to test**

On peut tester des chatbots autre que ChatGPT: [chatbots](https://24pm.com/intelligence-artificielle/ia-generative/965-les-llms-open-source-alternatifs-a-chatgpt-gpt)

# **Make ChatBot with GPT-2**

In [None]:
# Install transformer package
! pip install transformers

In [None]:
# Importer les packages nécessaires
import os
import csv
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer, GPT2LMHeadModel
import numpy as np

import json
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense,Embedding,GlobalAveragePooling1D
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

## **Load dataset**

In [None]:
# Charger le jeu de données
with open('intents.json') as file:
    data = json.load(file)


In [None]:
# Pour chaque question, on donne la réponse correspondante
items = []
for intent in data['intents']:
    for patterns in intent['patterns']:
        for responses in intent['responses']:
            items.append((patterns, responses))

items

[('Hi there', 'Hello, thanks for asking'),
 ('Hi there', 'Good to see you again'),
 ('Hi there', 'Hi there, how can I help?'),
 ('How are you', 'Hello, thanks for asking'),
 ('How are you', 'Good to see you again'),
 ('How are you', 'Hi there, how can I help?'),
 ('Is anyone there?', 'Hello, thanks for asking'),
 ('Is anyone there?', 'Good to see you again'),
 ('Is anyone there?', 'Hi there, how can I help?'),
 ('Hey', 'Hello, thanks for asking'),
 ('Hey', 'Good to see you again'),
 ('Hey', 'Hi there, how can I help?'),
 ('Hola', 'Hello, thanks for asking'),
 ('Hola', 'Good to see you again'),
 ('Hola', 'Hi there, how can I help?'),
 ('Hello', 'Hello, thanks for asking'),
 ('Hello', 'Good to see you again'),
 ('Hello', 'Hi there, how can I help?'),
 ('Good day', 'Hello, thanks for asking'),
 ('Good day', 'Good to see you again'),
 ('Good day', 'Hi there, how can I help?'),
 ('Bye', 'See you!'),
 ('Bye', 'Have a nice day'),
 ('Bye', 'Bye! Come back again soon.'),
 ('See you later', 'See

## **Load model**

In [None]:
# Charger GPT-2 et son tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='left') # padding_side = "left" est obligatoire si on utilise padding sur le tokenizer
model = TFGPT2LMHeadModel.from_pretrained('gpt2')

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [None]:
# get_lm_head donne l'information que le modèle à une tête de prédiction et on peut utiliser generate
model.get_lm_head

<bound method TFPreTrainedModel.get_lm_head of <transformers.models.gpt2.modeling_tf_gpt2.TFGPT2LMHeadModel object at 0x7e68e40c6c80>>

In [None]:
model.summary()

Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLay  multiple                  124439808 
 er)                                                             
                                                                 
Total params: 124439808 (474.70 MB)
Trainable params: 124439808 (474.70 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## **Inference**

In [None]:
# token spécial qui permet d'arrêter la génération
tokenizer.eos_token


'<|endoftext|>'

In [None]:
# Mettre le tokenizer de padding à eos
tokenizer.pad_token = tokenizer.eos_token

# Créer deux textes pour exemples
text_list = ["i am a man", "the boy is here but,"]

# Tokenizer les deux textes avec tensorflow
tokenized_texts = tokenizer.batch_encode_plus(
    text_list,
    padding=True,        # Pad the sequences to the same length
    truncation=True,     # Truncate sequences to a maximum length
    return_tensors="tf"  # Return tf
)


input_ids = tokenized_texts["input_ids"]
attention_mask = tokenized_texts["attention_mask"]


In [None]:
# [50256, 50256,    72,   716,   257,   582] est complété avec 50256
input_ids

<tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[50256, 50256,    72,   716,   257,   582],
       [ 1169,  2933,   318,   994,   475,    11]], dtype=int32)>

In [None]:
# 0 pour ne pas tenir en compte les deux premiers tokens du premier vecteur
attention_mask

<tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[0, 0, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1]], dtype=int32)>

In [None]:
# Taille du vocabulaire
tokenizer.vocab_size

50257

In [None]:
# Prédire les ids avec generate
output_ids = model.generate(input_ids, max_length=50, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)

In [None]:
output_ids

<tf.Tensor: shape=(2, 50), dtype=int32, numpy=
array([[50256, 50256,    72,   716,   257,   582,   508,   468,   587,
          257,   636,   286,   262,   995,   329,   625,  1542,   812,
           13,   314,   423,   587,   257,   636,   286,   262,   995,
          329,   625,  1542,   812,    13,   314,   423,   587,   257,
          636,   286,   262,   995,   329,   625,  1542,   812,    13,
          314,   423,   587,   257,   636],
       [ 1169,  2933,   318,   994,   475,    11,   314,  1101,   407,
         1654,   611,   339,   338,  1016,   284,   307,  1498,   284,
          466,   340,    13,   314,  1101,   407,  1654,   611,   339,
          338,  1016,   284,   307,  1498,   284,   466,   340,    13,
          314,  1101,   407,  1654,   611,   339,   338,  1016,   284,
          307,  1498,   284,   466,   340]], dtype=int32)>

In [None]:
# Decoder la génération du premier texte
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)

i am a man who has been a part of the world for over 30 years. I have been a part of the world for over 30 years. I have been a part of the world for over 30 years. I have been a part


In [None]:
# Decoder la génération du deuxième texte
generated_text = tokenizer.decode(output_ids[1], skip_special_tokens=True)
print(generated_text)

the boy is here but, I'm not sure if he's going to be able to do it. I'm not sure if he's going to be able to do it. I'm not sure if he's going to be able to do it


## **Test training**

In [None]:
tokenizer.pad_token = tokenizer.eos_token

text_list = ["i am a man", "the boy is here"]


tokenized_texts = tokenizer.batch_encode_plus(
    text_list,
    padding=True,        # Pad the sequences to the same length
    truncation=True,     # Truncate sequences to a maximum length
    return_tensors="tf"  # Return tf
)

input_ids = tokenized_texts["input_ids"]
attention_mask = tokenized_texts["attention_mask"]

In [None]:
input_ids

<tf.Tensor: shape=(2, 4), dtype=int32, numpy=
array([[  72,  716,  257,  582],
       [1169, 2933,  318,  994]], dtype=int32)>

In [None]:
dec_output = model(input_ids)

In [None]:
dec_output.keys()

odict_keys(['logits', 'past_key_values'])

In [None]:
dec_output['logits'].shape

TensorShape([2, 4, 50257])

## **Create the chatbot**

In [None]:
# Séparer les inputs et outputs
input_ids = []
output_ids = []
for item in items:
    input_ids.append(item[0])
    output_ids.append(item[1])

In [None]:
# Afficher input
input_ids[:10]

['Hi there',
 'Hi there',
 'Hi there',
 'How are you',
 'How are you',
 'How are you',
 'Is anyone there?',
 'Is anyone there?',
 'Is anyone there?',
 'Hey']

In [None]:
# Afficher output
output_ids[:10]

['Hello, thanks for asking',
 'Good to see you again',
 'Hi there, how can I help?',
 'Hello, thanks for asking',
 'Good to see you again',
 'Hi there, how can I help?',
 'Hello, thanks for asking',
 'Good to see you again',
 'Hi there, how can I help?',
 'Hello, thanks for asking']

In [None]:
len(output_ids)

91

In [None]:
# Tokonizer les inputs et outputs
tokenizer.pad_token = tokenizer.eos_token
input_ids = tokenizer.batch_encode_plus(
    input_ids,
    padding='max_length',
    max_length=20,        # Pad the sequences to the same length
    truncation=True,     # Truncate sequences to a maximum length
    return_tensors="tf"  # Return tf
)
input_ids = input_ids["input_ids"]
output_ids = tokenizer.batch_encode_plus(
    output_ids,
    padding='max_length',
    max_length=20,
    truncation=True,     # Truncate sequences to a maximum length
    return_tensors="tf"  # Return tf
)
output_ids = output_ids["input_ids"]

In [None]:
# Afficher le vocabulaire
tokenizer.vocab_size

50257

In [None]:
input_ids[:10]

<tf.Tensor: shape=(10, 20), dtype=int32, numpy=
array([[17250,   612, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [17250,   612, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [17250,   612, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [ 2437,   389,   345, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [ 2437,   389,   345, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [ 2437,   389,   345, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256,

In [None]:
input_ids.shape

TensorShape([91, 20])

In [None]:
output_ids[:10]

<tf.Tensor: shape=(10, 20), dtype=int32, numpy=
array([[15496,    11,  5176,   329,  4737, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [10248,   284,   766,   345,   757, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [17250,   612,    11,   703,   460,   314,  1037,    30, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [15496,    11,  5176,   329,  4737, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [10248,   284,   766,   345,   757, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256],
       [17250,   612,    11,   703,   460,   314,  1037,    30, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256,

In [None]:
# Définir la taille du batch
batch_size = 16

# Créer un générateur
class CustomDataGenerator(tf.keras.utils.Sequence):
    def __init__(self, input_ids, output_ids, batch_size, shuffle=True):
        self.input_ids = input_ids
        self.output_ids = output_ids
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.indices = np.arange(len(input_ids))
        if self.shuffle:
            np.random.shuffle(self.indices)

    def __len__(self):
        return int(np.ceil(len(self.input_ids) / self.batch_size))

    def __getitem__(self, index):
        batch_indices = self.indices[index * self.batch_size:(index + 1) * self.batch_size]
        batch_input_ids = self.input_ids[batch_indices]
        batch_output_ids = self.output_ids[batch_indices]

        return (batch_input_ids, batch_output_ids)

    def on_epoch_end(self):
        if self.shuffle:
            np.random.shuffle(self.indices)

# Initialiser le générateur
data_generator = CustomDataGenerator(input_ids, output_ids, batch_size)



In [None]:
# Définir les paramètres d'apprentissage
num_epochs = 5
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
learning_rate = tf.keras.optimizers.schedules.PolynomialDecay(
    5e-4,
    decay_steps=len(data_generator) * num_epochs,
    end_learning_rate=0.0,
)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

In [None]:
# Entraîner le modèle
model.fit(input_ids, output_ids, batch_size=batch_size,epochs=num_epochs)

# Sauvegarder le modèle
model.save_pretrained('chatbot_model')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
a = tokenizer.batch_encode_plus(
    ["hello"],
    padding=True,        # Pad the sequences to the same length
    truncation=True,     # Truncate sequences to a maximum length
    return_tensors="tf"  # Return tf
)['input_ids']

In [None]:
output_ids = model.generate(a, max_length=50, pad_token_id=tokenizer.eos_token_id, no_repeat_ngram_size=2)

In [None]:
tokenizer.decode(output_ids[0], skip_special_tokens=True)

'how are you can provide you provide Adverseverse'

In [None]:
#tokenizer = GPT2Tokenizer.from_pretrained("gpt2", padding_side="left")
#model = TFGPT2LMHeadModel.from_pretrained("chatbot_model")
#tokenizer.pad_token = tokenizer.eos_token

def test():
    while True:
        input_text = input("User: ")
        input_tokenized =  tokenizer.batch_encode_plus(
                [input_text],
                padding=True,        # Pad the sequences to the same length
                truncation=True,     # Truncate sequences to a maximum length
                return_tensors="tf"  # Return tf
        )
        input_ids = input_tokenized['input_ids']

        # Calculate the maximum length for generation based on the length of input_ids


        output_ids = model.generate(input_ids, max_length=50, pad_token_id=tokenizer.eos_token_id, no_repeat_ngram_size=2)

        output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        print("Bot:", output_text)


In [None]:
test()

User: hello
Bot: helloHello
User: Hi there
Bot: Hi there to provide you provide Adverseverse
User: how are you
Bot: how are you can provide you provide Adverseverse
User: no thanks
Bot: no thanks!!
User: how you could help me
Bot: how you could help me provide Adverseverse Ad Ad


KeyboardInterrupt: ignored

https://mrmaheshrajput.medium.com/how-to-build-an-intelligent-qa-chatbot-on-your-data-with-llm-or-chatgpt-d0009d256dce

# **Fine-tune GPT2 for specific domain and language**

In [15]:
! pip install git+https://github.com/keras-team/keras-nlp.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.2/475.2 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.0/442.0 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m81.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for keras-nlp (pyproject.toml) ... [?25l[?25hdone


In [18]:
import os
os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"
import keras_nlp
import tensorflow as tf
import keras_core as keras
import time

In [23]:
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

In [52]:
# On peut aussi utiliser le backbone juste pour la représentation des tokens, la sortie du decoder
# Il prend en entrée token_ids et padding_mask qu'on peut obtenir avec preprocessor
# voir obligatoirement la doc bien détaillé sur les approches de fit avec et sans label

backbone_decoder = keras_nlp.models.GPT2Backbone.from_preset("gpt2_base_en")
backbone_decoder.summary()

In [34]:
gpt2_lm.summary()

In [25]:
test = preprocessor(["It is a test"])

In [29]:
type(test)

tuple

In [30]:
len(test)

3

In [31]:
test[0]

{'token_ids': Array([[50256,  1026,   318,   257,  1332, 50256,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0, 

In [32]:
test[1]

Array([[ 1026,   318,   257,  1332, 50256,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
      

In [33]:
test[2]

Array([[ True,  True,  True,  True,  True, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
      

In [46]:
help(gpt2_lm)

Help on GPT2CausalLM in module keras_nlp.models.gpt2.gpt2_causal_lm object:

class GPT2CausalLM(keras_nlp.models.generative_task.GenerativeTask)
 |  GPT2CausalLM(backbone, preprocessor=None, **kwargs)
 |  
 |  An end-to-end GPT2 model for causal language modeling.
 |  
 |  A causal language model (LM) predicts the next token based on previous
 |  tokens. This task setup can be used to train the model unsupervised on
 |  plain text input, or to autoregressively generate plain text similar to
 |  the data used for training. This task can be used for pre-training or
 |  fine-tuning a GPT-2 model, simply by calling `fit()`.
 |  
 |  This model has a `generate()` method, which generates text based on a
 |  prompt. The generation strategy used is controlled by an additional
 |  `sampler` argument on `compile()`. You can recompile the model with
 |  different `keras_nlp.samplers` objects to control the generation. By
 |  default, `"top_k"` sampling will be used.
 |  
 |  This model can option

In [43]:
predict = gpt2_lm(inputs = test[0])

In [45]:
predict

Array([[[ -43.43164 ,  -39.836376,  -43.065914, ...,  -54.087627,
          -54.3451  ,  -42.364388],
        [ -72.13415 ,  -72.012955,  -74.436134, ...,  -81.43657 ,
          -75.13473 ,  -73.7148  ],
        [-111.18003 , -112.136024, -118.76081 , ..., -118.91605 ,
         -116.65909 , -114.36863 ],
        ...,
        [ -94.296364,  -96.13812 ,  -95.53333 , ..., -104.58594 ,
         -103.9342  ,  -91.63178 ],
        [ -94.082924,  -95.92499 ,  -95.301315, ..., -104.340256,
         -103.687515,  -91.42145 ],
        [ -93.96155 ,  -95.79409 ,  -95.14514 , ..., -104.18474 ,
         -103.5395  ,  -91.29171 ]]], dtype=float32)

In [44]:
predict.shape

(1, 128, 50257)

In [None]:
# Generate a text
output = gpt2_lm.generate("The goal of apple company", max_length=200)
print("\nGPT-2 output:")
print(output)


GPT-2 output:
The goal of apple company, Inc. was to make a better apple.

"I don't know how you would describe it, but the apple is the best apple we ever tasted, so we're very pleased with it," said Steve Hirsch, Apple's president and CEO. "It is one of the best apples ever."

The company is now working with the U.S. Department of Agriculture to develop and sell the apple.

The company is currently testing its product on apples grown by farmers in the Midwest, and is also working with the U.S. Department of Agriculture to test its products on apples grown in the Midwest.

Apple's apple has been the focus of controversy since its introduction in 2007 and its reputation has been tarnished by its poor quality.

Apple's reputation has been tarnished by a series of lawsuits that it has filed against the U.S. government, which is investigating the company for fraud.

Apple is


In [None]:
# Generate a text
output = gpt2_lm.generate("steave jobs", max_length=200)
print("\nGPT-2 output:")
print(output)

## **Fine-tune GPT2**

In [None]:
import tensorflow_datasets as tfds

reddit_ds = tfds.load("reddit_tifu", split="train", as_supervised=True)

Downloading and preparing dataset 639.54 MiB (download: 639.54 MiB, generated: 141.46 MiB, total: 781.00 MiB) to /root/tensorflow_datasets/reddit_tifu/short/1.1.2...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/79740 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/reddit_tifu/short/1.1.2.incompleteXTPWMX/reddit_tifu-train.tfrecord*...:  …

Dataset reddit_tifu downloaded and prepared to /root/tensorflow_datasets/reddit_tifu/short/1.1.2. Subsequent calls will reuse this data.


In [None]:
reddit_ds

<_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None))>

In [None]:
for document, title in reddit_ds:
    print(document.numpy())
    print(title.numpy())
    break

b"me and a friend decided to go to the beach last sunday. we loaded up and headed out. we were about half way there when i decided that i was not leaving till i had seafood. \n\nnow i'm not talking about red lobster. no friends i'm talking about a low country boil. i found the restaurant and got directions. i don't know if any of you have heard about the crab shack on tybee island but let me tell you it's worth it. \n\nwe arrived and was seated quickly. we decided to get a seafood sampler for two and split it. the waitress bought it out on separate platters for us. the amount of food was staggering. two types of crab, shrimp, mussels, crawfish, andouille sausage, red potatoes, and corn on the cob. i managed to finish it and some of my friends crawfish and mussels. it was a day to be a fat ass. we finished paid for our food and headed to the beach. \n\nfunny thing about seafood. it runs through me faster than a kenyan \n\nwe arrived and walked around a bit. it was about 45min since we a

In [None]:
train_ds = (
    reddit_ds.map(lambda document, _: document)
    .batch(32)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

In [None]:
train_ds = train_ds.take(500)
num_epochs = 1

# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

In [None]:
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

In [None]:
# SEARCH FOR TOP-K
# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_nlp.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

## **Finetune on Chinese Poem Dataset**

In [None]:
!git clone https://github.com/chinese-poetry/chinese-poetry.git

In [None]:
import os
import json

poem_collection = []
for file in os.listdir("chinese-poetry/全唐诗"):
    if ".json" not in file or "poet" not in file:
        continue
    full_filename = "%s/%s" % ("chinese-poetry/全唐诗", file)
    with open(full_filename, "r") as f:
        content = json.load(f)
        poem_collection.extend(content)

paragraphs = ["".join(data["paragraphs"]) for data in poem_collection]

In [None]:
print(paragraphs[0])

In [None]:
train_ds = (
    tf.data.Dataset.from_tensor_slices(paragraphs)
    .batch(16)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 1

learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-4,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

In [None]:
output = gpt2_lm.generate("昨夜雨疏风骤", max_length=200)
print(output)

# **Annexe**

## **BERT**

Si le modèle prend en entrée input et label, cela peut être exploité pour le fine-tuner sans rajouter un autre modèle dérrière. Car desqu'on donne labels en argument de model(), toute méthode de Bert retourne en plus de la sortie une loss.

Si la méthode n'a pas utilisé labels, dans ce cas il ne retourne pas toujours loss, on peut prendre la sortie pour l'utiliser sur un autre réseau. (09) modèles spécialisés en tout.

### **BertModel**

**BertModel :** Cette classe représente le modèle de base BERT. Elle inclut les couches d'encodeur BERT et est utilisée pour des tâches spécifiques qui nécessitent uniquement la représentation des tokens en sortie de l'encodeur. Elle n'a pas de têtes de classification spécifiques attachées.

In [None]:
from transformers import AutoTokenizer, BertModel

import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = BertModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [None]:
outputs.last_hidden_state.shape

torch.Size([1, 8, 768])

In [None]:
# 768: juste un flatten
outputs.pooler_output.shape

torch.Size([1, 768])

### **BertForPreTraining**

**BertForPreTraining :** Cette classe inclut, en plus des couches d'encodeur BERT, des têtes de pré-entraînement spécifiques pour les tâches de masquage de tokens (MLM - Masked Language Modeling) et de prédiction de la phrase suivante (NSP - Next Sentence Prediction). Elle est pré-entraînée sur des tâches de pré-entraînement spécifiques.

In [None]:
from transformers import AutoTokenizer, BertForPreTraining

import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = BertForPreTraining.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

outputs = model(**inputs)

prediction_logits = outputs.prediction_logits

seq_relationship_logits = outputs.seq_relationship_logits

In [None]:
inputs

{'input_ids': tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
outputs.keys()

odict_keys(['prediction_logits', 'seq_relationship_logits'])

In [None]:
outputs.prediction_logits

tensor([[[ -7.8962,  -7.8105,  -7.7903,  ...,  -7.0694,  -7.1693,  -4.3590],
         [ -8.4461,  -8.4401,  -8.5044,  ...,  -8.0625,  -7.9909,  -5.7160],
         [-15.2953, -15.4727, -15.5865,  ..., -12.9857, -11.7039, -11.4293],
         ...,
         [-14.0628, -14.2535, -14.3645,  ..., -12.7151, -11.1621, -10.2317],
         [-10.6576, -10.7892, -11.0402,  ..., -10.3233, -10.1578,  -3.7722],
         [-11.3383, -11.4590, -11.1767,  ...,  -9.2152,  -9.5209,  -9.5571]]],
       grad_fn=<ViewBackward0>)

In [None]:
# 30522 : taille du vocabulaire, donc le softmax permet de retrouver des mots
outputs.prediction_logits.shape

torch.Size([1, 8, 30522])

### **BertLMHeadModel**

**BertLMHeadModel :** est une classe spécifique dans la bibliothèque Hugging Face Transformers qui représente le modèle BERT avec une tête de langage (Language Modeling Head). Cette classe est spécifiquement conçue pour être utilisée dans des tâches de modélisation de langage et de génération de texte.

In [None]:
import torch

from transformers import AutoTokenizer, BertLMHeadModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = BertLMHeadModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

outputs = model(**inputs, labels=inputs["input_ids"])

loss = outputs.loss

logits = outputs.logits

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


In [None]:
inputs

{'input_ids': tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
outputs.keys()

odict_keys(['loss', 'logits'])

In [None]:
outputs.logits

tensor([[[ -7.8962,  -7.8105,  -7.7903,  ...,  -7.0694,  -7.1693,  -4.3590],
         [ -8.4461,  -8.4401,  -8.5044,  ...,  -8.0625,  -7.9909,  -5.7160],
         [-15.2953, -15.4727, -15.5865,  ..., -12.9857, -11.7039, -11.4293],
         ...,
         [-14.0628, -14.2535, -14.3645,  ..., -12.7151, -11.1621, -10.2317],
         [-10.6576, -10.7892, -11.0402,  ..., -10.3233, -10.1578,  -3.7722],
         [-11.3383, -11.4590, -11.1767,  ...,  -9.2152,  -9.5209,  -9.5571]]],
       grad_fn=<ViewBackward0>)

In [None]:
outputs.logits.shape

torch.Size([1, 8, 30522])

In [None]:
# Générer du texte avec Bert

from transformers import BertTokenizer, BertLMHeadModel

"""
is_decoder=True." suggère que si vous utilisez BertLMHeadModel de manière indépendante (c'est-à-dire, sans modèle préentraîné spécifique),
vous devriez spécifier explicitement que le modèle est un décodeur en ajoutant le paramètre is_decoder=True lors de son initialisation.

Cela peut être dû au fait que BertLMHeadModel est dérivé de BertForCausalLM,
qui est un modèle décodeur pour des tâches de modélisation de langage causale, généralement utilisé pour des prédictions
de séquences dans le temps. L'ajout de is_decoder=True informe le modèle que vous l'utilisez en tant que décodeur dans un contexte plus large.
"""
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertLMHeadModel.from_pretrained('bert-base-uncased', is_decoder=True)

input_text = "Hello, this is a test sentence."
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Génération de texte masqué
outputs = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=3, do_sample=True, temperature=0.8)
"""
Test avec:
outputs = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, do_sample=True, temperature=0.8)
outputs = model.generate(input_ids, max_length=50, num_beams=10, no_repeat_ngram_size=2, do_sample=True, temperature=0.8)
outputs = model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=2, do_sample=True, temperature=0.8)
outputs = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=3, do_sample=True, temperature=0.8)
"""
# Décodage de la séquence générée
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)


hello, this is a test sentence.... and and and,, for for for,,, and and as as as and and or or or,, in in in and and a a a is is is be be be


### **BertForMaskedLM (a creuser pour le maskage)**

**BertForMaskedLM:** est une classe spécifique dans la bibliothèque Hugging Face Transformers qui est conçue pour la tâche de Masked Language Modeling (MLM). Cette classe est dérivée de la classe BertPreTrainedModel et inclut les couches nécessaires pour la tâche MLM, y compris la projection linéaire vers l'espace du vocabulaire.

In [None]:
from transformers import AutoTokenizer, BertForMaskedLM

import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = BertForMaskedLM.from_pretrained("bert-base-uncased")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")

with torch.no_grad():

    outputs = model(**inputs)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
inputs

{'input_ids': tensor([[ 101, 1996, 3007, 1997, 2605, 2003,  103, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
outputs.keys()

odict_keys(['logits'])

In [None]:
outputs.logits

tensor([[[ -6.4346,  -6.4063,  -6.4097,  ...,  -5.7691,  -5.6326,  -3.7883],
         [-14.0119, -14.7240, -14.2120,  ..., -11.6976, -10.7304, -12.7617],
         [ -9.6561, -10.3125,  -9.7459,  ...,  -8.7782,  -6.6036, -12.6596],
         ...,
         [ -3.7861,  -3.8572,  -3.5644,  ...,  -2.5593,  -3.1093,  -4.3820],
         [-11.6598, -11.4274, -11.9266,  ...,  -9.8772, -10.2103,  -4.7594],
         [-11.7267, -11.7509, -11.8040,  ..., -10.5943, -10.9407,  -7.5151]]])

In [None]:
# La taille du vocabulaire
outputs.logits.shape

torch.Size([1, 9, 30522])

In [None]:
# index de [MASK]
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
mask_token_index

tensor([6])

In [None]:
# Recupérer l'argmax pour la prédiction de [MASK]
predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1)
predicted_token_id

tensor([3000])

In [None]:
tokenizer.decode(predicted_token_id)

'paris'

In [None]:
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]

# mask labels of non-[MASK] tokens

labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100)

outputs = model(**inputs, labels=labels)

round(outputs.loss.item(), 2)

0.88

In [None]:
labels

tensor([[-100, -100, -100, -100, -100, -100, 3000, -100, -100]])

### **BertForNextSentencePrediction**

**BertForNextSentencePrediction** est une classe de la bibliothèque Hugging Face Transformers qui est spécifiquement conçue pour la tâche de prédiction de la phrase suivante (Next Sentence Prediction, ou NSP) dans le contexte de modèles BERT (Bidirectional Encoder Representations from Transformers). Cette tâche est souvent utilisée comme composant auxiliaire dans des modèles pré-entraînés comme BERT.

In [None]:
from transformers import BertTokenizer, BertForNextSentencePrediction

# Charger le tokenizer et le modèle
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

# Exemple de paires de phrases
sentence_1 = "The capital of France is Paris."
sentence_2_correct = "It is a beautiful city."
sentence_2_incorrect = "Bananas grow on trees."

# Tokeniser les phrases
inputs_correct = tokenizer(sentence_1, sentence_2_correct, return_tensors='pt')
inputs_incorrect = tokenizer(sentence_1, sentence_2_incorrect, return_tensors='pt')

# Prédire si la deuxième phrase suit logiquement la première
outputs_correct = model(**inputs_correct)
outputs_incorrect = model(**inputs_incorrect)

# Afficher les scores de prédiction
print("Score pour une suite logique :", outputs_correct.logits[0, 0].item())
print("Score pour une suite non logique :", outputs_incorrect.logits[0, 0].item())


Score pour une suite logique : 5.836156845092773
Score pour une suite non logique : 1.2631127834320068


In [None]:
inputs_correct

{'input_ids': tensor([[ 101, 1996, 3007, 1997, 2605, 2003, 3000, 1012,  102, 2009, 2003, 1037,
         3376, 2103, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
outputs_correct.keys()

odict_keys(['logits'])

In [None]:
outputs_correct.logits

tensor([[ 5.8362, -5.3472]], grad_fn=<AddmmBackward0>)

### **BertForSequenceClassification**

**BertForSequenceClassification** est une classe de la bibliothèque Hugging Face Transformers spécifiquement conçue pour la tâche de classification de séquences. Elle est basée sur l'architecture BERT (Bidirectional Encoder Representations from Transformers) et est pré-entraînée sur de grandes quantités de données textuelles. Cette classe permet de fine-tuner un modèle BERT pré-entraîné pour des tâches spécifiques de classification de texte.

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax

# Charger le tokenizer et le modèle pré-entraîné
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)  # Choisir le nombre d'étiquettes de sortie

# Exemple de texte à classifier
text = "BERT is a powerful tool for natural language processing tasks."

# Tokeniser le texte
inputs = tokenizer(text, return_tensors='pt')

# Obtenir les logits de sortie du modèle
outputs = model(**inputs)

# Appliquer la fonction softmax pour obtenir les probabilités de classe
probs = softmax(outputs.logits, dim=1)

# Afficher les probabilités de classe
print("Probabilités de classe :", probs)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Probabilités de classe : tensor([[0.5004, 0.4996]], grad_fn=<SoftmaxBackward0>)


In [None]:
inputs

{'input_ids': tensor([[  101, 14324,  2003,  1037,  3928,  6994,  2005,  3019,  2653,  6364,
          8518,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
outputs.keys()

odict_keys(['logits'])

In [None]:
outputs.logits

tensor([[-0.5120, -0.5137]], grad_fn=<AddmmBackward0>)

In [None]:
# METTRE LABELS DANS LE CALCUL PERMET D'AVOIR LOSS EN SORTIE
from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax
import torch

# Charger le tokenizer et le modèle pré-entraîné
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)  # Choisir le nombre d'étiquettes de sortie

# Exemple de texte à classifier
text = "BERT is a powerful tool for natural language processing tasks."

# Tokeniser le texte
inputs = tokenizer(text, return_tensors='pt')

# Étiquettes pour l'entraînement (supposons une tâche binaire)
labels = torch.tensor([1])  # Par exemple, classe positive

# Obtenir les logits de sortie du modèle
outputs = model(**inputs, labels=labels)

# La perte sera calculée automatiquement ici

# Appliquer la fonction softmax pour obtenir les probabilités de classe
probs = softmax(outputs.logits, dim=1)

# Afficher les probabilités de classe
print("Probabilités de classe :", probs)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Probabilités de classe : tensor([[0.5337, 0.4663]], grad_fn=<SoftmaxBackward0>)


In [None]:
outputs.keys()

odict_keys(['loss', 'logits'])

In [None]:
# Fine-tune
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
from torch.nn.functional import softmax
import torch

# Charger le tokenizer et le modèle pré-entraîné
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)  # 3 classes de sortie

# Exemple de texte à classifier
text = "BERT is a powerful tool for natural language processing tasks."

# Créer un exemple de dataset
class CustomDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return {"text": self.texts[idx], "label": self.labels[idx]}

# Exemple de jeu de données
train_texts = ["Texte de la classe 0", "Texte de la classe 1", "Texte de la classe 2"]
train_labels = [0, 1, 2]

# Créer une instance du dataset
train_dataset = CustomDataset(train_texts, train_labels)

# Exemple de fonction d'entraînement
def train(model, dataset, epochs=3, batch_size=1):
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    for epoch in range(epochs):
        for batch in dataloader:
            inputs = tokenizer(batch["text"], return_tensors='pt', padding=True, truncation=True)
            labels = torch.tensor(batch["label"]).unsqueeze(0)  # Ajouter une dimension pour le nombre de classes
            outputs = model(**inputs, labels=labels)

            loss = outputs.loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

# Entraîner le modèle
train(model, train_dataset)

# Exemple d'inférence
test_text = "BERT is amazing!"
test_input = tokenizer(test_text, return_tensors='pt')
test_output = model(**test_input)

# Obtenir les probabilités de classe avec softmax
probs = softmax(test_output.logits, dim=1)

# Afficher les probabilités de classe
print("Probabilités de classe :", probs)


### **BertForMultipleChoice**

**BertForMultipleChoice** est une classe spécifique de la bibliothèque Hugging Face Transformers qui est conçue pour la tâche du choix multiple. Cette tâche est souvent rencontrée dans des scénarios où vous avez plusieurs options et vous devez choisir la meilleure réponse parmi elles. Par exemple, les questions à choix multiples dans les examens standardisés ou les enquêtes en ligne sont des exemples de situations où BertForMultipleChoice pourrait être utilisé.

In [None]:
from transformers import AutoTokenizer, BertForMultipleChoice

import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = BertForMultipleChoice.from_pretrained("bert-base-uncased")

prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."

choice0 = "It is eaten with a fork and a knife."

choice1 = "It is eaten while held in the hand."

labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1

encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors="pt", padding=True)

outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()}, labels=labels)  # batch size is 1

# the linear classifier still needs to be trained

loss = outputs.loss

logits = outputs.logits

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
encoding

{'input_ids': tensor([[  101,  1999,  3304,  1010, 10733,  2366,  1999,  5337, 10906,  1010,
          2107,  2004,  2012,  1037,  4825,  1010,  2003,  3591,  4895, 14540,
          6610,  2094,  1012,   102,  2009,  2003,  8828,  2007,  1037,  9292,
          1998,  1037,  5442,  1012,   102],
        [  101,  1999,  3304,  1010, 10733,  2366,  1999,  5337, 10906,  1010,
          2107,  2004,  2012,  1037,  4825,  1010,  2003,  3591,  4895, 14540,
          6610,  2094,  1012,   102,  2009,  2003,  8828,  2096,  2218,  1999,
          1996,  2192,  1012,   102,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1

In [None]:
labels

tensor([0])

In [None]:
loss

tensor(0.6923, grad_fn=<NllLossBackward0>)

In [None]:
logits

tensor([[0.7012, 0.6995]], grad_fn=<ViewBackward0>)

### **BertForTokenClassification - NER**

**BertForTokenClassification** est un modèle de la bibliothèque Hugging Face Transformers spécifiquement conçu pour la tâche de classification de jetons. Cette tâche est couramment utilisée dans le traitement du langage naturel (NLP) pour attribuer une étiquette ou une catégorie à chaque jeton dans une séquence de texte.

In [None]:
from transformers import AutoTokenizer, BertForTokenClassification

import torch

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

model = BertForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

inputs = tokenizer(

    "HuggingFace is a company based in Paris and New York", add_special_tokens=False, return_tensors="pt"

)

with torch.no_grad():

    logits = model(**inputs).logits

predicted_token_class_ids = logits.argmax(-1)

# Note that tokens are classified rather then input words which means that

# there might be more predicted token classes than words.

# Multiple token classes might account for the same word

predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]]

predicted_tokens_classes

labels = predicted_token_class_ids

loss = model(**inputs, labels=labels).loss

round(loss.item(), 2)

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


0.01

In [None]:
logits

tensor([[[ 8.8331e+00, -2.6947e+00, -1.4335e+00, -2.0760e+00, -1.9929e+00,
          -1.5367e+00, -1.3373e-01, -2.4921e+00,  1.1420e+00],
         [ 1.8645e+00, -1.9180e+00,  6.6708e-01, -2.8445e+00, -1.8780e-01,
          -1.2112e+00,  4.3467e+00, -2.1351e+00, -6.0643e-01],
         [ 1.6579e+00, -2.9294e+00, -9.1255e-01, -2.8349e+00, -7.5518e-01,
          -9.7249e-01,  7.4999e+00, -2.6189e+00, -2.2596e-01],
         [ 1.7139e+00, -2.8050e+00, -6.7317e-01, -3.0918e+00, -2.3132e-02,
          -1.3786e+00,  7.2262e+00, -3.0614e+00, -5.5431e-01],
         [ 9.9437e+00, -2.3024e+00, -7.7941e-01, -2.6059e+00, -1.2730e+00,
          -1.4824e+00,  1.6615e+00, -2.5419e+00, -8.0900e-01],
         [ 1.0505e+01, -2.4093e+00, -8.5326e-01, -2.7351e+00, -1.2257e+00,
          -1.5433e+00,  1.2509e+00, -2.4653e+00, -8.2943e-01],
         [ 9.9881e+00, -2.7991e+00, -6.8294e-01, -2.8792e+00, -1.1324e+00,
          -1.8547e+00,  2.4081e+00, -2.6043e+00, -5.8434e-01],
         [ 1.0769e+01, -2.2194e+00

In [None]:
logits.shape

torch.Size([1, 13, 9])

In [None]:
predicted_token_class_ids

tensor([[0, 6, 6, 6, 0, 0, 0, 0, 0, 8, 0, 8, 8]])

In [None]:
predicted_tokens_classes


['O',
 'I-ORG',
 'I-ORG',
 'I-ORG',
 'O',
 'O',
 'O',
 'O',
 'O',
 'I-LOC',
 'O',
 'I-LOC',
 'I-LOC']

### **BertForQuestionAnswering**

***BertForQuestionAnswering** est un modèle de la bibliothèque Hugging Face Transformers spécialement conçu pour la tâche de question-réponse. Ce modèle est basé sur l'architecture BERT (Bidirectional Encoder Representations from Transformers) et est pré-entraîné sur de grandes quantités de données textuelles pour comprendre les relations entre les différents éléments du langage.

In [None]:
from transformers import AutoTokenizer, BertForQuestionAnswering

import torch

tokenizer = AutoTokenizer.from_pretrained("deepset/bert-base-cased-squad2")

model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt")

with torch.no_grad():

    outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()

answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

# target is "nice puppet"

target_start_index = torch.tensor([14])

target_end_index = torch.tensor([15])

outputs = model(**inputs, start_positions=target_start_index, end_positions=target_end_index)

loss = outputs.loss

round(loss.item(), 2)

tokenizer_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


7.41

In [None]:
outputs.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

In [None]:
outputs.start_logits

tensor([[ 1.0374, -6.9344, -6.9556, -2.8814, -7.0174, -8.2111, -7.6869, -7.3035,
          0.8261, -4.2656, -5.2627,  0.3830,  7.0740,  5.2306,  5.6687, -7.3035]],
       grad_fn=<CloneBackward0>)

### **Tensorflow**

On peut refaire l'ensemble avec Tensorflow: https://huggingface.co/docs/transformers/v4.35.2/en/model_doc/bert#transformers.BertForQuestionAnswering

In [None]:
from transformers import AutoTokenizer, TFBertForMaskedLM

import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = TFBertForMaskedLM.from_pretrained("bert-base-uncased")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="tf")

logits = model(**inputs).logits

# retrieve index of [MASK]

mask_token_index = tf.where((inputs.input_ids == tokenizer.mask_token_id)[0])

selected_logits = tf.gather_nd(logits[0], indices=mask_token_index) # récupérer la ligne correspondant à [MASK]

predicted_token_id = tf.math.argmax(selected_logits, axis=-1)

tokenizer.decode(predicted_token_id)
'paris'

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


'paris'

In [None]:
logits

<tf.Tensor: shape=(1, 9, 30522), dtype=float32, numpy=
array([[[ -6.4346056,  -6.4063463,  -6.4097404, ...,  -5.7691364,
          -5.6326175,  -3.7882855],
        [-14.011929 , -14.724046 , -14.21197  , ..., -11.6976385,
         -10.730408 , -12.761751 ],
        [ -9.656149 , -10.312495 ,  -9.745864 , ...,  -8.77816  ,
          -6.603597 , -12.659605 ],
        ...,
        [ -3.7861161,  -3.8571935,  -3.5644362, ...,  -2.5592573,
          -3.1093237,  -4.3819623],
        [-11.659788 , -11.427394 , -11.926661 , ...,  -9.877242 ,
         -10.210293 ,  -4.759405 ],
        [-11.726652 , -11.750855 , -11.8039665, ..., -10.594333 ,
         -10.940655 ,  -7.5151224]]], dtype=float32)>

In [None]:
selected_logits

<tf.Tensor: shape=(1, 30522), dtype=float32, numpy=
array([[-3.7861161, -3.8571935, -3.5644362, ..., -2.5592573, -3.1093237,
        -4.3819623]], dtype=float32)>