<h1>Génération de Texte</h1>

Dans ce notebook, nous allons voir les techniques pour générer du texte avec la bibliothèque <code>transformers</code> de HuggingFace

On utilise également les bibliothèques pytorch et numpy.

In [3]:
import torch
import torch.nn.functional as F
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import random

charger le modèle

In [4]:
model_name = "gpt2-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)
model.eval();

amorce et tokenisation

In [5]:
amorce = "I went to the restaurant last night. I ordered"
#amorce = "Hello, I am"

tokens = tokenizer.tokenize(amorce)
print("tokenization de l'amorce: ", tokens)

encoded_context = tokenizer.encode(amorce)
print("tokens correspondants: ", encoded_context)

tensor_encoded_context = torch.LongTensor(encoded_context).view(1,-1)

#liste des tokens visibles sur:
#https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json

tokenization de l'amorce:  ['I', 'Ġwent', 'Ġto', 'Ġthe', 'Ġrestaurant', 'Ġlast', 'Ġnight', '.', 'ĠI', 'Ġordered']
tokens correspondants:  [40, 1816, 284, 262, 7072, 938, 1755, 13, 314, 6149]


Création d'une prédiction

In [6]:
outputs = model(tensor_encoded_context)
logits = outputs.logits
print("shape des logits: ", logits.shape)

shape des logits:  torch.Size([1, 10, 50257])


Implémentation manuelle de greedy search

In [7]:
tokens_temp = encoded_context.copy()
tensor_tokens = torch.LongTensor(tokens_temp).view(1,-1)
sentence_length = 30

for i in range(sentence_length):
    with torch.no_grad():
            outputs = model(tensor_tokens)
            logits = outputs.logits
            
            tokens_temp.append(logits[0,-1].argsort()[-1].item())
            tensor_tokens = torch.LongTensor(tokens_temp).view(1,-1)
            
print("Tokens générés: {} \n".format(tokens_temp))
print("Texte généré: {}".format(tokenizer.decode(tokens_temp)))

Tokens générés: [40, 1816, 284, 262, 7072, 938, 1755, 13, 314, 6149, 262, 9015, 290, 266, 48501, 13, 314, 373, 407, 12617, 13, 383, 9015, 373, 5894, 290, 262, 266, 48501, 547, 407, 4713, 13, 314, 481, 407, 307, 8024, 13, 198] 

Texte généré: I went to the restaurant last night. I ordered the chicken and waffles. I was not impressed. The chicken was dry and the waffles were not fresh. I will not be returning.



Utilisation de l'implémentation de la bibliothèque <code>transformers</code> de la greedy search

In [8]:
greedy_output = model.generate(tensor_encoded_context, max_length=40)


print(tokenizer.decode(greedy_output[0].tolist()))
#Note: le 0 correspond à la séquence 0 de l'output. On ne lui demande de générer
#qu'une séquence donc il n'y a rien d'autre
#Note2: on convertit en liste, le tokenizer digère mal les tenseur torch

  "You have modified the pretrained model configuration to control generation. This is a"


I went to the restaurant last night. I ordered the chicken and waffles. I was not impressed. The chicken was dry and the waffles were not fresh. I will not be returning.



Utilisation de l'implémentation de la bibliothèque <code>transformers</code> de la beam search

In [9]:
beam_output = model.generate(tensor_encoded_context, max_length=30, num_beams=3)

print(tokenizer.decode(beam_output[0].tolist()))

I went to the restaurant last night. I ordered a steak, and it came with a side of fries. The steak was good, but the fries


Utilisation de l'implémentation de la bibliothèque <code>transformers</code> du sampling
avec les paramètres de préselection de distribution.
Par défaut: top_k=50, top_p=1 et temperature=1

In [21]:
sampling_output = model.generate(tensor_encoded_context, do_sample=True, max_length=50,\
                                 top_k=50257, top_p=0.05, temperature=10.0, num_return_sequences=3)

for i in range(sampling_output.shape[0]):
        print("exemple {}: {}\n".format(i,tokenizer.decode(sampling_output[i].tolist())))

exemple 0: I went to the restaurant last night. I ordered Rajiri Ayton or Pinotes style Mayadellete!! Who Draniesvex postsic changed'may let my-footmaid ke# around unless kindly I ke???? where ma good!!"

exemple 1: I went to the restaurant last night. I ordered mis at ramstarkstein christurch kitchen standardand inside upon l cleaningdex seight bag fo various delivery off lines A16 MP €$$$$ thanor single while dis showed if having dough make

exemple 2: I went to the restaurant last night. I ordered Maâ boreíuu < Mangled Cream Old Jea Virgin Bloda <- coconut sausage ladis (< crackedbread Greek mixture kinda brobr ^ who belink version)" <@Devertsuct&



Génération de texte complètement aléatoire:

In [23]:
tokens_temp = encoded_context.copy()
tensor_tokens = torch.LongTensor(tokens_temp).view(1,-1)
outputs = model(tensor_tokens)
random_ranks = []

#En réduisant le top_k, on peut parvenir à obtenir quelque chose de plus cohérent 
# que la valeur 50257 qui considère TOUS les tokens de la distribution
top_k = 10

for i in range(50):
    with torch.no_grad():
            outputs = model(tensor_tokens)
            logits, past = outputs.logits, outputs.past_key_values
            
            random_rank = np.random.randint(top_k)
            tokens_temp.append(logits[0,-1].argsort()[-random_rank].item())
            tensor_tokens = torch.LongTensor(tokens_temp).view(1,-1)
            random_ranks.append(random_rank)
            
print('charabia généré: {}\n'.format(tokenizer.decode(tokens_temp)))
print('Rang des probas utilisées: {}'.format(random_ranks))

charabia généré: I went to the restaurant last night. I ordered some pasta salad with shrimp andi...
A few weeks and a lot changed around my family�. The is very good but i donembedreportprint i am. My mom told i have been getting better and i think the reason i feel

Rang des probas utilisées: [5, 5, 6, 4, 7, 1, 0, 6, 6, 2, 7, 2, 1, 9, 1, 2, 6, 8, 3, 7, 0, 2, 4, 0, 6, 8, 2, 4, 7, 9, 0, 8, 4, 0, 6, 3, 7, 6, 9, 3, 3, 7, 5, 1, 2, 5, 5, 2, 4, 7]


Visualisation des distributions et de l'effet de la température

In [None]:
import matplotlib.pyplot as plt
import numpy as np
# if using a Jupyter notebook, includue:
%matplotlib inline
x = np.random.normal(size=10)
print(x)


In [None]:
temperature = 5

softmax_temp = np.exp(x/temperature)/np.exp(x/temperature).sum()
x_axis = np.arange(x.shape[0])

print(softmax_temp, softmax_temp.sum())
plt.bar(x_axis, softmax_temp)
plt.ylabel('Probability')
plt.xlabel('Token id')
plt.title('Temperature {}'.format(temperature))
plt.show()

In [None]:
softmax_temp1 = np.exp(x)/np.exp(x).sum()
cumsum = np.sort(softmax_temp1).cumsum()
x_axis = np.arange(x.shape[0])

plt.bar(x_axis, cumsum)
plt.ylabel('Cumulative Probability')
plt.title('Temperature 1')
plt.show()