I will explore text generation using a GPT-2 model, which was trained to predict next words on 40GB of Internet text data. The fully trained model is actually not available as the creators were concerned about 'malicious applications of the technology', but there is a much smaller version that is available for enthusiants to play with, which we will use here

In [1]:
# let start of import our Hugging face 


!pip install transformers 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.0


In [2]:
from transformers import GPT2Tokenizer 
from transformers import TFGPT2LMHeadModel 


tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = TFGPT2LMHeadModel.from_pretrained('gpt2-large' , pad_token_id = tokenizer.eos_token_id)






Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading (…)"tf_model.h5";:   0%|          | 0.00/3.10G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [3]:
import tensorflow as tf 
tf.random.set_seed(43)
input_sentence = "I don't know about you, but there's only one thing I want to do after a long day of work" 

In [4]:
input_ids = tokenizer.encode(input_sentence , return_tensors ='tf')
output = model.generate(input_ids , max_length=70 ) 
### our result 
print("********** our result ************\n\n")
print(tokenizer.decode(output[0] , skip_special_tokens =True)  )



********** our result ************


I don't know about you, but there's only one thing I want to do after a long day of work: go to the gym.

I'm not talking about the gym that's right next to my house. I'm talking about the gym that's right next to my office.

I'm not talking about the gym that


In [5]:
### here we will collect all steps in one function to make it easy 
# first we use tokenizer to get input ids .  from it name that mean now we have ids of our input words 
# then we take this as input for our model to generate text 
# we have to define our maximum length for our predict 
# then we do some operations to get clean text .
max_length =50
def generate_text( inputs , model= model , max_length =max_length , tokenizer =tokenizer):
  
  
  input_ids = tokenizer.encode(inputs , return_tensors ='tf')
  output = model.generate(input_ids , max_length=70 ) 
  words = tokenizer.decode(output[0] , skip_special_tokens =True)
  sents = [sent for sent in words.split('\n')]
  final_sents =[]
  for i in range(len(sents)):
    if i %2 ==0 :     ### we chosse even numbers that have accual text  
      final_sents.append(sents[i])
  text = ' '.join(final_sents)
  return words

In [6]:
## let test our function 

# determine max length 
max_length = 256 
text_input ='model what mean'  

final_text = generate_text(inputs = text_input ,max_length = max_length)


In [7]:
final_text

'model what mean to you.\n\nThe first thing you need to do is to create a new project.\n\n$ git init\n\nThis will create a new directory called "my-project" and a new file called "my-project.yml".\n\nThe first thing you need to do is to create a new project.'

as we can see, our model starts repeating itself rather quickly. The main issue with Greedy Search is that words with high probabilities can be masked by words in front of them with low probabilities, so the model is unable to explore more diverse combinations of words. We can prevent this by implementing Beam Search:

# Beam Search 
Beam search is essentially Greedy Search  - but the model tracks and keeps num_beams of hypotheses at each time step, so the model is able to compare alternative paths as it generates text. We can also include a n-gram penalty by setting **no_repeat_ngram_size** = 2 which ensures that no 2-grams appear twice. We will also set **num_return_sequences** = 5 so we can see what the other 5 beams looked like

To use Beam Search, we need only modify some parameters in the generate function:

In [8]:
input_sentence = "I don't know about you, but there's only one thing I want to do after a long day of work"  
max_length = 70
inputs_ids = tokenizer.encode(input_sentence ,return_tensors='tf')
Beam_output = model.generate(  
                              inputs_ids ,
                             max_length = max_length ,
                             no_repeat_ngram_size =2 ,
                             num_return_sequences = 5 ,
                             num_beams = 5 , ### this option beam search to avoid repeated method to creat text ---> it make model chose differents ways to generate text 
                             early_stopping = True
 )

In [9]:
#now we have 5 sequences in Beam_output 
# let show it :


print('*'*10 , 'result' ,"*"*10)
print('-'*50)


for i , beam in enumerate(Beam_output):
  print(f'{i+1} : {tokenizer.decode(beam , skip_special_tokens =True)}')


********** result **********
--------------------------------------------------
1 : I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's not a good movie. I mean, it's
2 : I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's about a guy who has a crush on a girl
3 : I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's about a guy who has a crush on a woman
4 : I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not g

**of course, Beam Search is not perfect either. It works well when the legnth of the generated text is more or less constant**

# Basic Sampling 
so we can include the ***temperature*** parameter which increases the chances of high probability words and decreases the chances of low probability words in the sampling:

We just need to set **do_sample** = True to implement sampling and for demonstration purposes (you'll shortly see why) we set **top_k** = 0:

In [10]:
Sample_output = model.generate(  
    
                                inputs_ids ,
                               max_length = max_length ,
                               do_sample = True ,
                               top_k = 0 , 
                               temperature = 0.8 ,

)

print('*'*10 , 'result' ,'*'*10)
print('-'*50)
print(tokenizer.decode(Sample_output[0] , skip_special_tokens =True))

********** result **********
--------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of workman's paradise: eat a pork chop in a chile relleno. That's right. Pork chops? I guess that's not a thing. Cameron: No, it's not a thing. Rachel: I know right?


# Top_k 
Top-K Sampling
In Top-K sampling, the top k most likely next words are selected and the entire probability mass is shifted to these k words. So instead of increasing the chances of high probability words occuring and decreasing the chances of low probabillity words, we just remove low probability words all together

We just need to set top_k to however many of the top words we want to consider for our conditional probability distribution:

In [11]:
top_k_output = model.generate( 
                                inputs_ids ,
                              max_length = max_length ,
                              do_sample =True ,
                              top_k =50 ,
                              temperature = 0.8
)



print('*'*10 , 'result' ,'*'*10)
print('-'*50)
print(tokenizer.decode(top_k_output[0] , skip_special_tokens =True))


********** result **********
--------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work: curl up with a good book and enjoy a good night's sleep. No distractions, no worries. No worries about anything.

And then, once I get that good night's sleep, I'll get right back to work


**Top-K Sampling seems to generate more coherent text than our random sampling before. But we can do even better:**

# top-p 
Top-P sampling (also known as nucleus sampling) is similar to Top-K, but instead of choosing the top k most likely wordsm we choose the smallest set of words whose total probability is larger than p, and then the entire probability mass is shifted to the words in this set

The main difference here is that with Top-K sampling, the size of the set of words is static (obviously) whereas in Top-P sampling, the size of the set can change. To use this sampling method, we just set ***top_k*** = 0 and choose a value **top_p**:

In [12]:
top_p_output = model.generate( 
                              inputs_ids ,
                              max_length = max_length ,
                              do_sample = True ,
                              top_k=0 ,
                              top_p =0.8 ,


)



print('*'*10 , 'result' , '*'*10)
print('-'*50)
print(tokenizer.decode(top_p_output[0] , skip_special_tokens =True))




********** result **********
--------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work. And that's to put the excitement back in business. So, if you have time to spare, maybe I can help. The more work I do for myself, the more I want to help others. So, how can I


let combine all hyperparameters togethor to show our result.....

In [14]:
all_output = model.generate( 
                              inputs_ids ,
                              max_length = max_length ,
                              do_sample =True ,
                              top_k =50 ,
                              top_p =0.85 ,
                              num_return_sequences =5
)



print('*'*10 , 'result' , '*'*10)
print('-'*50) 

for i , sample in enumerate(all_output):
  print(f'{i+1} : {tokenizer.decode(sample , skip_special_tokens =True)}')



********** result **********
--------------------------------------------------
1 : I don't know about you, but there's only one thing I want to do after a long day of work: sit in a corner and look up at the ceiling.

But now, for the first time ever, we've got a solution. Just click the button below, and we'll send you a personalized note to the ceiling of
2 : I don't know about you, but there's only one thing I want to do after a long day of work: get out and get a nice long hot shower. Here's some information to help you do that.

You've Got to Do It. I know I don't always make it a habit of reading about how to have a
3 : I don't know about you, but there's only one thing I want to do after a long day of work: relax. So I made a simple, healthy, low-carb keto snack that's sure to leave you satisfied in no time!

Ingredients:

1/2 cup of your favorite low-carb, low-
4 : I don't know about you, but there's only one thing I want to do after a long day of work. So I've decided to do s

In [28]:
#  let test our model with Benchmark Prompts



text = '''In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, 
in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.''' 




max_length = 256 
inputs_ids = tokenizer.encode(text , return_tensors ='tf')

prompt_out = model.generate( 
                     inputs_ids ,
                     max_length = max_length ,
                     do_sample = True ,
                     top_k = 50 ,
                     top_p = 0.85 

)



print('*'*10 , 'result' , '*'*10)
print('-'*50) 
for i  ,prom in enumerate(prompt_out):

  print(f'{i+1}:{tokenizer.decode(prom , skip_special_tokens=True)}') 




********** result **********
--------------------------------------------------
1:In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, 
in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. "This is one of the biggest breakthroughs in animal communication ever," said Dr. Peter A. Williams, professor of wildlife science at the University of Colorado at Boulder, who co-authored the study. "The animals are showing a very high level of communication ability. They may be the first animals to communicate in a language, which is pretty unusual." Dr. Williams' study is published in the journal Animal Behaviour.

The study found that the animals were communicating in English, not Spanish. The creatures are the descendants of the wild llamas, which were imported to South America thousands of years ago and domesticated. The llamas had been imported to the US from Mexico, and

In [30]:
prompt2 = 'Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.' 
inputs_ids = tokenizer.encode(prompt2 , return_tensors = 'tf')


out2 = model.generate(
                      inputs_ids ,
                      max_length = max_length ,
                      do_sample =True ,
                      top_k=50 ,
                      top_p =0.85
)



print('*'*10 , 'result' , '*'*10)
print('-'*50) 
for i  ,prom in enumerate(out2):
  print(f'{i+1}:{tokenizer.decode(prom , skip_special_tokens=True)}') 

********** result **********
--------------------------------------------------
1:Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today. The singer was also spotted walking home wearing a pair of blue cargo pants and an orange hoodie.

A photo posted by Miley Cyrus (@mileycyrus) on Jul 9, 2016 at 8:54pm PDT

PHOTOS: Celebrity mugshots

The 26-year-old has already started work on her new music video, but this latest photo is likely to have caused a lot of confusion for her fans.

PHOTOS: Celebs gone wild

Cyrus has been spotted in recent weeks rocking a pair of black leather boots, a white t-shirt and red high-tops. The singer has also been spotted wearing a black turtleneck and pink pants.

The singer was spotted heading out to lunch in Malibu on Saturday, which was reportedly followed by a few "bonded" fans.

PHOTOS: Celebrity tattoos

A video posted by Miley Cyrus (@mileycyrus) on Jul 8, 2016 at 7:34pm PDT

PHOTOS: Miley Cyrus' best moments

After