<a href="https://colab.research.google.com/github/bhargavakusuma/CSV-Agent/blob/main/5_1_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with HuggingFace - GPT2

In [None]:
# . In this notebook, I will explore text generation using a GPT-2 model, which was trained to predict next words on 40GB of Internet text data.

In [2]:
!pip install transformers



In [3]:
#for reproducability
SEED = 34

#maximum number of words in output text
MAX_LEN = 70

In [4]:
input_sequence = "I don't know about you, but there's only one thing I want to do after a long day of work"

In [7]:
#get transformers
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

#get large GPT2 tokenizer and GPT2 model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)

#view model parameters
GPT2.summary()

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLay  multiple                  774030080 
 er)                                                             
                                                                 
Total params: 774030080 (2.88 GB)
Trainable params: 774030080 (2.88 GB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
# II. Different Decoding Methods
First Pass (Greedy Search)
With Greedy search, the word with the highest probability is predicted as the next word i.e. the next word is updated

In [8]:
#get deep learning basics
import tensorflow as tf
tf.random.set_seed(SEED)

In [9]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode(input_sequence, return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = GPT2.generate(input_ids, max_length = MAX_LEN)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work: go to the gym.

I'm not talking about the gym that's right next to my house. I'm talking about the gym that's right next to my office.

I'm not talking about the gym that


In [None]:
# Beam Search with N-Gram Penalities
#Beam search is essentially Greedy Search but the model tracks and keeps num_beams of hypotheses at each time step, so the model is able to compare alternative paths as it generates text. We can also include a n-gram penalty by setting no_repeat_ngram_size = 2 which ensures that no 2-grams appear twice. We will also set num_return_sequences = 5 so we can see what the other 5 beams looked like

#To use Beam Search, we need only modify some parameters in the generate function:

In [10]:
# set return_num_sequences > 1
beam_outputs = GPT2.generate(
    input_ids,
    max_length = MAX_LEN,
    num_beams = 5,
    no_repeat_ngram_size = 2,
    num_return_sequences = 5,
    early_stopping = True
)

print('')
print("Output:\n" + 100 * '-')

# now we have 3 output sequences
for i, beam_output in enumerate(beam_outputs):
      print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))


Output:
----------------------------------------------------------------------------------------------------
0: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's not a good movie. I mean, it's
1: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's about a guy who has a crush on a girl
2: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," you say. "But you're not going to like this one. It's about a guy who has a crush on a woman
3: I don't know about you, but there's only one thing I want to do after a long day of work, and that's to sit down and watch a movie."

"I know, I know," 

In [None]:
# Basic Sampling
Now we will explore indeterministic decodings - sampling. Instead of following a strict path to find the end text with the highest probability, we instead randomly pick the next word by its conditional probability distribution:

In [11]:
# use temperature to decrease the sensitivity to low probability candidates
sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_k = 0,
                             temperature = 0.8
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work."

"Hmm. Must be quite the choice of words."

"Well, it's not a choice of words, but a need. I can't find the right answer until I find my answer."

"


In [None]:
#Top-K Sampling
#In Top-K sampling, the top k most likely next words are selected and the entire probability mass is shifted to these k words. So instead of increasing the chances of high probability words occuring and decreasing the chances of low probabillity words, we just remove low probability words all together

#We just need to set top_k to however many of the top words we want to consider for our conditional probability distribution:

In [13]:
#sample from only top_k most likely words
sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_k = 50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work: sit down for long, uninterrupted hours in front of a computer and play a game! And today, it turns out, that's exactly what I'll do. It's a new Android game called Pocket Heroes' World by Pocket Heroes ...


In [None]:
#Top-P Sampling¶
#Top-P sampling (also known as nucleus sampling) is similar to Top-K, but instead of choosing the top k most likely wordsm we choose the smallest set of words whose total probability is larger than p, and then the entire probability mass is shifted to the words in this set

#The main difference here is that with Top-K sampling, the size of the set of words is static (obviously) whereas in Top-P sampling, the size of the set can change. To use this sampling method, we just set top_k = 0 and choose a value top_p:

In [14]:
#sample only from 80% most likely words
sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_p = 0.8,
                             top_k = 0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work: drink a beer. I don't have to be able to read and write to drink one. My parents taught me to do that before I was four, and I think I got the hang of it pretty well.

Advertisement ...


In [None]:
#Top-K and Top-P Sampling
#As you could have probably guessed, we can use both Top-K and Top-P sampling here. This reduces the chances of us getting weird words (low probability words) while allowing for a dynamic selection size. We need only top a value for both top_k and top_p. We can even include the inital temperature parameter if we want to, Let's now see how our model performs now after adding everything together. We will check the top 5 return to see how diverse our answers are:

In [15]:
#combine both sampling techniques                                               #to test how long we can generate and it be coherent
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = 2*MAX_LEN,
                              #temperature = .7,
                              top_k = 50,
                              top_p = 0.85,
                              num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: I don't know about you, but there's only one thing I want to do after a long day of work, and that is go to the bathroom." The only thing I want to do after a long day of work, and that is go to the bathroom.

What is your goal as an author? Is it just to write better books?

To write better books. Writing is about the journey, the creation, the writing. It's not about making it easy, or finding a good format, it's about the journey. There are two main things you can do when you write a book, and that's the writing of the book itself and then you have to have...

1: I don't know about you, but there's only one thing I want to do after a long day of work. I want to watch a movie. And since I'm here in America, I just had to take a road trip. It's not my usual choice, but I couldn't get my own way. I'm on my own. That's why I'm here."

"That's not the only reason, of course, bu

In [16]:
MAX_LEN = 150

In [17]:
prompt1 = 'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.'

input_ids = tokenizer.encode(prompt1, return_tensors='tf')

In [18]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50,
                              top_p = 0.85
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

A study by University of the Andes researcher Francisco Cisneros and colleagues has found evidence that unicorns speak a perfect version of English that humans can understand. The researchers discovered that the wild unicorns communicate by using only the four basic sounds – "toot" for a click, "chow" for a swallow and "wah" for a woof. This suggests that they have developed a language to communicate in, which we cannot speak.

Researchers believe that these unicorns were raised in a...



In [19]:
prompt2 = 'Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.'

input_ids = tokenizer.encode(prompt2, return_tensors='tf')

In [21]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50,
                              top_p = 0.85
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')# . In this notebook, I will explore text generation using a GPT-2 model, which was trained to predict next words on 40GB of Internet text data.

Output:
----------------------------------------------------------------------------------------------------
0: Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry. As they approached, they were immediately met with the first orc attack of the day, a small number of orcs charged towards them, their arrows piercing through their shields. The battle was short-lived, however, as the two hobbits quickly disarmed and brought down the charging warriors. As a result of their quick thinking, they were able to kill two of the orc warlords before they were able to kill their own. After the battle, the hobbits had killed over 50 orcs in a single battle. With a score of around 100 kills, the two hobbits had earned the rank of Grand Champion.

Afterwards, the two hobbits...



In [20]:
prompt3 = 'Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.'

input_ids = tokenizer.encode(prompt3, return_tensors='tf')

In [22]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50,
                              top_p = 0.85
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry. Behind them, the orcs raised their own weapons and fought. But before the battle could be waged, a huge, terrible beast had appeared. It roared and bellowed and roared, and the battle was over.

The orc that had fought the trolls had been slain. But the orc who had fought the trolls had not been slain. The orcs fought and battled with their weapons and their spells, but the creature that had attacked them was not the orc they had been fighting. The creature was not even a troll. The orc was something much more horrible than the trolls. And it was coming toward them.

The battle between the orcs and the...

