<a href="https://colab.research.google.com/github/docheem/NLP-Portfolio/blob/main/PR_TXT_Gen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text Generation
- Implementing Greedy Search Decoding
- Implementing Beam Search Decoding
- Sampling Methods
    - Top-k
    - Nucleus Sampling
    


# Greedy Search Decoding

The simplest decoding method to get discrete tokens from a model’s continuous output is to greedily select the token with the highest probability at each timestep

In [None]:
!pip install -q transformers
!pip install -q torch
!pip install datasets
!pip install sentencepiece
!nvidia-smi

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m78.8 MB/s[0m eta [36m0:00:00[0m
[?25hLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
 

#Greedy Search Decoding

We’ll implement a decoding method ourselves to see what goes on under the hood. we’ll use “Transformers are the” as the input prompt and run the decoding for eight timesteps.

At each timestep, we pick out the model’s logits for the last token in the prompt and wrap them with a softmax to get a probability distribution. We then pick the next token with the highest probability, add it to the input sequence, and run the process again.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "gpt2-xl"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:

import pandas as pd


input_txt = "Transformers are the"

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

iterations = []

n_steps = 8

choices_per_step = 5



with torch.no_grad():

    for _ in range(n_steps):

        iteration = dict()

        iteration["Input"] = tokenizer.decode(input_ids[0])

        output = model(input_ids = input_ids)


        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]

        next_token_probs = torch.softmax(next_token_logits,
                                         dim = -1)

        sorted_ids = torch.argsort(next_token_probs,
                                   dim = -1,
                                   descending = True)


        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):

            token_id = sorted_ids[choice_idx]

            token_prob = next_token_probs[token_id].cpu().numpy()

            token_choice = (f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)")

            iteration[f"Choice {choice_idx+1}"] = token_choice


        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)

        iterations.append(iteration)



pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (8.53%),only (4.96%),best (4.65%),Transformers (4.37%),ultimate (2.16%)
1,Transformers are the most,popular (16.78%),powerful (5.37%),common (4.96%),famous (3.72%),successful (3.20%)
2,Transformers are the most popular,toy (10.63%),toys (7.23%),Transformers (6.60%),of (5.46%),and (3.76%)
3,Transformers are the most popular toy,line (34.38%),in (18.20%),of (11.71%),brand (6.10%),line (2.69%)
4,Transformers are the most popular toy line,in (46.28%),of (15.09%),", (4.94%)",on (4.40%),ever (2.72%)
5,Transformers are the most popular toy line in,the (65.99%),history (12.42%),America (6.91%),Japan (2.44%),North (1.40%)
6,Transformers are the most popular toy line in the,world (69.26%),United (4.55%),history (4.29%),US (4.23%),U (2.30%)
7,Transformers are the most popular toy line in ...,", (39.73%)",. (30.64%),and (9.87%),with (2.32%),today (1.74%)


approach 2

In [None]:
input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
n_steps = 8
choices_per_step = 5
iterations = []

for _ in range(n_steps):

    output = model(input_ids = input_ids)

    next_token_probs = torch.softmax(output.logits[0, -1, :],
                                     dim = -1)

    sorted_ids = torch.argsort(next_token_probs,
                               dim = -1,
                               descending = True)

    iteration = {"Input": tokenizer.decode(input_ids[0])}


    for choice_idx in range(choices_per_step):

        token_id = sorted_ids[choice_idx]

        token_prob = next_token_probs[token_id].detach().cpu().numpy()

        token_choice = f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"

        iteration[f"Choice {choice_idx+1}"] = token_choice

    input_ids = torch.cat([input_ids,
                           sorted_ids[None, 0, None]],
                           dim=-1)

    iterations.append(iteration)

pd.DataFrame(iterations)


Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (8.53%),only (4.96%),best (4.65%),Transformers (4.37%),ultimate (2.16%)
1,Transformers are the most,popular (16.78%),powerful (5.37%),common (4.96%),famous (3.72%),successful (3.20%)
2,Transformers are the most popular,toy (10.63%),toys (7.23%),Transformers (6.60%),of (5.46%),and (3.76%)
3,Transformers are the most popular toy,line (34.38%),in (18.20%),of (11.71%),brand (6.10%),line (2.69%)
4,Transformers are the most popular toy line,in (46.28%),of (15.09%),", (4.94%)",on (4.40%),ever (2.72%)
5,Transformers are the most popular toy line in,the (65.99%),history (12.42%),America (6.91%),Japan (2.44%),North (1.40%)
6,Transformers are the most popular toy line in the,world (69.26%),United (4.55%),history (4.29%),US (4.23%),U (2.30%)
7,Transformers are the most popular toy line in ...,", (39.73%)",. (30.64%),and (9.87%),with (2.32%),today (1.74%)


With this simple method we were able to generate the sentence “Transformers are the most popular toy line in the world”. Interestingly, this indicates that GPT-2 has internalized some knowledge about the Transformers media franchise.

Lets use the built-in generate() function from   Transformers to explore more sophisticated decoding methods.

In [None]:
input_ids = tokenizer(input_txt,
                      return_tensors = "pt")["input_ids"].to(device)


output = model.generate(input_ids,
                        max_new_tokens = n_steps,
                        do_sample = False)



print(tokenizer.decode(output[0]))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most popular toy line in the world,


Lets reproduce a unicorn story from OpenAI.

We’ll encode the prompt with the tokenizer, and we’ll specify a larger value for max_length to generate a longer sequence of text

In [None]:
# Generating longer sequence

max_length = 128


input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""


input_ids = tokenizer(input_txt,
                      return_tensors="pt")["input_ids"].to(device)


output_greedy = model.generate(input_ids,
                               max_length = max_length,
                               do_sample = False)



print(tokenizer.decode(output_greedy[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able


One of the main drawbacks with greedy search decoding:

- It tends to produce repetitive output sequences, which is certainly undesirable in a news article, which can fail to give you the optimal solution.

- In the context of decoding, they can miss word sequences whose overall probability is higher just because high-probability words happen to be preceded by low-probability ones.

#Beam Search Decoding

Instead of decoding the token with the highest probability at each step, beam search keeps track of the top-b most probable next tokens, where b is referred to as the number of beams or partial hypotheses.

Beam Search helps the model decide what to write next by considering multiple options at the same time. Instead of only looking at the most likely next word, Beam Search looks at a few different options and picks the best one.

Let’s calculate and compare the log probabilities of the texts generated by greedy and beam search to see if beam search can improve the overall probability.

In [None]:
import torch.nn.functional as F

def log_probs_from_logits(logits, labels):

    logp = F.log_softmax(logits,
                         dim = -1)

    logp_label = torch.gather(logp,
                              2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label


In [None]:
# sum the log probabilities for each token

def sequence_logprob(model, labels, input_len=0):

    with torch.no_grad():

        output = model(labels)

        # The "log_probs_from_logits" function calculates the probability of each action being correct
        log_probs = log_probs_from_logits(output.logits[:, :-1, :], labels[:, 1:])

        # add up its scores for the actions it has taken correctly from a certain point
        seq_log_prob = torch.sum(log_probs[:, input_len:])

    return seq_log_prob.cpu().numpy()


# Calculating log probabilities

of the texts generated by greedy and beam search

In [None]:
# Calculating the sequence log probability of the greedy decoder on the OpenAI prompt

logp = sequence_logprob(model,
                        output_greedy,
                        input_len = len(input_ids[0]))

print(tokenizer.decode(output_greedy[0]))


print(f"\nlog-prob: {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able

log-prob: -87.43


Now let’s compare this to a sequence that is generated with beam search.

To activate beam search with the generate() function we just need to specify the number of beams with the num_beams parameter.

In [None]:
output_beam = model.generate(input_ids,
                             max_length = max_length,
                             num_beams = 5,
                             do_sample = False)

logp = sequence_logprob(model,
                        output_beam,
                        input_len = len(input_ids[0]))


print(tokenizer.decode(output_beam[0]))

print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery of the unicorns was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.


The scientists were conducting a study of the Andes Mountains when they discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English

log-prob: -55.23


We can see that we get a better log probability (higher is better) with beam search than we did with simple greedy decoding. However, we can see that beam search also suffers from repetitive text. One way to address this is to impose an n-gram penalty with the no_repeat_ngram_size parameter that tracks which n-grams have been seen and sets the next token probability to zero if it would produce a previously seen n-gram

In [None]:
output_beam = model.generate(input_ids,
                             max_length = max_length,
                             num_beams = 5,
                             do_sample = False,
                             no_repeat_ngram_size = 2)

logp = sequence_logprob(model,
                        output_beam,
                        input_len = len(input_ids[0]))

print(tokenizer.decode(output_beam[0]))


print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery was made by a team of scientists from the University of California, Santa Cruz, and the National Geographic Society.

According to a press release, the scientists were conducting a survey of the area when they came across the herd. They were surprised to find that they were able to converse with the animals in English, even though they had never seen a unicorn in person before. The researchers were

log-prob: -93.12


We’ve managed to stop the repetitions, and we can see that despite producing a lower score, the text remains coherent.

Beam search with n-gram penalty is a good way to find a trade-off between focusing on high-probability tokens (with beam search) while reducing repetitions (with n-gram penalty, which will increase the text score)

Sampling methods can also be used to reduce repetition and improve diversity when the accuracy of the information is less important.

The simplest sampling method is to randomly sample from the probability distribution of the model’s outputs over the full vocabulary at each timestep.

we will use temperature to influence the generated text, let’s sample with T = 2 by setting the temperature parameter in the generate() function. When. T is too small, the text becomes deterministic, vice versa, the text becomes more diverse and less accurate

In [None]:
output_temp = model.generate(input_ids,
                             max_length = max_length,
                             do_sample = True,
                             temperature = 2.0,
                             top_k = 0)


print(tokenizer.decode(output_temp[0]))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


 amplitude grand merMadavexit mentium cor un shoot commentary Smile consoniliaMom Thr VIDEO whis con proof MET ADHD youtube boomxfergusonAustinFeb disposed debt131mpegoba Earn horrifying SHA 510 ke Me YelpIntroduction Cyborg554 Cohibe LP Leyenan SeeMilitaryACTIONprime lion international uncredo muff interact6 VisualJustice 74 Afghansolder Handle jQuery WAS'. I indirect Mount Quick instinct AnimandedMCstrandingdim


High temperature has produced mostly gibberish. Let’s cool down the temperature.


In [None]:
output_temp = model.generate(input_ids,
                             max_length = max_length,
                             do_sample = True,
                             temperature=0.5,
                             top_k=0)


print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The discovery was made by scientists from Universidad Nacional Autónoma de México (UNAM) and the University of Bristol in the Andes Mountains. The team found the unicorns while conducting a study on the range of the Andean condors, which are a subspecies of the great horned owl.


The scientists believe that the unicorns are descendants of the original birds


Significantly better! Temperature allows us to control the quality of the samples.

Another way to adjust the trade-off between coherence and diversity is to truncate the distribution of the vocabulary. But, it excludes words that would be too strange in the context. There are 2 main ways:

- top-k
- nucleus(or top-p)

The basic idea is to restrict the number of possible tokens we can sample from at each timestep.

In [None]:
# Function for top-k
# The value of k is chosen manually and is the same
# for each choice in the sequence, independent of the actual output distribution

output_topk = model.generate(input_ids,
                             max_length = max_length,
                             do_sample = True,
                             top_k = 50)


print(tokenizer.decode(output_topk[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The finding, reported by The Guardian, is being compared to "Lord of the Rings" lore, and could revolutionize the technology world by allowing people to communicate with machines in a language that the machines understand.

Some experts believe the discovery could not only change the way we use computers, but open up new technological and research opportunities. The researchers said the project, called "Urania", has


Instead of choosing a fixed cutoff value, we set a condition of when to cut off. This condition is when a certain probability mass in the selection is reached

In [None]:
output_topp = model.generate(input_ids,
                             max_length = max_length,
                             do_sample = True,
                             top_p=0.90)


print(tokenizer.decode(output_topp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The species of "Unicorn" has been spotted in the Andes Mountains and is estimated to number no more than 60 animals. This particular individual of the species is the only one that can speak.

This incredible new species of unicorn is believed to live in an ancient mountain range, where the vegetation is particularly lush. Due to their environment, the creature is unable to reproduce in the wild,


We can even combine the two sampling approaches to get the best of both worlds. Setting top_k=50 and top_p=0.9 corresponds to the rule of choosing tokens with a probability mass of 90%, from a pool of at most 50 tokens. We can also apply beam search when we use sampling.