<a href="https://colab.research.google.com/github/brianchuo/AI-Storytelling-in-Virtual-Worlds/blob/main/Prompting_and_generation_using_GPT_Neo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompting and Generation using GPT-Neo.

In this notebook, you will experiment with language generation and controlling outputs of a pre-trained model without any additional training or fine-tuning. 

We will be utilizing the [Huggingface transformers library](https://huggingface.co/docs/transformers/index); this library gives us access to many pre-trained language models with a simple interface that allows us to generate text in as little as four lines of code. 

In the first part of the notebook, you will implement a function to generate several candidate sentences for an input prompt. Then, you will let the user to select one of the candidate sentences to be the following input prompt, giving the user control over the generated text.

In the second part of the notebook, we will automate the candidate selection by scoring the sentences for toxicity and returning the least toxic sentence to be the following prompt.

In the last and final part of the notebook, you will propose your scoring function to elect which sentence to return. 

# Installations

In this section, we will install the required libraries. Before running any code, make sure that you select GPU in your runtime environment. The code in this section needs only to be run once at the Colab environment's initialization. 

In [None]:
!pip install transformers

In [None]:
!pip install detoxify

# Imports 
In this section, we will have our package imports. Importing all the packages/libraries, we will be using beforehand. 

In [None]:
from transformers import GPTNeoForCausalLM, GPT2Tokenizer
import torch 

In [None]:
from detoxify import Detoxify

# Download Model and Tokenizer

In [None]:
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

Verify that there is a GPU assigned to this notebook runtime environment.

In [None]:
torch.cuda.is_available()

Move the model to the GPU

In [None]:
model = model.to('cuda')

# Part 1: Text Generation with the GPT-Neo Language Model

## 1.A.

Complete the function below that will generate and return a list of ```num_returned_sentences``` sentences from a given prompt. Each returned sentence should contain the prompt plus ```max_new_tokens``` additional tokens.

Consult the [Huggingface Transformers API](https://huggingface.co/docs/transformers/index). Of special importance to this project is the ```generate()``` function call, and the ```Tokenizer``` class.

There are generally four steps in generating text with a neural language model from a given prompt. The prompt is a string.

1. Encode: convert the string prompt to a tensor of tokens. 

Each word (or part of a word) that a neural language model recognizes as valid input is a one-hot vector. Since a one-hot vector is a vector the size of the recognized vocabulary with a single bit turned on, each word (or part of a word) is equivalent to the index of that bit in the vocabular. Thus each word (or part of a word) can be transformed into an integer, called a token. The ```tokenizer``` object knows the language model's vocabulary.

2. Pass the tokenized prompt into the language model to generate a distribution over successor tokens.

The distribution for the succesor token is the negative log likelihood of each possible token given the prior tokens p(token_n | token_1, ... token_{n-1}) according to the model. Since we will be sampling from this distribution, we might get some low probability tokens. To improve quality, we can truncate the head of the distribution in two ways. First, we can keep the top k tokens of the distribution and zero out the rest of the distribution. Second, we can keep p% of the distribution mass. 

3. Sample from the distribution to construct one or more sequences of tokens, typically in the form of a multidimensional tensor.

There are many ways to sample from this distribution. For example, greedy sampling is to take the top best token. Multinomial sampling randomly selects a token based on the distribution. ```model.generate()``` can be used to combine steps 2 and 3.

4. Decode: convert sampled tokens back into words.

Use the ```tokenizer``` object again to convert the tensors of tokens into strings. 

Below you will implement those 4 steps by completing the following method ```generate_candidate_sentences```

In [None]:
def generate_candidate_sentences(model, tokenizer, prompt, k, num_returned_sentences, max_new_tokens=20):
  # This function will return num_returned_sentences based on a given prompt.
  # Each sentence should be max_new_tokens longer than the prompt.
  # model: the GPT-neo model 
  # tokenizer: the object that tokenizes and de-tokenizes text
  # prompt: a string
  # k: stochastically sample sequences from the top k sequences in the distribution generated by the model.
  #    A larger k gives more variability and variety.
  #    This must be equal to or larger than num_returned_sentences
  # num_returned_sentences: the number of sentences to return from this function
  # max_length: the maximum length of the output sentences (in terms of number of tokens)
  ### STEP 1: Use the tokenizer to convert the prompt into a tensor of tokens
  ###         (don't forget to move the tensor to the GPU)
  ### STEP 2 and 3: Call model.generate and receive a tensor of length num_returned_sentences
  ### STEP 4: Convert the tensor into a list of num_returned_sentences to a list of 
  output_sentences = [] # A list of output sentences to collect up and return
  ### TODO: WRITE YOUR CODE BELOW HERE:

  ### WRITE YOUR CODE ABOVE HERE
  return output_sentences


##1.B
Test your ```generate_candidate_sentences()``` here

In [None]:
k = 5
num_returned_sentences = 5
text_length = 20

In [None]:
def print_candidate_sentences(candidate_sentences):
  print("Output:\n" + 100 * '-')
  for i, output in enumerate(candidate_sentences):
    print("{}: {} \n".format(i, output) + 20*'-')

In [None]:
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
"previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
"researchers was the fact that the unicorns spoke perfect English."

In [None]:
generated_sequences = generate_candidate_sentences(model, tokenizer, prompt, k, num_returned_sentences, text_length)
print_candidate_sentences(generated_sequences)

## 1.C
Large pre-trained language models have some limitations that we sometimes need to work around to utilize them effectively for our tasks. One of these limitations is the maximum token length, where these models can generate a capped number of tokens on each call to the ``` model.generate()``` function. For GPT-2, the maximum number of tokens is 1024, and for GPT-Neo, it is 2048. This limit is not practical as it restricts the generation of longer sequences of texts such as articles and stories. 


For 1.C: 
To account for this limitation and succesfully handle longer text generation; we can iteratively call the generate function, truncating the input prompt to make sure the input+generated text are less than the maximum token length.

Then, iteratively, given a prompt, generate a story longer than the 2048 token limit. 

---------------
Tip: For ease of testing, you can set the ```max_length``` parameter in the ```model.generate()``` function to 10 and try to use your code to generate a story of length 50. Setting this parameter can mimic the model's behavior when it receives a much larger input than expected. 

In [None]:
k = 5
target_length = 4096

In [None]:
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
"previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
"researchers was the fact that the unicorns spoke perfect English."

In [None]:
def generate_long(model, tokenizer, prompt, k, target_length):
  # Use the generate_candidate_sentences iteratively to generate longer sequences 
  # Step 1: Check the length of the current prompt, truncate it if it will result in text that is be longer than max length threshold. 
  # Make sure your new prompt substring includes the end of the prompt so the model can correctly produce a continuation.  
  # Step 2: Call the generate_candidate_sentences with the new substring prompt (Note, num_returned_sentences can be set to 1 for this section)
  # Step 3: Update the full_generated text to include the full prompt and the newly generated content.
  # Step 4: Repeat steps 1-3 until the goal length is reached. 
  full_generated_text = prompt
  ### TODO: WRITE YOUR CODE BELOW HERE

  ### WRITE YOUR CODE ABOVE HERE
  return full_generated_text

In [None]:
full = generate_long(model, tokenizer, prompt, k, target_length)
print(full)

# Part 2: Text generation on user preference

For this section you will iteratively generate text based on user preferences. Generating a few candidate continuations for the user to select from at each iteration and then using the selected continuation as a prompt for the next segment.

The behavior of the function is as follows. It should produce ```num_choices``` possible candidates. Only print out only the newly generated text. The user will provide a number indicating which of the candidates they want to use (see ```get_user_preference``` for a useful utility). The selected candidate will be added to the prompt. This will be done ```num_sentences_to_generate``` times.

In [None]:
def get_user_preference(candidate_sentences):
  while True:
    try: 
      user_choice = int(input("Enter the sentence number: "))
      if user_choice >= 0 and user_choice < len(candidate_sentences):
          return user_choice
      else:
        raise ValueError
    except ValueError: 
      print("Please enter a correct sentence number between {} and {}".format(0, len(candidate_sentences)-1))


In [None]:
k = 5
num_choices = 5
text_length = 20
num_sentences_to_generate = 5

In [None]:
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
"previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
"researchers was the fact that the unicorns spoke perfect English."

Start with the prompt. Generate ```num_choices``` continuations. Have the user choose the sentence that they want, and return that sentence

In [None]:
def generate_sentence_interactive(model, tokenizer, prompt, k, num_choices, num_sentences_to_generate, max_new_tokens=text_length):
  # We will generate the candidate sentences for the next prompt and and then get the index of the user's pereference. 
  # Then we use this to assign a new prompt to our model:
  # STEP 1: Generate candidate sentences.
  # STEP 2: Get the user's preference.
  # STEP 3: Map user choice to a sentence and then use this sentence (added to the prior context) to generate the next candidates.
  final_sentence = None
  ### TODO: WRITE YOUR CODE BELOW HERE

  ### WRITE YOUR CODE ABOVE HERE
  return final_sentence

Check your final output here

In [None]:
continuation = generate_sentence_interactive(model, tokenizer, prompt, k, num_choices, num_sentences_to_generate, text_length)
print("Final output: \n {}".format(continuation))

# Part 3: Text generation with automated toxicity scoring

In this part we will use [Detoxify](https://github.com/unitaryai/detoxify) scoring to choose the least toxic next sentence. 

First, run the code cell below to get an idea of how Detoxify works and how the output of the predict function looks like. 

In [None]:
# In this cell we show a demo for the Detoxify model. 
# You can use the Detoxify model 
detoxify_model = Detoxify('unbiased') 
results = detoxify_model.predict(['example text 1','example text 2'])
results

In [None]:
k = 5
num_choices = 5
text_length = 20

In [None]:
prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
"previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
"researchers was the fact that the unicorns spoke perfect English."

Second, use the Detoxify model to rank the generated candidates and return the full text with the lowest ```toxicity```.

In [None]:
def generate_detox_sentence(model, tokenizer, prompt, k, num_choices, max_new_tokens=text_length):
  # STEP 1: Generate candidate texts.
  # STEP 2: Find the candidate with the lowest toxicity score. 
  # STEP 3: Return the full text with the lowest toxicity score.
  best_sentence = None
  ### TODO: WRITE YOUR CODE BELOW HERE

  ### WRITE YOUR CODE ABOVE HERE
  return best_sentence

In [None]:
continuation = generate_detox_sentence(model, tokenizer, prompt, k, num_choices, text_length)
print(continuation)

# Part 4: Text generation with other scoring functions.

For this part we want you to come up with your own scoring mechanism. Choose a method the you think would improve story generation process using the pretrained model. Demonestrate it and explain briefly why you think it is a suitable scoring function? 

In [None]:
# STEP 1: Generate candidate text.
# STEP 2: Use your scoring function on the generated candidates 
# STEP 3: Return the full text with the lowest toxicity score.
### TODO: WRITE YOUR CODE BELOW HERE

### WRITE YOUR CODE ABOVE HERE

**YOUR EXPLANATION HERE:** ...