# Playing with transformers

*Notebook created by Lauren Klein, borrowing heavily from material created by Allison Parrish for her "Playing with Transformers" notebook*

We'll begin by installing the transformers library, and then importing the relevant parts of the library as well as a few additional print formatting libraries that I like to use.



In [None]:
!pip3 install transformers

In [None]:
import textwrap # for nice wrapping

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

Set parameters:

In [None]:
# for the GPU
device_name = 'cuda'       

tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

## Generation in more detail (advanced, but interesting)

So now we'll go a little bit under the hood. 

What's actually happening when you ask the model to generate text is this: 

First, you encode the prompt as a sequence of IDs using the tokenizer (remember this from the last notebook). 

Then, the model assigns a probability to every token in the tokenizer's vocabulary, based on which tokens it thinks are most likely to come next. 

Here's what it looks like to run that process "by hand," so to speak. 

First, create the prompt:

In [None]:
prompt = "Two roads diverged in a yellow wood, and"

Then encode the prompt as a sequence of token IDs:

In [None]:
prompt_encoded = tokenizer([prompt], return_tensors="pt")

prompt_encoded

Then we call the model ("distilgpt2") as though it were a function, passing in the key/value pairs that the tokenizer returned as parameters (using [Python's `**` operator](https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists)) for unpacking argument lists:

In [None]:
result = model(**prompt_encoded)

The value returned from calling the model is an object with various attributes that we can examine. 

I'm most interested in `.logits`, which is a PyTorch tensor that contains information about the probability that the model assigned to each vocabulary item. ("Tensor" btw is just a fancy word for "array with a bunch of dimensions.") The prediction for the next token can be found in the very last row of this tensor:

In [None]:
next_token_probs = result.logits[0,-1]

next_token_probs

The scores are shown in "raw" form, meaning that they don't have the kinds of values that we would normally associate with a probability distribution (i.e., multiple options all adding up to one). But we can still compare them in this state. Higher numbers mean higher probability.

This tensor has a shape that corresponds to the number of vocabulary items in that particular model:

In [None]:
next_token_probs.shape

And we can actually inquire about the probability of particular tokens by looking them up. The code in the following cells uses the tokenizer's `.encode()` method to convert a token to its ID, then looks up the ID by index in the array with the predictions. (The `.item()` call converts the resulting PyTorch tensor to a native Python value, which just makes the result a bit easier to look at.)

In [None]:
next_token_probs[tokenizer.encode(' the')].item()

In [None]:
next_token_probs[tokenizer.encode(' x')].item()

In [None]:
next_token_probs[tokenizer.encode(' an')].item()

We can see that the tokens ` the` and ` an` have fairly high probability, while the token ` x` has low probability.  Interesting!

### Generating text, the home-grown way

Using the PyTorch library, we can get a list of the most likely tokens to come next. (This has some dark magic in it if you're not familiar with PyTorch—or another array processing library like NumPy—so... just trust me for a sec.)

In [None]:
import torch
for idx in reversed(torch.argsort(next_token_probs)[-12:]):
    print("'" + tokenizer.decode(idx) + "'")

(Again, I've added in the quotation marks so you can clearly see that these tokens have whitespace at the beginning.) These are the top twelve tokens to come next in the sequence, as predicted by the model. One way to *generate* a text would be to take one of these tokens—maybe the top-scoring token, maybe one of the top *n* picked at random, append it to our original list of tokens, ask the model to make a prediction on *that* list of tokens, and repeat. The loop would look something like this:

In [None]:
import random

prompt = "Two roads diverged in a yellow wood, and"
for i in range(10):
    # encode the prompt
    prompt_encoded = tokenizer([prompt], return_tensors="pt")
    # run a forward pass on the network
    result = model(**prompt_encoded)
    # get the probabilities for the next word
    next_token_probs = result.logits[0,-1]
    # sort by value, get the top 12 (you can change this number! try 1, or 1000)
    nexts = torch.argsort(next_token_probs)[-12:]
    # append the decoded ID to the current prompt
    prompt += tokenizer.decode(random.choice(nexts))
    print(prompt)

### The `.generate()` method

Our home-grown solution above does the job, but it's very rudimentary. Because generating text is such a common use-case, the Transformers library provides a `.generate()` method that is quite fast and also has a bunch of bells and whistles that we can exploit to add expressiveness to our use of the language model. Under the hood, though, the `.generate()` method is essentially doing exactly what we did above—iteratively constructing a string based on predicted tokens from the model. Use the `.generate()` method like this:

In [None]:
prompt = "Two roads diverged in a yellow wood, and"
prompt_encoded = tokenizer(prompt, return_tensors="pt") # the "return_tensors" thing is important!
result = model.generate(**prompt_encoded)[0]
tokenizer.decode(result, skip_special_tokens=True)

For a more detailed overview of all of the ways you can use `.generate()`, see [How to generate](https://huggingface.co/blog/how-to-generate) on the Hugging Face blog. One argument of `.generate()` that is useful right off the bat is `max_length`, which continues the generation process for the number of tokens you specify:

In [None]:
prompt = "Two roads diverged in a yellow wood, and"
prompt_encoded = tokenizer(prompt, return_tensors="pt")
result = model.generate(**prompt_encoded, max_length=250)[0] # note the addition of the max length here 
tokenizer.decode(result, skip_special_tokens=True)

### Back to the pipeline

The process of encoding the prompt and decoding the results is pretty tedious. That's why the "pipeline" was invented. The `text-generation` pipeline takes care of encoding and decoding for you. Create a pipeline by calling `pipeline` with `text-generation` as the first parameter, and then the model and tokenizer that you want to use:

In [None]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

And then you can generate text with the pipeline. The first argument is the prompt; any remaining parameters will be forwarded to the model's `.generate()` method.

In [None]:
generator("Two roads diverged in a yellow", max_length=100)

As a reminder, to get the actual generated text, use indexing to get the value of the dictionary in the list returned from the pipeline:

In [None]:
generator("Two roads diverged in a yellow", max_length=100)[0]['generated_text']

And print it even more nicely:

In [None]:
print(textwrap.fill(generator("Two roads diverged in a yellow", max_length=100)[0]['generated_text'], 60))

## Controlling the model

By default, the `distilgpt2` model samples from the possible next tokens, weighted by the probability assigned to that token. This strategy leads to text that shows a good deal of variety, but there are strategies that we can use and parameters that we can tweak to exert a little more control over the model's output. In this section, I show a few of these strategies.

### The magic of the prompt

Transformer models are often able to follow up on cues you give about the desired content and style of the text in the prompt itself. The smaller transformer models aren't especially good at this, but it's still worth playing around with. For example, to get `distilgpt2` to generate something that looks like a movie review:

In [None]:
print(textwrap.fill(generator("My review of The Road Not Taken, the Movie:", max_length=100)[0]['generated_text'], 60))

You can also generate dialogues and interview transcripts:

In [None]:
print(textwrap.fill(generator("Lauren: I took the road less traveled by.\nRobert Frost:", max_length=100)[0]['generated_text'], 60))

Poetry facts:

In [None]:
print(textwrap.fill(generator("My favorite facts about poetry:\n\n1.", max_length=100)[0]['generated_text'], 60))

In general, this kind of prompting works best with texts that are likely to have a lot of representation in the training corpus.

### Sampling with temperature

As mentioned above, the `distilgpt2` model, by default, picks the next token at random, weighted by the probability that the model assigns to the word. To demonstrate how this works, let's imagine that the model only has five tokens in its vocabulary (instead of 50,000+). A schematic illustration of those probabilities might look like this:

    prompt: Whose woods these are I think I
    probabilities:
        know -> 0.5
        knew -> 0.2
        smell -> 0.15
        see -> 0.1
        am -> 0.05


The probabilities will add up to `1.0`. A probability of `0.5` indicates that the token has a 50% probability of coming next; a probability of 0.2 means that the token has a 20% probability of coming next, etc. Here are those probabilities represented in Python as two lists—one with the words, and one with the probabilities that correspond to those words by index:

In [None]:
tokens = ['know', 'knew', 'smell', 'see', 'am']
probs = [0.5, 0.2, 0.15, 0.1, 0.05]

By default, to select the next token, the generation code picks from this list weighted by probability. The code to do this with PyTorch looks like this:

In [None]:
index = torch.multinomial(torch.tensor(probs), 1).item()
print(tokens[index])

You don't have to worry about the specifics of this code—I'm just using it to demonstrate how the sampling process works. Run the code a few times and you'll see that about half the time you get "knew"—the token with the highest probability. Running the code in a loop makes this a bit easier to see:

In [None]:
for i in range(10):
    index = torch.multinomial(torch.tensor(probs), 1).item()
    print(tokens[index])

The generation process has a parameter called *temperature*, which lets you shift the probability distribution of the next token before it's sampled. If the temperature parameter is `1.0`, then sampling will proceed as normal, with the tokens weighted by their estimated probability. If the temperature parameter is less than `1.0`, then tokens that were already probable will get *more* probable. If the temperature parameter is greater than `1.0`, then the probabilities start to even out, approaching a uniform distribution (meaning that no token is more likely to be chosen than any other). To demonstrate this, I've written some code below that applies temperature to the probabilities defined above, and shows the resulting changes:

In [None]:
for temperature in [0.1, 0.35, 1.0, 2.0, 50.0]:
    modified = torch.softmax(
        torch.log(torch.tensor(probs)) / temperature, dim=-1)
    print(f"temperature {temperature:0.02f}")
    for tok, prob in zip(tokens, modified):
        print(tok.ljust(6), "→", f"{prob:0.002f}")
    print()

You can see that at temperature `1.0`, the probabilities are identical to the original. At temperature `0.35`, the probability of the most likely token has been boosted, but the other tokens still have a small chance of occurring. At temperature `0.1`, only the most likely token has a chance of being selected. At temperature `2.0`, the most likely token is still the most likely, but the probabilities of the other tokens have been boosted in comparison; at temperature `50.0`, no token is considered to be more likely than any other.

To apply temperature sampling to the model when generating text, pass the `temperature` parameter to the pipeline, like so:

In [None]:
generator("Two roads diverged in a yellow",
          temperature=0.1,
          max_length=100)[0]['generated_text']

Low temperatures generally produce predictable, repetitive results. Here's an attempt with high temperature:

In [None]:
generator("Two roads diverged in a yellow",
          temperature=4.0,
          max_length=100)[0]['generated_text']

The higher temperature example produces less likely sequences of words, so the text is a bit livelier—sometimes at the cost of coherence.

Adjusting the temperature can be useful when you want the text to be more or less "weird." It can be helpful to adjust the temperature downward when you feel as though the model is producing text that is a bit too unpredictable; it can be helpful to adjust upward when you want to model to take more unexpected turns when generating.

### Top-k sampling

By default, the generation process only selects from the top 50 most probable tokens at each step. This is called "top-k filtering." Because of top-k filtering, you're not likely to sample truly unusual tokens even when the temperature is high. You can adjust the threshold for top-k filtering with the `top_k` parameter of the model. For example, adjusting `top_k` to the number of items in the vocabulary ensures that every token gets its chance:

In [None]:
generator("Two roads diverged in a yellow",
          top_k=tokenizer.vocab_size,
          max_length=100)[0]['generated_text']

Using this with a temperature greater than `1.0` can yield some unusual turns of phrase:

In [None]:
generator("Two roads diverged in a yellow",
          top_k=tokenizer.vocab_size,
          temperature=1.2,
          max_length=100)[0]['generated_text']

Formatted more nicely:

In [None]:
print(textwrap.fill(generator("Two roads diverged in a yellow",
          top_k=tokenizer.vocab_size,
          temperature=1.2,
          max_length=100)[0]['generated_text']), 60)

On the other extreme, setting the `top_k` value to `1` ensures that *only* the most likely token is chosen at each step. This is the same thing as ["greedy decoding"](https://huggingface.co/transformers/main_classes/model.html#transformers.generation_utils.GenerationMixin.greedy_search):

In [None]:
print(textwrap.fill(generator("Two roads diverged in a yellow",
          top_k=1,
          max_length=100)[0]['generated_text'], 60))

Playing around with `top_k` and `temperature` in tandem is a good way to make adjustments to the texture of your generated text.

### Logit warping: Exclude "bad" words

The `.generate()` method has a parameter called `bad_words_ids`, which causes the model to zero out the probabilities of tokens associated with words that you pass in. The intended use of this feature is to stop the model from generating offensive or harmful words. But we can also repurpose it for poetic purposes. For example, in the cell below, I make the model complete the prompt "It was a dark and stormy" *without* using the words "night" or "day":

In [None]:
print(textwrap.fill(generator("It was a dark and stormy",
          bad_words_ids=tokenizer([" night", " day"]).input_ids)[0]['generated_text'], 60))

The syntax for specifying the "bad words" is to call the tokenizer on a list of words that you want to exclude, and then get the `.input_ids` attribute of the value returned from calling the tokenizer. This yields a list of lists that looks like this:

In [None]:
tokenizer(["Lauren", "Klein"]).input_ids

Note that I used ` night` and ` day` as the words, with leading spaces—this is necessary because I ended the prompt without whitespace, so the model is likely to generate a token with leading whitespace at the next step. I've found that the `bad_words_ids` parameter works best if your list of words includes versions both with and without whitespace.

Here's another example: getting the model to complete a prompt without using any forms of the verb *to be*:

In [None]:
print(textwrap.fill(generator("Once upon a time,",
          bad_words_ids=tokenizer(
              ["be", " be",
               "am", " am",
               "are", " are",
               "is", " is",
               "was", " was",
               "were", " were"]).input_ids,
          max_length=100)[0]['generated_text'], 60))

You can also create a list of token IDs that you want to exclude on the fly. In the following example, I make a list of token IDs that have the letter `e` in them, and pass that list to the `bad_words_ids` parameter:

In [None]:
forbidden_ids = []
for key, val in tokenizer.get_vocab().items():
    if 'e' in key:
        forbidden_ids.append([val]) # needs to be a list of lists
print(textwrap.fill(generator("Last month, I",
          bad_words_ids=forbidden_ids,
          max_length=100)[0]['generated_text']), 60)

### Fine-tuning a model -- SKIP IF LOW ON TIME!

"Fine-tuning" is a way of slightly modifying a model by training it a few extra steps on a corpus of your choice. This process adjusts the probabilities of the model so that it more closely reflects the probabilities of the source text you train it on. Fine-tuning models with Transformers is a little bit tricky! First, you'll need to install Hugging Face's `datasets` package:

In [None]:
!pip3 install datasets

And then import it:

In [None]:
import datasets

You'll want to select a text file to fine-tune the model on. Fine-tuning works best on large amounts of text, but fine-tuning is also very slow if you're not using a GPU. 

For demonstration purposes, I create a special version of [Frankenstein](https://www.gutenberg.org/ebooks/84) that contains only the first 20000 characters, and save it to a local file:

In [None]:
import requests 

resp = requests.get('https://www.gutenberg.org/files/84/84-0.txt')
text = resp.text[:20000]

with open("84-0-20k.txt", "w") as fh:
  fh.write(text)

Then I load this text file as my fine-tuning dataset:

In [None]:
training_data = datasets.load_dataset('text', data_files="84-0-20k.txt")

Now, there's a bunch of obligatory processing that we need to do to the data in order to prepare it for the model. This is boilerplate stuff, which I'm not going to go into in detail. If you want details, consult Hugging Face's [fine-tuning language models notebook](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb).

First, we tokenize the text:

In [None]:
# uncomment these if you needed to restart the runtime above and need to re-import necessary libraries
# from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
# model = AutoModelForCausalLM.from_pretrained('distilgpt2')

tokenizer.pad_token = tokenizer.eos_token
tokenized_training_data = training_data.map(
    lambda x: tokenizer(x['text']),
    remove_columns=["text"]
)

Then we break the tokenized text up into batches of tokens:

In [None]:
block_size = 64
# magic from https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result
lm_training_data = tokenized_training_data.map(
    group_texts,
    batched=True,
    batch_size=200
)

Now we import the `Trainer` class, which implements a training loop.

In [None]:
from transformers import Trainer, TrainingArguments

Running the following cell creates the `Trainer` object. The `output_dir` parameter specifies a directory where your fine-tuned model will be saved. The `num_train_epochs` sets how many "epochs" the trainer will run; one epoch is one iteration over the entire dataset. More epochs is better, but even one epoch can significantly change the way the model generates text.

In [None]:
trainer = Trainer(model=model,
                  train_dataset=lm_training_data['train'],
                  args=TrainingArguments(
                      output_dir='distilgpt2-finetune-frankenstein20k',
                      num_train_epochs=1,
                      do_train=True,
                      do_eval=False
                  ),
                  tokenizer=tokenizer)

Finally, the cell below will start the training process. If you're running this on a computer without a GPU, it will take a while.

In [None]:
trainer.train()

Running the cell below will save the model to disk:

In [None]:
trainer.save_model()

Now you can generate with the fine-tuned model! The fine-tuning process modifies the model in-place, so the `pipeline` you created before will make use of the fine-tuned model. (Note that if you want to get the original `distilgpt2` back, you'll need to reload it with the `.from_pretrained()` method, as demonstrated at the top of the notebook.)

In [None]:
# uncomment the line below if you needed to restart your runtime
# generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

generator("Two roads diverged in a yellow", max_length=100)[0]['generated_text']

You can see that fine-tuning on even a small dataset produces big changes in the model.

If you want to use your fine-tuned model in another project, use the same syntax that we used above to load `distilgpt2`—just replace `distilgpt2` with the name of the directory where you saved your model:

*Note that this is slightly more complicated for Colab, so if you want to do this for the class please just ask me for how to sync with Google Drive*

In [None]:
my_tokenizer = AutoTokenizer.from_pretrained('distilgpt2-finetune-frankenstein20k')
my_model = AutoModelForCausalLM.from_pretrained('distilgpt2-finetune-frankenstein20k')

Now generate with it:

In [None]:
my_generator = pipeline("text-generation", model=my_model, tokenizer=my_tokenizer)

In [None]:
my_generator("Two roads diverged in a yellow")[0]['generated_text']