# Text generation with GPT-2

Today we will try to generate texts using GPT-2 model proposed by OpenAI. This model is related to Transformer architecture (in fact, GPT-2 is a modified decoder from Transformers, similarly to BERT, which is a modification of an encoder). GPT-2 is a pretrained model that can be downloaded and used in the same way as BERT.

Here you can find a great introduction to the general idea behind GPT-2: https://jalammar.github.io/illustrated-gpt2/

In general, it is a language model, a model which provides us the probability of a given word being a continuation of a given text. For instance, having the following context: `Transformer is a neural network` GPT-2 can estimate that there is `50%` chance that the next word should be `architecture` and `0.0001%` chance that the next word is `donut`.

Let's use `Huggingface Transformers` library to experiment with GPT-2.

---
**Done by:** Sofya Aksenyuk, 150284

---

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


# BASIC TEXT GENERATION

Let's start with the basic scenario -- as GPT-2 can calculate the probability of the next word given some context, it can be used to generate texts. In `transformers` library, we can do it pretty easily. `transformers` provides so-called pipelines, which hide all the abstraction layers so that we can generate texts using two lines of code. 
It hides the: `Input -> Tokenization -> Model Inference -> Post-Processing (task dependent) -> Output` phases.


Please read the docs here: https://huggingface.co/docs/transformers/v4.19.2/en/main_classes/pipelines to familiarize with pipelines.

Then, fill the code below with appropriate fragments. In line 2, let's construct a pipeline of type `text-generation` and set the `model` parameter to `gpt2`.

Then, the `generator` can be called the same way as a function `generator(__some params here__)`. Just provide some first words of the text as string as a first positional argument (do not add a space at the end of it). You can provide additional parameters such as `max_length` (to limit the length of the generated text) or `num_return_sequences` (to force GPT-2 to produce multiple texts).

In [7]:
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")   # construct text-generation pipeline with model set to gpt2
generator("It's raining cats and", max_length=10, num_return_sequences=3)      # uncomment and add parameters

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'It\'s raining cats and dogs and birds," explains'},
 {'generated_text': 'It\'s raining cats and dogs," said Dave Sch'},
 {'generated_text': "It's raining cats and all. One night,"}]

There are various GPT-based models that are available in `transformers` library. Here: https://huggingface.co/models?search=gpt, you can find a list of them. They differ in the datasets they were trained on (the original GPT-2 was trained on Webtext https://paperswithcode.com/dataset/webtext, which consists of ~40GB of texts scraped from the internet) and the model sizes (e.g., GPT2-small consists of 117M parameters, GPT2-medium of 345M, GPT2-large of 762M).

Depending on our needs and available GPU memory, we can choose an appropriate one. 
There are also distilled models that are `compressed` similarly to DistilBERT: https://huggingface.co/distilgpt2 (You can find more about distillation here: https://neptune.ai/blog/knowledge-distillation).

Check how models of different sizes relate to the quality of generated texts. Use `gpt2-small`, `gpt2-medium`, `gpt2-large` instead of `gpt2` in the pipeline and analyze the results.

Check how models trained on some more "specific" data work (e.g., 
`CodeGPT-small-java-adaptedGPT2` that can be used to write Java code)

*No report on the results is required. Just experiment if you are interested in this topic :)*


# GPT-2 as a source of knowledge
Since the model provides probable continuations of texts, we can use them to find answers to some questions. 
You can type `The capital of Poland is` as a context to check if `Warsaw` will be proposed.

(beware!: don't add any whitespace at the end of the context. It frequently leads to some... strange results)

However, remember that the internet is biased. There is a lot of work that explores the bias of the GPT-models. This paper is an easy-to-follow analysis of the problem http://aclanthology.lst.uni-saarland.de/D19-1339.pdf. Because these models are trained on human-generated content, we should not treat them as oracles. Instead, we should treat them as some model of a stereotypical human being ;).

In [8]:
generator("A woman works as", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'A woman works as a public defender for a homeless couple in Boston, Mass., and on Monday morning, she was fired from her job after she called'},
 {'generated_text': 'A woman works as a "producers" for a company in a hotel in New York. (Charles Krupa/Associated Press)\n\n"'},
 {'generated_text': 'A woman works as head of a charity she founded that has helped millions in need of money through loans and work in need of social services. The charity'},
 {'generated_text': 'A woman works as a laborer working outside the office to support her daughter while her husband helps her raise some vegetables. (Photo: Kim Chae'},
 {'generated_text': 'A woman works as a teacher at a high school in North Carolina and was targeted to steal her belongings and car, Fox News reported Wednesday. The teen'}]

# Greedy search vs beam search

The default workflow of text generation with GPT-2 utilizes the greed search strategy. Given some context sequence, the model chooses a token with the highest probability as the continuation. However, in that scenario, we may generate "suboptimal" sequences. Please look at this webpage to grasp the idea of beam search https://huggingface.co/blog/how-to-generate. In short, beam search keeps the most likely `num_beams` of hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability. 

The code below describes an alternative (let's call it classic) approach to using GPT. Instead of a pipeline, we generate the tokenizer and the model manually and then pass the tokenized context to the model. Please look at the call to `generate` function, you can find `num_beams` parameter which sets the number of beams to keep! Try to change it to see how the quality of the output changes.

In [14]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

gpt_model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

starting_context = "The GPT model is great"

input_ids = tokenizer(starting_context, return_tensors="pt").input_ids

outputs = gpt_model.generate(
    input_ids,
    num_beams=100,
    num_return_sequences=1,
    no_repeat_ngram_size=1,
    max_length=50
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The GPT model is great, but there's still a lot of work to be done.


In [15]:
outputs = gpt_model.generate(
    input_ids,
    num_beams=10,
    num_return_sequences=1,
    no_repeat_ngram_size=1,
    max_length=50
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The GPT model is great, but there are a few things that need to be taken into account. The first and most important thing you should look out for when deciding whether or not it's worth investing your money in one of these companies:



In [16]:
outputs = gpt_model.generate(
    input_ids,
    num_beams=1,
    num_return_sequences=1,
    no_repeat_ngram_size=1,
    max_length=50
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The GPT model is great for the first few years, but it's not going to be as good in a long time.
I'm sure there are some people who think that this will make them feel better about their lives and they'll start


# Constrained GPT-2

Sometimes we would like to constrain the output generated by the model. If you use the GPT2 model to write comments about your products, you want them to be positive :). Wouldn't it be useful to force GPT-2 to generate texts that have to contain some selected words like `wonderful`, `best` or `amazing`? :).

The GPT-2 models allow us to constrain the output in such a way. You can find a good introduction here: https://towardsdatascience.com/new-hugging-face-feature-constrained-beam-search-with-transformers-7ebcfc2d70e9
. 

Analyze the snippet below (a modified code from the website mentioned above) to see how we can force GPT-2 to use some tokens. There are 2 cases: 
* give some single token that has to be present somewhere in the generated text
* we give a list of alternatives from which the GPT-2 model chooses one.

Important sidenote: when experimenting with the code I once noticed that the model generated `besting` instead of the expected word `best`. I was surprised at first, but it works fine: while `best` is a token we expect to be present in the generated text, in transformer-related pretrained models, we use tokenization that may produce subword units. If, after `best` a continuation subtoken (e.g., `##ing` according to WordPiece notation that is used in BERT) is produced, then these tokens will be joined. That doesn't make the result wrong -- the token `best` is included in the generated text!

In [17]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

gpt_model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

must_contain_token = "best"
must_contain_alternatives = ["amazing", "wonderful", "beautiful", "exceptional"]  # let gpt choose which word to use


force_words_ids = [
    tokenizer([must_contain_token], add_prefix_space=True, add_special_tokens=False).input_ids,
    tokenizer(must_contain_alternatives, add_prefix_space=True, add_special_tokens=False).input_ids,
]

starting_text = ["The laptop", "The product"]
input_ids = tokenizer(starting_text, return_tensors="pt").input_ids


outputs = gpt_model.generate(
    input_ids,
    force_words_ids=force_words_ids,
    num_beams=10,
    num_return_sequences=1,
    no_repeat_ngram_size=1,
    remove_invalid_values=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(tokenizer.decode(outputs[1], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The laptop is powered by an Intel Core i7-4790K CPU, which has amazing best
The product is available in a variety of colors and sizes, including the standard black. The beautiful best


In 2020, a new version called GPT3 was created. While OpenAI didn't release the model providing only API-based access, some attempts to replicate the model are being made. You can find a model that should work in the same way as GPT3 here: https://huggingface.co/EleutherAI/gpt-neo-1.3B.
The story behind GPT3 and the reasons why it is not published as a downloadable model are described on Wikipedia: https://en.wikipedia.org/wiki/GPT-3.
