# GPT-2 Large ⁠–774M params– with Pytorch: Not that impressive

<br>


_Update October, 21th: Migrated code to [transformers](https://github.com/huggingface/transformers)._

# This Notebook

In this notebook we will apply the out-of-the-box GPT-2 models (`gpt2`, `gpt2-medium` and the recently-released and ported `gpt2-large`) o the samples in the original blog post ([Better Language Models and Their Implications](https://openai.com/blog/better-language-models/)) using `huggingface`'s [pytorch-transformers](https://github.com/huggingface/pytorch-transformers) library, with a pretty simple code based on the library's [Quick Start](https://huggingface.co/pytorch-transformers/quickstart.html). 

We tried to write a concise and legible set of functions in order to make it easy to play a little with these models.

The results are good but not impressive: they are not consistent throughout and they enter some trivial loops. This may respond to a series of causes. First of all, the recently released large model has only 774M parameters, while the one used to generate the examples (and which has not yet been released yet) has roughly the double. Also, our proof of concept –this notebook– lacks some basic tweaks that may improve the results. Moreover, we are using a deterministic version of the generation, picking always the best prediction instead of using top-k truncated sampling and manually cherry-picking the best one, as is explicitly mentioned in their article:

>Note that while we have hand-chosen these samples, and are thus engaging in some meta-cherry-picking, we believe they are not too unrepresentative of the sampling process. We are simply using top-k truncated sampling, and have yet to explore more advanced methods of sampling (such as beam-search methods). ↩︎

_Note: We added two simple post-processing steps which seem fair to be added to this first experiment, although this breaks the out-of-the-box spirit. We are removing repeated sentences as well as the last incomplete sentence of the generated text (if any). This significantly improves the impact of the results._


## What's Next?
This is a one-shot out-of-the-box experiment, which proved too difficult, to be fair. We are currently writing a follow-up, more friendly experiment environment, with some simple post-processing steps and a manual cherry-picking of the best result out of a set of generated ones. The partial results generated in this friendlier experiment are much better than the ones shown here, but still not that terrifying. You can check the unfinished notebook (with all the outputs already generated) [here](https://github.com/dataista0/making-a-nietzsche/blob/master/nbs/GPT2-Large%20-%20Day%202%20-%20More%20friendly%20experiments%20-%20maybe%20impressive.ipynb).

## Important Notes 
_We were unable to make the large model work on Kaggle, as it just runs out of disk space._ 

In order to post a functional version on Kaggle, we are commenting the `gpt2-large` executions :( , but a fully executed notebook with the output is available [**here**](https://github.com/dataista0/making-a-nietzsche/blob/master/nbs/GPT2-Large%20-774M-%20with%20Pytorch%20-%20Not%20that%20impressive.ipynb).

Running it on a [`p2xlarge`](https://aws.amazon.com/ec2/instance-types/p2/) instance on AWS works fine (1 GPU, 4 cores). Also, if you are running this version on Kaggle, remember to turn on the Internet and the GPU. You can find the toggle dropping down the Settings tab on your right  ==> 

## Some Comments on the Results

We didn't get impressive human-like results with the large model in an out-of-the-box set-up.

There are various potential reasons for this:
* The large model is not as good as the huge one. We can see a difference between the small, medium and large versions, so it's possible that the 1.5B parameters model is just out-of-the-box much better than the large one.
* There may be some tweaks that improve the output (we are not too familiar with the topic, we are just jumping in with this notebook, so maybe some obvious tricks could improve the GPT-2 results a lot. Currently, we are working on a second notebook trying with top-k sampling, generating various results and manually picking the bes ones.
* We may be missing something huge, like wrong tokenization. Please, if you find some error in the code, leave a comment!


# Some Context

## The GPT-2, _February, 2019_

Around 6 months ago, Open AI published [Better Language Models and Their Implications](https://openai.com/blog/better-language-models/). In the article, GPT-2, a large deep network with new impressing text-generation capabilities is presented, and some *amazing* human-like generated text examples are shown. They trained 4 different sizes and they released the pre-trained weights for the small and medium versions, keeping the large and huge ones private and calling for a public debate around the social impact of the new human ability to produce machine-generated text of human-like quality.


### The English-Speaking Unicorns and Dr. Pérez:


Let's see the first, most notorious example of coherent speech, taken from the previously mentioned blog post: the generation of a fantastic narrative based on an human-written introduction paragraph:

>**SYSTEM PROMPT (HUMAN-WRITTEN)**

>In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

>**MODEL COMPLETION (MACHINE-WRITTEN, 10 TRIES)**

>The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

>Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

>Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

>Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

_(...continues for few paragraphs with a reasonable, interesting narrative...)_


## GPT-2: 6-Month Follow-Up - GPT-2 Large release, _August, 20th, 2019_

On August 20th, 2019 (4 days ago), OpenAI published [GPT-2: 6-Month Follow-Up](https://openai.com/blog/gpt-2-6-month-follow-up/) on their blog together with the release of the `774M` (large) pre-trained weights and architecture in [this commit](https://github.com/openai/gpt-2/commit/f35fa1d920e9d2d0690f66d03aa3f76b3c59230e).

As the project's [README](https://github.com/openai/gpt-2) says:
> We have currently released small (124M parameter), medium (355M parameter), and large (774M parameter) versions of GPT-2*, with only the full model as of yet unreleased. We have also released a dataset for researchers to study their behaviors

BERT, GPT and GPT-2 are originally released for tensorflow and there is an awesome git account named `huggingface` which is migrating all these models to the PyTorch world with a simple library called `pytorch-transformers`.

`gpt2-large` support was added to `master` on August 20th, with [this merge](https://github.com/huggingface/pytorch-transformers/commit/07681b6b5859b630077b742b2f06d440869f17e3).


# References
## Blogposts:
* [Better Language Models and Their Implications  (GPT-2) blogpost](https://openai.com/blog/better-language-models/) - February 14th, 2019 
* [GPT-2 6-month follow-up blogpost](https://openai.com/blog/gpt-2-6-month-follow-up/) - August 20th, 2019
* [Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing (BERT) blogpost](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) - November 2nd, 2018

## Technical:
* [This notebook](https://github.com/dataista0/making-a-nietzsche/blob/master/nbs/GPT2-Large%20-774M-%20with%20Pytorch%20-%20Not%20that%impressive.ipynb)
* [gpt2 github](https://github.com/openai/gpt-2/)
* [pytorch-transformers github](https://github.com/huggingface/pytorch-transformers/)
* [pytorch-transformers - Docs: quick start](https://huggingface.co/pytorch-transformers/quickstart.html)

# Show me the code!

In [1]:
!pip install transformers



## Fix randomness!

In [2]:
import torch
import numpy as np
    
def fix_randomness():
    seed = 123
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    
fix_randomness()

In [3]:
import logging
from transformers import GPT2Tokenizer, GPT2LMHeadModel
logging.getLogger().setLevel(logging.CRITICAL) # Disable an annoying warning for now

SAMPLE_INPUTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.",
    "Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.",
    "We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.",
    "Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.",
    "For today’s homework assignment, please describe the reasons for the US Civil War.",
    "John F. Kennedy was just elected President of the United States after rising from the grave decades after his assassination. Due to miraculous developments in nanotechnology, Kennedy’s brain was rebuilt from his remains and installed in the control center of a state-of-the art humanoid robot. Below is a transcript of his acceptance speech.",
    "Recycling is good for the world.\n\nNO! YOU COULD NOT BE MORE WRONG!!"
    ]

def get_tokenizer_and_model(model_id):
    assert model_id in ['gpt2', 'gpt2-medium', 'gpt2-large']
    tokenizer = GPT2Tokenizer.from_pretrained(model_id)
    model = GPT2LMHeadModel.from_pretrained(model_id).eval().to('cuda')
    return tokenizer, model

def stoi(tokenizer, text):
    indexed_tokens = tokenizer.encode(text)
    tokens = torch.tensor([indexed_tokens]).to('cuda')
    return tokens

def generate_next_token(model, tokens):
    with torch.no_grad():
        outputs = model(tokens)
        predictions = outputs[0]
    predicted_token = torch.argmax(predictions[0, -1, :]).item()
    return predicted_token

def add_token(tokens, token):
    token_in_dimensions_and_gpu = torch.tensor([[token]]).to('cuda')
    return torch.cat([tokens, token_in_dimensions_and_gpu], dim=1)

def add_n_tokens(model, tokens, n_tokens):
    generated = []
    for _ in range(n_tokens):
        new = generate_next_token(model, tokens)
        tokens = add_token(tokens, new)
        generated.append(new)
    return tokens, generated

def remove_repetitions(text):
    first_ocurrences = []
    for sentence in text.split("."):
        if sentence not in first_ocurrences:
            first_ocurrences.append(sentence)
    return '.'.join(first_ocurrences)

def trim_last_sentence(text):
    return text[:text.rfind(".")+1]

def postprocess(text):
    return trim_last_sentence(remove_repetitions(text))

def generate(text, n_words=10, model_id='gpt2'):
    print(f"MODEL: {model_id}")
    
    tokenizer, model = get_tokenizer_and_model(model_id)
    tokens = stoi(tokenizer, text)
    tokens, generated = add_n_tokens(model, tokens, n_words)
    
    generated_text = postprocess(tokenizer.decode(generated))
    
    print(f"INPUT: {text}")
    print(f"OUTPUT: {generated_text}\n")

I1021 21:45:33.513968 140031914428160 file_utils.py:32] TensorFlow version 2.0.0-beta1 available.
I1021 21:45:33.515211 140031914428160 file_utils.py:39] PyTorch version 1.1.0 available.
I1021 21:45:33.979461 140031914428160 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [4]:
%%time
text = "Hello GPT-2, how are you doing?"
n_words = 20

generate(text, n_words)
generate(text, n_words, 'gpt2-medium')
generate(text, n_words, 'gpt2-large')

MODEL: gpt2
INPUT: Hello GPT-2, how are you doing?
OUTPUT: 

I'm doing great. I'm doing great.

MODEL: gpt2-medium
INPUT: Hello GPT-2, how are you doing?
OUTPUT: 

MODEL: gpt2-large
INPUT: Hello GPT-2, how are you doing?
OUTPUT: 

I'm doing great! I'm so happy to be here.

CPU times: user 49.9 s, sys: 7.16 s, total: 57.1 s
Wall time: 56.4 s


# Benchmark function against SAMPLE_INPUTS

In [5]:
def benchmark(model_id, n_words, texts=SAMPLE_INPUTS):
    
    print(f"{model_id} with n_words={n_words}\n=========================")
    tokenizer, model = get_tokenizer_and_model(model_id)
    
    for text in texts:
        tokens = stoi(tokenizer, text)
        tokens, generated = add_n_tokens(model, tokens, n_words)
        generated_text = postprocess(tokenizer.decode(generated))
    
        print("INPUT: {}".format(text.replace('\n', '')))
        print("OUTPUT: {}".format(generated_text.replace('\n', '')))
        print("\n====\n")


# GPT-2 small - 117M

In [6]:
%%time
benchmark('gpt2', n_words=200)

gpt2 with n_words=200
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
OUTPUT: "The unicorns were very intelligent, and they were very intelligent," said Dr. David S. Siegel, a professor of anthropology at the University of California, Berkeley. "They were very intelligent, and they were very intelligent, and they were very intelligent."The researchers found that the unicorns were able to communicate with each other through their tongues."They were able to communicate with each other through their tongues," Siegel said. "They were able to communicate with each other through their tongues."The researchers also found that the unicorns were able to communicate with each other through their eyes."They were able to communicate with each other through their eyes," Siegel said. "They were able to communicat

# GPT-2 medium - 354M

In [7]:
%%time
benchmark('gpt2-medium', n_words=200)

gpt2-medium with n_words=200
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
OUTPUT: The researchers, led by Dr. David M. Koehler, a professor of anthropology at the University of California, Santa Cruz, discovered the unicorns in the remote valley of La Paz, in the Andes Mountains."We were surprised to find that the unicorns spoke perfect English," said Koehler. "They were very friendly and friendly with us, and they were very friendly with the locals. They were very friendly with the locals, and they were very friendly with the locals."The researchers were surprised to find that the unicorns spoke perfect English. They were very friendly and friendly with us, and they were very friendly with the locals.The researchers were surprised to find that the unicorns spoke perfect English.

====

INPUT: A 

# GPT-2 large - 774M

### Kaggle crashes with gpt2-large. Results on [github](http://github.com/dataista0/making-a-nietzsche/blob/master/nbs/GPT2-Large%20-774M-%20with%20Pytorch%20-%20Not%20that%20impressive.ipynb).

In [8]:
%%time
benchmark('gpt2-large', n_words=200)

gpt2-large with n_words=200
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
OUTPUT: The researchers, led by Dr. David R. Williams of the University of California, Santa Cruz, discovered the unicorns in the Andes Mountains of Peru. The area is known for its unique geology and is home to a number of rare species of animals.The researchers found the unicorns in the Andes Mountains of Peru."We were surprised to find that the unicorns were able to communicate with each other," Williams said. "We were also surprised to find that they were able to communicate in English."The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the area around 2,000 years ago."The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of the Andes," Will

# GPT-2 large on custom texts

In [9]:
%%time 

ww2 = """World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]"""
benchmark('gpt2-large', n_words=200, texts=[ww2])

gpt2-large with n_words=200
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
OUTPUT: The war was fought in Europe, Asia, and the Pacific. The A

In [10]:
benchmark('gpt2-medium', n_words=200, texts=[ww2])

gpt2-medium with n_words=200
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
OUTPUT: The war was fought in Europe, Asia, Africa, and the Ameri