# GPT-2 - Large (774M params) with Pytorch: Not that terrifying

## This notebook

In this notebook we will apply huggingface's [pytorch-transformers](https://github.com/huggingface/pytorch-transformers) `gpt2`, `gpt2-medium` and `gpt2-large` models to the samples in the original GPT-2 blogpost ([Better Language Models and Their Implications](https://openai.com/blog/better-language-models/)), with a pretty simple code based on the [pytorch-transformers - Docs: quick start](https://huggingface.co/pytorch-transformers/quickstart.html)'s Quickstart. 

NOTE: I was unable to make the large model work on Kaggle. It just runs out of disk space downloading the weights. I'm currently running this notebook on a `p2.xlarge` on aws and it runs OK. I'm commenting the large model usage for Kaggle. The executed notebook with the output is available [here](https://github.com/dataista0/making-a-nietzsche/blob/master/nbs/GPT2-Large%20-774M-%20with%20Pytorch%20-%20Not%20that%20terrifying.ipynb).

## The GPT-2

Around 6 months ago, Open AI published [Better Language Models and Their Implications](https://openai.com/blog/better-language-models/), were they showed some examples of human-like narrative text generated  by a large deep network which they refused to release due to it's harm potential, calling for a mature social debate around deep learning in text processing.


### The English-speaking unicorns and Dr. Pérez:

This is an example, taken from the previously mentioned blogpost of machine-produced realistic prose:

>**SYSTEM PROMPT (HUMAN-WRITTEN)**

>In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

>**MODEL COMPLETION (MACHINE-WRITTEN, 10 TRIES)**

>The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

>Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

>Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

>Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

> (...continues for few paragraphs with a reasonable narrative...)


## GPT-2: 6-Month Follow-Up - GPT-2 Large release

On August 20th, 2019, -4 days ago- OpenAI published [GPT-2: 6-Month Follow-Up](https://openai.com/blog/gpt-2-6-month-follow-up/) on their blog together with the release of the `774M` (large) pre-trained weights and architecture in [this commit](https://github.com/openai/gpt-2/commit/f35fa1d920e9d2d0690f66d03aa3f76b3c59230e).

As the project's [README](https://github.com/openai/gpt-2) says:
> We have currently released small (124M parameter), medium (355M parameter), and large (774M parameter) versions of GPT-2*, with only the full model as of yet unreleased. We have also released a dataset for researchers to study their behaviors

BERT, GPT and GPT2 are originally released for tensorflow and there is an awesome git account named `huggingface` which is migrating all these models to the Pytorch world with a simple library called `pytorch-transformers`.

`gpt2-large` support was added to `master` on August 20th, with [this merge](https://github.com/huggingface/pytorch-transformers/commit/07681b6b5859b630077b742b2f06d440869f17e3).

## Results

I didn't get terrifyingly-human results with the large model in an out-of-the-box fashion.
There are a lot of reasons:
* The generator gets better with the size increase, the still-unreleased huge model (with 1.5 billion parameters) maybe a lot better than this one out-of-the-box
* Maybe there are some tweaks that improve the output (I'm not in the topic, I'm just jumping-in with this so maybe some obvious tricks improve the GPT-2 generation a lot.
* I may be missing something big like wrong tokenization

### References
#### Blogpost:
* [Better Language Models and Their Implications  (GPT-2) blogpost](https://openai.com/blog/better-language-models/) - February 14th, 2019 
* [GPT-2 6-month follow-up blogpost](https://openai.com/blog/gpt-2-6-month-follow-up/) - August 20th, 2019
* [Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing (BERT) blogpost](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) - November 2nd, 2018

#### Technical:
* [gpt2 github](https://github.com/openai/gpt-2/)
* [pytorch-transformers github](https://github.com/huggingface/pytorch-transformers/)
* [pytorch-transformers - Docs: quick start](https://huggingface.co/pytorch-transformers/quickstart.html)

# Let's do it

In [1]:
# git cloning because the `pip` version of the library doesn't have the commit we need
!git clone https://github.com/huggingface/pytorch-transformers.git
!rm -rf ./pytorch-transformers/.git

fatal: destination path 'pytorch-transformers' already exists and is not an empty directory.


## Fix randomness!

In [2]:
def fix_randomness():
    import numpy as np
    import torch
    seed = 123
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    
fix_randomness()

In [3]:
import torch

import logging
logging.basicConfig(level=logging.CRITICAL) # There's an annoying

import sys; sys.path.append("pytorch-transformers/")
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

SAMPLE_INPUTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.",
    "Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.",
    "We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.",
    "Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.",
    "For today’s homework assignment, please describe the reasons for the US Civil War.",
    "John F. Kennedy was just elected President of the United States after rising from the grave decades after his assassination. Due to miraculous developments in nanotechnology, Kennedy’s brain was rebuilt from his remains and installed in the control center of a state-of-the art humanoid robot. Below is a transcript of his acceptance speech.",
    "Recycling is good for the world.\n\nNO! YOU COULD NOT BE MORE WRONG!!"
    ]

def get_tokenizer_and_model(model_id):
    assert model_id in ['gpt2', 'gpt2-medium', 'gpt2-large']
    tokenizer = GPT2Tokenizer.from_pretrained(model_id)
    model = GPT2LMHeadModel.from_pretrained(model_id).eval().to('cuda')
    return tokenizer, model

def stoi(tokenizer, text):
    indexed_tokens = tokenizer.encode(text)
    tokens = torch.tensor([indexed_tokens]).to('cuda')
    return tokens

def generate_next_token(model, tokens):
    with torch.no_grad():
        outputs = model(tokens)
        predictions = outputs[0]
    predicted_token = torch.argmax(predictions[0, -1, :]).item()
    return predicted_token

def add_token(tokens, token):
    token_in_dimensions_and_gpu = torch.tensor([[token]]).to('cuda')
    return torch.cat([tokens, token_in_dimensions_and_gpu], dim=1)

def add_n_tokens(model, tokens, n_tokens):
    generated = []
    for _ in range(n_tokens):
        new = generate_next_token(model, tokens)
        tokens = add_token(tokens, new)
        generated.append(new)
    return tokens, generated

def run(text, n_words=10, model_id='gpt2'):
    tokenizer, model = get_tokenizer_and_model(model_id)
    tokens = stoi(tokenizer, text)
    tokens, generated = add_n_tokens(model, tokens, n_words)
    print(f"INPUT: {text}")
    print(f"OUTPUT: {tokenizer.decode(generated)}")

In [4]:
%%time
text = "Hello my dear GPT, how are you doing?"
n_words = 20

print("====SMALL")
run(text, n_words)

print("\n\n====MEDIUM")
run(text, n_words, 'gpt2-medium')

print("\n\n====LARGE")
run(text, n_words, 'gpt2-large')

====SMALL
INPUT: Hello my dear GPT, how are you doing?
OUTPUT: 

I am so happy to see you are doing well.

I am so happy to


====MEDIUM
INPUT: Hello my dear GPT, how are you doing?
OUTPUT:  I am very happy to see you here. I am very happy to see you here. I am


====LARGE
INPUT: Hello my dear GPT, how are you doing?
OUTPUT:  I am very happy to see you. I am very happy to see you. I am very happy
CPU times: user 49.7 s, sys: 6.48 s, total: 56.2 s
Wall time: 55.5 s


# Check architecture size

In [5]:
import numpy as np

def get_arch_desc(model_id):
    def get_params(module):
        if hasattr(module, 'parameters'):
            params = 0
            for p in list(module.parameters()):
                params += np.prod([dim for dim in list(p.size())])
            return params
        return 0
    
    _, model = get_tokenizer_and_model(model_id)
    modules = dict(model.named_modules())    
    n_layers = len(modules['transformer.h'])
    n_hidden = modules['transformer.wte'].embedding_dim
    n_params = get_params(model)
    print(model_id, n_layers, n_hidden, int(n_params/1000000), "M")

In [6]:
get_arch_desc('gpt2')
get_arch_desc('gpt2-medium')
get_arch_desc('gpt2-large')

gpt2 12 768 124 M
gpt2-medium 24 1024 354 M
gpt2-large 36 1280 774 M


# Benchmark function

In [7]:
def benchmark(model_id, n_words, texts=SAMPLE_INPUTS):
    print(f"{model_id} with n_words={n_words}\n=========================")
    results = []
    tokenizer, model = get_tokenizer_and_model(model_id)
    for text in texts:
        tokens = stoi(tokenizer, text)
        tokens, generated = add_n_tokens(model, tokens, n_words)
        generated_text = tokenizer.decode(generated)
        results.append(generated_text)
        print("INPUT: {}".format(text.replace('\n', '')))
        print("OUTPUT: {}".format(generated_text.replace('\n', '')))
        print("\n====\n")

# GPT-2 small (117M) & GPT-2 medium (354M)

In [8]:
%%time
benchmark('gpt2', n_words=200)

gpt2 with n_words=200
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
OUTPUT: "The unicorns were very intelligent, and they were very intelligent," said Dr. David S. Siegel, a professor of anthropology at the University of California, Berkeley. "They were very intelligent, and they were very intelligent, and they were very intelligent."The researchers found that the unicorns were able to communicate with each other through their tongues."They were able to communicate with each other through their tongues," Siegel said. "They were able to communicate with each other through their tongues."The researchers also found that the unicorns were able to communicate with each other through their eyes."They were able to communicate with each other through their eyes," Siegel said. "They were able to communicat

In [9]:
%%time
benchmark('gpt2-medium', n_words=200)

gpt2-medium with n_words=200
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
OUTPUT: The researchers, led by Dr. David M. Koehler, a professor of anthropology at the University of California, Santa Cruz, discovered the unicorns in the remote valley of La Paz, in the Andes Mountains."We were surprised to find that the unicorns spoke perfect English," said Koehler. "They were very friendly and friendly with us, and they were very friendly with the locals. They were very friendly with the locals, and they were very friendly with the locals."The researchers were surprised to find that the unicorns spoke perfect English. They were very friendly and friendly with us, and they were very friendly with the locals.The researchers were surprised to find that the unicorns spoke perfect English. They were very f

# GPT-2 large (774M)

In [10]:
%%time
benchmark('gpt2-large', n_words=200)

gpt2-large with n_words=200
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
OUTPUT: The researchers, led by Dr. David R. Williams of the University of California, Santa Cruz, discovered the unicorns in the Andes Mountains of Peru. The area is known for its unique geology and is home to a number of rare species of animals.The researchers found the unicorns in the Andes Mountains of Peru."We were surprised to find that the unicorns were able to communicate with each other," Williams said. "We were also surprised to find that they were able to communicate in English."The researchers believe that the unicorns are descendants of the ancient Incas, who lived in the area around 2,000 years ago."The Incas were the first people to use the Andes Mountains as a refuge from the harsh climate of the Andes," Will

# GPT-2 large on custom texts

In [11]:
%%time 

ww2 = """World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]"""
benchmark('gpt2-large', n_words=200, texts=[ww2])

gpt2-large with n_words=200
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
OUTPUT: The war was fought in Europe, Asia, and the Pacific. The A

In [12]:
benchmark('gpt2-medium', n_words=200, texts=[ww2])

gpt2-medium with n_words=200
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
OUTPUT: The war was fought in Europe, Asia, Africa, and the Ameri