# GPT2-Large - Day 2 - More friendly experiments - maybe impressive

<br>

_Update October, 21th: Migrated code to [transformers](https://github.com/huggingface/transformers)._

## This Notebook

In this notebook we will run the recently-released and ported GPT2 large pytorch version over the amples shown in the original blogpost.

We originally ran an *out-of-the-box zero-shot experiment* obtaining not that impressing results in [this](https://www.kaggle.com/julian3833/gpt2-large-774m-w-pytorch-not-that-impressive/) notebook (also [here](https://github.com/dataista0/making-a-nietzsche/blob/master/nbs/GPT2-Large%20-774M-%20with%20Pytorch%20-%20Not%20that%20impressive.ipynb) in github). 


In this notebook we will have an opossite experimentation approach, **making it easy for the model to show its best face**.  We will post-process the generated text removing trivial problems and generate various samples and manually pick the best ones. 

## Results

Results are much better than in previous experiments.

## Experimentation framework

We will do the following:
* Run 10 trials with some parameters for different inputs
* Remove repeated sentences and the uncompleted last one if any
* Ignore results which are shorter than certain threshold (33% the original requested amount of words in the first experiment, less than an absolute value in the second one)
* Manually pick the best generated texts


### About the post processing

Removing repeated sequences seems a fair thing to do, since it's a solvable issue - trivially solvable - and it doesn't compromise the big picture. Repeating information is a problem of second level compared to generating coherent speech, since removing the repetitions is something we can trivially do with a post-filter. Although this filter can be done at generation time, we are doing it _a posteriori_ for simplicity of implementation (but it takes more time). Also, we are removing the last uncompleted sentence.

A good example of the aplication of this rule is the following output from the experiment with the WW2 paragraph, take a look at how this problem affect the perception about coherence:


#### Curated result

> The war was fought on two fronts: the Western Front (Western Europe) and the Eastern Front (Eastern Europe). The Western Front was the front that the Allies controlled, and the Eastern Front was the front that the Axis controlled. The war was fought in Europe, the Middle East, Africa, and Asia. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. 

#### Original result

> The war was fought on two fronts: the Western Front (Western Europe) and the Eastern Front (Eastern Europe). The Western Front was the front that the Allies controlled, and the Eastern Front was the front that the Axis controlled. The Western Front was the front that the Allies controlled, and the Eastern Front was the front that the Axis controlled. The Western Front was the front that the Allies controlled, and the Eastern Front was the front that the Axis controlled.The war was fought in Europe, the Middle East, Africa, and Asia. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were


#### Please, if you know a native -deep learning- way for managing the repetitions, comment!


# Results

In [1]:
!pip install transformers



In [2]:
import sys
import logging

import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
logging.getLogger().setLevel(logging.CRITICAL) # Disable an annoying warning for now
import torch.nn.functional as F

SAMPLE_INPUTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.",
    "Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.",
    "We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.",
    "Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.",
    "For today’s homework assignment, please describe the reasons for the US Civil War.",
    "John F. Kennedy was just elected President of the United States after rising from the grave decades after his assassination. Due to miraculous developments in nanotechnology, Kennedy’s brain was rebuilt from his remains and installed in the control center of a state-of-the art humanoid robot. Below is a transcript of his acceptance speech.",
    "Recycling is good for the world.\n\nNO! YOU COULD NOT BE MORE WRONG!!"
    ]

def fix_randomness(seed=123):
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    
def get_tokenizer_and_model(model_id):
    assert model_id in ['gpt2', 'gpt2-medium', 'gpt2-large']
    tokenizer = GPT2Tokenizer.from_pretrained(model_id)
    model = GPT2LMHeadModel.from_pretrained(model_id).eval().to('cuda')
    return tokenizer, model

def stoi(tokenizer, text):
    indexed_tokens = tokenizer.encode(text)
    tokens = torch.tensor([indexed_tokens]).to('cuda')
    return tokens

def top_k_logits(logits, k):
    if k == 0:
        return logits
    values, _ = torch.topk(logits, k)
    min_values = values[:, -1]
    return torch.where(logits < min_values, torch.ones_like(logits, dtype=logits.dtype) * -1e10, logits)

def add_n_tokens(model, tokens, n_tokens, temperature=1.0, top_k=40):
    generated = []
    with torch.no_grad():
        for i in range(n_tokens):
            res = model(tokens)
            logits = res[0]
            logits = logits[:, -1, :] / temperature
            logits = top_k_logits(logits, k=top_k)
            log_probs = F.softmax(logits, dim=-1)
            new = torch.multinomial(log_probs, num_samples=1)
            
            tokens = torch.cat((tokens, new), dim=1)
            generated.append(new[0][0].item())
    return tokens, generated

def remove_repetitions(text):
    first_ocurrences = []
    for sentence in text.split("."):
        if sentence not in first_ocurrences:
            first_ocurrences.append(sentence)
    return '.'.join(first_ocurrences)

def trim_last_sentence(text):
    return text[:text.rfind(".")+1]

def postprocess(text):
    return trim_last_sentence(remove_repetitions(text))

def benchmark(model_id, n_words, n_trials=10, top_k=40, temperature=0.1, texts=SAMPLE_INPUTS):
    print(f"{model_id} with n_words={n_words}\n=========================")
    
    tokenizer, model = get_tokenizer_and_model(model_id)
    results = {}
    
    for text in texts:
        results[text] = []
        print("---------------------------------------")
        print("INPUT: {}".format(text.replace('\n', '')))
        
        for trial in range(n_trials):
            tokens = stoi(tokenizer, text)
            tokens, generated = add_n_tokens(model, tokens, n_words, temperature=temperature, top_k=top_k)
            generated_text = postprocess(tokenizer.decode(generated))
            n_gen_words = len(generated_text.split(" "))
            results[text].append(generated_text)
            if n_gen_words > 0.33*n_words:
                print("OUTPUT {:2d}: {}".format(trial+1, generated_text.replace('\n', '')))
                print("\n====\n")
        print("---------------------------------------")
    return results

I1022 22:04:58.355577 140366812268288 file_utils.py:32] TensorFlow version 2.0.0-beta1 available.
I1022 22:04:58.356682 140366812268288 file_utils.py:39] PyTorch version 1.1.0 available.
I1022 22:04:59.430724 140366812268288 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [3]:
fix_randomness()

# GPT-2 large on the first paragraph of the wikipedia article for World War II.

In this block we ran GPT-2 large feeding it with the first paragraph of the wikipedia article for World War II.
We are generating 500 words and making 10 trials.


### Results: the 3 post-filtered generated texts are clearly distinguishable of human-generated text




In [4]:
%%time 
ww2 = """World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]"""
results_ww2 = benchmark('gpt2-large', n_words=500, texts=[ww2], n_trials=10, top_k=10, temperature=0.1)

gpt2-large with n_words=500
---------------------------------------
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
--------------------------

# English-speaking Unicorns


### Results: quite good, but some semantic repetitions

In [5]:
%%time 
results_unicorns = benchmark('gpt2-large', n_words=500, texts=[SAMPLE_INPUTS[0]], n_trials=10, top_k=10, temperature=0.1)

gpt2-large with n_words=500
---------------------------------------
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
OUTPUT  3: The researchers, from the University of California, Santa Cruz, and the University of California, Berkeley, were studying the behavior of the animals in the Andes Mountains when they discovered the unicorns."We were surprised to find that the unicorns were able to communicate with each other in English," said study co-author Dr. David H. Smith, a professor of biology at UC Santa Cruz. "We were also surprised to find that the unicorns were able to communicate with each other in English, which is a language that is not spoken by humans."The researchers were able to track the unicorns' movements by using GPS technology. The GPS technology allowed them to track the animals' move

# Train carriage and Miley Cyrus shoplifting


### Results: no results for Miley and almost perfect result for Train carriage


In [6]:
SAMPLE_INPUTS[1:3]

['A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.',
 'Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.']

In [7]:
results = {}
results.update(results_ww2)
results.update(results_unicorns)
results.keys()

dict_keys(["World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]", 'In a shocking finding, scientist discovered a herd of unicorns living in a remote, pre

In [8]:
%%time 
x = benchmark('gpt2-large', n_words=500, texts=SAMPLE_INPUTS[1:3], n_trials=10, top_k=10, temperature=0.1)
results.update(x)

gpt2-large with n_words=500
---------------------------------------
INPUT: A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.
OUTPUT  5: The incident occurred at about 7:30 a.m. in the area of West 12th Street and East Ninth Avenue, according to the Cincinnati Police Department.The train was carrying a shipment of nuclear materials, including a fuel rod, a control rod and a control rod assembly, according to the CPD.The incident is being investigated by the CPD's Special Investigations Unit.The CPD is asking anyone with information to call the CPD's Special Investigations Unit at 614-645-5200.AlertMe<|endoftext|>The U.S. Department of Justice has filed a lawsuit against the city of Chicago for its failure to provide a safe environment for its police officers.The lawsuit, filed in federal court in Chicago on Wednesday, alleges that the city's police department has a "pattern or practice of violating the constitutional ri

# The rest of the samples

In [9]:
%%time 
x = benchmark('gpt2-large', n_words=500, texts=SAMPLE_INPUTS[3:], n_trials=10, top_k=10, temperature=0.1)
results.update(x)


gpt2-large with n_words=500
---------------------------------------
INPUT: We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.
---------------------------------------
---------------------------------------
INPUT: Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.
---------------------------------------
---------------------------------------
INPUT: For today’s homework assignment, please describe the reasons for the US Civil War.
---------------------------------------
---------------------------------------
INPUT: John F. Kennedy was just elected President of the United States after rising from the grave decades after his assassination. Due to miraculous develop

# Second experiment: generate 800 words and keep results with at least 250 words after post processing.

Also affected `top_k` and `temperature` to `top_k=20, temperature=0.5`

In [10]:
def benchmark_v2(model_id, n_words, min_words=250, n_trials=10, top_k=40, temperature=0.1, texts=SAMPLE_INPUTS):
    print(f"{model_id} with n_words={n_words}\n=========================")
    
    tokenizer, model = get_tokenizer_and_model(model_id)
    results = {}
    
    for text in texts:
        results[text] = []
        print("---------------------------------------")
        print("INPUT: {}".format(text.replace('\n', '')))
        
        for trial in range(n_trials):
            tokens = stoi(tokenizer, text)
            tokens, generated = add_n_tokens(model, tokens, n_words, temperature=temperature, top_k=top_k)
            generated_text = postprocess(tokenizer.decode(generated))
            n_gen_words = len(generated_text.split(" "))
            results[text].append(generated_text)
            if n_gen_words > min_words:
                print("OUTPUT {:2d}: {}".format(trial+1, generated_text.replace('\n', '')))
                print("\n====\n")
        print("---------------------------------------")
    return results

In [12]:
%%time
results_2 = benchmark_v2('gpt2-large', n_words=800, texts=[ww2]+SAMPLE_INPUTS, n_trials=10, top_k=20, temperature=0.5)

gpt2-large with n_words=800
---------------------------------------
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
OUTPUT  1: The conflict wa

OUTPUT  7: The war also marked the end of the German Empire, the dissolution of the Soviet Union, the dissolution of the Warsaw Pact, and the end of the Cold War. The war also marked the end of the Second World War, which lasted from 1939 to 1945.The war was fought in Europe, the Middle East, Asia, and North Africa, and was the largest conflict since the Second World War. It was fought primarily in Europe, but also in Africa, the Middle East, and Asia. In the Middle East, the war was fought between the United Kingdom and the Axis powers of Germany, Italy, and Japan. In Asia, it was fought between the Soviet Union and Japan, and in North Africa, between the Axis Powers and the Allies. In North Africa, the war was fought between the Axis Powers and the Allies, while in the Middle East, it was fought between the Allies and the Axis Powers. The war also involved the United States, but in a more limited role than in the First World War. In the Pacific, it was fought between the Allies and t

OUTPUT  3: The findings were published in the Journal of Mammalogy.The researchers, from the University of California, Santa Cruz, and the Smithsonian National Zoo in Washington, D.C., spent months searching for the unicorns, which they believe were domesticated in the Andes Mountains."It was like a wild west," said lead author Dr. Jennifer A. Geller, a wildlife biologist at UCSC. "We had no idea what to expect."The researchers found the unicorns by chance, while conducting a survey of the Andes Mountains in the summer of 2012. They were looking for evidence of a species of giant, horned, antelope that had disappeared from the area."We were looking for a species that was rare and that was also very well-known," Geller said. "We didn't know what to expect."The team had been looking for the species for years, but were surprised to find the unicorns in their search."It was a surprise," Geller said."We were looking for a species that was rare and that was also very well-known."The research

OUTPUT 10: The discovery of the unicorns, which were named after the Latin word for "unicorn" and are native to the Andes Mountains, came as a shock to the researchers, who said the animals were not native to the region.The animals were discovered in the Santa Cruz region of Argentina, near the town of Arica, about 100 miles from the city of Santa Cruz.The scientists said the animals were likely descended from a group of wild animals that had been hunted to extinction by humans."The discovery of the unicorns is very important because it shows that the Andean population is still alive," said biologist and researcher Carlos Crespo, who led the research team."The animals are the first evidence of this population's existence in the Andes," he said.Crespo said the animals were likely descended from a group of wild animals that had been hunted to extinction by humans.The scientists said they had not been able to estimate the population size of the animals, but they said they estimate there a

---------------------------------------
---------------------------------------
INPUT: Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.
OUTPUT  1: The "Wrecking Ball" singer was seen wearing a red T-shirt and black pants when she was caught shoplifting at the store, TMZ reports.A source told TMZ that Cyrus was "shocked" when she saw the police tape outside the store."She was shocked and confused," the source said.Cyrus was reportedly caught in the act of stealing a pair of sunglasses from the store, but did not have any money.The singer was reportedly taken to the police station and was then released without any charges.The singer was also reportedly wearing a black hoodie and black pants at the time of the incident.TMZ reports that Cyrus was not arrested for the incident.The singer's rep, however, confirmed to PEOPLE that the singer was taken to the police station and released without any charges."Miley was released without charges and did no

OUTPUT  4:  The singer was photographed by a security guard and was quickly detained.The singer was spotted in a black Adidas jacket and black Adidas sneakers, according to TMZ.A source told the site that the singer was "in and out of the store in a matter of minutes" and was "shocked" by the shoplifting.Cyrus was reportedly spotted entering the store at around 2:30 pm and leaving the store at around 4:15 pm.The singer was reportedly seen leaving the store with her arm around a man who was sitting on a bench outside the store.The source told TMZ that the singer was "shocked" by the shoplifting and was "shouting at the security guard" when she was detained.The source said that the singer was "shocked and confused" at the time and was "shaking and crying" after the incident.A source told TMZ that the singer was not arrested and that the incident was "completely unprovoked."Cyrus is currently in the middle of a promotional tour for her new album "The Miseducation of Lauryn Hill." She was 

---------------------------------------
---------------------------------------
INPUT: We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.
OUTPUT  2:  We have also trained a deep neural network called GPT-3, which is designed to be able to learn and generalize to new tasks, including text generation.We have also trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks. We have also trained a deep neural network called GPT-4, w

---------------------------------------
---------------------------------------
INPUT: For today’s homework assignment, please describe the reasons for the US Civil War.
---------------------------------------
---------------------------------------
INPUT: John F. Kennedy was just elected President of the United States after rising from the grave decades after his assassination. Due to miraculous developments in nanotechnology, Kennedy’s brain was rebuilt from his remains and installed in the control center of a state-of-the art humanoid robot. Below is a transcript of his acceptance speech.
OUTPUT  3: "I am here tonight because I believe that our future is at stake, and that we must act now to protect it. I am here because I believe that the United States of America is the greatest nation on earth, because we have the greatest potential to make the world a better place. I am here because I believe that the United States of America is the only nation that can save the world. I am here 

---------------------------------------
---------------------------------------
INPUT: Recycling is good for the world.NO! YOU COULD NOT BE MORE WRONG!!
---------------------------------------
CPU times: user 8h 21min 22s, sys: 3h 58min 17s, total: 12h 19min 40s
Wall time: 12h 19min 24s
