# GPT-2 - Large - 2 - More friendly experiments

# This notebook

In this notebook we will run the recently release and ported GPT2 large pytorch version on the samples shown in the original blogpost.

We originally ran an *out-of-the-box zero-shot experiment* obtaining not that impressing results in [this](https://www.kaggle.com/julian3833/gpt2-large-774m-w-pytorch-not-that-impressive/) notebook (also [here](https://github.com/dataista0/making-a-nietzsche/blob/master/nbs/GPT2-Large%20-774M-%20with%20Pytorch%20-%20Not%20that%20impressive.ipynb) in github). 


In this notebook we will have an opossite experimentation approach, **making it easy for the model to show its best face**.  We will post-process the generated text removing trivial problems and generate various samples and manually pick the best ones. 

We will discuss the results with that perspective later on.


## Experimentation framework

We will do the following:
* Run 10 trials with some parameters for different inputs
* Manually remove repeated sequences.
* Ignore results which are shorter than certain threshold (33% the original requested amount of words in the first experiment, less than an absolute value in the second one)
* Manually pick the best generated texts


## About the post processing

Removing repeated sequences seems a fair thing to do, since it's a solvable issue - trivially solvable - and it doesn't compromise the big picture. Repeating information is a problem of second level compared to generating coherent speech, since removing the repetitions is something we can trivially do with a post-filter. Although this filter can be done at generation time, we are doing it _a posteriori_ for simplicity of implementation (but it takes more time). Also, we are removing the last uncompleted sentence.

A good example of the aplication of this rule is the following output from the experiment with the WW2 paragraph, take a look at how this problem affect the perception about coherence:


#### Curated result

> The war was fought on two fronts: the Western Front (Western Europe) and the Eastern Front (Eastern Europe). The Western Front was the front that the Allies controlled, and the Eastern Front was the front that the Axis controlled. The war was fought in Europe, the Middle East, Africa, and Asia. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. 

#### Original result

> The war was fought on two fronts: the Western Front (Western Europe) and the Eastern Front (Eastern Europe). The Western Front was the front that the Allies controlled, and the Eastern Front was the front that the Axis controlled. The Western Front was the front that the Allies controlled, and the Eastern Front was the front that the Axis controlled. The Western Front was the front that the Allies controlled, and the Eastern Front was the front that the Axis controlled.The war was fought in Europe, the Middle East, Africa, and Asia. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were led by Germany, Italy, and Japan. The Allies and Axis were divided into two major factions: the Allies and the Axis. The Allies were led by the United States, the Soviet Union, and Great Britain. The Axis were


#### Please, if you know a native -deep learning- way for managing the repetitions, comment!


# Results

In [1]:
import sys
import logging
logging.basicConfig(level=logging.CRITICAL) # Disable and annoying warning
sys.path.append("pytorch-transformers/")

import torch
import numpy as np
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch.nn.functional as F

SAMPLE_INPUTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.",
    "Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.",
    "We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.",
    "Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.",
    "For today’s homework assignment, please describe the reasons for the US Civil War.",
    "John F. Kennedy was just elected President of the United States after rising from the grave decades after his assassination. Due to miraculous developments in nanotechnology, Kennedy’s brain was rebuilt from his remains and installed in the control center of a state-of-the art humanoid robot. Below is a transcript of his acceptance speech.",
    "Recycling is good for the world.\n\nNO! YOU COULD NOT BE MORE WRONG!!"
    ]

def fix_randomness(seed=np.random.randint(1000)):
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    
def get_tokenizer_and_model(model_id):
    assert model_id in ['gpt2', 'gpt2-medium', 'gpt2-large']
    tokenizer = GPT2Tokenizer.from_pretrained(model_id)
    model = GPT2LMHeadModel.from_pretrained(model_id).eval().to('cuda')
    return tokenizer, model

def stoi(tokenizer, text):
    indexed_tokens = tokenizer.encode(text)
    tokens = torch.tensor([indexed_tokens]).to('cuda')
    return tokens

def top_k_logits(logits, k):
    if k == 0:
        return logits
    values, _ = torch.topk(logits, k)
    min_values = values[:, -1]
    return torch.where(logits < min_values, torch.ones_like(logits, dtype=logits.dtype) * -1e10, logits)

def add_n_tokens(model, tokens, n_tokens, temperature=1.0, top_k=40):
    generated = []
    with torch.no_grad():
        for i in range(n_tokens):
            res = model(tokens)
            logits = res[0]
            logits = logits[:, -1, :] / temperature
            logits = top_k_logits(logits, k=top_k)
            log_probs = F.softmax(logits, dim=-1)
            new = torch.multinomial(log_probs, num_samples=1)
            
            tokens = torch.cat((tokens, new), dim=1)
            generated.append(new[0][0].item())
    return tokens, generated

def remove_repetitions(text):
    first_ocurrences = []
    for sentence in text.split("."):
        if sentence not in first_ocurrences:
            first_ocurrences.append(sentence)
    return '.'.join(first_ocurrences)

def trim_last_sentence(text):
    return text[:text.rfind(".")+1]

def postprocess(text):
    return trim_last_sentence(remove_repetitions(text))

def benchmark(model_id, n_words, n_trials=10, top_k=40, temperature=0.1, texts=SAMPLE_INPUTS):
    print(f"{model_id} with n_words={n_words}\n=========================")
    
    tokenizer, model = get_tokenizer_and_model(model_id)
    results = {}
    
    for text in texts:
        results[text] = []
        print("---------------------------------------")
        print("INPUT: {}".format(text.replace('\n', '')))
        
        for trial in range(n_trials):
            tokens = stoi(tokenizer, text)
            tokens, generated = add_n_tokens(model, tokens, n_words, temperature=temperature, top_k=top_k)
            generated_text = postprocess(tokenizer.decode(generated))
            n_gen_words = len(generated_text.split(" "))
            results[text].append(generated_text)
            if n_gen_words > 0.33*n_words:
                print("OUTPUT {:2d}: {}".format(trial+1, generated_text.replace('\n', '')))
                print("\n====\n")
        print("---------------------------------------")
    return results

In [2]:
# git cloning because the `pip` version of the library doesn't have the commit we need
!git clone https://github.com/huggingface/pytorch-transformers.git
!rm -rf ./pytorch-transformers/.git
fix_randomness()

fatal: destination path 'pytorch-transformers' already exists and is not an empty directory.


# GPT-2 large on the first paragraph of the wikipedia article for World War II.

In this block we ran GPT-2 large feeding it with the first paragraph of the wikipedia article for World War II.
We are generating 500 words and making 10 trials.




In [3]:
%%time 
ww2 = """World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]"""
results_ww2 = benchmark('gpt2-large', n_words=500, texts=[ww2], n_trials=10, top_k=10, temperature=0.1)

gpt2-large with n_words=500
---------------------------------------
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
OUTPUT  4: The war was fou

# English-speaking Unicorns

In [6]:
%%time 
results_unicorns = benchmark('gpt2-large', n_words=500, texts=[SAMPLE_INPUTS[0]], n_trials=10, top_k=10, temperature=0.1)

gpt2-large with n_words=500
---------------------------------------
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
OUTPUT  1: The researchers, from the University of California, Santa Cruz, and the University of Colorado, Boulder, were studying the behavior of the animals in the Andes Mountains when they stumbled upon the unicorns."We were surprised to find that the animals were able to communicate with each other in English," said lead author Dr. Michael W. Smith, a professor of ecology and evolutionary biology at UC Santa Cruz. "We were also surprised to find that the animals were able to communicate in English, which is not a common language in the Andes."The researchers found that the animals were able to communicate with each other in English because they had learned to use the language of the

# Train carriage and Miley Cyrus shoplifting

In [8]:
SAMPLE_INPUTS[1:3]

['A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.',
 'Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.']

In [10]:
results = {}
results.update(results_ww2)
results.update(results_unicorns)
results.keys()

dict_keys(["World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]", 'In a shocking finding, scientist discovered a herd of unicorns living in a remote, pre

In [11]:
%%time 
x = benchmark('gpt2-large', n_words=500, texts=SAMPLE_INPUTS[1:3], n_trials=10, top_k=10, temperature=0.1)
results.update(x)

gpt2-large with n_words=500
---------------------------------------
INPUT: A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.
OUTPUT  6: The theft occurred at about 4:30 p.m. in the area of North Main Street and East Market Street, according to the Cincinnati Police Department.The train was carrying a shipment of nuclear materials, including a fuel rod, a control rod and a control rod assembly, according to the Cincinnati Police Department.The train was stopped at the railroad crossing at East Market Street and North Main Street, police said.The train was loaded with a total of 1,000 pounds of nuclear material.The train was taken to a secure location for further investigation.The investigation is ongoing.The Cincinnati Police Department is asking anyone with information to call the department at 614-645-5200.AlertMe<|endoftext|>The U.S. Department of Justice has filed a lawsuit against the city of Chicago, alleging that

# The rest of the samples

In [15]:
%%time 
x = benchmark('gpt2-large', n_words=500, texts=SAMPLE_INPUTS[3:], n_trials=10, top_k=10, temperature=0.1)
results.update(x)


gpt2-large with n_words=500
---------------------------------------
INPUT: We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.
---------------------------------------
---------------------------------------
INPUT: Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.
OUTPUT  1:  The orcs were not prepared for the onslaught of the two heroes. The battle raged for hours, and the orcs were forced to retreat.The two heroes then returned to the city of Minas Tirith, where they were greeted by the King of Gondor, Aragorn. Aragorn was pleased to see that the orcs had been defeated, and he offered them a place in his kingdom. The two heroes accepted, and Aragorn gave them th

# Second experiment: generate 800 words and keep results with at least 250 words after post processing.

Also affected `top_k` and `temperature` to `top_k=20, temperature=0.5`

In [17]:
def benchmark_v2(model_id, n_words, min_words=250, n_trials=10, top_k=40, temperature=0.1, texts=SAMPLE_INPUTS):
    print(f"{model_id} with n_words={n_words}\n=========================")
    
    tokenizer, model = get_tokenizer_and_model(model_id)
    results = {}
    
    for text in texts:
        results[text] = []
        print("---------------------------------------")
        print("INPUT: {}".format(text.replace('\n', '')))
        
        for trial in range(n_trials):
            tokens = stoi(tokenizer, text)
            tokens, generated = add_n_tokens(model, tokens, n_words, temperature=temperature, top_k=top_k)
            generated_text = postprocess(tokenizer.decode(generated))
            n_gen_words = len(generated_text.split(" "))
            results[text].append(generated_text)
            if n_gen_words > min_words:
                print("OUTPUT {:2d}: {}".format(trial+1, generated_text.replace('\n', '')))
                print("\n====\n")
        print("---------------------------------------")
    return results

In [18]:
%%time
results_2 = benchmark_v2('gpt2-large', n_words=800, texts=[ww2]+SAMPLE_INPUTS, n_trials=10, top_k=20, temperature=0.5)

gpt2-large with n_words=800
---------------------------------------
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
OUTPUT  2: The war began i

OUTPUT  5: The war was fought in Europe, Asia, and the Middle East. The war began on June 20, 1939, when Germany invaded the Soviet Union. The German invasion of the Soviet Union was led by Adolf Hitler and his SS, the Nazi Party, and the Luftwaffe. The invasion was a surprise to the Soviet Union, which had been preparing for an invasion of Germany for the past eight months. The Soviet Union's response was to attack the German invasion force, and to launch a massive offensive against the German army. The Soviet Union's offensive was successful, and the German army retreated to the Eastern Front, where the war was fought for the next three years.The war was a major turning point in world history. It was the first major conflict in which the United States was involved, and it marked the beginning of the Cold War. The war was also a turning point in the history of the United States, as it marked the beginning of the decline of the American empire and the beginning of the rise of the Unite

---------------------------------------
---------------------------------------
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
OUTPUT  1: The study was conducted by the University of Colorado, Boulder, and published in the journal PLOS One. The researchers found that the unicorns had been living in the region for about 20,000 years.The researchers are not the first to find a unicorn herd in the Andes Mountains, as the phenomenon has been documented in the region for years. However, this is the first time the phenomenon has been documented in the Andes Mountains.The researchers spent five years studying the area, and discovered that the unicorns had been living in an area known as the Huayhuash Valley, which is located in the middle of the Andes Mountains.The researchers were able to observe the uni

OUTPUT  6: "It's a little bit like a unicorn, but not quite," one of the researchers, Dr. David R. G. Rood, told CNN.The researchers said that the unicorns were not domesticated, but were a natural phenomenon that had been around for thousands of years.The animals were discovered in the Andean region of Bolivia, which is about 2,000 miles west of the equator. The researchers were able to travel to the valley because the animals are not found in other areas of the Andes.The researchers found the animals in a valley that is known as the Cajamarca Valley, which is about 2,500 miles northwest of the capital, La Paz."We can see the animals as they are walking along the valley floor," Rood said. "We can see the animals as they are eating grasses, eating leaves, and eating leaves and grasses. We can see them as they are feeding on the leaves of the trees."The animals are so far removed from any human culture that they don't speak English."They are not domesticated, they are not even close to 

OUTPUT 10: The study was conducted by a team of scientists from the University of California, Santa Cruz, and the Smithsonian Institution. They believe the unicorns are descendants of a herd of unicorns that were once found in the Andes Mountains. The researchers believe the unicorns were brought to the valley by a group of Spanish conquistadors.The group of Spanish conquistadors brought the unicorns to the valley in 1616, where they lived in peace. However, in the 1930s, the Spanish government began to hunt down the unicorns. The scientists were able to track down the last surviving unicorn in the valley, and brought it back to the U.S.The scientists also found a herd of other unicorns in the valley, which they believe are descendants of the original herd."It's a very exciting finding, and it's a very rare find," said study researcher Dr. Brian Switek, a professor at the University of California, Santa Cruz, in a statement. "There's never been anything like this in the Andes."The scie

OUTPUT  5: The theft took place at around 2:30 p.m. at the Amtrak train station in the city's East End, according to the Cincinnati Police Department.Police said the stolen carriage contained a nuclear material.No arrests have been made.The train was carrying a shipment of nuclear material, according to the Cincinnati Police Department.The train was headed to Pittsburgh, according to the police department.The stolen train was being transported to Pittsburgh.The train is being investigated by the FBI and the U.S. Department of Energy.Cincinnati Mayor John Cranley said in a statement that the incident is "an alarming reminder of the dangers posed by nuclear materials.""We are working closely with law enforcement officials to apprehend those responsible for this crime," Cranley said.Police said the train was carrying a shipment of nuclear material.The FBI is assisting with the investigation.<|endoftext|>If you have been following the news, you know that the United States has been in a sta

OUTPUT  7: The singer was spotted by a security guard at the store wearing a black hoodie and black pants.The singer was seen in the store at around 1pm when she was spotted carrying a large bag of clothes in her hand.The singer was spotted in the store at around 1pm when she was spotted carrying a large bag of clothes in her handThe singer was spotted in the store at around 1pm when she was spotted carrying a large bag of clothes in her handThe singer was spotted in the store at around 1pm when she was spotted carrying a large bag of clothes in her handThe singer was spotted in the store at around 1pm when she was spotted carrying a large bag of clothes in her handThe singer was spotted in the store at around 1pm when she was spotted carrying a large bag of clothes in her handThe singer was spotted in the store at around 1pm when she was spotted carrying a large bag of clothes in her handThe singer was spotted in the store at around 1pm when she was spotted carrying a large bag of clo

OUTPUT  6: The model is trained using a large corpus of documents, which are generated by a large corpus of natural language processing tasks. These tasks are commonly used by researchers, and the model's output is used to generate a large corpus of documents from which the model can be trained. The model can be used to generate a corpus of documents with a number of different text types, such as news articles, news stories, and news headlines, as well as to generate a corpus of documents with a number of different document types, such as news stories, news headlines, and news documents. GPT-2 is trained on a large corpus of texts, and is used to generate a large corpus of documents from which the model can be trained.The model is trained on a large corpus of texts, and is used to generate a large corpus of documents from which the model can be trained. The model is trained on a large corpus of texts, and is used to generate a large corpus of documents from which the model can be train

OUTPUT 10:  If you are not sure, please read the question and answer at http://www.theatlantic.com/history/archive/2010/01/the-civil-war-and-the-american-republic/272536/.This essay was written by the late David McCullough.<|endoftext|>The UESPWiki – Your source for The Elder Scrolls since 1995This page is currently being rewritten as part of the Morrowind Overhaul Project.The page is being rewritten and checked in several stages. All users are welcome to make changes to the page. If you make a change that is relevant to the project, please update this template accordingly, and make sure you have observed the project guidelines.Detail Walkthrough: not writtenInterior Images:added by Jeancey, checked by ChezburgarExterior Images:added by Chezburgar, checked by ChezburgarDaedric Ruins:Daedric Shrine( ) # of Zones 1 Occupants Daedra, Daedra Lords Console Location Code(s) Daedric Shrine, [2,2]Daedric Shrine, [2,2] Region Mournhold, Morrowind, [2,2]Daedric ShrineThe Daedric Shrine is the ma

OUTPUT 10: I am John F. Kennedy, President of the United States of America, and I pledge to you today that I will never, ever, ever let you down.I have been asked many times today to speak about the things that make America great. I will do so today. But first, I want to say a few words about the man who was my friend and mentor.I met John F. Kennedy when I was in college. I was a sophomore at the University of Texas. I remember sitting in the library with two of my best friends, Jerry and Joe, and we were talking about the election of John F. Kennedy as President of the United States. Jerry and Joe had been active in the student government, and they were both Democrats.I remember how excited I was. Jerry had a big smile on his face. Joe was a little more reserved, but he was still smiling. I could tell that he was proud of his friend.And I remember how excited I was when I learned that John Kennedy had been shot through the head.I remember the next day when I got to the hospital. I wa