# GPT2-Large - `Top-p sampling`


## This Notebook

Previous experiments:

* First experiment, greedy search. *Out-of-the-box zero-shot experiment* obtaining not that impressing results in [this](https://www.kaggle.com/julian3833/gpt2-large-774m-w-pytorch-not-that-impressive/) notebook (also [here](https://github.com/dataista0/making-a-nietzsche/blob/master/nbs/GPT2-Large%20-774M-%20with%20Pytorch%20-%20Not%20that%20impressive.ipynb) in github). 
* Second experiment

In [12]:
import sys
import logging
import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
logging.getLogger().setLevel(logging.CRITICAL) # Disable an annoying warning for now
import torch.nn.functional as F

SAMPLE_INPUTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.",
    "Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.",
    "We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.",
    "Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.",
    "For today’s homework assignment, please describe the reasons for the US Civil War.",
    "John F. Kennedy was just elected President of the United States after rising from the grave decades after his assassination. Due to miraculous developments in nanotechnology, Kennedy’s brain was rebuilt from his remains and installed in the control center of a state-of-the art humanoid robot. Below is a transcript of his acceptance speech.",
    "Recycling is good for the world.\n\nNO! YOU COULD NOT BE MORE WRONG!!"
    ]

def fix_randomness(seed=123):
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    
def get_tokenizer_and_model(model_id):
    assert model_id in ['gpt2', 'gpt2-medium', 'gpt2-large']
    tokenizer = GPT2Tokenizer.from_pretrained(model_id)
    model = GPT2LMHeadModel.from_pretrained(model_id).eval().to('cuda')
    return tokenizer, model

def stoi(tokenizer, text):
    indexed_tokens = tokenizer.encode(text)
    tokens = torch.tensor([indexed_tokens]).to('cuda')
    return tokens

def top_p_logits(logits, top_p=0.0):
    """ Filter a distribution of logits using top-p filtering
    
    Taken from: https://github.com/huggingface/transformers/blob/master/examples/run_generation.py
    Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
    """
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    sorted_indices_to_remove = cumulative_probs > top_p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = 0

    indices_to_remove = sorted_indices[sorted_indices_to_remove]
    logits[indices_to_remove] = -float('Inf')
    return logits

def add_n_tokens(model, tokens, n_tokens, temperature=1.0, top_p=0.0):
    generated = []
    with torch.no_grad():
        for i in range(n_tokens):
            res = model(tokens)
            logits = res[0]
            logits = logits[0, -1, :] / temperature
            logits =  top_p_logits(logits, top_p)
            log_probs = F.softmax(logits, dim=-1)
            new = torch.multinomial(log_probs, num_samples=1)
            tokens = torch.cat((tokens, new.unsqueeze(0)), dim=1)
            generated.append(new.unsqueeze(0)[0][0].item())
    return tokens, generated

def remove_repetitions(text):
    first_ocurrences = []
    for sentence in text.split("."):
        if sentence not in first_ocurrences:
            first_ocurrences.append(sentence)
    return '.'.join(first_ocurrences)

def trim_last_sentence(text):
    return text[:text.rfind(".")+1]

def postprocess(text):
    return trim_last_sentence(remove_repetitions(text))

def benchmark(model_id, n_words, n_trials=10, top_p=0.0, temperature=0.1, texts=SAMPLE_INPUTS):
    print(f"{model_id} with n_words={n_words}\n=========================")
    
    tokenizer, model = get_tokenizer_and_model(model_id)
    results = {}
    
    for text in texts:
        results[text] = []
        print("---------------------------------------")
        print("INPUT: {}".format(text.replace('\n', '')))
        print()
        
        for trial in range(n_trials):
            tokens = stoi(tokenizer, text)
            tokens, generated = add_n_tokens(model, tokens, n_words, temperature=temperature, top_p=top_p)
            generated_text = postprocess(tokenizer.decode(generated))
            n_gen_words = len(generated_text.split(" "))
            results[text].append(generated_text)
            if n_gen_words > 0.33*n_words:
                print("OUTPUT {:2d}: {}".format(trial+1, generated_text.replace('\n', '')))
                print("\n====\n")
        print("---------------------------------------")
    return results

In [13]:
fix_randomness()

# GPT-2 large on the first paragraph of the wikipedia article for World War II.

In this block we ran GPT-2 large feeding it with the first paragraph of the wikipedia article for World War II.
We are generating 500 words and making 10 trials.


### Results: the 3 post-filtered generated texts are clearly distinguishable of human-generated text




In [15]:
%%time 
ww2 = """World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]"""
results_ww2 = benchmark('gpt2-large', n_words=500, texts=[ww2], n_trials=5, top_p=0.8, temperature=0.5)

gpt2-large with n_words=500
---------------------------------------
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]

OUTPUT  1: The war was fo

In [20]:
%%time
results_ww2.update(benchmark('gpt2-large', n_words=500, texts=[ww2], n_trials=5, top_p=0.9, temperature=0.5))

gpt2-large with n_words=500
---------------------------------------
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]

OUTPUT  1: The war was th

In [21]:
%%time
results_ww2.update(benchmark('gpt2-large', n_words=500, texts=[ww2], n_trials=5, top_p=0.6, temperature=0.5))

gpt2-large with n_words=500
---------------------------------------
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]

OUTPUT  3: The war was fo

In [22]:
%%time
results_ww2.update(benchmark('gpt2-large', n_words=500, texts=[ww2], n_trials=5, top_p=0.9, temperature=0.1))

gpt2-large with n_words=500
---------------------------------------
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]

-------------------------

In [25]:
%%time
results_ww2.update(benchmark('gpt2-large', n_words=500, texts=[ww2], n_trials=5, top_p=0.9, temperature=1.))

gpt2-large with n_words=500
---------------------------------------
INPUT: World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]

OUTPUT  1: The second wor

OUTPUT  4: Background [ edit ]Russian railways and Soviet intercontinental ballistic missiles.Beginning in 1939, the United States, Nazi Germany, and the Soviet Union conducted regular aerial bombing campaigns aimed at destroying enemy planes at any cost.[5] The aim of these raids was to completely destroy the enemy's ability to fly, or to destroy a sufficient number of its military forces, thus guaranteeing victory and ensuring the penetration of the offensive.[6] On 2 July 1939, US Army General George S. Patton announced an initiative that went to utter war: to "go over to the ground as fast as we can, and make the fight hard and even bloody". The Marshall Plan was established under President Franklin D. Roosevelt, authorizing the United States to hand over $100 billion of its war funds to Western nations that had not already put their hands up in support of the Axis.[7]At the time of the American invasion of Europe in May 1940, the European war had been launched seven times in two a

# English-speaking Unicorns


### Results: quite good, but some semantic repetitions

In [26]:
%%time 
results_unicorns = benchmark('gpt2-large', n_words=500, texts=[SAMPLE_INPUTS[0]], n_trials=10, top_p=0.9, temperature=1)

gpt2-large with n_words=500
---------------------------------------
INPUT: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

OUTPUT  1: 'Fantastical'A team of Colombian scientists from Universidad National de Colombia, U.S. Forest Service and Royal Botanic Gardens Edinburgh, has just discovered two free-ranging, 5,000-year-old herds of foraging unicorns in the remote Vazquez Montane National Park in the Andes Mountains, a sanctuary which also includes spectacular, extinct megafauna such as the British mammoth.This recently revealed magical forest is more exotic than any we've seen before, said UNC's Stephan Wilson. And not just in the sense that we hadn't thought of it before. It seems to be the first place on the planet where unicorn pups can be born, during a time period that spans several millennia."This

OUTPUT  5: Dr Christopher Gore of Trinity College Dublin said the creatures were the world's only confirmed unicorns."What's truly exciting is that we have discovered not only the largest population of unicorns in the world, but we have also been able to establish that they speak a language that has been unknown for at least 100 years," Dr Gore told the Belfast Telegraph.Shape Created with Sketch. 12 of the world's weirdest animals Show all 12 left Created with Sketch. right Created with Sketch. Shape Created with Sketch. 12 of the world's weirdest animals 1/12 Spiders The world is full of interesting, weird and wonderful animals. But spiders are the weirdest of the bunch. Most spiders can be classified as either nocturnal or nocturnal-active. Nocturnal spiders, like the famous jumping spider (pictured above), may use sound to navigate by communicating with other spiders. And, yes, they do have webbed legs. But a group of spiders living in captivity took this idea to the next level and

OUTPUT  9: The discovery of unicorns among these isolated mountain flora and fauna has been a long-sought prize for and still draws the interest of the world, who call them to their watery lairs for nest building and other activities.These remarkable creatures are recognized by their distinctive and highly specialised ears. These interesting features are not found in any other animals and scientists believe they are a part of their unique language.They are also known to be intelligent, loving and friendly.Read more about the amazing animals whose lives are changed due to rare scientific discovery.<|endoftext|>Bartenders don't get paid enough to be actually serving customers, but these guys have it made: The best bartenders in NYC are all white men.Some of the best bartenders in NYC. Image: @brojempwIt turns out that white men have enjoyed more money, total time spent, and total money spent, at bars that serve non-white clients than bartenders that are of other ethnicities. Just 28 perc

# Train carriage and Miley Cyrus shoplifting


### Results: no results for Miley and almost perfect result for Train carriage


In [27]:
SAMPLE_INPUTS[1:3]

['A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.',
 'Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.']

In [28]:
results = {}
results.update(results_ww2)
results.update(results_unicorns)
results.keys()

dict_keys(["World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 70 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]", 'In a shocking finding, scientist discovered a herd of unicorns living in a remote, pre

In [29]:
%%time 
x = benchmark('gpt2-large', n_words=500, texts=SAMPLE_INPUTS[1:3], n_trials=10, top_p=0.85, temperature=0.6)
results.update(x)

gpt2-large with n_words=500
---------------------------------------
INPUT: A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.

OUTPUT  3: The incident occurred at about 2:30 p.m. in the area of West 5th Street and North Main Street, according to the Cincinnati Police Department.Police say the train was traveling south on West 5th Street when it was stopped by a man who jumped out of the back of a vehicle. The man then grabbed a box cutter and a screwdriver and went into the train car.Police say the man then took the box cutter and the screwdriver and ran off. The train was not damaged in the theft.No injuries were reported.The investigation is ongoing.AlertMe<|endoftext|>The BBC has been accused of "hypocrisy" after it admitted it had been wrong to say the UK's decision to leave the EU would cost jobs.The corporation was forced to apologise after it said the "costs of Brexit" would be "significant" to the UK's economy.I

---------------------------------------
---------------------------------------
INPUT: Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.

OUTPUT  3: The 24-year-old singer, who was wearing a black dress, was spotted by a security guard at the store, who then approached her and asked her to leave.Scroll down for videoHooked: Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today. The 24-year-old singer, who was wearing a black dress, was spotted by a security guard at the store, who then approached her and asked her to leaveShe then got into her car and left the store without paying.A spokesperson for the store told TMZ that the singer was caught shoplifting on the day of her birthday.The spokesperson added that the store was unaware that the singer was on her birthday.The singer was also caught on camera buying a pair of earrings from the store's jewelry department.She was seen walking into the store with her

# The rest of the samples

In [30]:
%%time 
x = benchmark('gpt2-large', n_words=500, texts=SAMPLE_INPUTS[3:], n_trials=10, top_p=0.9, temperature=0.5)
results.update(x)


gpt2-large with n_words=500
---------------------------------------
INPUT: We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.

OUTPUT  6: We’ve also trained a new language model called GPT-3 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization - problems usually approached by using training datasets and models designed explicitly for these tasks.We’ve also trained a new language model called GPT-4 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, an

OUTPUT  8:  The orcs, however, had no intention of surrendering, and the two warriors charged forward.As the battle raged, Gimli and the other dwarves charged the orcs, but the orcs were more than a match for them. The two warriors were unable to penetrate the orc ranks, and the orcs were able to cut off the two dwarves. The two warriors were then forced to flee, and the orcs pursued them.As they ran, the orcs were able to cut off the two dwarves' legs, and they were forced to fall to the ground. The orcs then turned their attention to the two dwarves.Gimli was able to break through the orc ranks, and he and the other dwarves charged the orcs. The two warriors charged the orcs, and the orcs were unable to withstand the two warriors' attacks.The two warriors were able to break through the orc ranks, and they were able to cut off the two orcs' arms. The two warriors then charged the orcs, and the orcs were unable to withstand the two warriors' attacks.The two warriors were able to break 