# **ELIZA** on Steroids: Texterzeugung mit GPT-2

Eines der spannendsten und herausforderndsten Themen ist die Erzeugung von Texten mittels künstlicher Intelligenz. 

Aktuell gibt es mit GPT-2 ein Sprachmodell, das so gute künstliche Texte erzeugt, dass die Firma OpenAI es nur in einer "abgespeckten" Version [veröffentlicht](https://openai.com/blog/better-language-models/).

Mit diesem Sprachmodell benötigt man nur wenige Zeilen Code, um einen englischen Text fortzusetzen.

In [3]:
import numpy as np
import torch
import torch.nn.functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from random import choice

In [23]:
# Lade Tokenizer und Modell
tok = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

I1102 22:50:51.707658 139965096539968 tokenization_utils.py:373] loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at /home/jupyter/.cache/torch/transformers/f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
I1102 22:50:51.710873 139965096539968 tokenization_utils.py:373] loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at /home/jupyter/.cache/torch/transformers/d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
I1102 22:50:52.358298 139965096539968 configuration_utils.py:151] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at /home/jupyter/.cache/torch/transformers/4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.085d5f6a8e7812ea05ff0e6ed0645ab2e75d80387ad55c1ad9806ee70d2

In [98]:
def get_pred(text, model, tok, p=0.7):
    """ Setze text mit Hilfe von model um ein Token fort.
    Wähle dabei zufällig aus so vielen Token, dass ihre Gesamtwahrscheinlichkeit p beträgt.
    """
    
    # Zerlege Text in Tokens und erzeuge Eingabetensor
    input_ids = torch.tensor(tok.encode(text)).unsqueeze(0)
    
    # Wende das Modell an und erzeuge einen Vektor mit den Wahrscheinlichkeiten für das nächste Wort
    logits = model(input_ids)[0][:, -1]
    probs = F.softmax(logits, dim=-1).squeeze()
    
    # Sortiere Kandidaten absteigend nach Wahrscheinlichkeit
    idxs = torch.argsort(probs, descending=True)
    res, cumsum = [], 0.
    
    # Wähle so viele Kandidaten, dass die Summe ihrer Wahrscheinlichkeiten p beträgt
    for idx in idxs:
        res.append(idx)
        cumsum += probs[idx]
        if cumsum > p:
            # Wähle aus den Kandidaten einen zufälligen aus
            pred_idx = idxs.new_tensor([choice(res)])
            break
    
    # Wandle Ergebnis in Text um
    pred = tok.convert_ids_to_tokens(int(pred_idx))
    return tok.convert_tokens_to_string(pred)


def generate_continuation(text):
    """ Setze Text so lange fort, bis ein Textende generiert wird oder fünf Sätze gebildet wurden. """
    text = text.replace("\n", " ")
    print(text)

    while text.count(".") < 5:
        res = get_pred(text, model, tok)
        if res == "<|endoftext|>":
            break
        print(res, end="")
        text += res
    return text

generate_continuation("This app is pretty")

This app is pretty
 simple and there's nothing I want to write here about other than that I just like what they're doing here and this is really what they're trying to accomplish.

"This app is pretty simple and there's nothing I want to write here about other than that I just like what they're doing here and this is really what they're trying to accomplish."

In [93]:
text = """In a shocking finding, scientist discovered a herd of unicorns living in a remote, 
previously unexplored valley, in the Andes Mountains. 
Even more surprising to the researchers was the fact that the unicorns spoke perfect English."""

generate_continuation(text)

In a shocking finding, scientist discovered a herd of unicorns living in a remote,  previously unexplored valley, in the Andes Mountains.  Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
 "That was really exciting to see that there were hundreds of different species living in this remote part of the world," explains Christopher Matthews, one of the scientists involved in the research. "It really gives you a feeling that these little things could live out in nature for many, many, many, many years."

 And of course, not all unicorns have unique uses.

'In a shocking finding, scientist discovered a herd of unicorns living in a remote,  previously unexplored valley, in the Andes Mountains.  Even more surprising to the researchers was the fact that the unicorns spoke perfect English. "That was really exciting to see that there were hundreds of different species living in this remote part of the world," explains Christopher Matthews, one of the scientists involved in the research. "It really gives you a feeling that these little things could live out in nature for many, many, many, many years."\n\n And of course, not all unicorns have unique uses.'

In [95]:
text = """I guess the problem is that you need to start a separate thread for each
connection and call serverSocket.accept() in a loop to accept more than one connection.
It is not a problem to have more than"""

generate_continuation(text)

I guess the problem is that you need to start a separate thread for each connection and call serverSocket.accept() in a loop to accept more than one connection. It is not a problem to have more than
 one thread running. However, some implementations that implement that provide methods that only accept incoming connections will fail because their data doesn't actually pass. In other cases, such as multi-connection communications, that implementation should create separate queues.

"I guess the problem is that you need to start a separate thread for each connection and call serverSocket.accept() in a loop to accept more than one connection. It is not a problem to have more than one thread running. However, some implementations that implement that provide methods that only accept incoming connections will fail because their data doesn't actually pass. In other cases, such as multi-connection communications, that implementation should create separate queues."

In [63]:
text = """As the above samples show, our model is capable of generating samples from a variety of 
prompts that feel close to human quality and show coherence over a page or more of text. 
Nevertheless"""

text = text.replace("\n", " ")

while text.count(".") < 5:
    text += get_pred(text, model, tok)

text

'As the above samples show, our model is capable of generating samples from a variety of  prompts that feel close to human quality and show coherence over a page or more of text.  Nevertheless, we are not satisfied with our answer to the second question of what does each background pose. This begs the question of how an ensemble can form one specific individual context or interpret its sounds. It has become an obvious problem that every time a message in the future "knew what he signed up for" , there are patterns emerging of increased alarm to expectations and habituation of these auditory cues, regardless of how an individual contexts themselves do not affect these things. These data are summarized below, summarizing the complex associations observed among one person\'s name and track as seen through various radio displays, phonograph radios, t-shirts, balloons, televisions, YouTube, social media, laptops, digital radio (indicator boards), notebook paper, cards, old magazine posters,

In [58]:
text = "I am"
while text.count(".") < 2:
    text += get_pred(text, model, tok, 0.8)

text

'I am committed to work hard in everything we do. While working at Max here I realize we have lost one step here," Klicker added.'

In [40]:
text = "You are"
while text.count(".") < 2:
    text += get_pred(text, model, tok, 0.8)

text

'You are prohibited from changing or destroying documents in the vehicle at a rally in Kansas. By being a law-abiding citizen and owning a valid concealed handgun license you do not hereby warrant any type of evidence or expert testimony about this case to any FBI or law enforcement agency in Missouri."'

In [48]:
text = "Dad is"
while text.count(".") < 2:
    text += get_pred(text, model, tok, 0.6)
    
text

"Dad is no stranger to building dynamic muscle, and the season's in no small part because of that. He'll pull up in the first couple of games, so his workouts will be limited, but if you don't feel he's in a rhythm he'll definitely take on that load."