In [2]:
# install the requirements
# !pip install torch torchvision
# !pip install transformers



# Modelos de Lenguaje de OpenAI

A mitad de febrero, [OpenAI publicó un modelo de lenguaje](https://blog.openai.com/better-language-models/) capaz de generar lenguaje natural de formar coherente. Este modelo es generalista y, a pesar de ello, es capaz de rivalizar con los mejores sistemas específicos en tareas como comprensión automática de lenguaje natural, traducción automática, búsqueda de respuestas y resumen automático.

Este modelo, llamado GPT-2, es el resultado de haber entrenado con 8 millones de páginas web (40 GB) con 1 500 millones de parámetros con un único objetivo: predecir cuál es la siguiente palabra.

Sin embargo, OpenAI no ha publicado el modelo para evitar que alguien con malas intenciones pueda hacer un uso dañino de esta tecnología. Sí que han publicado una versión simplificada y más pequeña, y el paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), en el que explican todo el proceso.

Con ganas y GPUs suficientes (+ tiempo y dinero), se puede replicar el proceso. Otras lecturas interesantes, sobre el tema: 

- [OpenAI's new Multitalented AI Writes, Translates, and Slanders](https://www.theverge.com/2019/2/14/18224704/ai-machine-learning-language-models-read-write-openai-gpt2)
- [Some thoughts on zero-day threats in AI, and OpenAI's GP2](https://www.fast.ai/2019/02/15/openai-gp2/)


Este código de ejemplo está inspirado en [un tweet de Thomas Wolf](https://twitter.com/Thom_Wolf/status/1097465312579072000), de [Hugging Face](https://huggingface.co/).

In [3]:
import torch
from torch.nn import functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

Downloading: 100%|██████████| 1.04M/1.04M [00:00<00:00, 1.47MB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 958kB/s] 
Downloading: 100%|██████████| 665/665 [00:00<00:00, 340kB/s]
Downloading: 100%|██████████| 548M/548M [00:45<00:00, 12.0MB/s] 


A continuación, definimos una función para:

1. tokenizar el texto de entrada y codificarlo como un vector con los pesos obtenidos por el modelo GPT2
2. predecir la siguiente palabra más frecuente
3. decodificar el vector como una secuencia de tokens

In [4]:
def generate(text, length=50):
    """Generate automatic Natural Language from the input text"""
    vec_text = tokenizer.encode(text)
    my_input, past = torch.tensor([vec_text]), None

    for _ in range(length):
        logits, past = model(my_input, past=past)
        my_input = torch.multinomial(F.softmax(logits[:, -1], dim=1), 1)
        vec_text.append(my_input.item())

    return tokenizer.decode(vec_text)

In [9]:
# defino un texto de entrada
text = "The only think we can do to fight climate change is"

# y generamos automáticamente las secuencias más probables
for _ in range(3):
    print(generate(text, 50), "\n")

The only think we can do to fight climate change is encourage policymakers to take some steps to support large-scale research into climate change."

After the vote, an Endo Group spokesperson did not respond to requests for comment on whether the firm would hire more diverse climate scientists in the years ahead.
 

The only think we can do to fight climate change is to make your administration less Incredibly authoritarian and fearful. Make sure that your administration does not impose radical Sharia law on America because all Americans believe they are right, and that doesn't include you.

Let's continue to work together on tomorrow's 

The only think we can do to fight climate change is to start sending even more kids to South Sudan. Beyond the basic poverty problem and every other reason."


Dharab Dhillon:

"Kerry is going to do his best – he's smart and a courageous man, but 



In [6]:
countries = "Spain France Italy Greece Russia China Japan India".split()

for country in countries:
    text = f"I was born in {country} so I speak"
    print(generate(text, 1))

I was born in Spain so I speak Spanish
I was born in France so I speak English
I was born in Italy so I speak no
I was born in Greece so I speak Greek
I was born in Russia so I speak Russian
I was born in China so I speak no
I was born in Japan so I speak Japanese
I was born in India so I speak English
