# Generation models

This notebook explores some novel models for generation:

* [CTRL](https://huggingface.co/transformers/model_doc/ctrl.html): generate text conditioned on control codes.
* [Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html): because it seems to give the best summarization performance

## 1. CTRL

For the complete list of control codes, check out [the appendix of the paper](https://arxiv.org/pdf/1909.05858.pdf). In brief, it can specify 

* the domain, 
* topic (from Reddit), 
* review rating, 
* specific task (e.g. QA/MT) or 
* using a fake URL prompt.

These control codes can be mixed and matched to create novel utterance.

**Note:** the model is quite large (1.6B parameters. 6.5GB disk space). If we want to fine-tune it, we'll likely need multiple GPUs.

In [1]:
from transformers import AutoModelWithLMHead, AutoTokenizer

model = AutoModelWithLMHead.from_pretrained("ctrl")
tokenizer = AutoTokenizer.from_pretrained("ctrl")



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=611.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=6552025106.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2049344.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1004909.0, style=ProgressStyle(descript…




In [17]:
control_code = "Links"
prompt = " https://www.cnn.com/technology/12/22/2020/knorex-beats-google-in-machine-translation"
prompt = control_code + prompt
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
outputs = model.generate(inputs, max_length=100, do_sample=True, num_beams=1, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]
generated

"Links https://www.cnn.com/technology/12/22/2020/knorex-beats-google-in-machine-translation \n Google's machine learning projects are so impressive, it's hard to believe they could get any better. But Knorek thinks it can. \n \n The company recently released a prototype version of its translation system, called Knorex. It was built by Knorex, a Google-led startup based in Mountain View, California. It's designed to translate English text into a string of letters that it thinks sounds like the source language, and then"

In [21]:
control_code = "Reviews Rating: 5.0"
prompt = " Knorex XPO Advertising Platform "
prompt = control_code + prompt
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
outputs = model.generate(inputs, max_length=50, do_sample=True, num_beams=1, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]
generated

'Reviews Rating: 5.0 Knorex XPO Advertising Platform  is the best that I have ever seen.Just love it, \n Rating: 5.0 \n The Knorex XPO Platform is very versatile. It will hold two or three items on each leg to weigh the'

In [22]:
control_code = "Reviews Rating: 1.0"
prompt = " Knorex XPO Advertising Platform "
prompt = control_code + prompt
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
outputs = model.generate(inputs, max_length=50, do_sample=True, num_beams=1, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]
generated

'Reviews Rating: 1.0 Knorex XPO Advertising Platform  is a poor excuse for a marketing platform. This is all about the ability to post and then click and go while the company makes money from people posting on this platform. I have used this product for years and'

## 2. Pegasus

The [Pegasus-large](https://huggingface.co/google/pegasus-large#) model looks nice. It mixes the two large corpora and achieved the best performance compared to using either C4 or HugeNews alone.

**Note:** the model size is 2.28GB.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/pegasus-large")

model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-large")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2866.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1912529.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=65.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=88.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2275327883.0, style=ProgressStyle(descr…