# Backprop Core Example: Finetuning for Text Generation

Finetuning lets you take a model that has been trained on a very broad task and adapt it to your specific niche.

For general models this works really well. To intuitively understand why, let's look at a text generation model.

T5 has been trained on around 750GB of text. The task was simply to predict some masked words within sentences. Take this sentence as an example: "The man went to the `___`, he bought a gallon of `___`." In order to correctly fill in the gaps, the model must understand enough about language and the world.

It turns out that this knowledge is transferrable to other tasks, hence the term transfer learning.

## Generating questions

One of the tasks that Backprop supports right out of the box is question answering based on context.

What if instead you wanted to do the reverse? That is generate questions based on context. That's certainly possible with text generation.

With a minimal amount of code, we'll build a model that can take any paragraph of text (such as something from Elon Musk's wikipedia page) and generate questions that can be answered based on the paragraph.

### Preparing the data

This step is not as scary as it may sound. Our finetuning functionality does most of the heavy lifting. No complicated data transformations are necessary.

We are going to be using the SQuAD dataset.

This dataset has multiple questions and answers on different paragraphs of text. What we'll do is get a paragraph of text as input and list of questions as output. Pretty simple!

In [1]:
from datasets import load_dataset

dataset = load_dataset("squad")

Reusing dataset squad (/home/kristo/.cache/huggingface/datasets/squad/plain_text/1.0.0/1244d044b266a5e4dbd4174d23cb995eead372fbca31a03edc3f8a132787af41)


In [2]:
dataset["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

Just so you can see what the data looks like, each item in the list is a dictionary with context, question and answer. All we'll do is group it up.

In [3]:
input_data = []
output_data = []

last_context = ""

# Limit to 5000 items for proof of concept
for i in range(5000):
    context = dataset["train"][i]["context"]
    question = dataset["train"][i]["question"]
    if context != last_context:
        input_data.append(context)
        last_context = context
        output_data.append([])

    output_index = len(input_data) - 1
    output_data[output_index].append(question)

In [4]:
input_data[0]

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [5]:
output_data[0]

['To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'What is in front of the Notre Dame Main Building?',
 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?',
 'What is the Grotto at Notre Dame?',
 'What sits on top of the Main Building at Notre Dame?']

In [6]:
len(input_data), len(output_data)

(820, 820)

Great! Seems like we got 820 examples for training. That's just a small portion of the data, but it's enough to achieve some promising results. There is just 1 final step before finetuning.

The T5 model that we are planning on using has already been finetuned on some tasks such as translation. For multiple tasks, it is useful to add a prefix to the input that can let the model know what it should do.

For example, `generate questions: Some paragraph of text.`

Additionally, our `output_data` is currently a list of strings. Our model expects just a string for an output example.

In [7]:
input_data = [f"generate questions: {i}" for i in input_data]
output_data = ["; ".join(o) for o in output_data]

In [8]:
input_data[0]

'generate questions: Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [9]:
output_data[0]

'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?; What is in front of the Notre Dame Main Building?; The Basilica of the Sacred heart at Notre Dame is beside to which structure?; What is the Grotto at Notre Dame?; What sits on top of the Main Building at Notre Dame?'

For output data, we chose `;` as a separator that's not too common in text. Anything similar should work fine.

### Finetuning

We are ready to finetune. All we'll need to do is pass the list of input and output strings to Backprop.

In [10]:
import backprop

In [11]:
# Start a local text generation task with T5
tg = backprop.TextGeneration(backprop.models.T5)
# Length here refers to number of tokens (1 token ~ 1 word)
tg.finetune(input_data, output_data, max_input_length=256, max_output_length=256)

Processing data...


GPU available: True, used: True
TPU available: None, using: 0 TPU cores


Finding the optimal batch size...


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  exp_avg_sq_row.mul_(beta2t).add_(1.0 - beta2t, update.mean(dim=-1))
Batch size 2 succeeded, trying batch size 4
Batch size 4 succeeded, trying batch size 8
Batch size 8 succeeded, trying batch size 16
Batch size 16 succeeded, trying batch size 32
Batch size 32 succeeded, trying batch size 64
Batch size 64 succeeded, trying batch size 128
Batch size 128 failed, trying batch size 64
Finished batch size finder, will continue with full run using batch size 64
Restored states from the checkpoint file at /home/kristo/Documents/backprop/examples/scale_batch_size_temp_model.ckpt
GPU available: True, used: True
TPU available: None, using: 0 TPU cores

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration

Validation sanity check: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validating: |          | 0/? [00:00<?, ?it/s]

Validating: |          | 0/? [00:00<?, ?it/s]

Validating: |          | 0/? [00:00<?, ?it/s]

Validating: |          | 0/? [00:00<?, ?it/s]

Validating: |          | 0/? [00:00<?, ?it/s]

Training finished! Save your model for later with kiri.save or upload it with kiri.upload


In [12]:
context = """generate questions: Tesla, Inc. (originally Tesla Motors) was incorporated in July 2003 by Martin Eberhard and Marc Tarpenning, who financed the company until the Series A round of funding.[90] Both men played active roles in the company's early development prior to Musk's involvement.[91] Musk led the Series A round of investment in 2004, joining Tesla's board of directors as its chairman.[92][93][94][95] Musk took an active role within the company and oversaw Roadster product design but was not deeply involved in day-to-day business operations.[96] Following a series of escalating conflicts in 2007 and the 2008 financial crisis, Eberhard was ousted from the firm.[97][98] Musk assumed leadership of the company as CEO and product architect in 2008, positions he still holds today. A 2009 lawsuit settlement with Eberhard designated Musk as a Tesla co-founder, along with Tarpenning and two others.[4][5] As of 2019, Musk is the longest tenured CEO of any automotive manufacturer globally.[99]"""

In [13]:
# Have a look at the text generation notebook to understand these parameters
tg(context, max_length=50, min_length=30, temperature=0.5)

RuntimeError: Input, output and indices must be on the current device

Not bad!

We used less than 1000 training examples, and less than 1 minute for training. This is the power of transfer learning!

I can see this model being useful to generate questions customers might have based on support articles (FAQs) or even automate making reading comprehension tests.

Where to from here? You can save this model for later with `backprop.save(model)` and load with `model = backprop.load("your-model-name")`.

Backprop even supports uploading your custom models to our production ready environment. You can do this with a single line of code.

The model will be private, always available and scale to support thousands of requests a second if needed. The best part is that you only pay for what you use. If it is idle, you pay nothing at all. 

In [None]:
# The model attached to the text generation task
model = tg.model
# Name the model
model.name = "t5-question-generation"
# Give a description
model.description = "This T5 small model was partly finetuned on SQuAD to generate questions based on given context."

# Requires API Key
backprop.upload(model, api_key="aynMblZPiX9PpFCEmlYkLa7zQUMWXsGL3RqqBV1f")

After the build has completed, *only* you can use the model anywhere via our API.

In [None]:
import requests

context = "generate questions: In 2016, Musk co-founded Neuralink, a neurotechnology start-up company to integrate the human brain with AI. The company is centered on creating devices that can be implanted in the human brain, with the eventual purpose of helping human beings merge with software and keep pace with advancements in artificial intelligence. These enhancements could improve memory or allow more direct interfacing with computing devices.[142][143]"

body = {
    "model": "t5-question-generation",
    "temperature": 0.5,
    "max_length": 40,
    "min_length": 15,
    "text": context
}

# Requires API Key
requests.post("https://api.backprop.co/generation", json=body,
              headers={"x-api-key": "aynMblZPiX9PpFCEmlYkLa7zQUMWXsGL3RqqBV1f"}).json()