# Text Generation

In this notebooks, we train an adapter for **GPT-2** that performs **poem generation**. We use a dataset of poem verses extracted from Project Gutenberg that is [available via HuggingFace datasets](https://huggingface.co/datasets/poem_sentiment).

First, let's install all required libraries:

In [1]:
!pip install -U adapter-transformers
!pip install datasets

Collecting git+https://github.com/hSterz/adapter-transformers.git@notebooks
  Cloning https://github.com/hSterz/adapter-transformers.git (to revision notebooks) to /tmp/pip-req-build-yq66v6vw
  Running command git clone -q https://github.com/hSterz/adapter-transformers.git /tmp/pip-req-build-yq66v6vw
  Running command git checkout -b notebooks --track origin/notebooks
  Switched to a new branch 'notebooks'
  Branch 'notebooks' set up to track remote branch 'notebooks' from 'origin'.
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: adapter-transformers
  Building wheel for adapter-transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for adapter-transformers: filename=adapter_transformers-2.0.0a1-cp37-none-any.whl size=2009422 sha256=191637b46cc2556c1eda5f1920af739aa96a3633cd569252f9e13a163a67259b
  Stored in directory: /tmp/pip-e

Next, we need to download the dataset:

In [2]:
from datasets import load_dataset

dataset = load_dataset("poem_sentiment")
print(dataset)

Using custom data configuration default
Reusing dataset poem_sentiment (/root/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/f4990808f049126bcea572bba70613313212cd45f3b12a3e5586135e2de42f56)


DatasetDict({
    train: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 892
    })
    validation: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 105
    })
    test: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 104
    })
})


Before training, we need to preprocess the dataset. We tokenize the entries in the dataset and remove all columns we don't need to train the adapter.

In [3]:
from transformers import GPT2Tokenizer

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  encoding = tokenizer(batch["verse_text"])
  # For language modeling the labels need to be the input_ids
  #encoding["labels"] = encoding["input_ids"]
  return encoding

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# The GPT-2 tokenizer does not have a padding token. In order to process the data 
# in batches we set one here 
tokenizer.pad_token = tokenizer.eos_token
column_names = dataset["train"].column_names
dataset = dataset.map(encode_batch, remove_columns=column_names, batched=True)



Loading cached processed dataset at /root/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/f4990808f049126bcea572bba70613313212cd45f3b12a3e5586135e2de42f56/cache-d6ab54fa3fbc78bd.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/f4990808f049126bcea572bba70613313212cd45f3b12a3e5586135e2de42f56/cache-6b5d3f18db518d3d.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/f4990808f049126bcea572bba70613313212cd45f3b12a3e5586135e2de42f56/cache-5f751c4a95f6b033.arrow


Next, we concatenate the documents in the dataset and create chunks with a length of `block_size`. This is beneficial for language modeling.

In [4]:
block_size = 50
# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
  # Concatenate all texts.
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
  # customize this part to your needs.
  total_length = (total_length // block_size) * block_size
  # Split by chunks of max_len.
  result = {
    k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
    for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

dataset = dataset.map(group_texts,batched=True,)

dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Loading cached processed dataset at /root/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/f4990808f049126bcea572bba70613313212cd45f3b12a3e5586135e2de42f56/cache-3f8369251f49a308.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/f4990808f049126bcea572bba70613313212cd45f3b12a3e5586135e2de42f56/cache-6f7b971f033f4873.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/f4990808f049126bcea572bba70613313212cd45f3b12a3e5586135e2de42f56/cache-5b3fb38a938eecda.arrow


Next, we create the model and add our new adapter. Let's just call it `poem` since it is trained to create new poems. Then we activate it and prepare it for training.

In [5]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
# add new adapter
model.add_adapter("poem")
# activate adapter for training
model.train_adapter("poem")

The last thing we need to do before we can start training is creating the trainer. As trainings arguments, we choose a learning rate of 1e-4. Feel free to play around with the parameters and see how they affect the result.

In [6]:
from transformers import AdapterTrainer, TrainingArguments
training_args = TrainingArguments(
  output_dir="./examples", 
  do_train=True,
  remove_unused_columns=False,
  learning_rate=5e-4,
  num_train_epochs=3,
)


trainer = AdapterTrainer(
        model=model,
        args=training_args,
        tokenizer=tokenizer,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"], 
    )

In [7]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=66, training_loss=5.653754956794508, metrics={'train_runtime': 264.1691, 'train_samples_per_second': 0.25, 'total_flos': 19852958822400.0, 'epoch': 3.0, 'init_mem_cpu_alloc_delta': 292637, 'init_mem_cpu_peaked_delta': 14018, 'train_mem_cpu_alloc_delta': 609432, 'train_mem_cpu_peaked_delta': 259598})

Now that we have a trained adapter we save it for future usage.

In [8]:
model.save_adapter("adapter_poem", "poem")

Next, let's generate some poetry with our trained adapter. In order to do this, we create a GPT2LMHeadModel that is best suited for language generation. Then we load our trained adapter. Finally, we have to choose the start of our poem. If you want your poem to start differently just change `PREFIX` accordingly.

In [9]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
# You can also load your locally trained adapter
model.load_adapter("adapter_poem")
model.set_active_adapters("poem")

PREFIX = "In the night"

For the generation, we need to tokenize the prefix first and then pass it to the model. In this case, we create five possible continuations for the beginning we chose.

In [10]:
encoding = tokenizer(PREFIX, return_tensors="pt")
output_sequence = model.generate(
  input_ids=encoding["input_ids"],
  attention_mask=encoding["attention_mask"],
  do_sample=True,
  num_return_sequences=5,
  max_length = 50,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Lastly, we want to see what the model actually created. To do this, we need to decode the tokens from ids back to words and remove the EOS tokens. You can easily use this code with another dataset. Don't forget to share your adapters at [AdapterHub](https://adapterhub.ml/).

In [11]:
 for generated_sequence_idx, generated_sequence in enumerate(output_sequence):
        print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
        generated_sequence = generated_sequence.tolist()

        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
        # Remove EndOfSentence Tokens
        text = text[: text.find(tokenizer.eos_token)]

        print(text)

=== GENERATED SEQUENCE 1 ===
In the night, he would go;and she is the queen, and a mistress,and she keeps in the nightthe king who died" (the "giant," said the ancient, as a poet)and a child in his home
=== GENERATED SEQUENCE 2 ===
In the night,when one thinks of the war upon the world, and of men who live in it;that's all you have, though, that's all, that's what you want. and that makes me want, but here's th
=== GENERATED SEQUENCE 3 ===
In the night, she was the first, for once, the girl of good cheer!--of the people, the love of her life, she has not come to see her sister again;yet i think if i could not have loved her I wer
=== GENERATED SEQUENCE 4 ===
In the night, she sang the sweetest lullaby of morning-the very sound he heard:the silent and delicate voice of the holy sea,that his face would not come to grief.a quiet and silent night,the song as always i
=== GENERATED SEQUENCE 5 ===
In the nighttime, the king says:but there can be no peace or sorrow if that night's not a bless