# Text Generation

In this notebooks, we train an adapter for **GPT-2** that performs **poem generation**. We use a dataset of poem verses extracted from Project Gutenberg that is [available via HuggingFace datasets](https://huggingface.co/datasets/poem_sentiment).

First, let's install all required libraries:

In [1]:
!pip install -Uq adapters
!pip install -q datasets
!pip install -Uq accelerate

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/251.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/251.2 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h

Next, we need to download the dataset:

In [2]:
from datasets import load_dataset

dataset = load_dataset("poem_sentiment")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 892
    })
    validation: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 105
    })
    test: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 104
    })
})


Before training, we need to preprocess the dataset. We tokenize the entries in the dataset and remove all columns we don't need to train the adapter.

In [3]:
from transformers import GPT2Tokenizer

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  encoding = tokenizer(batch["verse_text"])
  # For language modeling the labels need to be the input_ids
  #encoding["labels"] = encoding["input_ids"]
  return encoding

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# The GPT-2 tokenizer does not have a padding token. In order to process the data
# in batches we set one here
tokenizer.pad_token = tokenizer.eos_token
column_names = dataset["train"].column_names
dataset = dataset.map(encode_batch, remove_columns=column_names, batched=True)



Map:   0%|          | 0/105 [00:00<?, ? examples/s]

Map:   0%|          | 0/104 [00:00<?, ? examples/s]

Next, we concatenate the documents in the dataset and create chunks with a length of `block_size`. This is beneficial for language modeling.

In [4]:
block_size = 50
# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
  # Concatenate all texts.
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
  # customize this part to your needs.
  total_length = (total_length // block_size) * block_size
  # Split by chunks of max_len.
  result = {
    k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
    for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

dataset = dataset.map(group_texts,batched=True,)

dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Map:   0%|          | 0/105 [00:00<?, ? examples/s]

Map:   0%|          | 0/104 [00:00<?, ? examples/s]

Next, we create the model and add our new adapter. Because we create the model from the `AutoModelForCausalLM` class from the `transformers` package and not directly from `adapters`, we first need to enable adapter support by calling the `init()` method. Then, we add the adapter, let's just call it `poem` since it is trained to create new poems. Finally, we activate it and prepare it for training.

In [5]:
from transformers import AutoModelForCausalLM
from adapters import init

model = AutoModelForCausalLM.from_pretrained("gpt2")
# Enable adapter support
init(model)
# Add new adapter
model.add_adapter("poem")
# Activate adapter for training
model.train_adapter("poem")

The last thing we need to do before we can start training is creating the trainer. As trainings arguments, we choose a learning rate of 1e-4. Feel free to play around with the parameters and see how they affect the result.

In [6]:
from transformers import TrainingArguments
from adapters import AdapterTrainer
training_args = TrainingArguments(
  output_dir="./examples",
  do_train=True,
  remove_unused_columns=False,
  learning_rate=5e-4,
  num_train_epochs=3,
)


trainer = AdapterTrainer(
        model=model,
        args=training_args,
        tokenizer=tokenizer,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"],
    )

In [7]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=66, training_loss=5.651855931137547, metrics={'train_runtime': 11.5993, 'train_samples_per_second': 45.52, 'train_steps_per_second': 5.69, 'total_flos': 13614563635200.0, 'train_loss': 5.651855931137547, 'epoch': 3.0})

Now that we have a trained adapter we save it for future usage.

In [8]:
model.save_adapter("adapter_poem", "poem")

Next, let's generate some poetry with our trained adapter. In order to do this, we create a GPT2LMHeadModel that is best suited for language generation. Then we load our trained adapter. Finally, we have to choose the start of our poem. If you want your poem to start differently just change `PREFIX` accordingly.

In [9]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
# Enable adapter support
init(model)
# You can also load your locally trained adapter
model.load_adapter("adapter_poem")
model.set_active_adapters("poem")

PREFIX = "In the night"

For the generation, we need to tokenize the prefix first and then pass it to the model. In this case, we create five possible continuations for the beginning we chose.

In [10]:
encoding = tokenizer(PREFIX, return_tensors="pt")
output_sequence = model.generate(
  input_ids=encoding["input_ids"],
  attention_mask=encoding["attention_mask"],
  do_sample=True,
  num_return_sequences=5,
  max_length = 50,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Lastly, we want to see what the model actually created. To do this, we need to decode the tokens from ids back to words and remove the EOS tokens. You can easily use this code with another dataset. Don't forget to share your adapters at [AdapterHub](https://adapterhub.ml/).

In [11]:
for generated_sequence_idx, generated_sequence in enumerate(output_sequence):
       print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
       generated_sequence = generated_sequence.tolist()

       # Decode text
       text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
       # Remove EndOfSentence Tokens
       text = text[: text.find(tokenizer.eos_token)]

       print(text)

=== GENERATED SEQUENCE 1 ===
In the night,when it's still,how shall he sing,the king's favorite music.

for those of us who live in the old cities to-day:

all the streets and the rivers,the rivers that ran
=== GENERATED SEQUENCE 2 ===
In the night,he died at the barroom on the morning,a silent man, a widow who had borne the ill son of his trolls.the son was long standing and the son remembered him for his service.even the wisest ma
=== GENERATED SEQUENCE 3 ===
In the night that night, a man left a stone in his hand,and in his face he is seen--as he looked round him he could not see,and when he did,the two men walked,he turned his head with one though
=== GENERATED SEQUENCE 4 ===
In the night, the sun had a bright ray to dazzle upon earth.and how, though they call his name, is he like the angel of hell upon the world?when he, who lives in his dream, says,--what i
=== GENERATED SEQUENCE 5 ===
In the night!not from the earth's gates!the sweet scent of sweet-day."they said,what thou dost 