First we need to download the dataset. In this case we use a datasets containing poems. By doing so we train the model to create its own poems.

In [1]:
from datasets import load_dataset

dataset = load_dataset("poem_sentiment")
print(dataset)

Using custom data configuration default
Reusing dataset poem_sentiment (/home/eason/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)


DatasetDict({
    train: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 892
    })
    validation: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 105
    })
    test: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 104
    })
})


Before training we need to preprocess the dataset. We tokenize the entries in the dataset and remove all columns we don't need to train the adapter.

In [2]:
from transformers import GPT2Tokenizer

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  encoding = tokenizer(batch["verse_text"])
  # For language modeling the labels need to be the input_ids
  #encoding["labels"] = encoding["input_ids"]
  return encoding

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# The GPT-2 tokenizer does not have a padding token. In order to process the data 
# in batches we set one here 
tokenizer.pad_token = tokenizer.eos_token
column_names = dataset["train"].column_names
dataset = dataset.map(encode_batch, remove_columns=column_names, batched=True)



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Next we concatenate the documents in the dataset and create chunks with a length of `block_size`. This is beneficial for language modeling.

In [3]:
block_size = 50
# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
  # Concatenate all texts.
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
  # customize this part to your needs.
  total_length = (total_length // block_size) * block_size
  # Split by chunks of max_len.
  result = {
    k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
    for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

dataset = dataset.map(group_texts,batched=True,)

dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Next we create the model and add our new adapter.Let's just call it `poem` since it is trained to create new poems. Then we activate it and prepare it for training.

In [4]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
# add new adapter
model.add_adapter("poem")
# activate adapter for training
model.train_adapter("poem")

The last thing we need to do before we can start training is create the trainer. As trainingsargumnénts we choose a learningrate of 1e-4. Feel free to play around with the paraeters and see how they affect the result.

In [21]:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
  output_dir="./examples", 
  do_train=True,
  remove_unused_columns=False,
  learning_rate=5e-4,
  num_train_epochs=3,
)


trainer = Trainer(
        model=model,
        args=training_args,
        tokenizer=tokenizer,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"], 
    )

In [25]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=33, training_loss=5.555782896099669, metrics={'train_runtime': 5.5207, 'train_samples_per_second': 5.977, 'total_flos': 19852958822400.0, 'epoch': 3.0})

Now that we have a trained udapter we save it for future usage.

In [26]:
model.save_adapter("adapter_poem", "poem")

With our trained adapter we want to create some poems. In order to do this we create a GPT2LMHeadModel wich is best suited for language generation. Then we load our trained adapter. Finally we have to choose the start of our poem. If you want your poem to start differently just change `PREFIX` accordingly.

In [27]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
# You can also load your locally trained adapter
model.load_adapter("adapter_poem")
model.set_active_adapters("poem")

PREFIX = "In the night"

For the generation we need to tokenize the prefix first and then pass it to the model. In this case we create five possible continuations for the beginning we chose.

In [6]:
PREFIX = "In the night"

In [7]:
encoding = tokenizer(PREFIX, return_tensors="pt")
output_sequence = model.generate(
  input_ids=encoding["input_ids"],
  attention_mask=encoding["attention_mask"],
  do_sample=True,
  num_return_sequences=5,
  max_length = 50,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Lastly we want to see what the model actually created. Too de this we need to decode the tokens from ids back to words and remove the end of sentence tokens. You can easily use this code with an other dataset. Don't forget to share your adapters at [AdapterHub](https://adapterhub.ml/).

In [8]:
 for generated_sequence_idx, generated_sequence in enumerate(output_sequence):
        print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
        generated_sequence = generated_sequence.tolist()

        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
        # Remove EndOfSentence Tokens
        text = text[: text.find(tokenizer.eos_token)]

        print(text)

=== GENERATED SEQUENCE 1 ===
In the night, on Monday, and just the way of things went over in the home game.

The game came down to a narrow majority vote.

They finished third and then just after. Just so.


Just 
=== GENERATED SEQUENCE 2 ===
In the night out, Mr. Aoki was taken to the Todomac in North, where he died, with the help of his wife, with the consequence that that he died on the 4th November.

The following day i
=== GENERATED SEQUENCE 3 ===
In the night of December 28, 1968 in Berkeley, Berkeley with the intent to take down a long-haired man who had turned out to be Hitler. All around Berkeley, a large group of student workers, many of which had been in the mai
=== GENERATED SEQUENCE 4 ===
In the night, he is in an "out of heart".

"It is a surprise because he's one. He got a little boy. He gets his little boy boy," Mr Aamuja told the BBC Channel 1.
=== GENERATED SEQUENCE 5 ===
In the night heat, on cold roads he was an actor and a gun. "We played him and a script that the

In [9]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (invertible_adapters): ModuleDict()
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (attention_adapters): GPT2AttentionAdaptersModule(
          (adapters): ModuleDict()
          (adapter_fusion_layer): ModuleDict()
        )
        (output_adapters): GPT2OutputAdaptersModule(
          (adapters): ModuleDict(
            (poem): Adapter(
              (no