First we need to download the dataset. In this case we use a datasets containing poems. By doing so we train the model to create its own poems.

In [1]:
from datasets import load_dataset

dataset = load_dataset("poem_sentiment")
print(dataset)

Using custom data configuration default
Reusing dataset poem_sentiment (/home/eason/.cache/huggingface/datasets/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)


DatasetDict({
    train: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 892
    })
    validation: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 105
    })
    test: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 104
    })
})


Before training we need to preprocess the dataset. We tokenize the entries in the dataset and remove all columns we don't need to train the adapter.

In [2]:
from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline

tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall")

In [3]:
from transformers import GPT2Tokenizer

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  encoding = tokenizer(batch["verse_text"])
  # For language modeling the labels need to be the input_ids
  #encoding["labels"] = encoding["input_ids"]
  return encoding

#tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
#tokenizer.pad_token = tokenizer.eos_token

# The GPT-2 tokenizer does not have a padding token. In order to process the data 
# in batches we set one here 
column_names = dataset["train"].column_names
dataset = dataset.map(encode_batch, remove_columns=column_names, batched=True)



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Next we concatenate the documents in the dataset and create chunks with a length of `block_size`. This is beneficial for language modeling.

In [4]:
block_size = 50
# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
  # Concatenate all texts.
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
  # customize this part to your needs.
  total_length = (total_length // block_size) * block_size
  # Split by chunks of max_len.
  result = {
    k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
    for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

dataset = dataset.map(group_texts,batched=True,)

dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Next we create the model and add our new adapter.Let's just call it `poem` since it is trained to create new poems. Then we activate it and prepare it for training.

In [5]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("uer/gpt2-chinese-cluecorpussmall")

In [33]:

# add new adapter
model.add_adapter("poem")
# activate adapter for training
model.train_adapter("poem")

The last thing we need to do before we can start training is create the trainer. As trainingsargumnénts we choose a learningrate of 1e-4. Feel free to play around with the paraeters and see how they affect the result.

In [35]:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
  output_dir="./examples", 
  do_train=True,
  remove_unused_columns=False,
  learning_rate=5e-4,
  num_train_epochs=3,
)


trainer = Trainer(
        model=model,
        args=training_args,
        tokenizer=tokenizer,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"], 
    )

In [36]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=54, training_loss=4.2605831004955155, metrics={'train_runtime': 10.0689, 'train_samples_per_second': 5.363, 'total_flos': 26502744153600.0, 'epoch': 3.0})

Now that we have a trained udapter we save it for future usage.

In [40]:
PREFIX = "what a "

In [41]:
encoding = tokenizer(PREFIX, return_tensors="pt")
encoding = encoding.to(model.device)
output_sequence = model.generate(
  input_ids=encoding["input_ids"][:,:-1],
  attention_mask=encoding["attention_mask"][:,:-1],
  do_sample=True,
  num_return_sequences=5,
  max_length = 50,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Lastly we want to see what the model actually created. Too de this we need to decode the tokens from ids back to words and remove the end of sentence tokens. You can easily use this code with an other dataset. Don't forget to share your adapters at [AdapterHub](https://adapterhub.ml/).

In [42]:
 for generated_sequence_idx, generated_sequence in enumerate(output_sequence):
        print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
        generated_sequence = generated_sequence.tolist()

        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
        # Remove EndOfSentence Tokens
        text = text[: text.find(tokenizer.pad_token)]

        print(text)

=== GENERATED SEQUENCE 1 ===
[CLS] what a little wonder does [SEP] [CLS] within that worst [SEP] [CLS] that is a truth. [SEP] [CLS] truth was in trup. and he bed a spread : [SEP] [CLS] and fa
=== GENERATED SEQUENCE 2 ===
[CLS] what a great mother, [SEP] [CLS] and i have your broos or not that! this, whose stranged, [SEP] [CLS] and that even bleed to [SEP] [CLS] in their time to do the poo
=== GENERATED SEQUENCE 3 ===
[CLS] what a snow father below : in folin ， like noone ， he likeed means on which a snow ; [SEP] [CLS] their eyes of your growt ， and likely grown 
=== GENERATED SEQUENCE 4 ===
[CLS] what a situations to me, [SEP] [CLS] with me, [SEP] [CLS] what a worrys is, and they't [SEP] [CLS] that there'' their name and i mugul! when, the eagle th
=== GENERATED SEQUENCE 5 ===
[CLS] what a sack when quink [SEP] [CLS] the mercies seed [UNK] while in the moon, [SEP] [CLS] and my father i will looking be a seats and the dar, [SEP] [CLS] a sea'
