# Text Generation

This notebook helps to understand how text generation models may be further finetuned for certain domains. 
In the example below, at first we prompt a [GPT2 model](https://huggingface.co/openai-community/gpt2) with some medical condition.
Then we use a [medical dataset](https://huggingface.co/datasets/gamino/wiki_medical_terms) to try and improve the generation for this domain. 

In [16]:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
import torch
from transformers import pipeline

In [None]:
pipe = pipeline("text-generation", model="openai-community/gpt2", device = "cuda", model_kwargs={"max_length": 50})

In [5]:
prompt_generate = "I went to the bakery and ate a bagel. Now my stomach hurts. I think I have "
pipe(prompt_generate)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I went to the bakery and ate a bagel. Now my stomach hurts. I think I have icky poo and my mouth is burning.\n\nI tried to pick up the bagel and was kind of disappointed because it was so close, but that\'s okay. I got it next to me. I went to the bakery and I saw the bagel.\n\nWhen I opened the bagel I was so upset. I have a lot of friends and I\'m not alone. It\'s the only bagel in the Whole Foods. It\'s a bagel. I\'m kind of scared.\n\nI was crying when I saw the bagel.\n\nIt was so good. I was so happy.\n\nI started to cry. I was so happy.\n\nI was so happy.\n\nI was so happy.\n\nI felt so good.\n\nI was so happy.\n\nI had so many friends who have been there for me.\n\nI had so many friends who\'ve been there for me. I have so many friends who have been with me. I was so happy.\n\nMy friends are like, "This is my friend."\n\nMy friends are like, "This is my friend."\n\nMy friends are like, "This is my friend."\n\nI\'m so happy.'}]

In [12]:
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer

# Load dataset
dataset = load_dataset('gamino/wiki_medical_terms')
train_val = dataset["train"].train_test_split(
    test_size=0.2, seed=42)

dataset = DatasetDict({
    "train": train_val["train"],
    "validation": train_val["test"]
})

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('openai-community/gpt2')
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    inputs = tokenizer(examples['page_text'], truncation=True, padding='max_length', max_length=128)
    inputs['labels'] = inputs['input_ids'].copy()
    return inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=['page_title', 'page_text', '__index_level_0__'] )

Map:   0%|          | 0/5488 [00:00<?, ? examples/s]

Map:   0%|          | 0/1373 [00:00<?, ? examples/s]

In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['page_title', 'page_text', '__index_level_0__'],
        num_rows: 5488
    })
    validation: Dataset({
        features: ['page_title', 'page_text', '__index_level_0__'],
        num_rows: 1373
    })
})

In [14]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5488
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1373
    })
})

In [18]:
model =AutoModelForCausalLM.from_pretrained('openai-community/gpt2').to("cuda")

In [21]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='models/results',
    eval_strategy='epoch',
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='models/logs'
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)

# Train the model
trainer.train()


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,2.898,2.694345
2,2.5937,2.653255


TrainOutput(global_step=2744, training_loss=2.771134757439527, metrics={'train_runtime': 107.3711, 'train_samples_per_second': 102.225, 'train_steps_per_second': 25.556, 'total_flos': 716985335808000.0, 'train_loss': 2.771134757439527, 'epoch': 2.0})

In [24]:

# save the model and tokenizer explicitly
model_output_dir = 'models/gpt2/'

model.save_pretrained(model_output_dir)
tokenizer.save_pretrained(model_output_dir)

('models/gpt2/tokenizer_config.json',
 'models/gpt2/special_tokens_map.json',
 'models/gpt2/vocab.json',
 'models/gpt2/merges.txt',
 'models/gpt2/added_tokens.json',
 'models/gpt2/tokenizer.json')

In [26]:
pipe_causal_lm_finetuned = pipeline("text-generation", model="models/gpt2", device = "cuda", model_kwargs={"max_length": 50})

Device set to use cuda


In [27]:
pipe(prompt_generate)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I went to the bakery and ate a bagel. Now my stomach hurts. I think I have ileotosis. I\'m in pain. I think I have diarrhea. I have vomiting. I am so tired I am vomiting. And I think I have a migraine."\n\n"You can\'t just say that. There are so many people who are in pain. There are people who have had surgery, cancer, and they\'re so tired. But they have no way to stop pain. There is no way to stop them from doing what they\'re doing. I think my stomach\'s going to be very unbalanced. I think it\'s going to be a very difficult day."\n\n"I feel so sick. I\'m going to come out of this with all my bones. I\'m going to fall ill. I\'m going to lose a lot of muscle. I\'m going to have a bad headache. I\'m going to have a bad break in my heart. I\'m going to have a bad, broken heart."\n\n"I just don\'t think I\'m going to be able to do something that will save me. I can\'t do it. I don\'t know what I am going to do. I can\'t do anything. I can\'t do anything. I can\'t d

# To Do
1. Discuss if there is any improvement in the text generation as a result of the prompting.
2. What happens if you swap the model with a different Decoder only model such as (gpt2-xl, google/gemma-2b, mistralai/Mistral-7B-v0.1)?
3. What happens if you finetune in a different domain?