# Text Generation

This notebook helps to understand how text generation models may be further finetuned for certain domains. 
In the example below, at first we prompt a [GPT2 model](https://huggingface.co/openai-community/gpt2) with some medical condition.
Then we use a [medical dataset](https://huggingface.co/datasets/gamino/wiki_medical_terms) to try and improve the generation for this domain. 

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
import torch
from transformers import pipeline

In [2]:
pipe = pipeline("text-generation", model="openai-community/gpt2", device = "cuda", model_kwargs={"max_length": 50})

Device set to use cuda


In [3]:
prompt_generate = "I went to the bakery and ate a bagel. Now my stomach hurts. I think I have "
pipe(prompt_generate)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I went to the bakery and ate a bagel. Now my stomach hurts. I think I have ileostomy. I need to eat more food. I don\'t want to go to the doctor, I just want to go to the doctor and get better."\n\nHe then gave his family a short video description of the procedure, which was posted on his Facebook page.\n\n"I\'m sorry, but I am a full-fledged cancer survivor, so I can\'t speak for the entire community."\n\nHe was also seen on Twitter, writing: "Please help me get better. I am so sick. I am trying to be good to everyone. I can\'t even go to the doctor. Please find me a doctor."\n\nHe later posted a video on Instagram of himself walking down the street, standing at a gas station and talking to customers.\n\n"I want to thank all of the people I met and all the people who came to visit me. I\'m so grateful. Thank you so much."\n\nThe mother of a friend, who also called for help after the incident, said she had seen her son on Twitter and that she was very upset.\n\n"I\

In [6]:
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer

# Load dataset
dataset = load_dataset('gamino/wiki_medical_terms')
train_val = dataset["train"].train_test_split(
    test_size=0.2, seed=42)

dataset = DatasetDict({
    "train": train_val["train"],
    "validation": train_val["test"]
})

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('openai-community/gpt2')
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    inputs = tokenizer(examples['page_text'], truncation=True, padding='max_length', max_length=128)
    inputs['labels'] = inputs['input_ids'].copy()
    return inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=['page_title', 'page_text', '__index_level_0__'] )

In [5]:
dataset["train"][0]

{'page_title': 'Amédée Galzin',
 'page_text': 'Amédée Galzin (1 May 1853, Parrinet, Aveyron – 14 February 1925, Parrinet) was a French veterinarian and mycologist.\nIn 1878 he obtained his degree from the veterinary college in Toulouse. From 1879 to 1905, he served as a military veterinarian, becoming a knight of the Legion of Honour in 1899.With Abbé Hubert Bourdot, he was co-author of a series of publications (11 parts, 1909 to 1925) involving Hymenomycetes native to France; all parts being published in the Bulletin de la Société Mycologique de France. With Bourdot, he also wrote Heterobasidiae nondum descriptae (Descriptions of a few jelly fungi).With Bourdot, he was the taxonomic authority of the fungi genus Oxyporus, as well as of numerous mycological species.\n\n\n== References ==',
 '__index_level_0__': 3481}

In [10]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5488
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1373
    })
})

In [12]:
model =AutoModelForCausalLM.from_pretrained('openai-community/gpt2').to("cuda")

In [13]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='models/results',
    eval_strategy='epoch',
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='models/logs'
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)

# Train the model
trainer.train()


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,2.898,2.694345
2,2.5937,2.653255


TrainOutput(global_step=2744, training_loss=2.771134757439527, metrics={'train_runtime': 108.9743, 'train_samples_per_second': 100.721, 'train_steps_per_second': 25.18, 'total_flos': 716985335808000.0, 'train_loss': 2.771134757439527, 'epoch': 2.0})

In [15]:

# save the model and tokenizer explicitly
model_output_dir = 'models/gpt2/'

model.save_pretrained(model_output_dir)
tokenizer.save_pretrained(model_output_dir)

('models/gpt2/tokenizer_config.json',
 'models/gpt2/special_tokens_map.json',
 'models/gpt2/vocab.json',
 'models/gpt2/merges.txt',
 'models/gpt2/added_tokens.json',
 'models/gpt2/tokenizer.json')

In [16]:
pipe_causal_lm_finetuned = pipeline("text-generation", model="models/gpt2", device = "cuda", model_kwargs={"max_length": 50})

Device set to use cuda


In [17]:
pipe_causal_lm_finetuned(prompt_generate)

[{'generated_text': 'I went to the bakery and ate a bagel. Now my stomach hurts. I think I have  stomachache. I eat a lot of bread and some raw meat. I have a small appetite. I feel tired. The next day, I feel like I am in a coma. I feel sick, but I am not in a coma either. I am in my mid-30s now. I am in a state of shock, but I am still recovering from my stomachache.My stomachache is caused by the presence of a gastric acid, usually at the site of the stomach attack. This can be caused by the presence of a gastric acid, in the lower intestine, or the presence of a large acid in the stomach. It is most often caused by the presence of a large androgen receptor-positive gastric acid (GARPA).\nI am not a risk factor for gastric acidosis. However, in my experience, GARPA is one of the most common risk factors, and is extremely rare.\n\nCauses\nThe most common cause of gastric acidosis is a drug-induced gastric acidosis.  It is triggered by drugs interfering with the enzyme GABAA.  This le

# To Do
1. Discuss if there is any improvement in the text generation as a result of the finetuning.
2. What happens if you swap the model with a different Decoder only model such as (gpt2-xl, google/gemma-2b, mistralai/Mistral-7B-v0.1)?
3. What happens if you finetune in a different domain?