#### LLMs Fine-tuning
* First use a t5 model to perform a given task
* Setup and run a fine-tuning over the model to increase the accuracy on the task

In [1]:
import transformers
import datasets
from datasets import load_dataset

In [2]:
imdb_ds = load_dataset("imdb")

Found cached dataset imdb (/home/daniel/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
train = imdb_ds['train']
test = imdb_ds['test']


In [4]:
model_checkpoint = "t5-base"

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_auto = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [15]:
model_checkpoint_alt = "google/flan-t5-base"
model_auto_alt = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint_alt)
tokenizer_alt = AutoTokenizer.from_pretrained(model_checkpoint_alt)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading (…)"spiece.model";:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [6]:
lb_map = {0: 'negative', 1: 'positive', -1: 'neutral'}

In [7]:
def process(x):
    target_labels = [lb_map[y] for y in x["label"]]
    token_res = tokenizer(
        x["text"],
        text_target=target_labels,
        return_tensors="pt",
        truncation=True,
        padding=True,
    )
    return token_res

In [8]:
tokenized_dataset = imdb_ds.map(
    process, batched=True, remove_columns=["text", "label"]
)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [12]:
origin = "translate Spanish to English: Los estudios han demostrado que es bueno tener un perro."
inputs = tokenizer(origin, return_tensors="pt")

In [13]:
pred = model_auto.generate(**inputs)

In [14]:
tokenizer.decode(pred[0])

'<pad> El estudio es muy agradable y es mu'

In [81]:
def generate(model, tokenizer):
    def generate_fn(text):
        inputs = tokenizer(text, return_tensors="pt")
        gen = model.generate(**inputs)
        return tokenizer.batch_decode(gen, skip_special_tokens=True)
    return generate_fn

In [54]:
generate_fn = generate(model_auto, tokenizer)
generate_fn("translate Spanish to English: Los estudios han demostrado que es bueno tener un perro.")

['<pad> El estudio es muy agradable y es mu']

In [55]:
generate_fn_alt = generate(model_auto_alt, tokenizer_alt)
generate_fn_alt("translate to French: Los estudios han demostrado que es bueno tener un perro.")


["<pad> Les études ont montré que il est bien d'avoir un chien.</s>"]

In [41]:
total_params = sum(p.numel() for p in model_auto.parameters() if p.requires_grad)
print(f"{total_params:,} total parameters for {model_checkpoint}")

total_params = sum(p.numel() for p in model_auto_alt.parameters() if p.requires_grad)
print(f"{total_params:,} total parameters for {model_checkpoint_alt}")


222,903,552 total parameters for t5-base
247,577,856 total parameters for google/flan-t5-base


In [68]:
text = "Translate to Spanish: " + train[0]['text']
text

'Translate to Spanish: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are f

In [69]:
generate_fn(text)

['<pad> sex and nudity scenes are a major staple in Swedish cinema. ']

In [70]:
generate_fn_alt(text)

['<pad> El filme es un buen filme para cualquier persona que']

In [50]:
train[0]['text'][0:242]

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore '

In [71]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    "test-llm",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    optim="adamw_torch",
    report_to=["tensorboard"],
)

In [72]:
from transformers import DataCollatorWithPadding, Trainer
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
    model_auto,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [73]:
tensorboard_display_dir = f"test-llm/runs"

In [74]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=3126, training_loss=0.11657680415680069, metrics={'train_runtime': 1485.1983, 'train_samples_per_second': 33.666, 'train_steps_per_second': 2.105, 'total_flos': 3.0447894528e+16, 'train_loss': 0.11657680415680069, 'epoch': 2.0})

In [75]:
trainer.save_model("test-llm")

In [76]:
model_finetuned = AutoModelForSeq2SeqLM.from_pretrained("test-llm")

In [77]:
tokenizer_finetuned = AutoTokenizer.from_pretrained("test-llm")

In [82]:
generate_fn_finetuned = generate(model_finetuned, tokenizer_finetuned)

In [83]:
text

'Translate to Spanish: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are f

In [84]:
generate_fn_finetuned(text)

['positive']

In [85]:
test

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [86]:
generate_fn_finetuned(test[0]['text'])

['negative']

In [87]:
generate_fn_finetuned(test[1]['text'])

['negative']

In [91]:
test[1]['label']

0

In [94]:
generate_fn_finetuned("I hate to say, but I love this movie so much.")

['positive']

In [101]:
review = """The Idol presents a provocative view of the entertainment industry, showing the highs and lows experienced by the stars. This show, directed by Sam Levinson who made Euphoria continues to push boundaries and weird storylines.

Lily-Rose Depp gave us an extraordinary performance. Some may not enjoy it but i did. And it sure pulled viewers deeper into the narrative. As the episode kept going some probably couldnt complete it but the authenticity of the story cannot be ignored so you must finish it.

In a world where real-life issues often get ignored, this show pushes viewers to confront uncomfortable truths, creating an experience that stucks in the minds of the viewers."""
generate_fn_finetuned(review)

['positive']

In [102]:
review = """translate to english: Los perros son los mejores amigos del hombre."""
generate_fn_finetuned(review)

['en en en en en en en en en ']

In [106]:
generate_fn_finetuned(review)

['en en en en en en en en en ']