# Fine-tuning Fairberta

This notebook follows this tutorial: <https://huggingface.co/learn/nlp-course/en/chapter7/3?fw=pt> In the original tutorial the masked model is **bert** based, here we use **roberta-based fairberta**.

In [2]:
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

In [2]:
text = "This is a great <mask>."

In [3]:
model_checkpoint = 'facebook/FairBERTa'
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at facebook/FairBERTa and are newly initialized: ['lm_head.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great  deprecated.'
'>>> This is a great  ridden.'
'>>> This is a great  overpowered.'
'>>> This is a great  outnumbered.'
'>>> This is a great aze.'


In [4]:
fairberta_filler = pipeline("fill-mask", model=model_checkpoint)
text = "The director was great in this movie. <mask> was spectacular."
preds = fairberta_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at facebook/FairBERTa and are newly initialized: ['lm_head.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


>>> The director was great in this movie. sprang was spectacular.
>>> The director was great in this movie. antim was spectacular.
>>> The director was great in this movie.cult was spectacular.
>>> The director was great in this movie.olving was spectacular.
>>> The director was great in this movie.IUM was spectacular.


Interestingly, unlike roberta, fairberta predicts almost random words before fine tuning.

In [5]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [6]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

In [7]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (546 > 512). Running this sequence through the model will result in indexing errors


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

In [8]:
tokenizer.model_max_length

512

In [9]:
chunk_size = 128

In [10]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 360'
'>>> Review 1 length: 286'
'>>> Review 2 length: 121'


In [11]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 767'


In [12]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 127'


In [13]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [14]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 58910
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 57570
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 118147
    })
})

In [15]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

" issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably"

In [16]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [17]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> <s>I rented I Pig<mask>URI<mask>-YELLOW from my video store because of all<mask><mask> that surrounded it<mask> it was first released<mask> 1967. I also heard that at first it<mask> seized by U.S. customs if it<mask> tried to enter this country, therefore being a fan<mask> films considered "controversial" I really had to see this for myself<mask>br /catentrybr />The<mask> is centered around a young Swedish drama student<mask>restling who wants to learn everything she can about life. In particular she wants to focus her attentions deities making some<mask> of documentary on whatfig average Swede thought about certain political'

'>>>  issues such as the Vietnam War<mask> race issues in the United States.<mask> between<mask> politicians and ordinary denizens of Stockholm about their opinions on politics, she<mask> sex with her drama teacher, classmates, and<mask> men.<br /><br /><mask> kills me about I AM<mask>URIOUSpsych<mask>ELLOW is that 40 years ago, this was NS<mask>. Really,<

In [18]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [19]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> <s><mask> rented I AM CURIOUS-<mask><mask><mask><mask> my video<mask> because of all the controversy that surrounded it<mask><mask> was<mask> released in 1967. I<mask> heard that at first it was seized by U.S. customs<mask> it<mask> tried to enter this<mask>, therefore being<mask><mask> of films considered "controversial" I<mask> had<mask> see this for myself<mask>br<mask><mask>br<mask><mask> plot is<mask> around a<mask><mask><mask> student<mask> Lena who wants to learn everything she can about life.<mask> particular<mask> wants to focus her attentions<mask> making some sort<mask> documentary on what the<mask><mask><mask> thought about certain political'

'>>>  issues such as the<mask> War and race issues<mask> the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions<mask> politics<mask> she has sex with<mask> drama teacher,<mask>,<mask> married men<mask>br /><br />What kills me about I AM CURIOUS-YELLOW is<mask> 40 years<mask>,<mas

In [20]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [21]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=logging_steps,
)



In [22]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [23]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/16 [00:00<?, ?it/s]

>>> Perplexity: 276970790.03


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [24]:
trainer.train()

  0%|          | 0/471 [00:00<?, ?it/s]

{'loss': 6.9105, 'grad_norm': 7.788623332977295, 'learning_rate': 1.337579617834395e-05, 'epoch': 0.99}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 4.783680438995361, 'eval_runtime': 17.5666, 'eval_samples_per_second': 56.926, 'eval_steps_per_second': 0.911, 'epoch': 1.0}
{'loss': 4.5902, 'grad_norm': 7.977917194366455, 'learning_rate': 6.751592356687898e-06, 'epoch': 1.99}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 4.132420063018799, 'eval_runtime': 22.1342, 'eval_samples_per_second': 45.179, 'eval_steps_per_second': 0.723, 'epoch': 2.0}
{'loss': 4.2198, 'grad_norm': 8.551613807678223, 'learning_rate': 1.2738853503184715e-07, 'epoch': 2.98}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 3.999345541000366, 'eval_runtime': 20.3122, 'eval_samples_per_second': 49.231, 'eval_steps_per_second': 0.788, 'epoch': 3.0}
{'train_runtime': 1901.7977, 'train_samples_per_second': 15.775, 'train_steps_per_second': 0.248, 'train_loss': 5.232855713797729, 'epoch': 3.0}


TrainOutput(global_step=471, training_loss=5.232855713797729, metrics={'train_runtime': 1901.7977, 'train_samples_per_second': 15.775, 'train_steps_per_second': 0.248, 'total_flos': 1974490974720000.0, 'train_loss': 5.232855713797729, 'epoch': 3.0})

In [25]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/16 [00:00<?, ?it/s]

>>> Perplexity: 53.86


In [26]:
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [27]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [28]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [29]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [30]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [31]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [32]:
model_name = "fairberta-imdb"
output_dir = model_name

In [33]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)

  0%|          | 0/471 [00:00<?, ?it/s]

>>> Epoch 0: Perplexity: 24.73842177666774
>>> Epoch 1: Perplexity: 18.27911835014019
>>> Epoch 2: Perplexity: 34.1904029830448


In [36]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", model="./fairberta-imdb")

In [40]:
text = "The director was great in this movie. <mask> was spectacular."
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> The director was great in this movie. He was spectacular.
>>> The director was great in this movie. It was spectacular.
>>> The director was great in this movie. he was spectacular.
>>> The director was great in this movie. She was spectacular.
>>> The director was great in this movie. Acting was spectacular.


Looks like using fairberta model did not help us to reduce bias in the gender prediction...