In [1]:
# ! pip install datasets transformers

## Causal language modeling
the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it to access the tokens after token i when trying to predict the token i+1 in the sentence.
## Masked language modeling
the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

In [2]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

In [3]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

In [4]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
    
show_random_elements(datasets["train"])

Unnamed: 0,text
0,
1,= = Early life = = \n
2,
3,
4,"This mindlessness is connected to the context in which Brooks was writing . He declared : "" at this point we 're pretty much living in an irrational time "" , full of human suffering and lacking reason or logic . When asked in a subsequent interview about how he would compare terrorists with zombies , Brooks said : \n"
5,www.kreusch @-@ sheet @-@ music.net – Free Scores by Alkan \n
6,= = Legal framework = = \n
7,
8,"At the time when the poem Lietuva , Tėvyne mūsų was written , Lithuania was part of the Russian Empire . Kudirka , a medical student at the University of Warsaw , was writing as a columnist for the newspaper Varpas ( The Bell ) . In his Varpas columns , Kudirka urged Lithuanians to take pride in their heritage , discussed the problems the Russian Government was causing the Lithuanian population , and denounced those who wished to work for the Tsarist autocracy . In the course of writing for Varpas , he wrote down his thoughts on what Lithuania was and what it should be , resulting in the fifty @-@ word poem Lietuva , Tėvynė mūsų ( "" Lithuania , Our Homeland "" ) . \n"
9,"The forms in which the gods are shown , although diverse , are limited in many ways . Many creatures that are widespread in Egypt were never used in divine iconography , whereas a few , such as falcons , cobras , and cattle , can each represent many deities . Animals that were absent from Egypt in the early stages of its history were not used as divine images . For instance , the horse , which was only introduced in the Second Intermediate Period ( c . 1650 – 1550 BC ) , never represented a god . Similarly , the clothes worn by anthropomorphic deities in all periods changed little from the styles used in the Old Kingdom : a kilt , false beard , and often a shirt for male gods and a long , tight @-@ fitting dress for goddesses . \n"


In [5]:
# model_checkpoint = "distilgpt2"
# model_checkpoint = "EleutherAI/pythia-70m-deduped"
model_checkpoint = "EleutherAI/pythia-160m-deduped"
# model_checkpoint = "RWKV/rwkv-4-169m-pile"
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"])
    
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
tokenized_datasets["train"][1]

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

{'input_ids': [426, 657, 1278, 90, 5182, 28289, 868, 6490, 426, 2490],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [7]:
# block_size = tokenizer.model_max_length
block_size = 128

In [8]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

In [9]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

2024-03-28 09:13:00.740151: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-28 09:13:00.773082: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-28 09:13:00.955631: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-28 09:13:00.955700: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-28 09:13:00.956944: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

' time gameplay as its predecessors, the story runs parallel to the first game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both'

In [10]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/375M [00:00<?, ?B/s]

In [11]:
from transformers import Trainer, TrainingArguments

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

comet_ml is installed but `COMET_API_KEY` is not set.


In [12]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33myimei-yang[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,3.3986,3.507483
2,2.9941,3.481176
3,2.6643,3.55397


TrainOutput(global_step=7050, training_loss=3.052152225142675, metrics={'train_runtime': 431.3981, 'train_samples_per_second': 130.703, 'train_steps_per_second': 16.342, 'total_flos': 5356209314856960.0, 'train_loss': 3.052152225142675, 'epoch': 3.0})

In [13]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 34.95


Perplexity: 45.93 for pythia-70m-deduped

Perplexity: 34.95 for pythia-160m-deduped

Perplexity: 26.12 for RWKV/rwkv-4-169m-pile