In [1]:
# ! pip install datasets transformers

## Causal language modeling
the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it to access the tokens after token i when trying to predict the token i+1 in the sentence.
## Masked language modeling
the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

In [2]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

In [3]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

In [4]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
    
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"Titanoceratops - ( New Mexico , USA ) \n"
1,
2,"As of 1984 , the mean annual precipitation for the Loyalsock Creek watershed ( which Plunketts Creek is part of ) was 42 to 48 inches ( 1067 to 1219 mm ) . Pennsylvania receives the greatest amount of acid rain of any state in the United States . Because Plunketts Creek is in a sandstone and shale mountain region , it has a relatively low capacity to neutralize added acid . This makes it especially vulnerable to increased acidification from acid rain , which poses a threat to the long term health of the plants and animals in the creek . The total alkalinity ( TA ) is a measure of the capacity of water to neutralize acid , with a larger TA corresponding to a greater capacity . In 2007 , the TA of two subtributaries was known : Engle Run , a 4 @.@ 9 @-@ mile ( 7 @.@ 9 km ) tributary of King Run , had a TA of 5 , and the Noon Branch , a 1 @.@ 9 @-@ mile ( 3 @.@ 1 km ) tributary of Wolf Run , had a TA of 9 . \n"
3,"The episode was broadcast online by Netflix on February 1 , 2013 as part of the simultaneous release of all 13 episodes of season 1 of the series . The debut date was a weekend when there was little competition on television other than Super Bowl XLVII and a new episode of Downton Abbey on PBS . Netflix broadcast "" Chapter 1 "" and "" Chapter 2 "" to critics several days in advance of the release . \n"
4,"The 1975 tour with the Revue provided the backdrop to Dylan 's nearly four @-@ hour film Renaldo and Clara , a sprawling narrative mixed with concert footage and reminiscences . Released in 1978 , the movie received poor , sometimes scathing , reviews . Later in that year , a two @-@ hour edit , dominated by the concert performances , was more widely released . \n"
5,= = Service history = = \n
6,Editor Christopher Gay spoke about the episode in August 2012 : \n
7,"On 26 July , Federer announced that he would miss the 2016 Summer Olympics and the remainder of the 2016 season to fully recover from his knee injury . \n"
8,""" Mothers of the Disappeared "" was favourably received by critics . Steve Morse of The Boston Globe called the song "" powerful "" and described the backing vocals as tender and choirlike . Don McLeese of the Chicago Sun @-@ Times described it as a "" hymn to human rights "" . Adrian Thrills of NME called it "" a simple , plaintive lament of stunning beauty and sadness "" . Nicholas Jennings of Maclean 's felt that it was The Joshua Tree 's "" most topical song "" . Music journalist Andrew Mueller felt the track was a "" wilfully downbeat finale "" . In Rolling Stone , Steve Pond said "" ' Mothers of the Disappeared ' is built around desolate images of loss , but the setting is soothing and restorative â€” music of great sadness but also of unutterable compassion , acceptance and calm . "" Lennox Samuels of The Dallas Morning News stated that there was "" an ineffable sadness in Bono 's vocals and images where ' Night hangs like a prisoner / Stretched over black and blue ' "" , calling it "" a moving tribute "" to people around the world who had lost loved ones to warfare and conflict . He added "" [ w ] hat 's remarkable about the song is that despite the intrinsic pain , it remains eerily cleansing . Even in the midst of decay and excess and horror , Bono can find hope and absolution . "" In 2006 Bono described it as "" a beautiful end to the album "" , saying , "" That song means as much to me as any of the songs on that album , it 's right up there for me , "" and noting that it is a song "" I 'm very proud of to this day . "" \n"
9,


In [5]:
# model_checkpoint = "distilgpt2"
# model_checkpoint = "EleutherAI/pythia-70m-deduped"
# model_checkpoint = "EleutherAI/pythia-160m"
model_checkpoint = "RWKV/rwkv-4-169m-pile"
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"])
    
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
tokenized_datasets["train"][1]

{'input_ids': [426, 657, 1278, 90, 5182, 28289, 868, 6490, 426, 2490],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [7]:
# block_size = tokenizer.model_max_length
block_size = 128

In [8]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

In [9]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

2024-03-28 17:33:26.423494: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-28 17:33:26.424841: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-28 17:33:26.440724: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-28 17:33:26.440739: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-28 17:33:26.440752: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

' time gameplay as its predecessors, the story runs parallel to the first game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both'

In [10]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

  return self.fget.__get__(instance, owner)()


In [11]:
from transformers import Trainer, TrainingArguments

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

comet_ml is installed but `COMET_API_KEY` is not set.


In [12]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

# Zero shot evaluation
# trainer.train()

In [13]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

[34m[1mwandb[0m: Currently logged in as: [33myimei-yang[0m. Use [1m`wandb login --relogin`[0m to force relogin


Perplexity: 51.08


Perplexity: 45.93 for pythia-70m-deduped

Perplexity: 34.95 for pythia-160m-deduped

Perplexity: 26.12 for RWKV/rwkv-4-169m-pile

Perplexity: 61.54 for pythia-160m zero shot evaluation

Perplexity: 51.08 for RWKV/rwkv-4-169m-pile zero shot evaluation