In [1]:
# ! pip install datasets transformers

## Causal language modeling
the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it to access the tokens after token i when trying to predict the token i+1 in the sentence.
## Masked language modeling
the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

In [2]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

In [3]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

In [4]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
    
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"The earliest known written version of the Laws of Cricket , dating from 1744 , does not include an lbw rule . At the time , batsmen in English cricket used curved bats , which made it unlikely that they would be able to stand directly in front of the wickets . However , a clause in the 1744 laws gave umpires the power to take action if the batsman was "" standing unfair to strike "" . Cricket bats were modified to become straighter over the following years , allowing batsmen to stand closer to the wickets . Subsequently , some players deliberately began to obstruct the ball from hitting the wickets . Such tactics were criticised by writers and a revision of the laws in 1774 ruled that the batsman was out if he deliberately stopped the ball from hitting the wicket with his leg . However , critics noted that the umpires were left the difficult task of interpreting the intentions of batsmen . The 1788 version of the laws no longer required the umpires to take account of the batsman 's intent ; now a batsman was lbw if he stopped a ball that "" pitch [ ed ] straight "" . Further clarification of the law came in 1823 , when a condition was added that "" the ball must be delivered in a straight line to the wicket "" . The ambiguity of the wording was highlighted when two prominent umpires disagreed over whether the ball had to travel in a straight line from the bowler to the wicket , or between the wickets at either end of the pitch . In 1839 the MCC , by then responsible for drafting the Laws of Cricket , endorsed the latter interpretation and ruled the batsman out lbw if the ball pitched in between the wickets and would have hit the stumps . \n"
1,"In the two @-@ part finale to series two , "" Counterfeit "" ( 1994 ) , James Horton ( Peter Hudson ) , a renegade Watcher who believes all Immortals must be eliminated , uses killer Lisa Halle ( Meilani Paul ) to try and kill MacLeod . Lisa undergoes plastic surgery to resemble Tessa and therefore is played by Vandernoot from that point on . MacLeod meets Lisa just after he admitted to himself how much he missed Tessa , and he is stunned by her resemblance with Tessa . Despite knowing that Tessa is dead and cannot return , he eagerly pursues a relationship with Lisa . He eventually admits the truth when he discovers a scar on Lisa 's jaw . Horton kills Lisa on Tessa 's grave before being himself killed by MacLeod . \n"
2,
3,
4,"After reigning for barely one month , Zhang Bangchang was persuaded by the Song to step down as emperor of the Great Chu and to recognize the legitimacy of the Song imperial line . Li Gang pressured Gaozong to execute Zhang for betraying the Song . The emperor relented and Zhang was coerced into suicide . The killing of Zhang showed that the Song was willing to provoke the Jin , and that the Jin had yet to solidify their control over the newly conquered territories . The submission and abolition of Chu meant that Kaifeng was now back under Song control . Zong Ze ( 宗澤 ; 1059 – 1128 ) , the Song general responsible for fortifying Kaifeng , entreated Gaozong to move the court back to the city , but Gaozong refused and retreated south . The southward move marked the end of the Northern Song and the beginning of the Southern Song era of Chinese history . \n"
5,= = = Setting = = = \n
6,
7,= = Playing style = = \n
8,
9,"Piggott claimed that Wheeler 's appointment as Director @-@ General of the Archaeological Survey of India represented "" the most remarkable archaeological achievement of his career , an enormous challenge accepted and surmounted in the autocratic and authoritarian terms within which he could best deploy his powers as administrator and excavator . No other archaeologist of the time , it seems fair to remark , could have come near to attaining his command of incisive strategy and often ruthless tactics which won him the bewildered admiration and touching devotion of his Indian staff . "" The Indian archaeologist Dilip K. Chakrabarti later stated that Wheeler 's accomplishments while in India were "" considerable "" , particularly given the socio @-@ political turmoil of independence and partition . Chakrabarti stated that Wheeler had contributed to South Asian archaeology in various ways : by establishing a "" total view "" of the region 's development from the Palaeolithic onward , by introducing new archaeological techniques and methodologies to the subcontinent , and by encouraging Indian universities to begin archaeological research . Ultimately , Chakrabarti was of the opinion that Wheeler had "" prepared the archaeology of the subcontinent for its transition to modernity in the post @-@ Partition period . "" Similarly , Peter Johansen praised Wheeler for systematising and professionalising Indian archaeology and for "" instituting a clearly defined body of techniques and methods for field and laboratory work and training . "" \n"


In [5]:
# model_checkpoint = "distilgpt2"
# model_checkpoint = "EleutherAI/pythia-70m-deduped"
# model_checkpoint = "EleutherAI/pythia-160m"
model_checkpoint = "RWKV/rwkv-4-169m-pile"
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"])
    
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
tokenized_datasets["train"][1]

{'input_ids': [426, 657, 1278, 90, 5182, 28289, 868, 6490, 426, 2490],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [7]:
# block_size = tokenizer.model_max_length
block_size = 128

In [8]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

In [9]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

2024-04-11 23:53:17.342662: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-11 23:53:17.343977: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-11 23:53:17.359797: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-11 23:53:17.359812: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-11 23:53:17.359825: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

' time gameplay as its predecessors, the story runs parallel to the first game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both'

In [10]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)


  return self.fget.__get__(instance, owner)()


In [11]:
# Add peft

from peft import get_peft_config, get_peft_model, PrefixTuningConfig, TaskType, PeftType, get_peft_model_state_dict, set_peft_model_state_dict, PromptEncoderConfig

# ## Prefix-tuning
# peft_config = PrefixTuningConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=30, num_attention_heads=12)

## P-tuning
peft_type = PeftType.P_TUNING
# peft_config = PromptEncoderConfig(task_type="SEQ_CLS", num_virtual_tokens=20, encoder_hidden_size=128)
peft_config = PromptEncoderConfig(task_type="CAUSAL_LM", num_virtual_tokens=20, encoder_hidden_size=128, num_attention_heads=12)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()


trainable params: 229,376 || all params: 169,571,840 || trainable%: 0.13526774256857743


In [12]:
# !pip install peft

In [13]:
from transformers import Trainer, TrainingArguments

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

comet_ml is installed but `COMET_API_KEY` is not set.


In [14]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

# # Zero shot evaluation
# trainer.train()

In [15]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

[34m[1mwandb[0m: Currently logged in as: [33myimei-yang[0m. Use [1m`wandb login --relogin`[0m to force relogin


Perplexity: 60.05


Perplexity: 45.93 for pythia-70m-deduped

Perplexity: 34.95 for pythia-160m-deduped, with pre-fix tuning: 1105.27, with p-tuning: 39.32

Perplexity: 26.12 for RWKV/rwkv-4-169m-pile, with p-tuning: 35.49

Perplexity: 61.54 for pythia-160m zero shot evaluation, with pre-fix tuning: 1516.04, with p-tuning: 87.91

Perplexity: 51.08 for RWKV/rwkv-4-169m-pile zero shot evaluation, with p-tuning: 60.05