In [1]:
# ! pip install datasets transformers

## Causal language modeling
the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it to access the tokens after token i when trying to predict the token i+1 in the sentence.
## Masked language modeling
the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

In [1]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

In [2]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

In [3]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
    
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"Cougars are slender and agile members of the cat family . They are the fourth @-@ largest cat ; adults stand about 60 to 90 cm ( 24 to 35 in ) tall at the shoulders . Adult males are around 2 @.@ 4 m ( 7 @.@ 9 ft ) long nose @-@ to @-@ tail and females average 2 @.@ 05 m ( 6 @.@ 7 ft ) , with overall ranges between 1 @.@ 5 to 2 @.@ 75 m ( 4 @.@ 9 to 9 @.@ 0 ft ) nose to tail suggested for the species in general . Of this length , 63 to 95 cm ( 25 to 37 in ) is comprised by the tail . Males typically weigh 53 to 100 kg ( 115 to 220 lb ) , averaging 62 kg ( 137 lb ) . Females typically weigh between 29 and 64 kg ( 64 and 141 lb ) , averaging 42 kg ( 93 lb ) . Cougar size is smallest close to the equator , and larger towards the poles . The largest recorded cougar , shot in 1901 , weighed 105 @.@ 2 kg ( 232 lb ) ; claims of 125 @.@ 2 kg ( 276 lb ) and 118 kg ( 260 lb ) have been reported , though they were most likely exaggerated . On average , adult male cougars in British Columbia weigh 56 @.@ 7 kg ( 125 lb ) and adult females 45 @.@ 4 kg ( 100 lb ) , though several male cougars in British Columbia weighed between 86 @.@ 4 and 95 @.@ 5 kg ( 190 to 210 lb ) . \n"
1,"Citizens in the south were opposed to a centralised government , and to the decrees of its rule , which resulted in rebellion . Prior to the revolution France had been divided into provinces with local governments . In 1790 the government , the National Constituent Assembly , reorganised France into administrative departments in order to rebalance the uneven distribution of French wealth , which had been subject to feudalism under the monarchical Ancien Régime . \n"
2,"Hobbs was born on May 8 , 1883 , in Bloomington , Nebraska , to John Alden Hobbs and Cora Bush Hobbs . Her family moved to Salt Lake City , Utah when she was six years old ; she lived there for 12 years , finishing high school . Her father then met with financial difficulties , and she moved to Oregon , settling in Hillsboro . There , she put her younger brother and sister through school , while studying stenography and working for a living . \n"
3,Roger Federer at the Davis Cup \n
4,
5,
6,
7,
8,
9,"Major declines in populations have been observed from 1980 onward in Sweden , Finland , northern Russia ( Karelia ) and the Baltic States , and smaller declines in much of the rest of northern and central Europe . The bird has been adversely affected in these areas by intensive agriculture , and in several countries it has been red @-@ listed due to population declines of more than 50 % . Numbers dwindled in the United Kingdom by more than 80 % between 1966 and 2004 ; although populations in some areas such as Northern Ireland were stable or even increased , those in other areas , mainly England , declined even more sharply . The overall decline seems to be due to the low survival rate of young birds , which may be caused by changes in agricultural practices . The intensive farming methods used in northern Europe mean there is less pasture and meadow habitat available , and the supply of grassland invertebrates needed for the nestlings to thrive is correspondingly reduced . \n"


In [4]:
# model_checkpoint = "distilgpt2"
# model_checkpoint = "EleutherAI/pythia-70m-deduped"
model_checkpoint = "EleutherAI/pythia-160m"
# model_checkpoint = "RWKV/rwkv-4-169m-pile"
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
def tokenize_function(examples):
    return tokenizer(examples["text"])
    
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
tokenized_datasets["train"][1]

{'input_ids': [426, 657, 1278, 90, 5182, 28289, 868, 6490, 426, 2490],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
# block_size = tokenizer.model_max_length
block_size = 128

In [7]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

In [8]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

2024-04-11 11:47:48.502220: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-11 11:47:48.503575: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-11 11:47:48.519457: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-11 11:47:48.519473: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-11 11:47:48.519487: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

' time gameplay as its predecessors, the story runs parallel to the first game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both'

In [9]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

In [10]:
# Add peft

from peft import get_peft_config, get_peft_model, PrefixTuningConfig, TaskType, PeftType, get_peft_model_state_dict, set_peft_model_state_dict, PromptEncoderConfig

# ## Prefix-tuning
# peft_config = PrefixTuningConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=30)

## P-tuning
peft_type = PeftType.P_TUNING
# peft_config = PromptEncoderConfig(task_type="SEQ_CLS", num_virtual_tokens=20, encoder_hidden_size=128)
peft_config = PromptEncoderConfig(task_type="CAUSAL_LM", num_virtual_tokens=20, encoder_hidden_size=128)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()


trainable params: 229,376 || all params: 162,552,320 || trainable%: 0.1411090287730129


In [11]:
# !pip install peft

Collecting peft
  Downloading peft-0.10.0-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.10.0-py3-none-any.whl (199 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: peft
Successfully installed peft-0.10.0


In [11]:
from transformers import Trainer, TrainingArguments

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

comet_ml is installed but `COMET_API_KEY` is not set.


In [12]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

# Zero shot evaluation
# trainer.train()

In [13]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

[34m[1mwandb[0m: Currently logged in as: [33myimei-yang[0m. Use [1m`wandb login --relogin`[0m to force relogin


Perplexity: 87.91


Perplexity: 45.93 for pythia-70m-deduped

Perplexity: 34.95 for pythia-160m-deduped, with pre-fix tuning: Perplexity: 2739.01, with p-tuning: Perplexity: 87.91

Perplexity: 26.12 for RWKV/rwkv-4-169m-pile

Perplexity: 61.54 for pythia-160m zero shot evaluation

Perplexity: 51.08 for RWKV/rwkv-4-169m-pile zero shot evaluation