Week 1 Notebook: https://colab.research.google.com/drive/1bSQwJ8HiyJh_v_TnBCJl_FoI7WWhzqI4?usp=sharing

What is Pretraining?

As the name suggests, pretraining is a process that you do before training on the task that you actually care about. In most case this means training on some other data that has labels or a large pool of unlabelled data. There are many different kinds of pretraining techniques, you can get creative with these. The important part that we are trying to optimize for is that this pretraining process will increase performance of our model on downstream tasks. 

You might pretrain a model on all available data at your company or a bunch of data from the internet regardless of if it has labels or not and then finetune the model later on your own labeled data for your specific task.

The pretraining and finetuning task does not need to be the same, you just hope that pretraining teaches some element of the finetuning and will improve performance beyond starting with a randomly initialized model. 

Examples of Pretraining tasks:
- Next word prediction
= Fill in the blank(text, image, time series, audio)
- Classification(on different targets)
- Contrastive Learning (pass something through the network, slightly perturb it and pass through again and try to enforce that the outcome is similar again)
- Pretty much anything

<img src = "https://production-media.paperswithcode.com/datasets/ImageNet-0000000008-f2e87edd_Y0fT5zg.jpg">


One particularly famous case of pretraining that many ML people are familiar with is imagenet https://en.wikipedia.org/wiki/ImageNet This is a dataset of millions of images that have been labeled into various classes. People have found that training against this large dataset of labeled data and then reusing those model weights as the initialization point for training on other tasks significantly improves performance. The model that has the distilled information of imagenet inside of it transffers better to new tasks because it doesnt have to learn everything from scratch.

We have seen that this works even when the domain drastically changes like in imagenet you are classifying cats and dogs and birds and thousands of others, but those weights arent just useful for those specific classes. These weights have been used for cancer detection, pneumothorax, cell segmentation, etc. It is incredible how well it works.



---





How effective is pretraining?

In the ideal scenario our pretraining will transfer well to our downstream task. We want to align the pretraining task and finetuning task so that increase performance at one leads to better performance on the other. Sometimes it is non-obvious how to do this. Sometimes you need to get creative with tasks and where you get your data from. A good example of this is the whisper system from OpenAI. https://cdn.openai.com/papers/whisper.pdf If you are interested in seeing a creative use of unlabeled data. In some rare cases the pretraining can virtually become the actual training.

But if this isn't possible what performance can we expect in some sample scenarios? One we will be looking at extensively is language modeling, predicting the next token in a sequence of text.

<img src="https://i.imgur.com/nOQsSvS.png">

We can see that the top model trained with language modeling beforehand massively beats the transformer without any pretraining and even beats a LSTM that was also trained with language modeling.

GPT paper - https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf



---



Where to get data?
There is a lot of data and specifically text out on the internet. Lets look at some of the popular datasets to train against

GPT:

> Unsupervised pre-training We use the BooksCorpus dataset [71] for training the language model.
It contains over 7,000 unique unpublished books from a variety of genres including Adventure,
Fantasy, and Romance

Bert:
> Pre-training data The pre-training procedure
largely follows the existing literature on language
model pre-training. For the pre-training corpus we
use the BooksCorpus (800M words) (Zhu et al.,
2015) and English Wikipedia (2,500M words).
For Wikipedia we extract only the text passages
and ignore lists, tables, and headers.

Roberta:

> We consider five English-language corpora of
varying sizes and domains, totaling over 160GB
of uncompressed text. We use the following text
corpora:
• BOOKCORPUS (Zhu et al., 2015) plus English
WIKIPEDIA. This is the original data used to
train BERT. (16GB).
• CC-NEWS, which we collected from the English portion of the CommonCrawl News
dataset (Nagel, 2016). The data contains 63
million English news articles crawled between
September 2016 and February 2019. (76GB after filtering).4
• OPENWEBTEXT (Gokaslan and Cohen, 2019),
an open-source recreation of the WebText cor4We use news-please (Hamborg et al., 2017) to collect and extract CC-NEWS. CC-NEWS is similar to the REALNEWS dataset described in Zellers et al. (2019).
pus described in Radford et al. (2019). The text
is web content extracted from URLs shared on
Reddit with at least three upvotes. (38GB).5
• STORIES, a dataset introduced in Trinh and Le
(2018) containing a subset of CommonCrawl
data filtered to match the story-like style of
Winograd schemas. (31GB).

One commonality we see in this is Book Corpus. We can find a version of that on huggingfaces dataset hub:
https://huggingface.co/datasets/bookcorpus

But more generally we can look at all the datasets available on the hub for the fill mask task https://huggingface.co/datasets?task_categories=task_categories:fill-mask&sort=downloads

We can see that this has a lot of the same datasets available from the various papers. If you wanted to download all of these and concatenate them you probably could but it would be a lot of data. 

In [None]:
!pip install datasets

For our simple test case we will use the amazon reviews dataset because it is small enough to be manageable https://huggingface.co/datasets/amazon_reviews_multi if you want to you can visit this site and see what other languages are available and swap them out to get a different dataset

In [13]:
from datasets import load_dataset

dataset = load_dataset("amazon_reviews_multi", "en")

Found cached dataset amazon_reviews_multi (C:/Users/kleinada/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)
100%|██████████| 3/3 [00:00<00:00, 199.98it/s]


In [None]:
dataset

In [None]:
roughly_word_size = 0
char_size = 0
for sample in dataset["train"]["review_body"]:
  roughly_word_size += len(sample.split())
  char_size += len(sample)
print(f"Characters in the set: {char_size}")
print(f"Roughly number of words in the set: {roughly_word_size}")

So our small review dataset has about 6 million words. For context, book corpus has 800 million so we are a few orders of magnitude smaller. All of english wikipedia has about 2500 million. There are datasets like c4 that are cleaned up datasets from common crawl that have hundreds of gigabytes or even terabytes. https://huggingface.co/datasets/c4

C4 was used to train the T5 model. This paper is very interesting for understanding pretraining and scaling across various axes. https://arxiv.org/abs/1910.10683

<img src = "https://i.imgur.com/1f9J0b6.png">



In [None]:
!pip install tokenizers

In [None]:
from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
dir(models)

There are various options for tokenizers. Many models use WordPiece or Byte Pair Encoding. Here is a useful article explaining their differences. https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble/55416944#55416944

For this example we will use WordPiece just because it gives slightly nicer looking output

In [3]:
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()

In [4]:
vocab_size = 20000
tokenizer.train_from_iterator(dataset["train"]["review_body"], special_tokens=["","","",""], vocab_size = vocab_size)

NameError: name 'dataset' is not defined

In [None]:
!mkdir tokenizer

In [None]:
tokenizer.save_model("tokenizer")

In [None]:
tokenizer.get_vocab()

In [None]:
tokenized_text = tokenizer.encode_batch(dataset["train"]["review_body"])

In [None]:
import numpy as np

In [None]:
tokenized_text[0]

In [None]:
np.array(tokenized_text[0].ids)

In [None]:
np.array(tokenized_text[0].tokens)

In [None]:
tokenizer.decode(tokenized_text[0].ids)

In [None]:
!pip install transformers

In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./tokenizer/vocab.json",
    "./tokenizer/merges.txt",
)
     

In [None]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("", tokenizer.token_to_id("")),
    ("", tokenizer.token_to_id("")),
)
tokenizer.enable_truncation(max_length=512)

In [5]:
from transformers import RobertaConfig
vocab_size = 20000

config = RobertaConfig(
    vocab_size=vocab_size+5,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=12,
    type_vocab_size=1,
)

In [7]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./tokenizer", max_len=512, truncation = True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [9]:
model.num_parameters()


101427493

In [10]:
def tokenize_function(examples):
    return tokenizer(examples["review_body"], truncation = True)

In [11]:
import transformers
transformers.__version__

'4.25.1'

In [14]:
tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["review_body"])

NameError: name 'tokenizer' is not defined

In [None]:
block_size = tokenizer.model_max_length

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(['review_id', 'product_id', 'reviewer_id', 'stars', 'review_title', 'language', 'product_category'])

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

In [None]:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    f"amazon_reviews",
    evaluation_strategy = "steps",
    eval_steps = 500,
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size = 16,
    # push_to_hub=True,
    fp16=True,
    num_train_epochs = 20

)

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm = True, mlm_probability=0.15)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
trainer.train()

In [None]:
input = tokenizer("The dog chased the red <mask>", return_tensors = "pt")

In [None]:
input = input.to("cuda")

In [None]:
output = model(**input)

In [None]:
tokenizer.mask_token_id

In [None]:
input["input_ids"].to("cpu")

In [None]:
np.argwhere(input["input_ids"].to("cpu") == tokenizer.mask_token_id)[1]

In [None]:
mask_token_index = np.argwhere(input["input_ids"].to("cpu") == tokenizer.mask_token_id)[1]


In [None]:
mask_token_logits = output["logits"].to("cpu")[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
# We negate the array before argsort to get the largest, not the smallest, logits
top_5_tokens = np.argsort(-mask_token_logits.detach().cpu()[0])[:5].tolist()
print(mask_token_logits)
for token in top_5_tokens:
    print(f" {tokenizer.decode(token)}")