# Introduction to AI part 2 fine tuning

This time we will be working on finetuning a model first and training a model from scracth second, in either case we will NOT be generating a new architecture but pick an exsiting one. 

## Fine tuning

Fine tuning refers to taking a pre-trained model (This can be full training for a specific task or pre-trained as in masked language modelling) and adjusting it for a specific task ahead using a more limited dataset. 
Today we will use distilbert-base-uncased because it is little and we will use it for sentence classification, but there are many different tasks and huggingface documentation has decent examples of most of them. 

In [None]:
# load an existing dataset, we will be doing sentiment analysis

from datasets import load_dataset
imdb = load_dataset("imdb")

In [None]:
imdb["test"][0]

In [None]:
#this is more of a convenience, as long as you remember which label is which you can do this after the fact at inference time, really doesnt matter

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In the previous notebook we talked about Autotokenizers and Autmomodels (or at least used them). These are base classes that can access a model for supported architectures (and there are a lot!). The `.from_pretrained` methods are used to access existing models. These can be models from huggingface but can also be models that you have fine tuned and now want to use for inference.

The TrainingArguments, and Trainer classes store the information that you need to start t

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", padding=True, truncation=True, max_length=512)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)

We will create a preprocess function to set up our tokenizer with all the options that it supports. In this case we will remove all extra text via truncation, we will add a bunch of 0s to the short text until we reach the max lenght which is 512. While you can use change this arbitrarily on your tokenizer you cannot for your model and will not make a difference once you hit the max lenghts. If you want to change the context lenght of a model you will need to 1) change its architecture and 2) re-train it. That said there are many model with greatly varying context windows from 256 to 1 million.

We will then use this preprocess function to tokenize our dataset and get it ready for inference. It's a small dataset and we can use the entire thing in memory but when it's larger than our RAM can hold you can use the [datasets](https://huggingface.co/docs/datasets/index) package to create data streams. That feature is not discussed here but it is very well documented.

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

In [None]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

In [None]:
list(imdb.keys())

The [evaluate]() package holds several useful functions for well, evaluation. You can load your favourite one or you can pass an arbitrary function. We then need to wrap our metric in yet another function, the reason for this is not every model outputs things the same way, so you will need to change the outputs of your model and your inputs in a way that you can pass it to the metric function. In he example above we are taking the output of the model `eval_pred`. This is a piece of data that is in our test/validation set that we know the labels to. Then we pick the best (`np.argmax`) prediction. This is what the accuracy metric needs for its compute method.

In [None]:
import evaluate

accuracy = evaluate.load("accuracy") #there is also prec, recall, f1 etc.

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Here we define how we are going to train our model. There are many, many more options but the transformers package has already have a lot of very sensible defaults and they generally a very good start. As a general rule you will not go from 70% accuracy to 99% accuracy by playing around with hyperparameters, this will be more like from 85 to 89% on a good day. Like any ML model the data quality and the model's ability to comprehend all of its dimensions are the most important factors.

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5, #most important parameter when fine-tuning
    per_device_train_batch_size=16, # as large as your gpu would allow
    per_device_eval_batch_size=16, #same as above
    num_train_epochs=2, # better to overshoot and load a previous checkpoint
    weight_decay=0.01, # a small value (this is reasonable) to prevent overfitting, just as learning rate it is a trial and error, 
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

While these parameters are important the most imporatnt thing is your data and your labels, if they are not good there is nothing you can do. 

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"], #need to pass tokenized dataset
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics)

In [None]:
trainer.train()

## Training from scratch

Unlike fine-tuning this will take a lot more resources because we are training the whole model weights not just the last layer, we will be training a masked language model for Esperanto again using the distilbert-base-uncased model architecture, we will need to generate our own tokenizer, along with a bunch of other functions for masked language modelling. 

### Tokenizer

There are many different kinds of tokenizers, we will be using the simplest one word-piece tokenizer, you can use BytePairEncoding (BPE) or other lemma based ones like GPT uses. Check huggingface tokenizer documentation for more details

In [None]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer(lowercase=True)

# each tokenizer have different parameters, you will need to check the documentation
# in this case we are creating a tokenizer with 50K vocab, with 1000 different characters, since we are training a single
# language model this probably is overkill but if you are training multiple languages, or you have symbols, emojis etc
# you might want to keep the number large, also you will need to check your encoding, you cannot have emojis in ASCII for example
tokenizer.train(files="epo_literature_2011_300K-sentences.txt", 
                vocab_size=50_000, min_frequency=2,
               limit_alphabet=1000,
               special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'])

In [None]:
tokenizer.save_model("tokenizer")

There are many way to tokenize a text, this can be as simple as splitting the text at every whitespace or as complicated as using a rule based lingusitics model (like spacy) to get a words root and any prefixes and suffixes and tokenize them separately. A good starting point is generally a [bytepair encoding tokenizer](https://huggingface.co/learn/llm-course/en/chapter6/5) .

In [None]:
from tokenizers.implementations import BertWordPieceTokenizer 
from tokenizers.processors import BertProcessing

tokenizer = BertWordPieceTokenizer(
    "./tokenizer/vocab.txt"
)

In [None]:
tokenizer.encode("Mi estas Julien.").tokens

Now that we have our tokenizer we need to pick a model. Usually you pick the model and the tokenizer that comes with it. Except for rare cases you cannot really use a different tokenizer than the one model comes with. When we are starting a model from scratch we do not need the previously trained weights. All we need is the model configuration that you can download or add your own. In the hugginface repository you can see this configuration under the models files. [Here](https://huggingface.co/distilbert/distilbert-base-uncased/blob/main/config.json) is an example

In [None]:
from transformers import DistilBertConfig #RobertaConfig

config = DistilBertConfig(
    vocab_size=50_000,
    max_position_embeddings=514)

In [None]:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained("./tokenizer/", max_len=512)
     

In [None]:
from transformers import DistilBertForMaskedLM
model = DistilBertForMaskedLM(config=config)

We have everything we need, except for the most important part, the data. I will not be sharing this dataset because it's too big to put in github but it's basically a bunch of sentences written in esperanto. It does not have any labels. What we are doing at the moment is pre-training. We are just teaching the model to speak esperanto. Once that is done we can work on fine tuning (like above) for a specific task. Bert models are "masked language models", which means for a given data point we hide n% (usually 15-15) of it with a `[MASK]` token as ask the model to predict what goes in there. As the model goes through examples it will start learning the relationships between different tokens.

During fine tuning we will expoit this learned relationship between tokens and ask the model to assign labels to your liking.

In [None]:
from datasets import Dataset

In [None]:
import pandas as pd
data=pd.read_csv("epo_literature_2011_300K-sentences.txt", header=None, sep="\t")
data=data.rename(columns={0:"text"})

In [None]:
data

In [None]:
from datasets import Dataset
dataset=Dataset.from_pandas(data)

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

dataset_tokenized=dataset.map(preprocess_function)

The data collator class is the part that manages to set up this masking, all we need to specify what kind of model we are dealing with and some additional parameters. Of course we need our tokenizer because without it we do not have the `[MASK]` token.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)



In [None]:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5, #most important parameter when fine-tuning
    per_device_train_batch_size=16, # as large as your gpu would allow
    num_train_epochs=2, # better to overshoot and load a previous checkpoint
    weight_decay=0.01, # a small value (this is reasonable) to prevent overfitting, just as learning rate it is a trial and error, 
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset_tokenized,
)

In [None]:
trainer.train()