# CCM AI syncup session2

This time we will be working on finetuning a model first and training a model from scracth second, in either case we will NOT be generating a new architecture but pick an exsiting one. 

## Fine tuning

Fine tuning refers to taking a pre-trained model (This can be full training for a specific task or pre-trained as in masked language modelling) and adjusting it for a specific task ahead using a more limited dataset. 
Today we will use distilbert-base-uncased because it is little and we will use it for sentence classification, but there are many different tasks and huggingface documentation has decent examples of most of them. 

In [1]:
# load an existing dataset, we will be doing sentiment analysis

from datasets import load_dataset
imdb = load_dataset("imdb")

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [2]:
imdb["test"][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

In [3]:
#this is more of a convenience, as long as you remember which label is which you can do this after the fact at inference time, really doesnt matter

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [7]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", padding=True, truncation=True, max_length=512)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)

Some weights of the model checkpoint at distilbert/distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier

In [8]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [25]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

In [26]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [12]:
list(imdb.keys())

['train', 'test', 'unsupervised']

In [13]:
import evaluate

accuracy = evaluate.load("accuracy") #there is also prec, recall, f1 etc.

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [18]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5, #most important parameter when fine-tuning
    per_device_train_batch_size=16, # as large as your gpu would allow
    per_device_eval_batch_size=16, #same as above
    num_train_epochs=2, # better to overshoot and load a previous checkpoint
    weight_decay=0.01, # a small value (this is reasonable) to prevent overfitting, just as learning rate it is a trial and error, 
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

While these parameters are important the most imporatnt thing is your data and your labels, if they are not good there is nothing you can do. 

In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"], #need to pass tokenized dataset
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics)

In [None]:
trainer.train()

## Training from scratch

Unlike fine-tuning this will take a lot more resources because we are training the whole model weights not just the last layer, we will be training a masked language model for Esperanto again using the distilbert-base-uncased model architecture, we will need to generate our own tokenizer, along with a bunch of other functions for masked language modelling. 

### Tokenizer

There are many different kinds of tokenizers, we will be using the simplest one word-piece tokenizer, you can use BytePairEncoding (BPE) or other lemma based ones like GPT uses. Check huggingface tokenizer documentation for more details

In [1]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer(lowercase=True)

# each tokenizer have different parameters, you will need to check the documentation
# in this case we are creating a tokenizer with 50K vocab, with 1000 different characters, since we are training a single
# language model this probably is overkill but if you are training multiple languages, or you have symbols, emojis etc
# you might want to keep the number large, also you will need to check your encoding, you cannot have emojis in ASCII for example
tokenizer.train(files="epo_literature_2011_300K-sentences.txt", 
                vocab_size=50_000, min_frequency=2,
               limit_alphabet=1000,
               special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'])






In [2]:
tokenizer.save_model("tokenizer")

['tokenizer/vocab.txt']

In [3]:
from tokenizers.implementations import BertWordPieceTokenizer 
from tokenizers.processors import BertProcessing

tokenizer = BertWordPieceTokenizer(
    "./tokenizer/vocab.txt"
)

In [4]:
tokenizer.encode("Mi estas Julien.").tokens

['[CLS]', 'mi', 'estas', 'julie', '##n', '.', '[SEP]']

In [5]:
from transformers import DistilBertConfig #RobertaConfig

config = DistilBertConfig(
    vocab_size=50_000,
    max_position_embeddings=514)

In [6]:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained("./tokenizer/", max_len=512)
     

In [7]:
from transformers import DistilBertForMaskedLM
model = DistilBertForMaskedLM(config=config)

In [16]:
from datasets import Dataset

In [10]:
import pandas as pd
data=pd.read_csv("epo_literature_2011_300K-sentences.txt", header=None, sep="\t")
data=data.rename(columns={0:"text"})

In [11]:
data

Unnamed: 0,text
0,0.10 La trajno forlasis Vincovci (malfrue). 0....
1,101 De çiu malbona vojo mi detenas mian piedon...
2,103 Kiel dolça estas por mia palato Via vorto!
3,103 Malfeliˆa homo plendas pro tio ke neniu ha...
4,104 Verkisto diras pri kolego: “Li estas sprit...
...,...
276699,"Þi havis larmojn en la okuloj, kiam þia patro ..."
276700,"Þi povas eliri vespere, sen nerveksciteco dum ..."
276701,"Þi respondis: - Franæjo, tion mi ne scias.."
276702,"Þi ridetis kaj respondis: - Jes, sed Jesuo dez..."


In [13]:
from datasets import Dataset
dataset=Dataset.from_pandas(data)

In [14]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

dataset_tokenized=dataset.map(preprocess_function)

Map:   0%|          | 0/276704 [00:00<?, ? examples/s]

In [15]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)



In [21]:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5, #most important parameter when fine-tuning
    per_device_train_batch_size=16, # as large as your gpu would allow
    num_train_epochs=2, # better to overshoot and load a previous checkpoint
    weight_decay=0.01, # a small value (this is reasonable) to prevent overfitting, just as learning rate it is a trial and error, 
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset_tokenized,
)

In [None]:
trainer.train()