# Overview
This notebook is to save my notes while learning the hugging face library

## Part1

### Playing with The Pipeline class

In [2]:
from transformers import pipeline

# sentiment analysis it is
sent_pipe = pipeline('sentiment-analysis')

sents = ['Life does not seem to entail any meaning, my friend', 
         "Well, It is what it is", 
         "Man, I have never been this happy", 
         "Well, things could probably be better."]

res = sent_pipe(sents)
print(res)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'


[{'label': 'NEGATIVE', 'score': 0.9989687204360962}, {'label': 'POSITIVE', 'score': 0.9997546076774597}, {'label': 'POSITIVE', 'score': 0.9932047128677368}, {'label': 'NEGATIVE', 'score': 0.9978812336921692}]


In [3]:
# well as we can see the pipeline class is not for final usage.
# It is mainly for demonstrative usage

# let's try generation

gen = pipeline('text-generation', max_length=100, num_return_sequences=3, model='distilgpt2')

answer = gen('I am so tired. I want to')
print(answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "I am so tired. I want to say I could Step It up in the game and give it time I'm in like half-sleep, and it definitely is one of the reasons I decided to spend half my off-season at the start of next year. I don't know how it would help with any of the other things, so if I made my decisions so this would be something you'll appreciate as a bonus for me to go down and play against the best teams there has ever been"}, {'generated_text': "I am so tired. I want to die. For years of my life, I thought he couldn't find a place in my life. At age 31, I couldn't find anyone else with the resources to get me back in the U.S... and just go home happy. I also couldn't find someone to give me a phone call...\nWhy stop trying to get me back in the U.S.?\nI've been missing since July 2015. It's too good an"}, {'generated_text': "I am so tired. I want to live alone. Please let me know. I hate all of my life. I hate all I've heard about so far. I hate making a decision.\n\n\nFol

### Resources:
Well must check the following:
1. [Understand Bert](https://huggingface.co/docs/transformers/model_doc/bert)
2. [Decoder Models For text generation](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt)

# Part 2

## Tokenizer API
The first preprocessing step is tokenizing the input as well as adding any special tokens. To carry out the last step, one needs to better understand the model in question. The information is generally available on the model hub.

In [4]:
# let's start using some models
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [5]:
model_inputs = tokenizer(['Well, I have better things to do for now. See you later', "Do you really think I cannot win this match ?"], 
                        padding=True,
                        truncation=True, 
                        return_tensors='pt')

print(model_inputs['input_ids'])
print(model_inputs['attention_mask'])

tensor([[ 101, 2092, 1010, 1045, 2031, 2488, 2477, 2000, 2079, 2005, 2085, 1012,
         2156, 2017, 2101,  102],
        [ 101, 2079, 2017, 2428, 2228, 1045, 3685, 2663, 2023, 2674, 1029,  102,
            0,    0,    0,    0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])


In [6]:
# now time to use a model
from transformers import AutoModel
model = AutoModel.from_pretrained(checkpoint)

# let's do some inference
inference = model(**model_inputs)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
# the model's output's shape is generally: 
# 1. The batch size
# 2. The sequence's length
# 3. The hidden space dimensionality

from transformers import AutoModelForSequenceClassification, AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**model_inputs)

In [15]:
import torch
preds = torch.argmax(outputs.logits.detach(), dim=-1).tolist()
# extract the map from indices to classe
itos = model.config.id2label

print([itos[p] for p in preds])

['POSITIVE', 'NEGATIVE']


## Creating a Transformer

### Tokenization functions

In [None]:
from transformers import BertConfig, BertModel
config= BertConfig()
model = BertModel(config)
print(config)
# using the config file will simply create a BertModel from scratch

In [10]:
# there a couple of cool tokenizers that we can use
from transformers import BertTokenizer
BERT_CHECKPOINT = 'bert-base-cased'
bert_tokenizer = BertTokenizer.from_pretrained(BERT_CHECKPOINT)

sentence = 'fine tuning transformers is basically the first step in an NLP project'

tokens_data = bert_tokenizer(sentence)

# the forward call will execute the entire tokenization pipeline 
for k ,v in tokens_data.items():
    print(k, v, sep='\t')

input_ids	[101, 2503, 19689, 11303, 1468, 1110, 11519, 1103, 1148, 2585, 1107, 1126, 21239, 2101, 1933, 102]
token_type_ids	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [11]:
# let's break down a bit
tokens = bert_tokenizer.tokenize(sentence)
# the 2nd step is to convert each of these tokens to their numerical ids
ids = bert_tokenizer.convert_tokens_to_ids(tokens)
# we can actually convert the numerical ids back to text as follows
string = bert_tokenizer.decode(ids)
print(bert_tokenizer.decode(ids))

fine tuning transformers is basically the first step in an NLP project


### Multi-dimensional input 
* It is important to keep in mind that most models expected batched input.  
* The default output of the tokenize function is a list. Make sure to use tensor as a return type to account for batching
* With great power comes great responsibility. Each element in the batch must have the exact same shape. Here comes padding and the attention mechanism

In [77]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(BERT_CHECKPOINT)
from transformers import AutoModelForSequenceClassification
bm = AutoModelForSequenceClassification.from_pretrained(BERT_CHECKPOINT)

# first let's start with extracting some interesting values

pad_id = tokenizer.pad_token_id
start_id = tokenizer.sep_token_id 
cls_id = tokenizer.cls_token_id

def prepare_input(sentences: list[str], tokenizer):
    # extract the lenght of the longuest sentence
    tokens = [tokenizer.convert_tokens_to_ids(tokenizer.tokenize(s)) for s in sentences]
    max_length = len(max(tokens, key=len))

    def pad(tokens: str, num_pads):
        res = [tokenizer.cls_token_id] + tokens + [tokenizer.sep_token_id] + [tokenizer.pad_token_id] * num_pads
        return res
    
    def attention_mask(padded_ids: list[int]):
        return [int(t != tokenizer.pad_token_id) for t in padded_ids]
    pads = [pad(s, max_length - len(s)) for s in tokens]

    return {"input_ids": torch.tensor(pads) , "attention_mask": torch.tensor([attention_mask(s) for s in pads])}

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

In [78]:
sens = ['I am a good person', 'I love drinking tea with milk']
i1, m1 = (prepare_input(sens, tokenizer).values())
i1, m1

(tensor([[ 101,  146, 1821,  170, 1363, 1825,  102,    0],
         [ 101,  146, 1567, 5464, 5679, 1114, 6831,  102]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 1, 1, 1]]))

In [79]:
o =  tokenizer(sens, padding=True, return_tensors='pt')
o['input_ids'], o['attention_mask']
# well nice we got the hang of the process apparently !!!

(tensor([[ 101,  146, 1821,  170, 1363, 1825,  102,    0],
         [ 101,  146, 1567, 5464, 5679, 1114, 6831,  102]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 1, 1, 1]]))

### Tokenization API: More Details

In [None]:
# let's import a tokenizer really quick
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(BERT_CHECKPOINT)

# there are several arguments in the api call that might be of use
# 1. padding: whether and how to bad
help(tokenizer.__call__)
# we can see here that padding is either True of False. The default strategy is padding according to the longest sentence in the batch
# but we can use max_length as the maximum length of a sequence input

# 1.padding
# 2.max_length
# 3.truncation: whether the longer sequences (those exceeding max_length value) should be split into different parts
# 4.return_tensor ['np', 'tf', 'pt'], the default value will make the call return a list 


## Part3: Fine Tune Models

### Data Preprocessing: with Datasets

In [1]:
# surprisingly, the hugging face HUB doesn't only contain models but datasets as well
# let's start with a minimalistic dataset for experimenting
from datasets import load_dataset
ds = load_dataset('glue', 'mrpc')
train_ds, val_ds, test_ds = ds['train'], ds['validation'], ds['test']
for d in train_ds:
    print(d['sentence1'])
    print(d['sentence2'])
    break

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset glue (C:/Users/bouab/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 299.59it/s]

Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .
Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .





In [4]:
# this dataset is for the task of paraphrasing: comparing if 2 textual sequences are equivalent
# the input is expected to be 2 phrases
# let's start with a tokenizer
from transformers import AutoTokenizer
BERT_CHECKPOINT = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(BERT_CHECKPOINT)
pair = tokenizer('He is a workaholic', 'He is working no stop')
for k, v in pair.items():
    print(k, v, sep='\t')


input_ids	[101, 1124, 1110, 170, 1250, 3354, 14987, 102, 1124, 1110, 1684, 1185, 1831, 102]
token_type_ids	[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
attention_mask	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [5]:
# we haven't consider the token_type_ids before
# the latter determines which tokens belongs to which sentence
tokenizer.convert_ids_to_tokens(pair['input_ids'])
# so as we can see: the tokenizer return [CLS] sentence 1 [SEP] sentence2 [SEP], as well as token_type_ids to let the model know the boarders of sentences

['[CLS]',
 'He',
 'is',
 'a',
 'work',
 '##ah',
 '##olic',
 '[SEP]',
 'He',
 'is',
 'working',
 'no',
 'stop',
 '[SEP]']

In [6]:
# working with the Datasets library is needed for efficiency:
# a single example of the Dataset object cannot be feeded directly to a model and thus some preprocessing is needed
# it is important to define such steps in a function

def preprocess(example):
    # all we need this time is to tokeniz
    return tokenizer(example['sentence1'], example['sentence2'], truncation=True, padding=True) 

# then we use the map function, as well as the batched=True argument for batched preprocessing
tokenized_ds = ds.map(preprocess, batched=True)
# the map function will keep the original features in the

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Loading cached processed dataset at C:\Users\bouab\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-5b59d8dacba58b79.arrow
Loading cached processed dataset at C:\Users\bouab\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-175c3b5777158902.arrow


In [7]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [8]:
# let's check some of the individual examples
ex1 = tokenized_ds['train'][11]
for k, v in ex1.items():
    print(k, v, sep='\t')

sentence1	The Nasdaq composite index increased 10.73 , or 0.7 percent , to 1,514.77 .
sentence2	The Nasdaq Composite index , full of technology stocks , was lately up around 18 points .
label	0
idx	12
input_ids	[101, 1109, 11896, 1116, 1810, 4426, 14752, 7448, 2569, 1275, 119, 5766, 117, 1137, 121, 119, 128, 3029, 117, 1106, 122, 117, 4062, 1527, 119, 5581, 119, 102, 1109, 11896, 1116, 1810, 4426, 3291, 24729, 13068, 7448, 117, 1554, 1104, 2815, 17901, 117, 1108, 10634, 1146, 1213, 1407, 1827, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
token_type_ids	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [9]:
from transformers import DataCollatorWithPadding
# this function will convert 
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
samples = [train_ds[i]['sentence1'] for i in range(10)]
s = data_collator({'input_ids': tokenizer(samples)['input_ids']})


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


### The Trainer class

In [13]:
import torch
torch.cuda.is_available()

True

In [14]:
# let's train the model, we will use the Trainer class provided by the HG library
from transformers import TrainingArguments
# honestly, I will just copy paste this part until I get to wrap my head around the 450-line doc string.
training_args = TrainingArguments('test-trainer') 
# the number of labels used for the upcoming task is different from the one used in pretraining
# so let's set that number in the model's definition

from transformers import AutoModelForSequenceClassification

bert = AutoModelForSequenceClassification.from_pretrained(BERT_CHECKPOINT, num_labels=2)

# this means that the model will discard the output layer used in pretraining and add a Fully Conncted layer with 2 labels as specified.
# the last layer will be randomly initialized

from transformers import Trainer

trainer = Trainer(
    bert,
    training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# let's the model train baby

trainer.train()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

{'loss': 0.4971, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09}


 73%|███████▎  | 1000/1377 [09:43<03:44,  1.68it/s]

{'loss': 0.2465, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18}


100%|██████████| 1377/1377 [13:27<00:00,  1.71it/s]

{'train_runtime': 807.1518, 'train_samples_per_second': 13.633, 'train_steps_per_second': 1.706, 'train_loss': 0.2944130381554387, 'epoch': 3.0}





TrainOutput(global_step=1377, training_loss=0.2944130381554387, metrics={'train_runtime': 807.1518, 'train_samples_per_second': 13.633, 'train_steps_per_second': 1.706, 'train_loss': 0.2944130381554387, 'epoch': 3.0})