# Overview
This notebook is to save my notes while learning the hugging face library

## Part1

### Playing with The Pipeline class

In [2]:
from transformers import pipeline

# sentiment analysis it is
sent_pipe = pipeline('sentiment-analysis')

sents = ['Life does not seem to entail any meaning, my friend', 
         "Well, It is what it is", 
         "Man, I have never been this happy", 
         "Well, things could probably be better."]

res = sent_pipe(sents)
print(res)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'


[{'label': 'NEGATIVE', 'score': 0.9989687204360962}, {'label': 'POSITIVE', 'score': 0.9997546076774597}, {'label': 'POSITIVE', 'score': 0.9932047128677368}, {'label': 'NEGATIVE', 'score': 0.9978812336921692}]


In [3]:
# well as we can see the pipeline class is not for final usage.
# It is mainly for demonstrative usage

# let's try generation

gen = pipeline('text-generation', max_length=100, num_return_sequences=3, model='distilgpt2')

answer = gen('I am so tired. I want to')
print(answer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I am so tired. I want to do more as soon as possible. Please forgive your pain and I can never live again."'}, {'generated_text': "I am so tired. I want to go out to work. We've all been together and have been talking since the very beginning but we are doing the same things.\n\n\nIf you've been coming at me for so long, you've given my energy.\nI don't feel my heart can pick up on what I was doing.\nIt takes less. I just want to go out and be awesome.\nI believe in your power, but no sooner can you make it to"}, {'generated_text': "I am so tired. I want to feel the feeling of having it all over again. It was nice when my mom asked me to join me, she would tell me to stay in my home. But she told me to let me spend my days with him. Not by any means. I couldn't even keep my mind shut. I had to watch everything. It was hard.\nI was nervous after what she said and the other people on the show. I had to watch the whole episode"}]


### Resources:
Well must check the following:
1. [Understand Bert](https://huggingface.co/docs/transformers/model_doc/bert)
2. [Decoder Models For text generation](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt)

# Part 2

## Tokenizer API
The first preprocessing step is tokenizing the input as well as adding any special tokens. To carry out the last step, one needs to better understand the model in question. The information is generally available on the model hub.

In [4]:
# let's start using some models
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [5]:
model_inputs = tokenizer(['Well, I have better things to do for now. See you later', "Do you really think I cannot win this match ?"], 
                        padding=True,
                        truncation=True, 
                        return_tensors='pt')

print(model_inputs['input_ids'])
print(model_inputs['attention_mask'])

tensor([[ 101, 2092, 1010, 1045, 2031, 2488, 2477, 2000, 2079, 2005, 2085, 1012,
         2156, 2017, 2101,  102],
        [ 101, 2079, 2017, 2428, 2228, 1045, 3685, 2663, 2023, 2674, 1029,  102,
            0,    0,    0,    0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])


In [6]:
# now time to use a model
from transformers import AutoModel
model = AutoModel.from_pretrained(checkpoint)

# let's do some inference
inference = model(**model_inputs)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
# the model's output's shape is generally: 
# 1. The batch size
# 2. The sequence's length
# 3. The hidden space dimensionality

from transformers import AutoModelForSequenceClassification, AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**model_inputs)

In [8]:
import torch
preds = torch.argmax(outputs.logits.detach(), dim=-1).tolist()
# extract the map from indices to classe
itos = model.config.id2label

print([itos[p] for p in preds])

['POSITIVE', 'NEGATIVE']


## Creating a Transformer

### Tokenization functions

In [9]:
from transformers import BertConfig, BertModel
config= BertConfig()
model = BertModel(config)
print(config)
# using the config file will simply create a BertModel from scratch

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.30.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [10]:
# there a couple of cool tokenizers that we can use
from transformers import BertTokenizer
BERT_CHECKPOINT = 'bert-base-cased'
bert_tokenizer = BertTokenizer.from_pretrained(BERT_CHECKPOINT)

sentence = 'fine tuning transformers is basically the first step in an NLP project'

tokens_data = bert_tokenizer(sentence)

# the forward call will execute the entire tokenization pipeline 
for k ,v in tokens_data.items():
    print(k, v, sep='\t')

input_ids	[101, 2503, 19689, 11303, 1468, 1110, 11519, 1103, 1148, 2585, 1107, 1126, 21239, 2101, 1933, 102]
token_type_ids	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [11]:
# let's break down a bit
tokens = bert_tokenizer.tokenize(sentence)
# the 2nd step is to convert each of these tokens to their numerical ids
ids = bert_tokenizer.convert_tokens_to_ids(tokens)
# we can actually convert the numerical ids back to text as follows
string = bert_tokenizer.decode(ids)
print(bert_tokenizer.decode(ids))

fine tuning transformers is basically the first step in an NLP project


### Multi-dimensional input 
* It is important to keep in mind that most models expected batched input.  
* The default output of the tokenize function is a list. Make sure to use tensor as a return type to account for batching
* With great power comes great responsibility. Each element in the batch must have the exact same shape. Here comes padding and the attention mechanism

In [12]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(BERT_CHECKPOINT)
from transformers import AutoModelForSequenceClassification
bm = AutoModelForSequenceClassification.from_pretrained(BERT_CHECKPOINT)

# first let's start with extracting some interesting values

pad_id = tokenizer.pad_token_id
start_id = tokenizer.sep_token_id 
cls_id = tokenizer.cls_token_id

def prepare_input(sentences: list[str], tokenizer):
    # extract the lenght of the longuest sentence
    tokens = [tokenizer.convert_tokens_to_ids(tokenizer.tokenize(s)) for s in sentences]
    max_length = len(max(tokens, key=len))

    def pad(tokens: str, num_pads):
        res = [tokenizer.cls_token_id] + tokens + [tokenizer.sep_token_id] + [tokenizer.pad_token_id] * num_pads
        return res
    
    def attention_mask(padded_ids: list[int]):
        return [int(t != tokenizer.pad_token_id) for t in padded_ids]
    pads = [pad(s, max_length - len(s)) for s in tokens]

    return {"input_ids": torch.tensor(pads) , "attention_mask": torch.tensor([attention_mask(s) for s in pads])}

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

In [13]:
sens = ['I am a good person', 'I love drinking tea with milk']
i1, m1 = (prepare_input(sens, tokenizer).values())
i1, m1

(tensor([[ 101,  146, 1821,  170, 1363, 1825,  102,    0],
         [ 101,  146, 1567, 5464, 5679, 1114, 6831,  102]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 1, 1, 1]]))

In [14]:
o =  tokenizer(sens, padding=True, return_tensors='pt')
o['input_ids'], o['attention_mask']
# well nice we got the hang of the process apparently !!!

(tensor([[ 101,  146, 1821,  170, 1363, 1825,  102,    0],
         [ 101,  146, 1567, 5464, 5679, 1114, 6831,  102]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1, 1, 1, 1]]))

### Tokenization API: More Details

In [15]:
# let's import a tokenizer really quick
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(BERT_CHECKPOINT)

# there are several arguments in the api call that might be of use
# 1. padding: whether and how to bad
help(tokenizer.__call__)
# we can see here that padding is either True of False. The default strategy is padding according to the longest sentence in the batch
# but we can use max_length as the maximum length of a sequence input

# 1.padding
# 2.max_length
# 3.truncation: whether the longer sequences (those exceeding max_length value) should be split into different parts
# 4.return_tensor ['np', 'tf', 'pt'], the default value will make the call return a list 


Help on method __call__ in module transformers.tokenization_utils_base:

__call__(text: Union[str, List[str], List[List[str]]] = None, text_pair: Union[str, List[str], List[List[str]], NoneType] = None, text_target: Union[str, List[str], List[List[str]]] = None, text_pair_target: Union[str, List[str], List[List[str]], NoneType] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.utils.generic.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = None, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Union[str, transformers.utils.generic.TensorType, NoneType] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, ve

## Part3: Fine Tune Models

### Data Preprocessing: with Datasets

In [16]:
# surprisingly, the hugging face HUB doesn't only contain models but datasets as well
# let's start with a minimalistic dataset for experimenting
from datasets import load_dataset
ds = load_dataset('glue', 'mrpc')
train_ds, val_ds, test_ds = ds['train'], ds['validation'], ds['test']
for d in train_ds:
    print(d['sentence1'])
    print(d['sentence2'])
    break

Found cached dataset glue (C:/Users/bouab/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .
Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .


In [17]:
# this dataset is for the task of paraphrasing: comparing if 2 textual sequences are equivalent
# the input is expected to be 2 phrases
# let's start with a tokenizer
from transformers import AutoTokenizer
BERT_CHECKPOINT = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(BERT_CHECKPOINT)
pair = tokenizer('He is a workaholic', 'He is working no stop')
for k, v in pair.items():
    print(k, v, sep='\t')


input_ids	[101, 1124, 1110, 170, 1250, 3354, 14987, 102, 1124, 1110, 1684, 1185, 1831, 102]
token_type_ids	[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
attention_mask	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [18]:
# we haven't consider the token_type_ids before
# the latter determines which tokens belongs to which sentence
tokenizer.convert_ids_to_tokens(pair['input_ids'])
# so as we can see: the tokenizer return [CLS] sentence 1 [SEP] sentence2 [SEP], as well as token_type_ids to let the model know the boarders of sentences

['[CLS]',
 'He',
 'is',
 'a',
 'work',
 '##ah',
 '##olic',
 '[SEP]',
 'He',
 'is',
 'working',
 'no',
 'stop',
 '[SEP]']

In [19]:
# working with the Datasets library is needed for efficiency:
# a single example of the Dataset object cannot be feeded directly to a model and thus some preprocessing is needed
# it is important to define such steps in a function

def preprocess(example):
    # all we need this time is to tokeniz
    return tokenizer(example['sentence1'], example['sentence2'], truncation=True, padding=True) 

# then we use the map function, as well as the batched=True argument for batched preprocessing
tokenized_ds = ds.map(preprocess, batched=True)
# the map function will keep the original features in the

Loading cached processed dataset at C:\Users\bouab\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-44839de31f1bb325.arrow


Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Loading cached processed dataset at C:\Users\bouab\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-175c3b5777158902.arrow


In [20]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [21]:
# let's check some of the individual examples
ex1 = tokenized_ds['train'][11]
for k, v in ex1.items():
    print(k, v, sep='\t')

sentence1	The Nasdaq composite index increased 10.73 , or 0.7 percent , to 1,514.77 .
sentence2	The Nasdaq Composite index , full of technology stocks , was lately up around 18 points .
label	0
idx	12
input_ids	[101, 1109, 11896, 1116, 1810, 4426, 14752, 7448, 2569, 1275, 119, 5766, 117, 1137, 121, 119, 128, 3029, 117, 1106, 122, 117, 4062, 1527, 119, 5581, 119, 102, 1109, 11896, 1116, 1810, 4426, 3291, 24729, 13068, 7448, 117, 1554, 1104, 2815, 17901, 117, 1108, 10634, 1146, 1213, 1407, 1827, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
token_type_ids	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask	[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [22]:
from transformers import DataCollatorWithPadding
# this function will convert 
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
samples = [train_ds[i]['sentence1'] for i in range(10)]
s = data_collator({'input_ids': tokenizer(samples)['input_ids']})


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


### The Trainer class

In [30]:
# to save all the preprocessing steps in one cell:
from transformers import AutoTokenizer
from functools import partial
# 1st import the data
data = load_dataset('glue', 'mrpc')
# we can easily extract the different components as follows
data_train, data_val, data_test = data['train'], data['validation'], data['test']

# to proceed one needs to understand the nature of the tast and the data itself
# each two sentences should be tokenized together
# let's first create the tokenizer
paraphrase_tokenizer = AutoTokenizer.from_pretrained(BERT_CHECKPOINT)

# we define a function to process a given input, but first let's understand one sample of the dataset
train_sample = data_train[0]
print(train_sample)

def process_paraphrase_sentence(train_sample, tokenizer):
    return tokenizer(train_sample['sentence1'], train_sample['sentence2'], truncation=True, padding=True)   # the default will be the longest sequence in the batch

# let's apply this function to each batch usig the map function 
# but keep in mind that the function passed to the map function should have no argument
# partial is here for the rescue: 
process_function = partial(process_paraphrase_sentence, tokenizer=paraphrase_tokenizer)
tokenized_data = data.map(process_function, batched=True)

# we need an collator to make use of dynamic padding (I still don't get this part that much but that should be good enough for now)
data_collator = DataCollatorWithPadding(tokenizer=paraphrase_tokenizer)
# let'see our tokenized data
tokenized_train_sample = tokenized_data['train'][0]
print(tokenized_train_sample)
# we can see the inputs_ids and the masks fields were added

Found cached dataset glue (C:/Users/bouab/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at C:\Users\bouab\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-393aa7e5374d03aa.arrow


{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}


Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Loading cached processed dataset at C:\Users\bouab\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-590f93adf0413883.arrow


{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'input_ids': [101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],

In [54]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Found cached dataset glue (C:/Users/bouab/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at C:\Users\bouab\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-bb21e6423b980722.arrow
Loading cached processed dataset at C:\Users\bouab\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-d1e8c90b5d349f7a.arrow


Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [49]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
# this means that the model will discard the output layer used in pretraining and add a Fully Conncted layer with 2 labels as specified.
# the last layer will be randomly initialized

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# let's the model train baby
trainer.train()

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

  0%|          | 0/1377 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 0.518, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09}
{'loss': 0.3407, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18}
{'train_runtime': 225.2463, 'train_samples_per_second': 48.853, 'train_steps_per_second': 6.113, 'train_loss': 0.36598954100321407, 'epoch': 3.0}


TrainOutput(global_step=1377, training_loss=0.36598954100321407, metrics={'train_runtime': 225.2463, 'train_samples_per_second': 48.853, 'train_steps_per_second': 6.113, 'train_loss': 0.36598954100321407, 'epoch': 3.0})

In [52]:
predictions = trainer.predict(tokenized_ds['validation'])
predictions, labels = predictions.predictions, predictions.label_ids

  0%|          | 0/51 [00:00<?, ?it/s]

In [None]:
# the current values in the predictions are just logits
# we need argmax function  (well layer) to convert them to predicitons
import evaluate
import numpy as np
preds = np.argmax(predictions, axis=-1)
# we extract the metrics used in the dataset
metric = evaluate.load('glue', 'mrpc')
metric.compute(preditions=preds, refernces=labels) # for some reason, this statement gives an error
print(metric)

### Full Training Loop

In [56]:
# before proceeding with using Pytorch directly, we need to prepare the data for training
token_ds = tokenized_datasets.remove_columns(['sentence1', 'sentence2', 'idx']).rename_column('label', 'labels')
token_ds.set_format('torch')
token_ds['train'][0] # converting the list to torch tensors

{'labels': tensor(1),
 'input_ids': tensor([  101,  2572,  3217,  5831,  5496,  2010,  2567,  1010,  3183,  2002,
          2170,  1000,  1996,  7409,  1000,  1010,  1997,  9969,  4487, 23809,
          3436,  2010,  3350,  1012,   102,  7727,  2000,  2032,  2004,  2069,
          1000,  1996,  7409,  1000,  1010,  2572,  3217,  5831,  5496,  2010,
          2567,  1997,  9969,  4487, 23809,  3436,  2010,  3350,  1012,   102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1])}

In [None]:
# we need to create the dataloaders
from torch.utils.data import DataLoader
train_loader = DataLoader(token_ds['train'], 
                          shuffle=True, 
                          batch_size=8, 
                          collate_fn=data_collator)

val_loader = DataLoader(token_ds['validation'], 
                         shuffle=False, 
                         batch_size=8, 
                         collate_fn=data_collator)


In [62]:
for k, v in batch.items():
    print(k, v.shape, sep='\t')

labels	torch.Size([8])
input_ids	torch.Size([8, 68])
token_type_ids	torch.Size([8, 68])
attention_mask	torch.Size([8, 68])
