<a href="https://colab.research.google.com/github/benzionchen/transformer_NLP_research/blob/main/hugging_face_chapter_1_3_codealong.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers



In [2]:
!pip install transformers[sentencepiece]

Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.2.0-cp39-cp39-win_amd64.whl (991 kB)
     -------------------------------------- 991.5/991.5 kB 7.0 MB/s eta 0:00:00
Collecting protobuf
  Downloading protobuf-6.30.2-cp39-cp39-win_amd64.whl (431 kB)
     ---------------------------------------- 431.0/431.0 kB ? eta 0:00:00
Installing collected packages: sentencepiece, protobuf
Successfully installed protobuf-6.30.2 sentencepiece-0.2.0


In [3]:
!nvidia-smi

Fri Apr 11 22:25:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 572.83                 Driver Version: 572.83         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3080 Ti   WDDM  |   00000000:01:00.0  On |                  N/A |
| 52%   62C    P3             85W /  350W |    4739MiB /  12288MiB |     24%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [4]:
!pip install transformers torch



In [5]:
import transformers

In [6]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier("I've been waiting for Hugging Face course")

# pipeline() is function provided by HF, using their api to load pretrained model and run inference (text gen, sentiment analysis, etc.)
# it's doing model loading -> tokenization -> prediction in 1 line of code (defaults to distilbert-base-uncased-finetuned-sst-2-english model)
# what's happening to inference when you call pipeline() in inference time? the function is called, the pipeline is initialized (loads tokenizer and model weights) using pytorch or tensorflow setting up the inference backend, then GPU is enabled in colab
# which will run locally on my

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.995071530342102}]

In [7]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

# sentiment analysis is determining tone behind speech, positive and negative connotation + confidence level (confidence will never be 1.0 or 100%)
# the output is coming from input tensors that are fed into the pretrained model, so the model produces logits -> softmax -> classification probabilities
# model weights are downloaded from HF the first time + inference is done on colab's machine (VM probably at google's datacenter and not using local hardware like my 3080ti)

# 3 main steps involved when passing text into pipeline: 1. text is preprocessed into format model can understand 2. inputs are passed to model 3. predictions are post-processed so you can make sense of them
# format = vectors (tensors), matmul happens on these tensors, tokens are mapped to numbers via tokenizer, token ID used to index an embedding matrix to get vectors, turning words into vectors
# vectors are passed through layers of the transformer (inside attention layer + feedforward networks use matmul operations)
# attention score = softmax(QK transpose / root(d_k))

[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [8]:
classifier = pipeline('zero-shot-classification')

classifier(
    "this is a course on transformers",
    candidate_labels = ["education", "politics", "business"],
)
# my scores are different than the example in HF is that due to change in model weights? model drift, weights updated over time, new variants of the model, improving tokenization + processing logic can change scores, etc.

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


{'sequence': 'this is a course on transformers',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8483405709266663, 0.1200680136680603, 0.031591422855854034]}

In [9]:
print(classifier.model.name_or_path)

# zero shot = user doesnt have to fine-tune of the model

facebook/bart-large-mnli


In [None]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

# The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# try with r1? might run out of space

generator = pipeline(
    "text-generation",
    model="deepseek-ai/deepseek-llm-7b-base",
    device_map="auto",        # use GPU if available
    trust_remote_code=True    # custom model logic
)

output = generator("In this course, we will teach you how to", max_new_tokens=50)
print(output[0]["generated_text"])

# runs way slower than the one above

In [None]:
# big language models (like DeepSeek or LLaMA) are too massive to store in a single file (sometimes >20GB). So they're sharded — split into smaller chunks (e.g., pytorch_model-00001-of-00003.bin, pytorch_model-00002-of-00003.bin
# .bin files contain PyTorch tensors (learned weights of the model), .bin files are frozen brains of the model — all the neurons' values are saved here

# how are the model weights stored in bin? weights are FP32 usually and stored as byte sequence in bin using pytorch serialization tensor is stored as bin blob of floats
# looks like {
#  'encoder.layer.0.attention.self.query.weight': tensor([...]),
#  'encoder.layer.0.attention.self.key.weight': tensor([...]),
#  'encoder.layer.0.attention.self.value.weight': tensor([...]),
#  ...
#}

# each tensor is big matrix of floats stored as bin data, need pytorch to deserialize - this is different than an instruction set

In [None]:
import torch

# load the binary weight file
state_dict = torch.load("(insert pytorch_model.bin)") # this would probably be proprietary?

# list all parameter names
print(state_dict.keys())

# see one weight tensor
print(state_dict['encoder.layer.0.attention.self.query.weight'])

#probably will output soemthing like:
# tensor([[ 0.023, -0.042, ..., 0.019],
#         [-0.001,  0.112, ..., -0.055], ...])

In [None]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

In [None]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

# <mask> is like "(insert here)" and the model will predict it

# different than prefill, prefill = decoder-only LLMs like GPT and Deepseek where first pass of model encodes input context prompt into internal hidden states, prepping model memory before generation
# modern inference systems like vLLM use prefill (encode the input text) + decode (generate new tokens)

# top_k argument controls how many possibilities you want to be displayed

In [None]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

# named entity recognition - NER = task where model ahs to find which parts of the input text corresponds to entities such as person(s) + location(s) + organization(s)

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

In [None]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

In [None]:
# transformer models can broadly be grouped into 3 categories

# GPT-like (also called auto-regressive Transformer models)
# BERT-like (also called auto-encoding Transformer models)
# BART/T5-like (also called sequence-to-sequence Transformer models)

# Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models, trained on large amounts of raw text in a self-supervised fashion
# model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks because of this, general pretrained model then goes through transfer learning
# An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones

In [None]:
!pip install codecarbon

In [None]:
# transfer learning = more efficient
# typically perform better, but also transfer it's biases

# base model + very large corpus + days of training + $$$ -> pre-trained language model + training hardware + easily reproductible + $$$ -> fine-tuned language model

# encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input
# encoder takes inputs that represent text converting into numerical representations (embeddings or features), uses self-attention mechanism
# this is bidirectional features

# encoder-only models are good for tasks that require understanding of the input such as sentence classification, named entity recognition

In [None]:
# decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs
# decoder can also accept inputs, similar mechanism as encoder (masked self-attention) + unidirectional features used in auto regressive manner

# decoder-only models are good for generative tasks

In [None]:
# adding the two together, you get encoder-decoder which is a sequence-to-sequence transfomer
# encoder accept inputs -> high level representation of inputs and outputs are passed to decoder -> use encoder's output to generate prediction -> predict output

# encoder-decoder is good for generative takss that require an input such as translation or summarization

In [None]:
# attention layers pay attention to specific tokens/words in the sentence it's passed (and more or less ignore others)

# transformer architecture originally was for translation
# during training, encoder receives input sentences in one language while decoder receives same sentences in desired target language (like chinese -> english)
# attention layers can use all the words in a sentence and the translation is dependent on what is after as well as before the word in the sentence
# decoder works sequentially and can only pay attention to the words in the sentence that it has already translated, for example, when we predict the first 3 words of the translated target, we give them decoder which uses all
# inputs of the encoder to try to predict the 4th word

# to speed up training, the decoder is fed the whole target, but isnt allowed to use future words
# the first attention layer block apys attention to all the past inputs to the decoder but the second attention layer uses the output of the encoder, so can access the whole input sentence to best predict current word

In [None]:
# decoder only is taking previous token to predict the next token, GPT2 can predict 1024 context window
# GPT2 would still benefit from inference accel because a chip like Sohu is optimized for low-latency token by token generation + matmul heavy workloads + weight streaming/caching past K V + minimize memory bottleneck
# GPT2 does 1 forward pass per token, compute-intensive, each pass requires querying all past tokens, layer-wise matmul and attentions, hardware like Sohu handles this well because of KVCache + streaming attention

In [None]:
# encoder-decoder works like this: the entire input is first turned into numerical representation, and outputs contextualized embeddings for each word, this is encoder output/memory and is cached and reused by decoder

# step 1: tokenization, turning words into IDs
# step 2: each token ID is mapped to a vector embedding (the input into the transformer) - token IDs are not random, they're lookups from a vocabulary (fixed mapping between token & integers that model learned during pretrain)
# step 3: add positional embeddings because transformers don't understand order by default, so we have to say "word 2" came after "word 1"
# step 4: transformer layers begin - each token looks at every other token and scores its relevance, as in should the word "cat" care about the word "sat" or "on", then feedforward layer makes each embedding go through the NN
# - linear -> activation -> linear, and then finally normalize the layers (this is repeated layer after layer progressively enriching the embeddings with more context)
# step 5: output tokens

In [None]:
# biases of pre-trained models

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

In [None]:
# what happens inside pipeline()?

# convert raw text -> numbers (IDs) using tokenizer -> logits -> predictions (labels and scores)

# tokenization first splits text into tokens, add special tokens that model is expecting, and matches each token to unique ID

# when fitting the tokenizer to sentences, if there are 2 sentences with different lengths, will need to pad the shortest one in order to build an array

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)


In [None]:
# the output itself is a dictionary containing two keys, input_ids and attention_mask. input_ids contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence

# vector output by the Transformer module is usually large. It generally has three dimensions:

# Batch size: The number of sequences processed at a time (2 in our example).
# Sequence length: The length of the numerical representation of the sequence (16 in our example).
# Hidden size: The vector dimension of each model input.



In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

In [None]:
# for our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification:

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

In [None]:
# values we get as output from our model doesn't make sense by themselves
print(outputs.logits)

# logits are raw unnormalized scores by the last layer of the model, if we want to convert this to probability, they go through softmax layer on all the output logits (as a loss function for training will generally fuse the last
# activation function such as SoftMax with the actual loss function such as cross entropy)

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

In [None]:
model.config.id2label

In [None]:
# creating a transformer

from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

print(config)

In [None]:
# the hidden_size attribute defines the size of the hidden_states vector, and num_hidden_layers defines the number of layers the Transformer model has


In [None]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

model = BertModel.from_pretrained("bert-base-cased")

In [None]:
model.save_pretrained("directory on my computer")

# the pytorch_model.bin file is known as the state dictionary; it contains all your model’s weights
# the two files go hand in hand; the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters

In [None]:
sequences = ["Hello!", "Cool.", "Nice!"]

encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [None]:
import torch

model_inputs = torch.tensor(encoded_sequences)

output = model(model_inputs)

print(output)

In [None]:
# 3 approaches to tokenization: word-based, character-based, and subword-based

In [None]:
# 1. word-based (tokenize entire words as IDs)
# - this is a problem because "dog" and "dogs" are related but IDs are tokenized completely differently and this is not represented properly via the IDs
# - also, there are a lot of words in the english language, if want to understand all possible words, will need unique ID, each ID is mapped to a large vector to represent word meaning, and keeping track of this is enormous #
# of words
# - can train tokenizers based off of a smaller subset of words that are way more commonly used rather than the entirety of the english language or the entirety of a given text
# - words that are not a part of that smaller subset will be "UNKNOWN", can result in a large # of unknown words


In [None]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

# If we want to completely cover a language with a word-based tokenizer, we’ll need to have an identifier for each word in the language, which will generate a huge amount of tokens.
# For example, there are over 500,000 words in the English language, so to build a map from each word to an input ID we’d need to keep track of that many IDs.

In [None]:
# 2. character-based
# - a lot of words in language, like ~500k words in english, but there are only 256 chracters in english
# - 20k characters exist in chinese dictionary
# - tokenized characters will be more complete (no UNKNOWNs)
# - however characters hold less info than complete words but chinese language will have more info in one character over something like 1 character in latin language
# - what this means is that this will impact the size of the context that the model will carry around


In [None]:
# 3. subword-based
# - frequently used words should not be split into smaller subwoods, and rare words should be decomposed into meaningful subwords (what do you define as 'rare'?)
# - would separate "dogs" into "dog" and "s" as "dog" is the root of the meaning
# - "tokenization" would be split to "token" (the root meaning of the word) and "ization" that changes the meaning of the word (what about for words that change spelling like "writing"? do we split to "writ" and "ing"? yes)

# an alternative is byte-pair encoding

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

# the algorithm split the words differently than what was given in the example output

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

# these ids are the same though

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

In [None]:
# convert this list of numbers to a tensor and send it to the model:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
model(input_ids)

# IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

In [None]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

In [None]:
# Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

batched_ids = [ids, ids]

Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. There’s a second issue, though. When you’re trying to batch together two (or more) sentences, they might be of different lengths. If you’ve ever worked with tensors before, you know that they need to be of rectangular shape, so you won’t be able to convert the list of input IDs into a tensor directly. To work around this problem, we usually pad the inputs.

In [None]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

In [None]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

In [None]:
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

There’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values!

This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

In [None]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

Use a model with a longer supported sequence length.
Truncate your sequences.

In [None]:
# Otherwise, we recommend you truncate your sequences by specifying the max_sequence_length parameter: sequence = sequence[:max_sequence_length]

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [None]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

The tokenizer object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks — "pt" returns PyTorch tensors, "tf" returns TensorFlow tensors, and "np" returns NumPy arrays:

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

In [None]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))


In [None]:
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

In [None]:
!pip install datasets

In [None]:
!pip install datasets --upgrade --no-deps

In [None]:
!pip install fsspec==2025.3.0

In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121


In [None]:
!pip install evaluate
!pip install accelerate
!pip install bitsandbytes
!pip install sentencepiece
!pip install peft

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

In [None]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

In [None]:
raw_train_dataset.features

# behind the scenes, label is of type ClassLabel, and the mapping of integers to label name is stored in the names folder. 0 corresponds to not_equivalent, and 1 corresponds to equivalent.

In [None]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

In [None]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

In [None]:
# preprocess training dataset

tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

# works well but has trouble returning a dict with keys, input_ids, attention_mask, and token_type_ids
# only works if there is enough RAM to store entire dataset during tokenization


In [None]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# takes dict, returns new dict with key input_ids, attention_mask, and token_type_ids, works if example dictionary contains several sampels (each key as list of sentences) since tokenizer works on a list of pairs of sentences
# allows option batched=True in map() to speed up the tokenization

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

In [None]:
# dynamic padding - pad elements when they're batched together, so we don't unnecessarily pad elements not used (free up cpu gpu compute)

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

In [None]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

In [None]:
!pip install wandb


In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

In [None]:
import numpy as np
import evaluate

preds = np.argmax(predictions.predictions, axis=-1)

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

In [None]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
outputs = model(**batch)

print(outputs.loss, outputs.logits.shape)

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

In [None]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

# make sure its gpu

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

In [None]:
from accelerate import Accelerator
from transformers import AutoModelForSequenceClassification, get_scheduler
from torch.optim import AdamW

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)