In [5]:
from transformers import AutoTokenizer

In [7]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [9]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

## Input Parameters Explanation

### `raw_inputs`
The list of text we just defined - this contains the raw sentences we want to process.

### `padding=True`
- When sentences have different lengths:
  - Shorter sentences get filled with zeros (padding tokens)
  - Ensures all outputs have the same length
  - Needed because neural networks require fixed-size inputs

### `truncation=True`
- When sentences exceed the model's maximum length:
  - Automatically cuts off excess tokens
  - Uses the model's maximum allowed length
  - Prevents errors from overly long inputs

### `return_tensors="pt"`
- Specifies output format:
  - `"pt"` returns PyTorch tensors
  - Alternative option: `"tf"` for TensorFlow tensors
  - If not specified, returns regular Python lists

In [10]:
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

# Why We Use `input_ids` in Transformer Models

## The Core Reason
Transformer models (like BERT, DistilBERT, GPT) **cannot process raw text directly** - they only understand numerical representations.

# Attention Masks in Transformer Models

## What is an Attention Mask?

An attention mask is a binary list that tells the model which tokens to pay attention to:

- `1` = Real token (pay attention) ✅
- `0` = Padding token (ignore) ❌

## Why Do We Need Attention Masks?

### The Problem:
- Models require fixed-length inputs
- We pad shorter sequences with `[PAD]` tokens
- Without masks, models might process padding as meaningful content

### The Solution:

```python
# For the padded sentence: "Hello" + 3 [PAD] tokens
input_ids =      [101, 7592, 102, 0, 0, 0]
attention_mask = [1,   1,    1,   0, 0, 0]


# Going through the model

In [11]:
from transformers import AutoModel

In [12]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [13]:
outputs = model(**inputs)

# Model Input Tensors

This passes some input tensors (`inputs`) through the model.

The inputs would typically contain:

- **`input_ids`**: Tokenized text converted to numerical IDs.
- **`attention_mask`**: Mask showing which tokens are real vs padding.

In [15]:
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8987, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

In [16]:
outputs.last_hidden_state.shape

torch.Size([2, 16, 768])

The output shape `[2, 16, 768]` means:

- **2**: Batch size (2 input sequences processed together).
- **16**: Sequence length (each input has 16 tokens, likely padded to this length).
- **768**: Hidden dimension size (standard for BERT-base models).

- This is just the base model without its classification head (since we used `AutoModel` not `AutoModelForSequenceClassification`).
- The `last_hidden_state` contains the contextual embeddings for each token in each sequence.
- For classification tasks, you'd typically use the `[CLS]` token's embedding or pool the outputs.