# Hugging Face

This page discusses various aspects of the Hugging Face infrastructure.

## Transformers

`transformers` is python package that allows you to use pre-trained machine learning models that belong to the transformers architecture.

| Component                   | Description                                                                                 |
| --------------------------- | ------------------------------------------------------------------------------------------- |
| **Models**                  | Pretrained architectures for tasks like classification, generation, or embeddings.          |
| **Tokenizers**              | Convert text into numerical input for models; handle batching, padding, truncation.         |
| **Pipelines**               | High-level API combining tokenizer + model for a specific task (e.g., `summarization`).     |
| **Configurations**          | Define model hyperparameters and architecture settings (e.g., `BertConfig`).                |
| **Trainer**                 | High-level training API handling loops, evaluation, logging, and checkpointing.             |
| **Schedulers & Optimizers** | Learning rate schedulers and optimizer integrations for training models.                    |
| **Data Utilities**          | Helpers for preprocessing and batching (e.g., `DataCollator`, `BatchEncoding`).             |
| **Hub Integration**         | Download/upload pretrained models from Hugging Face Hub (`from_pretrained`, `push_to_hub`). |

Check more details in the [transformers](hugging_face/transformers.ipynb) page.

## Tokenizers

A package that implements different tokenization approaches and related tools.

| Component          | Description                                                                       |
| ------------------ | --------------------------------------------------------------------------------- |
| **PreTokenizers**  | Split text into initial units (words, punctuation, subwords) before encoding.     |
| **Models**         | Define the algorithm for tokenization (BPE, WordPiece, SentencePiece, Unigram).   |
| **Normalizers**    | Clean and standardize text (lowercasing, accent stripping, punctuation handling). |
| **Trainers**       | Learn tokenization vocabulary from a dataset.                                     |
| **Decoders**       | Convert token IDs back to readable text.                                          |
| **Processors**     | Post-process tokenized output (e.g., adding special tokens like `[CLS]`).         |
| **Batch Encoding** | Handle batch tokenization with padding, truncation, and attention masks.          |

Check:

- [Documentation](https://huggingface.co/docs/tokenizers/en/index) package.

---

Consider the most essential components that are typically used with the `tokenizers` package.

In [42]:
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
from tokenizers.models import BPE
from tokenizers import Tokenizer

Most tokenization algorithms require an initial data transformation. The text is first split into sections using a deterministic approach. Then, a fitting procedure is applied to these sections to determine the final set of tokens.

The following cell shows application of the pretokinizer to arbitrary string.

In [30]:
pretokinizer = Whitespace()
pretokinizer.pre_tokenize_str("Some test text")

[('Some', (0, 4)), ('test', (5, 9)), ('text', (10, 14))]

The `tokenizers.Tokenizer` class, is tool for interacting with a tokenizer. It takes a model that defines the exact approach to tokenization.

In [38]:
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pretokinizer

The trainer class is another component of the whole system, and it defines some parameters. The following cell shows the training of the tokenizer defined earlier.

In [None]:
trainer = BpeTrainer(vocab_size=20)

tokenizer.train_from_iterator(
    [
        "some super check",
        "super some check"
    ],
    trainer
)






The following cell shows the vocabulary of the final tokenizer. Each token has an ID that will be used after tokenization.

In [40]:
tokenizer.get_vocab()

{'some': 18,
 'per': 16,
 'ck': 11,
 'h': 2,
 'u': 9,
 'me': 14,
 'o': 5,
 'er': 12,
 'p': 6,
 'r': 7,
 'c': 0,
 'k': 3,
 'ch': 10,
 's': 8,
 'e': 1,
 'check': 19,
 'ome': 15,
 'm': 4,
 'su': 17,
 'eck': 13}

Here is a result of a transformation for a particular case.

In [41]:
tokenizer.encode("start some check").tokens

['s', 'r', 'some', 'check']