# ü§ó HuggingFace: A Deep Dive

Welcome to the HuggingFace universe! In this journey, we'll explore the main features of this amazing platform for NLP and generative AI. Get ready to learn with practical examples, detailed explanations, and lots of emojis to make everything more fun! üöÄ

ü§ñ HuggingFace is a technology company that has revolutionized the field of Natural Language Processing (NLP) and Machine Learning. Its main highlight is the open-source Transformers library, which democratized access to state-of-the-art NLP models, making them accessible to researchers, developers, and enthusiasts.

### Example use cases for HuggingFace:
- **Smart chatbots** (like virtual assistants)
- **Sentiment analysis** on social media
- **Automatic language translation**
- **Creative text generation** (stories, scripts, etc)
- **Summarizing long texts** automatically
- **Named Entity Recognition** (extracting names, places, etc)
- **Question answering** (extracting answers from documents)

üí° Let's see how all this works in practice!

## üó∫Ô∏è Roadmap for Our Journey

Hugging Face offers several components to make advanced research and practical applications easier. Here, we'll focus on the four essential pillars for NLP tasks:

* üß© **Tokenizers**: Transform text into numbers that models can understand. Example: splitting sentences into words or subwords.
  - Example: "I love AI!" ‚Üí ["I", "love", "AI", "!"]
  - Another example: "Transformers are awesome!" ‚Üí ["Transform", "##ers", "are", "awesome", "!"]
* üß† **Models**: The heart of the platform! Pre-trained models like GPT, BERT, Qwen, and many others, ready for various tasks.
  - Example: Generate text, answer questions, translate languages.
  - Example: Classify emails as spam or not spam.
* üìö **Datasets**: Data is the fuel for machine learning. Hugging Face offers a standardized library to access and process a wide range of datasets.
  - Example: Datasets of tweets, news, question-answer pairs, product reviews, etc.
* üèãÔ∏è‚Äç‚ôÇÔ∏è **Trainers**: Make model training easier, simplifying and speeding up the process.
  - Example: Train a model to classify movie reviews as positive or negative.

We'll also explore two amazing Hugging Face libraries for:
- üîß **Efficient fine-tuning** (adjusting models with few parameters)
- üèÜ **Reinforcement Learning** for transformer development

Let's get started! ‚ú®

## üß© Tokenizers

Tokenization means breaking down text into smaller units, called tokens, which can be as small as characters or as long as words. Hugging Face offers a dedicated library for tokenization that is both robust and efficient.

### Why is tokenization important? ü§î
- Models can't understand raw text, only numbers!
- Tokenization helps convert text into a format that models can process.

### Examples:
- "I love AI!" ‚Üí ["I", "love", "AI", "!"]
- "Transformers are awesome!" ‚Üí ["Transform", "##ers", "are", "awesome", "!"]
- "Let's tokenize this sentence." ‚Üí ["Let", "'s", "token", "##ize", "this", "sentence", "."]

In [1]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(f"The number of distinct tokens that bert-base-uncased uses is {tokenizer.vocab_size}")

The number of distinct tokens that bert-base-uncased uses is 30522


Let's see tokenizers in action! Here we load the tokenizer used for a BERT model, specifically a pre-trained model called `bert-base-uncased`. *Uncased* means the model was trained with a tokenizer that does not distinguish between uppercase and lowercase letters.

The number of distinct tokens that `bert-base-uncased` uses is 30.522. This includes individual letters, words, and word parts.

### More examples:
- Try tokenizing: "HuggingFace makes NLP easy!"
- Try tokenizing: "2026 is the year of AI."

Notice how numbers, punctuation, and even unknown words are handled!

In [2]:
sentence = "Transformers are awesome!!"
tokens = tokenizer.tokenize(sentence)
print(tokens)

['transformers', 'are', 'awesome', '!', '!']


We will tokenize the sentence "Transformers are awesome!!" using the `tokenizer.tokenize` method and print the result. Notice that "generative" was split into two tokens.

### Try more examples:
- "Deep learning is fun!"
- "Let's build something amazing."

See how different words and punctuation are split into tokens!

In [3]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[19081, 2024, 12476, 999, 999]


We can also see the actual token values as numbers that the model uses internally, starting with 19081.

### Try converting tokens to IDs for other sentences:
- "AI is everywhere!"
- "HuggingFace rocks!"

Each token has a unique ID in the model's vocabulary.

One notable feature of Hugging Face‚Äôs tokenizers is their speed. By leveraging a programming language called Rust under the hood, the library ensures rapid tokenization, even for vast amounts of text.

### Fun fact:
- You can tokenize entire books or large datasets in seconds!
- Try tokenizing a paragraph or a list of sentences and see how fast it is.

## üß† Models

Let's look at accessing the diverse and expansive collection of open-source models available on HuggingFace. You'll find lots of foundation models such as GPT, Qwen, Google's Gemma, and many more. There are also fine-tuned variants made by the community!

### What can you do with models?
- Text classification (spam detection, sentiment analysis)
- Text generation (stories, code, emails)
- Translation (English ‚ÜîÔ∏è French, etc)
- Question answering
- Named entity recognition

Let's see how to use a pre-trained model for sentiment analysis! üé¨

In [4]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

model_name = 'textattack/bert-base-uncased-imdb'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)



config.json:   0%|          | 0.00/511 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Let's download a model from HuggingFace and use it for sentiment analysis!

Here's a quick demonstration using a pre-trained model to classify the sentiment of the sentence "I love Generative AI". First, we load a model and a tokenizer using a model pre-trained by the community and available on HuggingFace. In this case, we are using the 'textattack/bert-base-uncased-imdb' model, which is adapted for movie review sentiment analysis.

### Try more examples:
- "This movie was fantastic!"
- "I didn't like the ending."
- "The actors did a great job."

See how the model predicts the sentiment for different sentences!

In [5]:
sentence = "I love Generative AI"
inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.softmax(logits, dim=1)
    prediction = torch.argmax(probabilities, dim=1).item()

if prediction == 1:
    print(f'The sentiment is positive with a probability of {probabilities[0][1]:.2f}')
else:
    print(f'The sentiment is negative with a probability of {probabilities[0][0]:.2f}')

The sentiment is positive with a probability of 0.89


To get the prediction, we go through a few steps:

1. Tokenize the input using the tokenizer
2. Use `no_grad` to tell the model we're only making predictions (not training)
3. Get the output probabilities for positive vs negative sentiment
4. Print the result with the probability

So, "I love generative AI" has a positive sentiment according to this model, with a probability of 89% (for example).

### Try more sentences and see the probabilities!
- "This is the best product ever!"
- "I wouldn't recommend this to anyone."
- "The plot was confusing but the visuals were stunning."

## üìö Datasets

Now let's turn to accessing datasets. Hugging Face's "Datasets" library is designed to make it fast and easy to access, preprocess, and manage large amounts of data for AI projects.

### Why use HuggingFace Datasets?
- Access 1000s of public datasets with one line of code
- Built-in tools for filtering, mapping, and transforming data
- Efficient memory usage, even for huge datasets

### Example datasets:
- IMDB (movie reviews)
- SQuAD (question answering)
- AG News (news classification)
- Amazon Reviews
- Common Crawl (web data)

Let's load a dataset and explore it! üîç

In [1]:
from datasets import load_dataset

imdb_dataset = load_dataset('imdb')



README.md: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(‚Ä¶):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

How easy is it? The `datasets` library provides an efficient way to access a myriad of datasets, including the popular IMDB dataset with movie reviews. In this example, we load the IMDB dataset and display one of its reviews.

Try loading other datasets, like 'ag_news' or 'amazon_polarity', and print some samples!

### More examples:
- Load the 'squad' dataset and print a question-answer pair
- Load the 'emotion' dataset and print a labeled sentence

In [2]:
review_number = 42
review_text = imdb_dataset['train'][review_number]['text']
review_label = imdb_dataset['train'][review_number]['label']

print(f"Review text: {review_text}")
print(f"Label: {'Positive' if review_label == 1 else 'Negative'}")

Label: Negative


There are a lot of reviews in here!

We'll look at number 42 from the train split of the dataset, since 42 is a great number üòâ

To access the text of the dataset, we use the key named 'text'.

The label is provided as either 0 or 1 and obtained using the 'label' key.

Try printing other reviews and their labels. Are they positive or negative?

### More ideas:
- Count how many positive and negative reviews are in the first 100 samples
- Find the longest review in the dataset

An important feature of the datasets library is its efficiency. Built on top of Apache Arrow, it allows for lightning-fast operations, ensuring that even large datasets can be processed seamlessly without hogging memory resources.

### Fun fact:
- You can filter, map, and batch process millions of samples quickly!
- Try filtering all positive reviews or mapping a function to clean the text.

## üèãÔ∏è‚Äç‚ôÇÔ∏è Trainer

HuggingFace also makes training a model easier! The `Trainer` class offers a streamlined solution for training and fine-tuning machine learning models. It encapsulates much of the complexity associated with training loops, evaluation, and optimization.

### Why use Trainer?
- Handles training, evaluation, and saving models automatically
- Supports logging, early stopping, and more
- Works with any HuggingFace model and dataset

Let's see how to set up a Trainer for a classification task!

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

# Define the pre-trained model name
model_name = "distilbert-base-uncased"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define a function to tokenize the dataset
def tokenize_function(examples):
    # Tokenize the text, pad to the max length, and truncate
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Load the IMDb dataset
imdb_dataset = load_dataset('imdb')

# Select a smaller subset of the dataset for faster training
subset_size = 1000 # Define the size of the subset
small_train_dataset = imdb_dataset["train"].shuffle(seed=42).select(range(subset_size))
small_test_dataset = imdb_dataset["test"].shuffle(seed=42).select(range(subset_size))


# Apply the tokenization function to the subset datasets
tokenized_datasets = {
    "train": small_train_dataset.map(tokenize_function, batched=True),
    "test": small_test_dataset.map(tokenize_function, batched=True)
}

Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_layer_norm.weight | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
vocab_transform.weight  | UNEXPECTED | 
pre_classifier.bias     | MISSING    | 
pre_classifier.weight   | MISSING    | 
classifier.bias         | MISSING    | 
classifier.weight       | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Let's start by loading the 'distilbert-base-uncased' pre-trained model and its tokenizer.

Next, we create a function called `tokenize_function` which takes a review and converts it to tokens that the large language model will understand. Notice that we are truncating long strings and padding shorter ones. We then load the IMDB movie review dataset. For demonstration, we use a subset, but you can train on the whole dataset if you want!

Try changing the subset size or using a different dataset for your own experiments.

In [2]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    save_strategy="epoch",
    report_to='none', # Disable reporting to experiment tracking services
)

With that all set, we define some training arguments. Here we set the batch size, output directory, learning rate, and number of epochs. We also specify which splits of the dataset to use for training and evaluation.

### Try changing:
- The learning rate (try 5e-5 or 1e-4)
- The number of epochs (try 1 or 5)
- The batch size (try 16 or 128)

See how these changes affect training speed and model performance!

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    #tokenizer=tokenizer,
)

trainer.train()

Now we create a `Trainer` object and start training!

The Trainer will handle the training loop, evaluation, and even saving the best model for you.

### More ideas:
- Add evaluation metrics (like accuracy or F1-score)
- Save and reload your trained model
- Try training on a different dataset or with a different model

Congratulations, you just trained a model with HuggingFace! üéâ

Let‚Äôs start with loading the distilbert-base-uncased pre-trained model as well as the tokenizer it uses.

Next we will create a function called tokenize_function which takes a review, and converts it to tokens that the large language model will understand. Notice that we are also truncating long strings as well as padding shorter strings. We then load the IMDB movie review dataset, and **we are using a subset of the dataset for faster training**. We then map this dataset to a new dataset with the tokenized, padded, and truncated reviews.