# 🤗 Notebook 07: Introduction to the Hugging Face Ecosystem

**Week 3-4: Deep Learning & NLP Foundations**  
**Gen AI Masters Program**

---

## 🎯 Objectives

Welcome to Hugging Face! This is your gateway to the world of modern, pre-trained NLP models. The `transformers` library and the Hugging Face Hub have become the industry standard for building, sharing, and deploying state-of-the-art models.

In this notebook, you will learn the complete end-to-end workflow:

1.  **The Ecosystem:** Get a high-level overview of the Hugging Face Hub, `transformers`, `datasets`, and other key libraries.
2.  **Zero-Shot Inference:** Use powerful, pre-trained models for tasks like classification and summarization *without any training*.
3.  **The Core Components:** Understand the roles of `AutoTokenizer` and `AutoModel`.
4.  **Fine-Tuning:** Learn how to adapt a pre-trained model to a specific, custom task using the `Trainer` API.
5.  **Inference and Deployment:** Use your fine-tuned model for predictions and save it for future use.

**Estimated Time:** 3-4 hours

---

## 🌍 The Hugging Face Ecosystem at a Glance

Hugging Face is more than just a library; it's a collaborative platform for machine learning.

*   **The Hub:** A central repository with over 500,000 models, 100,000 datasets, and 300,000 "Spaces" (apps). It's like GitHub for machine learning.
*   **`transformers`:** The core Python library. It provides a unified API for accessing thousands of pre-trained models (like BERT, GPT, T5) and tools for training them.
*   **`datasets`:** A library for efficiently loading, processing, and sharing large datasets.
*   **`evaluate`:** A library for easily calculating standard NLP metrics (like accuracy, F1, BLEU).
*   **`tokenizers`:** A high-performance library for converting text into the numerical inputs that models understand.

We'll touch on all of these as we build a practical text classifier.

In [None]:
# --- Core Library Imports ---
import torch
import pandas as pd
from typing import List, Dict

# --- Hugging Face Library Imports ---
# The 'pipeline' is a high-level, easy-to-use API for inference
from transformers import pipeline

# Auto* classes are smart wrappers that can automatically load the correct
# architecture from a given model checkpoint
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# The Trainer API provides a complete training and evaluation loop
from transformers import Trainer, TrainingArguments

# A helper for dynamically padding batches of text
from transformers import DataCollatorWithPadding

# The 'datasets' library for loading and processing data
from datasets import Dataset, DatasetDict

# The 'evaluate' library for metrics
import evaluate

# --- Configuration ---
# Set a seed for reproducibility
torch.manual_seed(42)

# Set the device (use GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"✅ Using device: {device}")

# Check the transformers version
import transformers
print(f"Transformers version: {transformers.__version__}")

## 🧠 Part 1: Instant Results with Inference Pipelines

The `pipeline` is the easiest way to get started with Hugging Face. It abstracts away all the complexity of tokenization, model loading, and post-processing, allowing you to use a pre-trained model for a specific task in just a few lines of code.

This is often called **zero-shot inference** because we are using a model for a task it wasn't *explicitly* trained on, but which it can perform due to its general language understanding.

### Example: Zero-Shot Text Classification

Let's say we have a maintenance log and we want to classify its severity. We can use a model pre-trained on a general "Natural Language Inference" (NLI) task to do this, without any fine-tuning. The model understands whether a "hypothesis" is an "entailment," "contradiction," or "neutral" to a "premise." We can cleverly adapt this to our classification task.

*   **Premise:** The maintenance log text.
*   **Hypotheses:** Our candidate labels (e.g., "This text is about a critical issue," "This text is about a normal event").

The model will tell us which "hypothesis" is most entailed by the "premise."

In [None]:
# 1. Create an inference pipeline for 'zero-shot-classification'
# The model 'facebook/bart-large-mnli' is a popular choice for this task.
# We specify the device to ensure it runs on the GPU if available.
classifier = pipeline(
    task='zero-shot-classification',
    model='facebook/bart-large-mnli',
    device=0 if torch.cuda.is_available() else -1
)

# 2. Define the text we want to classify and our candidate labels
maintenance_log = "Hydraulic pump output pressure collapsed, leading to an immediate line shutdown."
candidate_labels = ['normal operation', 'warning', 'critical failure']

# 3. Run the classification
result = classifier(maintenance_log, candidate_labels)

# The output is a dictionary containing the labels and their corresponding scores
print(f"Log Entry: '{result['sequence']}'\n")
print("Classification Scores:")
for label, score in zip(result['labels'], result['scores']):
    print(f"- {label}: {score:.3f}")

### Example: Summarization

Another powerful zero-shot task is summarization. We can take a long, detailed log entry and use a model to condense it into a short, actionable summary. This is incredibly useful for generating daily reports or quickly understanding the gist of a complex issue.

In [None]:
# 1. Create a summarization pipeline
# 'philschmid/bart-large-cnn-samsum' is a model fine-tuned for summarizing conversations,
# but it works well for technical logs too.
summarizer = pipeline(
    task='summarization',
    model='philschmid/bart-large-cnn-samsum',
    device=0 if torch.cuda.is_available() else -1
)

# 2. Define a long log entry
long_log_entry = (
    "During the midnight shift, operators in Bay 3 reported a series of persistent, high-frequency "
    "vibration spikes originating from the main gearbox assembly of Conveyor Belt #5. "
    "Following the vibration alerts, the coolant temperature for the associated motor began to rise, "
    "exceeding the nominal operating threshold of 85°C. A manual inspection by the on-site technician "
    "confirmed a partial blockage in the primary coolant loop, likely due to sediment buildup. "
    "As a temporary measure, the secondary coolant bypass was engaged, which successfully restored "
    "coolant flow and stabilized the temperature. However, the root cause of the blockage has not "
    "been addressed, and the gearbox pressure remains unstable."
)

# 3. Generate the summary
# We can control the length of the desired summary.
summary = summarizer(
    long_log_entry,
    max_length=50,  # Maximum number of words in the summary
    min_length=20,  # Minimum number of words
    do_sample=False # Use deterministic decoding (no randomness)
)[0]['summary_text']

print("--- Original Log ---")
print(long_log_entry)
print("\n--- Generated Summary ---")
print(summary)

## 🧠 Part 2: The Core Components - Tokenizers and Models

While pipelines are great for quick results, building custom solutions requires understanding the two core components that power them: the **Tokenizer** and the **Model**.

### The Tokenizer

A tokenizer's job is to convert raw text into a numerical format that the model can understand. This is a multi-step process:
1.  **Splitting:** The text is broken down into words, subwords, or characters. Most modern models use **subword tokenization** (like WordPiece or BPE), which can handle out-of-vocabulary words and capture morphological information.
2.  **Converting to IDs:** Each token is mapped to a unique integer ID from the model's vocabulary.
3.  **Adding Special Tokens:** Special tokens required by the model are added, such as `[CLS]` (start of sequence) and `[SEP]` (separator).
4.  **Creating Attention Masks:** An attention mask (a tensor of 1s and 0s) is created to tell the model which tokens are real and which are padding.

The `AutoTokenizer` class automatically downloads and configures the correct tokenizer for a given model checkpoint from the Hub.

In [None]:
# 1. Load the tokenizer for 'distilbert-base-uncased'
# This is a fast and popular model, good for general-purpose tasks.
model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# 2. Tokenize a sentence
# We pass the text and ask for PyTorch tensors in return ('pt').
encoded_input = tokenizer(maintenance_log, return_tensors='pt')

print("--- Tokenizer Output ---")
# The output is a dictionary containing the input IDs and the attention mask
for key, value in encoded_input.items():
    print(f"{key}:")
    print(value)

# 3. Decode the input IDs back to tokens to see what's happening
input_ids = encoded_input['input_ids'][0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)

print("\n--- Decoded Tokens ---")
print(tokens)

# Notice the special tokens '[CLS]' and '[SEP]' that have been added.
# Also, see how "shutdown" was split into "shut" and "##down". This is subword tokenization in action!

## 🧠 Part 3: Fine-Tuning a Transformer for a Custom Task

Zero-shot inference is powerful, but for high accuracy on a specific domain, you need to **fine-tune**. Fine-tuning takes a general-purpose pre-trained model and further trains it on a small, labeled dataset specific to your task.

**Our Goal:** Build a classifier to categorize maintenance logs into three severity levels: **Normal**, **Warning**, and **Critical**.

**The Workflow:**
1.  **Prepare the Dataset:** We'll create a small, labeled dataset of maintenance logs.
2.  **Tokenize the Data:** We'll use our `AutoTokenizer` to process the entire dataset.
3.  **Load the Model:** We'll use `AutoModelForSequenceClassification` to load a pre-trained model with a classification head on top.
4.  **Define Training Arguments:** We'll configure the training process (learning rate, batch size, epochs, etc.) using `TrainingArguments`.
5.  **Train:** We'll use the `Trainer` API to run the fine-tuning loop.

### Step 1: Prepare the Dataset

In a real-world scenario, this data would come from a database or log files. For this example, we'll create a small `pandas` DataFrame.

In [None]:
# Create a labeled dataset of maintenance logs
incident_data = {
    'text': [
        'Lubrication schedule for gearbox #3 completed with no deviations noted.',
        'Pressure fluctuations observed above the 5% tolerance threshold on the main hydraulic press.',
        'Emergency stop was triggered by a high voltage surge on production line B.',
        'Routine monthly inspection confirmed all safety sensors are calibrated and functional.',
        'A minor coolant leak was observed near the heat exchanger housing during the night shift.',
        'The critical alarm for furnace #2 temperature persisted for 15 minutes despite a manual override attempt.',
        'Conveyor speed oscillation was resolved after a standard system reset procedure.',
        'The primary bearing temperature exceeded the critical safety threshold of 120°C.',
        'Complete hydraulic pump failure on the stamping machine caused an immediate production halt.',
        'A minor increase in vibration was logged during the swing shift for the milling machine.'
    ],
    'label': [
        0, # Normal
        1, # Warning
        2, # Critical
        0, # Normal
        1, # Warning
        2, # Critical
        0, # Normal
        2, # Critical
        2, # Critical
        1  # Warning
    ]
}

# Create a pandas DataFrame
incident_df = pd.DataFrame(incident_data)

# Create mappings between integer labels and human-readable names
id2label = {0: 'NORMAL', 1: 'WARNING', 2: 'CRITICAL'}
label2id = {'NORMAL': 0, 'WARNING': 1, 'CRITICAL': 2}

print("--- Sample of the Labeled Dataset ---")
incident_df.head()

### Step 2: Convert to a `Dataset` Object and Tokenize

The `Trainer` API works best with the `Dataset` object from the `datasets` library. We'll convert our DataFrame to a `Dataset` and then apply our tokenizer to the entire dataset at once.

In [None]:
# Convert the pandas DataFrame into a Hugging Face Dataset object
full_dataset = Dataset.from_pandas(incident_df)

# The .map() function is a powerful way to apply a function to the entire dataset.
# We define a simple function to tokenize a batch of examples.
def tokenize_function(examples):
    # The tokenizer will pad to the length of the longest example in the batch
    # and truncate examples that are longer than the model's max input size.
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Apply the tokenization to the entire dataset
tokenized_dataset = full_dataset.map(tokenize_function, batched=True)

# For training, we need to split our data into a training set and a validation set.
# This helps us check if the model is overfitting.
train_test_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)

# The result is a DatasetDict, which is a dictionary-like object holding our splits.
split_dataset = DatasetDict({
    'train': train_test_split['train'],
    'validation': train_test_split['test']
})

print("--- Dataset Structure after Splitting and Tokenization ---")
print(split_dataset)

print("\n--- Example from the Training Set ---")
print(split_dataset['train'][0])

### Step 3: Load the Model and Define Metrics

We'll now load our pre-trained model. It's crucial to use `AutoModelForSequenceClassification` because it loads the base `distilbert` model *plus* a randomly initialized classification head on top, ready for fine-tuning.

We also need to define the metric we'll use to evaluate the model during training. The `evaluate` library makes this easy.

In [None]:
# Load the pre-trained model with a classification head
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, 
    num_labels=3, # We have 3 classes: Normal, Warning, Critical
    id2label=id2label, # Pass our mappings to the model
    label2id=label2id
).to(device) # Move the model to the GPU if available

# The DataCollatorWithPadding will dynamically pad the sentences to the
# longest length in a batch during training. This is more efficient than
# padding all sentences to the overall maximum length.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Load the accuracy metric from the 'evaluate' library
accuracy_metric = evaluate.load("accuracy")

# Define a function to compute metrics during evaluation
def compute_metrics(eval_pred):
    """
    This function is called by the Trainer during evaluation.
    It takes the model's predictions and the true labels, and returns a dictionary of metrics.
    """
    logits, labels = eval_pred
    # The model outputs raw logits; we get the predicted class by taking the argmax
    predictions = np.argmax(logits, axis=-1)
    # The metric's 'compute' method returns a dictionary, e.g., {'accuracy': 0.9}
    return accuracy_metric.compute(predictions=predictions, references=labels)

### Step 4 & 5: Define Training Arguments and Train!

This is the final step before training. The `TrainingArguments` class is a powerful configuration object that lets you control every aspect of the training loop.

We'll then instantiate the `Trainer`, passing it all the components we've prepared: the model, tokenizer, datasets, and training arguments. Calling `trainer.train()` will kick off the fine-tuning process.

In [None]:
# Define the directory where model checkpoints will be saved
output_dir = 'incident_classifier_model'

# Configure the training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=2e-5,               # A small learning rate is crucial for fine-tuning
    per_device_train_batch_size=4,    # Batch size for training
    per_device_eval_batch_size=4,     # Batch size for evaluation
    num_train_epochs=15,              # Number of times to iterate over the training data
    weight_decay=0.01,                # Adds a bit of regularization
    evaluation_strategy="epoch",      # Run evaluation at the end of each epoch
    save_strategy="epoch",            # Save a model checkpoint at the end of each epoch
    load_best_model_at_end=True,      # Automatically load the best model at the end of training
    logging_strategy="epoch",         # Log metrics at the end of each epoch
    push_to_hub=False,                # We won't push to the Hub in this example
)

# Instantiate the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Start the training!
print("🚀 Starting fine-tuning...")
trainer.train()
print("✅ Fine-tuning complete!")

## 🧠 Part 4: Using and Evaluating the Fine-Tuned Model

Now that the model is trained, let's use it for what it was built for: **inference**. We can use the `trainer.predict()` method to get predictions on our validation set, or we can build a `pipeline` for easy inference on new, unseen data.

### Inference on New, Unseen Data

This is the true test of our model. Let's give it a log entry it has never seen before and see how it classifies it. We can wrap our fine-tuned model in a `pipeline` for maximum convenience.

## 🧠 Part 5: Saving Your Model for Deployment

Training a model is only half the battle. To use it in a real application, you need to save it and be able to load it later. The `Trainer` makes this incredibly simple.

When you save a model, Hugging Face saves everything you need to recreate it:
*   The model weights (`pytorch_model.bin`).
*   The model configuration file (`config.json`).
*   The tokenizer files (`tokenizer.json`, `vocab.txt`, etc.).

In [None]:
# Define a path to save the final, best-performing model
final_model_path = "./final_incident_classifier"

# Use the trainer's 'save_model' method to save the model and tokenizer
trainer.save_model(final_model_path)

print(f"✅ Model and tokenizer saved to: {final_model_path}")

# You can now find a new directory with all the necessary files.
# This directory can be zipped and sent to a server or loaded in another notebook.
import os
print("\n--- Files in the saved model directory ---")
for filename in os.listdir(final_model_path):
    print(f"- {filename}")

### Loading and Using the Saved Model

Let's simulate a new session. We'll pretend we've just opened our notebook and want to use our previously trained model. We can load it directly from the directory we just created.

In [None]:
# Load the model and tokenizer from the saved path
# The Auto* classes are smart enough to figure everything out from the directory.
loaded_model = AutoModelForSequenceClassification.from_pretrained(final_model_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(final_model_path)

# Create a new pipeline with the loaded components
loaded_pipeline = pipeline(
    "text-classification",
    model=loaded_model,
    tokenizer=loaded_tokenizer
)

print("✅ Successfully loaded model from disk and created a new pipeline.")

# Test the loaded pipeline with another new log
another_new_log = "All systems are running within normal parameters after the quarterly maintenance check was completed."
loaded_result = loaded_pipeline(another_new_log)

print("\n--- Inference with Loaded Model ---")
print(f"Log: '{another_new_log}'")
print(f"Predicted Result: {loaded_result}")

## 🎉 Summary & Next Steps

Congratulations! You have completed the full, end-to-end workflow for fine-tuning a Transformer model for a custom task.

### Key Takeaways:

*   **Pipelines for Quick Inference:** The `pipeline` function is the fastest way to use pre-trained models for standard tasks.
*   **Tokenizer is Key:** The tokenizer prepares your text for the model, handling subword splitting, special tokens, and padding.
*   **Fine-Tuning Unlocks Performance:** By fine-tuning a general-purpose model on your specific data, you can achieve state-of-the-art performance on your custom task.
*   **The `Trainer` API:** This high-level API handles the entire training and evaluation loop, making the process streamlined and repeatable.
*   **Save and Load for Production:** Saving a trained model with `trainer.save_model()` is the first step toward deploying it in a real-world application.

### Where to Go From Here:

*   **Push to Hub:** Learn how to use `trainer.push_to_hub()` to share your model with the community or your team.
*   **More Complex Tasks:** Explore other tasks like Named Entity Recognition (NER), Question Answering, or Text Generation. The workflow is very similar.
*   **Model Optimization:** For production, you might want to optimize your model for speed and size using techniques like quantization or pruning.
*   **MLOps:** Integrate your training workflow into a robust MLOps pipeline for automated testing, deployment, and monitoring.

This notebook provides the foundational workflow. You are now equipped to tackle a wide range of NLP problems using the power of the Hugging Face ecosystem.

<div align="center">
<b>You've fine-tuned your first Transformer. The possibilities are endless! 🚀</b>
</div>