# 🤗 Notebook 07: HuggingFace Transformers & Hub Intro

**Week 3-4: Deep Learning & NLP Foundations**  
**Gen AI Masters Program**

---

## 📋 Objectives

By the end of this notebook, you will master:
1. ✅ Navigating the HuggingFace ecosystem (Hub, Datasets, Spaces)
2. ✅ Using inference pipelines for manufacturing text tasks
3. ✅ Tokenizers, AutoModel, and configuration overview
4. ✅ Fine-tuning a transformer with the `Trainer` API
5. ✅ Evaluating and saving models locally & to the Hub
6. ✅ Manufacturing best practices for secure deployment

**Estimated Time:** 4-5 hours

---

## 🌍 HuggingFace Ecosystem at a Glance

- **Hub**: 400k+ models, datasets, spaces
- **Transformers**: State-of-the-art model library
- **Datasets**: Efficient dataset loading&streaming
- **Evaluate**: Metrics ready-to-use
- **Spaces**: Deploy demos with Gradio/Streamlit

We'll connect these capabilities to manufacturing, pharma, and agribusiness automation.

In [None]:
# Core libraries
import os
import torch
import pandas as pd
from typing import List, Dict

# HuggingFace libraries (pre-installed via requirements.txt)
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict
import evaluate

torch.manual_seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"✅ Using device: {device}")
print(f"Transformers version: {pipeline.__module__.split('.')[0]}")

## 1️⃣ Authenticating with the HuggingFace Hub

1. Create an account: https://huggingface.co
2. Generate a **User Access Token** (Settings → Access Tokens)
3. Login from notebook or CLI

```python
from huggingface_hub import login
login(token='hf_xxx...')  # or use os.environ['HF_TOKEN']
```

> 🔐 Tip: Store tokens securely in environment variables or GitHub Secrets for CI/CD.

## 2️⃣ Quick Inference Pipelines

### Zero-Shot Classification for Incident Severity
Classify maintenance logs into Normal/Warning/Critical without training.

In [None]:
incident_pipeline = pipeline(
    'zero-shot-classification',
    model='facebook/bart-large-mnli',
    device=0 if torch.cuda.is_available() else -1
)

maintenance_log = 'Hydraulic pump output pressure collapsed causing line shutdown'
labels = ['normal', 'warning', 'critical']
result = incident_pipeline(maintenance_log, candidate_labels=labels)
result

In [None]:
pd.DataFrame({
    'label': result['labels'],
    'score': result['scores']
})

### Summarization for Maintenance Logs
Compress a long log entry into an actionable summary.

In [None]:
summarizer = pipeline(
    'summarization',
    model='philschmid/bart-large-cnn-samsum',
    device=0 if torch.cuda.is_available() else -1
)

long_log = (
    'During the night shift operators noticed persistent vibration spikes, '
    'followed by coolant temperature rise in furnace bay three. '
    'Manual inspection confirmed partial blockage in the coolant loop. '
    'Temporary bypass restored flow but pressure remains unstable.'
)
summary = summarizer(long_log, max_length=45, min_length=15, do_sample=False)[0]['summary_text']
print(summary)

### Named Entity Recognition (NER)
Identify assets, components, and actions within logs.

In [None]:
ner_pipeline = pipeline(
    'ner',
    model='dslim/bert-base-NER',
    aggregation_strategy='simple',
    device=0 if torch.cuda.is_available() else -1
)
entities = ner_pipeline('Technicians replaced the Siemens servo motor on line 4 and recalibrated the ABB controller.')
entities

## 3️⃣ Tokenizers & AutoModel Essentials

Tokenizers break text into model-friendly tokens. HuggingFace `AutoTokenizer` selects the correct tokenizer based on a checkpoint.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
tokens = tokenizer(maintenance_log, return_tensors='pt')
tokens

Inspect decoded tokens to understand subword splitting.

In [None]:
decoded_tokens = tokenizer.convert_ids_to_tokens(tokens['input_ids'][0])
list(zip(decoded_tokens, tokens['input_ids'][0].tolist()))

## 4️⃣ Building a Manufacturing Incident Classifier

We'll fine-tune `distilbert-base-uncased` to classify logs into **normal**, **warning**, and **critical** severity levels.

In [None]:
incident_data = pd.DataFrame({
    'text': [
        'Lubrication schedule completed with no deviations',
        'Pressure fluctuations above tolerance noted on press',
        'Emergency stop triggered due to high voltage surge',
        'Routine inspection confirmed sensors calibrated',
        'Coolant leak observed near heat exchanger',
        'Critical alarm persisted despite manual override',
        'Conveyor speed oscillation resolved after reset',
        'Bearing temperature exceeded safety threshold',
        'Hydraulic pump failure caused production halt',
        'Minor vibration increase logged during swing shift'
    ],
    'label': [0, 1, 2, 0, 1, 2, 0, 2, 2, 1]
})
label_map = {0: 'normal', 1: 'warning', 2: 'critical'}
incident_data

Split into train/validation sets and convert to a `datasets.Dataset`.

In [None]:
train_df = incident_data.sample(frac=0.8, random_state=42)
valid_df = incident_data.drop(train_df.index)
dataset = DatasetDict({
    'train': Dataset.from_pandas(train_df.reset_index(drop=True)),
    'validation': Dataset.from_pandas(valid_df.reset_index(drop=True))
})
dataset

Tokenize the dataset with padding and truncation.

In [None]:
def preprocess(batch: Dict[str, List[str]]):
    return tokenizer(batch['text'], padding=True, truncation=True)

tokenized_dataset = dataset.map(preprocess, batched=True)
tokenized_dataset

### Define Model & Data Collator

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=3
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return metric.compute(predictions=preds, references=labels)

### Fine-Tuning with the Trainer API

In [None]:
training_args = TrainingArguments(
    output_dir='model_outputs/hf_incident_classifier',
    evaluation_strategy='epoch',
    save_strategy='no',
    logging_strategy='epoch',
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=6,
    weight_decay=0.01,
    load_best_model_at_end=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

train_result = trainer.train()
train_result