# Notebook 03 · Model Selection & Preprocessing

Choose the right model, tokenizer, and preprocessing steps for downstream manufacturing NLP projects.

## Selection Framework
1. **Business objective**: root cause summarization vs. ticket routing
2. **Data sensitivity**: bring model on-premise if logs contain IP
3. **Latency budget**: edge analytics vs. cloud batch jobs
4. **Multilingual needs**: supplier documentation often bilingual
5. **Tuning appetite**: zero-shot vs. fine-tuned

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

candidate = "distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(candidate)
model = AutoModelForSequenceClassification.from_pretrained(candidate)

sample = "Line 4 packaging robot stopped due to torque sensor fault."
inputs = tokenizer(sample, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
logits.shape

## Tokenizer Deep Dive
`AutoTokenizer` abstracts vocabulary, normalization, and pre-tokenization logic. Inspect tokens to understand truncation and special tokens.

## 🎯 Outcomes
- Build a scoring rubric for candidate models and tokenizers.
- Perform dataset profiling to anticipate context window requirements.
- Implement normalization, redaction, and chunking strategies for plant data.
- Design post-processing for actionable insights and audit trails.

## 🔍 Data Landscape Snapshot
| Source | Format | Typical Length | Sensitivities |
| --- | --- | --- | --- |
| Shift reports | Plain text | 1-3k tokens | Operational KPIs |
| Maintenance tickets | Semi-structured | 80-200 tokens | Machine IDs, operators |
| SOP manuals | PDF/DOCX | 5k-100k tokens | Compliance-critical |
| Sensor anomalies | CSV/JSON | 50-200 rows | Real-time streaming |

## 🧮 Model Selection Rubric
| Factor | Question | Weight | Notes |
| --- | --- | --- | --- |
| Accuracy | Does the model understand manufacturing jargon? | 30% | Evaluate on curated validation set |
| Latency | Can it respond < 700 ms per request? | 20% | Edge vs. cloud |
| Context window | Do tickets exceed default length? | 15% | Longformer vs. standard |
| Privacy | Can data leave the facility? | 15% | On-prem models preferred if no |
| Cost | Is licensing/OPEX acceptable? | 10% | Consider API pricing tiers |
| Maintainability | Do we have talent to fine-tune? | 10% | Evaluate team skillset |

In [None]:
import pandas as pd

rubric = pd.DataFrame([
weights = {'accuracy': 0.3, 'latency': 0.2, 'context': 0.15, 'privacy': 0.15, 'cost': 0.1, 'maintainability': 0.1}
rubric['score'] = rubric.apply(lambda row: sum(row[col] * weight for col, weight in weights.items()), axis=1)
rubric.sort_values('score', ascending=False)
: 
,
: {
: 

: [
,
 
: 
,
: null,
: {
: 

: [],
: [
,
,
,
92
,
2150
,
,
,
,
,

- WordPiece splits units like `psi` cleanly, while GPT-2 BPE may fragment them.
- For domain-specific acronyms (OEE, SPC), consider training a custom tokenizer (see Notebook 04).

## 🧪 Context Window Stress Test
Estimate token lengths before selecting model context size.

In [None]:
import numpy as np

long_log = '
'.join([
tok = AutoTokenizer.from_pretrained('allenai/longformer-base-4096')
length = len(tok(long_log)['input_ids'])
print('Token length:', length)
print('Fits in 512 context:', length <= 512)
print('Fits in 4096 context:', length <= 4096)
: 
,
: {
: 

: [
,
1
,
,
,

: 
,
: null,
: {
: 

: [],
: [
,
,
,
,
,
,
,
92
2150

: 
,
: {
: 

: [
2
,

2
 
: 
,
: null,
: {
: 

: [],
: [
,
,
,
,
7

: 
,
: {
: 

: [
3
,
 
: 
,
: null,
: {
: 

: [],
: [

## 📤 Post-processing
After inference, translate logits into actionable recommendations.

In [None]:
from transformers import AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)
ticket = 'Compressor 7 vibration exceeded 12 mm/s despite recent bearing change.'
inputs = tok(ticket, return_tensors='pt')
with torch.no_grad():
    logits = model(**inputs).logits
probabilities = F.softmax(logits, dim=-1).squeeze().tolist()
labels = ['normal', 'maintenance', 'safety']
classified = dict(zip(labels, [round(p, 3) for p in probabilities]))
classified
: 
,
: {
: 

: [
,
,