# 🏭 Week 5-6 · Notebook 03 · Model Selection & Preprocessing for Manufacturing

**Module:** LLMs, Prompt Engineering & RAG  
**Project:** Build the Knowledge Core for the Manufacturing Copilot

---

Choosing the right Large Language Model and preparing your data correctly are critical first steps in any NLP project. In a manufacturing environment, this is especially true due to the unique nature of the data: technical jargon, machine IDs, sensor readings, and safety-critical information. This notebook provides a framework for selecting a model and preprocessing your data for tasks like classification and analysis.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:
1. ✅ **Create a Model Selection Rubric:** Evaluate LLMs based on criteria relevant to manufacturing (accuracy, latency, cost, privacy).
2. ✅ **Understand Tokenization:** See how different tokenizers handle domain-specific text (e.g., `CNC-12`, `500-PSI`).
3. ✅ **Perform Text Preprocessing:** Prepare text for a model by tokenizing it and converting it to tensors.
4. ✅ **Build a Simple Classifier:** Use a pre-trained model to classify maintenance tickets into categories.

## 📊 A Framework for Model Selection

Not all models are created equal. A massive model like GPT-4 is powerful but might be too slow, expensive, or risky for your use case. A smaller, fine-tuned model might be better. Use a rubric to make a data-driven decision.

In [None]:
import pandas as pd

# Define a rubric to score candidate models
rubric_data = {
    'Model': ['BERT-base', 'DistilBERT', 'RoBERTa-large', 'On-prem Llama-2 7B'],
    'Accuracy (Score 1-5)': [3, 3, 4, 4], # General pre-trained accuracy
    'Latency (Score 1-5)': [4, 5, 2, 3], # 5 is fastest
    'Cost (Score 1-5)': [5, 5, 4, 2], # 5 is cheapest
    'Privacy (Score 1-5)': [5, 5, 5, 5], # All can be run on-prem
    'Fine-tuning Ease (Score 1-5)': [4, 4, 3, 2] # 5 is easiest
}

rubric_df = pd.DataFrame(rubric_data)

# Define weights for each criterion
weights = {
    'Accuracy (Score 1-5)': 0.30,
    'Latency (Score 1-5)': 0.25,
    'Cost (Score 1-5)': 0.20,
    'Privacy (Score 1-5)': 0.15,
    'Fine-tuning Ease (Score 1-5)': 0.10
}

# Calculate a weighted score
score = sum(rubric_df[col] * weight for col, weight in weights.items())
rubric_df['Weighted_Score'] = score

print("--- Model Selection Rubric ---")
rubric_df.sort_values('Weighted_Score', ascending=False).reset_index(drop=True)

**Conclusion:** For a balanced prototype, `DistilBERT` often wins due to its excellent balance of speed and performance. For highest accuracy, a larger model like `RoBERTa` or a fine-tuned `Llama-2` might be better, but with higher costs.

## ⚙️ Preprocessing: The Tokenizer

Models don't understand words; they understand numbers. A **tokenizer** converts your text into a sequence of numerical IDs that correspond to tokens in the model's vocabulary.

The choice of tokenizer is important, as it affects how domain-specific terms are handled.

In [None]:
from transformers import AutoTokenizer

# Let's use a standard tokenizer for DistilBERT
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

manufacturing_text = "Incident on CNC-12: pressure dropped to 5 PSI."

# Tokenize the text
tokens = tokenizer.tokenize(manufacturing_text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f"Original Text: {manufacturing_text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

# Notice how 'CNC-12' is split into 'cn', '##c', '-', '12'. 
# This is common for specialized terms not in the base vocabulary.

## 🛠️ Building a Simple Maintenance Ticket Classifier

Let's build a simple classifier to categorize maintenance tickets. This is a common first step for routing issues to the correct team.

**Task:** Classify a ticket as `Mechanical`, `Electrical`, or `Software`.

In [None]:
from transformers import AutoModelForSequenceClassification
import torch

# Define our labels
labels = ["Mechanical", "Electrical", "Software"]
id2label = {i: label for i, label in enumerate(labels)}
label2id = {label: i for i, label in enumerate(labels)}

# Load a pre-trained model, configured for our 3 labels
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", 
    num_labels=3,
    id2label=id2label,
    label2id=label2id
)

# A new maintenance ticket arrives
new_ticket = "The main conveyor belt is slipping and making a loud grinding noise."

# 1. Preprocess the text with the tokenizer
inputs = tokenizer(new_ticket, return_tensors="pt")

# 2. Feed the inputs to the model
with torch.no_grad():
    logits = model(**inputs).logits

# 3. Interpret the output
predicted_class_id = logits.argmax().item()
predicted_label = model.config.id2label[predicted_class_id]

print(f"Ticket: '{new_ticket}'")
print(f"Predicted Category: {predicted_label}")

# Note: The prediction is essentially random because the pre-trained model hasn't been
# fine-tuned on our specific labels. The goal here is to show the end-to-end workflow.

## 📏 Handling Long Documents: Context Windows

Most transformer models have a **maximum context window**—a limit on the number of tokens they can process at once (e.g., 512 for BERT, 4096 for Longformer).

If you feed a document that's too long, the tokenizer will truncate it, losing information. It's important to check document lengths.

In [None]:
long_sop = """Standard Operating Procedure for the Hydraulic Press #H-75. Section 1: Safety. Always ensure the machine is in a full stop and de-energized state before performing maintenance. Use lockout-tagout procedures as documented in SOP-GEN-001. Section 2: Weekly Maintenance. Check hydraulic fluid levels and top off if below the minimum line. Inspect for any visible leaks around the main cylinder and hoses. Section 3: Annual Maintenance. Replace the primary hydraulic filter (Part #HF-2045). Send a fluid sample for analysis. Check and re-torque all mounting bolts to 250 ft-lbs.""" * 5 # Make it longer

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Max length 512
longformer_tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096") # Max length 4096

bert_tokens = bert_tokenizer(long_sop, return_tensors="pt")['input_ids']
longformer_tokens = longformer_tokenizer(long_sop, return_tensors="pt")['input_ids']

print(f"Original document length: {len(long_sop)} characters")
print(f"Tokens (BERT): {bert_tokens.shape[1]}")
print(f"Tokens (Longformer): {longformer_tokens.shape[1]}")

print(f"\nFits in BERT's 512 context window? {'Yes' if bert_tokens.shape[1] <= 512 else 'No'}")
print(f"Fits in Longformer's 4096 context window? {'Yes' if longformer_tokens.shape[1] <= 4096 else 'No'}")

# For documents longer than the context window, you need a strategy like chunking (see RAG notebooks).

## ✅ Next Steps

This notebook covered the essential first steps for any applied NLP project: choosing the right model and preparing your data. You've learned:

- How to evaluate models systematically for a specific use case.
- The role of the tokenizer and how it handles domain-specific text.
- The end-to-end workflow for a simple classification task.
- The importance of considering context window limits.

In the next notebook, we will dive into the art of **tokenization** in more detail, including how to train your own custom tokenizer to better understand your manufacturing vocabulary.