# 🏭 Week 5-6, Notebook 3: Model Selection & Data Preprocessing

**Module:** LLMs, Prompt Engineering & RAG  
**Project:** Build the Knowledge Core for the Manufacturing Copilot

---

Choosing the right Large Language Model and preparing your data correctly are two of the most critical decisions you will make in any applied NLP project. This is especially true in a specialized domain like manufacturing, where the language is filled with technical jargon, machine IDs, sensor readings, and safety-critical information.

This notebook provides a structured framework for:
1.  **Model Selection:** How to choose the best model for your specific task by balancing performance, speed, cost, and privacy.
2.  **Data Preprocessing:** The essential steps of turning raw text into a format that a model can understand—a process known as tokenization.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:

1.  **Create a Model Selection Rubric:** Systematically evaluate and compare different LLMs based on criteria crucial for manufacturing, such as accuracy, latency, cost, and data privacy.
2.  **Understand Tokenization in Depth:** Analyze how different tokenizers handle domain-specific text (e.g., `CNC-12`, `500-PSI`, `Part#A34-Z2`) and the impact this has on performance.
3.  **Perform Text Preprocessing:** Master the workflow of preparing raw text for a model by tokenizing it and converting it into the numerical tensors the model expects.
4.  **Build a Foundational Classifier:** Implement an end-to-end workflow for a simple text classifier, demonstrating how to use a pre-trained model for a custom task.

## 📊 Part 1: A Framework for Model Selection

Not all models are created equal. A massive, 70-billion-parameter model is incredibly powerful but might be too slow, expensive, or insecure for your specific use case. Conversely, a small, fast model might not be accurate enough for a critical task.

To make a data-driven decision, we can use a **selection rubric**. This is a simple but powerful tool for scoring candidate models against the criteria that matter most to your project.

**Key Criteria for Manufacturing:**

*   **Accuracy:** How well does the model perform the target task (e.g., classification accuracy, summarization quality)?
*   **Latency:** How quickly does the model produce a response? Critical for real-time applications.
*   **Cost:** What is the hardware and operational cost? Includes GPU requirements and API fees.
*   **Privacy/Security:** Can the model be run on-premises (`on-prem`) to protect sensitive company data?
*   **Fine-tuning Ease:** How difficult is it to adapt the model to your specific data and terminology?

In [None]:
# Hands-On: Building a Model Selection Rubric
import pandas as pd

# Define a rubric to score candidate models for a maintenance ticket classification task.
# Scores are on a 1-5 scale, where 5 is best.
rubric_data = {
    'Model': ['BERT-base-uncased', 'DistilBERT-base-uncased', 'RoBERTa-large', 'On-prem Llama-3-8B'],
    'Accuracy (General)': [3, 3, 4, 5],  # General pre-trained accuracy on standard benchmarks
    'Latency (Speed)': [4, 5, 2, 3],     # 5 = fastest inference
    'Cost (Hardware/API)': [5, 5, 4, 2], # 5 = cheapest to run
    'Privacy (On-Prem)': [5, 5, 5, 5],   # All these models can be run on-premises
    'Fine-Tuning Ease': [4, 4, 3, 2]      # 5 = easiest to fine-tune
}

rubric_df = pd.DataFrame(rubric_data)

# Define the weights for each criterion based on project priorities.
# For a prototype, we might prioritize speed and cost.
weights = {
    'Accuracy (General)': 0.30,
    'Latency (Speed)': 0.25,
    'Cost (Hardware/API)': 0.20,
    'Privacy (On-Prem)': 0.15,
    'Fine-Tuning Ease': 0.10
}

# Calculate a weighted score for each model
# Note: .iloc[:, 1:] selects all rows and all columns from the second column onward.
weighted_scores = (rubric_df.iloc[:, 1:] * list(weights.values())).sum(axis=1)
rubric_df['Weighted_Score'] = weighted_scores

print("--- Model Selection Rubric ---")
# Display the results sorted by the final score
rubric_df.sort_values('Weighted_Score', ascending=False).reset_index(drop=True)

**Conclusion:** Based on this rubric, `DistilBERT-base-uncased` emerges as the top candidate for a balanced prototype. It offers the best latency and cost while maintaining good enough accuracy and fine-tuning potential. For a final production system where accuracy is paramount, `RoBERTa-large` or a fine-tuned `Llama-3-8B` might be a better choice, but they come with significantly higher computational costs.

## ⚙️ Part 2: Preprocessing with the Tokenizer

Models don't understand words; they understand numbers. A **tokenizer** is the critical component that bridges this gap. It converts your raw text into a sequence of numerical IDs that correspond to tokens in the model's vocabulary.

The choice of tokenizer is just as important as the choice of model. The tokenizer must match the model it was trained with. Using the wrong tokenizer will result in poor performance because the numerical IDs will not align with what the model expects.

Let's see how a standard tokenizer handles the specialized vocabulary found in a manufacturing environment.

In [None]:
# Hands-On: Exploring Tokenization on Manufacturing Data
from transformers import AutoTokenizer

# Let's load the tokenizer for our chosen model, DistilBERT.
# The "uncased" part means the tokenizer converts all text to lowercase.
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# A typical sentence from a maintenance log
manufacturing_text = "Incident on CNC-12: pressure dropped to 5 PSI and triggered alarm A-45."

# The .tokenize() method splits the text into a list of token strings.
tokens = tokenizer.tokenize(manufacturing_text)

# The .encode() method performs the full pipeline: tokenization and conversion to numerical IDs.
token_ids = tokenizer.encode(manufacturing_text)

print(f"Original Text: '{manufacturing_text}'")
print("-" * 30)
print(f"Tokens: {tokens}")
print("-" * 30)
print(f"Token IDs: {token_ids}")

# --- Analysis ---
# Notice how the tokenizer handles the domain-specific terms:
# - "cnc-12" is split into ['cn', '##c', '-', '12'].
# - "psi" is a single token: ['psi'].
# - "a-45" is split into ['a', '-', '45'].
# The "##" prefix indicates that the token is a subword and should be attached to the previous token.

# This splitting of unknown words is a key feature of subword tokenization, but it can
# sometimes hurt performance on highly specialized vocabularies. We will address this
# in the next notebook by training a custom tokenizer.

## 🛠️ Part 3: Building a Simple Maintenance Ticket Classifier

Now, let's put everything together to build a simple classifier. This is a common and highly valuable task, as it can automatically route maintenance tickets to the correct team (e.g., Mechanical, Electrical, or Software).

**The Goal:** Classify a new maintenance ticket into one of three categories.

**The Workflow:**
1.  **Load a pre-trained model:** We'll use `DistilBERT`, but we need to tell it that we want to use it for a classification task with 3 specific labels.
2.  **Prepare the input:** We take a new ticket (raw text) and use our tokenizer to convert it into the format the model expects (input IDs, attention mask, etc.).
3.  **Feed the input to the model:** We pass the tokenized input through the model to get its raw output, known as **logits**.
4.  **Interpret the output:** We convert the logits into a predicted label.

**Important Note:** The model we are using has been pre-trained on a massive amount of general text, but it has **not** been fine-tuned on our specific maintenance labels. Therefore, its predictions will be essentially random. The purpose of this exercise is to demonstrate the complete, end-to-end workflow, which is the foundation for fine-tuning later.

In [None]:
# Hands-On: End-to-End Classification Workflow
from transformers import AutoModelForSequenceClassification
import torch

# --- Step 1: Define Labels and Load Model ---

# Define the categories for our classification task
labels = ["Mechanical", "Electrical", "Software"]
id2label = {i: label for i, label in enumerate(labels)}
label2id = {label: i for i, label in enumerate(labels)}

# Load the DistilBERT model, but configure it for sequence classification with our 3 labels.
# This adds a new, randomly initialized classification "head" on top of the pre-trained model.
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
)

# --- Step 2: Prepare Input Data ---

# A new maintenance ticket arrives
new_ticket = "The main conveyor belt is slipping off its track and making a loud grinding noise near motor M-17."

# Tokenize the text. `return_tensors="pt"` tells the tokenizer to output PyTorch tensors.
inputs = tokenizer(new_ticket, return_tensors="pt", truncation=True, padding=True)

print(f"Ticket: '{new_ticket}'\n")
print(f"Tokenizer Output (Input IDs): {inputs['input_ids']}\n")

# --- Step 3: Get Model Prediction ---

# We use `torch.no_grad()` to disable gradient calculations, as we are only doing inference, not training.
# This makes the process faster and uses less memory.
with torch.no_grad():
    # The model returns a dictionary. The raw, unnormalized scores are in the 'logits' key.
    outputs = model(**inputs)
    logits = outputs.logits

print(f"Model Logits: {logits}\n")

# --- Step 4: Interpret the Results ---

# The logits are the raw scores for each class. To get a probability, you can apply a softmax function.
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities: {probabilities.numpy()}\n")

# To get the final predicted label, we find the class with the highest logit/probability.
predicted_class_id = logits.argmax().item()
predicted_label = model.config.id2label[predicted_class_id]

print(f"--> Predicted Category: {predicted_label}")
print("\nNote: This prediction is from an untrained classification head. It will become accurate after fine-tuning.")

## 📏 Part 4: Handling Long Documents and Context Windows

A critical limitation of most Transformer models is their **maximum context window**. This is the fixed number of tokens the model can process at one time. For example:
*   `BERT-base`: 512 tokens
*   `Llama-3-8B`: 8,192 tokens
*   `GPT-4-Turbo`: 128,000 tokens

If you provide a document that is longer than the model's context window, the tokenizer will **truncate** it by default, cutting off the end of the document. This can lead to the loss of crucial information.

It is essential to be aware of your model's context window and have a strategy for handling documents that exceed it.

In [None]:
# Hands-On: Checking Document Length Against Context Windows
# A long Standard Operating Procedure (SOP) document
long_sop = """
Standard Operating Procedure for the Hydraulic Press #H-75.
Section 1: Safety Protocols. Before any maintenance, ensure the machine is in a full stop and de-energized state. All personnel must follow the lockout-tagout procedures as documented in SOP-GEN-001. Personal Protective Equipment (PPE), including safety glasses and steel-toed boots, is mandatory within a 10-foot radius of the machine.
Section 2: Weekly Maintenance Checklist. 1. Check hydraulic fluid levels and top off if below the minimum fill line. Use only approved fluid type H-45. 2. Inspect for any visible leaks around the main cylinder, hoses, and fittings. Report any leaks immediately to the shift supervisor. 3. Clean debris from the base of the press.
Section 3: Annual Maintenance and Calibration. 1. Replace the primary hydraulic filter (Part #HF-2045-B). 2. Send a sample of the hydraulic fluid to an external lab for analysis of particulate matter and viscosity. 3. Check and re-torque all main frame mounting bolts to the specified 250 ft-lbs. 4. Calibrate the pressure sensor using the Fluke 700G Precision Pressure Gauge Calibrator.
""" * 3 # Repeat the text to make it longer

# Load tokenizers for models with different context window sizes
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # Max length: 512
llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") # Max length: 8192

# Tokenize the long document with each tokenizer
bert_tokens = bert_tokenizer(long_sop, return_tensors="pt")['input_ids']
llama_tokens = llama_tokenizer(long_sop, return_tensors="pt")['input_ids']

print(f"Original document length: {len(long_sop.split())} words")
print("-" * 40)
print(f"Tokens (BERT tokenizer): {bert_tokens.shape[1]}")
print(f"Tokens (Llama-3 tokenizer): {llama_tokens.shape[1]}")
print("-" * 40)

# Check if the document fits within each model's context window
print(f"Fits in BERT's 512 context window? {'Yes' if bert_tokens.shape[1] <= 512 else 'No'}")
print(f"Fits in Llama-3's 8192 context window? {'Yes' if llama_tokens.shape[1] <= 8192 else 'No'}")

# For documents that are too long, you must implement a chunking strategy.
# This is a core concept in Retrieval-Augmented Generation (RAG), which we will cover in detail later.

## ✅ Summary and Next Steps

This notebook covered the essential foundational steps for any applied NLP project: choosing the right model and preparing your data. You have learned:

-   **How to Evaluate Models Systematically:** You used a weighted rubric to compare different models based on criteria like accuracy, speed, and cost, allowing you to make a data-driven choice for your specific use case.
-   **The Role of the Tokenizer:** You saw how a tokenizer converts text into numerical IDs and how it handles domain-specific vocabulary, including the use of subword tokens for unknown terms.
-   **The End-to-End Classification Workflow:** You implemented the full process of taking raw text, tokenizing it, feeding it to a model, and interpreting the output for a classification task.
-   **The Importance of Context Windows:** You learned about the token limits of different models and why it's crucial to handle long documents to avoid information loss through truncation.

In the next notebook, we will dive even deeper into the art of **tokenization**. You will learn about different tokenization algorithms and how to **train your own custom tokenizer** to better understand the unique vocabulary of your manufacturing environment, which is a key step toward improving model performance.