# 🔡 Week 09-10 · Notebook 02 · Bigram Language Model with Maintenance Logs

Build a from-scratch bigram language model to understand token statistics inside manufacturing maintenance notes.

## 🎯 Learning Objectives
- Clean and normalize maintenance text for n-gram modeling.
- Implement a bigram language model in PyTorch and compute perplexity.
- Analyze jargon-heavy vs. plain-language performance.
- Document failure modes relevant to plant-floor deployment.

## 🧩 Scenario
Technicians log work orders on a shared tablet. Autocomplete powered by a small language model can reduce typing time, but it must understand torque specs, part numbers, and bilingual notes.

In [None]:
import re
import math
import torch
from collections import defaultdict, Counter
import pandas as pd

torch.manual_seed(42)

## 🛠️ Maintenance Log Samples
Synthetic logs mimic common maintenance narratives. Replace with your CMMS export to train on real data.

In [None]:
logs_data = [
    {"log_id": "L001", "text": "Replaced the main conveyor belt motor (Part #MTR-789). Torqued bolts to 150 Nm. System back online."},
    {"log_id": "L002", "text": "Calibrated the CNC machine's laser sensor. Accuracy is now within 0.05mm tolerance. Se requiere una revisión en 24 horas."},
    {"log_id": "L003", "text": "Hydraulic fluid leak detected on Press-03. Replaced seal (Part #SL-456) and refilled fluid. Pressure at 2000 psi."},
    {"log_id": "L004", "text": "Routine check on the HVAC system. Cleaned filters and checked coolant levels. All nominal."},
    {"log_id": "L005", "text": "Emergency stop button on Assembly Line 2 was faulty. Replaced the entire switch assembly. Pruebas completadas."},
    {"log_id": "L006", "text": "Welding robot WR-08 reported joint misalignment. Ran diagnostic and recalibrated arm. Positional error is now less than 0.1mm."},
    {"log_id": "L007", "text": "Power supply for the packaging machine failed. Swapped with a new PSU (Part #PSU-123). El sistema está funcionando."},
    {"log_id": "L008", "text": "Updated firmware on all PLCs to version 3.4.1. System rebooted without issues."},
    {"log_id": "L009", "text": "Investigated high-temperature alarm on Furnace-01. Found a faulty thermocouple. Replaced and tested. Temp stable at 900°C."},
    {"log_id": "L010", "text": "La carretilla elevadora F-05 necesita una recarga de batería. Maintenance scheduled."}
]
logs = pd.DataFrame(logs_data)
logs

### 🔄 Text Normalization
Normalize units, tokenize multilingual text, and preserve domain-specific numbers.

In [None]:
def normalize_and_tokenize(text: str) -> list[str]:
    """
    Cleans and tokenizes maintenance log text.
    - Converts to lowercase
    - Adds spaces around units and special characters for better tokenization
    - Removes punctuation
    - Splits into tokens
    """
    text = text.lower()
    # Add padding to units and part numbers to treat them as separate tokens
    text = re.sub(r'(\d+)\s*(nm|mm|psi|kpa|°c)', r' \1 \2 ', text)
    text = re.sub(r'\(part\s*#([\w-]+)\)', r' part_\1 ', text)
    # Remove punctuation except for underscores and hyphens in part numbers
    text = re.sub(r'[^\w\s-]', '', text)
    tokens = text.split()
    return tokens

logs['tokens'] = logs['text'].apply(normalize_and_tokenize)
logs[['text', 'tokens']].head()

## 🧮 Bigram Model Implementation
We collect transition counts and transform them into log probabilities.

In [None]:
START_TOKEN = '<s>'
STOP_TOKEN = '</s>'

def build_bigram_counts(token_sequences):
    """Builds a vocabulary and counts of token pairs (bigrams)."""
    counts = defaultdict(Counter)
    vocab = set()
    for seq in token_sequences:
        # Add start and stop tokens to each sequence
        full_seq = [START_TOKEN] + seq + [STOP_TOKEN]
        vocab.update(full_seq)
        for prev_token, next_token in zip(full_seq, full_seq[1:]):
            counts[prev_token][next_token] += 1
    return counts, vocab

counts, vocab = build_bigram_counts(logs['tokens'])
print(f"Vocabulary size: {len(vocab)}")

# Example: See what follows the token 'replaced'
print("\nTokens that follow 'replaced':")
print(counts['replaced'])

In [None]:
def bigram_log_probs(counts, vocab, smoothing=1.0):
    """
    Converts bigram counts to log probabilities with add-one smoothing.
    Using log probabilities helps prevent underflow with long sequences.
    """
    vocab_size = len(vocab)
    # Calculate a default log probability for unseen bigrams.
    # This is the probability of an unseen word following a given word.
    default_log_prob = math.log(smoothing / (smoothing * vocab_size))
    
    # The outer defaultdict provides a default dictionary for unseen prev_tokens.
    # The inner defaultdict provides the default_log_prob for unseen next_tokens.
    probs = defaultdict(lambda: defaultdict(lambda: default_log_prob))

    for prev_token, next_counts in counts.items():
        total_count = sum(next_counts.values())
        denominator = total_count + smoothing * vocab_size
        
        # For each token that *could* follow prev_token, calculate its smoothed probability.
        # We only need to iterate through the tokens that actually appeared after prev_token.
        # The defaultdict will handle all other unseen next_tokens.
        for next_token, count in next_counts.items():
            numerator = count + smoothing
            prob = numerator / denominator
            probs[prev_token][next_token] = math.log(prob)
            
    return probs

log_probs = bigram_log_probs(counts, vocab)

# Display some example log probabilities
print("Log probabilities of tokens following the start token:")
sorted_start_probs = sorted(log_probs[START_TOKEN].items(), key=lambda item: item[1], reverse=True)
for token, log_prob in sorted_start_probs[:5]:
    print(f"  '{token}': {log_prob:.4f}")

### 📉 Perplexity Calculation
Evaluate how well the model predicts held-out sequences.

In [None]:
def sentence_log_prob(tokens, log_probs):
    """Calculates the total log probability of a sequence."""
    tokens = [START_TOKEN] + tokens + [STOP_TOKEN]
    logp = 0.0
    for prev_token, next_token in zip(tokens, tokens[1:]):
        logp += log_probs[prev_token].get(next_token, -np.inf) # Use get for safety, though defaultdict handles it
    return logp

def perplexity(dataset, log_probs):
    """
    Calculates perplexity, a measure of how well a model predicts a sample.
    Lower is better.
    """
    total_logp = 0.0
    total_tokens = 0
    for tokens in dataset:
        # Add 2 to token count for start/stop tokens
        total_tokens += len(tokens) + 2
        total_logp += sentence_log_prob(tokens, log_probs)

    # Perplexity is e^(-L), where L is the average log probability per token
    avg_logp = total_logp / max(total_tokens, 1)
    return math.exp(-avg_logp)

# Split data for a simple train/test evaluation
train_data = logs['tokens'][:8]
test_data = logs['tokens'][8:]

# Build model on training data
train_counts, train_vocab = build_bigram_counts(train_data)
train_log_probs = bigram_log_probs(train_counts)

# Evaluate on test data
pp = perplexity(test_data, train_log_probs)
print(f"Perplexity on test data: {pp:.2f}")

# Example sentence probability
test_sentence = ['updated', 'firmware', 'on', 'all', 'plcs']
prob = sentence_log_prob(test_sentence, train_log_probs)
print(f"Log probability of a test sentence: {prob:.2f}")

## 🧪 Experiment: Extend to Trigrams
Use this cell as a template for students to implement trigram logic and compare perplexities.

In [None]:
# TODO (Lab): implement trigram counts, probabilities, and perplexity
# Document perplexity reduction and note failure cases (e.g., rare part numbers).

## 🩺 Error Analysis Framework
- Inspect top 20 predicted next tokens for bilingual sentences.
- Flag mispredictions on torque specs, units, and compliance phrases.
- Track per-language perplexity to justify multilingual adapters.

## 🧪 Lab Assignment
1. Swap in three months of real maintenance logs and compute bigram perplexity.
2. Implement trigram smoothing (Kneser–Ney or Good–Turing) and compare results.
3. Document failure modes tied to acronyms, bilingual phrases, and numerical tolerances.
4. Present a memo recommending whether to proceed with larger-scale transformer pre-training.

## ✅ Checklist
- [ ] Logs normalized with consistent unit handling
- [ ] Bigram model implemented with smoothing
- [ ] Perplexity baseline captured per language
- [ ] Failure report shared with maintenance SMEs

## 📚 References
- Jurafsky & Martin, *Speech and Language Processing* (Chapter on N-grams)
- TorchText tutorials on language modeling
- *Multilingual Maintenance Communications* (Society of Manufacturing Engineers, 2024)