# 🔡 Week 09-10 · Notebook 02 · Bigram Language Model with Maintenance Logs

Build a from-scratch bigram language model to understand token statistics inside manufacturing maintenance notes.

## 🎯 Learning Objectives
- Clean and normalize maintenance text for n-gram modeling.
- Implement a bigram language model in PyTorch and compute perplexity.
- Analyze jargon-heavy vs. plain-language performance.
- Document failure modes relevant to plant-floor deployment.

## 🧩 Scenario
Technicians log work orders on a shared tablet. Autocomplete powered by a small language model can reduce typing time, but it must understand torque specs, part numbers, and bilingual notes.

In [None]:
import re
import math
import torch
from collections import defaultdict, Counter
import pandas as pd

torch.manual_seed(42)

## 🛠️ Maintenance Log Samples
Synthetic logs mimic common maintenance narratives. Replace with your CMMS export to train on real data.

In [None]:
logs = pd.DataFrame([

### 🔄 Text Normalization
Normalize units, tokenize multilingual text, and preserve domain-specific numbers.

In [None]:
def normalize(text: str) -> list[str]:
    text = text.lower()
    text = text.replace('psi', ' psi ').replace('nm', ' nm ').replace('kpa', ' kpa ')
    text = re.sub(r'[^]', ' ', text)
    tokens = text.split()
    return tokens

logs['tokens'] = logs['text'].apply(normalize)
logs[['text', 'tokens']]

## 🧮 Bigram Model Implementation
We collect transition counts and transform them into log probabilities.

In [None]:
START_TOKEN = '<s>'
STOP_TOKEN = '</s>'

def build_bigram_counts(token_sequences):
    counts = defaultdict(Counter)
    vocab = set()
    for seq in token_sequences:
    return counts, vocab

counts, vocab = build_bigram_counts(logs['tokens'])
len(vocab)

In [None]:
def bigram_probs(counts, smoothing=1.0):
    probs = {}
    vocab = set(counts.keys())
    for prev_token, next_counts in counts.items():
    return probs

probs = bigram_probs(counts)
list(probs[START_TOKEN].items())[:5]

### 📉 Perplexity Calculation
Evaluate how well the model predicts held-out sequences.

In [None]:
def sentence_log_prob(tokens, probs):
    tokens = [START_TOKEN] + tokens + [STOP_TOKEN]
    logp = 0.0
    for prev_token, next_token in zip(tokens, tokens[1:]):
    return logp

def perplexity(dataset, probs):
    logp_sum, token_count = 0.0, 0
    for tokens in dataset:
    return math.exp(-logp_sum / max(token_count, 1))

pp = perplexity(logs['tokens'], probs)
pp

## 🧪 Experiment: Extend to Trigrams
Use this cell as a template for students to implement trigram logic and compare perplexities.

In [None]:
# TODO (Lab): implement trigram counts, probabilities, and perplexity
# Document perplexity reduction and note failure cases (e.g., rare part numbers).

## 🩺 Error Analysis Framework
- Inspect top 20 predicted next tokens for bilingual sentences.
- Flag mispredictions on torque specs, units, and compliance phrases.
- Track per-language perplexity to justify multilingual adapters.

## 🧪 Lab Assignment
1. Swap in three months of real maintenance logs and compute bigram perplexity.
2. Implement trigram smoothing (Kneser–Ney or Good–Turing) and compare results.
3. Document failure modes tied to acronyms, bilingual phrases, and numerical tolerances.
4. Present a memo recommending whether to proceed with larger-scale transformer pre-training.

## ✅ Checklist
- [ ] Logs normalized with consistent unit handling
- [ ] Bigram model implemented with smoothing
- [ ] Perplexity baseline captured per language
- [ ] Failure report shared with maintenance SMEs

## 📚 References
- Jurafsky & Martin, *Speech and Language Processing* (Chapter on N-grams)
- TorchText tutorials on language modeling
- *Multilingual Maintenance Communications* (Society of Manufacturing Engineers, 2024)