# 🏭 Week 09-10 · Notebook 01 · Pre-Training Concepts for Manufacturing Corpora

Understand how to audit, curate, and prepare manufacturing text data before launching large-scale language model pre-training.

## 🎯 Learning Objectives
- Diagnose whether a manufacturing corpus is ready for masked or causal language modeling.
- Engineer domain-specific curricula that balance routine operations with edge-case incidents.
- Quantify coverage, freshness, and risk hotspots across maintenance, quality, and safety documents.
- Produce a governance-ready data audit that satisfies IT/Compliance stakeholders.

## 🧩 Scenario
You have five years of shift logs, non-conformance reports (NCRs), maintenance tickets, and safety bulletins collected from four automotive plants. Leadership wants a maintenance co-pilot pre-trained on this corpus. Your job is to surface data gaps, compliance hazards, and curriculum strategy before anyone spins up GPUs.

In [None]:
# Core libraries for profiling
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8-darkgrid')

## 🗂️ Sample Manufacturing Corpus
The notebook ships with a synthetic corpus that mimics mixed-format manufacturing documents. Replace these CSV/JSON stubs with your plant exports when running in production.

In [None]:
# Create a synthetic dataset with multiple document classes
np.random.seed(42)
num_docs = 1000
doc_types = ['shift_log', 'maintenance_ticket', 'ncr', 'safety_bulletin']
plants = ['Plant_A', 'Plant_B', 'Plant_C', 'Plant_D']
pii_flags = ['none', 'name', 'id', 'contact']

data = {
    'doc_id': [f'DOC-{i:04d}' for i in range(num_docs)],
    'doc_type': np.random.choice(doc_types, num_docs, p=[0.5, 0.3, 0.15, 0.05]),
    'plant': np.random.choice(plants, num_docs),
    'last_updated': pd.to_datetime('2025-10-13') - pd.to_timedelta(np.random.randint(1, 1000, size=num_docs), unit='d'),
    'pii_flags': np.random.choice(pii_flags, num_docs, p=[0.8, 0.1, 0.05, 0.05]),
    'safety_sensitive': np.random.choice([True, False], num_docs, p=[0.1, 0.9])
}
documents = pd.DataFrame(data)

# Make safety bulletins more likely to be safety sensitive
documents.loc[documents['doc_type'] == 'safety_bulletin', 'safety_sensitive'] = True

documents.head()

In [None]:
# Coverage summary by document type
coverage = documents.groupby('doc_type').agg(
    count=('doc_id', 'size'),
    percentage=('doc_id', lambda x: 100 * x.count() / len(documents))
).sort_values('count', ascending=False)

print("Corpus Coverage Analysis:")
print(coverage)

### 🔎 Interpretation Guidance
- **Coverage**: Ensure high-risk document classes (e.g., safety bulletins) have sufficient volume.
- **Curriculum Candidate**: Stage training from routine shift logs → maintenance tickets → high-severity NCRs.
- **Action**: Flag doc types with <10% representation for synthetic augmentation or targeted collection.

In [None]:
# Freshness analysis: days since last update
today = pd.Timestamp('2025-10-13')
documents['days_stale'] = (today - documents['last_updated']).dt.days
staleness = documents.groupby('doc_type')['days_stale'].describe()[['mean', '50%','max']].astype(int)

print("Data Freshness Analysis (in days):")
print(staleness)

### 🧭 Governance Check
- Define a freshness SLA: e.g.,  180 days for maintenance tickets,  90 days for safety bulletins.
- Trigger review workflows if `max`  730 days (stale procedures).
- Document exceptions and notify plant managers for updates.

In [None]:
# Visualize plant-wise distribution and PII risk
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plant Distribution
plant_counts = documents.groupby('plant').size()
plant_counts.plot(kind='bar', ax=axes[0], title='Documents per Plant', color='#1f77b4', rot=0)
axes[0].set_ylabel('Document Count')
axes[0].set_xlabel('Plant ID')

# PII Flags Distribution
pii_counts = documents.groupby('pii_flags').size()
pii_counts.plot(kind='bar', ax=axes[1], title='PII Flags Detected', color='#d62728', rot=0)
axes[1].set_ylabel('Document Count')
axes[1].set_xlabel('PII Type')


for ax in axes:
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    for p in ax.patches:
        ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='center', xytext=(0, 9), textcoords='offset points')

plt.tight_layout()
plt.show()

## 🧱 Masked vs. Causal LM Readiness

| Dimension | Masked LM (e.g., BERT) Proof Points | Causal LM (e.g., GPT/Llama) Proof Points | Manufacturing Notes |
|---|---|---|---|
| **Goal** | Bidirectional understanding, classification, entity recognition. | Text generation, summarization, Q&A. | **Causal LM** is better for a copilot that needs to draft reports and answer questions. **Masked LM** is good for pre-processing steps like identifying machine parts in text. |
| **Data Structure** | Unstructured text, sentences with missing words. | Sequential, conversational, or instruction-formatted text. | Our corpus is a mix. Shift logs are sequential (good for Causal), while NCRs are structured forms (good for Masked). |
| **Task Example** | "The [MASK] failed due to overheating." -> Predicts "motor". | "Generate a maintenance report for motor failure..." -> Drafts a full report. | The copilot's primary value is generative, favoring a **Causal LM** architecture. |
| **Training Cost** | Generally cheaper and faster to pre-train. | More computationally expensive. | Start with a pre-trained Causal LM and fine-tune it. Full pre-training is a massive undertaking. |
| **Verdict** | Use for specialized NLP tasks (e.g., PII detection). | **Primary choice** for the Manufacturing Copilot's core generative engine. | A hybrid approach is powerful: use a fine-tuned Masked LM to extract structured data from logs, then feed that data to a Causal LM to generate a summary. |


In [None]:
# Risk scoring heuristic to prioritize governance reviews
risk_weights = {
    'safety_bulletin': 5,
    'ncr': 4,
    'maintenance_ticket': 3,
    'shift_log': 1
}

def calculate_risk(row):
    doc_type_risk = risk_weights.get(row['doc_type'], 0)
    pii_risk = 2 if row['pii_flags'] != 'none' else 0
    stale_risk = 1 if row['days_stale'] > 365 else 0
    safety_risk = 3 if row['safety_sensitive'] else 0
    return doc_type_risk + pii_risk + stale_risk + safety_risk

documents['governance_risk'] = documents.apply(calculate_risk, axis=1)

risk_summary = documents.groupby('plant')['governance_risk'].mean().sort_values(ascending=False)

print("Average Governance Risk Score per Plant:")
print(risk_summary)

# Display documents with the highest risk scores
print("\nTop 5 High-Risk Documents:")
print(documents.sort_values('governance_risk', ascending=False).head())

### 🛡️ Risk Register Template
- Plants with average risk  6 require Compliance sign-off before data export.
- Flag PII types and note anonymization method (hash, redact, aggregate).
- Capture risks in ISO 9001 change log with mitigation owners.

## 🧪 Lab Assignment
1. Replace the synthetic dataset with your plant corpus exports (CSV/JSON/PDF).
2. Extend the freshness SLA by equipment criticality and create alert rules.
3. Propose a three-phase curriculum schedule and justify each phase with metrics.
4. Present a governance report to IT + Compliance with risk scores and mitigation actions.

## ✅ Checklist
- [ ] Corpus inventoried with volume, freshness, and language breakdown
- [ ] PII and safety-sensitive text cataloged with mitigation plan
- [ ] Curriculum roadmap drafted and validated with SMEs
- [ ] Governance report delivered to stakeholders

## 📚 References
- *ISO 9001:2015 Quality Management Systems*
- *OSHA Recordkeeping Guidelines*
- HuggingFace Datasets: [Data Curation Playbook](https://huggingface.co/docs/datasets/main/en/process)
- NVIDIA: *Data-Centric AI for Industrial Use Cases* (2024)