# 🏭 Week 09-10 · Notebook 01 · Pre-Training Concepts for Manufacturing Corpora

Understand how to audit, curate, and prepare manufacturing text data before launching large-scale language model pre-training.

## 🎯 Learning Objectives
- Diagnose whether a manufacturing corpus is ready for masked or causal language modeling.
- Engineer domain-specific curricula that balance routine operations with edge-case incidents.
- Quantify coverage, freshness, and risk hotspots across maintenance, quality, and safety documents.
- Produce a governance-ready data audit that satisfies IT/Compliance stakeholders.

## 🧩 Scenario
You have five years of shift logs, non-conformance reports (NCRs), maintenance tickets, and safety bulletins collected from four automotive plants. Leadership wants a maintenance co-pilot pre-trained on this corpus. Your job is to surface data gaps, compliance hazards, and curriculum strategy before anyone spins up GPUs.

In [None]:
# Core libraries for profiling
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8')

## 🗂️ Sample Manufacturing Corpus
The notebook ships with a synthetic corpus that mimics mixed-format manufacturing documents. Replace these CSV/JSON stubs with your plant exports when running in production.

In [None]:
# Create a synthetic dataset with multiple document classes
np.random.seed(42)
documents = pd.DataFrame([

In [None]:
# Coverage summary by document type
coverage = documents.groupby('doc_type').agg(

### 🔎 Interpretation Guidance
- **Coverage**: Ensure high-risk document classes (e.g., safety bulletins) have sufficient volume.
- **Curriculum Candidate**: Stage training from routine shift logs → maintenance tickets → high-severity NCRs.
- **Action**: Flag doc types with <10% representation for synthetic augmentation or targeted collection.

In [None]:
# Freshness analysis: days since last update
today = pd.Timestamp('2025-10-13')
documents['days_stale'] = (today - documents['last_updated']).dt.days
staleness = documents.groupby('doc_type')['days_stale'].describe()[['mean', '50%','max']]

### 🧭 Governance Check
- Define a freshness SLA: e.g.,  180 days for maintenance tickets,  90 days for safety bulletins.
- Trigger review workflows if `max`  730 days (stale procedures).
- Document exceptions and notify plant managers for updates.

In [None]:
# Visualize plant-wise distribution and PII risk
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
documents.groupby('plant').size().plot(kind='bar', ax=axes[0], title='Documents per Plant', color='#1f77b4')
documents.groupby('pii_flags').size().plot(kind='bar', ax=axes[1], title='PII Flags', color='#d62728')
for ax in axes:

## 🧱 Masked vs. Causal LM Readiness
| Dimension | Masked LM Proof Points | Causal LM Proof Points | Manufacturing Notes |

In [None]:
# Risk scoring heuristic to prioritize governance reviews
risk_weights = {
documents['governance_risk'] = documents.apply(lambda row: risk_weights[row['doc_type']] + (2 if row['pii_flags'] != 'none' else 0) + (1 if row['days_stale'] > 365 else 0) + (2 if row['safety_sensitive'] else 0), axis=1)
risk_summary = documents.groupby('plant')['governance_risk'].mean().sort_values(ascending=False)

### 🛡️ Risk Register Template
- Plants with average risk  6 require Compliance sign-off before data export.
- Flag PII types and note anonymization method (hash, redact, aggregate).
- Capture risks in ISO 9001 change log with mitigation owners.

## 🧪 Lab Assignment
1. Replace the synthetic dataset with your plant corpus exports (CSV/JSON/PDF).
2. Extend the freshness SLA by equipment criticality and create alert rules.
3. Propose a three-phase curriculum schedule and justify each phase with metrics.
4. Present a governance report to IT + Compliance with risk scores and mitigation actions.

## ✅ Checklist
- [ ] Corpus inventoried with volume, freshness, and language breakdown
- [ ] PII and safety-sensitive text cataloged with mitigation plan
- [ ] Curriculum roadmap drafted and validated with SMEs
- [ ] Governance report delivered to stakeholders

## 📚 References
- *ISO 9001:2015 Quality Management Systems*
- *OSHA Recordkeeping Guidelines*
- HuggingFace Datasets: [Data Curation Playbook](https://huggingface.co/docs/datasets/main/en/process)
- NVIDIA: *Data-Centric AI for Industrial Use Cases* (2024)