# FAA Maintenance Log Classification with BERT

This project builds a **text classification pipeline** for FAA Service Difficulty Reports (SDR) using **BERT fine-tuning**.

## Goal
Classify free-text **Discrepancy** narratives into high-level aircraft system categories (e.g., STRUCTURES, AVIONICS/ELECTRICAL, POWERPLANT).  
This enables automated triage and trend analysis of maintenance issues
.
To achieve this, we demonstrate:
- **Data wrangling & cleaning**: parsing raw FAA SDR `.xls` HTML tables into a consistent dataset.  
- **Category engineering**: mapping fine-grained JASC codes into ~7–8 balanced categories.  
- **Fine-tuning BERT**: adapting a pretrained model (DistilBERT / BERT-base) for aerospace maintenance text.  
- **Few-shot inference**: testing model predictions on novel, hand-crafted examples to evaluate generalization.  
- **Class imbalance handling**: merging sparse categories, experimenting with weighted loss.  
- **Evaluation & analysis**: reporting macro-F1, confusion matrices, and error case studies.  
- **Reusable inference interface**: providing a lightweight CLI script to classify new log entries.

## Why BERT?
- **Domain-specific challenge**: Maintenance logs are short, technical, and use aviation-specific terminology.  
- **Limitations of traditional models**: TF-IDF + Logistic Regression baselines often fail to capture context or rare phrases.  
- **BERT advantage**: Pretrained on large corpora, BERT can capture contextual meaning (e.g., “wing crack” vs “engine crack”) and generalize better to unseen phrasing.

## Data 
(See previous EDA notebook for more deatail of how labels are grouped based on their JASC code)
- Source: FAA Service Difficulty Reports (SDR), Boeing 767 (2022–2025).  
- After cleaning and mapping JASC codes → Categories, dataset size is ~14k rows.  
- Categories were **consolidated into 7 balanced groups** to avoid data sparsity:
  - STRUCTURES  
  - AVIONICS/ELECTRICAL  
  - POWERPLANT  
  - CABIN  
  - FLUID_SYSTEMS  
  - ENVIRONMENTAL/SAFETY  
  - FLIGHT_CONTROLS

## Pipeline
1. **Data acquisition & merge**: Parse FAA SDR HTML `.xls` exports into one combined CSV.  
2. **EDA & Cleaning**: Normalize dates, remove blanks, analyze text length & class balance.  
3. **Category Mapping**: Convert JASC codes → domain-relevant system categories, merge rare labels.  
4. **Train/Val/Test Split**: Stratified to preserve class distribution.  
5. **Modeling**: Fine-tune `distilbert-base-uncased` (later `bert-base-uncased`).  
6. **Evaluation**: Macro-F1, confusion matrix, per-class reports.  

In [1]:
#Import and data
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("../data/processed/sdr_training.csv")
OUTPUT_DIR = "../outputs/bert"

In [2]:
#Splitng data set 
X = df["Discrepancy"].astype(str)
y = df["Category_adj"]
#We spliting the data set into training and testing with the same probotion as original data 
SEED = 676

X_train, X_tmp, y_train, y_tmp = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=SEED
)
X_val, X_test, y_val, y_test = train_test_split(
    X_tmp, y_tmp, test_size=0.50, stratify=y_tmp, random_state=SEED
)

print(f"Train: {len(X_train):,} | Val: {len(X_val):,} | Test: {len(X_test):,}")

# quick distribution check
def dist(s): 
    return s.value_counts(normalize=True).sort_index().round(3)

print("\nClass distribution (proportions):")
print("Full:\n", dist(y))
print("Train:\n", dist(y_train))
print("Val:\n", dist(y_val))
print("Test:\n", dist(y_test))

Train: 10,026 | Val: 2,149 | Test: 2,149

Class distribution (proportions):
Full:
 Category_adj
AVIONICS/ELECTRICAL     0.158
CABIN                   0.046
ENVIRONMENTAL/SAFETY    0.046
FLIGHT_CONTROLS         0.027
FLUID_SYSTEMS           0.042
POWERPLANT              0.065
STRUCTURES              0.617
Name: proportion, dtype: float64
Train:
 Category_adj
AVIONICS/ELECTRICAL     0.158
CABIN                   0.046
ENVIRONMENTAL/SAFETY    0.046
FLIGHT_CONTROLS         0.027
FLUID_SYSTEMS           0.042
POWERPLANT              0.065
STRUCTURES              0.617
Name: proportion, dtype: float64
Val:
 Category_adj
AVIONICS/ELECTRICAL     0.158
CABIN                   0.047
ENVIRONMENTAL/SAFETY    0.046
FLIGHT_CONTROLS         0.027
FLUID_SYSTEMS           0.042
POWERPLANT              0.065
STRUCTURES              0.617
Name: proportion, dtype: float64
Test:
 Category_adj
AVIONICS/ELECTRICAL     0.158
CABIN                   0.047
ENVIRONMENTAL/SAFETY    0.046
FLIGHT_CONTROLS         0

In [3]:
#Label encoding 
#  Models can’t work with text labels  so We convert our labels from text to integer. 
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder().fit(y_train)
y_train_id = le.transform(y_train)
y_val_id   = le.transform(y_val)
y_test_id  = le.transform(y_test)

num_labels = len(le.classes_)
id2label = {i: c for i, c in enumerate(le.classes_)}
label2id = {c: i for i, c in id2label.items()}

print("Classes:", list(le.classes_))
print("num_labels:", num_labels)

Classes: ['AVIONICS/ELECTRICAL', 'CABIN', 'ENVIRONMENTAL/SAFETY', 'FLIGHT_CONTROLS', 'FLUID_SYSTEMS', 'POWERPLANT', 'STRUCTURES']
num_labels: 7


In [4]:
# Build HF datasets
from datasets import Dataset

train_df = pd.DataFrame({"text": X_train.tolist(), "labels": y_train_id})
val_df   = pd.DataFrame({"text": X_val.tolist(),   "labels": y_val_id})
test_df  = pd.DataFrame({"text": X_test.tolist(),  "labels": y_test_id})

ds_train = Dataset.from_pandas(train_df, preserve_index=False)
ds_val   = Dataset.from_pandas(val_df,   preserve_index=False)
ds_test  = Dataset.from_pandas(test_df,  preserve_index=False)

len(ds_train), len(ds_val), len(ds_test)

(10026, 2149, 2149)

In [5]:
#import sys, platform
#print(sys.executable)   # should point to ...\anaconda3\envs\ds-interview\python.exe
#print(platform.python_version())
#%pip install --upgrade --index-url https://download.pytorch.org/whl/cpu torch torchvision

In [6]:
#Run this if you need py torch, or if there is conflict in enviroment only
%pip install --upgrade --index-url https://download.pytorch.org/whl/cu121 torch torchvision

Looking in indexes: https://download.pytorch.org/whl/cu121
Note: you may need to restart the kernel to use updated packages.


In [7]:
# [5] Tokenizer
# ---------------------------
# Why?
# - Transformers like BERT cannot read raw text, they need token IDs.
# -  We can use Hugging Face AutoTokenizer loads to get the tokkens ID.
# - Each sentence will be:
#     * Split into subword tokens (WordPiece)
#     * Mapped to IDs from the pretrained vocab
#     * Truncated or padded to a fixed length
# - After this step, our Hugging Face Dataset will contain fields like:
#     input_ids, attention_mask, labels
from transformers import AutoTokenizer

#Config
MODEL_NAME = "distilbert-base-uncased"
MAX_LEN = 128

# Load tokenizer tied to DistilBERT
tok = AutoTokenizer.from_pretrained(MODEL_NAME)
# Function to apply tokenizer to a batch of examples
def tok_fn(batch):
    return tok(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=MAX_LEN,
    )
    
# Apply tokenizer to train/val/test datasets
# remove_columns=["text"] since we don’t need raw text inside model input
ds_train_tok = ds_train.map(tok_fn, batched=True, remove_columns=["text"])
ds_val_tok   = ds_val.map(tok_fn, batched=True, remove_columns=["text"])
ds_test_tok  = ds_test.map(tok_fn, batched=True, remove_columns=["text"])
# Convert Hugging Face dataset into PyTorch tensors (for Trainer)
ds_train_tok.set_format("torch")
ds_val_tok.set_format("torch")
ds_test_tok.set_format("torch")

ds_train_tok[0].keys()


Map:   0%|          | 0/10026 [00:00<?, ? examples/s]

Map:   0%|          | 0/2149 [00:00<?, ? examples/s]

Map:   0%|          | 0/2149 [00:00<?, ? examples/s]

dict_keys(['labels', 'input_ids', 'attention_mask'])

In [8]:
# [6] Model setup
# - We load a pretrained DistilBERT model for sequence classification.
# - The classification head is configured with our number of labels.
# - We also handle class imbalance by computing weights for each class.
     
import numpy as np, torch, torch.nn as nn
from transformers import AutoModelForSequenceClassification
# Configure if we will use class weights. In this case we use class weights as our classes are not balanced
################################################################################
use_class_weights = True  # flip to False if we don;t want to use class weights
#################################################################################
# Load DistilBERT with a classification head
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,   #number of category
    id2label=id2label,       # int -> class name mapping
    label2id=label2id,       # class name -> int mapping
)

if use_class_weights:
    counts = np.bincount(y_train_id, minlength=num_labels) # samples per class
    weights = counts.sum() / (counts + 1e-9)      # # inverse frequency weighting
    weights = weights / weights.mean() # normalize
    class_weights_t = torch.tensor(weights, dtype=torch.float)
    
# These weights will be plugged into CrossEntropyLoss later in Trainer


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
#import transformers, inspect, sys
#print("transformers version:", transformers.__version__)
#print("loaded from:", transformers.__file__)
#from transformers import TrainingArguments
#print("TrainingArguments signature:\n", inspect.signature(TrainingArguments.__init__))

In [10]:
# [7] TrainingArguments + Trainer (Transformers 4.56+)
#  Configure for hyperparameter and training arguments
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
import torch, torch.nn as nn


# Function: compute metrics after each eval step
# We report both accuracy and macro-F1 (balances across classes)
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {
        "acc": accuracy_score(labels, preds),
        "f1_macro": f1_score(labels, preds, average="macro"),
    }

# Hyperparameter config
EPOCHS = 3
LR = 2e-5
BATCH_TRAIN = 16
BATCH_EVAL = 32
FP16 = torch.cuda.is_available() # use mixed precision if GPU available

# TrainingArguments control the Trainer's behavior
args_tr = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_TRAIN,
    per_device_eval_batch_size=BATCH_EVAL,
    learning_rate=LR,
    weight_decay=0.01,            # L2 regularization
    eval_strategy="steps",        # evaluate every N steps
    save_strategy="steps",        # save checkpoint every N steps
    logging_steps=50,
    eval_steps=300,               # run eval every 300 steps
    save_steps=300,               # restore best checkpoint (based on metric_for_best_model)
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    fp16=FP16,                     # use FP16 for faster training on GPU
    report_to="none",
)

# ---- Weighted trainer -
class WeightedTrainer(Trainer):
    def __init__(self, class_weights=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = class_weights 

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        outputs = model(**inputs)
        logits  = outputs.get("logits")
        labels  = inputs.get("labels")

        if labels is not None and logits is not None:
            if self.class_weights is not None:
                loss_fct = nn.CrossEntropyLoss(weight=self.class_weights.to(logits.device))
            else:
                loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits, labels)
        else:
            loss = outputs.get("loss")

        return (loss, outputs) if return_outputs else loss

# ---- Instantiate trainer (use processing_class instead of deprecated tokenizer) ----
if use_class_weights:
    trainer = WeightedTrainer(
        model=model,
        args=args_tr,
        train_dataset=ds_train_tok,
        eval_dataset=ds_val_tok,
        processing_class=tok,         
        compute_metrics=compute_metrics,
        class_weights=class_weights_t,
    )
else:
    trainer = Trainer(
        model=model,
        args=args_tr,
        train_dataset=ds_train_tok,
        eval_dataset=ds_val_tok,
        processing_class=tok,          
        compute_metrics=compute_metrics,
    )


In [11]:
# [8] Train
train_output = trainer.train()
train_output

Step,Training Loss,Validation Loss,Acc,F1 Macro
300,0.8861,0.719475,0.794323,0.685859
600,0.4735,0.485246,0.86738,0.772676
900,0.4003,0.500465,0.901815,0.827178
1200,0.4808,0.4888,0.913913,0.840139
1500,0.3009,0.472811,0.90926,0.831906
1800,0.2878,0.451557,0.90228,0.825684


TrainOutput(global_step=1881, training_loss=0.5631307797632212, metrics={'train_runtime': 139.52, 'train_samples_per_second': 215.582, 'train_steps_per_second': 13.482, 'total_flos': 996177423324672.0, 'train_loss': 0.5631307797632212, 'epoch': 3.0})

In [12]:
# [9] Test evaluation + classification report
from sklearn.metrics import classification_report

# Aggregate metrics on TEST (uses the HF eval loop)
test_metrics = trainer.evaluate(ds_test_tok)
print("TEST metrics:", test_metrics)
# Per-class metrics: get raw predictions, argmax to class IDs
pred_logits = trainer.predict(ds_test_tok).predictions
pred_ids = np.argmax(pred_logits, axis=1)

print("\nTEST classification report:\n",
      classification_report(y_test_id, pred_ids, target_names=list(le.classes_), digits=3))

TEST metrics: {'eval_loss': 0.5481747984886169, 'eval_acc': 0.9060027919962773, 'eval_f1_macro': 0.823120792426267, 'eval_runtime': 1.4518, 'eval_samples_per_second': 1480.258, 'eval_steps_per_second': 46.839, 'epoch': 3.0}

TEST classification report:
                       precision    recall  f1-score   support

 AVIONICS/ELECTRICAL      0.896     0.894     0.895       339
               CABIN      0.710     0.710     0.710       100
ENVIRONMENTAL/SAFETY      0.879     0.879     0.879        99
     FLIGHT_CONTROLS      0.702     0.690     0.696        58
       FLUID_SYSTEMS      0.703     0.933     0.802        89
          POWERPLANT      0.832     0.826     0.829       138
          STRUCTURES      0.961     0.942     0.951      1326

            accuracy                          0.906      2149
           macro avg      0.812     0.839     0.823      2149
        weighted avg      0.909     0.906     0.907      2149



~90% of test examples classified correctly iwth Macro F1 ~ 81.73

Main take away:
 - The model does very well on the dominant classes like AVIONICS/ ELECTRICAL, STRUCTURES, etc..
 - Macro-F1 < accuracy indicates some smaller classes perform worse
 - Cabin has weak precision but reasoanble recall suggest that the model over predict cabin in borderline case
 -  (57% of what model predict as cabin is actually cabin), (Of all cabin, model can identify 74% cabins)

Things to inspect and consider for improvement:
 - Inspect confusion: build a confusion matrix to see which classes are confused with CABIN.
 - Try larger Max_length if narratives are being truncated
 - Hyperparams: small LR sweep (1e-5–5e-5), epochs=3–5, adjust eval_steps
 - Model: try roberta-base (often stronger on classification) with same pipeline


In [13]:
# [10] Save best model + tokenizer + label classes
from pathlib import Path
import json

best_dir = Path(OUTPUT_DIR) / "best_model"
best_dir.mkdir(parents=True, exist_ok=True)

trainer.save_model(str(best_dir))
tok.save_pretrained(str(best_dir))

with open(best_dir / "label_classes.json", "w") as f:
    json.dump(list(le.classes_), f, indent=2)

(best_dir / "id2label.json").write_text(json.dumps(id2label, indent=2))
(best_dir / "label2id.json").write_text(json.dumps(label2id, indent=2))

print("Saved to:", best_dir)

Saved to: ..\outputs\bert\best_model


In [14]:
# [11] Inference helper
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loaded_tok = AutoTokenizer.from_pretrained(best_dir)
loaded_model = AutoModelForSequenceClassification.from_pretrained(best_dir).to(device)

def predict_category(texts):
    if isinstance(texts, str):
        texts = [texts]
    enc = loaded_tok(texts, padding=True, truncation=True, max_length=MAX_LEN, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = loaded_model(**enc).logits
    pred_ids = logits.argmax(dim=1).cpu().numpy().tolist()
    return [id2label[i] for i in pred_ids]

# try it:
predict_category("FUEL PRESSURE LOW LIGHT ON DURING CLIMB; CHECKED LINES AND REPLACED PUMP.")

['FLUID_SYSTEMS']