# Notebook: USE a Fine-Tuned BERT Model for Inference on New Data

### Objective:

1. Load a fine-tuned model + metadata
2. Tokenize new data
3. Run inference
4. Save predictions to CSV



### Step 1 - Imports and Setup

In this step we: 
- Import `pandas`.
- Import utility functions from `local_llm.training.text_finetune`.
- Define 
    - Where the fine-tuned model artifacts live 
    - Where the unlabeled data is stored.
    - Whch columns should be concatenated to form the text input

In [2]:
from pathlib import Path
import pandas as pd

from local_llm.training.text_finetune import (
    set_seed,
    load_finetune_meta,
    load_finetuned_classifier_for_inference,
    encode_unlabeled_dataframe,
    predict_unlabeled_tensors,
    merge_unlabeled_with_predictions,
    export_unlabeled_predictions_csv,
)

print("✅ Imports complete.")

# Directory where the fine-tuning artifacts were saved by the training notebook
finetune_dir = Path("C:/Users/Cameron.Webster/Python/local-llm/artifacts/finetune_bert")

# Path to UNLABELED data you want to classify
unlabeled_csv_path = Path("C:/Users/Cameron.Webster/Python/local-llm/data/wbs_inference_data.csv")   # adjust if needed

# Text columns to concatenate (must match what you used when training)
text_cols = ("wbs_name", "keyword")

print(f"Fine-tune artifacts directory: {finetune_dir.resolve()}")
print(f"Unlabeled data CSV: {unlabeled_csv_path.resolve()}")
print(f"Text columns used for the model: {text_cols}")



✅ Imports complete.
Fine-tune artifacts directory: C:\Users\Cameron.Webster\Python\local-llm\artifacts\finetune_bert
Unlabeled data CSV: C:\Users\Cameron.Webster\Python\local-llm\data\wbs_inference_data.csv
Text columns used for the model: ('wbs_name', 'keyword')


### Step 2 - Load Metadata, Rebuild Config, and load the fine-tuned model

When you fine-tuned BERT, the training scripts saved:
- `classifier_full.pt` - full classifier (BERT encoder + classifier head) weights.
- `finetune_meta.json` - metadata about labels and training setup.

Here we: 
- Load the metadata to reconstruct label mappings (`label_to_id`, `id_to_label`).
- Rebuild a `FineTuneConfig` suitable for inference.
- Rebuild the model architecture and then laod the fine-tuned weights.

This gives us the **same model** tha was trained eariler, ready to run predictions.


In [3]:
# Optional: set seed for reproducibility of any stochastic ops
set_seed(42)
print("Random seed set to 42.\n")

# Load meta + model in one go
model, cfg, label_to_id, id_to_label, meta = load_finetuned_classifier_for_inference(
    output_dir=finetune_dir,
    text_cols=text_cols,
)

print("✅ Fine-tuned classifier loaded.")
print("Label mapping (label_to_id):")
for lab, idx in label_to_id.items():
    print(f"  {lab!r} -> {idx}")

print("\nSome key config fields for inference:")
print(f"  assets_dir:      {cfg.assets_dir}")
print(f"  output_dir:      {cfg.output_dir}")
print(f"  max_len:         {cfg.max_len}")
print(f"  pooling:         {cfg.pooling}")
print(f"  finetune_policy: {cfg.finetune_policy}")
print(f"  finetune_last_n: {cfg.finetune_last_n}")
print(f"  device:          {cfg.device}")

# Confirm model is on the right device
print("\nModel is on device:", next(model.parameters()).device)



Random seed set to 42.

✅ Fine-tuned classifier loaded.
Label mapping (label_to_id):
  'Construction' -> 0
  'D&D' -> 1
  'Design and NRE' -> 2
  'OPC Testing and Startup' -> 3
  'Process Equipment' -> 4
  'SEPM' -> 5
  'Site Preparation' -> 6
  'Standard Equipment' -> 7

Some key config fields for inference:
  assets_dir:      C:\Users\Cameron.Webster\Python\local-llm\assets\bert-base-local
  output_dir:      C:\Users\Cameron.Webster\Python\local-llm\artifacts\finetune_bert
  max_len:         256
  pooling:         cls
  finetune_policy: last_n
  finetune_last_n: 2
  device:          cuda

Model is on device: cuda:0


### Step 3 - Load New Unlabeled Data

Now we load the CSV file that contains new examples to classify. 

These rows: 
- Have text columsn (e.g., `wbs_name`, `wbs_title_hierarchy`, `keyword`).
- Do not have label columns (we want to predict them).

We print:
- Shape of the data
- Column Names
- First few rows

This helps verify that the text columsn awe configured actually exist in the data. 


In [4]:
print(f"Loading unlabeled data from: {unlabeled_csv_path.resolve()}")
unlabeled_df = pd.read_csv(unlabeled_csv_path)

print("\n✅ Unlabeled data loaded.")
print(f"Data shape: {unlabeled_df.shape[0]} rows × {unlabeled_df.shape[1]} columns")
print("Columns:", unlabeled_df.columns.tolist())

print("\nFirst 5 rows of the unlabeled data:")
display(unlabeled_df.head())


Loading unlabeled data from: C:\Users\Cameron.Webster\Python\local-llm\data\wbs_inference_data.csv

✅ Unlabeled data loaded.
Data shape: 802 rows × 4 columns
Columns: ['parsid', 'wbs', 'wbs_name', 'keyword']

First 5 rows of the unlabeled data:


Unnamed: 0,parsid,wbs,wbs_name,keyword
0,1268,1.10.01,Management Support,management support
1,1268,1.10.01.02,Project Management,management support project management
2,1268,1.10.01.03,Subcontractor Support,management support
3,1268,1.10.01.SM,Program Management,management support program management
4,1268,1.10.02.01,Design Engineering,design engineering


### Step 4 - Encode unlabeled Text into Model-ready Tensors

The model can't work directly with raw text; it needs:
- `input_ids` - token IDs for each position in the sequence.
- `token_type_ids` - segment IDs (for single-sentence inputs these are usually all zeros).
- `attention_mask` - 1 where tokens are real, 0 where they are padding.

In this step we: 
- Use `encode_unlabeled_dataframe` (from the library) to: 
    - Concatenate your selected text columns into a single string.
    - Tokenize and encode each example.
    - Return the three tensors in a dict.

We then print the shapes so you can seee how many example and how long each sequence is. 

In [5]:
unlabeled_tensors = encode_unlabeled_dataframe(
    df=unlabeled_df,
    cfg=cfg,
)

print("✅ Unlabeled data encoded.")
print("Tensor shapes:")
for k, v in unlabeled_tensors.items():
    print(f"  {k}: {tuple(v.shape)}")


✅ Unlabeled data encoded.
Tensor shapes:
  input_ids: (802, 256)
  token_type_ids: (802, 256)
  attention_mask: (802, 256)


### Step 5 - Run Inference and Collect Predictions

With this text encoded and the model loaded, we can now run inference:
- `predict_unlabeled_tensors`: 
    - Wraps the tensors into a datset and dataloader.
    - Runs the model in evaluation mode (no gradient updates).
    - For each example, computes:
        - `pred_label_id` - the predicted class index
        - `pred_label` - the human readable label(using `id_to_label`).
        - `pred_confidence` - the model's condifence for that label.

We'll inspect the fist few prediction rows to see what the model is doing.

In [6]:
preds_df = predict_unlabeled_tensors(
    model=model,
    unlabeled_tensors=unlabeled_tensors,
    cfg=cfg,
    id_to_label=id_to_label,
)

print("✅ Inference complete.")
print(f"Number of predictions: {len(preds_df)}")

print("\nFirst 5 prediction rows:")
display(preds_df.head())


✅ Inference complete.
Number of predictions: 802

First 5 prediction rows:


Unnamed: 0,pred_label_id,pred_label,pred_confidence
0,5,SEPM,0.997838
1,5,SEPM,0.997665
2,5,SEPM,0.99706
3,5,SEPM,0.998559
4,2,Design and NRE,0.997837


### Step 6 - Merge Predictions Back Onto the Original Rows


The `preds_df` we just created only contains prediction-related columns.

To make the results more useful, we merge predictions with the original data, so each row has:
- The original input columns (e.g., `wbs_name`, `wbs_title_hierarchy`, `keyword`, etc.).
- `pred_label_id` - numeric class ID.
- `pred_label` - human-readable label.
- `pred_confidence` - the model's confidence.

This way you can filter, sort, and analyze results easily.

In [7]:
merged_df = merge_unlabeled_with_predictions(
    raw_df=unlabeled_df,
    preds_df=preds_df,
)

print("✅ Merged original data with predictions.")
print(f"Merged shape: {merged_df.shape[0]} rows × {merged_df.shape[1]} columns")

print("\nFirst 5 rows of merged data:")
display(merged_df.head())


✅ Merged original data with predictions.
Merged shape: 802 rows × 7 columns

First 5 rows of merged data:


Unnamed: 0,parsid,wbs,wbs_name,keyword,pred_label_id,pred_label,pred_confidence
0,1268,1.10.01,Management Support,management support,5,SEPM,0.997838
1,1268,1.10.01.02,Project Management,management support project management,5,SEPM,0.997665
2,1268,1.10.01.03,Subcontractor Support,management support,5,SEPM,0.99706
3,1268,1.10.01.SM,Program Management,management support program management,5,SEPM,0.998559
4,1268,1.10.02.01,Design Engineering,design engineering,2,Design and NRE,0.997837


### Step 7 - Save Predictions to CSV and Inspect Basic Stats

Finally, we save the merged predictions to disk:
- The file is written inside the same output directory used for fine-tuning.
- You can open it in Excel, pandas, or any other tool. 

We also:
- Reload the CSV for a quick sanity check. 
- Show the distrubution of predicted labels. 

In [8]:
output_csv_path = export_unlabeled_predictions_csv(
    merged_df=merged_df,
    cfg=cfg,
    filename="unlabeled_predictions.csv",
)

print("✅ Predictions saved to CSV:")
print(output_csv_path.resolve())

# Optional: reload to sanity-check
loaded_preds = pd.read_csv(output_csv_path)
print(f"\nReloaded {len(loaded_preds)} rows from saved predictions.")
print("Columns:", loaded_preds.columns.tolist())

print("\nFirst 5 rows of saved predictions:")
display(loaded_preds.head())

print("\nPredicted label distribution:")
display(loaded_preds["pred_label"].value_counts())


✅ Predictions saved to CSV:
C:\Users\Cameron.Webster\Python\local-llm\artifacts\finetune_bert\unlabeled_predictions.csv

Reloaded 802 rows from saved predictions.
Columns: ['parsid', 'wbs', 'wbs_name', 'keyword', 'pred_label_id', 'pred_label', 'pred_confidence']

First 5 rows of saved predictions:


Unnamed: 0,parsid,wbs,wbs_name,keyword,pred_label_id,pred_label,pred_confidence
0,1268,1.10.01,Management Support,management support,5,SEPM,0.997838
1,1268,1.10.01.02,Project Management,management support project management,5,SEPM,0.997665
2,1268,1.10.01.03,Subcontractor Support,management support,5,SEPM,0.99706
3,1268,1.10.01.SM,Program Management,management support program management,5,SEPM,0.998559
4,1268,1.10.02.01,Design Engineering,design engineering,2,Design and NRE,0.997837



Predicted label distribution:


pred_label
Process Equipment          436
Standard Equipment         159
SEPM                        76
Design and NRE              48
OPC Testing and Startup     45
Construction                36
D&D                          2
Name: count, dtype: int64