### Preliminaries

Essential libraries for data handling, model training, and dataset preparation.

In [65]:
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F

from torch import nn
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

Imports the custom factory modules used in the pipeline:
- `token_factory`: Handles preprocessing/tokenization based on input type  
- `dataloader_factory`: Builds dataloaders for labeled, unlabeled, and validation sets  
- `model_factory`: Constructs models based on input type and task

The modules are reloaded using `importlib.reload()` to reflect any recent changes without restarting the environment (useful in interactive environments like notebooks).

In [None]:
import importlib
import token_factory as tf
import dataloader_factory as dl
import model_factory as md

importlib.reload(tf)
importlib.reload(dl)
importlib.reload(md)

### Configuration

This dictionary contains all the settings used for training the pseudo-labeling model on tabular data.

Sections:
- **General**: Includes session ID and random seed for reproducibility.
- **Pseudo-Labeling Settings**:
  - `learning_rate`: Controls how fast the model learns.
  - `confidence_threshold`: Minimum confidence required to accept a pseudo-label from the model.
  - `epochs`: Number of full passes through the training data.
- **Dataset Paths & Structure**:
  - Dataset paths should point to appropriate folders or files for labeled, unlabeled, and validation data.
  - `input_type`: Specify "image", "text", or "tabular".
  - `validation_set_percentage`: Used to split part of the labeled data for validation.
- **Tabular Input Columns**:
  - `categorical_columns`: List of columns treated as categories (e.g., gender, city).
  - `numeric_columns`: List of columns with continuous values (e.g., age, income).
  - `tabular_target_column`: Column name of the prediction target (must be categorical for this setup).
<br><br>
> 💡 Detailed explanations of configuration variables can be found in the README.md

In [None]:
config = {
    # General
    "training_session": 1,
    "seed": 27,

    # Pseudo-Labeling Model
    "learning_rate": 3e-4,
    "confidence_threshold": 0.98,
    "epochs": 20,

    # Dataset
    "labeled_dataset_path": "",
    "unlabeled_dataset_path": "",
    "validation_set_percentage": 0,
    "batch_size": 64,

    # Tabular input
    "categorical_columns": [],
    "numeric_columns": [],
    "tabular_target_column": "",
}

### Training Utilities

***train_one_epoch(...)***

Trains the model for a single epoch using labeled data.

How it works:
- Sets the model to training mode (enables dropout, batchnorm, etc.).
- Iterates through each batch of labeled data.
- Computes the **Cross Entropy Loss**, suitable for classification tasks.
- Performs backpropagation and updates the model using the optimizer.

Parameters:
- `model`: The classification model to train.
- `loader`: Dataloader containing labeled training samples.
- `device`: Target device (CPU or GPU).
- `optimizer`: Optimizer used to update model weights.
<br><br>
> 💡 This function is used during the initial supervised training phase and later when retraining with pseudo-labeled data.

In [67]:
def train_one_epoch(model, loader, device, optimizer):
    model.train()
    
    loss_function = nn.CrossEntropyLoss()
    total_loss = 0.00
    for x, y in loader:
        x, y = x.to(device), y.to(device)

        optimizer.zero_grad()
        outputs = model(x)

        loss = loss_function(outputs, y)
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    print(f"Total Loss: {total_loss:.4f}")

***evaluate(...)***

Evaluates the model’s performance on a labeled validation dataset.

How it works:
- Switches the model to **evaluation mode** (disables dropout, no gradient tracking).
- Iterates over the validation data.
- Computes the **cross-entropy loss** for classification tasks.
- Generates predicted class labels using `argmax`.
- Calculates and prints **accuracy** across all validation samples.

Parameters:
- `model`: The trained model to evaluate.
- `loader`: Dataloader for the validation set.
- `device`: Target device (CPU or GPU).
<br><br>
> 💡 This function is used to monitor validation performance after training and pseudo-labeling phases.

In [68]:
def evaluate(model, loader, device):
    model.eval()

    all_predictions, all_labels = [], []
    total_loss = 0.00

    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)

            logits = model(x)
            predictions = torch.argmax(logits, dim=1)

            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(y.cpu().numpy())

            loss = F.cross_entropy(logits, y.long())
            total_loss += loss.item()

    accuracy = np.mean(np.array(all_predictions) == np.array(all_labels))
    print(f"Validation Accuracy: {accuracy:.4f} | Loss: {total_loss:.4f}")
    
    return all_predictions, all_labels

***train_without_pseudo_labels(...)***

Performs the **initial training phase** of the Pseudo-Labeling algorithm using only labeled data. After training, the model is used to generate pseudo-labels for confident predictions on the unlabeled data.

Phase 1: Supervised Training
- Trains the model using only the labeled dataset.
- After each epoch, the model is evaluated on a validation set.
- The best-performing model (based on validation accuracy) is saved to disk.

Phase 2: Pseudo-Label Generation
- Switches the model to evaluation mode.
- For each batch of **unlabeled data**:
  - Predicts class probabilities using `softmax`.
  - Retains only the samples where the model is confident (`confidence >= threshold`).
  - These confident predictions are returned as pseudo-labeled data.

Parameters:
- `model`: The tabular classification model to train.
- `labeled_loader`: Dataloader containing the labeled training data.
- `validation_loader`: Dataloader for evaluating validation performance.
- `unlabeled_loader`: Dataloader for the unlabeled dataset (no ground truth).
- `device`: The target device (CPU or GPU).
<br><br>
> ⚠️ Only confident predictions are retained to reduce the risk of propagating noisy pseudo-labels.

In [69]:
def train_without_pseudo_labels(model, labeled_loader, validation_loader, unlabeled_loader, device):
    optimizer = torch.optim.Adam(model.parameters(), lr=config["learning_rate"])

    best_model_path = f"../../models/pseudo_label/best_model_tabular_{config["training_session"]}.pt"
    best_accuracy = 0.00

    # Train on labeled data
    for epoch in range(1, config["epochs"] + 1):
        print(f"--- Start of Epoch {epoch}! ---")

        train_one_epoch(model, labeled_loader, device, optimizer)

        predictions, labels = evaluate(model, validation_loader, device)
        validation_accuracy = accuracy_score(labels, predictions)

        if validation_accuracy > best_accuracy:
            best_accuracy = validation_accuracy
            torch.save(model.state_dict(), best_model_path)
    
            print(f"✅ Best model saved to {best_model_path} | Accuracy: {best_accuracy:.4f}")

    print(f"--- End of Training ---")
    
    # Generate pseudo-labels
    model.eval()
    pseudo_features, pseudo_labels = [], []
    with torch.no_grad():
        for x in unlabeled_loader:
            x = x.to(device)

            outputs = model(x)

            probabilities = torch.softmax(outputs, dim=1)
            confidence, predictions = torch.max(probabilities, dim=1)
            mask = confidence >= config["confidence_threshold"]

            pseudo_features.append(x[mask].cpu())
            pseudo_labels.append(predictions[mask].cpu())
            
    return pseudo_features, pseudo_labels

***train_with_pseudo_labels(...)***

Fine-tunes the model using a **combined dataset** of real labeled samples and high-confidence pseudo-labeled samples generated from the previous phase.

How it works:
- Continues training the model using the expanded dataset (original + pseudo-labeled).
- Tracks validation accuracy after each epoch.
- Saves the best-performing model to a separate path (`_pseudo.pt` suffix).

Parameters:
- `model`: The tabular model to be fine-tuned.
- `labeled_loader`: Dataloader containing both real and pseudo-labeled training data.
- `validation_loader`: Dataloader used to monitor model performance.
- `device`: Target device (CPU or GPU).
<br><br>
> 💡 This phase helps the model generalize better by exposing it to confidently predicted unlabeled samples.

In [None]:
def train_with_pseudo_labels(model, labeled_loader, validation_loader, device):
    optimizer = torch.optim.Adam(model.parameters(), lr=config["learning_rate"])

    best_model_path = f"../../models/pseudo_label/best_model_tabular_{config["training_session"]}_pseudo.pt"
    best_accuracy = 0.00

    # Fine-tune the model with pseudo-labeled data
    for epoch in range(1, config["epochs"] + 1):
        print(f"--- Start of Epoch {epoch}! ---")

        train_one_epoch(model, labeled_loader, device, optimizer)

        predictions, labels = evaluate(model, validation_loader, device)
        validation_accuracy = accuracy_score(labels, predictions)

        if validation_accuracy > best_accuracy:
            best_accuracy = validation_accuracy
            torch.save(model.state_dict(), best_model_path)
            print(f"✅ Best model saved to {best_model_path} | Accuracy: {best_accuracy:.4f}")

    print(f"--- End of Training ---")

### Main Training Loop

This is the main script that runs the **Pseudo-Labeling training process** for tabular classification tasks. It follows a two-phase structure: initial supervised training, followed by fine-tuning with pseudo-labeled data.

General Workflow
1. **Device Setup**  
   Automatically selects GPU if available, otherwise defaults to CPU.

2. **Data Loading**  
   Reads labeled and unlabeled tabular datasets from CSV files.

3. **Train-Validation Split**  
   Splits the labeled dataset into training and validation sets using stratified sampling.

4. **Tokenization (Preprocessing)**  
   - A `token_factory()` prepares the data:
     - Encodes categorical columns.
     - Normalizes numeric columns.
     - Converts target labels into class indices.

5. **Dataloader Creation**  
   - `dataloader_factory()` builds three loaders:
     - `labeled_loader`: For supervised training.
     - `unlabeled_loader`: For pseudo-label generation.
     - `validation_loader`: For tracking accuracy.

6. **Model Initialization**  
   Creates a tabular classification model (typically an MLP) based on input dimensions and number of classes.

7. **Phase 1: Supervised Training**  
   Trains the model using only labeled data via `train_without_pseudo_labels()`, then uses the trained model to generate **pseudo-labels** for confident unlabeled examples.

8. **Pseudo-Label Filtering**  
   If no confident predictions were found (based on `confidence_threshold`), the script exits early.

9. **Dataset Combination**  
   Combines original labeled data with pseudo-labeled examples using `torch.cat()` to create a new training set.
   
10. **Phase 2: Fine-Tuning with Pseudo-Labels**  
    Re-trains the model using the expanded dataset via `train_with_pseudo_labels()`.
<br><br>
> 💡 This pipeline is designed to boost performance when labeled data is limited but unlabeled data is abundant — a common scenario in real-world banking and business datasets.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from token_factory import token_factory
from model_factory import model_factory
from dataloader_factory import dataloader_factory, combined_dataloader_factory

# Load labeled and unlabeled dataset
labeled_dataframe = pd.read_csv(config["labeled_dataset_path"])
unlabeled_dataframe = pd.read_csv(config["unlabeled_dataset_path"])

# Split labeled dataset into train and validation sets
train_dataframe, validation_dataframe = train_test_split(
    labeled_dataframe,
    test_size=config["validation_set_percentage"],
    stratify=labeled_dataframe[config["tabular_target_column"]],
    random_state=config["seed"]
)

# Obtain the tokenizer for tabular inputs
tokenizer = token_factory(
    categorical_columns=config["categorical_columns"],
    numeric_columns=config["numeric_columns"],
    target_column=config["tabular_target_column"],
)

# Fit only on training dataframe
tokenizer.fit(train_dataframe)

# Tokenize features
X_train = tokenizer.transform(train_dataframe)
y_train = tokenizer.transform_target(train_dataframe)

X_validation = tokenizer.transform(validation_dataframe)
y_validation = tokenizer.transform_target(validation_dataframe)

X_unlabeled = tokenizer.transform(unlabeled_dataframe)

# Create dataloaders
labeled_loader, unlabeled_loader, validation_loader = dataloader_factory(
    X_train=X_train, y_train=y_train, 
    X_validation=X_validation, y_validation=y_validation, 
    X_unlabeled=X_unlabeled, batch_size=config["batch_size"]
)

# Create MLP model
input_dim = labeled_dataframe.drop(columns=[config["tabular_target_column"]]).shape[1]
num_classes = labeled_dataframe[config["tabular_target_column"]].nunique()
model = model_factory(
    input_dim=input_dim,
    num_classes=num_classes,
).to(device)

# Train on labeled data and generate pseudolabels
X_pseudo, y_pseudo = train_without_pseudo_labels(model, labeled_loader, validation_loader, unlabeled_loader, device)

# No pseudolabels were generated because model is not confident
if len(X_pseudo) == 0:
    print("No pseudo-labels generated")
    exit(0)

# Combine generated pseudolabels with original labeled dataset
X_combined = torch.cat([
    torch.tensor(X_train, dtype=torch.float32),  
    torch.cat(X_pseudo, dim=0)            
], dim=0)
y_combined = torch.cat([
    torch.tensor(y_train, dtype=torch.long),
    torch.cat(y_pseudo, dim=0)
], dim=0)

# Create dataloader for combined dataset
labeled_loader = combined_dataloader_factory(
    X_combined=X_combined, y_combined=y_combined, batch_size=config["batch_size"]
)

# Train on labeled data with pseudolabels
train_with_pseudo_labels(model, labeled_loader, validation_loader, device)