# Semi-supervised pipeline to complete YieldStrength

This notebook implements a semi-supervised workflow to predict and fill the column `YieldStrength` when many rows are unlabeled.

In [1]:
# Basic imports and module path (adjust if repo layout differs)
import sys, os
from pathlib import Path
module_path = os.path.abspath('../src')
if module_path not in sys.path: sys.path.append(module_path)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')

# ML imports
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Self Training Regressor
from self_training_regression import SelfTrainingRegressorCustom

In [2]:
# Paths and feature lists
TEST_PATH = Path('..') / 'data' / 'test_normalised.csv'
TRAIN_PATH = Path('..') / 'data' / 'train_normalised.csv'
target_col = 'YieldStrength'

# Load data (expect CSVs to exist)
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)
full_df = pd.concat([train_df, test_df], axis=0, ignore_index=True)
print('Loaded shapes:\n train:', train_df.shape, 'test:', test_df.shape, 'full:', full_df.shape)


Loaded shapes:
 train: (1500, 45) test: (152, 45) full: (1652, 45)


## Prepare labeled / unlabeled data

Only part of the dataset is labeled (semi-supervised setting). We'll randomly mask some labels to simulate missing target values

In [3]:
X_full = full_df.drop(columns=[target_col])
y_full = full_df[target_col].copy()

Identify labeled and unlabeled samples (missing target values)

In [4]:
labeled_mask = y_full.notna()
n_labeled = labeled_mask.sum()
n_unlabeled = (~labeled_mask).sum()

In [5]:
print("The number of unlabeled targets is:", n_unlabeled)
print("The number of labeled targets is:", n_labeled)

The number of unlabeled targets is: 895
The number of labeled targets is: 757


Split the labeled data into a small train/test set for evaluation

In [6]:
X_labeled = X_full[labeled_mask]
y_labeled = y_full[labeled_mask]
X_unlabeled = X_full[~labeled_mask]

In [7]:
X_lab_train, X_lab_test, y_lab_train, y_lab_test = train_test_split(
    X_labeled, y_labeled, test_size=0.2, random_state=42
)

Combine labeled training data with unlabeled data (semi-supervised pool)

In [8]:
X_combined = pd.concat([X_lab_train, X_unlabeled], ignore_index=True)
y_combined = pd.concat(
    [y_lab_train, pd.Series([np.nan] * len(X_unlabeled))],
    ignore_index=True
)

## Define the base regressor

We use a RandomForestRegressor as the base learner

In [9]:
base = RandomForestRegressor(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

In [10]:
self_trainer = SelfTrainingRegressorCustom(
    base_estimator=base,
    max_iter=8,
    add_per_iter=0.15,            # add 15% of the unlabeled pool per iteration (tunable)
    min_samples_added=5,         # ensure at least 5 per iteration if available
    confidence_threshold=None,   # or try 0.6..0.9 after inspecting confidence distribution
    random_state=42,
    verbose=1
)

Combine X_lab_train and X_unlabeled as pools for the algorithm:

**NOTE**: SelfTrainingRegressorCustom expects separate labeled and unlabeled inputs.


In [11]:
self_trainer.fit(X_lab_train, y_lab_train, X_unlabeled)


[Self-Training] Iteration 1/8 | Labeled size: 605 | Unlabeled pool: 895




  -> Added 135 pseudo-labeled samples (confidence mean: 0.2692)

[Self-Training] Iteration 2/8 | Labeled size: 740 | Unlabeled pool: 760




  -> Added 114 pseudo-labeled samples (confidence mean: 0.6706)

[Self-Training] Iteration 3/8 | Labeled size: 854 | Unlabeled pool: 646




  -> Added 97 pseudo-labeled samples (confidence mean: 0.2454)

[Self-Training] Iteration 4/8 | Labeled size: 951 | Unlabeled pool: 549




  -> Added 83 pseudo-labeled samples (confidence mean: 0.6392)

[Self-Training] Iteration 5/8 | Labeled size: 1034 | Unlabeled pool: 466




  -> Added 70 pseudo-labeled samples (confidence mean: 0.7535)

[Self-Training] Iteration 6/8 | Labeled size: 1104 | Unlabeled pool: 396




  -> Added 60 pseudo-labeled samples (confidence mean: 0.0844)

[Self-Training] Iteration 7/8 | Labeled size: 1164 | Unlabeled pool: 336




  -> Added 51 pseudo-labeled samples (confidence mean: 0.6774)

[Self-Training] Iteration 8/8 | Labeled size: 1215 | Unlabeled pool: 285




  -> Added 43 pseudo-labeled samples (confidence mean: 0.6582)


<self_training_regression.SelfTrainingRegressorCustom at 0x2889f602510>

## Evaluate on held-out labeled test set

In [12]:
y_pred = self_trainer.predict(X_lab_test)
rmse = mean_squared_error(y_lab_test, y_pred)
mae = mean_absolute_error(y_lab_test, y_pred)
r2 = r2_score(y_lab_test, y_pred)

In [13]:
print("\nEvaluation on held-out labeled test set:")
print(f"RMSE: {rmse:.4f}")
print(f"MAE : {mae:.4f}")
print(f"R2  : {r2:.4f}")


Evaluation on held-out labeled test set:
RMSE: 2721.9191
MAE : 37.5464
R2  : 0.6201


Baseline supervised trained only on labeled training portion

In [14]:
baseline = RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)
baseline.fit(X_lab_train, y_lab_train)
y_base_pred = baseline.predict(X_lab_test)
base_rmse = mean_squared_error(y_lab_test, y_base_pred)
base_mae = mean_absolute_error(y_lab_test, y_base_pred)
base_r2 = r2_score(y_lab_test, y_base_pred)


In [15]:
print("\nBaseline (supervised only) results on the same held-out set:")
print(f"Baseline RMSE: {base_rmse:.4f}")
print(f"Baseline MAE: {base_mae:.4f}")
print(f"Baseline R2  : {base_r2:.4f}")


Baseline (supervised only) results on the same held-out set:
Baseline RMSE: 2459.7454
Baseline MAE: 36.6682
Baseline R2  : 0.6567


The semi-supervised model performs slightly worse.

That means the pseudo-labels added were too noisy: they probably contained enough errors to degrade the model rather than help it generalize.