## **Self-Training**

**Type:** Semi-Supervised Learning Technique

---

### **Definition (Formal)**

Self-training is a semi-supervised learning approach where a model is first trained on a small set of labeled data, then used to assign labels (pseudo-labels) to unlabeled data, and finally retrained using both the original labeled data and the confidently labeled pseudo-data.

---

### **Key Idea (Easy)**

It’s like teaching yourself — first you learn from a teacher (labeled data), then you try answering questions you don’t have answers for (unlabeled data), and if you’re confident, you add those answers to your notes and study again.

---

### **Core Steps**

1. **Train Initial Model** — on the small labeled dataset.
2. **Pseudo-Labeling** — use the trained model to predict labels for unlabeled data.
3. **Select Confident Predictions** — keep only predictions above a certain confidence threshold.
4. **Retraining Loop** —

   * Add the pseudo-labeled data to the labeled set.
   * Retrain the model on this combined dataset.
   * Repeat until no significant improvement.

---

### **Pros**

* Makes use of large unlabeled datasets.
* Improves performance when labeled data is scarce.

### **Cons**

* If pseudo-labels are wrong, errors can accumulate.
* Needs careful choice of confidence threshold.

---

### **Real-Life Examples**

* Text classification when only a small set of articles is labeled.
* Image recognition with few labeled photos but many unlabeled ones.
* Speech recognition where manual transcription is expensive.


## **Self-Training with Pseudo-Labeling**

We’ll use **a small dataset** with just two classes: **Class 0** and **Class 1**.

---

### **Step 1: Start with labeled data**

We have a few labeled points:

| Data (x) | Class |
| -------- | ----- |
| 1.0      | 0     |
| 1.2      | 0     |
| 3.0      | 1     |
| 3.2      | 1     |

Think of “x” as a number from a measurement.

---

### **Step 2: Find class centers (average)**

We calculate the mean (average) for each class:

* **Class 0 mean:** $(1.0 + 1.2) / 2 = 1.1$
* **Class 1 mean:** $(3.0 + 3.2) / 2 = 3.1$

The halfway point between these means is:

$$
(1.1 + 3.1) / 2 = 2.1
$$

---

### **Step 3: Classify unlabeled data**

Unlabeled data:

```
0.9, 1.1, 2.0, 2.9, 3.3
```

Rule:

* If $x < 2.1$ → **Class 0**
* If $x > 2.1$ → **Class 1**

Predictions:

* 0.9 → Class 0 ✅
* 1.1 → Class 0 ✅
* 2.0 → Class 0 ✅
* 2.9 → Class 1 ✅
* 3.3 → Class 1 ✅

---

### **Step 4: Add confident predictions**

We **pretend** we only keep predictions where we’re “very sure” (in this simple example, all are far from 2.1, so they’re confident).

Add them to the labeled set:

| Data (x) | Class |
| -------- | ----- |
| 1.0      | 0     |
| 1.2      | 0     |
| 0.9      | 0     |
| 1.1      | 0     |
| 2.0      | 0     |
| 3.0      | 1     |
| 3.2      | 1     |
| 2.9      | 1     |
| 3.3      | 1     |

---

### **Step 5: Recalculate class centers**

Now with more data:

* **Class 0 mean:** $(1.0 + 1.2 + 0.9 + 1.1 + 2.0) / 5 = 1.24$
* **Class 1 mean:** $(3.0 + 3.2 + 2.9 + 3.3) / 4 = 3.10$

New halfway point:

$$
(1.24 + 3.10) / 2 \approx 2.17
$$

---

### **Step 6: Repeat**

We can now classify **any remaining unlabeled points** using the new boundary $2.17$, add confident ones, and keep going until nothing changes.


In [5]:
from statistics import mean

# Initial labeled sets
L0, L1 = [1.0, 1.2], [3.0, 3.2]
U = [0.9, 1.1, 2.0, 2.9, 3.3]

In [6]:
tau = 0.90 

In [7]:
for _ in range(5):
    mu0, mu1 = mean(L0), mean(L1)
    keep0, keep1 = [], []
    for x in U:
        d0, d1 = abs(x - mu0), abs(x - mu1)
        yhat = 0 if d0 <= d1 else 1
        conf = 1 - min(d0, d1) / (d0 + d1)
        if conf >= tau:
            (keep0 if yhat == 0 else keep1).append(x)
    if not (keep0 or keep1): break
    L0 += keep0
    L1 += keep1
    U = [x for x in U if x not in keep0 + keep1]


In [8]:

print("Final means:", mean(L0), mean(L1))
print("Remaining unlabeled:", U)

Final means: 1.05 3.1666666666666665
Remaining unlabeled: [2.0, 2.9]
