# Semi-supervised Classification on a Text Dataset#

In this example, semi-supervised classifiers are trained on the 20 newsgroups dataset (which will be automatically downloaded).

You can adjust the number of categories by giving their names to the dataset loader or setting them to None to get all 20 of them.

## What is Semi-Supervised Learning?

Semi-supervised learning is a machine learning technique that uses a combination of labeled and unlabeled data to train a model. It bridges the gap between supervised learning, which requires labeled data, and unsupervised learning, which uses only unlabeled data. The key idea is to leverage the large amounts of unlabeled data available to improve model performance, especially when labeled data is scarce or expensive to obtain[1][2][3].

## When to Use Semi-Supervised Learning

Semi-supervised learning is particularly useful in the following scenarios:

- **Limited resources for labeling data**, such as medical images that require annotation by specialists[1]
- **Large volumes of unlabeled data available**, like in social networks or the internet, which can be leveraged to improve models[1]
- **Working with unstructured data** like text, images, or audio that is difficult to label[1]
- **Dealing with rare classes** in classification tasks, where labeled examples may be limited, such as in fraud detection or rare disease diagnosis[1][2]
- **When the labeled data alone is not representative of the entire data distribution**, making supervised learning ineffective[3]

However, semi-supervised learning may not be suitable if the labeled data is not representative of the underlying data distribution. It is most effective when there is a significant amount of unlabeled data available and the data follows certain assumptions like continuity, clustering, or manifold structure.

In [9]:
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier

# Loading dataset containing first five categories
data = fetch_20newsgroups(
    subset="train",
    categories=[
        "alt.atheism",
        "comp.graphics",
        "comp.os.ms-windows.misc",
        "comp.sys.ibm.pc.hardware",
        "comp.sys.mac.hardware",
    ],
)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

2823 documents
5 categories



In [10]:
# parameters
sdg_params = dict(alpha=1e-5, penalty="l2", loss="log_loss")
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

sdg_params, vectorizer_params

({'alpha': 1e-05, 'penalty': 'l2', 'loss': 'log_loss'},
 {'ngram_range': (1, 2), 'min_df': 5, 'max_df': 0.8})

In [11]:
text = ["hoi this is me", "I'm august is me"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)

print(vectorizer.get_feature_names_out())
print(X.toarray())

['august' 'hoi' 'is' 'me' 'this']
[[0 1 1 1 1]
 [1 0 1 1 0]]


In [12]:
# supervised pipeline => trained exlusively on labeled data
pipeline = Pipeline(
    [
        # Step 1: Convert the text documents into a bag-of-words model
        # This will create a document-term matrix where each entry is the count of a word in a document
        ("vect", CountVectorizer(**vectorizer_params)),

        # Step 2: Apply TF-IDF transformation on the bag-of-words output
        # This will reweight the counts with Term Frequency - Inverse Document Frequency (TF-IDF),
        # which gives more importance to rare terms and reduces the impact of frequently occurring ones.
        ("tfidf", TfidfTransformer()),

        # This uses the TF-IDF weighted features to train a supervised model.
        ("clf", SGDClassifier(**sdg_params))
    ]
)

In [13]:
# SelfTraining Pipeline
st_pipeline = Pipeline(
    [
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("clf", SelfTrainingClassifier(SGDClassifier(**sdg_params), verbose=True)),
    ]
)

In [14]:
# LabelSpreading Pipeline
ls_pipeline = Pipeline(
    [
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        # LabelSpreading does not support dense matrices
        ("toarray", FunctionTransformer(lambda x: x.toarray())),
        ("clf", LabelSpreading()),
    ]
)

In [17]:
def eval_and_print_metrics(clf, X_train, y_train, X_test, y_test):
    print("Number of training samples:", len(X_train))
    print("Unlabeled samples in training set:", sum(1 for x in y_train if x == -1))

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(
        "Micro-averaged F1 score on test set: %0.3f"
        % f1_score(y_test, y_pred, average="micro")
    )
    print("-" * 10)
    print()

In [18]:
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

print("Supervised SGDClassifier on 100% of the data:")
eval_and_print_metrics(pipeline, X_train, y_train, X_test, y_test)

# select a mask of 20% of the train dataset
y_mask = np.random.rand(len(y_train)) < 0.2

# X_20 and y_20 are the subset of the train dataset indicated by the mask
X_20, y_20 = map(
    list, zip(*((x, y) for x, y, m in zip(X_train, y_train, y_mask) if m))
)
print("Supervised SGDClassifier on 20% of the training data:")
eval_and_print_metrics(pipeline, X_20, y_20, X_test, y_test)

# set the non-masked subset to be unlabeled
y_train[~y_mask] = -1
print("SelfTrainingClassifier on 20% of the training data (rest is unlabeled):")
eval_and_print_metrics(st_pipeline, X_train, y_train, X_test, y_test)

print("LabelSpreading on 20% of the data (rest is unlabeled):")
eval_and_print_metrics(ls_pipeline, X_train, y_train, X_test, y_test)

Supervised SGDClassifier on 100% of the data:
Number of training samples: 2117
Unlabeled samples in training set: 0
Micro-averaged F1 score on test set: 0.888
----------

Supervised SGDClassifier on 20% of the training data:
Number of training samples: 445
Unlabeled samples in training set: 0
Micro-averaged F1 score on test set: 0.766
----------

SelfTrainingClassifier on 20% of the training data (rest is unlabeled):
Number of training samples: 2117
Unlabeled samples in training set: 1672
End of iteration 1, added 1080 new labels.
End of iteration 2, added 209 new labels.
End of iteration 3, added 49 new labels.
End of iteration 4, added 30 new labels.
End of iteration 5, added 18 new labels.
End of iteration 6, added 10 new labels.
End of iteration 7, added 2 new labels.
End of iteration 8, added 3 new labels.
End of iteration 9, added 6 new labels.
End of iteration 10, added 1 new labels.
Micro-averaged F1 score on test set: 0.827
----------

LabelSpreading on 20% of the data (rest i

**Self-Training Pipeline (Semi-Supervised)**

- **Definition**: In a self-training pipeline, the model starts with a small labeled dataset and is trained in a supervised way. After initial training, the model is applied to the unlabeled data. The model’s predictions for the unlabeled data with high confidence are then added to the training set (pseudo-labeling), and the process is repeated.
- **Data**: Starts with a small amount of labeled data, but utilizes a large amount of unlabeled data for learning.
- **Example Use Case**: Image classification with limited labeled examples, where the model is able to self-label additional images based on its confidence.
  
### **Label Spreading Pipeline (Semi-Supervised)**

- **Definition**: Label spreading is a graph-based semi-supervised learning method that propagates labels through the data. It works by constructing a graph where each node represents a sample, and edges represent similarities between samples. Labeled data is used to propagate information to the unlabeled data based on their similarity.
- **Data**: A small amount of labeled data and a large amount of unlabeled data. Labels are propagated based on feature similarity.
- **Example Use Case**: Assigning topics to documents when only a few are labeled, but many are unlabeled, and you rely on similarities between documents to spread labels.
  
---

### Key Differences:

| Aspect                  | Supervised Pipeline             | Self-Training Pipeline             | Label Spreading Pipeline          |
|-------------------------|---------------------------------|------------------------------------|-----------------------------------|
| **Data Requirements**    | Fully labeled dataset           | Small labeled + large unlabeled    | Small labeled + large unlabeled   |
| **Learning Strategy**    | Purely based on labeled data    | Uses labeled + pseudo-labeled data | Spreads labels through similarity |
| **Model Iteration**      | No iteration, standard training | Iterative: model pseudo-labels data | Spreads labels once based on graph|
| **Model Type**           | Any standard supervised model   | Self-training classifier           | Graph-based semi-supervised model |
| **Use Case**             | Fully labeled tasks (classification) | When labels are limited but unlabeled data is abundant | For tasks with clear similarity patterns among data points |

Each approach is chosen based on how much labeled data is available and how much unlabeled data can be leveraged for learning.