In [227]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"


# Semi-supervised and Positive Unlabeled Learning
## Learning with Limited Labels: Weak Supervision and Uncertainty-Aware Training
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)


## Summary

### Keypoints

- Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data to improve model performance.

- Key approaches to semi-supervised learning include self-training, co-training, and multi-view learning.

- Self-training iteratively uses a model's most confident predictions on unlabeled data to expand the training set.

- Label propagation constructs a graph of data points and spreads labels from known instances to unlabeled ones.

- Co-training leverages multiple views of the data, training separate models on each view and allowing them to 
teach each other.

- PU learning is a specialized form of semi-supervised learning that uses only positive and unlabeled data, without explicit negative examples.

- PU learning relies on key assumptions, including positive label reliability and label flipping independence.

- The Elkan and Noto approach to PU learning involves estimating the probability of a sample being labeled and adjusting predictions accordingly.

### Takeaways

- Semi-supervised and PU learning techniques can significantly improve model performance when labeled data is scarce or expensive to obtain.

- The effectiveness of these methods depends on the validity of their underlying assumptions about the data distribution and labeling process.

- Choosing the appropriate semi-supervised or PU learning technique requires careful consideration of the specific problem, available data, and computational resources.

- While these methods can be powerful, they also introduce additional complexity and potential sources of error compared to fully supervised approaches.

- Practitioners should always validate the results of semi-supervised and PU learning methods against a held-out test set or through cross-validation to ensure their effectiveness.


## Overview of Semi-supervised learning

Semi-supervised learning is a type of machine learning that uses a small amount of labeled data along with a large amount of unlabeled data to train models. This approach is particularly useful when labeled data is scarce or expensive to obtain. Semi-supervised learning can be applied to a variety of tasks, including image classification, speech recognition, and natural language processing.

### Definition and Motivation

**Semi-supervised learning** bridges the gap between supervised learning (which relies entirely on labeled data) and unsupervised learning (which uses only unlabeled data). The primary motivation for using semi-supervised learning is to leverage the abundant unlabeled data available in many practical scenarios to improve model performance when labeled data is limited.

### Differences Between Supervised, Unsupervised, and Semi-Supervised Learning

- **Supervised Learning**: Uses a fully labeled dataset to train models. Each training example is paired with an output label. Examples include classification and regression tasks.
- **Unsupervised Learning**: Uses only unlabeled data to find hidden patterns or intrinsic structures in the input data. Examples include clustering and dimensionality reduction.
- **Semi-Supervised Learning**: Combines a small amount of labeled data with a large amount of unlabeled data. The goal is to improve learning accuracy by incorporating the unlabeled data, which can provide additional context and structure.

### Approaches to Semi-Supervised Learning

Several approaches can be employed in semi-supervised learning, each with its own strengths and weaknesses. The choice of approach depends on the specific problem being addressed:

1. **Self-Training**:
   - The model is initially trained using the labeled data.
   - The model then labels the unlabeled data, and these pseudo-labels are used to retrain the model.
   - *Strength*: Simple to implement.
   - *Weakness*: The initial model's errors can propagate through the pseudo-labels.

2. **Co-Training**:
   - Two or more models are trained on different views of the data (different feature sets).
   - Each model labels the unlabeled data, and these labels are used to train the other models.
   - *Strength*: Can leverage complementary information from different views.
   - *Weakness*: Requires that the data can be naturally split into distinct views.

3. **Multi-View Learning**:
   - Similar to co-training but more generalized.
   - Combines multiple views of the data in a unified framework.
   - *Strength*: More flexible and can handle complex data structures.
   - *Weakness*: Computationally more intensive.

### Techniques for Combining Labeled and Unlabeled Data

Effectively combining labeled and unlabeled data is a key challenge in semi-supervised learning. Several techniques can be used:

- **Generative Models**:
  - Models the joint probability distribution of the data and labels.
  - Examples include Gaussian Mixture Models and Variational Autoencoders.
  - *Strength*: Can generate new data points.
  - *Weakness*: Often requires strong assumptions about the data distribution.

- **Graph-Based Methods**:
  - Represents data as a graph, where nodes are data points and edges represent similarities.
  - Labels are propagated through the graph based on these similarities.
  - *Strength*: Captures the manifold structure of the data.
  - *Weakness*: Can be computationally expensive for large datasets.

- **Self-Training Algorithms**:
  - Iteratively refine the model by using its own predictions as additional training data.
  - *Strength*: Simple and effective for many tasks.
  - *Weakness*: Risk of reinforcing initial model biases.


### Key Assumptions in Semi-Supervised Learning

Semi-supervised learning relies on certain assumptions about the data for it to be effective.  Violating these assumptions may lead to inaccurate predictions. Here are some key assumptions to consider:

- **Cluster Assumption (Homogeneity):**  Imagine plotting your data points in a high-dimensional space. The cluster assumption suggests that points close together in this space are likely to belong to the same class or have the same label.  Think of it as "birds of a feather flock together" – data instances within a cluster tend to share characteristics. 

- **Continuity Assumption (Smoothness):** This assumption focuses on regions of varying data density:
    - **High-Density Regions:** If two points are very close and lie within a high-density region (lots of data points nearby), they are likely to have the same label. 
    - **Low-Density Regions:** Points in low-density regions may have different labels even if they are close. Imagine two data clusters separated by a low-density gap; points on opposite sides of this gap are likely from different classes, even if they are spatially near each other.

- **Manifold Assumption:** This assumption suggests that high-dimensional data often lie on a lower-dimensional manifold. Think of a folded piece of paper in three-dimensional space—while it exists in 3D, the paper itself represents a 2D surface. In the context of machine learning, the manifold assumption implies that the decision boundary between classes should ideally pass through low-density regions, avoiding cuts through high-density clusters. This is because points within a dense cluster are likely to belong to the same class, as per the previous assumptions. 

> **Important Note**: These assumptions may not apply universally. Always consider your data's characteristics and problem context.
>
> **Also**: The effectiveness of semi-supervised learning depends on the assumption that the labeled and unlabeled data come from the same distribution and that the unlabeled data provides useful information about the structure of the data.


## Our dataset

We're using the classic "Dogs vs Cats" dataset for this image class. It's structured as follows:

**Source and Organization:**

- Images are stored in the `data/dogs-vs-cats` directory.
- Within this directory, we have separate folders for `train`, `valid`, and `test` sets.
- Each of these sets is further divided into `DOG` and `CAT` folders containing the respective images in `.jpg` format.

**Dataset Split and Size:**

- **Training:** Used to train the model (exact numbers of dog and cat images are printed during preprocessing).
- **Validation:**  Used for hyperparameter tuning and model evaluation during training.
- **Test:**  A held-out set for the final, unbiased evaluation of the trained model.

**Preprocessing:**

1. **Shuffling:**  We shuffle the image lists within each split using a fixed random seed to prevent order-based learning and ensure diverse training batches.
2. **Feature Extraction with Hugging Face:**
   - We use the `google/vit-base-patch16-224-in21k` Vision Transformer model.
   - Images are loaded, preprocessed using the `AutoImageProcessor`, and fed to the model.
   - The average of the last hidden state is extracted as the feature vector for each image.
   - Features are saved using `joblib.dump` for efficiency.



In [228]:
from glob import glob
import os
from pathlib import Path
import random
import numpy as np
from tqdm import tqdm
from PIL import Image
import joblib

dog_files_train = glob('data/dogs-vs-cats/train/DOG/*.jpg')
cat_files_train = glob('data/dogs-vs-cats/train/CAT/*.jpg')
dog_files_valid = glob('data/dogs-vs-cats/valid/DOG/*.jpg')
cat_files_valid = glob('data/dogs-vs-cats/valid/CAT/*.jpg')
dog_files_test = glob('data/dogs-vs-cats/test/DOG/*.jpg')
cat_files_test = glob('data/dogs-vs-cats/test/CAT/*.jpg')


# Define a specific seed
seed = 271828

# Create a random number generator with the specific seed
rng = np.random.default_rng(seed)

# Shuffle the files using the random number generator
rng.shuffle(dog_files_train)
rng.shuffle(cat_files_train)
rng.shuffle(dog_files_valid)
rng.shuffle(cat_files_valid)
rng.shuffle(dog_files_test)
rng.shuffle(cat_files_test)


In [229]:
print(f"""
Sizes:
  Train - Dog: {len(dog_files_train)}, Cat: {len(cat_files_train)}
  Valid - Dog: {len(dog_files_valid)}, Cat: {len(cat_files_valid)}
  Test  - Dog: {len(dog_files_test)}, Cat: {len(cat_files_test)}
""")


Sizes:
  Train - Dog: 9976, Cat: 9971
  Valid - Dog: 1239, Cat: 1253
  Test  - Dog: 1246, Cat: 1246



In [230]:
from typing import List, Tuple
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
from tqdm import tqdm
import numpy as np
import joblib

def extract_features_from_files(files: List[str], processor: AutoImageProcessor, model: AutoModel) -> List[Tuple[int, str, np.ndarray]]:
    """Extracts features from a list of image files using a pre-trained model.

    Args:
        files (List[str]): List of file paths to the images.
        processor (AutoImageProcessor): The image processor from Hugging Face.
        model (AutoModel): The pre-trained model from Hugging Face.

    Returns:
        List[Tuple[int, str, np.ndarray]]: List of tuples containing labels, file paths, and their corresponding extracted features.
    """
    features = []
    # Iterate over each file and extract features
    for file in tqdm(files):
        # Open the image file
        image = Image.open(file)
        # Process the image to prepare it for the model
        inputs = processor(images=image, return_tensors="pt")
        # Pass the processed image through the model to get outputs
        outputs = model(**inputs)
        # Extract the last hidden states from the model outputs
        last_hidden_states = outputs.last_hidden_state
        # Compute the mean of the hidden states to get a feature vector
        feature_vector = last_hidden_states.mean(axis=1).squeeze().detach().numpy()
        # Assign a label based on the file name (1 for DOG, 0 for CAT)
        label = 1 if "DOG" in file else 0
        # Append the label, file path, and feature vector to the features list
        features.append((label, file, feature_vector))
    return features

# Load the pre-trained model and processor from Hugging Face
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
model = AutoModel.from_pretrained("google/vit-base-patch16-224-in21k")

# Extract features from training, validation, and test image files
train_features = extract_features_from_files(dog_files_train + cat_files_train, processor, model)
valid_features = extract_features_from_files(dog_files_valid + cat_files_valid, processor, model)
test_features = extract_features_from_files(dog_files_test + cat_files_test, processor, model)

# Shuffle the features to ensure random distribution
rng.shuffle(train_features)
rng.shuffle(valid_features)
rng.shuffle(test_features)

# Save the extracted features to disk using joblib for later use
joblib.dump(train_features, 'outputs/cats-vs-dogs/train_features.joblib')
joblib.dump(valid_features, 'outputs/cats-vs-dogs/valid_features.joblib')
joblib.dump(test_features, 'outputs/cats-vs-dogs/test_features.joblib')


`resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.


Truncated File Read

100%|██████████| 19947/19947 [23:27<00:00, 14.18it/s]
100%|██████████| 2492/2492 [02:59<00:00, 13.89it/s]
100%|██████████| 2492/2492 [03:00<00:00, 13.81it/s]


['outputs/cats-vs-dogs/test_features.joblib']

In [231]:
import umap
import numpy as np
import plotly.express as px
import pandas as pd

# Create a UMAP projection of the valid features
# UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique
# that is particularly well-suited for visualizing high-dimensional data.

# Assuming valid_features is a list of tuples where each tuple contains (label, file_path, feature_vector)
# Extract the feature vectors from valid_features and convert them to a numpy array
feature_vectors = np.array([x[2] for x in valid_features])

# Initialize the UMAP model with desired parameters
# n_neighbors: The size of the local neighborhood (in terms of number of neighboring sample points) used for manifold approximation
# min_dist: The minimum distance between points in the low-dimensional space
# n_components: The number of dimensions of the low-dimensional space (2D in this case)
# random_state: Seed for the random number generator to ensure reproducibility
umap_model = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=271828)

# Fit the UMAP model to the feature vectors and transform them into a 2D projection
valid_features_umap = umap_model.fit_transform(feature_vectors)

# valid_features_umap now contains the 2D coordinates of the valid features in the UMAP projection

# Create a DataFrame from the UMAP projection
# The DataFrame will have columns "x" and "y" for the 2D coordinates
df = pd.DataFrame(valid_features_umap, columns=["x", "y"])

# Add the labels to the DataFrame
# Extract the labels from valid_features and map them to "Cat" and "Dog"
df["label"] = [x[0] for x in valid_features]
df["label"] = df["label"].map({0: "Cat", 1: "Dog"})

# Plot the UMAP projection with uniform opacity
# Use Plotly Express to create a scatter plot of the UMAP projection
# Color the points by their labels ("Cat" or "Dog")
fig = px.scatter(df, x="x", y="y", color="label", title="UMAP Projection of Valid Features", opacity=0.7)

# Show the plot
fig.show()


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



## Self-Training

Self-training is a **semi-supervised learning technique** that leverages both a small labeled dataset and a larger unlabeled dataset to improve model performance. This approach is particularly beneficial when labeled data is scarce and expensive to obtain. The core idea is to iteratively refine the model by generating pseudo-labels for the unlabeled data and incorporating these pseudo-labels into the training process.

<p align="center">
  <img src="images/self_training.png"  width="80%" height="80%" />
  <br>
</p>


### Algorithm

The self-training algorithm can be broken down into the following steps:

1. **Initialize Model**: Train an initial model on the small labeled dataset.
2. **Generate Pseudo-Labels**: Use the trained model to predict labels for the unlabeled dataset, creating pseudo-labels.
3. **Combine Data**: Merge the original labeled data with the pseudo-labeled data.
4. **Retrain Model**: Retrain the model on this combined dataset.
5. **Repeat**: Iterate steps 2-4 until the model's performance converges or a predetermined number of iterations is reached.

### Detailed Explanation

#### Initialize Model
- **Objective**: Create a baseline model using the limited labeled data available.
- **Approach**: Train the model using standard supervised learning techniques on this small dataset.

#### Generate Pseudo-Labels
- **Objective**: Utilize the trained model to predict labels for the unlabeled data.
- **Approach**: The model assigns the most likely class label to each unlabeled instance, treating these predictions as pseudo-labels.

#### Combine Data
- **Objective**: Create an expanded training dataset by merging the original labeled data with the newly pseudo-labeled data.
- **Approach**: The combined dataset now includes both the actual labeled examples and the pseudo-labeled examples, increasing the training data size.

#### Retrain Model
- **Objective**: Improve the model's performance by training it on the expanded dataset.
- **Approach**: Retrain the model, allowing it to learn from both true labels and pseudo-labels, enhancing its generalization capabilities.

#### Repeat
- **Objective**: Refine the model iteratively for better accuracy.
- **Approach**: Continue generating pseudo-labels and retraining the model until performance stabilizes or a set number of iterations is completed.

### Pseudo-Labeling

**Pseudo-labeling** is the process of assigning labels to unlabeled data based on predictions from a model trained on labeled data. These pseudo-labels are then treated as ground truth during subsequent training phases, effectively incorporating the unlabeled data into the training process.

- **Purpose**: To utilize the vast amount of unlabeled data by turning it into a form that can assist in model training.
- **Process**: 
  - Predict labels for the unlabeled data using the current model.
  - Treat these predictions as if they were actual labels.
  - Incorporate these pseudo-labeled data points into the training set.

### Key Points to Consider

- **Model Confidence**: Pseudo-labeling relies on the assumption that the model's predictions are reasonably accurate. Low-confidence predictions can introduce noise.
- **Iteration Control**: Monitor performance metrics to decide when to stop iterations. Too many iterations can lead to overfitting.
- **Data Balance**: Ensure that the pseudo-labeled data does not overwhelm the original labeled data, maintaining a balance to avoid biasing the model.

### Implementation Example

To illustrate the self-training process, consider the "Dogs vs Cats" image classification task. Suppose we have:
- A small labeled dataset with 100 images from the `train` set.
- A large unlabeled dataset with the rest of the `train` set.
- A large unlabeled dataset.

Let's try it now.


In [232]:
import joblib
import numpy as np

# Load pre-extracted features from disk
# These features were previously saved using joblib
train_features = joblib.load('outputs/cats-vs-dogs/train_features.joblib')
valid_features = joblib.load('outputs/cats-vs-dogs/valid_features.joblib')
# test_features = joblib.load('outputs/cats-vs-dogs/test_features.joblib')

# Ensure train_features is a numpy array with a consistent shape
# dtype=object is used because the array contains tuples of different types
train_features = np.array(train_features, dtype=object)

# Create a copy of train_features for self-learning
self_learning_train_features = train_features.copy()

# Randomly select 100 elements from self_learning_train_features
# Set the labels of the other elements to -1 to indicate they are unlabeled
indices = rng.choice(len(self_learning_train_features), 100, replace=False)
self_learning_train_features[:, 0][~np.isin(np.arange(len(self_learning_train_features)), indices)] = -1

# Extract feature vectors (X) and labels (y) from self_learning_train_features
X_train = np.array([x for _, _, x in self_learning_train_features])
y_train = np.array([y for y, _, _ in self_learning_train_features])

# Extract feature vectors (X) and labels (y) from valid_features
X_valid = np.array([x for _, _, x in valid_features])
y_valid = np.array([y for y, _, _ in valid_features])

In [233]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((19947, 768), (19947,), (2492, 768), (2492,))

In [234]:
import pandas as pd

# Convert the y_train array into a pandas Series
# This allows us to use pandas' built-in functions for data analysis
y_train_series = pd.Series(y_train)

# Count the occurrences of each unique value in y_train_series
# This provides a summary of the distribution of labels in the training data
label_counts = y_train_series.value_counts()

# Display the counts of each label
label_counts

-1    19847
 1       54
 0       46
Name: count, dtype: int64

In [235]:
pd.Series(y_valid).value_counts()

0    1253
1    1239
Name: count, dtype: int64

In [236]:
from sklearn.semi_supervised import SelfTrainingClassifier # Sklearn provides a SelfTrainingClassifier class that can be used to use self-training with any classifier.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, matthews_corrcoef

# Create a random forest classifier
rf = RandomForestClassifier(random_state=271828, n_jobs=-1)

# Create a self-training classifier
self_training_rf = SelfTrainingClassifier(rf)

# Fit the self-training classifier
self_training_rf.fit(X_train, y_train)

# Predict on the validation set
y_valid_pred = self_training_rf.predict(X_valid)

# Show the classification report
print(classification_report(y_valid, y_valid_pred))

# Show Accuracy
print(f"Accuracy: {accuracy_score(y_valid, y_valid_pred)}")

# Show Matthews Correlation Coefficient
print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_valid, y_valid_pred)}")

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1253
           1       0.98      1.00      0.99      1239

    accuracy                           0.99      2492
   macro avg       0.99      0.99      0.99      2492
weighted avg       0.99      0.99      0.99      2492

Accuracy: 0.9915730337078652
Matthews Correlation Coefficient: 0.9832383379379628


In [237]:
# Check if the termination condition of the self-training random forest model is met
# The termination condition indicates whether the model has reached a stopping criterion
# such as a maximum number of iterations or a convergence threshold.

# Access the termination condition attribute of the self-training random forest model
termination_condition_met = self_training_rf.termination_condition_

# Print the status of the termination condition
if termination_condition_met:
    print(f"Termination condition met: {self_training_rf.termination_condition_}")
else:
    print("Termination condition not met")

Termination condition met: max_iter


In [238]:
# Convert the transduction labels from the self-training random forest model into a pandas Series
# The transduction labels represent the predicted labels for the unlabeled data points
transduction_labels_series = pd.Series(self_training_rf.transduction_)

# Count the occurrences of each unique label in the transduction labels
# This provides a summary of how many data points were assigned to each class
label_counts = transduction_labels_series.value_counts()

# Display the counts of each label
label_counts

# Note that our final model trained with much more (pseudo-labeled) data than our initially labeled dataset

 1    10000
 0     9671
-1      276
Name: count, dtype: int64

## Label Propagation

**Label propagation** is a semi-supervised learning technique that leverages both labeled and unlabeled data to assign labels to previously unclassified instances. Imagine a social network where some users have declared their political affiliation, while others haven't. Label propagation would attempt to predict the political leanings of these undeclared users based on their connections and the affiliations of their friends.

This method relies on constructing a graph representation of the data, where nodes represent data points (e.g., users in the network) and edges represent relationships or similarities between them (e.g., friendships). The labeled data points act as "anchors," and the algorithm propagates these labels through the graph, influencing the labels of connected nodes.


<p align="center">
  <img src="images/label_propagation.png"  width="80%" height="80%" />
  <br>
</p>

### How Label Propagation Works

1. **Graph Construction:** The algorithm begins by creating a graph where each data point, labeled or unlabeled, is represented by a node. The edges connecting these nodes are weighted based on the similarity between the data points. This similarity can be determined using various measures like Euclidean distance or kernel functions.

2. **Label Initialization:** Initially, the labeled data points retain their known labels, while the unlabeled points are assigned a uniform distribution over the possible labels. For instance, in a binary classification problem, unlabeled points might start with a 50/50 probability for each class.

3. **Label Propagation:** The algorithm iteratively updates the label probabilities of unlabeled nodes based on the labels of their neighbors.  This propagation can be envisioned as a "diffusion" process, where the label information flows from the labeled "anchors" to the unlabeled nodes through the edges of the graph. The strength of this influence is determined by the edge weights, with stronger connections carrying more weight.

4. **Convergence:** The process continues until the label probabilities for the unlabeled nodes stabilize, meaning further iterations result in minimal changes. At this point, the algorithm assigns the label with the highest probability to each unlabeled node.

### Model Features

* **Label Clamping:**  During the propagation process, the algorithm can handle the initial labeled data in two primary ways:
    * **Hard Clamping:**  The labels of the initially labeled data points remain fixed throughout the iterations, ensuring they don't change.
    * **Soft Clamping:** Allows for some flexibility in the initial labels. This means the assigned labels can change slightly during each iteration, controlled by a parameter (alpha). This flexibility can be beneficial if there's a chance of noise or errors in the initial labeling.

* **Kernel:** The choice of kernel function influences how the similarity between data points is measured, which in turn affects the edge weights in the graph.
    * **RBF Kernel:**  Creates a dense matrix, potentially leading to higher computational costs, especially with large datasets.
    * **KNN Kernel:** Constructs a sparse matrix by connecting each data point only to its 'k' nearest neighbors. This results in faster computation, particularly for large datasets. 

### Advantages and Limitations

**Advantages:**

* **Utilizes Unlabeled Data:** Label propagation effectively leverages the information present in unlabeled data, which is often abundant and cheaper to obtain than labeled data.
* **Simple and Intuitive:** The core concept of propagating labels based on graph connectivity is relatively straightforward to grasp.

**Limitations:**

* **Computational Cost:** Constructing and manipulating the graph, especially with dense matrices, can be computationally expensive for large datasets.
* **Sensitivity to Graph Structure:** The performance of label propagation heavily relies on the quality of the graph representation. Poorly constructed graphs with inaccurate similarity measures can lead to inaccurate label assignments. 


Let's try it now.


In [239]:
# Load pre-extracted features from disk using joblib
# These features were previously saved and contain tuples of (label, file_path, feature_vector)
train_features = joblib.load('outputs/cats-vs-dogs/train_features.joblib')
valid_features = joblib.load('outputs/cats-vs-dogs/valid_features.joblib')

# Ensure train_features is a numpy array with a consistent shape
# dtype=object is used because the array contains tuples of different types
train_features = np.array(train_features, dtype=object)

# Create a copy of train_features for label propagation purposes
label_propagation_train_features = train_features.copy()

# Randomly select 100 elements from label_propagation_train_features to be labeled
# Set the labels of the other elements to -1 to indicate they are unlabeled
labeled_indices = rng.choice(len(label_propagation_train_features), 100, replace=False)
label_propagation_train_features[:, 0][~np.isin(np.arange(len(label_propagation_train_features)), labeled_indices)] = -1

# Extract feature vectors (X) and labels (y) from label_propagation_train_features
X_train = np.array([x for _, _, x in label_propagation_train_features])
y_train = np.array([y for y, _, _ in label_propagation_train_features])

# Extract feature vectors (X) and labels (y) from valid_features
X_valid = np.array([x for _, _, x in valid_features])
y_valid = np.array([y for y, _, _ in valid_features])

# Identify the indices of the unlabeled data points in the training set
unlabeled_indices = np.where(y_train == -1)[0]

In [240]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((19947, 768), (19947,), (2492, 768), (2492,))

In [241]:
len(unlabeled_indices)

19847

In [242]:
from sklearn.semi_supervised import LabelPropagation

# Create a label propagation model
label_propagation = LabelPropagation(kernel='knn', n_jobs=-1, max_iter=2000)

# Fit the label propagation model   
label_propagation.fit(X_train, y_train)

# Predict on the training unlabeled data
y_train_label_propagation = label_propagation.transduction_

# Train a random forest classifier on the updated training data
rf = RandomForestClassifier(random_state=271828, n_jobs=-1)
rf.fit(X_train, y_train_label_propagation)

# Predict on the validation set
y_valid_pred = rf.predict(X_valid)

# Show the classification report
print(classification_report(y_valid, y_valid_pred))

# Show Accuracy
print(f"Accuracy: {accuracy_score(y_valid, y_valid_pred)}")

# Show Matthews Correlation Coefficient
print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_valid, y_valid_pred)}")


              precision    recall  f1-score   support

           0       0.98      0.99      0.98      1253
           1       0.99      0.98      0.98      1239

    accuracy                           0.98      2492
   macro avg       0.98      0.98      0.98      2492
weighted avg       0.98      0.98      0.98      2492

Accuracy: 0.9839486356340289
Matthews Correlation Coefficient: 0.9679553454117247


In [243]:
np.bincount(y_train_label_propagation)

array([10435,  9512])

In [244]:
# Now the same thing, but with a different kernel

# Create a label propagation model
label_propagation = LabelPropagation(kernel='rbf', n_jobs=-1, max_iter=2000)

# Fit the label propagation model   
label_propagation.fit(X_train, y_train)

# Predict on the training unlabeled data
y_train_label_propagation = label_propagation.transduction_

# Train a random forest classifier on the updated training data
rf = RandomForestClassifier(random_state=271828, n_jobs=-1)
rf.fit(X_train, y_train_label_propagation)

# Predict on the validation set
y_valid_pred = rf.predict(X_valid)

# Show the classification report
print(classification_report(y_valid, y_valid_pred))

# Show Accuracy
print(f"Accuracy: {accuracy_score(y_valid, y_valid_pred)}")

# Show Matthews Correlation Coefficient
print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_valid, y_valid_pred)}")

              precision    recall  f1-score   support

           0       0.50      1.00      0.67      1253
           1       1.00      0.00      0.00      1239

    accuracy                           0.50      2492
   macro avg       0.75      0.50      0.34      2492
weighted avg       0.75      0.50      0.34      2492

Accuracy: 0.5040128410914928
Matthews Correlation Coefficient: 0.034913071783961136


In [245]:
np.bincount(y_train_label_propagation)

array([19802,   145])

In [246]:
# Now the same thing, but with a slighly different approach called Label Spreading
from sklearn.semi_supervised import LabelSpreading

# Create a label spreading model
label_spreading = LabelSpreading(kernel='knn', n_jobs=-1, max_iter=2000)

# Fit the label spreading model   
label_spreading.fit(X_train, y_train)

# Predict on the training unlabeled data
y_train_label_spreading = label_spreading.transduction_

# Train a random forest classifier on the updated training data
rf = RandomForestClassifier(random_state=271828, n_jobs=-1)
rf.fit(X_train, y_train_label_spreading)

# Predict on the validation set
y_valid_pred = rf.predict(X_valid)

# Show the classification report
print(classification_report(y_valid, y_valid_pred))

# Show Accuracy
print(f"Accuracy: {accuracy_score(y_valid, y_valid_pred)}")

# Show Matthews Correlation Coefficient
print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_valid, y_valid_pred)}")



              precision    recall  f1-score   support

           0       0.92      0.99      0.96      1253
           1       0.99      0.92      0.95      1239

    accuracy                           0.96      2492
   macro avg       0.96      0.95      0.95      2492
weighted avg       0.96      0.96      0.95      2492

Accuracy: 0.9550561797752809
Matthews Correlation Coefficient: 0.9127863817792571


## Co-Training

**Co-training** is a technique in semi-supervised learning that leverages multiple models trained on different views of the data. This method uses the predictions of one model to improve the training of another, making it particularly effective when data can be naturally divided into distinct feature sets or views. Each view should capture different aspects of the underlying structure of the data.

<br>
<p align="center">
  <img src="images/cotraining.png"  width="80%" height="80%" />
  <br>
</p>
 <br>
 
### Key Concepts

- **Different Views of Data**: In co-training, the data is split into different subsets of features or representations, referred to as *views*. For example, in a dogs-cats dataset, consider an image embedding with 768 dimensions:
  - One view could be 384 dimensions of the embedding.
  - Another view could be the remaining 384 dimensions of the embedding.
- **Model Training and Interaction**: Two models are trained separately on each view. These models then exchange predictions on the unlabeled data, allowing each model to learn from the other's predictions.

### Traditional Co-Training Process

1. **Initial Training**:
   - The method starts by checking and validating the input data.
   - It then randomly splits the features into two views
   - Two base classifiers are initialized and trained on the labeled data, each using its respective view:

2. **Co-training Loop**:
   - The main loop runs for `n_iter` iterations or until there's no more unlabeled data.
   - In each iteration:
      - A pool of unlabeled data is randomly selected
      - Both classifiers make predictions on this pool
      - The most confident positive and negative predictions from each classifier are selected
      - These selected examples are added to the labeled dataset
      - The selected examples are removed from the unlabeled dataset
      - Both classifiers are retrained on the updated labeled dataset

3. **Final Prediction**:
   - For making predictions, the classifier:
      - Gets predictions from both views. 
      - Averages these predictions. 
      - Returns either the class label (predict) or the probability (predict_proba).

The key idea behind co-training is that each view of the data can provide different, complementary information. By allowing each classifier to "teach" the other with its most confident predictions, the algorithm can leverage unlabeled data to improve overall performance. The random split of features into views adds an element of diversity, potentially capturing different aspects of the data in each view.

### Advantages of Co-Training

- **Improved Learning**: By leveraging multiple views, co-training can improve learning performance, especially when labeled data is scarce.
- **Complementary Information**: Different views can provide complementary information, making the combined model more robust and accurate.

### Multi-View Learning

**Multi-view learning** generalizes the concept of co-training by combining multiple views of data. This approach offers more flexibility and can capture more complex data structures.

#### Differences from Traditional Co-Training

- **Flexibility**: Unlike traditional co-training, which typically involves two models, multi-view learning can integrate more than two views.
- **Complex Structures**: It is capable of modeling more complex relationships and dependencies among the views.

### Potential Questions and Misconceptions

- **What if the views are not truly independent?**
  - The effectiveness of co-training relies on the assumption that the views are conditionally independent given the class label. If this assumption is violated, the performance might degrade.
- **Can co-training be applied to any type of data?**
  - Co-training is particularly useful when data can be naturally split into distinct views. If such views are not available, other semi-supervised learning techniques might be more appropriate.


In [247]:
# Load pre-extracted features from disk using joblib
# These features were previously saved and contain tuples of (label, file_path, feature_vector)
train_features = joblib.load('outputs/cats-vs-dogs/train_features.joblib')
valid_features = joblib.load('outputs/cats-vs-dogs/valid_features.joblib')

# Ensure train_features is a numpy array with a consistent shape
# dtype=object is used because the array contains tuples of different types
train_features = np.array(train_features, dtype=object)

# Create a copy of train_features for co-training purposes
co_training_train_features = train_features.copy()

# Randomly select 100 elements from co_training_train_features to be labeled
# Set the labels of the other elements to -1 to indicate they are unlabeled
labeled_indices = rng.choice(len(co_training_train_features), 100, replace=False)
co_training_train_features[:, 0][~np.isin(np.arange(len(co_training_train_features)), labeled_indices)] = -1

# Extract feature vectors (X) and labels (y) from co_training_train_features
X_train = np.array([x for _, _, x in co_training_train_features])
y_train = np.array([y for y, _, _ in co_training_train_features])

# Extract feature vectors (X) and labels (y) from valid_features
X_valid = np.array([x for _, _, x in valid_features])
y_valid = np.array([y for y, _, _ in valid_features])

# Identify the indices of the unlabeled data points in the training set
unlabeled_indices = np.where(y_train == -1)[0]

In [248]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((19947, 768), (19947,), (2492, 768), (2492,))

In [249]:
# Separate the labeled and unlabeled training data
# Labeled data has labels not equal to -1
# Unlabeled data has labels equal to -1

# Extract feature vectors and labels for labeled training data
X_train_labeled = X_train[y_train != -1]
y_train_labeled = y_train[y_train != -1]

# Extract feature vectors for unlabeled training data
X_train_unlabeled = X_train[y_train == -1]

# Display the shapes of the labeled and unlabeled datasets
# This helps in understanding the distribution of labeled and unlabeled data
X_train_labeled.shape, y_train_labeled.shape, X_train_unlabeled.shape

((100, 768), (100,), (19847, 768))

In [250]:
from sklearn.metrics import accuracy_score, matthews_corrcoef
from helpers.semisupervised import MultiViewCoTrainingClassifier

In [251]:
for n_views in [2, 4, 6, 8]:

    # Initialize and train the co-training classifier
    co_clf = MultiViewCoTrainingClassifier(n_views=n_views, n_iter=20)
    co_clf.fit(X_train_labeled, y_train_labeled, X_train_unlabeled)

    # Predict on the validation set
    y_valid_pred = co_clf.predict(X_valid)

    # Show Accuracy
    print(f"Accuracy: {accuracy_score(y_valid, y_valid_pred)} - {n_views} views")

    # Show Matthews Correlation Coefficient
    print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_valid, y_valid_pred)} - {n_views} views")

Accuracy: 0.9923756019261637 - 2 views
Matthews Correlation Coefficient: 0.9848052156854211 - 2 views
Accuracy: 0.992776886035313 - 4 views
Matthews Correlation Coefficient: 0.985616441444488 - 4 views
Accuracy: 0.9915730337078652 - 6 views
Matthews Correlation Coefficient: 0.9832383379379628 - 6 views
Accuracy: 0.9891653290529695 - 8 views
Matthews Correlation Coefficient: 0.9784989168942184 - 8 views


> As we can see, semi-supervised learning techniques like self-training and label propagation offer powerful tools for leveraging unlabeled data to improve model performance. By incorporating the information present in the unlabeled data, these methods can enhance the generalization capabilities of models trained on limited labeled data.
>
> Understanding the underlying principles and implementation details of these techniques is essential for effectively applying them to real-world problems.
> 
> You must be aware of the assumptions and limitations of these methods to ensure their successful application in practice. Always experiment with different hyperparameters, monitor performance metrics, and validate the results to determine the effectiveness of semi-supervised learning techniques in your specific use case.

## Introduction to PU Learning

PU learning, or Positive-Unlabeled learning, is a specialized form of semi-supervised learning designed for scenarios where we only have access to **positive (P)** and **unlabeled (U)** data, lacking explicitly labeled negative instances. This approach proves particularly valuable in data-centric AI applications where:

- **Labeled data is scarce or expensive:** Obtaining labeled data can be resource-intensive, making PU learning an attractive alternative.
- **Positive instances are easily identifiable:** In some domains, identifying positive cases is straightforward, while labeling negatives might be ambiguous or costly.
- **Unlabeled data is abundant:** PU learning leverages the readily available unlabeled data to enhance model performance.

<p align="center">
  <img src="images/pulearn.webp"  width="80%" height="80%" />
  <br>
</p>

PU learning offers a powerful framework for leveraging partially labeled datasets in data-centric AI applications. By transforming the problem and relying on carefully considered assumptions, we can train effective models even when traditional fully-labeled datasets are unavailable or impractical to obtain.


## Core Concepts and Notation

Let's define our dataset as $(x, y^*, \tilde{y})$, where:

* $x$ represents the input features.
* $y^*$ represents the true target variable (unobserved), which we aim to predict.
* $\tilde{y}$ represents the observed label (positive or unlabeled).

In PU learning, our goal is to estimate the probability of an instance being positive given its features:

$$f(x) = P(y^* = 1 | x)$$

However, due to the absence of labeled negative instances, we can only directly estimate:

$$\tilde{f}(x) = P(\tilde{y} = 1 | x)$$

where $\tilde{f}(x)$ represents our model's prediction for the positive class based on the available data, and $\tilde{y}$ denotes the observed label (positive or unlabeled).



## Key Assumptions

PU learning relies on several crucial assumptions to function effectively:

1. **Positive Label Reliability:** All instances labeled as positive are indeed positive. This assumption implies:

   $$P(\tilde{y} = 1 | x, y^* = 0) = 0$$

2. **Unlabeled Data Composition:** Unlabeled instances can belong to either the positive or negative class.

3. **Label Flipping Independence:** The probability of a positive instance being mislabeled as unlabeled is independent of its features. This assumption, while strong, is necessary for tractability and can be expressed as:

   $$P(\tilde{y} = 0 | x, y^* = 1) = P(\tilde{y} = 0 | y^* = 1)$$



## The PU Learning Transformation

The fundamental idea behind PU learning is to transform the problem from predicting the unknown true target variable $y^*$ to predicting the observed positive class label $\tilde{y} = 1$. This transformation allows us to effectively use the information contained within the unlabeled data.



### Key Lemma

This transformation hinges on a key lemma that connects the true positive probability to our observable probabilities:

$$P(y^* = 1 | x) = \frac{P(\tilde{y} = 1 | x)}{c}$$

where $c = P(\tilde{y} = 1 | y^* = 1)$ represents the **class prior** or **label frequency**, indicating the probability of a positive instance being observed as positive in our data.


### Proof Sketch

1. We start by marginalizing the joint probability:

   $$P(\tilde{y} = 1 | x) = P(y^* = 1, \tilde{y} = 1 | x) + P(y^* = 0, \tilde{y} = 1 | x)$$

2. Applying the positive label reliability assumption ($P(\tilde{y} = 1 | x, y^* = 0) = 0$), the second term vanishes:

   $$P(\tilde{y} = 1 | x) = P(y^* = 1, \tilde{y} = 1 | x)$$

3. Using the definition of conditional probability, we can rewrite this as:

   $$P(\tilde{y} = 1 | x) = P(y^* = 1 | x) * P(\tilde{y} = 1 | y^* = 1, x)$$

4. Applying the label flipping independence assumption:

   $$P(\tilde{y} = 1 | x) = P(y^* = 1 | x) * P(\tilde{y} = 1 | y^* = 1)$$

5. Rearranging the terms leads us to the key lemma:

   $$P(y^* = 1 | x) = \frac{P(\tilde{y} = 1 | x)}{P(\tilde{y} = 1 | y^* = 1)} = \frac{P(\tilde{y} = 1 | x)}{c}$$




## Estimating the Class Prior (c)

To bridge the gap between our model's predictions $\tilde{f}(x)$ and the true positive probabilities $f(x)$, we need to estimate the class prior $c$. There are several approaches to estimating the class prior:

1. **Proportion-based Estimation:**
   This simple method involves training a classifier on the combined set of positive and unlabeled instances and estimating $c$ as the proportion of positive predictions among the unlabeled instances:

   $$c \approx \frac{1}{|P|} \sum_{x \in U} P(\tilde{y} = 1 | x)$$

   where $|P|$ represents the number of positive instances in the dataset, and the summation iterates over all instances in the unlabeled set $U$.

2. **Spy Technique:**
   This method involves "spying" on the unlabeled data by mixing a small subset of positive examples with the unlabeled set. By observing how these "spy" instances are classified, we can better estimate the class prior.

3. **Expectation-Maximization (EM) Algorithm:**
   The EM algorithm can be adapted for PU learning to iteratively estimate the class prior and refine the classifier. This approach alternates between:
   - E-step: Estimating the probability of each unlabeled instance being positive.
   - M-step: Updating the classifier parameters and the class prior estimate.

4. **Kernel Mean Matching (KMM):**
   This non-parametric method estimates the class prior by matching the means of the positive and unlabeled data distributions in a high-dimensional feature space.

The choice of method depends on factors such as dataset size, computational resources, and the specific characteristics of the problem at hand. We'll stick to the proportion-based estimation for simplicity in this class.



### Application in Binary Classification

PU learning finds frequent application in binary classification tasks, where the target variable $y$ takes on values of 0 or 1. In this context:

* Positive instances: $y = 1$ (known)
* Unlabeled instances: $y \in \{0, 1\}$ (unknown)

Our dataset can be represented as $(x, \tilde{y}) \in \{0, 1\}$, where $\tilde{y}$ represents the observed labels (1 for positive, 0 for unlabeled).


#### PU Learning Workflow

The typical workflow for PU learning in binary classification involves the following steps:

1. **Data Preparation:**
   - Separate the dataset into positive instances ($P$) and unlabeled instances ($U$).
   - Combine the positive and unlabeled instances into a single dataset $(X, \tilde{y})$.

2. **Class Prior Estimation:**
    - Estimate the class prior $c$ using one of the methods discussed earlier.

3. **Model Training:**
    - Train a classifier on the combined positive and unlabeled dataset $(X, \tilde{y})$.
    - Use the estimated class prior $c$ to adjust the model's predictions.

4. **Model Evaluation:**
    - Evaluate the model's performance on a separate test set or through cross-validation.
    - Assess the model's ability to generalize to new data and make accurate predictions.

    

In [252]:
# Load pre-extracted features from disk using joblib
# These features were previously saved and contain tuples of (label, file_path, feature_vector)
train_features = joblib.load('outputs/cats-vs-dogs/train_features.joblib')
valid_features = joblib.load('outputs/cats-vs-dogs/valid_features.joblib')

# Ensure train_features is a numpy array with a consistent shape
# dtype=object is used because the array contains tuples of different types
train_features = np.array(train_features, dtype=object)

# Create a copy of train_features for PU learning purposes
pu_learn_train_features = train_features.copy()

# Identify indices of positive samples (label == 1)
positive_indices = np.where(pu_learn_train_features[:, 0] == 1)[0]

# Randomly select 200 positive samples
positive_indices_sample = rng.choice(positive_indices, 200, replace=False)

# Randomly select 2000 samples to be treated as unlabeled
unlabeled_indices = rng.choice(len(pu_learn_train_features), 2000, replace=False)

# Combine the selected positive samples and unlabeled samples
indices_to_keep = np.concatenate([positive_indices_sample, unlabeled_indices])

# Extract feature vectors and labels for the selected positive samples
X_train_positive = np.array([x for _, _, x in pu_learn_train_features[positive_indices_sample]])
y_train_positive = np.array([y for y, _, _ in pu_learn_train_features[positive_indices_sample]])

# Extract feature vectors for the selected unlabeled samples
# Unlabeled samples are assigned a label of 0
X_train_unlabeled = np.array([x for _, _, x in pu_learn_train_features[unlabeled_indices]])
y_train_unlabeled = np.zeros(len(unlabeled_indices))

# Combine the positive and unlabeled samples to form the training set
X_train = np.concatenate([X_train_positive, X_train_unlabeled])
y_train = np.concatenate([y_train_positive, y_train_unlabeled]).astype(int)

# Extract feature vectors and labels for the entire training set
X_train_full = np.array([x for _, _, x in pu_learn_train_features])
y_train_full = np.array([y for y, _, _ in pu_learn_train_features]).astype(int)

# Extract feature vectors and labels for the validation set
X_valid = np.array([x for _, _, x in valid_features])
y_valid = np.array([y for y, _, _ in valid_features]).astype(int)

# Display the shapes of the training and validation datasets
# This helps in understanding the distribution of data
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((2200, 768), (2200,), (2492, 768), (2492,))

In [253]:
# Establish our baseline - if the model was trained on a full supervised dataset
# Note: This is for demonstration purposes; in practice, you wouldn't train on the full dataset for baseline comparison.

# Import necessary libraries from scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, matthews_corrcoef
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Initialize the RandomForestClassifier model with specified parameters
# random_state=271828: Seed for random number generator to ensure reproducibility
# n_jobs=-1: Use all available CPU cores for parallel processing
model = RandomForestClassifier(random_state=271828, n_jobs=-1)

# Train the model on the full supervised dataset
model.fit(X_train_full, y_train_full)

# Predict the target values for the validation set using the trained model
y_valid_pred = model.predict(X_valid)

# Print the classification report to evaluate the model's performance
# target_names=['CAT', 'DOG']: Specify the names of the target classes
print(classification_report(y_valid, y_valid_pred, target_names=['CAT', 'DOG']))

# Show the accuracy of the model
print(f"Accuracy: {accuracy_score(y_valid, y_valid_pred)}")

# Show the Matthews Correlation Coefficient (MCC) of the model
# MCC is a balanced measure that can be used even if the classes are of very different sizes
print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_valid, y_valid_pred)}")

              precision    recall  f1-score   support

         CAT       1.00      0.99      0.99      1253
         DOG       0.99      1.00      0.99      1239

    accuracy                           0.99      2492
   macro avg       0.99      0.99      0.99      2492
weighted avg       0.99      0.99      0.99      2492

Accuracy: 0.9939807383627608
Matthews Correlation Coefficient: 0.9880001945828281


### Approaches to PU Learning

#### Naive Approach

A straightforward approach to binary classification is to treat the unlabeled data as samples of the negative class. However, this approach has limited performance when dealing with imbalanced datasets, i.e., datasets with a large number of unknown positive samples. It relies on the assumption that the proportion of positive samples in the unlabeled data is small enough not to significantly impact the model’s performance. Despite being simple to implement, the naive approach often results in suboptimal performance in practice.


In [254]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from typing import Tuple
from sklearn.preprocessing import MinMaxScaler

def fit_positive_unlabeled_estimator(X: np.ndarray, y: np.ndarray, hold_out_ratio: float, estimator: LogisticRegression) -> Tuple[LogisticRegression, float, float]:
    """Fits a positive-unlabeled (PU) estimator and estimates P(s=1|y=1).

    Args:
        X (np.ndarray): Feature matrix.
        y (np.ndarray): Labels array.
        hold_out_ratio (float): Ratio of positive samples to hold out for estimating P(s=1|y=1).
        estimator (LogisticRegression): The logistic regression estimator.

    Returns:
        Tuple[LogisticRegression, float, float]: The fitted estimator, the estimated P(s=1|y=1), and the estimated c (the proportion of positive samples in the unlabeled set).
    """
    assert isinstance(y, np.ndarray), "Must pass np.ndarray rather than list as y"
    
    # Find indices of positive/labeled elements
    positive_indices = np.where(y == 1.0)[0]
    
    # Calculate the number of positive samples to hold out
    hold_out_size = int(np.ceil(len(positive_indices) * hold_out_ratio))
    
    # Shuffle the positive indices to ensure randomness
    np.random.shuffle(positive_indices)
    
    # Select hold-out indices and corresponding samples
    hold_out_indices = positive_indices[:hold_out_size]
    X_hold_out = X[hold_out_indices]
    
    # Remove hold-out samples from X and y to create the training set
    X = np.delete(X, hold_out_indices, axis=0)
    y = np.delete(y, hold_out_indices)
    
    # Fit the estimator on the remaining samples
    estimator.fit(X, y)
    
    # Predict probabilities for the hold-out set
    hold_out_predictions = estimator.predict_proba(X_hold_out)[:, 1]
    
    # Estimate P(s=1|y=1) as the mean probability of the hold-out set. P(s=1|y=1) is the probability of being labeled given a positive sample.
    prob_s1_given_y1 = np.mean(hold_out_predictions)

    # Estimate c as the mean probability of the entire training set
    c = np.mean(estimator.predict_proba(X)[:, 1])

    # Adjust c to account for the estimated P(s=1|y=1)
    c = c / prob_s1_given_y1
    
    return estimator, prob_s1_given_y1, c

def predict_positive_unlabeled_prob(X: np.ndarray, estimator: LogisticRegression, prob_s1_given_y1: float) -> np.ndarray:
    """Predicts the probability of being labeled for positive-unlabeled data.

    Args:
        X (np.ndarray): Feature matrix.
        estimator (LogisticRegression): The fitted logistic regression estimator.
        prob_s1_given_y1 (float): The estimated P(s=1|y=1), that is, the probability of being labeled given a positive sample.

    Returns:
        np.ndarray: Predicted probabilities.
    """
    # Predict probabilities for the input data
    predicted_probabilities = estimator.predict_proba(X)[:, 1]
    
    # Adjust the predicted probabilities by dividing by P(s=1|y=1)
    return predicted_probabilities / prob_s1_given_y1

In [255]:
np.bincount(y_train)

array([2000,  200])

In [256]:
# Initialize predictions array to accumulate predicted probabilities
predicted_probabilities = np.zeros(len(X_train))

# Initialize list to store c values from each iteration
c_values = []

# Number of learning iterations
num_iterations = 100

# Perform learning iterations
for iteration in range(num_iterations):
    # Fit the PU estimator and get the estimated probabilities and c value
    # hold_out_ratio=0.25: 25% of positive samples are held out for estimating P(s=1|y=1)
    pu_estimator, prob_s1_given_y1, c = fit_positive_unlabeled_estimator(
        X_train, y_train, 0.25, LogisticRegression(max_iter=5000, n_jobs=-1)
    )
    
    # Store the c value
    c_values.append(c)
    
    # Predict probabilities for the current iteration
    predicted_probabilities_unscaled = predict_positive_unlabeled_prob(
        X_train, pu_estimator, prob_s1_given_y1
    )
    
    # Accumulate the predicted probabilities
    predicted_probabilities += predicted_probabilities_unscaled
    
    # Print progress every 10 iterations
    if iteration % 10 == 0:
        print(f'Learning Iteration::{iteration}/{num_iterations} => P(s=1|y=1)={round(prob_s1_given_y1, 2)} c={round(c, 2)}')

# Normalize the accumulated probabilities by the number of iterations
predicted_probabilities /= num_iterations

# Scale the normalized probabilities to the range [0, 1]
predicted_probabilities_scaled = MinMaxScaler().fit_transform(
    predicted_probabilities.reshape(-1, 1)
).flatten()

# Calculate the mean c value from all iterations
mean_c = np.mean(c_values)

Learning Iteration::0/100 => P(s=1|y=1)=0.1 c=0.73
Learning Iteration::10/100 => P(s=1|y=1)=0.11 c=0.65
Learning Iteration::20/100 => P(s=1|y=1)=0.12 c=0.59
Learning Iteration::30/100 => P(s=1|y=1)=0.11 c=0.66
Learning Iteration::40/100 => P(s=1|y=1)=0.09 c=0.81
Learning Iteration::50/100 => P(s=1|y=1)=0.12 c=0.61
Learning Iteration::60/100 => P(s=1|y=1)=0.12 c=0.57
Learning Iteration::70/100 => P(s=1|y=1)=0.12 c=0.59
Learning Iteration::80/100 => P(s=1|y=1)=0.12 c=0.58
Learning Iteration::90/100 => P(s=1|y=1)=0.12 c=0.59


In [257]:
print(f"The estimated positive class prior probability (c) is: {mean_c:.4f}")
print(f"The estimated probability of a truly positive sample being labeled is: {predicted_probabilities_scaled.mean():.4f}")

The estimated positive class prior probability (c) is: 0.6053
The estimated probability of a truly positive sample being labeled is: 0.0848


In [258]:
# We would not have this data in real life, but..... the real probabilities are

# Calculate the proportion of positive samples in the training data
# [1] selects the count of positive samples 
real_c = (np.bincount(y_train_full) / len(y_train_full))[1]

# Calculate the probability that a truly positive sample is labeled as positive
# The ratio gives the probability of a positive sample being labeled
real_p = len(y_train_positive) / len(y_train)

# Print the calculated probabilities
# real_c is the proportion of positive samples in the training data
# real_p is the probability that a truly positive sample is labeled as positive
print(f"The real positive class prior probability (c) is: {real_c:.4f}")
print(f'The real probability of a truly positive sample being labeled is: {real_p:.4f}')

The real positive class prior probability (c) is: 0.5001
The real probability of a truly positive sample being labeled is: 0.0909


Really close

In [259]:
# Convert predicted probabilities to binary labels
# If the predicted probability is greater than 0.5, classify as positive (1, 'DOG')
# Otherwise, classify as negative (0, 'CAT')
predicted_labels = predicted_probabilities > 0.5


Now we can use the estimated labels to train another classifier

In [260]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, matthews_corrcoef

# Initialize the RandomForestClassifier
# random_state=271828: Seed for random number generator to ensure reproducibility
# n_jobs=-1: Use all available CPU cores for parallel processing
classifier = RandomForestClassifier(random_state=271828, n_jobs=-1)

# Fit the classifier on the combined dataset
# X_train: Feature matrix containing both positive and unlabeled samples
# predicted_labels: Labels predicted in the previous step
classifier.fit(X_train, predicted_labels)

# Predict labels for the validation set using the trained classifier
y_valid_pred = classifier.predict(X_valid)

# Print the classification report to evaluate the model's performance
# target_names=['CAT', 'DOG']: Specify the names of the target classes
print(classification_report(y_valid, y_valid_pred, target_names=['CAT', 'DOG']))

# Show the accuracy of the model
print(f"Accuracy: {accuracy_score(y_valid, y_valid_pred)}")

# Show the Matthews Correlation Coefficient (MCC) of the model
# MCC is a balanced measure that can be used even if the classes are of very different sizes
print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_valid, y_valid_pred)}")

              precision    recall  f1-score   support

         CAT       0.87      1.00      0.93      1253
         DOG       1.00      0.85      0.92      1239

    accuracy                           0.92      2492
   macro avg       0.93      0.92      0.92      2492
weighted avg       0.93      0.92      0.92      2492

Accuracy: 0.9229534510433387
Matthews Correlation Coefficient: 0.8560109094778148


### Elkan and Noto’s Approach

A more sophisticated method for positive-unlabeled classification is [Elkan and Noto’s (E&N) approach](https://dl.acm.org/doi/10.1145/1401890.1401920). This method typically involves:

1. **Training a Classifier:** Predict the probability that a sample is labeled.
2. **Estimating Probabilities:** Use the model to estimate the probability that a positive sample is labeled.
3. **Adjusting Probabilities:** Divide the probability that an unlabeled sample is labeled by the probability that a positive sample is labeled to get the actual probability that the sample is positive.

While the E&N approach is effective in practice, it requires a significant amount of labeled data to accurately estimate the likelihood of a sample being positive, which may be impractical in real-world scenarios where labeled data is scarce.

> **Note:** The E&N approach is advantageous because it directly handles the labeling issue, but its reliance on sufficient labeled data can be a limitation in certain applications.

### Positive-Unlabeled Learning with [pulearn](https://pulearn.github.io/pulearn/doc/pulearn/)
 
The pulearn Python package provides a collection of scikit-learn wrappers for various Positive-Unlabeled (PU) learning methods. PU learning is a type of semi-supervised learning where the training data consists of positive samples and unlabeled samples, with the latter potentially containing both positive and negative samples.

This library expects the input data to be in the form of a feature matrix X and a target vector y, where the target vector `y contains the labels for the positive samples (1) and unlabeled samples (-1)`. The pulearn library provides a range of PU learning algorithms, including:

In [261]:
# Convert the combined labels to a new format
# If the label is 0 (unlabeled), convert it to -1
# If the label is 1 (positive), keep it as 1
y_train_formatted = np.array([-1 if x == 0 else 1 for x in y_train])

# Count the occurrences of each label (-1 and 1) in the formatted labels
# np.bincount counts the number of occurrences of each value in the array
# Since np.bincount expects non-negative integers, it will not work directly with -1
# To handle this, we can use a workaround by adding 1 to each element before counting
# This shifts the range to non-negative integers
counts = np.bincount(y_train_formatted + 1)

# Print the counts of -1 and 1
# counts[0] corresponds to the count of -1 (originally 0 in the shifted range)
# counts[2] corresponds to the count of 1 (originally 2 in the shifted range)
print(f"Count of -1 (unlabeled): {counts[0]}")
print(f"Count of 1 (positive): {counts[2]}")

Count of -1 (unlabeled): 2000
Count of 1 (positive): 200


In [262]:
from pulearn import ElkanotoPuClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, matthews_corrcoef

# Initialize the RandomForestClassifier
# n_jobs=-1: Use all available CPU cores for parallel processing
# random_state=271828: Seed for random number generator to ensure reproducibility
model = RandomForestClassifier(n_jobs=-1, random_state=271828)

# Initialize the ElkanotoPuClassifier with the RandomForestClassifier as the base estimator
# hold_out_ratio=0.30: Ratio of positive samples to hold out for estimating P(s=1|y=1)
pu_estimator = ElkanotoPuClassifier(estimator=model, hold_out_ratio=0.30)

# Fit the PU classifier on the combined dataset
# X_train: Feature matrix containing both positive and unlabeled samples
# y_train_formatted: Labels formatted to -1 for unlabeled and 1 for positive samples
pu_estimator.fit(X_train, y_train_formatted)

# Predict labels for the validation set using the trained PU classifier
y_valid_pred = pu_estimator.predict(X_valid)

# Print the classification report to evaluate the model's performance
# target_names=['CAT', 'DOG']: Specify the names of the target classes
print(classification_report(y_valid, y_valid_pred, target_names=['CAT', 'DOG']))

# Show the accuracy of the model
print(f"Accuracy: {accuracy_score(y_valid, y_valid_pred)}")

# Show the Matthews Correlation Coefficient (MCC) of the model
# MCC is a balanced measure that can be used even if the classes are of very different sizes
print(f"Matthews Correlation Coefficient: {matthews_corrcoef(y_valid, y_valid_pred)}")

              precision    recall  f1-score   support

         CAT       0.94      0.96      0.95      1253
         DOG       0.96      0.94      0.95      1239

    accuracy                           0.95      2492
   macro avg       0.95      0.95      0.95      2492
weighted avg       0.95      0.95      0.95      2492

Accuracy: 0.9514446227929374
Matthews Correlation Coefficient: 0.9030864730146777


# Questions

1. What is the primary motivation for using semi-supervised learning?

2. Describe the self-training algorithm in semi-supervised learning.

3. What are the key assumptions in semi-supervised learning?

4. Explain the concept of label propagation in semi-supervised learning.

5. How does co-training leverage multiple views of the data?

6. What is the core idea behind PU learning?

7. What are the key assumptions of PU learning?

8. Describe Elkan and Noto’s approach to PU learning.

9. What is the purpose of estimating the class prior in PU learning?

10. How does the naive approach to PU learning treat the unlabeled data?


`Answers are commented inside this cell.`

<!-- 1. The primary motivation for using semi-supervised learning is to leverage the abundant unlabeled data available in many practical scenarios to improve model performance when labeled data is limited or expensive to obtain.

2. The self-training algorithm involves initializing a model with a small labeled dataset, generating pseudo-labels for the unlabeled data using the model, combining the labeled and pseudo-labeled data, retraining the model, and repeating the process until the model's performance converges.

3. The key assumptions in semi-supervised learning include the cluster assumption (homogeneity), the continuity assumption (smoothness), and the manifold assumption. These assumptions suggest that similar data points are likely to have the same label and that the data lies on a lower-dimensional manifold.

4. Label propagation constructs a graph where nodes represent data points and edges represent similarities. It spreads labels from known instances to unlabeled ones through the graph, iteratively updating label probabilities until they stabilize.

5. Co-training leverages multiple views of the data by training separate models on each view and allowing them to teach each other through their most confident predictions. This method is effective when data can be naturally divided into distinct feature sets or views.

6. The core idea behind PU learning is to transform the problem from predicting the unknown true target variable to predicting the observed positive class label, effectively using the information contained within the unlabeled data.

7. The key assumptions of PU learning are positive label reliability (all instances labeled as positive are indeed positive), unlabeled data composition (unlabeled instances can belong to either the positive or negative class), and label flipping independence (the probability of a positive instance being mislabeled as unlabeled is independent of its features).

8. Elkan and Noto’s approach to PU learning involves training a classifier to predict the probability that a sample is labeled, estimating the probability that a positive sample is labeled, and adjusting the probabilities to get the actual probability that the sample is positive.

9. Estimating the class prior in PU learning is crucial for adjusting the model's predictions to reflect the true positive probabilities. It helps bridge the gap between the observed data (positive and unlabeled) and the true underlying distribution.

10. The naive approach to PU learning treats the unlabeled data as samples of the negative class, assuming that the proportion of positive samples in the unlabeled data is small enough not to significantly impact the model’s performance. -->
