<a href="https://colab.research.google.com/github/Zhaotai924/Pytorch-Tutorial-Youtube/blob/main/Transductive_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [14]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.semi_supervised import LabelPropagation

# Sample data
labeled_docs = [
    "football match", "basketball game", "tech conference", "new technology",
    "soccer team", "baseball tournament", "latest gadget", "innovation in tech"
]
labels = [0, 0, 1, 1, 0, 0, 1, 1]  # 0: Sports, 1: Tech

unlabeled_docs = [
    "innovation in football", "basketball technology", "tech soccer",
    "new baseball gadget", "team conference", "gadget competition"
]

# Combine labeled and unlabeled data
all_docs = labeled_docs + unlabeled_docs

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(all_docs)
# Convert sparse matrix to dense array
X_dense = X.toarray() # Convert the sparse matrix to a dense NumPy array

# Convert sparse matrix to dense array
X_dense = X.toarray() # Convert the sparse matrix to a dense NumPy array

The shape of X_dense is determined by two factors:

Number of Documents: This is the total number of documents you have in all_docs (both labeled and unlabeled). Each document becomes a row in the X_dense matrix.

Size of Vocabulary: This is the number of unique words (or tokens) present in your entire corpus after the TfidfVectorizer analyzes and tokenizes the text. Each unique word becomes a column in the X_dense matrix.

In essence, X_dense is a numerical representation of your text data, where:

Each row corresponds to a document.
Each column corresponds to a unique word in your vocabulary.
The values in the matrix represent the TF-IDF score of each word in each document.
To illustrate with your example:

You have 14 documents (8 labeled + 6 unlabeled).
Let's say the TfidfVectorizer identifies 16 unique words in your corpus.
Therefore, the shape of X_dense would be (14, 16).

In [15]:
# Create labels array with -1 for unlabeled data
all_labels = np.concatenate([labels, -1 * np.ones(len(unlabeled_docs))])

# Apply Label Propagation
label_prop_model = LabelPropagation()
label_prop_model.fit(X_dense, all_labels) # Use the dense array here

# Predict labels for unlabeled data
predicted_labels = label_prop_model.transduction_[-len(unlabeled_docs):]

array([1., 1., 0., 1., 1., 1.])

n your code, label_prop_model is an instance of the LabelPropagation class from scikit-learn. This class implements a semi-supervised machine learning algorithm called Label Propagation.

What is Label Propagation?

Label Propagation is a technique used to assign labels to unlabeled data points based on the labels of a smaller set of labeled data points. It's based on the idea that data points that are "close" to each other in some feature space are likely to share the same label.

How does it work (in a simplified way)?

Graph Construction: The algorithm constructs a graph where each data point (both labeled and unlabeled) is a node. Edges between nodes represent the similarity or proximity between data points.

Label Propagation: The algorithm iteratively propagates labels from the labeled nodes to the unlabeled nodes. In each iteration, the label of an unlabeled node is updated based on the labels of its neighbors, weighted by their similarity.

Convergence: The process continues until the labels of the unlabeled nodes stabilize, meaning they no longer change significantly between iterations.

In your code:

You use label_prop_model.fit(X_dense, all_labels) to train the Label Propagation model on your data:

X_dense: This is the dense NumPy array representing your text data (TF-IDF features).
all_labels: This is an array containing the labels for the labeled data points and -1 for the unlabeled data points.
After training, label_prop_model learns how to propagate labels from the labeled data to the unlabeled data. You then use label_prop_model.transduction_ to get the predicted labels for the unlabeled data points.



In [16]:
# Output results
for doc, label in zip(unlabeled_docs, predicted_labels):
    print(f"Document: '{doc}' => Predicted Label: {label}")

Document: 'innovation in football' => Predicted Label: 1.0
Document: 'basketball technology' => Predicted Label: 1.0
Document: 'tech soccer' => Predicted Label: 0.0
Document: 'new baseball gadget' => Predicted Label: 1.0
Document: 'team conference' => Predicted Label: 1.0
Document: 'gadget competition' => Predicted Label: 1.0
