# Semi-Supervised Learning

In this exercise, we use a breast cancer dataset to explore the concepts of semi-supervised learning. In particular, we will perform the following tasks: 

1. Create a dataset suitable for semi-supervised learning
2. Create a baseline and report accuracy
3. Solve the classification task using a semi-supervised method and report accuracy
4. Create a classification model that utilizes the predicted output from the semi-supervised learning

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from numpy import concatenate
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
from seaborn import catplot

### Load the data

data location: `/dsa/data/DSA-8410/Wisconsin-Breast-Cancer-Cytology/BreastCancer.csv`

In [2]:
data = pd.read_csv("/dsa/data/DSA-8410/Wisconsin-Breast-Cancer-Cytology/BreastCancer.csv")

In [3]:
data.shape

(683, 4)

In [4]:
data.head()

Unnamed: 0,id,thickness,size,class
0,1000025,5,1,0
1,1002945,5,4,0
2,1015425,3,1,0
3,1016277,6,8,0
4,1017023,4,1,0


### Remove the 'id' column

In [5]:
data= data.drop(["id"],axis=1)
data.head()

Unnamed: 0,thickness,size,class
0,5,1,0
1,5,4,0
2,3,1,0
3,6,8,0
4,4,1,0


### Extract the first two features and class variable

In [6]:
X = data.iloc[:,0:2]
y = data.loc[:,"class"]

### T1. Create datasets for semi-supervised learning

1. Create train and test datasets with a 50-50 split with stratification 
2. Split the training set into a labeled and unlabeled datasets with a 50-50 split with stratification 

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.50, random_state=1, stratify=y)

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(
    X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

### T2. Report the sizes of the labeled, unlabeled, and test sets

In [8]:
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)
print('Test Set:', X_test.shape, y_test.shape)

Labeled Train Set: (170, 2) (170,)
Unlabeled Train Set: (171, 2) (171,)
Test Set: (342, 2) (342,)


### T3. Baseline Performance 

We can establish a baseline by fitting a classifier only on the labeled training data. This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm that fits the labeled data alone. If this is not the case, we need to rethink about the semi-supervised model and/or data that we are using.

### T4. Define and fit the random forest model as a baseline

In [9]:
model = RandomForestClassifier()
model.fit(X_train_lab, y_train_lab)

RandomForestClassifier()

### T5. Report baseline prediction accuracy

In [10]:
yhat = model.predict(X_test)
score1 = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score1*100))

Accuracy: 94.444


### T6. Fit a label propagation model 


In [11]:
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
nolabel = [-1 for _ in range(len(y_test_unlab))]
y_train_mixed = concatenate((y_train_lab, nolabel))

In [12]:
model = LabelSpreading(max_iter=2000)
model.fit(X_train_mixed, y_train_mixed)

LabelSpreading(max_iter=2000)

### T7. Report prediction accuracy by label propagation method

In [13]:
yhat = model.predict(X_test)
score2 = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score2*100))

Accuracy: 93.567


### T8. Fit a supervised model using the estimated labels for the training dataset

In [14]:
tran_labels = model.transduction_
model2 = RandomForestClassifier()
model2.fit(X_train_mixed, tran_labels)

RandomForestClassifier()

In [15]:
yhat = model2.predict(X_test)
score3 = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score3*100))

Accuracy: 94.152


### T9. Discuss your observations