# Transductive Learning

* Transductive learning (TL) train the model and label the unlabeled data points which have already encountered.

* TL does not build a predictive model. If new unlabeled points are encounteredö we will have to rerun the model.

* Transductive learning can predict only the points in the encountered testing dataset based on based on the observed training dataset.


source: https://towardsdatascience.com/a-simple-svm-based-implementation-of-semi-supervised-learning-f44eafb0a970

<img src='https://miro.medium.com/max/589/1*Va3RZ9tPKRTmV932Wnvj2w.png' width='400'  align="center">

<img src='https://miro.medium.com/max/471/1*1JmNnvBVFYF-elmdHloI3g.png' width='400'  align="center">

### Importing the library and the datasets

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import datasets

http://rasbt.github.io/mlxtend/user_guide/data/wine_data/

In [2]:
# Samples: 178, Features: 13, Class: 3
X, y = datasets.load_wine(return_X_y=True) # Load Wine dataset
X = X[:,10:12] # keep only two features for training (Proanthocyanins, Color intensity), 
# y is the class id (there are 3 classes: 0 (59 samples), 1 (71 samples), 2 (48 samples))
X

array([[1.04 , 3.92 ],
       [1.05 , 3.4  ],
       [1.03 , 3.17 ],
       [0.86 , 3.45 ],
       [1.04 , 2.93 ],
       [1.05 , 2.85 ],
       [1.02 , 3.58 ],
       [1.06 , 3.58 ],
       [1.08 , 2.85 ],
       [1.01 , 3.55 ],
       [1.25 , 3.17 ],
       [1.17 , 2.82 ],
       [1.15 , 2.9  ],
       [1.25 , 2.73 ],
       [1.2  , 3.   ],
       [1.28 , 2.88 ],
       [1.07 , 2.65 ],
       [1.13 , 2.57 ],
       [1.23 , 2.82 ],
       [0.96 , 3.36 ],
       [1.09 , 3.71 ],
       [1.03 , 3.52 ],
       [1.11 , 4.   ],
       [1.09 , 3.63 ],
       [1.12 , 3.82 ],
       [1.13 , 3.2  ],
       [0.92 , 3.22 ],
       [1.02 , 2.77 ],
       [1.25 , 3.4  ],
       [1.04 , 3.59 ],
       [1.19 , 2.71 ],
       [1.09 , 2.88 ],
       [1.23 , 2.87 ],
       [1.25 , 3.   ],
       [1.1  , 2.87 ],
       [1.04 , 3.47 ],
       [1.09 , 2.78 ],
       [1.12 , 2.51 ],
       [1.18 , 2.69 ],
       [0.89 , 3.53 ],
       [0.95 , 3.38 ],
       [0.91 , 3.   ],
       [0.88 , 3.56 ],
       [0.8

### Dividing the dataset train (labeled), unl(Unlabeled)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)# split dataset to train and test data
X_train, X_unl, y_train, y_unl = train_test_split(X_train, y_train, test_size=0.7, random_state=1) # split train data to labeled data and unlabeled data


In [4]:
print(X_train.shape)
print(X_test.shape)
print(X_unl.shape)

(37, 2)
(54, 2)
(87, 2)


### Training on the labeled set

<font size="4"> <font color='royalblue'> __What is Support Vector Machine?__ </font> </font>

* The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

* To separate the two classes of data points, there are many possible hyperplanes that could be chosen. The objective of SVM is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. 

* Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

source: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

<img src='https://miro.medium.com/max/600/0*9jEWNXTAao7phK-5.png' width='200'  align="center">
<img src='https://miro.medium.com/max/600/0*0o8xIA4k3gXUDCFU.png' width='200'  align="center">

In [5]:
clf = svm.SVC(kernel='linear', probability=True,C=1).fit(X_train, y_train)# use SVM classifier to predict the labels (classes) of the train data
acc = clf.score(X_test, y_test)
acc

0.6851851851851852

### Predict on the unlabeled data

In [6]:
clp= clf.predict_proba(X_unl) # extract the predicted probability assigned foreach class 
lab=clf.predict(X_unl)#  predict the class according to the higenst probability

In [7]:
df = pd.DataFrame(clp, columns = ['C1Prob', 'C2Prob','C3Prob']) # for each unlabeled sample, extract the probability assigned for each class
df['max']=df[["C1Prob", "C2Prob","C3Prob"]].max(axis=1) # extract the highest probability values
df['lab']=lab # predicted the classes of the unlabeled data according to the highest probability values. 
df['actual']=y_unl # ground truth

In [8]:
df

Unnamed: 0,C1Prob,C2Prob,C3Prob,max,lab,actual
0,0.563874,0.381664,0.054462,0.563874,1,0
1,0.061865,0.054491,0.883644,0.883644,2,2
2,0.467037,0.457587,0.075376,0.467037,1,0
3,0.289868,0.158459,0.551673,0.551673,2,2
4,0.490104,0.343825,0.166071,0.490104,1,1
...,...,...,...,...,...,...
82,0.110907,0.883948,0.005144,0.883948,0,0
83,0.023730,0.026267,0.950003,0.950003,2,2
84,0.514501,0.177224,0.308274,0.514501,1,1
85,0.373670,0.598724,0.027607,0.598724,1,0


In [9]:
nc=np.arange(0.98,.33,-.03)# create a set of threshold (22 values)

In [10]:
nc

array([0.98, 0.95, 0.92, 0.89, 0.86, 0.83, 0.8 , 0.77, 0.74, 0.71, 0.68,
       0.65, 0.62, 0.59, 0.56, 0.53, 0.5 , 0.47, 0.44, 0.41, 0.38, 0.35])

In [11]:
acc=np.empty(22)
acc
i=0
for k in np.nditer(nc): # confidence interval
    conf_ind=df["max"]>k # conf_ind takes all samples having probabilities higher than the threshold k 
    X_train1 = np.append(X_train,X_unl[conf_ind,:],axis=0) # add the chosen unlabed data to the new training X list
    y_train1 = np.append(y_train,df.loc[conf_ind,['lab']]) # add the labels of the chosen unlabed data to the new training Y list
    clf = svm.SVC(kernel='linear', probability=True,C=1).fit(X_train1, y_train1)# retrain the SVM classifier using the new dataset
    acc[i]=  clf.score(X_test, y_test)#  accuracy list wher each element is attained using labeled data + unlabeled data with probabilities higher than the threshold k
    i = i + 1
    acc=np.empty(22)

i=0

for k in np.nditer(nc): # confidence interval

    conf_ind=df["max"]>k # conf_ind takes all samples having probabilities higher than the threshold k 

    X_train1 = np.append(X_train,X_unl[conf_ind,:],axis=0) # add the chosen unlabed data to the new training X list

    y_train1 = np.append(y_train,df.loc[conf_ind,['lab']]) # add the labels of the chosen unlabed data to the new training Y list

    clf = svm.SVC(kernel='linear', probability=True,C=1).fit(X_train1, y_train1)# retrain the SVM classifier using the new dataset

    acc[i]=  clf.score(X_test, y_test)#  accuracy list wher each element is attained using labeled data + unlabeled data with probabilities higher than the threshold k

    i = i + 1

In [12]:
acc

array([0.68518519, 0.68518519, 0.68518519, 0.68518519, 0.66666667,
       0.64814815, 0.68518519, 0.72222222, 0.74074074, 0.75925926,
       0.7037037 , 0.7037037 , 0.7037037 , 0.62962963, 0.66666667,
       0.68518519, 0.68518519, 0.68518519, 0.68518519, 0.68518519,
       0.68518519, 0.68518519])

In [13]:
conf_ind2=df["max"]>0.9
X_train2 = np.append(X_train,X_unl[conf_ind2,:],axis=0)
y_train2 = np.append(y_train,df.loc[conf_ind2,['lab']])
clf2 = svm.SVC(kernel='linear', probability=True,C=1).fit(X_train2, y_train2)
acc_2 =  clf2.score(X_test, y_test)
acc_2

0.6851851851851852

In [14]:
conf_ind3=df["max"]>0.55
X_train3 = np.append(X_train,X_unl[conf_ind3,:],axis=0)
y_train3 = np.append(y_train,df.loc[conf_ind3,['lab']])
clf3 = svm.SVC(kernel='linear', probability=True,C=1).fit(X_train3, y_train3)
acc_3 =  clf3.score(X_test, y_test)
acc_3

0.6666666666666666

In [16]:
conf_ind4=df["max"]>0.71
X_train4 = np.append(X_train,X_unl[conf_ind4,:],axis=0)
y_train4 = np.append(y_train,df.loc[conf_ind4,['lab']])
clf4 = svm.SVC(kernel='linear', probability=True,C=1).fit(X_train4, y_train4)
acc_4 =  clf4.score(X_test, y_test)
acc_4

0.7592592592592593