# Pseudo Labelling

Pseudo labelling is the process of using the labelled data model to predict labels for unlabelled data. Here at first, a model has trained with the dataset containing labels and that model is used to generate pseudo labels for the unlabelled dataset. Finally, both the datasets and labels(original labels and pseudo labels) are combined for a final model training. It is called pseudo(which means unreal) as these may or may not be real labels and we are generating them based on a similar data model. 


# Implementation in Python

For this demonstration, We’ve taken up the sklearn dataset breast cancer. We know that it already contains labels but we are going to modify it by splitting the data into two parts one having labels and the other with no labels. We’ll generate our own labels for the unlabelled data from the labelled data model that has been trained and then finally use both to train a final model.

Importing libraries

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy sklearn statsmodels --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

Loading dataset

In [None]:
data = load_breast_cancer()
X = data['data']
y = data['target']

Splitting dataset into data with labels and data with no labels in 40:60 ratio

In [None]:
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=.6)
x_train.shape,y_train.shape,x_test.shape

Model Creation and fitting the data containing labels

In [None]:
model1 = RandomForestClassifier()
history = model1.fit(x_train,y_train)
history

Accuracy score for data-label model training

In [None]:
model1.score(x_train,y_train)

Now we use this model to predict labels (called pseudo labels) for no label data

In [None]:
y_new = model1.predict(x_test)
y_new.shape

We concatenate both these datasets now

In [None]:
final_X = np.concatenate((x_train,x_test))
final_X.shape

Similarly both labels(original and pseudo) are also concatenated.

In [None]:
final_Y = np.concatenate((y_train,y_test))
final_Y.shape

Final model containing entire dataset is fitted and accuracy score is generated

In [None]:
model2 = RandomForestRegressor()
model2.fit(final_X,final_Y)
model2.score(final_X,final_Y)