### Ensemble-Based Semi-Supervised Learning for Incomplete Labeled Data

This project uses a novel ensemble-based semi-supervised learning model on Marketing campaign dataset downloaded from Kaggle.

The project used ensembles techniques/methods such as Bagging, Boosting, and Stacking along side with Semi-Supervised technique called Self-Training.


#### Installing Libraries

In [None]:
#!pip install xgboost
#!pip install lightgbm

#### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import datetime
import openpyxl

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import train_test_split,RandomizedSearchCV
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.semi_supervised import SelfTrainingClassifier 
from sklearn.ensemble import RandomForestClassifier, StackingClassifier

#### Loading Dataset
Once the dataset is loaded, the ‘Dependents_Flag’ column is created to be used as the prediction target. In other words, the project seeks to predict whether our supermarket shopper has any dependents (kids/teens) at home or not.

In [2]:
df = pd.read_csv('marketing_campaign.csv', 
                 encoding='utf-8', delimiter=';',
                 usecols=['ID','Marital_Status','Kidhome', 'Teenhome', 
                          'MntWines', 'MntMeatProducts']
                )

# Create a flag to denote whether the person has any dependants at home
# (either kids or teens)
df['Dependents_Flag']=df.apply(lambda x: 1 
                               if x['Kidhome']+x['Teenhome']>0 
                               else 0, axis=1)

df

Unnamed: 0,ID,Marital_Status,Kidhome,Teenhome,MntWines,MntMeatProducts,Dependents_Flag
0,5524,Single,0,0,635,546,0
1,2174,Single,1,1,11,6,1
2,4141,Together,0,0,426,127,0
3,6182,Together,1,0,11,20,1
4,5324,Married,1,0,173,118,1
...,...,...,...,...,...,...,...
2235,10870,Married,0,1,709,182,1
2236,4001,Together,2,1,406,30,1
2237,7270,Divorced,0,0,908,217,0
2238,8235,Together,0,1,428,214,1


#### Split data 
Dataset is split into train and test samples and their shapes checked accordingly. We further mask 95% of labels within the training data and create a target variable that uses ‘-1’ to denote unlabeled (masked) data

In [3]:
df_train, df_test = train_test_split(df, test_size=0.25, random_state=0)
print('Size of train dataframe: ', df_train.shape[0])
print('Size of test dataframe: ', df_test.shape[0])

Size of train dataframe:  1680
Size of test dataframe:  560


In [6]:
# Create a flag for label masking
df_train['Random_Mask'] = True
df_train.loc[df_train.sample(frac=0.05, random_state=0).index, 
             'Random_Mask'] = False

# Create a new target colum with labels. The 1's and 0's are 
# original labels and -1 represents unlabeled (masked) data
df_train['Dependents_Target']=df_train.apply(lambda x: x['Dependents_Flag']
                                             if x['Random_Mask']==False 
                                             else -1, axis=1)

# Show target value distribution
print('Target Value Distribution:')
print(df_train['Dependents_Target'].value_counts())

Target Value Distribution:
-1    1596
 1      58
 0      26
Name: Dependents_Target, dtype: int64


#### Visualization
Training data is plotted on a 2D scatterplot to see how the obsservations are distributed. Indeed, showing the combination of labeled and unlabeled data for Semi-Supervised Learning.

In [7]:
# Create a scatter plot
fig = px.scatter(df_train, x='MntMeatProducts', y='MntWines', 
                 opacity=1, 
                 color=df_train['Dependents_Target'].astype(str),
                 color_discrete_sequence=['lightgrey', 'red', 'blue'],
                )

# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='white', 
                 zeroline=True, zerolinewidth=1, zerolinecolor='white', 
                 showline=True, linewidth=1, linecolor='white')

fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='white', 
                 zeroline=True, zerolinewidth=1, zerolinecolor='white', 
                 showline=True, linewidth=1, linecolor='white')

# Set figure title
fig.update_layout(
    title_text="Marketing Campaign Training Data - Labeled vs. Unlabeled")

# Update marker size
fig.update_traces(marker=dict(size=5))

fig.show()

#### Model training
As shown in the above plot, we will use ‘MntMeatProducts’ (shopper’s annual spend on meat products) and ‘MntWines’ (shopper’s annual spend on wine) as two features to predict whether the shopper has any dependents at home.


We will first train a couple models like the Random Forest Classifier, XGBoost Classifier, as well as Light GBM Classifier on the labeled data to establish a model performance benchmark. This will equally enable us to judge whether an ensembled-based Semi-Supervised approach from a later step is better or worse.

In [6]:
# Select only records with known labels
df_train_labeled=df_train[df_train['Dependents_Target']!=-1]

# Select data for modeling 
X_baseline=df_train_labeled[['MntMeatProducts', 'MntWines']]
y_baseline=df_train_labeled['Dependents_Target'].values

# Put test data into an array
X_test=df_test[['MntMeatProducts', 'MntWines']]
y_test=df_test['Dependents_Flag'].values

In [7]:
# Creating dictionary of models
models = {
    'RandomForestClassifier':RandomForestClassifier(random_state=2018),
    'XGBClassifier':XGBClassifier(random_state=2018, learning_rate=0.01),
    'LGBMClassifier':LGBMClassifier(random_state=2018,learning_rate=0.001),
    'SVMClassifier':SVC(kernel='rbf', probability=True, C=1.0, 
                        gamma='scale',random_state=2018)
}

In [8]:
 for key in models.keys():
    clf = models[key].fit(X_baseline, y_baseline)
    
    # Use score method to get accuracy of the model
    print('Model Evaluation ---', key)
    print('----------------------------------------------')
    accuracy_score_B = models[key].score(X_test, y_test)
    print('Accuracy Score: ', accuracy_score_B)
    # Look at classification report to evaluate the model
    print(classification_report(y_test, models[key].predict(X_test)))
    print(' ')

Model Evaluation --- RandomForestClassifier
----------------------------------------------
Accuracy Score:  0.8232142857142857
              precision    recall  f1-score   support

           0       0.71      0.60      0.65       154
           1       0.86      0.91      0.88       406

    accuracy                           0.82       560
   macro avg       0.78      0.75      0.77       560
weighted avg       0.82      0.82      0.82       560

 
Model Evaluation --- XGBClassifier
----------------------------------------------
Accuracy Score:  0.8196428571428571
              precision    recall  f1-score   support

           0       0.72      0.56      0.63       154
           1       0.85      0.92      0.88       406

    accuracy                           0.82       560
   macro avg       0.78      0.74      0.76       560
weighted avg       0.81      0.82      0.81       560

 
Model Evaluation --- LGBMClassifier
----------------------------------------------
Accuracy Score


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



The results abouve shows that individual models are already pretty good, with an accuracy of 82.3%, 82%, 72.5%, 82.85% for Random Forest Classifier, XGBoost Classifier, Light GBM Classifier, and Support Vector Machine Classifier respectively. Note that the f1 score is higher for label=1 (shopper with dependents) due to class imbalance.

#### Using Stacking Methods
Here, we implement stacking ensemble learning technique by combine the predictions all the individual models mentioned on the labeled data. These estimators are used to create Stacking Classifier model

In [18]:
# Creating list of estimator 
list_estimators = [
    ('RandomForestClassifier', 
     RandomForestClassifier(random_state=2018)),
    ('XGBClassifier',
     XGBClassifier(random_state=2018,learning_rate=0.01)),
    ('LGBMClassifier',
     LGBMClassifier(random_state=2018, learning_rate=0.001)),
    ('SVMClassifier',
     SVC(kernel='rbf', probability=True, C=1.0, gamma='scale',
         random_state=2018))
                  ]

In [21]:
# Defining the Stacking Classifier model
model_sclf = StackingClassifier(estimators=list_estimators, 
                                final_estimator=LogisticRegression())

In [22]:
# Training 
sclf = model_sclf.fit(X_baseline, y_baseline)
    
# Use score method to get accuracy of the model
print('Model Evaluation ---', 'Stacking Classifier')
print('----------------------------------------------')
accuracy_score_B = model_sclf.score(X_test, y_test)
print('Accuracy Score: ', accuracy_score_B)
# Look at classification report to evaluate the model
print(classification_report(y_test, model_sclf.predict(X_test)))
print(' ')

Model Evaluation --- Stacking Classifier
----------------------------------------------
Accuracy Score:  0.825
              precision    recall  f1-score   support

           0       0.78      0.51      0.61       154
           1       0.83      0.95      0.89       406

    accuracy                           0.82       560
   macro avg       0.81      0.73      0.75       560
weighted avg       0.82      0.82      0.81       560

 


The results above an accuracy of 82.5%, which is the same as that of the Random Forest Classifier model. Note that the f1 score is also higher here for label=1 (shopper with dependents) due to class imbalance.

#### Implementing Sklearn’s Self-Training Classifier
Now we implement a Semi-Supervised approach with Sklearn’s Self-Training Classifier while using the Stacking Classifier model as a base estimator on all the dataset (labeled and unlabeled)

In [23]:
# Select data for modeling - we are including masked (-1) labels this time
X_train = df_train[['MntMeatProducts', 'MntWines']]
y_train = df_train['Dependents_Target'].values

In [24]:
# Specify Self-Training model parameters
self_training_model = SelfTrainingClassifier(base_estimator=model_sclf, 
                                             threshold=0.7, 
                                             criterion='threshold', 
                                             max_iter=100, verbose=True)

# Fit the model
clf_ST = self_training_model.fit(X_train, y_train)

End of iteration 1, added 1343 new labels.
End of iteration 2, added 220 new labels.
End of iteration 3, added 26 new labels.
End of iteration 4, added 3 new labels.
End of iteration 5, added 2 new labels.


In [27]:
# Model Evaluation - Self Training Model

print('Model Evaluation ---', 'Self-Training Classifier')
print('----------------------------------------------')
accuracy_score_ST = clf_ST.score(X_test, y_test)
print('Accuracy Score: ', accuracy_score_ST)
# Look at classification report to evaluate the model
print(classification_report(y_test, clf_ST.predict(X_test)))

Model Evaluation --- Self-Training Classifier
----------------------------------------------
Accuracy Score:  0.8303571428571429
              precision    recall  f1-score   support

           0       0.80      0.51      0.62       154
           1       0.84      0.95      0.89       406

    accuracy                           0.83       560
   macro avg       0.82      0.73      0.76       560
weighted avg       0.83      0.83      0.82       560



The results above shows an improved model performance with an accuracy of approximately 83%. F1 score is also marginally better for label=0, driven by improved precision.

This model used a probability threshold of 0.7. It means that any observation with a class probability of 0.7 or higher will be added to the pool of pseudo-labeled data and used to train the model in the next iteration.