# ***End to End tutorial for SMS_SPAM labeling using Cage:***
**The paper, documentation, colab notebook can be found here:** [Paper](https://ojs.aaai.org/index.php/AAAI/article/view/5742), [Documentation](https://spear-decile.readthedocs.io/en/latest/#cage), [Colab](https://colab.research.google.com/drive/1vec-Q-xO9wQtM3p_CZ7237gCq0xIR9b9?usp=sharing)

In [1]:
#pip install

In [2]:
'''
User don't need to include this cell to use the package
'''
import sys
sys.path.append('../../')

In [3]:
import numpy as np

# ***Defining an Enum to hold labels:***
### **Representation of class Labels**

<p>All the class labels for which we define labeling functions are encoded in enum and utilized in our next tasks. Make sure not to define an Abstain(Labeling function(LF) not deciding anything) class inside this Enum, instead import the ABSTAIN object as used later in LF section.</p>

<p>SPAM dataset contains 2 classes i.e <b>HAM</b> and <b>SPAM</b>. Note that the numbers we associate can be anything but it is suggested to use a continuous numbers from 0 to number_of_classes-1</p>

<p><b>**Note that even though this example is a binary classification, this(SPEAR) library supports multi-class classification**</b></p>

In [4]:
import enum

# enum to hold the class labels
class ClassLabels(enum.Enum):
    SPAM = 1
    HAM = 0

THRESHOLD = 0.8

# ***Defining preprocessors, continuous_scorers, labeling functions:***
During labeling the unlabelled data we lookup for few keywords to assign a class SMS.

<b>Example</b> : *If a message contains apply or buy in it then most probably the message is spam*

In [5]:
trigWord1 = {"free","credit","cheap","apply","buy","attention","shop","sex","soon","now","spam"}
trigWord2 = {"gift","click","new","online","discount","earn","miss","hesitate","exclusive","urgent"}
trigWord3 = {"cash","refund","insurance","money","guaranteed","save","win","teen","weight","hair"}
notFreeWords = {"toll","Toll","freely","call","meet","talk","feedback"}
notFreeSubstring = {"not free","you are","when","wen"}
firstAndSecondPersonWords = {"I","i","u","you","ur","your","our","we","us","youre"}
thirdPersonWords = {"He","he","She","she","they","They","Them","them","their","Their"}

### **Declaration of a simple preprocessor function**


For most of the tasks in NLP, computer vivsion instead of using the raw datapoint we preprocess the datapoint and then label it. Preprocessor functions are used to preprocess an instance before labeling it. We use **`@preprocessor(name,resources)`** decorator to declare a function as preprocessor.

In [6]:
from spear.labeling import preprocessor


@preprocessor(name = "LOWER_CASE")
def convert_to_lower(x):
    return x.lower().strip()

lower = convert_to_lower("RED")

### **Some Labeling function(LF) definitions**
Below are some examples on how to define LFs and continuous LFs(CLFs). To get the continuous score for a CLF, we need to define a function with continuous_scorer decorator(just like labeling_function decorator) and pass it to a CLF as displayed below. Also note how the continuous score can be used in CLF. Note that the word_similarity is the function with continuous_scorer decorator and is written in con_scorer file(this file is not a part of package) in same folder.

In [7]:
from spear.labeling import labeling_function, ABSTAIN

from helper.con_scorer import word_similarity
import re


@preprocessor()
def convert_to_lower(x):
    return x.lower().strip()


@labeling_function(resources=dict(keywords=trigWord1),pre=[convert_to_lower],label=ClassLabels.SPAM)
def LF1(c,**kwargs):    
    if len(kwargs["keywords"].intersection(c.split())) > 0:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=trigWord2),pre=[convert_to_lower],label=ClassLabels.SPAM)
def LF2(c,**kwargs):
    if len(kwargs["keywords"].intersection(c.split())) > 0:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=trigWord3),pre=[convert_to_lower],label=ClassLabels.SPAM)
def LF3(c,**kwargs):
    if len(kwargs["keywords"].intersection(c.split())) > 0:
        return ClassLabels.SPAM 
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=notFreeWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF4(c,**kwargs):
    if "free" in c.split() and len(kwargs["keywords"].intersection(c.split()))>0:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=notFreeSubstring),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF5(c,**kwargs):
    for pattern in kwargs["keywords"]:    
        if "free" in c.split() and re.search(pattern,c, flags= re.I):
            return ClassLabels.HAM
    return ABSTAIN

@labeling_function(resources=dict(keywords=firstAndSecondPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF6(c,**kwargs):
    if "free" in c.split() and len(kwargs["keywords"].intersection(c.split()))>0:
        return ClassLabels.HAM
    else:
        return ABSTAIN


@labeling_function(resources=dict(keywords=thirdPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF7(c,**kwargs):
    if "free" in c.split() and len(kwargs["keywords"].intersection(c.split()))>0:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(label=ClassLabels.SPAM)
def LF8(c,**kwargs):
    if (sum(1 for ch in c if ch.isupper()) > 6):
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=trigWord1),pre=[convert_to_lower],label=ClassLabels.SPAM)
def CLF1(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=trigWord2),pre=[convert_to_lower],label=ClassLabels.SPAM)
def CLF2(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=trigWord3),pre=[convert_to_lower],label=ClassLabels.SPAM)
def CLF3(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=notFreeWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF4(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=notFreeSubstring),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF5(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=firstAndSecondPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF6(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=thirdPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF7(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=lambda x: 1-np.exp(float(-(sum(1 for ch in x if ch.isupper()))/2)),label=ClassLabels.SPAM)
def CLF8(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

model loading
model loaded


# ***Accumulating all LFs into rules, an LFset(a class) object:***
### **Importing LFSet and passing LFs we defined, to that class**

In [8]:
from spear.labeling import LFSet

LFS = [LF1,
    LF2,
    LF3,
    LF4,
    LF5,
    LF6,
    LF7,
    LF8,
    CLF1,
    CLF2,
    CLF3,
    CLF4,
    CLF5,
    CLF6,
    CLF7,
    CLF8
      ]

rules = LFSet("SPAM_LF")
rules.add_lf_list(LFS)

# ***Loading data:***
### **Load the data: X, Y**
<p>Note that the utils below is not a part of package but is used to load the necessary data. User have to use some means(which doesn't matter) to load his data(X, Y). X is the raw data that is to be passed to LFs and Y are true labels(if available). Note that feature matrix is not needed in Cage algorithm but it is needed in JL algorithm.</p>

In [9]:
from helper.utils import load_data_to_numpy, get_test_U_data

X, _, Y = load_data_to_numpy()

test_size = 400
U_size = 4500
n_lfs = len(rules.get_lfs())

X_T, Y_T, _, X_U, _= get_test_U_data(X, Y, n_lfs, test_size, U_size)

# ***Labeling data:***
### **Paths**
* path_json: path to json file generated by PreLabels
* T_path_pkl: path to pkl file generated by PreLabels containing the test data with true labels
* U_path_pkl: path to pkl file generated by PreLabels containing the unlabelled data without true labels
* log_path: path to save the log which is generated during the algorithm
* params_path: path to save parameters of model

<p>Make sure that the directory of the files(in above paths) exists. Note that any existing contents in pickle files will be erased.</p>

In [10]:
path_json = 'data_pipeline/Cage/sms_json.json'
T_path_pkl = 'data_pipeline/Cage/sms_pickle_T.pkl' #test data - have true labels
U_path_pkl = 'data_pipeline/Cage/sms_pickle_U.pkl' #unlabelled data - don't have true labels

log_path_cage_1 = 'log/Cage/sms_log_1.txt' #cage is an algorithm, can be found below
params_path = 'params/Cage/sms_params.pkl' #file path to store parameters of Cage, used below

### **Importing PreLabels class and using it to label data**
Json file should be generated only once as shown below.
<p><b>Note:</b> We don't pass feature matrix as the CAGE algorithm don't need one. Also note that we don't pass gold_lables(or true labels) to the 2nd PreLabels class which generates labels to unlabelled data(U).</p>

In [11]:
from spear.labeling import PreLabels

sms_noisy_labels = PreLabels(name="sms",
                               data=X_T,
                               gold_labels=Y_T,
                               rules=rules,
                               labels_enum=ClassLabels,
                               num_classes=2)
sms_noisy_labels.generate_pickle(T_path_pkl)
sms_noisy_labels.generate_json(path_json) #generating json files once is enough

sms_noisy_labels = PreLabels(name="sms",
                               data=X_U,
                               rules=rules,
                               labels_enum=ClassLabels,
                               num_classes=2) #note that we don't pass gold_labels here, for the unlabelled data
sms_noisy_labels.generate_pickle(U_path_pkl)

100%|██████████| 400/400 [00:40<00:00,  9.99it/s]
100%|██████████| 4500/4500 [07:36<00:00,  9.85it/s]


# ***Accessing labeled data:***
### **Importing and the use of get_data and get_classes**
<p>These functions can be used to extract data from pickle files and json file respectively. Note that these are the files generated using PreLabels.</p>
<p>For detailed contents of output, please refer documentation.</p>

In [12]:
from spear.utils import get_data, get_classes

data_U = get_data(path = U_path_pkl, check_shapes=True)
#check_shapes being True(above), asserts for relative shapes of arrays in pickle file
print("Number of elements in data list: ", len(data_U))
print("Shape of feature matrix: ", data_U[0].shape)
print("Shape of labels matrix: ", data_U[1].shape)
print("Shape of continuous scores matrix : ", data_U[6].shape)
print("Total number of classes: ", data_U[9])

classes = get_classes(path = path_json)
print("Classes dictionary in json file(modified to have integer keys): ", classes)

Number of elements in data list:  10
Shape of feature matrix:  (0,)
Shape of labels matrix:  (4500, 16)
Shape of continuous scores matrix :  (4500, 16)
Total number of classes:  2
Classes dictionary in json file(modified to have integer keys):  {1: 'SPAM', 0: 'HAM'}


# ***Cage Algorithm:***
### **Importing Cage class (the algorithm) and declaring an object of it**
Cage algorithm needs only the pickle file(with labels given by LFs using PreLabels class) with unlabelled data(the data without true/gold labels) and it will predict the labels of this data. An optinal test data(which has true/gold labels) can also passed to get a log information of accuracies. 
<p><b>Note:</b> Multiple calls to fit_* functions will train parameters continuously ie, parameters are not reinitialised in fit_* functions. So, to train large data, one can call fit_* functions repeatedly on smaller chunks. Also, in order to perform multiple runs over the algorithm, one need to reinitialise paramters(by creating an object of Cage) at the start of each run.</p>

In [13]:
from spear.cage import Cage

cage = Cage(path_json = path_json, n_lfs = n_lfs)

### **fit_and_predict_proba function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances, num_classes) having the probability of a particular instance being that class. 
<p>Here the order of classes along a row for any instance is the ascending order of values used in enum defined before to hold labels.</p>
<p>For more details about arguments, please refer documentation; same should be the case for any of the member functions used from here on.</p>

In [14]:
cage = Cage(path_json = path_json, n_lfs = n_lfs)

probs = cage.fit_and_predict_proba(path_pkl = U_path_pkl, path_test = T_path_pkl, path_log = log_path_cage_1, \
                                   qt = 0.9, qc = 0.85, metric_avg = ['binary'], n_epochs = 200, lr = 0.01)
labels = np.argmax(probs, 1)
print("probs shape: ", probs.shape)
print("labels shape: ",labels.shape)

100%|██████████| 200/200 [00:15<00:00, 12.57it/s]

final_test_accuracy_score: 0.7975
test_average_metric: binary	final_test_f1_score: 0.5030674846625766
probs shape:  (4500, 2)
labels shape:  (4500,)





### **fit_and_predict function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances,) containing integers(because need_strings is False), having the classes of each instance.

In [15]:
cage = Cage(path_json = path_json, n_lfs = n_lfs)

labels = cage.fit_and_predict(path_pkl = U_path_pkl, path_test = T_path_pkl, path_log = log_path_cage_1, \
                              qt = 0.9, qc = 0.85, metric_avg = ['binary'], n_epochs = 200, lr = 0.01, \
                              need_strings = False)

print("labels shape: ", labels.shape)
print(type(labels[0]))

100%|██████████| 200/200 [00:25<00:00,  7.92it/s]

final_test_accuracy_score: 0.7975
test_average_metric: binary	final_test_f1_score: 0.5030674846625766
labels shape:  (4500,)
<class 'numpy.int64'>





### **fit_and_predict function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances,) containing strings(because need_strings is True), having the classes of each instance.

In [16]:
cage = Cage(path_json = path_json, n_lfs = n_lfs)

labels_strings = cage.fit_and_predict(path_pkl = U_path_pkl, path_test = T_path_pkl, path_log = log_path_cage_1, \
                              qt = 0.9, qc = 0.85, metric_avg = ['binary'], n_epochs = 200, lr = 0.01, \
                              need_strings = True)

print("labels_strings shape: ", labels_strings.shape)
print(type(labels_strings[0]))

100%|██████████| 200/200 [00:21<00:00,  9.12it/s]

final_test_accuracy_score: 0.7975
test_average_metric: binary	final_test_f1_score: 0.5030674846625766
labels_strings shape:  (4500,)
<class 'numpy.str_'>





### **Save parameters**
<p> Make sure that the directory of the save_path file exists. Note that any existing contents in pickle file will be erased.</p>

In [17]:
cage.save_params(save_path = params_path)

### **Load parameters**

In [18]:
cage_2 = Cage(path_json = path_json, n_lfs = n_lfs)
cage_2.load_params(load_path = params_path)

### **predict_proba function of Cage class**
The output(probs_test) is a numpy matrix of shape (num_instances, num_classes) having the probability of a particular instance being that class.
<p>Here the order of classes along a row for any instance is the ascending order of values used in enum defined before to hold labels.</p>

In [19]:
probs_test = cage_2.predict_proba(path_test = T_path_pkl, qc = 0.85) 
#NEED NOT use the same test data(above) used in Cage class before.
print("probs_test shape: ",probs_test.shape)

probs_test shape:  (400, 2)


### **predict function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances,) containing integers(strings) if need_strings is Flase(True), having the classes of each instance. Just the use case with need_strings as False is displayed here.

In [20]:
labels_test = cage_2.predict(path_test = T_path_pkl, qc = 0.85, need_strings = False)
print("labels_test shape: ", labels_test.shape)

from sklearn.metrics import accuracy_score, f1_score

#Y_T is true labels of test data, type is numpy array of shape (num_instances,)
print("accuracy_score: ", accuracy_score(Y_T, labels_test))
print("f1_score: ", f1_score(Y_T, labels_test, average = 'binary'))

labels_test shape:  (400,)
accuracy_score:  0.7975
f1_score:  0.5030674846625766


### **Converting numpy array of integers to enums**
The below utility from spear can help convert return values of predict(obtained when need_strings is Flase) to a numpy array of enums.

In [21]:
from spear.utils import get_enum

labels_test_enum = get_enum(np_array = labels_test, enm = ClassLabels) 
#the second argument is the Enum class defined at beginning
print(type(labels_test_enum[0]))

<enum 'ClassLabels'>
