# **End to end tutorial for SMS spam using SPEAR(Cage and JL)**

In [1]:
import sys
sys.path.append('../../')
import numpy as np

# **Defining an Enum to hold labels**
### **Representation of class Labels**

<p>All the class labels for which we define labeling functions are encoded in enum and utilized in our next tasks. Make sure not to define an Abstain(Labeling function(LF) not deciding anything) class inside this Enum, instead use the Abstain object as used later in LF section.</p>

<p>SPAM dataset contains 2 classes i.e <b>HAM</b> and <b>SPAM</b>. Note that the numbers we associate can be anything but it is suggested to use a continuous numbers from 0 to number_of_classes-1</p>

In [2]:
import enum

# enum to hold the class labels
class ClassLabels(enum.Enum):
    SPAM = 1
    HAM = 0

THRESHOLD = 0.8

# **Defining preprocessors, continuous_scorers, labeling functions**
During labeling the unlabelled data we lookup for few keywords to assign a class SMS.

<b>Example</b> : *If a message contains apply or buy in it then most probably the message is spam*

In [4]:
trigWord1 = {"free","credit","cheap","apply","buy","attention","shop","sex","soon","now","spam"}
trigWord2 = {"gift","click","new","online","discount","earn","miss","hesitate","exclusive","urgent"}
trigWord3 = {"cash","refund","insurance","money","guaranteed","save","win","teen","weight","hair"}
notFreeWords = {"toll","Toll","freely","call","meet","talk","feedback"}
notFreeSubstring = {"not free","you are","when","wen"}
firstAndSecondPersonWords = {"I","i","u","you","ur","your","our","we","us","youre"}
thirdPersonWords = {"He","he","She","she","they","They","Them","them","their","Their"}

### **Declaration of a simple preprocessor function**


For most of the tasks in NLP, computer vivsion instead of using the raw datapoint we preprocess the datapoint and then label it. Preprocessor functions are used to preprocess an instance before labeling it. We use **`@preprocessor(name,resources)`** decorator to declare a function as preprocessor.

In [5]:
from spear.labeling import preprocessor


@preprocessor(name = "LOWER_CASE")
def convert_to_lower(x):
    return x.lower().strip()

lower = convert_to_lower("RED")

### **Some Labeling function(LF) definitions**
Below are some examples on how to define LFs and continuous LFs(CLFs). To get the continuous score for a CLF, we need to define a function with continuous_scorer wrapper(just like labeling_function wrapper) and pass it to a CLF as displayed below. Also note how the continuous score can be used in CLF. Note that the word_similarity is the function with continuous_scorer wrapper and is written in con_scorer file(this file is not a part of package) in same folder.

In [6]:
from spear.labeling import labeling_function, ABSTAIN

from con_scorer import word_similarity
import re


@preprocessor()
def convert_to_lower(x):
    return x.lower().strip()


@labeling_function(resources=dict(keywords=trigWord1),pre=[convert_to_lower],label=ClassLabels.SPAM)
def LF1(c,**kwargs):    
    if len(kwargs["keywords"].intersection(c.split())) > 0:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=trigWord2),pre=[convert_to_lower],label=ClassLabels.SPAM)
def LF2(c,**kwargs):
    if len(kwargs["keywords"].intersection(c.split())) > 0:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=trigWord3),pre=[convert_to_lower],label=ClassLabels.SPAM)
def LF3(c,**kwargs):
    if len(kwargs["keywords"].intersection(c.split())) > 0:
        return ClassLabels.SPAM 
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=notFreeWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF4(c,**kwargs):
    if "free" in c.split() and len(kwargs["keywords"].intersection(c.split()))>0:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=notFreeSubstring),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF5(c,**kwargs):
    for pattern in kwargs["keywords"]:    
        if "free" in c.split() and re.search(pattern,c, flags= re.I):
            return ClassLabels.HAM
    return ABSTAIN

@labeling_function(resources=dict(keywords=firstAndSecondPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF6(c,**kwargs):
    if "free" in c.split() and len(kwargs["keywords"].intersection(c.split()))>0:
        return ClassLabels.HAM
    else:
        return ABSTAIN


@labeling_function(resources=dict(keywords=thirdPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF7(c,**kwargs):
    if "free" in c.split() and len(kwargs["keywords"].intersection(c.split()))>0:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(label=ClassLabels.SPAM)
def LF8(c,**kwargs):
    if (sum(1 for ch in c if ch.isupper()) > 6):
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=trigWord1),pre=[convert_to_lower],label=ClassLabels.SPAM)
def CLF1(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=trigWord2),pre=[convert_to_lower],label=ClassLabels.SPAM)
def CLF2(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=trigWord3),pre=[convert_to_lower],label=ClassLabels.SPAM)
def CLF3(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=notFreeWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF4(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=notFreeSubstring),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF5(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=firstAndSecondPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF6(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=thirdPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF7(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=lambda x: 1-np.exp(float(-(sum(1 for ch in x if ch.isupper()))/2)),label=ClassLabels.SPAM)
def CLF8(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

model loading
model loaded


# **Accumulating all LFs into rules, an LFset(a class) object**
### **Importing LFSet and passing LFs we defined, to that class**

In [7]:
from spear.labeling import LFSet

LFS = [LF1,
    LF2,
    LF3,
    LF4,
    LF5,
    LF6,
    LF7,
    LF8,
    CLF1,
    CLF2,
    CLF3,
    CLF4,
    CLF5,
    CLF6,
    CLF7,
    CLF8
      ]

rules = LFSet("SPAM_LF")
rules.add_lf_list(LFS)

# **Loading data**
### **Load the data: X, X_feats, Y**
<p>Note that the utils below is not a part of package but is used to load the necessary data. X is the raw data that is to be passed to LFs, X_feats is a numpy array of shape (num_instances, num_features) and Y are true labels(if available).</p>

In [8]:
from utils import load_data_to_numpy, get_various_data

X, X_feats, Y = load_data_to_numpy()

validation_size = 100
test_size = 200
L_size = 100
U_size = 4000
n_lfs = len(rules.get_lfs())

X_V, Y_V, X_feats_V,_, X_T, Y_T, X_feats_T,_, X_L, Y_L, X_feats_L,_, X_U, X_feats_U,_ = get_various_data(X, Y,\
    X_feats, n_lfs, validation_size, test_size, L_size, U_size)

# **Labeling data**
### **Paths**
* path_json: path to json file generated by PreLabels
* V_path_pkl: path to pkl file generated by PreLabels containing the validation data with true labels
* L_path_pkl: path to pkl file generated by PreLabels containing the labeled data with true labels
* T_path_pkl: path to pkl file generated by PreLabels containing the test data with true labels
* U_path_pkl: path to pkl file generated by PreLabels containing the unlabelled data without true labels
* log_path: path to save the log which is generated during the algorithm
<p>Difference between test and labeled data is that labeled data may be used in the algorithm(JL uses it while Cage doesn't) but test data isn't. Make sure to have the pickle files <font color='red'><b>EMPTY</b></font> ie, it should not any data inside it before passing to .generate_pickle() member function of PreLabels</p>

In [9]:
path_json = 'data_pipeline/sms_json.json'
V_path_pkl = 'data_pipeline/sms_pickle_V.pkl' #validation data - have true labels
T_path_pkl = 'data_pipeline/sms_pickle_T.pkl' #test data - have true labels
L_path_pkl = 'data_pipeline/sms_pickle_L.pkl' #Labeled data - have true labels
U_path_pkl = 'data_pipeline/sms_pickle_U.pkl' #unlabelled data - don't have true labels

log_path_cage_1 = 'log/cage_log_1.txt' #cage is an algorithm, can be found below
log_path_jl_1 = 'log/jl_log_1.txt' #jl is an algorithm, can be found below

### **Importing PreLabels class and using it to label data**

In [10]:
from spear.labeling import PreLabels

sms_noisy_labels = PreLabels(name="sms",
                               data=X_V,
                               gold_labels=Y_V,
                               data_feats=X_feats_V,
                               rules=rules,
                               labels_enum=ClassLabels,
                               num_classes=2)
sms_noisy_labels.generate_pickle(V_path_pkl)
sms_noisy_labels.generate_json(path_json) #generating json files once is enough

sms_noisy_labels = PreLabels(name="sms",
                               data=X_T,
                               gold_labels=Y_T,
                               data_feats=X_feats_T,
                               rules=rules,
                               labels_enum=ClassLabels,
                               num_classes=2)
sms_noisy_labels.generate_pickle(T_path_pkl)

sms_noisy_labels = PreLabels(name="sms",
                               data=X_L,
                               gold_labels=Y_L,
                               data_feats=X_feats_L,
                               rules=rules,
                               labels_enum=ClassLabels,
                               num_classes=2)
sms_noisy_labels.generate_pickle(L_path_pkl)

sms_noisy_labels = PreLabels(name="sms",
                               data=X_U,
                               rules=rules,
                               data_feats=X_feats_U,
                               labels_enum=ClassLabels,
                               num_classes=2)
sms_noisy_labels.generate_pickle(U_path_pkl)

100%|██████████| 100/100 [00:06<00:00, 15.90it/s]
100%|██████████| 200/200 [00:11<00:00, 16.79it/s]
100%|██████████| 100/100 [00:07<00:00, 13.14it/s]
100%|██████████| 4000/4000 [04:07<00:00, 16.15it/s]


# **Accessing labeled data**
### **Importing and the use of get_data and get_classes**
<p>These functions can be used to extract data from pickle files and json file respectively. Note that these are the files generated using PreLabels.</p>
<p>For detailed contents of output, please refer documentation.</p>

In [11]:
from spear.utils import get_data, get_classes

data_U = get_data(U_path_pkl, check_shapes=True)
#check_shapes being True(above), asserts for relative shapes of arrays in pickle file
print("Number of elements in data list: ", len(data_U))
print("Shape of feature matrix: ", data_U[0].shape)
print("Shape of labels matrix: ", data_U[1].shape)
print("Shape of continuous scores matrix : ", data_U[6].shape)
print("Total number of classes: ", data_U[9])

classes = get_classes(path_json)
print("Classes dictionary in json file(modified to have integer keys): ", classes)

Number of elements in data list:  10
Shape of feature matrix:  (4000, 1024)
Shape of labels matrix:  (4000, 16)
Shape of continuous scores matrix :  (4000, 16)
Total number of classes:  2
Classes dictionary in json file(modified to have integer keys):  {1: 'SPAM', 0: 'HAM'}


# **Cage Algorithm**
### **Importing Cage class (the algorithm) and declaring an object of it**

In [12]:
from spear.Cage import Cage

cage = Cage(path_json, n_lfs)

### **Use of fit_and_predict_proba function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances, num_classes) having the probability of a particular instance being that class. For more details about arguments, please refer documentation; same should be the case for any of the member functions used from here on.

In [13]:
probs = cage.fit_and_predict_proba(path_pkl = U_path_pkl, path_test = T_path_pkl, path_log = log_path_cage_1, \
                                   qt = 0.9, qc = 0.85, metric_avg = ['binary'], n_epochs = 200, lr = 0.01)
labels = np.argmax(probs, 1)
print("probs shape: ", probs.shape)
print("labels shape: ",labels.shape)

final_test_accuracy_score: 0.81
test_average_metric: binary	final_test_f1_score: 0.5250000000000001
probs shape:  (4000, 2)
labels shape:  (4000,)


### **Use of fit_and_predict function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances,) containing integers, having the classes of each instance.

In [14]:
labels = cage.fit_and_predict(path_pkl = U_path_pkl, path_test = T_path_pkl, path_log = log_path_cage_1, \
                              qt = 0.9, qc = 0.85, metric_avg = ['binary'], n_epochs = 200, lr = 0.01, \
                              need_strings=False)
print("labels shape: ", labels.shape)

final_test_accuracy_score: 0.79
test_average_metric: binary	final_test_f1_score: 0.5116279069767442
labels shape:  (4000,)


### **Use of fit_and_predict function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances,) containing strings, having the classes of each instance.

In [16]:
labels_strings = cage.fit_and_predict(U_path_pkl, T_path_pkl, log_path_cage_1, need_strings=True)
print("labels_strings shape: ", labels_strings.shape)

final_test_accuracy_score: 0.79
test_average_metric: binary	final_test_f1_score: 0.5116279069767442
labels_strings shape:  (4000,)


### **Save parameters**
<p>Make sure the pickle you are passing here is empty</p>

In [17]:
cage.save_params('params/sms_cage_params.pkl')

### **Load parameters**

In [18]:
cage_2 = Cage(path_json, n_lfs)
cage_2.load_params('params/sms_cage_params.pkl')

### **Use of predict_proba function of Cage class**
The output(probs_test) is a numpy matrix of shape (num_instances, num_classes) having the probability of a particular instance being that class.

In [19]:
probs_test = cage_2.predict_proba(path_test = T_path_pkl, qc = 0.85) 
#NEED NOT use the same test data(above) used in Cage class before.
print("probs_test shape: ",probs_test.shape)

probs_test shape:  (200, 2)


### **Use of predict function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances,) containing integers(strings) if need_strings is Flase(True), having the classes of each instance. Just the use case with need_strings as False is displayed here.

In [20]:
labels_test = cage_2.predict(path_test = T_path_pkl, qc = 0.85, need_strings = False)
print("labels_test shape: ", labels_test.shape)

labels_test shape:  (200,)


# **Joint Learning(JL) Algorithm**
### **Subset selection**