# ***End to End tutorial for SMS_SPAM labeling using High Level Supervision:***
**The paper and documentation can be found here:** [Paper](https://openreview.net/pdf/e4d3b0f4237ea03ce6b9b73bd796822f7f84a40c.pdf), [Documentation](https://spear-decile.readthedocs.io/en/latest)

In [1]:
'''
User don't need to include this cell to use the package
'''
import sys
sys.path.append('../../')

# ***Defining an Enum to hold labels:***
### **Representation of class Labels**

<p>All the class labels for which we define labeling functions are encoded in enum and utilized in our next tasks. Make sure not to define an Abstain(Labeling function(LF) not deciding anything) class inside this Enum, instead import the ABSTAIN object as used later in LF section.</p>

<p>SPAM dataset contains 2 classes i.e <b>HAM</b> and <b>SPAM</b>. Note that the numbers we associate can be anything but it is suggested to use a continuous numbers from 0 to number_of_classes-1</p>

<p><b>**Note that even though this example is a binary classification, this(SPEAR) library supports multi-class classification**</b></p>

In [2]:
import enum

# enum to hold the class labels
class ClassLabels(enum.Enum):
    SPAM = 1
    HAM = 0

THRESHOLD = 0.8

# ***Defining preprocessors, continuous_scorers, labeling functions:***
During labeling the unlabelled data we lookup for few keywords to assign a class SMS.

<b>Example</b> : *If a message contains apply or buy in it then most probably the message is spam*

In [3]:
trigWord1 = {"free","credit","cheap","apply","buy","attention","shop","sex","soon","now","spam"}
trigWord2 = {"gift","click","new","online","discount","earn","miss","hesitate","exclusive","urgent"}
trigWord3 = {"cash","refund","insurance","money","guaranteed","save","win","teen","weight","hair"}
notFreeWords = {"toll","Toll","freely","call","meet","talk","feedback"}
notFreeSubstring = {"not free","you are","when","wen"}
firstAndSecondPersonWords = {"I","i","u","you","ur","your","our","we","us","youre"}
thirdPersonWords = {"He","he","She","she","they","They","Them","them","their","Their"}

### **Declaration of a simple preprocessor function**


For most of the tasks in NLP, computer vivsion instead of using the raw datapoint we preprocess the datapoint and then label it. Preprocessor functions are used to preprocess an instance before labeling it. We use **`@preprocessor(name,resources)`** decorator to declare a function as preprocessor.

In [4]:
from spear.labeling import preprocessor


@preprocessor(name = "LOWER_CASE")
def convert_to_lower(x):
    return x.lower().strip()

lower = convert_to_lower("RED")

# ***High Level Supervision Algorithm:***

In [5]:
#my_data_feeders
import collections

f_d = 'f_d'
f_d_U = 'f_d_U'
test_w = 'test_w'

train_modes = [f_d, f_d_U]

F_d_U_Data = collections.namedtuple('GMMDataF_d_U', 'x l m L d r')
F_d_Data = collections.namedtuple('GMMDataF_d', 'x labels')

### **Importing the required functionalities**


Import the required libraries. Also, import the latest version of tensorflow.

In [6]:
from spear.Implyloss import *
import numpy as np
import sys, os, shutil
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
# tf.reset_default_graph()

Instructions for updating:
non-resource variables are not supported in the long term


### **Setting up the model's checkpoints**

In [7]:
# import tensorflow.compat.v1 as tf
# tf.disable_v2_behavior()
tf.reset_default_graph()


test_best_ckpt()
test_checkmate()
test_checkpoint()
test_mru_checkpoints(num_to_keep=1)
test_mru_checkpoints(num_to_keep=5)


Checkpoint file does not exist
INFO:tensorflow:best.ckpt-51 is not in all_model_checkpoint_paths. Manually adding it.
Saved new best checkpoint to path:  /tmp/best_ckpt_0.266783/foo-bar/best.ckpt-51
Restoring best checkpoint from path:  /tmp/best_ckpt_0.266783/foo-bar/best.ckpt-51
INFO:tensorflow:Restoring parameters from /tmp/best_ckpt_0.266783/foo-bar/best.ckpt-51
INFO:tensorflow:best.ckpt-52 is not in all_model_checkpoint_paths. Manually adding it.
Saved new best checkpoint to path:  /tmp/best_ckpt_0.266783/foo-bar/best.ckpt-52
Restoring best checkpoint from path:  /tmp/best_ckpt_0.266783/foo-bar/best.ckpt-51
INFO:tensorflow:Restoring parameters from /tmp/best_ckpt_0.266783/foo-bar/best.ckpt-51
INFO:tensorflow:best.ckpt-53 is not in all_model_checkpoint_paths. Manually adding it.
Saved new best checkpoint to path:  /tmp/best_ckpt_0.266783/foo-bar/best.ckpt-53
INFO:tensorflow:best.ckpt-54 is not in all_model_checkpoint_paths. Manually adding it.
Saved new best checkpoint to path:  /t

### **Initializing the Directories for storing relevant information**

In [8]:
checkpoint_dir =  './checkpoint'
# data_dir = "/home/parth/Desktop/SEM6/RnD/Learning-From-Rules/data/TREC" # Directory containing data pickles
# data_dir = "/home/parth/Desktop/SEM6/RnD/spear/examples/SMS_SPAM/data_pipeline/"
data_dir = "../../examples/SMS_SPAM/data_pipeline/"
inference_output_dir = './inference_output/'
log_dir = './log/hls'
metric_pickle_dir = './met_pickl/'
tensorboard_dir =  './tensorboard'

### **Creating the directories if they don't exist**

In [9]:
if not os.path.exists(inference_output_dir):
    os.makedirs(inference_output_dir)

if not os.path.exists(log_dir):
    os.makedirs(log_dir)

if not os.path.exists(metric_pickle_dir):
    os.makedirs(metric_pickle_dir)

if not os.path.exists(tensorboard_dir):
    os.makedirs(tensorboard_dir)

### **Initializing the parameter values**

In [10]:
checkpoint_load_mode = 'mru' # Which kind of checkpoint to restore from. Possible options are mru: Most recently saved checkpoint. Use this to continue a run f_d, f_d_U: Use these to load the best checkpoint from these runs 
# d_pickle = data_dir+"d_processed.p"
d_pickle = data_dir+"sms_pickle_L.pkl"
dropout_keep_prob =  0.8
early_stopping_p = 20 # early stopping patience (in epochs)
f_d_adam_lr =  0.0003 # default = 0.01
f_d_batch_size = 16
f_d_class_sampling = [10,10] # Comma-separated list of number of times each d instance should be sampled depending on its class for training f on d. Size of list must equal number of classes.
f_d_epochs = 4 # default = 2
f_d_metrics_pickle = metric_pickle_dir+"metrics_train_f_on_d.p"
f_d_primary_metric = 'accuracy' #'f1_score_1' # Metric for best checkpoint computation. The best metrics pickle will also be stored on this basis. Valid values are: accuracy: overall accuracy. f1_score_1: f1_score of class 1. avg_f1_score: average of all classes f1_score 
f_d_U_adam_lr =  0.0003 # default = 0.01
f_d_U_batch_size = 32
f_d_U_epochs = 4 # default = 2  
f_d_U_metrics_pickle = metric_pickle_dir+"metrics_train_f_on_d_U.p"
f_infer_out_pickle = inference_output_dir+"infer_f.p" # output file name for any inference that was ran on f (classification) network
gamma = 0.1 # weighting factor for loss on U used in implication, pr_loss, snorkel, generalized cross entropy etc. 
lamda = 0.1
min_rule_coverage = 0 # Minimum coverage of a rule in U in order to include it in co-training. Rules which have coverage less than this are assigned a constant weight of 1.0.
mode = "implication" # "learn2reweight" / "implication" / "pr_loss" / "f_d" 
test_mode = "" # "" / test_f" / "test_w" / "test_all"
num_classes = 2 # can be 0. Number of classes. If 0, this will be dynamically determined using max of labels in 'd'.
num_load_d = None # can be 0. Number of instances to load from d. If 0 load all.
num_load_U = None # can be 0. Number of instances to load from U. If 0 load all.
num_load_validation = None # can be 0. Number of instances to load from validation. If 0 load all.
q = "1"
rule_classes = None # Comma-separated list of the classes predicted by each rule if string is empty, rule classes are determined from data associated with rule firings.
shuffle_batches = True # Don't shuffle batches. Useful for debugging and stepping through batch by batch
test_w_batch_size = 1000
# U_pickle = data_dir+"U_processed.p"
U_pickle = data_dir+"sms_pickle_U.pkl"
use_joint_f_w = False # whether to utilize w network during inference
# validation_pickle = data_dir+"validation_processed.p"
validation_pickle = data_dir+"sms_pickle_V.pkl"
w_infer_out_pickle = inference_output_dir+"infer_w.p" # output file name for any inference that was ran on w (rule) network
json_file = data_dir+"sms_json.json"


In [11]:
import shutil
output_dir = "./" + str(mode) + "_" + str(gamma) + "_" + str(lamda) + "_" + str(q)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

if test_mode=="":
    if os.path.exists(checkpoint_dir):
        shutil.rmtree(checkpoint_dir, ignore_errors=True)    
    os.makedirs(checkpoint_dir)

# number of input dir - 1 (data_dir)
# number of output dir - 6 (checkpoint, inference_output, log_dir, metric_pickle, output, tensorboard)


### **Creating a Data Feeder Object to process data**

In [12]:
if(str(test_mode)==""):
    output_text_file=log_dir + "/" + str(mode) + "_" + str(gamma) + "_" + str(lamda) + "_" + str(q)+".txt"
else:    
    output_text_file=log_dir + "/" + str(test_mode) + "_" + str(mode) + "_" + str(gamma) + "_" + str(lamda) + "_" + str(q)+".txt"
sys.stdout = open(output_text_file,"w")
if(test_mode!=""):
    mode = test_mode
if mode not in ['learn2reweight', 'implication', 'f_d', 'pr_loss', 'gcross',  'label_snorkel', 'pure_snorkel', 'gcross_snorkel', 'test_f', 'test_w', 'test_all']:
    raise ValueError('Invalid run mode ' + mode)

data_feeder = DataFeeder(d_pickle, 
                         U_pickle, 
                         validation_pickle,
                         json_file,
                         shuffle_batches, 
                         num_load_d, 
                         num_load_U, 
                         num_classes, 
                         f_d_class_sampling, 
                         min_rule_coverage, 
                         rule_classes, 
                         num_load_validation, 
                         f_d_batch_size, 
                         f_d_U_batch_size, 
                         test_w_batch_size,
                         out_dir=output_dir)


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Instructions for updating:
Use `tf.cast` instead.
INFO:tensorflow:./checkpoint/f_d_U/best.ckpt-124 is not in all_model_checkpoint_paths. Manually adding it.


In [13]:
 num_features, num_classes, num_rules, num_rules_to_train = data_feeder.get_features_classes_rules()

In [14]:
print("Number of features: ", num_features)
print("Number of classes: ",num_classes)
print("Print num of rules to train: ", num_rules_to_train)
print("Print num of rules: ", num_rules)
print("\n\n")

In [15]:
rule_classes = data_feeder.rule_classes

### **Initializing the rule network and classification network of the algorithm**

In [16]:
w_network = networks.w_network_fully_connected #rule network - CHANGE config in w_network_fully_connected of my_networks - DONE
f_network = networks.f_network_fully_connected #classification network - CHANGE config in f_network_fully_connected of my_networks - DONE
    

### **Creating a High Level Supervision Network Object to be trained and tested**

In [17]:
tf.reset_default_graph()
hls = HighLevelSupervisionNetwork(
            num_features,
            num_classes,
            num_rules,
            num_rules_to_train,
            rule_classes,
            w_network,
            f_network,
            f_d_epochs, 
            f_d_U_epochs, 
            f_d_adam_lr, 
            f_d_U_adam_lr, 
            dropout_keep_prob, 
            f_d_metrics_pickle, 
            f_d_U_metrics_pickle, 
            early_stopping_p, 
            f_d_primary_metric, 
            mode, 
            data_dir, 
            tensorboard_dir, 
            checkpoint_dir, 
            checkpoint_load_mode, 
            gamma, 
            lamda,
            raw_d_x=data_feeder.raw_d.x, #instances from the "d" set
            raw_d_L=data_feeder.raw_d.L) #labels from the "d" set

float_formatter = lambda x: "%.3f" % x # Output 3 digits after decimal point in numpy arrays
np.set_printoptions(formatter={'float_kind':float_formatter})



In [18]:
print('Run mode is ' + mode)

### **Train and Test on the hls object**

In [None]:
if mode == 'f_d':
    print('training f on d')
    hls.train.train_f_on_d(data_feeder, f_d_epochs)
elif mode[:4]!="test":
    print(mode+" training started")
    hls.train.train_f_on_d_U(data_feeder, f_d_U_epochs, loss_type=mode)
    print(mode+" training ended")
elif mode == 'test_f':
    print('Running test_f')
    hls.test.test_f(data_feeder, log_output=True, 
                    save_filename=f_infer_out_pickle, 
                    use_joint_f_w=use_joint_f_w)
elif mode == 'test_w': # use only if train_mode = implication or train_mode = pr_loss
    print('Running test_w')
    hls.test.test_w(data_feeder, log_output=True, save_filename=w_infer_out_pickle+"_test")
elif mode == 'test_all': # use only if train_mode = implication or train_mode = pr_loss
    print('Running all tests')
    print('\ninference on f network ...\n')
    hls.test.test_f(data_feeder, log_output=True, 
                    save_filename=f_infer_out_pickle,
                    use_joint_f_w=use_joint_f_w)
    print('\ninference on w network...')
    print('we only test on instances covered by atleast one rule\n')
    hls.test.test_w(data_feeder, log_output=True, save_filename=w_infer_out_pickle+"_test")
else:
    assert not "Invalid mode string: %s" % mode

sys.stdout.close()