## Description
#### This is one of the assignments from course CS 431/631 (Data-intensive Distributed Analytics) at University of Waterloo.
#### This assignment focuses on using big data framework to do something with machine learning.
#### Some modifications have been made to improve the presentation on this platform.
---

For here, use the Spark installation in the CS451 course account:

In [1]:
import findspark, random
findspark.init("/u/cs451/packages/spark")

from pyspark import SparkContext, SparkConf
sc = SparkContext(appName="YourTest", master="local[2]", conf=SparkConf().set('spark.ui.port', random.randrange(4000,5000)))

---
#### Overview
**Goal:** Use Python and Spark to perform spam detection.   The first thing is to build spam prediction models, using training data sets and stochastic gradient descent (SGD).   The second is to use these models to predict whether the documents in a test data set are spam.
The stochastic gradient descent technique that will be used is based on [a paper](http://arxiv.org/abs/1004.5168) by Cormack, Smucker and Clarke.  
**Files needed:** `spamminess.py`, training and test datasets in '/u/cs451/public_html/spam/'

#### Training a Spam Classification Models
To build a spam classification model, start with a training data set.   Each instance in the training set represents a single document, and is labeled to indicate whether that document should be considered to be spam or ham.
An instance looks like this:
```
clueweb09-en0094-20-13546 spam 387908 697162 426572 161118 688171 ...
```
The first field, `clueweb09-en0094-20-13546`, is the (unique) document name.   The second field is the label, indicating whether the document should be considered spam (as in this example) or ham.   The remaining fields are integers representing *features* present in the document.   In this case, the features are hashed byte 4-grams, represented as integers.   Each training data set is stored as a text file, with one instance per line.   The training files have been preloaded into the directory `/u/cs451/public_html/spam`.   They are:
* `spam.train.group_x.txt`   (25 MB)
* `spam.train.group_y.txt`   (20 MB)
* `spam.train.britney.txt`   (766 MB)


---

#### Part 1:

The first task is to write a sequential SGD model trainer in Python (no Spark).   For our purposes, a model associates a *weight* with each feature.   The model trainer decides what these weights should be, based on the training instances.  Since the model trainer is based on SGD, the trainer should behave like this:
```
for each training instance T
   predict whether T is spam or ham using the weights of the current model
   update the model weights by comparing T's predicted label with its actual label
```
Of course, the important part is how to update the model.

In [the paper](http://arxiv.org/abs/1004.5168), the model is used to assign a "spamminess" score to a document.   Documents with positive spamminess are predicted to be spam.   Those with negative spamminess are predicted to be ham.  The spamminess of a document $D$ is simply the sum of the weights (from the model) of each of the document's features:
\begin{equation}
spamminess(D) = \sum_{f \in D}{w(f)}
\end{equation}
where $w(f)$ is the weight assocated with feature $f$.

The Python module `spamminess.py` defines a function `spamminess(F,W)` which computes this quantity.   This function takes two arguments, `F` and `W`.  `F` is a list of features (integers) associated to the document whose spamminess we want to compute, and `W` is a dictionary representing the current model.  `W` maps features ($f$) to their weights ($w(f)$) under the model.

The cell below presents the partial pseudo-code that shows how to implement the SGD model trainer defined by Cormack, Smucker, and Clarke.   It reads the training instances one at a time from one of the training files, and uses them to adjust the model weights.   The task here is to turn this pseudo-code into actual runnable Python code that can
be used to learn a model from any one of the training files. Implement the function `sequential_SGD()` that takes as input a model (`w`), the training dataset and a value for the update parameter `delta`, and returns the trained model.

In [2]:
from spamminess import spamminess
from math import exp

def sequential_SGD(model, training_dataset='/u/cs451/public_html/spam/spam.train.group_x.txt', delta = 0.002):
    # open one of the training files - defaults to group_x
    with open(training_dataset) as f:
        for line in f:
            doc = line.split()
            t = doc[1]
            F = []
            for i in doc[2:]:
                F.append(int(i))
            score = spamminess(F,model)
            prob = 1.0/(1+exp(-score))
            for f in F:
                if t == 'spam':
                    if f not in model:
                        model[f] = (1.0-prob)*delta
                    else:
                        model[f] += (1.0-prob)*delta
                elif t == 'ham':
                    if f not in model:
                        model[f] = -prob*delta
                    else:
                        model[f] -= prob*delta
        
    #   for line in f:
    #      each line represents a document
    #      read and parse the line
    #      Let:
    #        t represent the spam/ham tag for this document
    #        F represent the list of features for this document

    #      find the spamminess of the current document using the current model:
    #      score = spamminess(F,w)

    #      then, update the model:
    #      prob = 1.0/(1+exp(-score))
    #      for each feature f in F:
    #          if t == 'spam':
    #              increase model(f) by (1.0-prob)*delta (or set model(f) to (1.0-prob)*delta if f is not in the dict yet)
    #          elif t == 'ham':
    #              decrease model(f) by prob*delta (or set model(f) to -prob*delta if f is not in the dict yet)
    return model

In [3]:
# tests here
w = sequential_SGD({}) # Providing an empty model

#### Part 2:

Try implementing a Spark version of the SGD model trainer.   Before we start, create a new folder
called `models` inside the cs631 folder.   The Spark implementation should read a training file, train the model, and then output the model to the models folder.  The model output file should list the weight associated with each feature, with one feature per line, like this:
```
(802123, 0.0009858585991850937)
(438450, 4.267897922108138e-05)
(271525, 0.0013133437007968654)
(92853, 0.0004300009932503611)
```


Implement the function `spark_SGD()` below that takes as input the path to the training dataset, an output path `output_model` and a value for the update parameter `delta`, and writes the trained model to `output_model` using Spark's `saveAsTextFile`.


In [4]:
from spamminess import spamminess
from math import exp
import shutil, os

def spark_SGD(training_dataset='/u/cs451/public_html/spam/spam.train.group_x.txt', output_model='models/group_x_model', delta = 0.002):
    if os.path.isdir(output_model):
        shutil.rmtree(output_model) # Remove the previous model to create a new one
    training_data = sc.textFile(training_dataset) 
    # split the data and force all of the training instances into a single partition
    training_data_new = training_data.map(lambda x:x.split()).repartition(1)    
    # define a function that update the model weights
    def update_weight(x):
        model = {}
        for line in x:
            t = line[1]
            F = []
            for i in line[2:]:
                F.append(int(i))
            score = spamminess(F,model)
            prob = 1.0/(1+exp(-score))
            for f in F:
                if t == 'spam':
                    if f not in model:
                        model[f] = (1.0-prob)*delta
                    else:
                        model[f] += (1.0-prob)*delta
                elif t == 'ham':
                    if f not in model:
                        model[f] = -prob*delta
                    else:
                        model[f] -= prob*delta
        return model.items()
    # use mapPartitions to call the above function and get the desired output
    final_model = training_data_new.mapPartitions(update_weight).map(lambda x:(x[0],x[1]))
    final_model.saveAsTextFile(output_model)

In [5]:
# tests here
spark_SGD()
spark_SGD(training_dataset='/u/cs451/public_html/spam/spam.train.group_y.txt', output_model='models/group_y_model')
spark_SGD(training_dataset='/u/cs451/public_html/spam/spam.train.britney.txt', output_model='models/britney_model')

#### Part 3:

Re-implement the trainer from part 2 so that it will randomly reorder the training instances before using them to update the model. One way to shuffle the training instances is to assign a random sort key to each training instance as we read it from the input file, and then sort the instances using the random sort key.

Implement the function `spark_shuffled_SGD` below that takes as input the path to the training dataset, an output path `output_model` and a value for the update parameter `delta`, shuffles the training instances using the method described above and writes the trained model to `output_model` using Spark's `saveAsTextFile`.


In [6]:
from spamminess import spamminess
from math import exp
import shutil, os, random

def spark_shuffled_SGD(training_dataset='/u/cs451/public_html/spam/spam.train.group_x.txt', output_model='models/group_x_model', delta = 0.002):
    if os.path.isdir(output_model):
        shutil.rmtree(output_model) # Remove the previous model to create a new one
    training_data = sc.textFile(training_dataset)
    # split the data and force all of the training instances into a single partition
    # also, assign a random key to each training instance and shuffle the training instances
    training_data_new = training_data.map(lambda x:x.split()).repartition(1)\
                        .map(lambda x:(random.random(),x)).sortByKey(True).map(lambda x:x[1])
    # define a function that update the model weights
    def update_weight(x):
        model = {}
        for line in x:
            t = line[1]
            F = []
            for i in line[2:]:
                F.append(int(i))
            score = spamminess(F,model)
            prob = 1.0/(1+exp(-score))
            for f in F:
                if t == 'spam':
                    if f not in model:
                        model[f] = (1.0-prob)*delta
                    else:
                        model[f] += (1.0-prob)*delta
                elif t == 'ham':
                    if f not in model:
                        model[f] = -prob*delta
                    else:
                        model[f] -= prob*delta
        return model.items()
    # use mapPartitions to call the above function and get the desired output
    final_model = training_data_new.mapPartitions(update_weight).map(lambda x:(x[0],x[1]))
    final_model.saveAsTextFile(output_model)

### Additional part - Comparison between the models in Part 2 and Part 3 ###
Based on the codes in the following cell:  
a) Most of the 5 features with the highest weights are different in the original model and the shuffled model, although the weight values are similar.  
b) The ROCA score of the two models are similar.  
To sum up, shuffling the training instances does have some effect on the training model, but the overall accuracy does not change a lot.

In [9]:
# tests here (!!! run this cell as the last step !!!)
# the original model in part2
spark_SGD(output_model='models/group_x_model')
top5_weights = sc.textFile('models/group_x_model').map(lambda x:eval(x))\
               .map(lambda x:(x[1],x[0])).sortByKey(False).map(lambda x:(x[1],x[0])).take(5)
print("The 5 features with the highest weights in the original model are")
print(top5_weights)
spark_classify(input_model='models/group_x_model')
print("The ROCA of the original model is")
!/u/cs451/bin/spam_eval.sh results/test_qrels

# the shuffled model in part3
spark_shuffled_SGD(output_model='models/group_x_model_shuffled')
top5_weights_shuffled = sc.textFile('models/group_x_model_shuffled').map(lambda x:eval(x))\
                        .map(lambda x:(x[1],x[0])).sortByKey(False).map(lambda x:(x[1],x[0])).take(5)
print("The 5 features with the highest weights in the shuffled model are")
print(top5_weights_shuffled)
spark_classify(input_model='models/group_x_model_shuffled')
print("The ROCA of the shuffled model is")
!/u/cs451/bin/spam_eval.sh results/test_qrels

The 5 features with the highest weights in the original model are
[(288281, 0.022996007768337472), (316070, 0.02178760768974702), (305568, 0.02166991395934579), (737304, 0.02155627086676939), (102264, 0.021381723804112546)]
The ROCA of the original model is
1-ROCA%: 17.26
The 5 features with the highest weights in the shuffled model are
[(278969, 0.024026215713269107), (275149, 0.023040579273347354), (943477, 0.02284840305297022), (944342, 0.022367091269665434), (288281, 0.0223646197909275)]
The ROCA of the shuffled model is
1-ROCA%: 17.94


#### Part 4:

Last but not least, write a Spark program that can be used to classify documents as spam or ham, using the classification models we produced.

The test data, i.e., the document instances that should be classified, are located in the file
* `/u/cs451/public_html/spam/spam.test.qrels.txt`

Each line in this file represents a document that needs to be classified as spam or ham.  The format of this file is identical to the format of the files that hold the training instances.

Implement the function `spark_classify` below that will load a model (from a specified folder under `models`), classify all of the instances in a given test data file (`spam.test.qrels.txt` by default) using that model, and then output the results in the folder `results_path` using Spark's `saveAsTextFile`.   The contents of the output file should look like this:
```
(clueweb09-en0000-00-00142,spam,2.601624279252943,spam)
(clueweb09-en0000-00-01005,ham,2.5654162439491004,spam)
(clueweb09-en0000-00-01382,ham,2.5893946346394188,spam)
```
Each line of the output represents one test instance.   The first two fields are the document ID and the test label.  These are just copied from the test data.   The third field is the spamminess score of the document, produced by the spamminess function using the model we are classifying with.   The fourth field is the spam/ham prediction made by the model.

Unlike model training, classification is easily parallelizable, since each document is classified independently. 

In [7]:
from spamminess import spamminess
import shutil, os

def spark_classify(input_model='models/group_x_model', test_dataset='/u/cs451/public_html/spam/spam.test.qrels.txt', results_path='results/test_qrels'):
    if os.path.isdir(results_path):
        shutil.rmtree(results_path) # Remove the previous results
    test_data = sc.textFile(test_dataset)
    # load the training model into the driver program as side data
    model_weights = sc.textFile(input_model).map(lambda x:eval(x)).collect()
    # transform the model into the dictionary to match the parameter format in spamminess calculation
    model_dict = dict(model_weights)
    # split and rearrange the test data
    test_data_new = test_data.map(lambda x:x.split()).map(lambda x:(x[0],x[1],[int(x[i]) for i in range(2,len(x))]))
    # calculate the spamminess score of each instance in test data
    test_score = test_data_new.map(lambda x:(x[0],x[1],spamminess(x[2],model_dict)))
    # predict whether it is 'spam' or 'ham' based on the spamminess score, and rearrange the data to a desired output
    test_prediction = test_score.map(lambda x:(x[0],x[1],x[2],'spam' if x[2] >= 0 else 'ham'))  
    test_prediction.saveAsTextFile(results_path)

We have installed a program in the CS451 account that can be used to evaluate our classification results.  Given the ouput file, in the proper format, it will compute the area under the receiver operating curve (ROC).   This is a common way to characterize classifier error.    The lower this score, the better.   The evaluation program should produce one line of output, like this
```
1-ROCA%: 17.25
```

In [8]:
# tests here
#  Run the evaluation program like this, after first replacing "output-file"
#  with the name of the folder that holds your classifier's output
#  Note the "!" character, which is important.   This is the escape character
#  that tells the notebook to run an external program.
spark_classify()
!/u/cs451/bin/spam_eval.sh results/test_qrels

1-ROCA%: 17.26
