# Spam Detection

## Outline:
- Data Preprocessing
- Modeling
    - Naive Bayes
    - Naive Bayes + ngram
    - Logistic Regression
    - Random Forest
- Best Model
    - Naive Bayes Classifier
        - Assumptions
    - References for Model Introduction and Algorithms
    - More Model Introductions
- Next Step

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext

from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, StringIndexer, NGram
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes, LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [2]:
raw = spark.read.option("delimiter",
                        "\t").csv('..\project-spam-detection-and-pyspark-streaming\data\SMSSpamCollection').toDF('spam', 'message')
raw.show(5)

+----+--------------------+
|spam|             message|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
+----+--------------------+
only showing top 5 rows



In [3]:
# Split to tran and test sets
trainingData, testData = raw.randomSplit([0.7, 0.3])

## Data Preprocessing

In [4]:
# Extract word
tokenizer = Tokenizer().setInputCol('message').setOutputCol('words')

# Custom stopwords
stopwords = StopWordsRemover().getStopWords() + ['-']

# Remove stopwords
remover = StopWordsRemover().setStopWords(stopwords).setInputCol('words').setOutputCol('filtered')

# Set 2-gram
bigram = NGram().setN(2).setInputCol('filtered').setOutputCol('bigrams')

# Generate features
cvmodel = CountVectorizer().setInputCol('filtered').setOutputCol('features')
cvmodel_ngram = CountVectorizer().setInputCol('bigrams').setOutputCol('features')

# Convert to binary label
indexer = StringIndexer().setInputCol('spam').setOutputCol('label')

In [5]:
pipeline_proprocess = Pipeline(stages = [tokenizer, remover, bigram, cvmodel, indexer])
preprocessed = pipeline_proprocess.fit(raw)
preprocessed.transform(raw).show(5)

+----+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
|spam|             message|               words|            filtered|             bigrams|            features|label|
+----+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
| ham|Go until jurong p...|[go, until, juron...|[go, jurong, poin...|[go jurong, juron...|(13459,[8,12,33,6...|  0.0|
| ham|Ok lar... Joking ...|[ok, lar..., joki...|[ok, lar..., joki...|[ok lar..., lar.....|(13459,[0,26,307,...|  0.0|
|spam|Free entry in 2 a...|[free, entry, in,...|[free, entry, 2, ...|[free entry, entr...|(13459,[2,14,20,3...|  1.0|
| ham|U dun say so earl...|[u, dun, say, so,...|[u, dun, say, ear...|[u dun, dun say, ...|(13459,[0,71,83,1...|  0.0|
| ham|Nah I don't think...|[nah, i, don't, t...|[nah, don't, thin...|[nah don't, don't...|(13459,[36,39,141...|  0.0|
+----+--------------------+--------------------+--------

## Modeling

Three models, Naive Bayes, Logistic Regression, and Random Forest, are used for modeling. For Naive Bayes, pipelines with and without ngram are used. For the other two models, only piple without ngram is used.

### Naive Bayes

In [6]:
nb = NaiveBayes(smoothing=1)
pipeline = Pipeline(stages = [tokenizer, remover, cvmodel, indexer, nb])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)
predictions.select('message', 'label', 'rawPrediction', 'probability', 'prediction').show(5)

evaluator = BinaryClassificationEvaluator().setLabelCol('label').setRawPredictionCol('prediction').setMetricName('areaUnderROC')
AUC = evaluator.evaluate(predictions)
print(AUC)

+--------------------+-----+--------------------+--------------------+----------+
|             message|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
| &lt;DECIMAL&gt; ...|  0.0|[-96.541405187808...|[0.99999999907516...|       0.0|
| gonna let me kno...|  0.0|[-91.510438513462...|[0.99999999999997...|       0.0|
|"Keep ur problems...|  0.0|[-212.69897555180...|[0.99999987280161...|       0.0|
|"Speak only when ...|  0.0|[-29.501650783090...|[0.99999474020436...|       0.0|
|"The world suffer...|  0.0|[-67.080948350198...|[0.99999475556872...|       0.0|
+--------------------+-----+--------------------+--------------------+----------+
only showing top 5 rows

0.953538905240701


Naive Bayes could generate fairly good prediction performance.

### Naive Bayes + ngram

In [7]:
nb = NaiveBayes(smoothing=1)
pipeline = Pipeline(stages = [tokenizer, remover, bigram, cvmodel_ngram, indexer, nb])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)
#predictions.select('message', 'label', 'rawPrediction', 'probability', 'prediction').show(5)

evaluator = BinaryClassificationEvaluator().setLabelCol('label').setRawPredictionCol('prediction').setMetricName('areaUnderROC')
AUC = evaluator.evaluate(predictions)
print(AUC)

0.8886194680942128


Suprisingly, including ngram does not improve prediction.

### Logistic Regression

In [8]:
log_reg = LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
pipeline = Pipeline(stages = [tokenizer, remover, cvmodel, indexer, log_reg])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)

evaluator = BinaryClassificationEvaluator().setLabelCol('label').setRawPredictionCol('prediction').setMetricName('areaUnderROC')
AUC = evaluator.evaluate(predictions)
print(AUC)

0.5


The result of Logistic Regression shows no better performance than random guessing.

### Random Forest

In [9]:
%%time
rf = RandomForestClassifier().setLabelCol('label').setFeaturesCol('features').setNumTrees(10)
pipeline = Pipeline(stages = [tokenizer, remover, cvmodel, indexer, rf])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)

evaluator = BinaryClassificationEvaluator().setLabelCol('label').setRawPredictionCol('prediction').setMetricName("areaUnderROC")
AUC = evaluator.evaluate(predictions)
print(AUC)

0.5093896713615024
Wall time: 2min 38s


Neither does Random Forest generate much better result than random guessing.

## Best Model
Naive Bayes without ngram in pipeline is clearly the best model for spam detection. The natural of this model actually fit very well to the problem, where the data fram contains sparse data. On the other hand, logistic regression and random forest do not have any advantages when it comes to this type of question/dataset. A brief introduction on Naive Bayes model is discussed below.

### Naive Bayes Classifier

The Naive Bayes classifier uses a simplifying assumption that, that probability of $a_1$, $a_24$, to $a_d$ and given label $y$ is the product of the probability of each feature ($a_1$ to $a_d$) given $y$.
$$ p(a_1, a_2, ..., a_d|y) = \prod_{j}p(a_j|y)$$

#### Algorithm
**Learning:** Based on the frequency counts in the dataset:
1. Estimate all $p(y), \forall_y \in \mathbb{Y}$
2. Estimate all $p(a_i|y) \forall_y \in \mathbb{Y}, \forall a_i$
**Classification:** For a new sample, use:
$$y_{new} = \operatorname*{arg\,max}_{y \in Y} p(y) \prod_{j}p(a_j|y)$$

***Note: No model per se or hyperplane, just count the frequencies of various data combinations within the training examples.***

### References for Model Introduction and Algorithms
- Post Graduate Diploma of Applied Machine Learning and Artificial Intelligence - Columnbia Engineering Executive Education

### More Model Introductions
For more detailed discussion on Bayes Classifier, Logistic Regression, and Random Forest, please refer the the notebook **models-and-algothrms** in [project-parkinsons-disease-classification on my GitHub](https://github.com/byrontang/project-parkinsons-disease-classification).



## Next Step
With the best model identified, the next step would be to build an application that predicts the spam message with the model on the steaming data. The application will connect to flume to retrieve streaming data and make prediction in near real time.