### Spark MLLib - Naive Bayes

**Description**

- Based on Bayes' theorem.
- Probability of an event A = P(A) is between 0 and 1.
- The probability P (A / B) = P(A intersect B) * P(A) / P(B).
- The target variable becomes the event A.
- The model stores the conditional probability of the target variable for each possible value of the predictor variables.

**Pros:** Fast and simple, works well even with missing values, provides results probabilities and is excellent with categorical variables.

**Cons:** It doesn't work well with many numeric variables and expects the predictor variables to be independent.

**Application:** Spam filter, medical diagnosis, document classification.

### Classifying SMS's Spam

https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [1]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes, NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.feature import IDF

In [2]:
spSession = SparkSession.builder.master('local').appName('smsSpam').getOrCreate()

In [3]:
rddSpam = sc.textFile('aux/datasets/sms-spam-collection.csv')

**We can cache the RDD to optimize performance.**

In [4]:
rddSpam.cache()

aux/datasets/sms-spam-collection.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [5]:
rddSpam.count()

1000

In [6]:
rddSpam.take(5)

['ham,Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...,,,,,,,,,',
 'ham,Ok lar... Joking wif u oni...,,,,,,,,,,',
 'ham,U dun say so early hor... U c already then say...,,,,,,,,,,',
 "ham,Nah I don't think he goes to usf, he lives around here though,,,,,,,,,",
 'ham,Even my brother is not like to speak with me. They treat me like aids patent.,,,,,,,,,,']

### Data Pre-Processing

In [7]:
def vectorize(row):
    attrList = row.split(',') 
    
    smsType = 0.0 if attrList[0] == 'ham' else 1.0
    
    return [smsType, attrList[1]]

In [8]:
rddSpam02 = rddSpam.map(vectorize)

In [9]:
rddSpam02.take(5)

[[0.0, 'Go until jurong point'],
 [0.0, 'Ok lar... Joking wif u oni...'],
 [0.0, 'U dun say so early hor... U c already then say...'],
 [0.0, "Nah I don't think he goes to usf"],
 [0.0,
  'Even my brother is not like to speak with me. They treat me like aids patent.']]

In [10]:
dfSpam = spSession.createDataFrame(rddSpam02, ['label', 'message'])

In [11]:
dfSpam.cache()

DataFrame[label: double, message: string]

In [12]:
dfSpam.select('label', 'message').show(5)

+-----+--------------------+
|label|             message|
+-----+--------------------+
|  0.0|Go until jurong p...|
|  0.0|Ok lar... Joking ...|
|  0.0|U dun say so earl...|
|  0.0|Nah I don't think...|
|  0.0|Even my brother i...|
+-----+--------------------+
only showing top 5 rows



### Machine Learning

In [13]:
(dataTraining, dataTest) = dfSpam.randomSplit([.7, .3])

In [14]:
dataTraining.count()

711

In [15]:
dataTest.count()

289

In [16]:
dataTraining.count() + dataTest.count() == dfSpam.count()

True

In [17]:
tokenizer = Tokenizer(inputCol = 'message', outputCol = 'words')

In [18]:
hashingTF = HashingTF(inputCol = tokenizer.getOutputCol(), outputCol = 'hash_features')

In [19]:
idf = IDF(inputCol = hashingTF.getOutputCol(), outputCol = 'features')

In [20]:
naiveBayes = NaiveBayes()

In [21]:
pipeline = Pipeline(stages = [tokenizer, hashingTF, idf, naiveBayes])

In [22]:
model = pipeline.fit(dataTraining)

In [23]:
predictions = model.transform(dataTest)

In [24]:
predictions.select('prediction', 'label').show(5)

+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       1.0|  0.0|
+----------+-----+
only showing top 5 rows



In [25]:
evaluator = MulticlassClassificationEvaluator(
    predictionCol = 'prediction', 
    labelCol = 'label', 
    metricName = 'accuracy')

In [26]:
evaluator.evaluate(predictions)

0.916955017301038

**Confusion Matrix - Summing Up Predictions**

In [27]:
predictions.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  1.0|       1.0|  139|
|  0.0|       1.0|   18|
|  1.0|       0.0|    6|
|  0.0|       0.0|  126|
+-----+----------+-----+

