## NLP & Binary Classification: Spam Collection Data
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

** Dataset Information: **

a collection of more than 5 thousand SMS phone messages

** Attribute Information: (? features and 1 class)**

each line has the correct class followed by the raw message. We offer some examples bellow: 

ham What you doing?how are you?

ham Ok lar... Joking wif u oni... 
 
spam FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop 

spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B 

** Objective of this project **

predict whether a message is spam or ham

## Data

In [1]:
import findspark
findspark.init('/home/danny/spark-2.2.1-bin-hadoop2.7')

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spam').getOrCreate()

In [2]:
# Load Data
df = spark.read.csv('smsspamcollection/SMSSpamCollection',
                    inferSchema=True,sep='\t')
# Inspect Data
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)



In [3]:
df = df.withColumnRenamed('_c0','class').withColumnRenamed('_c1','text')

In [4]:
df.show(5)

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
+-----+--------------------+
only showing top 5 rows



In [5]:
df.head()

Row(class='ham', text='Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')

In [6]:
df.describe().show()

+-------+-----+--------------------+
|summary|class|                text|
+-------+-----+--------------------+
|  count| 5574|                5574|
|   mean| null|               645.0|
| stddev| null|                 NaN|
|    min|  ham| &lt;#&gt;  in mc...|
|    max| spam|… we r stayin her...|
+-------+-----+--------------------+



In [7]:
df.groupby('class').count().show()

+-----+-----+
|class|count|
+-----+-----+
|  ham| 4827|
| spam|  747|
+-----+-----+



## Data preprocessing

In [8]:
from pyspark.sql.functions import length
from pyspark.ml.feature import (Tokenizer,StopWordsRemover,
                                CountVectorizer,IDF,StringIndexer,
                                VectorAssembler,OneHotEncoder)
from pyspark.ml import Pipeline

** Pipeline for data prep **

In [9]:
# create a new length feature (feature engineering)
df= df.withColumn('length',length(df['text']))
df.show(5)
df.groupby('class').mean().show()

+-----+--------------------+------+
|class|                text|length|
+-----+--------------------+------+
|  ham|Go until jurong p...|   111|
|  ham|Ok lar... Joking ...|    29|
| spam|Free entry in 2 a...|   155|
|  ham|U dun say so earl...|    49|
|  ham|Nah I don't think...|    61|
+-----+--------------------+------+
only showing top 5 rows

+-----+-----------------+
|class|      avg(length)|
+-----+-----------------+
|  ham|71.45431945307645|
| spam|138.6706827309237|
+-----+-----------------+



In [10]:
# create bag-of-words model (feature transformations)
tokenizer = Tokenizer(inputCol="text", outputCol="token_text")
stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens')
count_vec = CountVectorizer(inputCol='stop_tokens',outputCol='c_vec')
idf = IDF(inputCol="c_vec", outputCol="tf_idf")
ham_spam_to_num = StringIndexer(inputCol='class',outputCol='label')

In [11]:
# combine features into a single column
assembler = VectorAssembler(inputCols=['tf_idf','length'],
                            outputCol='features')

In [12]:
# pipeline
data_prep_pipe = Pipeline(stages=[ham_spam_to_num,tokenizer,stopremove,count_vec,idf,assembler])
clean_data = data_prep_pipe.fit(df).transform(df)
final_data = clean_data.select(['label','features'])
final_data.show(5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(13424,[7,11,31,6...|
|  0.0|(13424,[0,24,297,...|
|  1.0|(13424,[2,13,19,3...|
|  0.0|(13424,[0,70,80,1...|
|  0.0|(13424,[36,134,31...|
+-----+--------------------+
only showing top 5 rows



** Split Train Test sets **

In [13]:
seed = 101 #for reproducibility
train_data,test_data = final_data.randomSplit([0.7,0.3],seed=seed)
train_data.describe().show()
test_data.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|               3901|
|   mean| 0.1330428095360164|
| stddev|0.33966453354208315|
|    min|                0.0|
|    max|                1.0|
+-------+-------------------+

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|               1673|
|   mean|0.13628212791392708|
| stddev| 0.3431904862170847|
|    min|                0.0|
|    max|                1.0|
+-------+-------------------+



## Naive Bayes Model

In [14]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Train the model
nbc = NaiveBayes()
spam_predictor = nbc.fit(train_data)
# Make predictions
predictions = spam_predictor.transform(test_data)
predictions.select(['features','label','prediction']).show()
# Evaluate the models
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictions)
print("Accuracy of model at predicting spam: {:.1f}%".format(acc*100))

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|(13424,[0,1,5,20,...|  0.0|       0.0|
|(13424,[0,1,7,8,1...|  0.0|       0.0|
|(13424,[0,1,7,15,...|  0.0|       0.0|
|(13424,[0,1,11,32...|  0.0|       0.0|
|(13424,[0,1,14,78...|  0.0|       0.0|
|(13424,[0,1,17,19...|  0.0|       0.0|
|(13424,[0,1,18,20...|  0.0|       0.0|
|(13424,[0,1,21,27...|  0.0|       0.0|
|(13424,[0,1,146,1...|  0.0|       0.0|
|(13424,[0,1,498,5...|  0.0|       0.0|
|(13424,[0,1,874,1...|  0.0|       0.0|
|(13424,[0,2,4,5,1...|  0.0|       0.0|
|(13424,[0,2,4,8,1...|  0.0|       0.0|
|(13424,[0,2,4,8,2...|  0.0|       0.0|
|(13424,[0,2,4,10,...|  0.0|       0.0|
|(13424,[0,2,4,25,...|  0.0|       0.0|
|(13424,[0,2,4,40,...|  0.0|       0.0|
|(13424,[0,2,4,128...|  0.0|       0.0|
|(13424,[0,2,7,8,1...|  0.0|       0.0|
|(13424,[0,2,7,11,...|  0.0|       0.0|
+--------------------+-----+----------+
only showing top 20 rows

Accuracy of mo

In [15]:
spark.stop()