<a href="https://colab.research.google.com/github/bdmlworkshop/Examples/blob/main/Spam_Detection_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Simple Spam SMS detection

For this code along we will build a spam filter! We'll use the various NLP tools we learned about as well as a new classifier, Naive Bayes.

We'll use a classic dataset for this - UCI Repository SMS Spam Detection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [None]:
!wget -q https://mirrors.netix.net/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
!tar -xzf spark-3.0.1-bin-hadoop3.2.tgz
!pip install -q findspark

# define some evironement variable diretly with python instruction using the module os
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/default-java"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"
import findspark
findspark.init()

In [None]:
#Get The Data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip

In [None]:
!unzip smsspamcollection.zip

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName('BDML_smsSpam').getOrCreate()

In [None]:
data = spark.read.csv("SMSSpamCollection",inferSchema=True,sep='\t')

In [None]:
data = data.withColumnRenamed('_c0','class').withColumnRenamed('_c1','text')

In [None]:
data.show()

## Clean and Prepare the Data

**Create a new length feature:**

In [None]:
from pyspark.sql.functions import length

In [None]:
data = data.withColumn('length',length(data['text']))

In [None]:
data.show()

In [None]:
# Pretty Clear Difference
data.groupby('class').mean().show()

## Feature Transformations

In [None]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover, CountVectorizer,IDF,StringIndexer

tokenizer = Tokenizer(inputCol="text", outputCol="token_text")
stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens')
count_vec = CountVectorizer(inputCol='stop_tokens',outputCol='c_vec')
idf = IDF(inputCol="c_vec", outputCol="tf_idf")
ham_spam_to_num = StringIndexer(inputCol='class',outputCol='label')

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector

In [None]:
clean_up = VectorAssembler(inputCols=['tf_idf','length'],outputCol='features')

### The Model

We'll use Naive Bayes, but feel free to play around with this choice!

In [None]:
from pyspark.ml.classification import NaiveBayes

In [None]:
# Use defaults
nb = NaiveBayes()

### Pipeline

In [None]:
from pyspark.ml import Pipeline

In [None]:
data_prep_pipe = Pipeline(stages=[ham_spam_to_num,tokenizer,stopremove,count_vec,idf,clean_up])

In [None]:
cleaner = data_prep_pipe.fit(data)

In [None]:
clean_data = cleaner.transform(data)

### Training and Evaluation!

In [None]:
clean_data = clean_data.select(['label','features'])

In [None]:
clean_data.show()

In [None]:
(training,testing) = clean_data.randomSplit([0.7,0.3])

In [None]:
spam_predictor = nb.fit(training)

In [None]:
data.printSchema()

In [None]:
test_results = spam_predictor.transform(testing)

In [None]:
test_results.show()

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print(f"Accuracy of model at predicting spam was: {acc}")

Not bad considering we're using straight math on text data! Try switching out the classification models! Or even try to come up with other engineered features!

## Great Job!