<a href="https://colab.research.google.com/github/ayaamr11/SMSSpamCollect-NLP/blob/main/SMSSpamCollect.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 44 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 51.7 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=1dd9bbf9faa01c2601a479861750a6d940b533f24d09cf123f137df3dca175bc
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


# NLP Using PySpark

## Objective:
- The objective from this project is to create a <b>Spam filter using NaiveBayes classifier</b>.
- We'll use a dataset from UCI Repository. SMS Spam Detection: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

### Create a spark session and import the required libraries

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()

### Read the data into a DataFrame

In [None]:
sms = spark.read.option("delimiter","\t").csv("/content/SMSSpamCollection (1).csv",header=False)

### Print the schema

In [None]:
sms.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)



### Rename the first column to 'class' and second column to 'text'

In [None]:
sms = sms.withColumnRenamed("_c0","class").\
    withColumnRenamed("_c1","text")

In [None]:
sms.printSchema()

root
 |-- class: string (nullable = true)
 |-- text: string (nullable = true)



### Show the first 10 rows from the dataframe
- Show once with truncate=True and once with truncate=False

In [None]:
sms.show(10,truncate=True)

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
| spam|FreeMsg Hey there...|
|  ham|Even my brother i...|
|  ham|As per your reque...|
| spam|WINNER!! As a val...|
| spam|Had your mobile 1...|
+-----+--------------------+
only showing top 10 rows



In [None]:
sms.show(10,truncate=False)

+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|class|text                                                                                                                                                            |
+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ham  |Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                                 |
|ham  |Ok lar... Joking wif u oni...                                                                                                                                   |
|spam |Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075o

## Clean and Prepare the Data

### Create a new feature column contains the length of the text column

In [None]:
df = sms.withColumn("col_len",length("text"))

### Show the new dataframe

In [None]:
df.show()

+-----+--------------------+-------+
|class|                text|col_len|
+-----+--------------------+-------+
|  ham|Go until jurong p...|    111|
|  ham|Ok lar... Joking ...|     29|
| spam|Free entry in 2 a...|    155|
|  ham|U dun say so earl...|     49|
|  ham|Nah I don't think...|     61|
| spam|FreeMsg Hey there...|    147|
|  ham|Even my brother i...|     77|
|  ham|As per your reque...|    160|
| spam|WINNER!! As a val...|    157|
| spam|Had your mobile 1...|    154|
|  ham|I'm gonna be home...|    109|
| spam|SIX chances to wi...|    136|
| spam|URGENT! You have ...|    155|
|  ham|I've been searchi...|    196|
|  ham|I HAVE A DATE ON ...|     35|
| spam|XXXMobileMovieClu...|    149|
|  ham|Oh k...i'm watchi...|     26|
|  ham|Eh u remember how...|     81|
|  ham|Fine if thats th...|     56|
| spam|England v Macedon...|    155|
+-----+--------------------+-------+
only showing top 20 rows



### Get the average text length for each class (give alias name to the average length column)

In [None]:
df.groupby("class").agg(avg("col_len").alias("Avg. Lenght")).show()

+-----+-----------------+
|class|      Avg. Lenght|
+-----+-----------------+
|  ham|71.45431945307645|
| spam|138.6706827309237|
+-----+-----------------+



## Feature Transformations

### In this part I transformed raw text in to tf_idf model :
- For more information about TF-IDF check the following link:
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

### Perform the following steps to obtain TF-IDF:
1. Import the required transformers/estimators for the subsequent steps.
2. Create a <b>Tokenizer</b> from the text column.
3. Create a <b>StopWordsRemover</b> to remove the <b>stop words</b> from the column obtained from the <b>Tokenizer</b>.
4. Create a <b>CountVectorizer</b> after removing the <b>stop words</b>.
5. Create the <b>TF-IDF</b> from the <b>CountVectorizer</b>.

In [None]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover,CountVectorizer,IDF,StringIndexer,VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
tokenizer = Tokenizer(inputCol="text",outputCol="words")

In [None]:
remover = StopWordsRemover(inputCol="words",outputCol="removed")

In [None]:
cv = CountVectorizer(inputCol="removed",outputCol="vectors")

In [None]:
idf = IDF(inputCol="vectors", outputCol="idf")

- Convert the <b>class column</b> to index using <b>StringIndexer</b>
- Create feature column from the <b>TF-IDF</b> and <b>lenght</b> columns.

In [None]:
indexer = StringIndexer(inputCol="class",outputCol="class_indexed")

In [None]:
assembler = VectorAssembler(inputCols=['idf','col_len'],outputCol="features")

## The Model
- Create a <b>NaiveBayes</b> classifier with the default parameters.

In [None]:
nb = NaiveBayes(featuresCol='features',labelCol='class_indexed')

## Pipeline
### Create a pipeline model contains all the steps starting from the Tokenizer to the NaiveBays classifier.

In [None]:
pipe = Pipeline(stages=[tokenizer,remover,cv,idf,indexer,assembler,nb])

### Split your data to trian and test data with ratios 0.7 and 0.3 respectively.

In [None]:
train_df,test_df=df.randomSplit([0.7,0.3],seed=42)

### Fit your Pipeline model to the training data

In [None]:
pl_model = pipe.fit(train_df)

### Perform predictions on tests dataframe

In [None]:
pred = pl_model.transform(test_df)

### Print the schema of the prediction dataframe

In [None]:
pred.printSchema()

root
 |-- class: string (nullable = true)
 |-- text: string (nullable = true)
 |-- col_len: integer (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- removed: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- vectors: vector (nullable = true)
 |-- idf: vector (nullable = true)
 |-- class_indexed: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



## Model Evaluation
- Use <b>MulticlassClassificationEvaluator</b> to calculate the <b>f1_score</b>.

In [None]:
classval = MulticlassClassificationEvaluator(predictionCol = "prediction", labelCol = "class_indexed", metricName = "f1")

In [None]:
classval.evaluate(pred)

0.9727502290227267