# Text analytics (Unstructured)

## Spark Env

In [75]:
# Is Spark working?
print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])
spark

Spark UI running on http://YOURIPADDRESS:4040


## Read email data


In [66]:
email_text = spark.read.text("../data/emails/")
email_text.show(30, False)

+------------------------------------------+
|value                                     |
+------------------------------------------+
|From: "my name" <me@me.com>               |
|To: "your name" <you@you.com>             |
|Sent-From:  4.4.4.4                       |
|Date: 2017-11-01T16:42:15-0500            |
|Subject: team meeting this afternoon @ 2pm|
|                                          |
|Team,                                     |
|let's do a quick meeting today afternoon. |
|Let's discuss the current project.        |
|                                          |
|see you then!                             |
|From: "me" <me@me.com>                    |
|To: "your name" <you@you.com>             |
|Sent-From:  3.3.3.3                       |
|Date: 2017-11-01T16:42:15-0500            |
|Subject: Free Diploma!                    |
|                                          |
|!!!FREE Diploma!!!                        |
|Get your free diploma here                |
|Just clic

In [67]:
# How many lines of text? 
email_text.count()

45

## Hmm SPAM!
Let's look for spammy content.  
For simplicity, we are going to classify email as spam if it has `!!!`

In [68]:

spam_lines = email_text.filter(email_text['value'].contains('!!!'))
spam_lines.show(10, False)
spam_lines.count()

+--------------------------+
|value                     |
+--------------------------+
|!!!FREE Diploma!!!        |
|Subject: !!!HOT DEALS!!!  |
|!!!! HOT DEALS!!!!        |
|Subject: !!!VIAGRA Sale!!!|
+--------------------------+



4

## Identify Spam Emails
For this we need to know the `file_name` of the email.

In [69]:
from pyspark.sql.functions import input_file_name

emails = spark.read.text("../data/emails/").withColumn("file_name", input_file_name())
emails.show(100, False)

+------------------------------------------+---------------------------------------------------------------------------------+
|value                                     |file_name                                                                        |
+------------------------------------------+---------------------------------------------------------------------------------+
|From: "my name" <me@me.com>               |file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e4.txt|
|To: "your name" <you@you.com>             |file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e4.txt|
|Sent-From:  4.4.4.4                       |file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e4.txt|
|Date: 2017-11-01T16:42:15-0500            |file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e4.txt|
|Subject: team meeting this afternoon @ 2pm|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data

In [70]:
## Find Spam

spam_lines = emails.filter(emails['value'].contains('!!!'))
spam_lines.show(10, False)

+--------------------------+---------------------------------------------------------------------------------+
|value                     |file_name                                                                        |
+--------------------------+---------------------------------------------------------------------------------+
|!!!FREE Diploma!!!        |file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e3.txt|
|Subject: !!!HOT DEALS!!!  |file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e2.txt|
|!!!! HOT DEALS!!!!        |file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e2.txt|
|Subject: !!!VIAGRA Sale!!!|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e5.txt|
+--------------------------+---------------------------------------------------------------------------------+



In [71]:
## select the file names
spam_lines.select('file_name').show(10, False)

+---------------------------------------------------------------------------------+
|file_name                                                                        |
+---------------------------------------------------------------------------------+
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e3.txt|
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e2.txt|
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e2.txt|
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e5.txt|
+---------------------------------------------------------------------------------+



In [72]:
## Distinct
spam_lines.select('file_name').distinct().show(10, False)

+---------------------------------------------------------------------------------+
|file_name                                                                        |
+---------------------------------------------------------------------------------+
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e3.txt|
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e5.txt|
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e2.txt|
+---------------------------------------------------------------------------------+



In [73]:
## group by
spam_lines.groupby('file_name').count().show(10, False)

+---------------------------------------------------------------------------------+-----+
|file_name                                                                        |count|
+---------------------------------------------------------------------------------+-----+
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e3.txt|1    |
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e5.txt|1    |
|file:///Volumes/PhotoDisk/Dropbox/ElephantScale/spark-workshop/data/emails/e2.txt|2    |
+---------------------------------------------------------------------------------+-----+

