# Ex6c - Email Features Extraction - Spam classification with SVM
To use an SVM to classify emails into Spam v.s. Non-Spam, you first need to convert each email into a vector of features. This notebook shows the different steps used to extract these features from a given email.
We will be using the NTLK (Natural Language Toolkit) for language processing.

In [1]:
import re
from nltk.stem import PorterStemmer
import numpy as np

In [2]:
fname = 'emailSample1.txt'
fh = open(fname)
words = list()
for line in fh:
    linewords = line.split()
    for word in linewords:
        if word not in words:
            words.append(word)
# words.sort()
fh.close()

In [3]:
print(words)

['>', 'Anyone', 'knows', 'how', 'much', 'it', 'costs', 'to', 'host', 'a', 'web', 'portal', '?', 'Well,', 'depends', 'on', 'many', 'visitors', "you're", 'expecting.', 'This', 'can', 'be', 'anywhere', 'from', 'less', 'than', '10', 'bucks', 'month', 'couple', 'of', '$100.', 'You', 'should', 'checkout', 'http://www.rackspace.com/', 'or', 'perhaps', 'Amazon', 'EC2', 'if', 'youre', 'running', 'something', 'big..', 'To', 'unsubscribe', 'yourself', 'this', 'mailing', 'list,', 'send', 'an', 'email', 'to:', 'groupname-unsubscribe@egroups.com']


## Email preprocess
In this part, we will implement the preprocessing steps for each email producing a word indices vector for a given email. This involves:
* Lower case
* Remove any HTML markup
* Replace all numbers with the text "numbers"
* Replace all URLs with the test "httpaddr"
* Replace dollar signal with dollar


In [4]:
# lower case
words = list(map(lambda word: word.lower(), words))
# Stripping HTML
words = list(map(lambda word: re.sub('<[^<>]+>', ' ', word), words))
# Handle numbers
words = list(map(lambda word: re.sub('[0-9]+', 'number', word), words))
# Handle URLs
words = list(map(lambda word: re.sub('(http|https)://[^\s]*', 'httpaddr', word), words))
# Handle Email Addresses
words = list(map(lambda word: re.sub('[^\s]+@[^\s]+', 'emailaddr', word), words))
# Handle Email Addresses
words = list(map(lambda word: re.sub('[$]+', 'dollar', word), words))
# Remove any non alphanumeric characters
words = list(map(lambda word: re.sub('[^a-zA-Z0-9]', '', word), words))
# Remove any empty string
words = list(filter(None, words))

print(words)

['anyone', 'knows', 'how', 'much', 'it', 'costs', 'to', 'host', 'a', 'web', 'portal', 'well', 'depends', 'on', 'many', 'visitors', 'youre', 'expecting', 'this', 'can', 'be', 'anywhere', 'from', 'less', 'than', 'number', 'bucks', 'month', 'couple', 'of', 'dollarnumber', 'you', 'should', 'checkout', 'httpaddr', 'or', 'perhaps', 'amazon', 'ecnumber', 'if', 'youre', 'running', 'something', 'big', 'to', 'unsubscribe', 'yourself', 'this', 'mailing', 'list', 'send', 'an', 'email', 'to', 'emailaddr']


## Email tokenize
We will be using the <b>PorterStemmer</b> algorithm to leave only the word's root (so 'running' becomes 'run')
> This algorithm is part of the NLTK: https://www.nltk.org/. The O'Reilly book is in https://www.nltk.org/book/

In [5]:
ps = PorterStemmer()
words = list(map(lambda word: ps.stem(word), words))

print(words)

['anyon', 'know', 'how', 'much', 'it', 'cost', 'to', 'host', 'a', 'web', 'portal', 'well', 'depend', 'on', 'mani', 'visitor', 'your', 'expect', 'thi', 'can', 'be', 'anywher', 'from', 'less', 'than', 'number', 'buck', 'month', 'coupl', 'of', 'dollarnumb', 'you', 'should', 'checkout', 'httpaddr', 'or', 'perhap', 'amazon', 'ecnumb', 'if', 'your', 'run', 'someth', 'big', 'to', 'unsubscrib', 'yourself', 'thi', 'mail', 'list', 'send', 'an', 'email', 'to', 'emailaddr']


## Vocabulary list
Now we'll be identifying which words we want to use on our filter and which we want to leave out. For this we will be using a <b>vocabulary list</b>, which consists of a list of words which occur at least 100 times in the spam corpus. The list we are using contains 1899 words, in practise these list range from 10.000 to 50.000 words.
> <b>Purpose of the Vocabulary List:</b> Considering in a training set words that rarely occur may cause the model to overfit our training set. 

In [6]:
# Load the vocabulary list into a list
# I was hesitating whether to use a list, a dictionary
# or an array.
# Dictionary and array would mantain an strong index-word relantionship
# The list is an simpler structure but the word order with in the list
# must be mantained. The 'sort' method would completely change the indexs 
fname = 'vocab.txt'
fh = open(fname)
vocabList = list()
for line in fh:
    linewords = line.split()
    for word in linewords[1:]:
        if word not in vocabList:
            vocabList.append(word)
fh.close()

In [7]:
# In this lambda we'll be using these two list functions:
# 1) 'ab' in vocabList
# 2) vocabList.index('patata')
# With them we can create an elegant lambda that makes the 
# work in one line of code
wordIndices = list(map(lambda word: vocabList.index(word) if (word in vocabList) else None, words))
# Remove the characters not found
wordIndices = list(filter(None, wordIndices))

print(wordIndices)

[85, 915, 793, 1076, 882, 369, 1698, 789, 1821, 1830, 430, 1170, 1001, 1894, 591, 1675, 237, 161, 88, 687, 944, 1662, 1119, 1061, 374, 1161, 478, 1892, 1509, 798, 1181, 1236, 809, 1894, 1439, 1546, 180, 1698, 1757, 1895, 1675, 991, 960, 1476, 70, 529, 1698, 530]


## Extracting Features vector from Emails
We will be coverting every email into a vector in $R^{n}$ where n is the number of words in our dictionary <code>vocabList</code>. 
$x_i=1$ is the i-th word is present and $x_i=0$ if that word is not present.

In [8]:
featVector = np.zeros((len(vocabList)), dtype = int)
for i in range(len(wordIndices)): featVector[wordIndices[i]]=1