# <div style="text-align:center"> DS7333 - Case Study 6 | SPAM</div>
### <div style="text-align:center">Andy Ho, An Nguyen, Jodi Pafford</div>
<div style="text-align:center">June 17, 2019</div>

# 1 Introduction

This case study will answer Question 20 in Chapter 3 of "Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving" (Nolan and Lang). 

>Question 20 is: "In the section called “Classifying New Messages” we used the test set that we had put aside to both select , the threshold for the log odds, and to evaluate the Type I and II errors incurred when we use this threshold. Ideally, we choose from another set of messages that is both independent of our training data and our test data. The method of cross-validation is designed to use the training set for training and validating the model. Implement 5-fold cross-validation to choose and assess the error rate with our training data. To do this, follow the steps: 
    <br> 
    <br> A.) Use the sample() function to permute the indices of the training set, and organize these permuted indices into 5 equal-size sets, called folds. 
    <br> 
    <br> B.) For each fold, take the corresponding subset from the training data to use as a 'test' set. Use the remaining messages in the training data as the training set. Apply the functions developed in the section called “Implementing the Naïve Bayes Classifier” to estimate the probabilities that a word occurs in a message given it is spam or ham, and use these probabilities to compute the log likelihood ratio for the messages in the training set.
    <br> 
    <br> C.) Pool all of the LLR values from the messages in all of the folds, i.e., from all of the training data, and use these values and the typeIErrorRate() function to select a threshold that achieves a 1% Type I error.
    <br> 
    <br> D.) Apply this threshold to our original/real test set and find its Type I and Type II errors."

The data was extracted, cleaned, reformatted and prepared for analysis.  The preparation includes parsing out words from the body of emails and removing stop words. Once the data was prepared, we set aside one third of the data to be used as a validation set.  The remaining two third of the data used as the training set.  The training set was further divided into 5 folds for cross-validation of the Naive-Bayes classification method in determining spam and non-spam emails.

# 2 Background

Spam messages have been sent to email accounts almost since the inception of emails. People have gotten good at recognizing Spam messages, however AI should be able to create filters to also recognize Spam. Doing so would save companies from the risk of employees who aren't as 'savy' clicking links inside Spam emails, potentially saving the company time and money. 

Before an AI can learn to classify messages, programmers (people) must do some basic analysis to find the right parameters to train the AI. Does the computer simply need to look at the sending email address, or does it also need to include the entire message? How many Type I and Type II errors are found in doing any kind of filtering?

# 3 Methods

Following the code from Chapter 3 Sections 3.1 - 3.5, we cleaned the email files. The first step was removing the stop words and punctuation. Once this was complete, we were able to split the messages into Ham and Spam and combine into one file. The final step in preparation was to create a test and train set in order to complete the steps of the assignment.

# 4 Results

#### A)
To create the 5 folds we first created two 5-element lists, one for spam emails and one for ham (non-spam emails).  We used the sample() function to randomly sample ```r floor(trainNumSpam/fold) ``` [1] for the spam emails and ```r floor(trainNumSpam/fold) ``` [2] in the ham emails training set.  With each new fold the index that have already been allocated is removed from the elements from which sample() is choosing from.

[1]

In [19]:
#run appendix code first
floor(trainNumSpam/fold)

[2]

In [20]:
#run appendix code first
floor(trainNumSpam/fold)

#### B)
Once the indexes have been created. For each fold the indexes are then used to create a list of words from the designated test emails, labeling them as ham or spam. The remaining words are used to create a frequency table from which the likelihood-ratio(LLR) are calculated using the words in test set. This process is repeated over the 5 folds resulting in a list of 5 lists of LLR.

#### C)
We then determine the error rate for each word. For each fold all the LLR with error greater than 1% is deteremined. The tau is the minimum LLR from this set. Once all five taus are found we averaged.

Setting the Type I error rate at 1%, we found a tau of 387.465811932302.

#### D)
The Type I and II error is calculated by applying the validation test set to each of the five training set created above. The error rates are determined and then averaged together.

When applied to the validation data set we see that the Type I error is indeed 1% and the Type II error is 8%.

# 5 Conclusion

### References
+ Nolan, D. and Lang, D. T. “Data Science in R.” CRC Press, 2015 (Chapter 3)
+ http://rdatasciencecases.org/
+ https://rstudio-pubs-static.s3.amazonaws.com/351788_b8d5de284dd645a1b920b7bd77e0967b.html

# 6 Appendix - Code

Ok so you can change the dir structure, but this code does depend on data being in certain directories:


`This Notebook`

`    < spam path folder > `

        messages
    
            easy_ham
        
            spam
        
            etc....
        `
`

In [1]:
#load files
spamPath = "./SpamAssassinMessages/"
#List of messages
dirNames = list.files(path = paste(spamPath, "messages",  sep = .Platform$file.sep))
#Findn file names and place in directories
fullDirNames = paste(spamPath, "messages", dirNames, sep = .Platform$file.sep)

In [2]:
#Function to split message into 2 character vectors (header and body)
splitMessage = function(msg) {
  splitPoint = match("", msg)
  header = msg[1:(splitPoint-1)]
  body = msg[ -(1:splitPoint) ]
  return(list(header = header, body = body))
}

#find the boundary string
getBoundary = function(header) {
  boundaryIdx = grep("boundary=", header)
  boundary = gsub('"', "", header[boundaryIdx])
  gsub(".*boundary= *([^;]*);?.*", "\\1", boundary)
}

#drop attachment using boundary info above
dropAttach = function(body, boundary){
  
  bString = paste("--", boundary, sep = "")
  bStringLocs = which(bString == body)
  
  if (length(bStringLocs) <= 1) return(body)
  
  eString = paste("--", boundary, "--", sep = "")
  eStringLoc = which(eString == body)
  if (length(eStringLoc) == 0) 
    return(body[ (bStringLocs[1] + 1) : (bStringLocs[2] - 1)])
  
  n = length(body)
  if (eStringLoc < n) 
     return( body[ c( (bStringLocs[1] + 1) : (bStringLocs[2] - 1), 
                    ( (eStringLoc + 1) : n )) ] )
  
  return( body[ (bStringLocs[1] + 1) : (bStringLocs[2] - 1) ])
}

In [3]:
library(tm)

stopWords = stopwords()

#clean to stop words to remove case and punctuation
cleanSW = tolower(gsub("[[:punct:]0-9[:blank:]]+", " ", stopWords))

#divide stop words into strings into words by splitting the string on blanks
SWords = unlist(strsplit(cleanSW, "[[:blank:]]+"))

#Drop one-letter stop words
SWords = SWords[ nchar(SWords) > 1 ]

stopWords = unique(SWords)

"package 'tm' was built under R version 3.4.4"Loading required package: NLP


In [4]:
#final clean text code (like above but in a function for the message words)
cleanText =
function(msg)   {
  tolower(gsub("[[:punct:]0-9[:space:][:blank:]]+", " ", msg))
}

findMsgWords = 
function(msg, stopWords) {
 if(is.null(msg))
  return(character())

 words = unique(unlist(strsplit(cleanText(msg), "[[:blank:]\t]+")))
 
 # drop empty and 1 letter words
 words = words[ nchar(words) > 1]
 words = words[ !( words %in% stopWords) ]
 invisible(words)
}

In [5]:
#Cleaning the entire group of messages in one larger function that contains the functions of the others
processAllWords = function(dirName, stopWords)
{
       # read all files in the directory
  fileNames = list.files(dirName, full.names = TRUE)
       # drop files that are not email, i.e., cmds
  notEmail = grep("cmds$", fileNames)
  if ( length(notEmail) > 0) fileNames = fileNames[ - notEmail ]

  messages = lapply(fileNames, readLines, encoding = "latin1")
  
       # split header and body
  emailSplit = lapply(messages, splitMessage)
       # put body and header in own lists
  bodyList = lapply(emailSplit, function(msg) msg$body)
  headerList = lapply(emailSplit, function(msg) msg$header)
  rm(emailSplit)
  
       # determine which messages have attachments
  hasAttach = sapply(headerList, function(header) {
    CTloc = grep("Content-Type", header)
    if (length(CTloc) == 0) return(0)
    multi = grep("multi", tolower(header[CTloc])) 
    if (length(multi) == 0) return(0)
    multi
  })
  
  hasAttach = which(hasAttach > 0)
  
       # find boundary strings for messages with attachments
  boundaries = sapply(headerList[hasAttach], getBoundary)
  
       # drop attachments from message body
  bodyList[hasAttach] = mapply(dropAttach, bodyList[hasAttach], 
                               boundaries, SIMPLIFY = FALSE)
  
       # extract words from body
  msgWordsList = lapply(bodyList, findMsgWords, stopWords)
  
  invisible(msgWordsList)
}

In [6]:
#apply the 'processAllWords to the entire directory
msgWordsList = lapply(fullDirNames, processAllWords, 
                      stopWords = stopWords) 

"incomplete final line found on './SpamAssassinMessages//messages/spam/0143.260a940290dcb61f9327b224a368d4af'"

In [7]:
#vector of the number of elements in each list
numMsgs = sapply(msgWordsList, length)
numMsgs

In [8]:
#notating which is spam and which is ham
isSpam = rep(c(FALSE, FALSE, FALSE, TRUE, TRUE), numMsgs)

#one list of all words (all 5 files combined)
msgWordsList = unlist(msgWordsList, recursive = FALSE)

#number of emails
numEmail = length(isSpam)
#number of spam emails
numSpam = sum(isSpam)
#number of ham emails
numHam = numEmail - numSpam

set.seed(418910)

#determine indices
validationSpamIdx = sample(numSpam, size = floor(numSpam/3))
validationHamIdx = sample(numHam, size = floor(numHam/3))

#select word vectors
validationMsgWords = c((msgWordsList[isSpam])[validationSpamIdx],
                 (msgWordsList[!isSpam])[validationHamIdx] )
trainingMsgWords = c((msgWordsList[isSpam])[ - validationSpamIdx], 
                  (msgWordsList[!isSpam])[ - validationHamIdx])

#create test (validation) and train
validationIsSpam = rep(c(TRUE, FALSE), 
                 c(length(validationSpamIdx), length(validationHamIdx)))
trainingIsSpam = rep(c(TRUE, FALSE), 
                 c(numSpam - length(validationSpamIdx), 
                   numHam - length(validationHamIdx)))

In [9]:

#number of folds to produce
fold = 5

trainNumEmail = length(trainingIsSpam)
trainNumSpam = sum(trainingIsSpam)
trainNumHam = trainNumEmail - trainNumSpam

#initialize lists of indexes
foldSpamIdx = list()
foldHamIdx = list()

#create a list of x folds holding a list of indexes
#sample() will not sample indexes that have already been taken for previous folds
for (x in 1:fold){
    foldSpamIdx[x] = list(sample((1:trainNumSpam)[!((1:trainNumSpam) %in% unlist(foldSpamIdx))], size = floor(trainNumSpam/fold)))
    foldHamIdx[x] = list(sample((1:trainNumHam)[!((1:trainNumHam) %in% unlist(foldHamIdx))], size = floor(trainNumHam/fold)))
}

In [10]:
#Function to form a matrix of log likelihood rations and porportions
computeFreqs =
function(wordsList, spam, bow = unique(unlist(wordsList)))
{
   # create a matrix for spam, ham, and log odds
  wordTable = matrix(0.5, nrow = 4, ncol = length(bow), 
                     dimnames = list(c("spam", "ham", 
                                        "presentLogOdds", 
                                        "absentLogOdds"),  bow))

   # For each spam message, add 1 to counts for words in message
  counts.spam = table(unlist(lapply(wordsList[spam], unique)))
  wordTable["spam", names(counts.spam)] = counts.spam + .5

   # Similarly for ham messages
  counts.ham = table(unlist(lapply(wordsList[!spam], unique)))  
  wordTable["ham", names(counts.ham)] = counts.ham + .5  


   # Find the total number of spam and ham
  numSpam = sum(spam)
  numHam = length(spam) - numSpam

   # Prob(word|spam) and Prob(word | ham)
  wordTable["spam", ] = wordTable["spam", ]/(numSpam + .5)
  wordTable["ham", ] = wordTable["ham", ]/(numHam + .5)
  
   # log odds
  wordTable["presentLogOdds", ] = 
     log(wordTable["spam",]) - log(wordTable["ham", ])
  wordTable["absentLogOdds", ] = 
     log((1 - wordTable["spam", ])) - log((1 - wordTable["ham", ]))

  invisible(wordTable)
}

In [11]:
#create a list of word table, 1 for each fold as the test set
trainTable = list()
testMsgWords = list()
testIsSpam = list()
trainIsSpam = list()

for (x in 1:fold){
    testSpamIdx = foldSpamIdx[[x]]
    testHamIdx = foldHamIdx[[x]]

    testMsgWords[[x]] = c((trainingMsgWords[trainingIsSpam])[testSpamIdx],
                     (trainingMsgWords[!trainingIsSpam])[testHamIdx] )
    trainMsgWords = c((trainingMsgWords[trainingIsSpam])[ - testSpamIdx], 
                      (trainingMsgWords[!trainingIsSpam])[ - testHamIdx])

    testIsSpam[[x]] = rep(c(TRUE, FALSE), 
                     c(length(testSpamIdx), length(testHamIdx)))
    trainIsSpam[[x]] = rep(c(TRUE, FALSE), 
                     c(numSpam - length(testSpamIdx), 
                       numHam - length(testHamIdx)))

    trainTable[[x]] = list(computeFreqs(trainMsgWords, trainIsSpam[[x]]))
}

In [12]:
#Function to calculate the log likelihood ration (LLR)
computeMsgLLR = function(words, freqTable) 
{
       # Discards words not in training data.
  words = words[!is.na(match(words, colnames(freqTable)))]

       # Find which words are present
  present = colnames(freqTable) %in% words

  sum(freqTable["presentLogOdds", present]) +
    sum(freqTable["absentLogOdds", !present])
}

In [13]:
#Calculate Log Likelihood Ratio (LLR) for Test Messages
testLLR = list()
for (x in 1:fold){
    testLLR[[x]] = sapply(testMsgWords[[x]], computeMsgLLR, trainTable[[x]][[1]])
}

In [14]:
#finding the Type I Error Rate (how many ham are classified as Spam)
typeIErrorRates = 
function(llrVals, isSpam) 
{
  o = order(llrVals)
  llrVals =  llrVals[o]
  isSpam = isSpam[o]

  idx = which(!isSpam)
  N = length(idx)
  list(error = (N:1)/N, values = llrVals[idx])
}

In [15]:
xI = list()
tau01 = list()

for (x in 1:fold){
    xI[[x]] = typeIErrorRates(testLLR[[x]], testIsSpam[[x]])
    tau01[[x]] = min(xI[[x]]$values[xI[[x]]$error <= 0.01])
}

mean_tau01 = mean(unlist(tau01))
mean_tau01


In [16]:
validationLLR = list()
for (x in 1:fold){
    validationLLR[[x]] = sapply(validationMsgWords, computeMsgLLR, trainTable[[x]][[1]])
}

In [17]:
# Type I and Type II error rate
typeIErrorRate = 
function(tau, llrVals, spam)
{
  classify = llrVals > tau
  sum(classify & !spam)/sum(!spam)
}

typeIIErrorRate = 
function(tau, llrVals, spam)
{
  classify = llrVals > tau
  sum(classify & spam)/sum(spam)
}

In [18]:
tIerror = list()
tIIerror = list()

for (x in 1:fold){
    tIerror[[x]] = typeIErrorRate(mean_tau01, validationLLR[[x]], validationIsSpam) #1 if all is classify as spam
    tIIerror[[x]] = typeIIErrorRate(mean_tau01, validationLLR[[x]], validationIsSpam) #1 if all is classify as ham
}

print(paste("Type I error rate with tau = ", round(mean_tau01, 2), " is ", round(mean(unlist(tIerror))*100), "%.", sep=""))
print(paste("Type II error rate with tau = ", round(mean_tau01, 2), " is ", round(mean(unlist(tIIerror))*100), "%.", sep=""))


[1] "Type I error rate with tau = 387.47 is 1%."
[1] "Type II error rate with tau = 387.47 is 8%."
