#DATASCI W261: Machine Learning at Scale
##Section 3
##Homework, Week 1
##Name: T.Thomas
##Email: tgthomas@berkeley.edu
##Submission Date: 1/22/2016 1:30AM PST

**nbviewer link:**
http://nbviewer.jupyter.org/urls/dl.dropbox.com/s/leu326mah0c148f/MIDS-W261-2016-Hwk-Week01-Thomas.T.ipynb?flush_cache=true

**pdf link:**
https://www.dropbox.com/s/ur1ehwhqi27s4da/MIDS-W261-2016-Hwk-Week01-Thomas.T.pdf?dl=0

##HW1.0.0. Define big data. Provide an example of a big data problem in your domain of expertise. 

Big Data is a general term used to represent very large and/or complex data sets that cannot be consumed for analysis or processing using traditional sequential data processing applications. This applies to limitations in Storage, Processing and Throughput.

The above definition however does not tell us anything about how big really is Big Data and that size component has evolved over time with technology. 10 Years ago 200GB could have been big. Today 2 Terabytes or some data that does not fit on a typical hardrive might be considered big. Processing this data using a traditional application using a single desktop/laptop might be impractical in terms of Storage, Processing time and Throughput.

In the Banking and Finance domain where I have worked over the past 10 years, a typical 'Big Data' data set could be the millions of daily transcations executed on each of the banks millions of customers. These transcation data need to be analyzed for different reasons like fraud detection, new cross selling opportunities and customer retention intiatives.

##HW1.0.1.In 500 words (English or pseudo code or a combination) describe how to estimate the bias, the variance, the irreduciable error for a test dataset T when using polynomial regression models of degree 1, 2,3, 4,5 are considered. How would you select a model?

The general idea here would be to fit the test data *** $T$ *** using the different polynomial estimators *** $gN(x)$ *** having degree model order *** $N = [1,2,3,4,5]$* ** Within each estimator fit test, we measure the variance, bias and irreducible error for that estimator model. Then finally we choose the best model where both Bias and Variance are minimized.

However to get a good estimate of our model, we will need as many potential data sets as possible, but we have only one sample data set to work with * $T$ *. To simulate potential data sets, we will bootstrap data sets by randomly drawing samples from T up to say 50 times, to generate 50 different sample data sets. We will repeat the fitting process for each of these 50 data sets to measure variance and bias.

To measure ***variance*** for each model *$g_N(x)$*, we estimate an average estimator over all the 50 datasets for that model,  and consider variance to be a measure of how much each single estimate from each individual data set deviates from the average estimator from all data sets. The variance is essentially measuring how each of the estimators vary with each other – if there is too much variation, then model has high variance. Formally:

***variance*** = the average squared difference between any single data-set-dependent estimate $g_N(x)$ and the average value of estimated $E[g_N(x)]$ over all datasets.

***variance = $E[ ( g_N(x) – E[g_N(x)])^2 ]$ ***

To measure bias for each model, we estimate and average estimator fit over all the data sets and consider bias to be a measure of how much this average estimator deviates from the true model representing T. Formally:

***bias*** = how much the average estimator fit over datasets E[gN(x)]deviates from the value of the underlying target function f(x).

***bias = $E[g_N(x)]- f(x)$ ***

In addition to *bias* and *variance* we also determine each models goodness of fit. For a new data point

$x^*, y^* = g_N^*(x) + \epsilon$ 

we consider ***Expected Prediction Error***, representing the squared difference between the model's prediction $g_N^{*}(x)$ of the observarion $y^{*}$. The expected prediction error is then:

$Err(x)=E[(g_N^{*}(x) − y^{*})^2]$

This error may then be decomposed into bias and variance components:

$Err(x)=Bias^2 + Variance + Irreducible Error$

That third term, irreducible error, is the noise term, that cannot be modeled away. It cannot be reduced by any model. 


<img src="https://theclevermachine.files.wordpress.com/2013/04/bias-variance-tradeoff.png" height="300" width="300" align=left hspace=100 vspace=10 border=100>



   Pseudo code describing the model estimation process

    T = Test Data with k observations
    S = subset of n observations randomly sampled from T 

    For N = 1 to 5 #choose polynomial with degree
        For i = 1 to 50 #choose 50 random samples of n observations from T

             S = random.sample(T, n)         
             Gx[i] = Estimate gN(x) for S 
         
        avg_est = AVG(G(x))
        variance[N] = SUM_i= 1 to 50 [(Gx[i] – avg_est)^2]
        bias[N] = SUMi= 1 to 50 [(avg_est – f(x))^2]


Choose Model where bias and variance are smallest. If there are multiple models with low variance and bias, then pick the simpler model where possible.


##HW1.1. Read through the provided control script (pNaiveBayes.sh)    and all of its comments. When you are comfortable with their purpose and function, respond to the remaining homework questions below. 

In [1]:
print "Done!"

Done!


##HW1.2. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.

***NOTE:***

* The the same mapper.py and reducer.py are used for all questions 1.2 - 1.5
* I tested 1.3 without smoothing, but since I am using the same code for the remaining 1.4-1.5, the outputs for 1.3-1.5 all include smoothing.


In [169]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Tigi Thomas
## Description: mapper code for HW1.2-1.5
import sys
import re

#fixed list of stop words
stopwords = ['a','able','about','across','after','all','almost','also','am','among',
             'an','and','any','are','as','at','be','because','been','but','by','can',
             'cannot','could','dear','did','do','does','either','else','ever','every',
             'for','from','get','got','had','has','have','he','her','hers','him','his',
             'how','however','i','if','in','into','is','it','its','just','least','let',
             'like','likely','may','me','might','most','must','my','neither','no','nor',
             'not','of','off','often','on','only','or','other','our','own','rather','said',
             'say','says','she','should','since','so','some','than','that','the','their',
             'them','then','there','these','they','this','tis','to','too','twas','us',
             'wants','was','we','were','what','when','where','which','while','who',
             'whom','why','will','with','would','yet','you','your']

#Use some pre-compiled re-gex for punctuation and numbers
punctpattern = re.compile(r'[\.\^\$\*\+\-\=\{\}\[\]\\\|\(\)<>-@#%_=!:;,/\'\"]')
numpatt1 = re.compile(r'\b[0-9]{2,100}\b')
numpatt2 = re.compile(r'[0-9]{2,100}\b')
numpatt3 = re.compile(r'\b[0-9]{2,100}')

# Preprocess any text
def preprocess_txt(s): 
    s = s.lower()
    s = re.sub(punctpattern, r'',s)
    s = re.sub(numpatt1, r'', s )
    s = re.sub(numpatt2, r'', s )
    s = re.sub(numpatt3, r'', s )
    return s

# -- check ---
#print sys.argv

#initialize arguments
count = 0
filename = sys.argv[1]
findwords = []

#set up word list if provied or check * for non
if (len(sys.argv) >= 2 and sys.argv[2] != "*"):
    findwords = set(re.split(" ",sys.argv[2].lower()))

with open (filename, "rU") as myfile:
#Please insert your code
    for line in myfile:
        line = line.lower()
        emaildata = line.split("\t")
        emaildatalen = len(emaildata)
        
        #don't really care about the first item, 
        #start from spam or not.. 
        if(emaildatalen >= 2):            
            if(emaildatalen >= 3 ):
                subject = emaildata[2]
            else:
                subject = " "
            
            if(emaildatalen == 4 ):
                body = emaildata[3]
            else:
                body = " "
            
            #Get subject and body of email together
            emailcontent = subject + " "  + body 
            
            #pre-process to remove punctuation etc.
            emailcontent = preprocess_txt(emailcontent)
            
            #extract words now and filter out stop-words
            words = emailcontent.split()
            filtered_words = [tk for tk in words if tk not in stopwords]

            #we write the ID , current Spam classification, total word count in document
            #the output file will contain one row for every document
            if emaildata[1] == "1":
                outtxt = emaildata[0]+'\t'+ "1" +'\t'+ str(len(filtered_words)) 
            else:
                outtxt = emaildata[0]+'\t'+ "0" +'\t'+ str(len(filtered_words)) 
    
            #Prepare our word count list
            #Add word to dict if in findword list or all words '*' specified
            wc = {}
            for word in filtered_words:
                if( (len(findwords) == 0) or (word in findwords)):
                    wc[word] = wc.get(word, 0) + 1
            
            #Stick the occurence of word=count for every word in document
            #for faster processing on the reducer side.
            wordcounttxt = "\t*"
            if(len(wc) > 0):
                wordcounttxt = "";
                for word, count in wc.iteritems():
                    wordcounttxt += '\t'+word+"="+str(count)
            
            #write out document row with 
            print outtxt + wordcounttxt

Overwriting mapper.py


In [170]:
!chmod a+x mapper.py

#Reducer Code. 

In [199]:
%%writefile reducer.py
#!/usr/bin/python
## reducer.py
## Author: T.Thomas
## Description: reducer code for HW1.2-1.5
import sys
import math


def printcollection(Title, coll):
    if(Title != ""):
        print ""
        print Title          
    for k, v in coll.iteritems():
        print k, v


# some dict structures to keep track of intermediate data
wc = {}
spwc = {}
nb = {}
w_condprob = {}
testdata = {}

#Going to add smoothing params if specified
lp_smooth = False
if (sys.argv[1] == "1"):
    lp_smooth = True

# -- check ---
#print sys.argv

# we skip the very first argument and the rest are the file 
# names that are being passed in
for filename in sys.argv[2:]:
    #open each file and get the counts into Wc
    with open (filename, "rU") as myfile:
        for line in myfile:
            item = line.split("\t")
            
            #extract total spam=1 and spam=0 doc/email counts
            nb["spam"+item[1]+"dc"] = nb.get("spam"+item[1]+"dc", 0) + 1
            #extract spam=1 and spam=0 word counts
            nb["spam"+item[1]+"wc"] = nb.get("spam"+item[1]+"wc", 0) + eval(item[2])

            #extract the words and their counts per document
            #this way we have how many times word occurs in document
            #and how many times it occurs in a spam document
            if(item[3].strip() != "*"):
                for itm in item[3:]:
                    word, count = itm.strip().split("=",1)
                    wc[word] = wc.get(word, 0) + eval(count)
                    if(item[1] == "1"):                  
                        spwc[word] = spwc.get(word, 0) + eval(count)
                    #nb[item[0]+"_InSpamCount"] = nb.get(item[0]+"_InSpamCount", 0) + eval(item[2])
            testdata[item[0]] = item[1:]

#Calculate prior probabilities - these will be used everywhere.            
prior_spam =  ( nb["spam1dc"] / float( nb["spam1dc"] + nb["spam0dc"] ) ) 
prior_notspam = 1 - prior_spam 

#setup smoothing parameters if smoothing was specified. see line #58-59 in pNaiveBayes.sh
lp_num = 0
lp_denom = 0   
if (lp_smooth):
    lp_num = 1
    lp_denom = len(wc) 
    
#Here we pre-compute the conditional probablitly for each word
#P(word in email | spam) &  #P(word in email | not spam)
for word, count in wc.iteritems():
    in_spam_count = spwc.get(word,0)
    not_in_spam_count = (count - in_spam_count)
    spam_wordcount = nb.get("spam1wc")
    not_spam_wordcount = nb.get("spam0wc")
    
    spam_condprob =  (in_spam_count + lp_num)/ float(spam_wordcount + lp_num)
    notspam_condprob = (not_in_spam_count + lp_num) / float(not_spam_wordcount + lp_num)
    
    # ----------- Debug check if we have any prob that are 0.0 -----------------
    #if(spam_condprob == 0.0 or notspam_condprob == 0.0):
    #   print word, count, in_spam_count,  not_in_spam_count   
    
    w_condprob[word] = [spam_condprob, notspam_condprob]
          
#print all words and the count
#printcollection("-----------Total Word Count-----------------", wc)

#print in-spam count for each word
#printcollection("-----------In Spam Word Count-----------------", spwc)

#print some NB model parameters
printcollection("-----------Counts for Naive Bayes Model -------------", nb)

#print computed probabilities after running thru training set
print ""
print "------Computed Probabilities from Training Set---"
print "P(prior_spam) = {0:.5f}".format(prior_spam)
print "P(prior_not_Spam) = {0:.5f}".format(prior_notspam)

print ""
#print "Word [ P(spam | word in email), P(not spam | word in email) ]"
#print "-------------------------------------------------------------"
#printcollection("", w_condprob)


#print classification - classify test data set
print ""
print "RESULTS: Classification of Test Data ----------"
print ""
print "ID  Truth(Spam/Ham : 1/0) Class(Spam/Ham : 1/0)"
print "-----------------------------------------------"
result = {}

#run the model on our data set classifying all emails.
#note: 
#the first item in w_condprob, namely w_condprob[word][0] stores P(word in email | spam)
#the second item in w_condprob, namely w_condprob[word][1] stores P(word in email | not spam)
doc_count = len(testdata)
error_count = 0
for docid, data in testdata.iteritems():
    #data[0] is current classification
    #data[1] is total word count in the document
    #data[2:] contains all the individual words and their counts
    #         and for each word we compute P(spam | word in email ) and P(not spam | word in email)
    #         
    p_spam = math.log(prior_spam)
    p_notspam = math.log(prior_notspam)
    if(data[2].strip() != "*"):        
        for itm in data[2:]:            
            word, count = itm.strip().split("=",1)
            p_spam += math.log(w_condprob[word][0])*eval(count)
            p_notspam += math.log(w_condprob[word][1])*eval(count)
        
    if(p_spam > p_notspam): 
        print docid, data[0], "1"
        if(data[0] != "1"):
            error_count += 1
    else:
        print docid, data[0], "0"

print ""
print "Model Error Rate = {0:.5f}".format(error_count/doc_count)

print 

Overwriting reducer.py


In [196]:
!chmod a+x reducer.py

Write pNaiveBayes.sh to file

***NOTES/ASSUMPTIONS:*** 
* The "enronemail_1h.txt" file used hear was re-saved to a new file to fix line endings wich wc -l could not recognize.
* I am using Smoothing for all examples, mainly to handle the Log(0) issue

In [173]:
%%writefile pNaiveBayes.sh
## pNaiveBayes.sh
## Author: Jake Ryland Williams
## Usage: pNaiveBayes.sh m wordlist
## Input:
##       m = number of processes (maps), e.g., 4
##       wordlist = a space-separated list of words in quotes, e.g., "the and of"
##
## Instructions: Read this script and its comments closely.
##               Do your best to understand the purpose of each command,
##               and focus on how arguments are supplied to mapper.py/reducer.py,
##               as this will determine how the python scripts take input.
##               When you are comfortable with the unix code below,
##               answer the questions on the LMS for HW1 about the starter code.

## collect user input
m=$1 ## the number of parallel processes (maps) to run
wordlist=$2 ## if set to "*", then all words are used

## a test set data of 100 messages
data="enronemail_1h.txt" 

## the full set of data (33746 messages)
# data="enronemail.txt" 

## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`

## determine the lines per chunk for the desired number of processes
## bc in the unix bench calculater.. you pipe and expression to it and 
## prints an output.
linesinchunk=`echo "$linesindata/$m+1" | bc`

## split the original file into chunks by line
## do the split by the lines data.
split -l $linesinchunk $data $data.chunk.

## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
    ## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
    ####
    ####
    ./mapper.py $datachunk "$wordlist" > $datachunk.counts &
    ####
    ####
done
## wait for the mappers to finish their work
wait

## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`\ls $data.chunk.*.counts | perl -pe 's/\n/ /'`
## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
##### ********* NOTE THE EXTRA 1 as first parameter to the reducer.. I am enforcing Smoothing for log(0) error
./reducer.py 1 $countfiles > $data.output
####
####

## clean up the data chunks and temporary count files
\rm $data.chunk.*

Overwriting pNaiveBayes.sh


#Run the file

In [174]:
!chmod a+x pNaiveBayes.sh

##HW1.2. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.

***NOTE:***
The same mapper and reducer are used for 1.2 - 1.5 but some code will be commented out to reduce outputs based on the HW question

USAGE:
pNaiveBayesWorking parallelmappers "some word list"

In [177]:
!./pNaiveBayes.sh 5 "assistance"

Output for "assistance" word count

In [178]:
!cat ./enronemail_1h.txt.output


-----------Total Word Count-----------------
assistance 10

-----------In Spam Word Count-----------------
assistance 8

-----------Counts for Naive Bayes Model -------------
spam0wc 8054
spam0dc 56
spam1dc 44
spam1wc 10666

------Computed Probabilities from Training Set---
P(prior_spam) = 0.44000
P(prior_not_Spam) = 0.56000

Word [ P(spam | word in email), P(not spam | word in email) ]
-------------------------------------------------------------
assistance [0.0008437236336364489, 0.00037243947858472997]


##HW1.3. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by a single, user-specified word using the multinomial Naive Bayes Formulation. Examine the word “assistance” and report your results. 

To do so, make sure that
  
   * mapper.py and
   * reducer.py 

that performs a single word Naive Bayes classification. For multinomial Naive Bayes, the Pr(X=“assistance”|Y=SPAM) is calculated as follows:

   * the number of times “assistance” occurs in SPAM labeled documents / the number of words in documents labeled SPAM 

In [181]:
!./pNaiveBayes.sh 5 "assistance"

Output word counts and Classification for "assistance"

In [182]:
!cat ./enronemail_1h.txt.output


-----------Total Word Count-----------------
assistance 10

-----------In Spam Word Count-----------------
assistance 8

-----------Counts for Naive Bayes Model -------------
spam0wc 8054
spam0dc 56
spam1dc 44
spam1wc 10666

------Computed Probabilities from Training Set---
P(prior_spam) = 0.44000
P(prior_not_Spam) = 0.56000

Word [ P(spam | word in email), P(not spam | word in email) ]
-------------------------------------------------------------
assistance [0.0008437236336364489, 0.00037243947858472997]

--------------- Classify Test Data ----------------

ID  Truth(Spam/Ham : 1/0) Class(Spam/Ham : 1/0)
-----------------------------------------------
0001.2000-01-17.beck 0 0
0018.1999-12-14.kaminski 0 0
0004.2001-06-12.sa_and_hp 1 0
0016.2003-12-19.gp 1 0
0015.2001-07-05.sa_and_hp 1 0
0009.1999-12-14.farmer 0 0
0015.2001-02-12.kitchen 0 0
0017.1999-12-14.kaminski 0 0
0006.1999-12-13.kaminski 0 0
0003.2000-01-17.beck 0 0
0007.2004-08-01.bg 1 0
0008

#HW1.4. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results

To do so, make sure that

   * mapper.py counts all occurrences of a list of words, and
   * reducer.py 

performs the multiple-word multinomial Naive Bayes classification via the chosen list.

In [183]:
!./pNaiveBayes.sh 5 "assistance valium enlargementWithATypo"

Output word counts and Classification for "assistance valium enlargementWithATypo"

In [184]:
!cat ./enronemail_1h.txt.output


-----------Total Word Count-----------------
assistance 10
valium 3

-----------In Spam Word Count-----------------
assistance 8
valium 3

-----------Counts for Naive Bayes Model -------------
spam0wc 8054
spam0dc 56
spam1dc 44
spam1wc 10666

------Computed Probabilities from Training Set---
P(prior_spam) = 0.44000
P(prior_not_Spam) = 0.56000

Word [ P(spam | word in email), P(not spam | word in email) ]
-------------------------------------------------------------
assistance [0.0008437236336364489, 0.00037243947858472997]
valium [0.00037498828161619947, 0.00012414649286157667]

--------------- Classify Test Data ----------------

ID  Truth(Spam/Ham : 1/0) Class(Spam/Ham : 1/0)
-----------------------------------------------
0001.2000-01-17.beck 0 0
0018.1999-12-14.kaminski 0 0
0004.2001-06-12.sa_and_hp 1 0
0016.2003-12-19.gp 1 1
0015.2001-07-05.sa_and_hp 1 0
0009.1999-12-14.farmer 0 0
0015.2001-02-12.kitchen 0 0
0017.1999-12-14.kaminski 0 0
0006.19

##HW1.5. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by all words present.

   To do so, make sure that

   * mapper.py counts all occurrences of all words, and
   * reducer.py performs a word-distribution-wide Naive Bayes classification.

In all cases, mapper.py will read in a portion of the email data, count some words and print out counts to a file.

***NOTE:***
Word Counts and other Intermediate results are not shown here (too long to list). This was just done by commenting code in the reducer.py and printing only some high level info and final classification results.

In [200]:
!./pNaiveBayes.sh 5 "*"

Output word counts and Classification using all words

In [201]:
!cat ./enronemail_1h.txt.output


-----------Counts for Naive Bayes Model -------------
spam0wc 8054
spam0dc 56
spam1dc 44
spam1wc 10666

------Computed Probabilities from Training Set---
P(prior_spam) = 0.44000
P(prior_not_Spam) = 0.56000


RESULTS: Classification of Test Data ----------

ID  Truth(Spam/Ham : 1/0) Class(Spam/Ham : 1/0)
-----------------------------------------------
0001.2000-01-17.beck 0 0
0018.1999-12-14.kaminski 0 0
0004.2001-06-12.sa_and_hp 1 1
0016.2003-12-19.gp 1 1
0015.2001-07-05.sa_and_hp 1 1
0009.1999-12-14.farmer 0 0
0015.2001-02-12.kitchen 0 0
0017.1999-12-14.kaminski 0 0
0006.1999-12-13.kaminski 0 0
0003.2000-01-17.beck 0 0
0007.2004-08-01.bg 1 1
0008.2001-06-12.sa_and_hp 1 1
0007.2001-02-09.kitchen 0 0
0008.2004-08-01.bg 1 1
0010.1999-12-14.kaminski 0 0
0015.2000-06-09.lokay 0 0
0017.2003-12-18.gp 1 1
0016.1999-12-15.farmer 0 0
0012.2003-12-19.gp 1 1
0012.2001-02-09.kitchen 0 0
0005.1999-12-12.kaminski 0 0
0007.2003-12-18.gp 1 1
0014.2004-08-01.bg 1 

##HW1.6 Benchmark your code with the Python SciKit-Learn implementation of multinomial Naive Bayes

* Run the Multinomial Naive Bayes algorithm (using default settings) from SciKit-Learn over the same training data used in HW1.5 and report the Training error (please note some data preparation might be needed to get the Multinomial Naive Bayes algorithm from SkiKit-Learn to run over this dataset)
* Run the Bernoulli Naive Bayes algorithm from SciKit-Learn (using default settings) over the same training data used in HW1.5 and report the Training error 
* Run the Multinomial Naive Bayes algorithm you developed for HW1.5 over the same data used HW1.5 and report the Training error 
* Please prepare a table to present your results

***NOTE:***
* The following code assumes "enronemail_1h.txt" is in the same folder as this notebook.
* Results from HW1.5 are hard coded into the results data set for conveinience.

In [206]:
#!/usr/bin/python
## 
## Author: Tigi Thomas
## Description: Benchmark Classifier using Scikit Learn for HW 1.6
import sys
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

from IPython.display import display, HTML


#fixed list of stop words
stopwords = ['a','able','about','across','after','all','almost','also','am','among',
             'an','and','any','are','as','at','be','because','been','but','by','can',
             'cannot','could','dear','did','do','does','either','else','ever','every',
             'for','from','get','got','had','has','have','he','her','hers','him','his',
             'how','however','i','if','in','into','is','it','its','just','least','let',
             'like','likely','may','me','might','most','must','my','neither','no','nor',
           'not','of','off','often','on','only','or','other','our','own','rather','said',
             'say','says','she','should','since','so','some','than','that','the','their',
             'them','then','there','these','they','this','tis','to','too','twas','us',
             'wants','was','we','were','what','when','where','which','while','who',
             'whom','why','will','with','would','yet','you','your']

#Use some pre-compiled re-gex for punctuation and numbers
punctpattern = re.compile(r'[\.\^\$\*\+\-\=\{\}\[\]\\\|\(\)<>-@#%_=!:;,/\'\"]')
numpatt1 = re.compile(r'\b[0-9]{2,100}\b')
numpatt2 = re.compile(r'[0-9]{2,100}\b')
numpatt3 = re.compile(r'\b[0-9]{2,100}')
stpwordspattern = re.compile(r'\b(' + r'|'.join(stopwords) + r')\b')

# Preprocess any text
def preprocess_txt(s): 
    s = s.strip().lower()
    s = re.sub(punctpattern, r'',s)
    s = re.sub(numpatt1, r'', s )
    s = re.sub(numpatt2, r'', s )
    s = re.sub(numpatt3, r'', s )
    s = re.sub(stpwordspattern, r'',s)
    return s

email = []
emailclass = []
data = []

# Read Data File and setup Prep Data for Scikit Learn Model
filename = "enronemail_1h.txt"
with open (filename, "rU") as myfile:
    for line in myfile:
        line = line.lower()
        emaildata = line.split("\t")
        emaildatalen = len(emaildata)
        
        #don't really care about the first item, 
        #start from spam or not.. 
        if(emaildatalen >= 2):            
            if(emaildatalen >= 3 ):
                subject = emaildata[2]
            else:
                subject = " "
            
            if(emaildatalen == 4 ):
                body = emaildata[3]
            else:
                body = " "
                    
            emailcontent = subject + " "  + body 
            
            #pre-process to remove punctuation etc.
            emailcontent = preprocess_txt(emailcontent).strip();
            
            data.append([emaildata[0], emailcontent])
            email.append(emailcontent)
            emailclass.append(emaildata[1])       
                      

# Shuffle and create Training and Test data Set- both are same for now.
emails = np.array(data)
X, Y = np.array(email), np.array(emailclass)

np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]
emails = emails[shuffle]

#Both Test and Train are same data set as 
#per instructions in HW.
train_data, train_labels  = X, Y
test_data,  test_labels   = X, Y

##checks
#print emails.shape
#print 'training label shape:', train_labels.shape
#print 'test label shape:', test_labels.shape

#Initialize Count Vectorizer
count_vect = CountVectorizer()
#fit and transform
fitX_train_counts = count_vect.fit_transform(train_data)
fitX_test_counts  = count_vect.fit_transform(test_data)


# Run thru the different Scikit Learn Models and Compare accuracy.

print "Error Rates: Benchmark Comparison between Scikit Learn and HW1.5 "
print "----------------------------------------------------------------"
results = []

# ----------------------- MultinomialNB --------------------------
mnb = MultinomialNB() #(alpha = a)
mnb.fit(fitX_train_counts, train_labels)

# Predict using our Model
mnb_predicted = mnb.predict(fitX_test_counts)
accuracy  = round(np.where(mnb_predicted == test_labels, 1, 0).sum() / float(len(test_data)),5)
error = round(1 - accuracy,5)
#print 'Error Rate : Scikit-Learn Multinomia NB = {0:.5f}'.format(error)
results.append(['Scikit MultinomialNB', accuracy, error])

# ----------------------- BernoulliNB --------------------------
bnb = BernoulliNB() #(alpha = a)
bnb.fit(fitX_train_counts, train_labels)

# Predict using our Model
bnb_predicted = bnb.predict(fitX_test_counts)
accuracy = round(np.where(bnb_predicted == test_labels, 1, 0).sum() / float(len(test_data)),5)
error = round(1 - accuracy,5)
#print 'Error Rate : Sciki-Learn Bernoulli NB = {0:.5f}'.format(error)
results.append(['Scikit BernoulliNB', accuracy, error])

# ----------------------- HW1.5 --------------------------
#Add in error rate from HW1.5 above
#print 'Error Rate : HW1.5 MulinomilNB = {0:.5f}'.format(0.0)
results.append(['HW1.5 MulinomilNB', 1.0, 0.0])

# Tabularize the Results
results = pd.DataFrame(results, columns=["Model", "Accuracy", "Error"])
display(results)
#HTML(results.to_html())

Error Rates: Benchmark Comparison between Scikit Learn and HW1.5 
----------------------------------------------------------------


Unnamed: 0,Model,Accuracy,Error
0,Scikit MultinomialNB,1.0,0.0
1,Scikit BernoulliNB,0.77,0.23
2,HW1.5 MulinomilNB,1.0,0.0


##Explain/justify any differences in terms of training error rates over the dataset in HW1.5 between your Multinomial Naive Bayes implementation (in Map Reduce) versus the Multinomial Naive Bayes implementation in SciKit-Learn (Hint: smoothing, which we will discuss in next lecture)

    When using both Scikit MultinomialNB and MultinomialNB fro m HW1.5, the error rates were zero. In both cases the both the entire data set was used to train and to test the model. Additionally, Smoothing was used in HW1.5, and the default Scikit MultinomialNB 'alpha' (Additive Laplace smoothing parameter (0 for no smoothing)) is 1.0 , resulting in similar results.

##Discuss the performance differences in terms of training error rates over the dataset in HW1.5 between the  Multinomial Naive Bayes implementation in SciKit-Learn with the  Bernoulli Naive Bayes implementation in SciKit-Learn

    BernoulliNB implements the naive Bayes assuming each feature to be a binary-valued (Bernoulli, boolean) variable. Therefore samples are required to be represented as binary-valued feature vectors. The decision rule for Bernoulli NB explicitly penalizes non-occurrence of a feature versus MultinomialNB which ignores a non-occurring feature. Since the feature vectors are now yes or no vectors versus count vectors, with larger documents there is more chance of error.