# MIDS W261 Machine Learning At Scale

Christopher Llop | christopher.llop@ischool.berkeley.edu <br>
Week 1 | Submission Date:

<span style="color:red">[Placeholder for introduction to assignment]</span>

<span style="color:silver"><b>HW1.0.0.</b> Define big data. Provide an example of a big data problem in your domain of expertise. </span>

<span style="color:green"><b>Answer:</b></span> In short, big data refers to problems using large or complex data sets that cannot quickly and easily be processed by a single machine using "traditional" methods of data-processing. There are several reasons why data could be not easy to process by traditional means. 

Many people talk about the 3 (or 4) V's: Velocity, Volume, Variety and Varacity. These lead to data challenges of processing, storage, or throughput that can be addressed by some "big data" techniques.

In my current domain of economic and litigation consulting, most of our data processing can be handled by traditional means. However, we are starting to run into situations that push the boundaries of what we can do by traditional means. For example:
- Quickly processing 104,000 Analyst Report text documents
- Analyzing multiple TB of credit card transaction records at stake in litigation
- Modeling complicated relationships between nodes in the electric system at an hourly level over decades


<span style="color:silver"><b>HW1.0.1.</b> In 500 words (English or pseudo code or a combination) describe how to estimate the bias, the variance, the error for a test dataset T when using polynomial regression models of degree 1, 2,3, 4,5 are considered. How would you select a model?</span>


<span style="color:green"><b>Answer:</b></span>

<b>Bias:</b> We can estimate the bais by taking bootstrap samples of our data (resampling), building many of each  polynomial model, and useing these models to calculate average predicted values for each type of model. By comparing average predicted values of data to the actual values, we can determine how off our model is on average. If the average prediction values are very different from the actual values, bais is high. If they are similar, bais is low.
$$E[H(x^*)] - f(x^*)$$

<b>Variance:</b> We can estimate the variances through a similar bootstrapping process. For each polynomial model, we can measure how consistent the predictions are from one bootstrap to the next. If the predictions for an example are all similar (tightly clustered), the variance is low. If they are dispersed, the variance is high. We can use the expected value formula for variance for the calculation: 
$$E[(h(x^*) - E[h(x^*)])^2]$$

<b>Error:</b> The error can be evaluated by running our predictions on a held out test set, then evaluating how far off we are using a loss function such as squared prediction error. Because the model was not trained on this held out set, it will (likely) not be impacted by overfitting.  We can then compare the error of each polynomial and select the model with the least error on the held out test set. As shown in lecture and class readings, a similar result can be obtained by minimzing the $bais^2 + variance$.


<span style="color:silver"><b>HW1.1.</b></style> Read through the provided control script (pNaiveBayes.sh) and all of its comments. When you are comfortable with their purpose and function, respond to the remaining homework questions below. A simple cell in the notebook with a print statmement with  a "done" string will suffice here. (dont forget to include the Question Number and the quesition in the cell as a multiline comment!)

In [150]:
# HW 1.1: Read through pNaiveBayes and all comments to 
#         become comfortable with the code.

def HW1_1():
    print "done"

HW1_1()

done


<span style="color:silver"><b>HW1.2.</b> Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.</span>

   <span style="color:silver">To do so, make sure that:</span>
   - <span style="color:silver">mapper.py counts all occurrences of a single word, and</span>
   - <span style="color:silver">reducer.py collates the counts of the single word.</span>
   
   

In [350]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re

# Get input parameters
findword = sys.argv[2]
findfile = sys.argv[1]
term_hits = {}

with open (findfile, "r") as myfile:
    for full_email in myfile:
        try:
            # Spam classification
            is_spam = re.findall("\t([0-1])\t",full_email)[0]

            # Parse out email body for processing. Find body using "tab spam/ham tab"
            # use regex to strip out non alpha-numeric. "don't" will become "dont" which is fine.
            keyword = re.findall("\t[0-1]\t",full_email)[0]
            email_id, is_spam_tabbed, email_body = full_email.partition(keyword)
            email_body = re.sub('[^A-Za-z0-9\s]+', '', email_body)
            email_len = len(email_body.split())

            # Number of hits for search word. Using a dictionary now will allow us to use
            #    a similar Mapper output format when using multiple search terms later 
            #    in this problem set.
            term_hits[findword] = len(re.findall(findword,email_body))
            
            # Print as tuple with unique splitter "|||"
            print "{} ||| {} ||| {}".format(is_spam, email_len, term_hits)

        except:
            pass
            

Overwriting mapper.py


In [351]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import ast

# Get input parameters - list of file names
filelist = sys.argv[1:]

term_sum = {}

# Open each map result
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        # Process each emaul one at a time
        for processed_email in openfile:
            # Read data for email
            processed_email = processed_email.split(" ||| ")
            count_dict = ast.literal_eval(processed_email[2])
            
            # Fold (sum) together the results from mapping
            for key, value in count_dict.iteritems():
                term_sum[key] = term_sum.get(key,0) + value

# Print results
for key, value in term_sum.iteritems():
    print "The word count for '{}' is {}".format(key, value)


Overwriting reducer.py


In [352]:
# Use chmod for permissions
!chmod a+x mapper.py
!chmod a+x reducer.py
!chmod a+x pNaiveBayes.sh

In [353]:
# HW 1.2: Create map/reduce pair that determins occurances of a single word
def HW1_2():
    # Run pNaiveBayes.sh
    !./pNaiveBayes.sh 10 "assistance"

    # Print the output file contents to screen
    with open ("enronemail_1h.txt.output", "r") as openfile:
        print "Result:", openfile.read()

    # Crosscheck results (data is small enough to use RE in python)
    with open ("enronemail_1h.txt", "r") as myfile:
        print "Check Result:", len(re.findall("assistance",myfile.read()))
        
HW1_2()

Result: The word count for 'assistance' is 10

Check Result: 10


<span style="color:silver"><b>HW1.3. </b>Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will classify the email messages by a single, user-specified word. Examine the word “assistance” and report your results. To do so, make sure that</span>
   
   - <span style="color:silver">mapper.py is same as in part (2), and</span>
   - <span style="color:silver">reducer.py performs a single word Naive Bayes classification.</span>

In [354]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import ast

filelist = sys.argv[1:]

spam_term_count = 0
ham_term_count = 0
spam_count = 0
ham_count = 0
spam_len = 0
ham_len = 0

# Open each file and build Multinomial Naive Bayes model
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            is_spam = int(processed_email[0])
            email_len = int(processed_email[1])
            count_dict = ast.literal_eval(processed_email[2])
            
            # Build counts for spam and ham definitions. Note, we assume dictionaries with
            #    one entry as requested in the problem. We will make this robust to larger
            #    dictionaries in the next problem. This code looks awkward because part of
            #    it is a placeholder for the next problem.
            if is_spam:
                for key, value in count_dict.iteritems():
                    spam_term_count += value
                spam_count += 1
                spam_len += email_len
            else:
                for key, value in count_dict.iteritems():
                    ham_term_count += value
                ham_count += 1
                ham_len += email_len
    
# Calculate our priors based on the overall ratio of spam to ham
spam_prior = float(spam_count) / (spam_count + ham_count)
ham_prior = 1 - spam_prior

# Calculate our conditional probabilites for the search term using MNB formulas
search_given_spam = (spam_term_count + 1.0) / (spam_len + 1.0)
search_given_ham = (ham_term_count + 1.0) / (ham_len + 1.0)

# Open each file and predict. Note - prediction is embaressingly parallel and could
#   be done effectively via Mapping. However, we were asked not to modify pNaiveBayes.sh
#   as part of this assignment. As a result, I will predict here in the reducer although
#   it is not the most efficient method.
accuracy = []
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Defaults
            term_count = 0
            pred_spam = 0

            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            is_spam = int(processed_email[0])
            count_dict = ast.literal_eval(processed_email[2])
            
            # Read in counts to use in prediction
            for key, value in count_dict.iteritems():
                term_count += value
            
            # Calculate the probability for each class
            spam_prediction = spam_prior * search_given_spam**term_count
            ham_prediction = ham_prior * search_given_ham**term_count
            
            # Pick the higher probability
            if spam_prediction > ham_prediction: 
                pred_spam = 1
            
            # Store accuracy in a list
            accuracy.append(1*(pred_spam==is_spam))

# Print accuracy
print "Accuracy = {:.2f}".format(float(sum(accuracy))/len(accuracy))



Overwriting reducer.py


In [356]:
# HW 1.3: Create multinomial bayes map/reduce pair that predicts spam/ham using a single word
def HW1_3():
    # Run pNaiveBayes.sh
    !./pNaiveBayes.sh 10 "assistance"

    # Print the output file contents to screen
    with open ("enronemail_1h.txt.output", "r") as openfile:
        print openfile.read()
        
HW1_3()

Accuracy = 0.60



<b>HW1.4.</b> Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will classify the email messages by a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results
   To do so, make sure that

   - mapper.py counts all occurrences of a list of words, and
   - reducer.py performs the multiple-word Naive Bayes classification via the chosen list.

In [357]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re

# Get input parameters
findwords = sys.argv[2].split()
findfile = sys.argv[1]
term_hits = {}

with open (findfile, "r") as myfile:
    for full_email in myfile:
        try:
            # Spam classification
            is_spam = re.findall("\t([0-1])\t",full_email)[0]

            # Parse out email body for processing. Find body using "tab spam/ham tab"
            # use regex to strip out non alpha-numeric. "don't" will become "dont" which is fine.
            keyword = re.findall("\t[0-1]\t",full_email)[0]
            email_id, is_spam_tabbed, email_body = full_email.partition(keyword)
            email_body = re.sub('[^A-Za-z0-9\s]+', '', email_body)
            email_len = len(email_body.split())

            # Number of hits for search word, stored in dictionary
            for word in findwords:
                term_hits[word] = len(re.findall(word,email_body))
            
            
            # Print as tuple with unique splitter "|||"
            print "{} ||| {} ||| {}".format(is_spam, email_len, term_hits)

        except:
            pass
            

Overwriting mapper.py


In [423]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import ast
import math

filelist = sys.argv[1:]

spam_term_counts = {}
ham_term_counts = {}
spam_count = 0
ham_count = 0
spam_len = 0
ham_len = 0

# Open each file and build Multinomial Naive Bayes model
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            is_spam = int(processed_email[0])
            email_len = int(processed_email[1])
            count_dict = ast.literal_eval(processed_email[2])
            
            # Build counts for spam and ham definitions.
            if is_spam:
                for key, value in count_dict.iteritems():
                    spam_term_counts[key] = spam_term_counts.get(key, 0) + value
                spam_count += 1
                spam_len += email_len
            else:
                for key, value in count_dict.iteritems():
                    ham_term_counts[key] = ham_term_counts.get(key, 0) + value
                ham_count += 1
                ham_len += email_len
    
# Calculate our priors based on the overall ratio of spam to ham
spam_prior = float(spam_count) / (spam_count + ham_count)
ham_prior = 1 - spam_prior

# Get full vocabulary
vocab_full = sorted(list(set(spam_term_counts.keys() + ham_term_counts.keys())))
vocab_len = len(vocab_full)

# Store probabilites for each term in the full vocabulary in dictionary. 
#    Calculate with MNB formulas.
probs_given_spam = {}
probs_given_ham = {}
for key in vocab_full:
    # Probability of term | Spam
    spam_term_count = spam_term_counts.get(key, 0)
    probs_given_spam[key] = (spam_term_count + 1.0) / (spam_len + vocab_len)
    
    # Probability of term | Ham
    ham_term_count = ham_term_counts.get(key, 0)
    probs_given_ham[key] = (ham_term_count + 1.0) / (ham_len + vocab_len)

print spam_term_counts
print ham_term_counts
print
print spam_len
print ham_len
print vocab_len
print
print probs_given_spam
print probs_given_ham
print
print spam_prior
print ham_prior
print math.log(spam_prior)
print math.log(ham_prior)

# Open each file and predict. Note - prediction is embaressingly parallel and could
#   be done effectively via Mapping. However, we were asked not to modify pNaiveBayes.sh
#   as part of this assignment. As a result, I will predict here in the reducer although
#   it is not the most efficient method.

accuracy = []

for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Defaults
            term_count = 0
            pred_spam = 0

            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            is_spam = int(processed_email[0])
            count_dict = ast.literal_eval(processed_email[2])
            
            # We will predict using the sum of logs to avoid multiplying together very small
            #   numbers, which would risk a floating point error.
            spam_prediction = math.log(spam_prior)
            ham_prediction = math.log(ham_prior)
            for key, value in count_dict.iteritems():
                spam_prediction += value*math.log(probs_given_spam[key])
                ham_prediction += value*math.log(probs_given_ham[key])
                #print key, value
                #print is_spam
                #print spam_prediction
                #print ham_prediction
                        
            # Pick the higher probability
            if spam_prediction > ham_prediction: 
                pred_spam = 1
                
            #print pred_spam
            # Store accuracy in a list
            accuracy.append(1*(pred_spam==is_spam))

# Print accuracy
print "Accuracy = {:.2f}".format(float(sum(accuracy))/len(accuracy))



Overwriting reducer.py


In [426]:
# HW 1.4: TODO
def HW1_4():
    # Run pNaiveBayes.sh
    !./pNaiveBayes.sh 1 "assistance valium enlargementWithATypo"
#    !./pNaiveBayes.sh 1 "assistance"
    
    # Print the output file contents to screen
    with open ("enronemail_1h.txt.output", "r") as openfile:
        print openfile.read()
        
HW1_4()

{'assistance': 8, 'enlargementWithATypo': 0, 'valium': 3}
{'assistance': 2, 'enlargementWithATypo': 0, 'valium': 0}

18284
13184
3

{'assistance': 0.0004921528954995352, 'enlargementWithATypo': 5.468365505550391e-05, 'valium': 0.00021873462022201564}
{'assistance': 0.00022749677712899068, 'enlargementWithATypo': 7.58322590429969e-05, 'valium': 7.58322590429969e-05}

0.44
0.56
-0.82098055207
-0.579818495253
Accuracy = 0.63



<b>HW1.5.</b> Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by all words present. To do so, make sure that

   - mapper.py counts all occurrences of all words, and
   - reducer.py performs a word-distribution-wide Naive Bayes classification.

In [444]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re

# Get input parameters
findwords = sys.argv[2].split()
findfile = sys.argv[1]
term_hits = {}

if findwords[0] == "*":
    with open (findfile, "r") as myfile:
        for full_email in myfile:
            try:
                # Spam classification
                is_spam = re.findall("\t([0-1])\t",full_email)[0]

                # Parse out email body for processing. Find body using "tab spam/ham tab"
                # use regex to strip out non alpha-numeric. "don't" will become "dont" which is fine.
                keyword = re.findall("\t[0-1]\t",full_email)[0]
                email_id, is_spam_tabbed, email_body = full_email.partition(keyword)
                email_body = re.sub('[^A-Za-z0-9\s]+', '', email_body)
                email_len = len(email_body.split())

                # Number of hits for search word, stored in dictionary
                for word in list(set(email_body.split())):
                    term_hits[word] = len(re.findall(word,email_body))


                # Print as tuple with unique splitter "|||"
                print "{} ||| {} ||| {}".format(is_spam, email_len, term_hits)

            except:
                pass
else:
    print "Invalid input"

Overwriting mapper.py


In [455]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import ast
import math

filelist = sys.argv[1:]

spam_term_counts = {}
ham_term_counts = {}
spam_count = 0
ham_count = 0
spam_len = 0
ham_len = 0

# Open each file and build Multinomial Naive Bayes model
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            is_spam = int(processed_email[0])
            email_len = int(processed_email[1])
            count_dict = ast.literal_eval(processed_email[2])
            
            # Build counts for spam and ham definitions.
            if is_spam:
                for key, value in count_dict.iteritems():
                    spam_term_counts[key] = spam_term_counts.get(key, 0) + value
                spam_count += 1
                spam_len += email_len
            else:
                for key, value in count_dict.iteritems():
                    ham_term_counts[key] = ham_term_counts.get(key, 0) + value
                ham_count += 1
                ham_len += email_len
    
# Calculate our priors based on the overall ratio of spam to ham
spam_prior = float(spam_count) / (spam_count + ham_count)
ham_prior = 1 - spam_prior

# Get full vocabulary
vocab_full = sorted(list(set(spam_term_counts.keys() + ham_term_counts.keys())))
vocab_len = len(vocab_full)

# Store probabilites for each term in the full vocabulary in dictionary. 
#    Calculate with MNB formulas.
probs_given_spam = {}
probs_given_ham = {}
for key in vocab_full:
    # Probability of term | Spam
    spam_term_count = spam_term_counts.get(key, 0)
    probs_given_spam[key] = (spam_term_count + 1.0) / (spam_len + vocab_len)
    
    # Probability of term | Ham
    ham_term_count = ham_term_counts.get(key, 0)
    probs_given_ham[key] = (ham_term_count + 1.0) / (ham_len + vocab_len)

print spam_len
print ham_len
print vocab_len
print
print spam_prior
print ham_prior
print math.log(spam_prior)
print math.log(ham_prior)

# Open each file and predict. Note - prediction is embaressingly parallel and could
#   be done effectively via Mapping. However, we were asked not to modify pNaiveBayes.sh
#   as part of this assignment. As a result, I will predict here in the reducer although
#   it is not the most efficient method.

accuracy = []

for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Defaults
            term_count = 0
            pred_spam = 0

            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            is_spam = int(processed_email[0])
            count_dict = ast.literal_eval(processed_email[2])
            
            # We will predict using the sum of logs to avoid multiplying together very small
            #   numbers, which would risk a floating point error.
            spam_prediction = math.log(spam_prior)
            ham_prediction = math.log(ham_prior)
            for key, value in count_dict.iteritems():
                spam_prediction += value*math.log(probs_given_spam[key])
                ham_prediction += value*math.log(probs_given_ham[key])
                #print key, value
                #print spam_prediction
                #print ham_prediction
                        
            # Pick the higher probability
            if spam_prediction > ham_prediction: 
                pred_spam = 1
                
            #print "Predict:", pred_spam
            #print "Actual:", is_spam
            # Store accuracy in a list
            accuracy.append(1*(pred_spam==is_spam))
            print pred_spam
            print accuracy

# Print accuracy
print "Accuracy = {:.2f}".format(float(sum(accuracy))/len(accuracy))



Overwriting reducer.py


In [456]:
# HW 1.5: TODO
def HW1_5():
    # Run pNaiveBayes.sh
    !./pNaiveBayes.sh 1 "*"
    
    # Print the output file contents to screen
    with open ("enronemail_1h.txt.output", "r") as openfile:
        print openfile.read()
        
HW1_5()

18284
13184
5740

0.44
0.56
-0.82098055207
-0.579818495253
0
[1]
0
[1, 1]
0
[1, 1, 1]
0
[1, 1, 1, 1]
0
[1, 1, 1, 1, 1]
0
[1, 1, 1, 1, 1, 1]
0
[1, 1, 1, 1, 1, 1, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1]
0
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,

In [348]:
%%writefile mapper.py
#!/usr/bin/python
# This mapper requires each mapper to be given a chunk size of one individual email
import sys
import re

findword = sys.argv[2]
findfile = sys.argv[1]

with open (findfile, "r") as myfile:
    for content in myfile:
        try:
            # Spam classification
            is_spam = re.findall("\t([0-1])\t",content)[0]

            # Parse out email body for processing. Find body using "tab spam/ham tab"
            # use regex to strip out non alpha-numeric. "don't" will become "dont" which is fine.
            keyword = re.findall("\t[0-1]\t",content)[0]
            email_id, is_spam_tabbed, email_body = content.partition(keyword)
            email_body = re.sub('[^A-Za-z0-9\s]+', '', email_body)
            email_distinct = sorted(list(set(email_body.split())))

            # Number of hits for search word
            term_hits = len(re.findall(findword,email_body))
            
            # Print as triple with unique splitter "|||"
            print "{} ||| {} ||| {}".format(is_spam, term_hits, email_distinct)

        except:
            pass
        


Overwriting mapper.py


In [349]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import ast

filelist = sys.argv[1:]

spam_words = []
ham_words = []
spam_search_count = 0
ham_search_count = 0
spam_count = 0
ham_count = 0

# Open each file and build Multinomial Naive Bayes model
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for line in openfile:
            # Read in triples created by mapper
            line = line.split(" ||| ")
            is_spam = int(line[0])
            searchterm_count = int(line[1])
            distinct_words = ast.literal_eval(line[2])
            
            if is_spam:
                spam_words += distinct_words
                spam_search_count += searchterm_count
                spam_count += 1
            else:
                ham_words += distinct_words
                ham_search_count += searchterm_count
                ham_count += 1
    
    # Find unique word counts
    spam_words = sorted(list(set(spam_words)))
    ham_words = sorted(list(set(ham_words)))
    all_words = sorted(list(set(spam_words + ham_words)))

    
    spam_prior = spam_count / (spam_count + ham_count)
    ham_prior = 1 - spam_prior
    
    search_given_spam = (spam_search_count + 1) / ()
    search_given_ham = 
             
#            print is_spam, searchterm_count, distinct_words

print "Spam words", len(spam_words)
print "Ham words", len(ham_words)


print "Spam words", len(spam_words)
print "Ham words", len(ham_words)

# Open each file and predict. Note - prediction is embaressingly parallel and could
#   be done effectively via Mapping. However, we were asked not to modify pNaiveBayes.sh
#   as part of this assignment. As a result, I will predict here in the reducer although
#   it is not the most efficient method.



Overwriting reducer.py


In [125]:
content = "12313313	1	asdad13	1	a13	0	a"
print re.findall("\t([0-1])\t",content)[0]

1


In [174]:
content = "12313313	1	asdad13	1	a13	0	a"
keyword = re.findall("\t[0-1]\t",content)[0]
email_id, is_spam, email_body = content.partition(keyword)
print email_id
print is_spam
print email_body

12313313
	1	
asdad13	1	a13	0	a
