# MIDS W261 Machine Learning At Scale

Christopher Llop | christopher.llop@ischool.berkeley.edu <br>
Week 2 | Submission Date:

<span style="color:red">[Placeholder for introduction to assignment]</span>

<span style="color:black"><b>HW2.0.</b> What is a race condition in the context of parallel computation? Give an example.
What is MapReduce?
How does it differ from Hadoop?
Which programming paradigm is Hadoop based on? Explain and give a simple example in code and show the code running.</span>

<span style="color:green"><b>Answer:</b></span>

A <b>race condition</b> is a condition where two threads must access the same data source. The programmer in this instance does not control which thread modifies the data first. As a result, it is possible that the ultimate end product of the code differs at random based on the order that the data source is accessed by threads.

As an example, say the number "3" is stored on disk. Thread 1 wants to double the number, while Thread 2 wants to add 5 to the number. If Thread 1 acts first, the result is $(3 * 2) + 5 = 11$. If Thread 2 acts first, the result is $(3 + 5) * 2 = 16$. This sort of condition can cause all sorts of difficulties.

<br>
<b>MapReduce</b> is a problem solving framework/concept for embaressingly parallel data analysis. At its core, a problem is chunked and first processed in parallel by a number of mappers. Reducers then "fold" together the results of the mappers into a final output. <b>Hadoop</b> is a technical environment that, when combined with Hadoop File System allows a programmer to execute MapReduce jobs with ease. Hadoop programming is based on the MapReduce paradigm.

<span style="color:silver"><b>HW2.1. </b> Sort in Hadoop MapReduce
Given as input: Records of the form (integer, “NA”), where integer is any integer, and “NA” is just the empty string.
Output: sorted key value pairs of the form (integer, “NA”); what happens if you have multiple reducers? Do you need additional steps? Explain.</span>

<span style="color:silver">Write code to generate N  random records of the form (integer, “NA”). Let N = 10,000.
Write the python Hadoop streaming map-reduce job to perform this sort.</span>


<span style="color:green"><b>Answer:</b></span>

If we have multiple reducers, the results will be sorted within each reducer - however they will not be globally sorted. While all the results for a given key wind up at the same reducer, reducers are not guaranteed to be given consecutive keys in the sort order. To correct for this, we could either force our system to send keys to the reducers in sorted chunks, or we could post-process all the reducer outputs to re-sort.

In [46]:
%%writefile mapper.py
#!/usr/bin/python
from random import randint

for x in range(0,10000):
    # Generate random integers
    print "{}\t{}".format(randint(0,10000),"NA")

Overwriting mapper.py


In [47]:
%%writefile reducer.py
#!/usr/bin/python
import sys

# input comes from STDIN
for line in sys.stdin:
    # Because the reducer is given the keys in sorted order, we can simply print
    print line.rstrip('\n')

Overwriting reducer.py


In [48]:
# Use chmod for permissions
!chmod a+x mapper.py
!chmod a+x reducer.py

In [50]:
# We need a dummy text file to run streaming
!echo nothing > dummy.txt
!hadoop fs -mkdir ./W261/In/HW2
!hdfs dfs -put ./dummy.txt ./W261/In/HW2/

15/09/12 19:25:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
put: `W261/In/HW2/dummy.txt': File exists


In [54]:
# HW2.1: Execute a job using Hadoop Streaming to generate 10,000 random integers and sort them.
def HW2_1():
    !hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
    -Dmapreduce.job.maps=10 \
    -Dmapreduce.job.reduces=1 \
    -files ./mapper.py,./reducer.py \
    -mapper ./mapper.py  \
    -reducer ./reducer.py \
    -input ./W261/In/HW2/dummy.txt -output ./W261/Out/HW2_1
    
    print
    print "Display head of file to prove run worked:"
    !hadoop fs -cat ./W261/Out/HW2_1/part-00000 | head -n15

HW2_1()

15/09/12 19:26:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/12 19:26:26 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/12 19:26:26 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/12 19:26:26 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/12 19:26:27 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/12 19:26:27 INFO mapreduce.JobSubmitter: number of splits:1
15/09/12 19:26:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1950920983_0001
15/09/12 19:26:27 INFO mapred.LocalDistributedCacheManager: Localized file:/Users/cjllop/Code/MIDS/W261/HW/W2/mapper.py as file:/usr/local/Cellar/hadoop/hdfs/tmp/mapred/local/1442100387402/mapper.py
15/09/12 19:26:27 INFO mapred.LocalDistributedCacheManager: L

<span style="color:silver"><b>HW2.2.</b> Using the Enron data from HW1 and Hadoop MapReduce streaming, write mapper/reducer pair that  will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.</style>


   <span style="color:silver">To do so, make sure that</span>
   
   - <span style="color:silver">mapper.py counts all occurrences of a single word, and</span>
   - <span style="color:silver">reducer.py collates the counts of the single word.</span>

In [74]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re

findword = sys.argv[1]

# input comes from standard input
for full_email in sys.stdin:

    # Parse out email body for processing. Find body using "tab spam/ham tab"
    # use regex to strip out non alpha-numeric. "don't" will become "dont" which is fine.
    keyword = re.findall("\t[0-1]\t",full_email)[0]
    email_id, is_spam_tabbed, email_body = full_email.partition(keyword)
    email_body = re.sub('[^A-Za-z0-9\s]+', '', email_body)

    for word in email_body.split():
        if word == findword:
            print '%s\t%s' % (word, 1)

Overwriting mapper.py


In [92]:
%%writefile reducer.py
#!/usr/bin/python
import sys

current_word = None
current_count = 0
word = None

# input comes from standard input
for line in sys.stdin:
    # parse the input we got from mapper.py
    word, count = line.strip().split('\t', 1)
    count = int(count)

    # take advantage of sorted keys
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # print result when word changes
            print "{}\t{}".format(current_word, current_count)
        current_count = count
        current_word = word

# print final word
if current_word == word:
    print "{}\t{}".format(current_word, current_count)


Overwriting reducer.py


In [77]:
# Move input file to HDFS
!hdfs dfs -put ./enronemail_1h.txt ./W261/In/HW2/

15/09/12 21:06:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [94]:
# HW2.2: Execute a job using Hadoop Streaming to search the input file for a user-specified word
def HW2_2(term="assistance"):
    !hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
    -Dmapreduce.job.maps=10 \
    -Dmapreduce.job.reduces=1 \
    -files ./mapper.py,./reducer.py \
    -mapper './mapper.py {term}' \
    -reducer ./reducer.py \
    -input ./W261/In/HW2/enronemail_1h.txt -output ./W261/Out/HW2_2
    
    print
    print "Display head of file to prove run worked:"
    !hadoop fs -cat ./W261/Out/HW2_2/part-00000 | head -n15

    # Crosscheck results (data is small enough to use RE in python)
    print "Running Crosscheck..."
    with open ("enronemail_1h.txt", "r") as myfile:
        print "Check Result:", len(re.findall(findword,myfile.read()))
        
HW2_2(term="assistance")



15/09/12 21:16:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/12 21:16:17 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/12 21:16:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/12 21:16:17 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/12 21:16:17 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/12 21:16:17 INFO mapreduce.JobSubmitter: number of splits:1
15/09/12 21:16:17 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local299837932_0001
15/09/12 21:16:18 INFO mapred.LocalDistributedCacheManager: Localized file:/Users/cjllop/Code/MIDS/W261/HW/W2/mapper.py as file:/usr/local/Cellar/hadoop/hdfs/tmp/mapred/local/1442106978099/mapper.py
15/09/12 21:16:18 INFO mapred.LocalDistributedCacheManager: Lo

<span style="color:silver"><b>HW2.3.</b> Using the Enron data from HW1 and Hadoop MapReduce, write  a mapper/reducer pair that
   will classify the email messages by a single, user-specified word. Examine the word “assistance” and report your results. To do so, make sure that</style>
   
   - <span style="color:silver">mapper.py</span>
   - <span style="color:silver">reducer.py </span>

   <span style="color:silver">performs a single word multinomial Naive Bayes classification.</span>

In [113]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re

# Get search term(s)
findword = sys.argv[1]

for full_email in sys.stdin:
#with open (findfile, "r") as myfile:
#    for full_email in myfile:
    # Empty dictionary
    term_hits = {}

    # Spam classification
    is_spam = re.findall("\t([0-1])\t",full_email)[0]

    # Parse out email body for processing. Find body using "tab spam/ham tab"
    # use regex to strip out non alpha-numeric. "don't" will become "dont" which is fine for classifying.
    keyword = re.findall("\t[0-1]\t",full_email)[0]
    email_id, is_spam_tabbed, email_body = full_email.partition(keyword)
    email_body = re.sub('[^A-Za-z0-9\s]+', '', email_body)
    # Must process search query and email bodies the same
    findword = re.sub('[^A-Za-z0-9\s]+', '', findword)
    email_len = len(email_body.split())

    # Build counts of term words. 
    for word in list(set(email_body.split())):
        term_hits[word] = len(re.findall(word,email_body))

    # Print as tuple with unique splitter "|||"
    print "{}\t{} ||| {} ||| {}".format(email_id, is_spam, email_len, term_hits)


Overwriting mapper.py


In [114]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import ast
import math

# Get search term(s)
findword = sys.argv[1]
search_terms = findword.split()

# Parse all mapper results into a list so we can loop through again to predict after 
# looping through to train the model
mapper_results = []
for line in sys.stdin:
    mapper_results.append(line)

spam_term_counts = {}
ham_term_counts = {}
word_prob_given_spam = {}
word_prob_given_ham = {}
spam_count = 0
ham_count = 0
spam_len = 0
ham_len = 0
distinct_term_list = []

# Open each file and build Multinomial Naive Bayes model
for processed_email in mapper_results:
    # Read in tuples created by mapper
    email_id, processed_email = processed_email.split("\t")
    processed_email = processed_email.split(" ||| ")
    is_spam = int(processed_email[0])
    email_len = int(processed_email[1])
    count_dict = ast.literal_eval(processed_email[2])

    # Build counts for spam and ham definitions.
    if is_spam:
        for key, value in count_dict.iteritems():
            spam_term_counts[key] = spam_term_counts.get(key, 0) + value
        spam_count += 1
        spam_len += email_len
    else:
        for key, value in count_dict.iteritems():
            ham_term_counts[key] = ham_term_counts.get(key, 0) + value
        ham_count += 1
        ham_len += email_len

    distinct_term_list = list(set(distinct_term_list + count_dict.keys()))

# Calculate our priors based on the overall ratio of spam to ham
spam_prior = float(spam_count) / (spam_count + ham_count)
ham_prior = 1 - spam_prior
spam_prior = math.log10(spam_prior)
ham_prior = math.log10(ham_prior)

# Calculate our conditional probabilites for the search term using MNB formula
#     term_given_spam = (spam_term_count + 1 for smoothing) / (total count of spam words + total distinct vocab size)
distinct_term_count = len(distinct_term_list)

for term in search_terms:
    word_prob_given_spam[term] = math.log10((spam_term_counts.get(term,0) + 1.0) / (float(spam_len) + distinct_term_count))
    word_prob_given_ham[term] = math.log10((ham_term_counts.get(term,0) + 1.0) / (float(ham_len) + distinct_term_count))

# Now let's predict!
accuracy = []
for processed_email in mapper_results:
    # Defaults
    pred_spam = 0
    spam_prediction = spam_prior
    ham_prediction = ham_prior

    # Read in tuples created by mapper
    email_id, processed_email = processed_email.split("\t")
    processed_email = processed_email.split(" ||| ")
    is_spam = int(processed_email[0])
    count_dict = ast.literal_eval(processed_email[2])

    # Read in counts to use in prediction
    for term in word_prob_given_spam.keys():
        # Calculate the probability for each class
        spam_prediction += (word_prob_given_spam[term] * count_dict.get(term, 0))
        ham_prediction += (word_prob_given_ham[term] * count_dict.get(term, 0))

    # Pick the higher probability
    if spam_prediction > ham_prediction: 
        pred_spam = 1

    # Store accuracy in a list
    accuracy.append(1*(pred_spam==is_spam))

    # Print predictions to results file
    print '{}\t{}\t{}'.format(email_id, is_spam, pred_spam)

# Print accuracy
sys.stderr.write("\nSpam Probs: {}\n".format(word_prob_given_spam))
sys.stderr.write("Ham Probs: {}\n".format(word_prob_given_ham))
sys.stderr.write("Accuracy = {:.2f}\n".format(float(sum(accuracy))/len(accuracy)))


Overwriting reducer.py


In [115]:
# HW2.3: Predict via MNBusing Hadoop Streaming for a user-specified word
def HW2_3(term="assistance"):
    !hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
    -Dmapreduce.job.maps=10 \
    -Dmapreduce.job.reduces=1 \
    -files ./mapper.py,./reducer.py \
    -mapper './mapper.py {term}' \
    -reducer './reducer.py {term}' \
    -input ./W261/In/HW2/enronemail_1h.txt -output ./W261/Out/HW2_3
    
    print
    print "Display head of file to prove run worked:"
    !hadoop fs -cat ./W261/Out/HW2_3/part-00000 | head -n15
        
HW2_3(term="assistance")

# Note - output shows same accuracy of 60% that we saw in HW1


15/09/12 21:57:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/12 21:57:32 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/12 21:57:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/12 21:57:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/12 21:57:33 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/12 21:57:33 INFO mapreduce.JobSubmitter: number of splits:1
15/09/12 21:57:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local540684409_0001
15/09/12 21:57:34 INFO mapred.LocalDistributedCacheManager: Localized file:/Users/cjllop/Code/MIDS/W261/HW/W2/mapper.py as file:/usr/local/Cellar/hadoop/hdfs/tmp/mapred/local/1442109453919/mapper.py
15/09/12 21:57:34 INFO mapred.LocalDistributedCacheManager: Lo

<span style="color:black"><b>HW2.4.</b></style> Using the Enron data from HW1 and in the Hadoop MapReduce framework, write  a mapper/reducer pair that
   will classify the email messages using multinomial Naive Bayes Classifier using a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results
   To do so, make sure that

   - mapper.py 
   - reducer.py 

   performs the multiple-word multinomial Naive Bayes classification via the chosen list.

In [119]:
# HW2.4: Predict via MNBusing Hadoop Streaming for multiple user-specified words
# Note - the solution program to HW2.3 can already do this. We just need to give it more terms.
def HW2_4(term="assistance valium enlargementWithATypo"):
    !hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
    -Dmapreduce.job.maps=10 \
    -Dmapreduce.job.reduces=1 \
    -files ./mapper.py,./reducer.py \
    -mapper './mapper.py "{term}"' \
    -reducer './reducer.py "{term}"' \
    -input ./W261/In/HW2/enronemail_1h.txt -output ./W261/Out/HW2_4
    
    print
    print "Display head of file to prove run worked:"
    !hadoop fs -cat ./W261/Out/HW2_4/part-00000 | head -n15
        
HW2_4(term="assistance valium enlargementWithATypo")

# Note - output shows same accuracy of 63% that we saw in HW1


15/09/12 22:07:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/12 22:07:08 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/12 22:07:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/12 22:07:08 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/12 22:07:08 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/12 22:07:08 INFO mapreduce.JobSubmitter: number of splits:1
15/09/12 22:07:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local864627615_0001
15/09/12 22:07:09 INFO mapred.LocalDistributedCacheManager: Localized file:/Users/cjllop/Code/MIDS/W261/HW/W2/mapper.py as file:/usr/local/Cellar/hadoop/hdfs/tmp/mapred/local/1442110029290/mapper.py
15/09/12 22:07:09 INFO mapred.LocalDistributedCacheManager: Lo

In [118]:
!hadoop fs -rm -r ./W261/Out/HW2_4

15/09/12 22:07:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/12 22:07:01 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted W261/Out/HW2_4


<span style="color:black"><b>HW2.5.</b></style> Using the Enron data from HW1 an in the  Hadoop MapReduce framework, write  a mapper/reducer for a multinomial Naive Bayes Classifier that
   will classify the email messages using  words present. Also drop words with a frequency of less than three (3). How does it affect the misclassifcation error of learnt naive multinomial Bayesian Classifiers on the training dataset:

In [9]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import ast
import math

filelist = sys.argv[1:]

spam_term_counts = {}
ham_term_counts = {}
word_prob_given_spam = {}
word_prob_given_ham = {}
spam_count = 0
ham_count = 0
spam_len = 0
ham_len = 0
distinct_term_list = []

# Open each file and build Multinomial Naive Bayes model
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            search_terms = ast.literal_eval(processed_email[0])
            email_id = processed_email[1]
            is_spam = int(processed_email[2])
            email_len = int(processed_email[3])
            count_dict = ast.literal_eval(processed_email[4])
            
            # Build counts for spam and ham definitions.
            if is_spam:
                for key, value in count_dict.iteritems():
                    spam_term_counts[key] = spam_term_counts.get(key, 0) + value
                spam_count += 1
                spam_len += email_len
            else:
                for key, value in count_dict.iteritems():
                    ham_term_counts[key] = ham_term_counts.get(key, 0) + value
                ham_count += 1
                ham_len += email_len
                
            distinct_term_list = list(set(distinct_term_list + count_dict.keys()))

# Calculate our priors based on the overall ratio of spam to ham
spam_prior = float(spam_count) / (spam_count + ham_count)
ham_prior = 1 - spam_prior
spam_prior = math.log10(spam_prior)
ham_prior = math.log10(ham_prior)

# Calculate our conditional probabilites for the search term using MNB formula
#     term_given_spam = (spam_term_count + 1 for smoothing) / (total count of spam words + total distinct vocab size)
distinct_term_count = len(distinct_term_list)

for term in search_terms:
    word_prob_given_spam[term] = math.log10((spam_term_counts.get(term,0) + 1.0) / (float(spam_len) + distinct_term_count))
    word_prob_given_ham[term] = math.log10((ham_term_counts.get(term,0) + 1.0) / (float(ham_len) + distinct_term_count))

# Open each file and predict. Note - prediction is embaressingly parallel and could
#   be done effectively via Mapping. However, the assignment asks us to solve the problem using
#   the provided pNaiveBayes.sh.
accuracy = []
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Defaults
            pred_spam = 0
            spam_prediction = spam_prior
            ham_prediction = ham_prior

            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            email_id = processed_email[1]
            is_spam = int(processed_email[2])
            count_dict = ast.literal_eval(processed_email[4])
            
            # Read in counts to use in prediction
            for term in word_prob_given_spam.keys():
                # Calculate the probability for each class
                spam_prediction += (word_prob_given_spam[term] * count_dict.get(term, 0))
                ham_prediction += (word_prob_given_ham[term] * count_dict.get(term, 0))
                
            # Pick the higher probability
            if spam_prediction > ham_prediction: 
                pred_spam = 1
            
            # Store accuracy in a list
            accuracy.append(1*(pred_spam==is_spam))

            # Print predictions to results file
            print '{}\t{}\t{}'.format(email_id, is_spam, pred_spam)
# Print accuracy
sys.stderr.write("Spam Probs: {}\n".format(word_prob_given_spam))
sys.stderr.write("Ham Probs: {}\n".format(word_prob_given_ham))
sys.stderr.write("Accuracy = {:.2f}\n".format(float(sum(accuracy))/len(accuracy)))


Overwriting reducer.py


In [10]:
# HW 1.3: Create multinomial bayes map/reduce pair that predicts spam/ham using a single word
def HW1_3(terms="assistance"):
    # Run pNaiveBayes.sh
    !./pNaiveBayes.sh 10 "{terms}"
        
HW1_3(terms = "assistance")

Spam Probs: {'assistance': -3.4264028097516}
Ham Probs: {'assistance': -3.799891684656865}
Accuracy = 0.60


<span style="color:silver"><b>HW1.4.</b> Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will classify the email messages by a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results
   To do so, make sure that</span>
 

   - <span style="color:silver">mapper.py counts all occurrences of a list of words, and</span>
   - <span style="color:silver">reducer.py performs the multiple-word Naive Bayes classification via the chosen list.</span>

In [11]:
# HW 1.4: Our function for HW 1.3 can also classify multiple words, as requested by problem 1.4
def HW1_4(words = "assistance valium enlargementWithATypo"):
    # Run pNaiveBayes.sh
    !./pNaiveBayes.sh 10 "{words}"
            
HW1_4(words = "assistance valium enlargementWithATypo")

Spam Probs: {'assistance': -3.4264028097516, 'enlargementWithATypo': -4.380645319190925, 'valium': -3.778585327862962}
Ham Probs: {'assistance': -3.799891684656865, 'enlargementWithATypo': -4.277012939376528, 'valium': -4.277012939376528}
Accuracy = 0.63


<span style="color:silver"><b>HW1.5.</b> Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by all words present. To do so, make sure that</span>

   - <span style="color:silver">mapper.py counts all occurrences of all words, and</span>
   - <span style="color:silver">reducer.py performs a word-distribution-wide Naive Bayes classification.</span>

In [12]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re

# Get input parameters
findword = sys.argv[2]
findfile = sys.argv[1]

star_switch = 0
if findword == "*":
    star_switch = 1

with open (findfile, "r") as myfile:
    for full_email in myfile:
        try:
            # Empty dictionary
            term_hits = {}

            # Spam classification
            is_spam = re.findall("\t([0-1])\t",full_email)[0]

            # Parse out email body for processing. Find body using "tab spam/ham tab"
            # use regex to strip out non alpha-numeric. "don't" will become "dont" which is fine for classifying.
            keyword = re.findall("\t[0-1]\t",full_email)[0]
            email_id, is_spam_tabbed, email_body = full_email.partition(keyword)
            email_body = re.sub('[^A-Za-z0-9\s]+', '', email_body)
            # Must process search query and email bodies the same
            # This is a good place to add logic to use all terms in body as search terms when "*" appears
            if star_switch == 1:
                findword = " ".join(list(set(email_body.split())))
            else:
                findword = re.sub('[^A-Za-z0-9\s]+', '', findword)
            email_len = len(email_body.split())

            # Build counts of term words.
            for word in list(set(email_body.split())):
                term_hits[word] = len(re.findall(word,email_body))

            # Print as tuple with unique splitter "|||"
            print "{} ||| {} ||| {} ||| {} ||| {}".format(re.findall(r'\w+', findword), email_id, is_spam, email_len, term_hits)

        except:
            pass
            

Overwriting mapper.py


In [13]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import ast
import math

filelist = sys.argv[1:]

spam_term_counts = {}
ham_term_counts = {}
word_prob_given_spam = {}
word_prob_given_ham = {}
spam_count = 0
ham_count = 0
spam_len = 0
ham_len = 0
distinct_term_list = []

# Open each file and build Multinomial Naive Bayes model
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            search_terms = ast.literal_eval(processed_email[0])
            email_id = processed_email[1]
            is_spam = int(processed_email[2])
            email_len = int(processed_email[3])
            count_dict = ast.literal_eval(processed_email[4])
            
            # Build counts for spam and ham definitions.
            if is_spam:
                for key, value in count_dict.iteritems():
                    spam_term_counts[key] = spam_term_counts.get(key, 0) + value
                spam_count += 1
                spam_len += email_len
            else:
                for key, value in count_dict.iteritems():
                    ham_term_counts[key] = ham_term_counts.get(key, 0) + value
                ham_count += 1
                ham_len += email_len
                
            distinct_term_list = list(set(distinct_term_list + count_dict.keys()))

# Calculate our priors based on the overall ratio of spam to ham
spam_prior = float(spam_count) / (spam_count + ham_count)
ham_prior = 1 - spam_prior
spam_prior = math.log10(spam_prior)
ham_prior = math.log10(ham_prior)

# Calculate our conditional probabilites for the search term using MNB formula
#     term_given_spam = (spam_term_count + 1 for smoothing) / (total count of spam words + total distinct vocab size)
distinct_term_count = len(distinct_term_list)

for term in distinct_term_list:
    word_prob_given_spam[term] = math.log10((spam_term_counts.get(term,0) + 1.0) / (float(spam_len) + distinct_term_count))
    word_prob_given_ham[term] = math.log10((ham_term_counts.get(term,0) + 1.0) / (float(ham_len) + distinct_term_count))

# Open each file and predict. Note - prediction is embaressingly parallel and could
#   be done effectively via Mapping. However, the assignment asks us to solve the problem using
#   the provided pNaiveBayes.sh.
accuracy = []
for thisfile in filelist:
    with open (thisfile, "r") as openfile:
        for processed_email in openfile:
            # Defaults
            pred_spam = 0
            spam_prediction = spam_prior
            ham_prediction = ham_prior

            # Read in tuples created by mapper
            processed_email = processed_email.split(" ||| ")
            email_id = processed_email[1]
            is_spam = int(processed_email[2])
            count_dict = ast.literal_eval(processed_email[4])
            
            # Read in counts to use in prediction
            for term in distinct_term_list:
                # Calculate the probability for each class
                spam_prediction += (word_prob_given_spam[term] * count_dict.get(term, 0))
                ham_prediction += (word_prob_given_ham[term] * count_dict.get(term, 0))
                
            # Pick the higher probability
            if spam_prediction > ham_prediction: 
                pred_spam = 1
            
            # Store accuracy in a list
            accuracy.append(1*(pred_spam==is_spam))

            # Print predictions to results file
            print '{}\t{}\t{}'.format(email_id, is_spam, pred_spam)
# Print accuracy
sys.stderr.write("Accuracy = {:.2f}\n".format(float(sum(accuracy))/len(accuracy)))


Overwriting reducer.py


In [14]:
# HW 1.5: Run on all words. 
def HW1_5():
    # Run pNaiveBayes.sh
    !./pNaiveBayes.sh 10 "*"
            
HW1_5()

Accuracy = 0.96


In [237]:
test = ["1","3","five"]
print str(test)
print 


['1', '3', 'five']
1 3 five


In [125]:
content = "12313313	1	asdad13	1	a13	0	a"
print re.findall("\t([0-1])\t",content)[0]

1


In [174]:
content = "12313313	1	asdad13	1	a13	0	a"
keyword = re.findall("\t[0-1]\t",content)[0]
email_id, is_spam, email_body = content.partition(keyword)
print email_id
print is_spam
print email_body

12313313
	1	
asdad13	1	a13	0	a
