# DATASCI W261: Machine Learning at Scale 

**Name: Carlos Eduardo Rodriguez Castillo**

**email: cerodriguez@berkeley.edu**

**Week 1**

**Section 2**


### This notebook provides a poor man Map Reduce framework through command-line and python. Please note that I kept logging commented logging code to show my work.

#### HW0.0

__Prepare your bio and include it in this HW submission. Please limit to 100 words. Count the words in your bio and print the length of your bio (in terms of words) in a separate cell.__

My name is Carlos Eduardo Rodriguez Castillo. I was born and raised in Caracas, Venezuela but currently live in Brooklyn, NY. I have been in New York City for the past eight years. I received my undergrad degree at Columbia University where I majored in Operations Research. I am in the May 2015 cohort of the MIDS program. I currently work at an adtech company called AppNexus as a team lead in our Services org. I enjoy being outside, riding my bike to the beach, go bouldering and rock climbing as well as going on skiing trips in the Rockies.

In [92]:
bio = "My name is Carlos Eduardo Rodriguez Castillo. I was born and raised in Caracas, Venezuela but currently live in Brooklyn, NY. I have been in New York City for the past eight years. I received my undergrad degree at Columbia University where I majored in Operations Research. I am in the May 2015 cohort of the MIDS program. I currently work at an adtech company called AppNexus as a team lead in our Services org. I enjoy being outside, riding my bike to the beach, go bouldering and rock climbing as well as going on skiing trips in the Rockies."


print "The length of the bio is: %s" % len(bio.split())

The length of the bio is: 100


#### HW1.0.0

__Define big data. Provide an example of a big data problem in your domain of expertise.__

I consider big data to be data that has extreme volume, velocity and variety (or minimally 2 of these characteristics at the same time). I define extreme volume of data as an amount of data that cannot fit in a single machine to be processed in a meaningful way (e.g. several hundred gygabytes as a minimum). I define data that has extreme velocity as data that is generated (at the atomic record level) multiple times per second. I define data that has extreme variety as data that is multidimensional. A corollary of this definition is that big data requires special methods to be effectively processed and utilized in a meaningful way as traditional data processing methodologies are inadequate for big data.

Ad tech, the industry that I work in, has no shortage of big data problems. One of them is the primary issue of counting and reporting impressions bought by media-buying campaigns. Given that campaigns can buy hundreds of thousands of impressions per day (velocity), and the fact that each impression has tens of fields to report on (variety), which make up for tens to hundreds of gigabytes of data created per day, soring and processing these logs quickly turns into an exercise of processing several terabytes of data (volume).

#### HW1.0.1

__In 500 words (English or pseudo code or a combination) describe how to estimate the bias, the variance, the irreduciable error for a test dataset T when using polynomial regression models of degree 1, 2,3, 4,5 are considered. How would you select a model?__

For a test dataset __T__, the sum of the [squared] bias, the variance and the irreducible error can be estimated by calculating the prediction error of the model (which I will assume was computed using a separate development dataset __D__) on __T__.

Punctually, given the constraints in the instructions, this can be achieved by separately fitting [polynomial regression models](https://en.wikipedia.org/wiki/Polynomial_regression) of degrees 1, 2, 3, 4, and 5 to __D__ and then having each model predict on __T__. The prediction error from each of the models (which in this case I will define as the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)) could then be ploted on a graph to indicate their relative order, which in turn gives us an (abeit potentially quite poor) estimate for the sum of the [squared] bias, the variance and the irreducible error as a function of the polynomial order of a regression model.

__Polynomial Regression Models Fitting test data set T__
![alt text]()
__Prediction error for the different Polynomial Regression Models__
![alt text]()

The reason that the [squared] bias, variance and irreducible error can be estimated from the prediction error on __T__ is that the expected value of the unconditional prediction error is equivalent to the sum of these three values. Since we do not realistically have access to the full population of test data sets (from which __T__ was necessarily pulled), the best we can do is estimate the prediction error by sampling from the test data sets population (note that in the same fashion that we estimate from/sample test data sets, we analogously follow the process for development data sets such as __D__ to estimate the 'true' model for each of the selected polynomial orders). Since we only have access to the single test data set __T__, by definition our estimate for the expected prediction error (and thus for the variance, [squared] bias and irreducible error) will be poor; the larger the sample of test data sets to compute the estimate from, the more the estimate should approximate the true value for the unconditional expected prediction error. In order to get around this issue, I would attempt to generate a sample of development and test data sets by means of bootstrapping __D__ and __T__ respectively. This way we could simulate a larger sample of development and test data sets from which to estimate the expected unconditional prediction error.

Finally, using the above graph as an input, I would select the model with the lowest prediction error on __T__ because, as I stated earlier we estimate that it would give us the lowest sum of bias and variance (we ignore irreducible error as by definition it cannot be generally reduced) which in turn returns the model with the best generalizable precision and accuracy.

#### Simple EDA

In [None]:
!wc -l enronemail_1h.txt  #100 email records
!cut -f2 -d$'\t' enronemail_1h.txt|wc  #extract second field which is SPAM flag
!cut -f2 -d$'\t' enronemail_1h.txt|head
!head -n 100 enronemail_1h.txt|tail -1|less #an example SPAM email record

#### HW1.1

__Read through the provided control script (pNaiveBayes.sh) and all of its comments. When you are comfortable with their purpose and function, respond to the remaining homework questions below. A simple cell in the notebook with a print statmement with  a  "done" (print "done") string will suffice here. (don't forget to include the Question Number and the question in the cell as a markdown multiline comment!)__

In [7]:
print "done"

done


#### HW1.2

__Provide a mapper/reducer pair that, when executed by **pNaiveBayes.sh** will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.__

In [100]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Carlos Eduardo Rodriguez Castillo
## Description: mapper code for HW1.2
import sys
import re
import logging

## setting up logger
################################################################################################

################################################################################################
## NOTE: make sure to set the logging directory path appropriately as the below one is custom!!!
################################################################################################

################################################################################################

logging_directory_path = '/home/crodriguez1/W261/HW1/log'
logging_file = logging_directory_path + "/" + 'map_log.log'
log_format = '%(levelname)s\n%(asctime)s.%(msecs)-3d filename:%(filename)-20s line:%(lineno)-5d \n%(message)s\n\n'
log_date_format = '%H:%M:%S'
logging.basicConfig(filename = logging_file,
    stream = sys.stderr,
    level = logging.DEBUG,
    format = log_format,
    datefmt = log_date_format)
suds_logger = logging.getLogger("suds")
suds_logger.propagate = False

count = 0
WORD_RE = re.compile(r"[\w']+")

## collect user input
filename = sys.argv[1]
findwords = re.split(" ",sys.argv[2].lower())

with open (filename, "r") as myfile:
    # Reading input file line by line
    for line in myfile:
        logging.debug('Line: %s'% line)
        words = line.split()
        for word in words:
            # Doing minimal tocken clean up
            logging.debug('Word: %s' % word)
            word = word.rstrip(',')
            word = word.rstrip(';')
            word = word.rstrip(':')
            word = word.rstrip('.')
            word = word.rstrip('"')
            word = word.lstrip('"')
            # Scanning for our word of choice
            for findword in findwords:
                if word == findword:
                    # increment when word encountered is word of choice
                    logging.debug('Word: [%s] is equal to [%s]! Increment!' % (word, findword))
                    count = count + 1
    # This is a special case where findwords has a single entry
    print "%s %d" % (findwords[0], count)

Overwriting mapper.py


In [101]:
%%writefile reducer.py
#!/usr/bin/python
## reducer.py
## Author: Carlos Eduardo Rodriguez Castillo
## Description: mapper code for HW1.2
import sys
import re
import logging

## setting up logger
################################################################################################

################################################################################################
## NOTE: make sure to set the logging directory path appropriately as the below one is custom!!!
################################################################################################

################################################################################################

logging_directory_path = '/home/crodriguez1/W261/HW1/log'
logging_file = logging_directory_path + "/" + 'reduce_log.log'
log_format = '%(levelname)s\n%(asctime)s.%(msecs)-3d filename:%(filename)-20s line:%(lineno)-5d \n%(message)s\n\n'
log_date_format = '%H:%M:%S'
logging.basicConfig(filename = logging_file,
    stream = sys.stderr,
    level = logging.DEBUG,
    format = log_format,
    datefmt = log_date_format)
suds_logger = logging.getLogger("suds")
suds_logger.propagate = False

WORD_RE = re.compile(r"[\w']+")

## collect mapper input
logging.debug('Input file chunks: %s'% sys.argv)
## removing the first element of sys.argv
filenames = sys.argv[1:]

sum = 0

for filename in filenames:
    logging.debug('processing input file: %s'% filename)
    with open (filename, "r") as myfile:
        for line in myfile:
            inputs = line.split()
            # This is the special case where there is a single word to be counted
            word = inputs[0]
            chunk_count = int(inputs[1])
            logging.debug('Sum line: %s'% line)
            sum = sum + int(chunk_count)
            logging.debug('Sum: %s ' % sum)
print "%s\t%d" % (word, sum)

Overwriting reducer.py


Overwrite mapper.py and reducer.py.

In [102]:
!chmod a+x mapper.py

In [103]:
!chmod a+x reducer.py

In [104]:
%%writefile pNaiveBayes.sh
## pNaiveBayes.sh
## Author: Jake Ryland Williams
## Usage: pNaiveBayes.sh m wordlist
## Input:
##       m = number of processes (maps), e.g., 4
##       wordlist = a space-separated list of words in quotes, e.g., "the and of"
##
## Instructions: Read this script and its comments closely.
##               Do your best to understand the purpose of each command,
##               and focus on how arguments are supplied to mapper.py/reducer.py,
##               as this will determine how the python scripts take input.
##               When you are comfortable with the unix code below,
##               answer the questions on the LMS for HW1 about the starter code.
 
## collect user input
m=$1 ## the number of parallel processes (maps) to run
wordlist=$2 ## if set to "*", then all words are used
 
## a test set data of 100 messages
#data=`cat enronemail_1h.txt | cut -d$'\t' -f 3,4`
data="enronemail_1h.txt" 
    
## the full set of data (33746 messages)
# data="enronemail.txt" 
 
## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`

## debugging line
# echo "$linesindata"

## determine the lines per chunk for the desired number of processes
linesinchunk=`echo "$linesindata/$m+1" | bc`

## debugging line
# echo "$linesinchunk"
 
## split the original file into chunks by line
split -l $linesinchunk $data $data.chunk.
 
## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
    ## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
    ####
    ####
    ##### debugging line
#     echo "MAPPING CHUNK: $datachunk"
    
    ./mapper.py $datachunk "$wordlist" > $datachunk.counts &
    
    ##### debugging line
#     echo "FINISHED MAPPING CHUNK: $datachunk"
    ####
    ####
done
## wait for the mappers to finish their work
wait
 
## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`\ls $data.chunk.*.counts | perl -pe 's/\n/ /'`

## debugging line
# echo "$countfiles"
 
## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
##### debugging line
# echo "REDUCING CHUNKS"
./reducer.py $countfiles > $data.output
####
####

## print results

cat $data.output

## clean up the data chunks and temporary count files
\rm $data.chunk.*

Overwriting pNaiveBayes.sh


In [105]:
!chmod a+x pNaiveBayes.sh

In [106]:
!./pNaiveBayes.sh 2 "assistance"

assistance	10


__ANSWER: We see that the number of occurences of the word assistance is ten.__

#### HW1.3

__Provide a mapper/reducer pair that, when executed by *pNaiveBayes.sh* will classify the email messages by a single, user-specified word using the multinomial Naive Bayes Formulation. Examine the word “assistance” and report your results.__

In [186]:
%%writefile mapper_2.py
#!/usr/bin/python
## mapper.py
## Author: Carlos Eduardo Rodriguez Castillo
## Description: mapper code for HW1.3
import sys
import re
import json
import logging

## setting up logger
################################################################################################

################################################################################################
## NOTE: make sure to set the logging directory path appropriately as the below one is custom!!!
################################################################################################

################################################################################################

logging_directory_path = '/home/crodriguez1/W261/HW1/log'
logging_file = logging_directory_path + "/" + 'map_log.log'
log_format = '%(levelname)s\n%(asctime)s.%(msecs)-3d filename:%(filename)-20s line:%(lineno)-5d \n%(message)s\n\n'
log_date_format = '%H:%M:%S'
logging.basicConfig(filename = logging_file,
    stream = sys.stderr,
    level = logging.DEBUG,
    format = log_format,
    datefmt = log_date_format)
suds_logger = logging.getLogger("suds")
suds_logger.propagate = False


## collecting user input
filename = sys.argv[1]
findwords = re.split(" ",sys.argv[2].lower())
logging.debug("The words to be processed are: %s" % ','.join(findwords))

## setting up JSON that contains emmitted information
emit_dict = {}
emit_dict['SPAM_total_counter'] = 0
emit_dict['HAM_total_counter'] = 0
emit_dict['words'] = {}

## Initializing counts for all the words in the vocabulary
for findword in findwords:
    emit_dict['words'][findword] = {"SPAM_count":0,"HAM_count":0}
    emit_dict['words'][findword] = {"SPAM_count":0,"HAM_count":0}

count = 0
total_count = 0
WORD_RE = re.compile(r"[\w']+")

## opening the chunk file for mapping
with open (filename, "r") as myfile:
    ## iterating line by line
    for line in myfile:
        HAM = True
        line_components = line.split("\t")
        output_variable = int(line_components[1])
        ## check whether the training document is SPAM or HAM
#         if int(output_variable) == 0:
#             emit_dict['HAM_total_counter'] = emit_dict['HAM_total_counter'] + 1
#         else:
#             emit_dict['SPAM_total_counter'] = emit_dict['SPAM_total_counter'] + 1
#             HAM = False
        line = ' '.join([str(x) for x in line_components[2:]])
        logging.debug('Line: %s'% line)
        words = line.split()
        ## iterating through words in document
        for word in words:
            ## minimally trimming words in document
            logging.debug('Word: %s' % word)
            word = word.rstrip(',')
            word = word.rstrip(';')
            word = word.rstrip(':')
            word = word.rstrip('.')
            word = word.rstrip('"')
            word = word.lstrip('"')
            ## incrementing count of word for each of the classes as necessary
            for findword in findwords:
                if word == findword:
                    ## check whether the training document is SPAM or HAM
                    if int(output_variable) == 0:
                        emit_dict['HAM_total_counter'] = emit_dict['HAM_total_counter'] + 1
                    else:
                        emit_dict['SPAM_total_counter'] = emit_dict['SPAM_total_counter'] + 1
                        HAM = False
                    logging.debug('Word: [%s] is equal to [%s]! Increment!' % (word, findword))
                    if HAM:
                        emit_dict['words'][findword]['HAM_count'] = emit_dict['words'][findword]['HAM_count'] + 1
                    else:
                        emit_dict['words'][findword]['SPAM_count'] = emit_dict['words'][findword]['SPAM_count'] + 1
    print json.dumps(emit_dict)

Overwriting mapper_2.py


In [187]:
!chmod a+x mapper_2.py

In [188]:
%%writefile reducer_2.py
#!/usr/bin/python
## reducer.py
## Author: Carlos Eduardo Rodriguez Castillo
## Description: reducer code for HW1.3
import sys
import re
import json
import logging
import pprint

## setting up logger
################################################################################################

################################################################################################
## NOTE: make sure to set the logging directory path appropriately as the below one is custom!!!
################################################################################################

################################################################################################

logging_directory_path = '/home/crodriguez1/W261/HW1/log'
logging_file = logging_directory_path + "/" + 'reduce_log.log'
log_format = '%(levelname)s\n%(asctime)s.%(msecs)-3d filename:%(filename)-20s line:%(lineno)-5d \n%(message)s\n\n'
log_date_format = '%H:%M:%S'
logging.basicConfig(filename = logging_file,
    stream = sys.stderr,
    level = logging.DEBUG,
    format = log_format,
    datefmt = log_date_format)
suds_logger = logging.getLogger("suds")
suds_logger.propagate = False

WORD_RE = re.compile(r"[\w']+")

## collecting mapper input
logging.debug('Input file chunks: %s'% sys.argv)
## taking the raw set of documents as an input
raw_data = sys.argv[1]
## taking the chunk files array as an input
filenames = sys.argv[2:]

## set up JSON that contains emmitted information
reduced_dict = {}
reduced_dict['SPAM_total_counter'] = 0
reduced_dict['HAM_total_counter'] = 0
reduced_dict['words'] = {}

sum = 0
accurate_count = 0
inaccurate_count = 0

for filename in filenames:
    logging.debug('processing input file: %s'% filename)
    with open (filename, "r") as myfile:
        data = json.load(myfile)
        logging.debug('MAP_DICT -> SPAM_total_counter: %s ' % data['SPAM_total_counter'])
        reduced_dict['SPAM_total_counter'] = reduced_dict['SPAM_total_counter'] + data['SPAM_total_counter']
        logging.debug('REDUCED_DICT -> SPAM_total_counter: %s ' % reduced_dict['SPAM_total_counter'])
        logging.debug('MAP_DICT -> HAM_total_counter: %s ' % data['HAM_total_counter'])
        reduced_dict['HAM_total_counter'] = reduced_dict['HAM_total_counter'] + data['HAM_total_counter']
        logging.debug('REDUCED_DICT -> HAM_total_counter: %s ' % reduced_dict['HAM_total_counter'])
        for word in data['words']:
            if word in reduced_dict['words'].keys():
                logging.debug('%s -> HAM_counter: %s ' % (word,reduced_dict['words'][word]['HAM_count']))
                reduced_dict['words'][word]['HAM_count'] = reduced_dict['words'][word]['HAM_count'] + data['words'][word]['HAM_count']
                reduced_dict['words'][word]['SPAM_count'] = reduced_dict['words'][word]['SPAM_count'] + data['words'][word]['SPAM_count']
            else:
                reduced_dict['words'][word] = data['words'][word]
                
for word in reduced_dict['words'].keys():
    
    reduced_dict['words'][word]['SPAM_cond_prob'] = float(reduced_dict['words'][word]['SPAM_count']) / float(reduced_dict['SPAM_total_counter'])
    reduced_dict['words'][word]['HAM_cond_prob'] = float(reduced_dict['words'][word]['HAM_count']) / float(reduced_dict['HAM_total_counter'])

total_words = reduced_dict['SPAM_total_counter'] + reduced_dict['HAM_total_counter']
reduced_dict['SPAM_prob_total'] = float(reduced_dict['SPAM_total_counter']) / float(total_words)
reduced_dict['HAM_prob_total'] = float(reduced_dict['HAM_total_counter']) / float(total_words)

for word in reduced_dict['words'].keys():
    reduced_dict['words'][word]['HAM_prob_cond_doc'] = float(reduced_dict['HAM_prob_total']) * float(reduced_dict['words'][word]['HAM_cond_prob'])
    reduced_dict['words'][word]['SPAM_prob_cond_doc'] = float(reduced_dict['SPAM_prob_total']) * float(reduced_dict['words'][word]['SPAM_cond_prob'])
    reduced_dict['words'][word]['IS_SPAM'] = (1 if reduced_dict['words'][word]['SPAM_prob_cond_doc'] > reduced_dict['words'][word]['HAM_prob_cond_doc'] else 0)

total_documents = 0

## running classification now that model has been fitted
with open (raw_data, "r") as myfile:
    for line in myfile:
        total_documents = total_documents + 1
        found_vocabulary = False
        line_components = line.split("\t")
        output_variable = int(line_components[1])
        variable_id = line_components[0]
        line = ' '.join([str(x) for x in line_components[2:]])
        logging.debug('Line: %s'% line)
        words = line.split()
        for findword in reduced_dict['words'].keys():
            for word in words:
                logging.debug('Word: %s' % word)
                word = word.rstrip(',')
                word = word.rstrip(';')
                word = word.rstrip(':')
                word = word.rstrip('.')
                word = word.rstrip('"')
                word = word.lstrip('"')
                if word == findword:
                    found_vocabulary = True
                    if output_variable == reduced_dict['words'][word]['IS_SPAM']:
                        accurate_count = accurate_count + 1
                    else:
                        inaccurate_count = inaccurate_count + 1
                    break          
        if not found_vocabulary:
            if reduced_dict['SPAM_prob_total'] > reduced_dict['HAM_prob_total']:
                predited_outcome = 1
            else:
                predited_outcome = 0
            if output_variable == predited_outcome:
                accurate_count = accurate_count + 1
            else:
                inaccurate_count = inaccurate_count + 1
        print "%s\t%d\t%d" % (variable_id, output_variable, predited_outcome)
print "Accurate count: %s" % accurate_count
print "Inaccurate count: %s" % inaccurate_count
print "accurate_count type: %s" % type(accurate_count)
print "Multinomial Naive Bayes accuracy: %.2f" % (float(int(accurate_count)) / float(total_documents))

Overwriting reducer_2.py


In [189]:
!chmod a+x reducer_2.py

In [190]:
%%writefile pNaiveBayes_2.sh
## pNaiveBayes.sh
## Author: Carlos Eduardo Rodriguez Castillo (original author: Jake Ryland Williams)
## Usage: pNaiveBayes.sh m wordlist
## Input:
##       m = number of processes (maps), e.g., 4
##       wordlist = a space-separated list of words in quotes, e.g., "the and of"
##
## Instructions: Read this script and its comments closely.
##               Do your best to understand the purpose of each command,
##               and focus on how arguments are supplied to mapper.py/reducer.py,
##               as this will determine how the python scripts take input.
##               When you are comfortable with the unix code below,
##               answer the questions on the LMS for HW1 about the starter code.
 
## collect user input
m=$1 ## the number of parallel processes (maps) to run
wordlist=$2 ## if set to "*", then all words are used
 
## a test set data of 100 messages
#data=`cat enronemail_1h.txt | cut -d$'\t' -f 3,4`
data="enronemail_1h.txt" 
    
## the full set of data (33746 messages)
# data="enronemail.txt" 
 
## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`

## debugging line
# echo "$linesindata"

## determine the lines per chunk for the desired number of processes
linesinchunk=`echo "$linesindata/$m+1" | bc`

## debugging line
# echo "$linesinchunk"
 
## split the original file into chunks by line
split -l $linesinchunk $data $data.chunk.
 
## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
    ## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
    ####
    ####
    ##### debugging line
#     echo "MAPPING CHUNK: $datachunk"
    
    ./mapper_2.py $datachunk "$wordlist" > $datachunk.counts &
    
    ##### debugging line
#     echo "FINISHED MAPPING CHUNK: $datachunk"
    ####
    ####
done
## wait for the mappers to finish their work
wait
 
## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`\ls $data.chunk.*.counts | perl -pe 's/\n/ /'`

## debugging line
# echo "$countfiles"
 
## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
##### debugging line
# echo "REDUCING CHUNKS"
./reducer_2.py $data $countfiles > $data.output
####
####

## print results

cat $data.output

## clean up the data chunks and temporary count files
\rm $data.chunk.*

Overwriting pNaiveBayes_2.sh


In [191]:
!chmod a+x pNaiveBayes_2.sh

In [192]:
!./pNaiveBayes_2.sh 2 "assistance"

0001.1999-12-10.farmer	0	1
0001.1999-12-10.kaminski	0	1
0001.2000-01-17.beck	0	1
0001.2000-06-06.lokay	0	1
0001.2001-02-07.kitchen	0	1
0001.2001-04-02.williams	0	1
0002.1999-12-13.farmer	0	1
0002.2001-02-07.kitchen	0	1
0002.2001-05-25.SA_and_HP	1	1
0002.2003-12-18.GP	1	1
0002.2004-08-01.BG	1	1
0003.1999-12-10.kaminski	0	1
0003.1999-12-14.farmer	0	1
0003.2000-01-17.beck	0	1
0003.2001-02-08.kitchen	0	1
0003.2003-12-18.GP	1	1
0003.2004-08-01.BG	1	1
0004.1999-12-10.kaminski	0	1
0004.1999-12-14.farmer	0	1
0004.2001-04-02.williams	0	1
0004.2001-06-12.SA_and_HP	1	1
0004.2004-08-01.BG	1	1
0005.1999-12-12.kaminski	0	1
0005.1999-12-14.farmer	0	1
0005.2000-06-06.lokay	0	1
0005.2001-02-08.kitchen	0	1
0005.2001-06-23.SA_and_HP	1	1
0005.2003-12-18.GP	1	1
0006.1999-12-13.kaminski	0	1
0006.2001-02-08.kitchen	0	1
0006.2001-04-03.williams	0	1
0006.2001-06-25.SA_and_HP	1	1
0006.2003-12-18.GP	1	1
0006.2004-08-01.BG	1	1
0007.1999-12-13.kaminski	0	1
0007.1999-12-14.farmer	0	1
0007.2000-01-17.beck	0	1
0007.2

__ANSWER: We see that our accuracy is poor low (44%).__

#### HW1.4

__Provide a mapper/reducer pair that, when executed by *pNaiveBayes.sh* will classify the email messages by a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results (accuracy).__

*__For this exercise I will use the same mapper code that I wrote for HW1.3. That said, I need to enhance the reducer code such that it takes into consideration the probabilities of multiple vocabulary words in computing the final predicted class for each document.__*

In [193]:
%%writefile reducer_3.py
#!/usr/bin/python
## reducer.py
## Author: Carlos Eduardo Rodriguez Castillo
## Description: reducer code for HW1.4
import sys
import re
import json
import logging
import pprint
from math import log

## setting up logger
logging_directory_path = '/home/crodriguez1/W261/HW1/log'
logging_file = logging_directory_path + "/" + 'reduce_log.log'
log_format = '%(levelname)s\n%(asctime)s.%(msecs)-3d filename:%(filename)-20s line:%(lineno)-5d \n%(message)s\n\n'
log_date_format = '%H:%M:%S'
logging.basicConfig(filename = logging_file,
    stream = sys.stderr,
    level = logging.DEBUG,
    format = log_format,
    datefmt = log_date_format)
suds_logger = logging.getLogger("suds")
suds_logger.propagate = False

WORD_RE = re.compile(r"[\w']+")

## collecting mapper input
logging.debug('Input file chunks: %s'% sys.argv)
## taking the raw set of documents as an input
raw_data = sys.argv[1]
## taking the chunk files array as an input
filenames = sys.argv[2:]

## set up JSON that contains emmitted information
reduced_dict = {}
reduced_dict['SPAM_total_counter'] = 0
reduced_dict['HAM_total_counter'] = 0
reduced_dict['words'] = {}

sum = 0
accurate_count = 0
inaccurate_count = 0

for filename in filenames:
    logging.debug('processing input file: %s'% filename)
    with open (filename, "r") as myfile:
        data = json.load(myfile)
        logging.debug('MAP_DICT -> SPAM_total_counter: %s ' % data['SPAM_total_counter'])
        reduced_dict['SPAM_total_counter'] = reduced_dict['SPAM_total_counter'] + data['SPAM_total_counter']
        logging.debug('REDUCED_DICT -> SPAM_total_counter: %s ' % reduced_dict['SPAM_total_counter'])
        logging.debug('MAP_DICT -> HAM_total_counter: %s ' % data['HAM_total_counter'])
        reduced_dict['HAM_total_counter'] = reduced_dict['HAM_total_counter'] + data['HAM_total_counter']
        logging.debug('REDUCED_DICT -> HAM_total_counter: %s ' % reduced_dict['HAM_total_counter'])
        for word in data['words']:
            if word in reduced_dict['words'].keys():
                logging.debug('%s -> HAM_counter: %s ' % (word,reduced_dict['words'][word]['HAM_count']))
                reduced_dict['words'][word]['HAM_count'] = reduced_dict['words'][word]['HAM_count'] + data['words'][word]['HAM_count']
                reduced_dict['words'][word]['SPAM_count'] = reduced_dict['words'][word]['SPAM_count'] + data['words'][word]['SPAM_count']
            else:
                reduced_dict['words'][word] = data['words'][word]
                
for word in reduced_dict['words'].keys():
    
    reduced_dict['words'][word]['SPAM_cond_prob'] = float(reduced_dict['words'][word]['SPAM_count']) / float(reduced_dict['SPAM_total_counter'])
    reduced_dict['words'][word]['HAM_cond_prob'] = float(reduced_dict['words'][word]['HAM_count']) / float(reduced_dict['HAM_total_counter'])

total_words = reduced_dict['SPAM_total_counter'] + reduced_dict['HAM_total_counter']
reduced_dict['SPAM_prob_total'] = float(reduced_dict['SPAM_total_counter']) / float(total_words)
reduced_dict['HAM_prob_total'] = float(reduced_dict['HAM_total_counter']) / float(total_words)

HAM_prob_cond_doc = log(reduced_dict['HAM_prob_total'])
SPAM_prob_cond_doc = log(reduced_dict['SPAM_prob_total'])

total_documents = 0

## running classification now that model has been fitted
with open (raw_data, "r") as myfile:
    for line in myfile:
        total_documents = total_documents + 1
        found_vocabulary = False
        line_components = line.split("\t")
        output_variable = int(line_components[1])
        variable_id = line_components[0]
        line = ' '.join([str(x) for x in line_components[2:]])
        logging.debug('Line: %s'% line)
        words = line.split()
        for findword in reduced_dict['words'].keys():
            for word in words:
                logging.debug('Word: %s' % word)
                word = word.rstrip(',')
                word = word.rstrip(';')
                word = word.rstrip(':')
                word = word.rstrip('.')
                word = word.rstrip('"')
                word = word.lstrip('"')
                if word == findword:
                    if reduced_dict['words'][word]['HAM_cond_prob'] > 0:
                        HAM_prob_cond_doc = HAM_prob_cond_doc + log(reduced_dict['words'][word]['HAM_cond_prob'])
                    if reduced_dict['words'][word]['SPAM_cond_prob'] > 0:
                        SPAM_prob_cond_doc = SPAM_prob_cond_doc  + log(reduced_dict['words'][word]['SPAM_cond_prob'])
                    found_vocabulary = True
        if found_vocabulary:
            if HAM_prob_cond_doc >= SPAM_prob_cond_doc:
                predicted_outcome = 0
            else:
                predicted_outcome = 1      
        if not found_vocabulary:
            if reduced_dict['HAM_prob_total'] >= reduced_dict['SPAM_prob_total']:
                predicted_outcome = 0
            else:
                predicted_outcome = 1
        if output_variable == predicted_outcome:
            accurate_count = accurate_count + 1
        else:
            inaccurate_count = inaccurate_count + 1
        print "%s\t%d\t%d" % (variable_id, output_variable, predicted_outcome)
print "Accurate count: %s" % accurate_count
print "Inaccurate count: %s" % inaccurate_count
print "accurate_count type: %s" % type(accurate_count)
print "Multinomial Naive Bayes accuracy: %.2f" % (float(int(accurate_count)) / float(total_documents))

Overwriting reducer_3.py


In [194]:
!chmod a+x reducer_3.py

In [195]:
%%writefile pNaiveBayes_3.sh
## pNaiveBayes.sh
## Author: Carlos Eduardo Rodriguez Castillo (original author: Jake Ryland Williams)
## Usage: pNaiveBayes.sh m wordlist
## Input:
##       m = number of processes (maps), e.g., 4
##       wordlist = a space-separated list of words in quotes, e.g., "the and of"
##
## Instructions: Read this script and its comments closely.
##               Do your best to understand the purpose of each command,
##               and focus on how arguments are supplied to mapper.py/reducer.py,
##               as this will determine how the python scripts take input.
##               When you are comfortable with the unix code below,
##               answer the questions on the LMS for HW1 about the starter code.
 
## collect user input
m=$1 ## the number of parallel processes (maps) to run
wordlist=$2 ## if set to "*", then all words are used
 
## a test set data of 100 messages
#data=`cat enronemail_1h.txt | cut -d$'\t' -f 3,4`
data="enronemail_1h.txt" 
    
## the full set of data (33746 messages)
# data="enronemail.txt" 
 
## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`

## debugging line
# echo "$linesindata"

## determine the lines per chunk for the desired number of processes
linesinchunk=`echo "$linesindata/$m+1" | bc`

## debugging line
# echo "$linesinchunk"
 
## split the original file into chunks by line
split -l $linesinchunk $data $data.chunk.
 
## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
    ## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
    ####
    ####
    ##### debugging line
#     echo "MAPPING CHUNK: $datachunk"
    
    ./mapper_2.py $datachunk "$wordlist" > $datachunk.counts &
    
    ##### debugging line
#     echo "FINISHED MAPPING CHUNK: $datachunk"
#     echo "THE WORDLIST IS $wordlist"
    ####
    ####
done
## wait for the mappers to finish their work
wait

##### debugging lines
# echo "Content of transmitted data (raw):"
# for file in $data.chunk.*.counts; do
#   cat $file
# done
 
## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`\ls $data.chunk.*.counts | perl -pe 's/\n/ /'`

## debugging lines
# echo "$countfiles"
# echo "Content of transmitted data:"

# for file in $countfiles; do
#   cat $file
# done
 
## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
##### debugging line
# echo "REDUCING CHUNKS"
./reducer_3.py $data $countfiles > $data.output
####
####

## print results

cat $data.output

## clean up the data chunks and temporary count files
\rm $data.chunk.*

Overwriting pNaiveBayes_3.sh


In [196]:
!chmod a+x pNaiveBayes_3.sh

In [197]:
!./pNaiveBayes_3.sh 2 "assistance valium enlargementWithATypo"

0001.1999-12-10.farmer	0	1
0001.1999-12-10.kaminski	0	1
0001.2000-01-17.beck	0	1
0001.2000-06-06.lokay	0	1
0001.2001-02-07.kitchen	0	1
0001.2001-04-02.williams	0	1
0002.1999-12-13.farmer	0	1
0002.2001-02-07.kitchen	0	1
0002.2001-05-25.SA_and_HP	1	1
0002.2003-12-18.GP	1	1
0002.2004-08-01.BG	1	1
0003.1999-12-10.kaminski	0	1
0003.1999-12-14.farmer	0	1
0003.2000-01-17.beck	0	1
0003.2001-02-08.kitchen	0	1
0003.2003-12-18.GP	1	1
0003.2004-08-01.BG	1	1
0004.1999-12-10.kaminski	0	1
0004.1999-12-14.farmer	0	1
0004.2001-04-02.williams	0	1
0004.2001-06-12.SA_and_HP	1	1
0004.2004-08-01.BG	1	1
0005.1999-12-12.kaminski	0	1
0005.1999-12-14.farmer	0	1
0005.2000-06-06.lokay	0	1
0005.2001-02-08.kitchen	0	1
0005.2001-06-23.SA_and_HP	1	1
0005.2003-12-18.GP	1	1
0006.1999-12-13.kaminski	0	1
0006.2001-02-08.kitchen	0	1
0006.2001-04-03.williams	0	1
0006.2001-06-25.SA_and_HP	1	1
0006.2003-12-18.GP	1	1
0006.2004-08-01.BG	1	1
0007.1999-12-13.kaminski	0	1
0007.1999-12-14.farmer	

__ANSWER: We see that our accuracy is very poor, 36%.__

#### HW1.5

__Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by all words present.__

In [198]:
%%writefile mapper_3.py
#!/usr/bin/python
## mapper.py
## Author: Carlos Eduardo Rodriguez Castillo
## Description: mapper code for HW1.5
import sys
import re
import json
import logging

## setting up logger
logging_directory_path = '/home/crodriguez1/W261/HW1/log'
logging_file = logging_directory_path + "/" + 'map_log.log'
log_format = '%(levelname)s\n%(asctime)s.%(msecs)-3d filename:%(filename)-20s line:%(lineno)-5d \n%(message)s\n\n'
log_date_format = '%H:%M:%S'
logging.basicConfig(filename = logging_file,
    stream = sys.stderr,
    level = logging.DEBUG,
    format = log_format,
    datefmt = log_date_format)
suds_logger = logging.getLogger("suds")
suds_logger.propagate = False


## collecting user input
filename = sys.argv[1]
# findwords = re.split(" ",sys.argv[2].lower())
findwords = re.split(" ",sys.argv[2].lower())
logging.debug("The words to be processed are: %s" % ','.join(findwords))

## setting up JSON that contains emmitted information
emit_dict = {}
emit_dict['SPAM_total_counter'] = 0
emit_dict['HAM_total_counter'] = 0
emit_dict['words'] = {}

if findwords[0] != "*":
    logging.debug("Illegal word argument for this exercise(%s)! Please exclusively define the vocabulary as '*'" % findwords)
    print "Illegal word argument for this exercise(%s)! Please exclusively define the vocabulary as '*'" % findwords
    sys.exit()

## Initializing counts for all the words in the vocabulary
# for findword in findwords:
#     emit_dict['words'][findword] = {"SPAM_count":0,"HAM_count":0}
#     emit_dict['words'][findword] = {"SPAM_count":0,"HAM_count":0}

count = 0
total_count = 0
WORD_RE = re.compile(r"[\w']+")

## opening the chunk file for mapping
with open (filename, "r") as myfile:
    ## iterating line by line
    for line in myfile:
        HAM = True
        line_components = line.split("\t")
        output_variable = line_components[1]
        line = ' '.join([str(x) for x in line_components[2:]])
        logging.debug('Line: %s'% line)
        words = line.split()
        ## iterating through words in document
        for word in words:
            ## trimming words in document
            logging.debug('Word: %s' % word)
            word = word.rstrip(',')
            word = word.rstrip(';')
            word = word.rstrip(':')
            word = word.rstrip('.')
            word = word.rstrip('"')
            word = word.lstrip('"')
            if int(output_variable) == 0:
                emit_dict['HAM_total_counter'] = emit_dict['HAM_total_counter'] + 1
            else:
                emit_dict['SPAM_total_counter'] = emit_dict['SPAM_total_counter'] + 1
                HAM = False
            if word not in emit_dict['words'].keys():
                emit_dict['words'][word] = {"SPAM_count":0,"HAM_count":0}
            if HAM:
                emit_dict['words'][word]['HAM_count'] = emit_dict['words'][word]['HAM_count'] + 1
            else:
                emit_dict['words'][word]['SPAM_count'] = emit_dict['words'][word]['SPAM_count'] + 1
    print json.dumps(emit_dict)

Overwriting mapper_3.py


In [199]:
!chmod a+x mapper_3.py

In [208]:
%%writefile reducer_4.py
#!/usr/bin/python
## reducer.py
## Author: Carlos Eduardo Rodriguez Castillo
## Description: reducer code for HW1.4
import sys
import re
import json
import logging
import pprint
from math import log

## setting up logger
################################################################################################
## NOTE: make sure to set the logging directory path appropriately as the below one is custom!!!
################################################################################################
logging_directory_path = '/home/crodriguez1/W261/HW1/log'
logging_file = logging_directory_path + "/" + 'reduce_log.log'
log_format = '%(levelname)s\n%(asctime)s.%(msecs)-3d filename:%(filename)-20s line:%(lineno)-5d \n%(message)s\n\n'
log_date_format = '%H:%M:%S'
logging.basicConfig(filename = logging_file,
    stream = sys.stderr,
    level = logging.DEBUG,
    format = log_format,
    datefmt = log_date_format)
suds_logger = logging.getLogger("suds")
suds_logger.propagate = False

WORD_RE = re.compile(r"[\w']+")

## collecting mapper input
logging.debug('Input file chunks: %s'% sys.argv)
## taking the raw set of documents as an input
raw_data = sys.argv[1]
## taking the chunk files array as an input
filenames = sys.argv[2:]

## set up JSON that contains emmitted information
reduced_dict = {}
reduced_dict['SPAM_total_counter'] = 0
reduced_dict['HAM_total_counter'] = 0
reduced_dict['words'] = {}

sum = 0
accurate_count = 0
inaccurate_count = 0
total_documents = 0

for filename in filenames:
    logging.debug('processing input file: %s'% filename)
    with open (filename, "r") as myfile:
        data = json.load(myfile)
        logging.debug('MAP_DICT -> SPAM_total_counter: %s ' % data['SPAM_total_counter'])
        reduced_dict['SPAM_total_counter'] = reduced_dict['SPAM_total_counter'] + data['SPAM_total_counter']
        logging.debug('REDUCED_DICT -> SPAM_total_counter: %s ' % reduced_dict['SPAM_total_counter'])
        logging.debug('MAP_DICT -> HAM_total_counter: %s ' % data['HAM_total_counter'])
        reduced_dict['HAM_total_counter'] = reduced_dict['HAM_total_counter'] + data['HAM_total_counter']
        logging.debug('REDUCED_DICT -> HAM_total_counter: %s ' % reduced_dict['HAM_total_counter'])
        for word in data['words']:
            if word in reduced_dict['words'].keys():
                logging.debug('%s -> HAM_counter: %s ' % (word,reduced_dict['words'][word]['HAM_count']))
                
                ## assigning the word counts
                reduced_dict['words'][word]['HAM_count'] = reduced_dict['words'][word]['HAM_count'] + data['words'][word]['HAM_count']
                reduced_dict['words'][word]['SPAM_count'] = reduced_dict['words'][word]['SPAM_count'] + data['words'][word]['SPAM_count']
                
                ## adding 1 to the counter of the words in each of the classes as a Laplace smoother
                if reduced_dict['words'][word]['HAM_count'] == 0:
                    reduced_dict['words'][word]['HAM_count'] = reduced_dict['words'][word]['HAM_count'] + data['words'][word]['HAM_count'] + 1
                if reduced_dict['words'][word]['SPAM_count'] == 0:
                    reduced_dict['words'][word]['SPAM_count'] = reduced_dict['words'][word]['SPAM_count'] + data['words'][word]['SPAM_count'] + 1
            else:
                reduced_dict['words'][word] = data['words'][word]
                if reduced_dict['words'][word]['HAM_count'] == 0:
                    reduced_dict['words'][word]['HAM_count'] = reduced_dict['words'][word]['HAM_count'] + data['words'][word]['HAM_count'] + 1
                if reduced_dict['words'][word]['SPAM_count'] == 0:
                    reduced_dict['words'][word]['SPAM_count'] = reduced_dict['words'][word]['SPAM_count'] + data['words'][word]['SPAM_count'] + 1
                
for word in reduced_dict['words'].keys():
    
    reduced_dict['words'][word]['SPAM_cond_prob'] = float(reduced_dict['words'][word]['SPAM_count']) / float(reduced_dict['SPAM_total_counter'])
    reduced_dict['words'][word]['HAM_cond_prob'] = float(reduced_dict['words'][word]['HAM_count']) / float(reduced_dict['HAM_total_counter'])

total_words = reduced_dict['SPAM_total_counter'] + reduced_dict['HAM_total_counter']
reduced_dict['SPAM_prob_total'] = float(reduced_dict['SPAM_total_counter']) / float(total_words)
reduced_dict['HAM_prob_total'] = float(reduced_dict['HAM_total_counter']) / float(total_words)

HAM_prob_cond_doc = log(reduced_dict['HAM_prob_total'])
SPAM_prob_cond_doc = log(reduced_dict['SPAM_prob_total'])

## running classification now that model has been fitted
count = 0
with open (raw_data, "r") as myfile:
    for line in myfile:
        total_documents = total_documents + 1
        HAM_prob_cond_doc = log(reduced_dict['HAM_prob_total'])
        SPAM_prob_cond_doc = log(reduced_dict['SPAM_prob_total'])
        count = count + 1
        found_vocabulary = False
        line_components = line.split("\t")
        variable_id = line_components[0]
        output_variable = int(line_components[1])
        line = ' '.join([str(x) for x in line_components[2:]])
        logging.debug('Line: %s'% line)
#         print "Processing line: %s" % line
        words = line.split()
#         for findword in reduced_dict['words'].keys():
        for word in words:
            logging.debug('Word: %s' % word)
            word = word.rstrip(',')
            word = word.rstrip(';')
            word = word.rstrip(':')
            word = word.rstrip('.')
            word = word.rstrip('"')
            word = word.lstrip('"')
            #Here we are checking that we exclusively take the log of non-zero numbers
#             if word == findword:
            if reduced_dict['words'][word]['HAM_cond_prob'] > 0:
                HAM_prob_cond_doc = HAM_prob_cond_doc + log(reduced_dict['words'][word]['HAM_cond_prob'])
            if reduced_dict['words'][word]['SPAM_cond_prob'] > 0:
                SPAM_prob_cond_doc = SPAM_prob_cond_doc  + log(reduced_dict['words'][word]['SPAM_cond_prob'])
            if word in reduced_dict['words'].keys():
                found_vocabulary = True
        if found_vocabulary:
            if HAM_prob_cond_doc >= SPAM_prob_cond_doc:
                predicted_outcome = 0
            else:
                predicted_outcome = 1      
        if not found_vocabulary:
#             print "Pretty sure this can never happen!!!"
            if reduced_dict['HAM_prob_total'] >= reduced_dict['SPAM_prob_total']:
                predicted_outcome = 0
            else:
                predicted_outcome = 1
        if output_variable == predicted_outcome:
            accurate_count = accurate_count + 1
        else:
            inaccurate_count = inaccurate_count + 1
#         print "True outcome: %.2f Prediction: %.2f\nHAM_PROB: %.2f SPAM_PROB: %.2f" % (output_variable, predicted_outcome, HAM_prob_cond_doc, SPAM_prob_cond_doc)
        print "%s\t%d\t%d" % (variable_id, output_variable, predicted_outcome)
print "Accurate count: %s" % accurate_count
print "Inaccurate count: %s" % inaccurate_count
print "Accurate_count type: %s" % type(accurate_count)
print "Multinomial Naive Bayes accuracy: %.2f" % (float(int(accurate_count)) / float(total_documents))
# print reduced_dict

Overwriting reducer_4.py


In [209]:
!chmod a+x reducer_4.py

In [210]:
%%writefile pNaiveBayes_4.sh
## pNaiveBayes.sh
## Author: Carlos Eduardo Rodriguez Castillo (original author: Jake Ryland Williams)
## Usage: pNaiveBayes.sh m wordlist
## Input:
##       m = number of processes (maps), e.g., 4
##       wordlist = a space-separated list of words in quotes, e.g., "the and of"
##
## Instructions: Read this script and its comments closely.
##               Do your best to understand the purpose of each command,
##               and focus on how arguments are supplied to mapper.py/reducer.py,
##               as this will determine how the python scripts take input.
##               When you are comfortable with the unix code below,
##               answer the questions on the LMS for HW1 about the starter code.
 
## collect user input
m=$1 ## the number of parallel processes (maps) to run
wordlist=$2 ## if set to "*", then all words are used
 
## a test set data of 100 messages
#data=`cat enronemail_1h.txt | cut -d$'\t' -f 3,4`
data="enronemail_1h.txt"
    
## the full set of data (33746 messages)
# data="enronemail.txt" 
 
## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`

## debugging line
# echo "$linesindata"

## determine the lines per chunk for the desired number of processes
linesinchunk=`echo "$linesindata/$m+1" | bc`

## debugging line
# echo "$linesinchunk"
 
## split the original file into chunks by line
split -l $linesinchunk $data $data.chunk.
 
## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
    ## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
    ####
    ####
    ##### debugging line
#     echo "MAPPING CHUNK: $datachunk"
    
    ./mapper_3.py $datachunk "$wordlist" > $datachunk.counts &
    
    ##### debugging line
#     echo "FINISHED MAPPING CHUNK: $datachunk"
    #echo "THE WORDLIST IS $wordlist"
    ####
    ####
done
## wait for the mappers to finish their work
wait

#echo "Content of transmitted data (raw):"

#for file in $data.chunk.*.counts; do
#   cat $file
# done
 
## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`\ls $data.chunk.*.counts | perl -pe 's/\n/ /'`

## debugging line
# echo "$countfiles"
# echo "Content of transmitted data:"

# for file in $countfiles; do
#   cat $file
# done
 
## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
##### debugging line
# echo "REDUCING CHUNKS"
./reducer_4.py $data $countfiles > $data.output
####
####

## print results

# Commenting out for now
cat $data.output

## clean up the data chunks and temporary count files
\rm $data.chunk.*

Overwriting pNaiveBayes_4.sh


In [211]:
!chmod a+x pNaiveBayes_4.sh

In [212]:
!./pNaiveBayes_4.sh 5 "*"

0001.1999-12-10.farmer	0	1
0001.1999-12-10.kaminski	0	0
0001.2000-01-17.beck	0	0
0001.2000-06-06.lokay	0	0
0001.2001-02-07.kitchen	0	0
0001.2001-04-02.williams	0	0
0002.1999-12-13.farmer	0	0
0002.2001-02-07.kitchen	0	0
0002.2001-05-25.SA_and_HP	1	1
0002.2003-12-18.GP	1	1
0002.2004-08-01.BG	1	1
0003.1999-12-10.kaminski	0	0
0003.1999-12-14.farmer	0	0
0003.2000-01-17.beck	0	0
0003.2001-02-08.kitchen	0	0
0003.2003-12-18.GP	1	1
0003.2004-08-01.BG	1	1
0004.1999-12-10.kaminski	0	0
0004.1999-12-14.farmer	0	0
0004.2001-04-02.williams	0	0
0004.2001-06-12.SA_and_HP	1	1
0004.2004-08-01.BG	1	1
0005.1999-12-12.kaminski	0	0
0005.1999-12-14.farmer	0	0
0005.2000-06-06.lokay	0	0
0005.2001-02-08.kitchen	0	0
0005.2001-06-23.SA_and_HP	1	1
0005.2003-12-18.GP	1	1
0006.1999-12-13.kaminski	0	0
0006.2001-02-08.kitchen	0	0
0006.2001-04-03.williams	0	0
0006.2001-06-25.SA_and_HP	1	1
0006.2003-12-18.GP	1	1
0006.2004-08-01.BG	1	1
0007.1999-12-13.kaminski	0	0
0007.1999-12-14.farmer	

__ANSWER: We saw a fairly high accuracy of 97%.__

#### HW1.6

__Benchmark your code with the Python SciKit-Learn implementation of multinomial Naive Bayes.__

__A. Run the Multinomial Naive Bayes algorithm (using default settings) from SciKit-Learn over the same training data used in HW1.5 and report the Training error (please note some data preparation might be needed to get the Multinomial Naive Bayes algorithm from SkiKit-Learn to run over this dataset)__

__B. Run the Bernoulli Naive Bayes algorithm from SciKit-Learn (using default settings) over the same training data used in HW1.5 and report the Training error__

__C. Run the Multinomial Naive Bayes algorithm you developed for HW1.5 over the same data used HW1.5 and report the Training error__

__D. Please prepare a table to present your results__

__E. Explain/justify any differences in terms of training error rates over the dataset in HW1.5 between your Multinomial Naive Bayes implementation (in Map Reduce) versus the Multinomial Naive Bayes implementation in SciKit-Learn (Hint: smoothing, which we will discuss in next lecture)__

__F. Discuss the performance differences in terms of training error rates over the dataset in HW1.5 between the Multinomial Naive Bayes implementation in SciKit-Learn with the  Bernoulli Naive Bayes implementation in SciKit-Learn__

In [2]:
import numpy as np

from sklearn.naive_bayes import MultinomialNB, BernoulliNB

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

import sys
import re
import logging
import pprint
from math import log

## setting up logger
logging_directory_path = '/home/crodriguez1/W261/HW1/log'
logging_file = logging_directory_path + "/" + 'sklearn_log.log'
log_format = '%(levelname)s\n%(asctime)s.%(msecs)-3d filename:%(filename)-20s line:%(lineno)-5d \n%(message)s\n\n'
log_date_format = '%H:%M:%S'
logging.basicConfig(filename = logging_file,
    stream = sys.stderr,
    level = logging.DEBUG,
    format = log_format,
    datefmt = log_date_format)
suds_logger = logging.getLogger("suds")
suds_logger.propagate = False

WORD_RE = re.compile(r"[\w']+")

train_data = []
train_labels = []

## Training error calculating function
def training_error(labels, prediction):
    errors = 0
    num_observations = len(labels)
    for truth, prediction in zip(train_labels ,MultiNB_predicted):
        logging.debug("Truth: %s | Prediction: %s"% (truth, prediction))
        if truth != prediction:
            errors = errors + 1
    print "The training error is %.2f" % (float(errors)/float(num_observations))

with open ("enronemail_1h.txt", "r") as myfile:
    for line in myfile:
        line_components = line.split("\t")
        output_variable = int(line_components[1])
        line = ' '.join([str(x) for x in line_components[2:]])
        train_data.extend([line])
        train_labels.extend([output_variable])
# Initialize CountVectorizer and fit and transform the data
cv = CountVectorizer()
cv_train = cv.fit_transform(train_data)

## Create and fit MultinomialNB classifier (default settings)
MultiNB = MultinomialNB()
MultiNB.fit(cv_train,train_labels)
MultiNB_predicted = MultiNB.predict(cv_train)

## Create and fit MultinomialNB classifier (default settings)
BernoulliNB = BernoulliNB()
BernoulliNB.fit(cv_train,train_labels)
BernoulliNB_predicted = BernoulliNB.predict(cv_train)

print "ANSWER:"
print "HW1.6 A. Multinomial Naive Bayes algorithm (using default settings)"
training_error(train_labels, MultiNB_predicted)
print "HW1.6 B. Bernoulli Naive Bayes algorithm (using default settings)"
training_error(train_labels, BernoulliNB_predicted)
print "HW1.6 C. From the results in HW1.5 we see that the training error for the Multinomial NB is 3%"

HW1.6 A. Multinomial Naive Bayes algorithm (using default settings)
The training error is 0.00
HW1.6 B. Bernoulli Naive Bayes algorithm (using default settings)
The training error is 0.00


__HW1.6 E. I believe that the difference in the training error rates between my Multinomial Naive Bayes implementation (in Map Reduce) versus the Multinomial Naive Bayes implementation in SciKit-Learn is that I believe that my implementation has slightly less bias than that which was implemented by SciKit-Learn given that my implementation used a smoothing factor and the SciKit-Learn alternative did not. As such my results brought the probability for the classes closer together, which opens up the posibility for lower accuracy (but better generalizability).__

__HW1.6 F. I believe that the (lack) of performance difference in terms of training error rates over the dataset in HW1.5 between the  Multinomial Naive Bayes implementation in SciKit-Learn with the  Bernoulli Naive Bayes implementation in SciKit-Learn is due to the fact that both models highly overfitted to the training data (using all the words in the corpus as a vocabulary AND predicting on the training data).__