# HW 2 - Naive Bayes in Hadoop MR
__`MIDS w261: Machine Learning at Scale | UC Berkeley School of Information | Fall 2025`__

In the live sessions for week 2 and week 3 you got some practice designing and debugging Hadoop Streaming jobs. In this homework we'll use Hadoop MapReduce to implement your first parallelized machine learning algorithm: Naive Bayes. As you develop your implementation you'll test it on a small dataset that matches the 'Chinese Example' in the _Manning, Raghavan and Shutze_ reading for Week 2. For the main task in this assignment you'll be working with a small subset of the Enron Spam/Ham Corpus. By the end of this assignment you should be able to:
* __... describe__ the Naive Bayes algorithm including both training and inference.
* __... perform__ EDA on a corpus using Hadoop MR.
* __... implement__ parallelized Naive Bayes.
* __... constrast__ partial, unordered and total order sort and their implementations in Hadoop Streaming.
* __... explain__ how smoothing affects the bias and variance of a Multinomial Naive Bayes model.

As always, your work will be graded both on the correctness of your output and on the clarity and design of your code.

## IMPORTANT NOTE (READ THIS TO SAVE YOURSELF LOTS OF TIME)

#### WE HAVE MIGRATED OVER TO GRADESCOPE. THE FOLLOWING FILES ARE REQUIRED TO BE UPLOADED FOR GRADESCOPE

Notebook (Don't change the name)
- HW2.ipynb

1.) 
- mapper.py
- reducer.py

2.) 
- chineseResults.txt

3.)
- chineseModelUnsmoothed.txt
- chineseModelSmoothed.txt

4.) THERE ARE CELLS TO CREATE THESE FILES BELOW.
- Unsmoothed_results.txt
- Smoothed_results.txt
- Unsmoothed_NBmodel.txt
- Smoothed_NBmodel.txt

## Notebook Setup
Before starting, run the following cells to confirm your setup.

In [1]:
!hadoop version

Hadoop 3.2.4
Source code repository https://bigdataoss-internal.googlesource.com/third_party/apache/hadoop -r bd7653f7bc314f79f77d426015074df3eac9d6da
Compiled by bigtop on 2025-09-10T20:59Z
Compiled with protoc 2.5.0
From source with checksum feb1e797681713e36fa183c0646b3381
This command was run using /usr/lib/hadoop/hadoop-common-3.2.4.jar


In [2]:
# imports
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [3]:
# global vars (paths) - ADJUST AS NEEDED
JAR_FILE = "/usr/lib/hadoop/hadoop-streaming.jar"
HDFS_DIR = "/user/root/HW2"
HOME_DIR = "/media/notebooks/Assignments/HW2"

In [4]:
%cd {HOME_DIR}

/media/notebooks/Assignments/HW2


In [5]:
# save path for use in Hadoop jobs (-cmdenv PATH={PATH})
from os import environ
PATH  = environ['PATH']

In [6]:
# data path
ENRON = "data/enronemail_1h.txt"

In [7]:
# make the HDFS directory if it doesn't already exist
!hdfs dfs -mkdir -p {HDFS_DIR}
!hdfs dfs -ls 

Found 1 items
drwxr-xr-x   - root hadoop          0 2025-09-21 05:03 HW2


# Question 1: Hadoop MapReduce Key Takeaways

This assignment will be the only one in which you use Hadoop Streaming to implement a distributed algorithm. The key reason we continue to teach Hadoop streaming is because of the way it forces the programmer to think carefully about what is happening under the hood when you parallelize a calculation. This question will briefly highlight some of the most important concepts that you need to understand about Hadoop Streaming and MapReduce before we move on to Spark next week.   

### Q1 Tasks:

* __a) Multiple Choice:__ What "programming paradigm" is Hadoop MapReduce based on? 

* __b) Multiple Answers:__ What are the main ideas of this programming paradigm and how does MapReduce exemplify these ideas? (Select 3)

* __c) Short Essay:__ What is the Hadoop Shuffle? When does it happen? Why is it potentially costly? Describe one specific thing we can we do to mitigate the cost associated with this stage of our Hadoop Streaming jobs.

* __d) Multiple Choice:__ In Hadoop Streaming why do the input and output record format of a combiner script have to be the same? [__`HINT`__ What level of combining does the framework guarantee? what is the relationship between the record format your mapper emits and the format your reducer expects to receive?_]

* __e) Multiple Choice:__ To what extent can you control the level of parallelization of your Hadoop Streaming jobs? Please be specific.

* __f) Multiple Choice:__ What change in the kind of computing resources available prompted the creation of parallel computation frameworks like Hadoop?


In [8]:
# q1a
### MULTIPLE CHOICE
### QUESTION: What "programming paradigm" is Hadoop MapReduce based on?

#   a.) Object-Oriented Programming
#   b.) Dynamic Programming
#   c.) Recursive Programming
#   d.) Functional Programming
#   e.) None of the provided responses


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "d"


#####################
print(answer)

d


In [9]:
# q1b
### MULTIPLE ANSWERS
### QUESTION: What are the main ideas of this programming paradigm and how does MapReduce exemplify these ideas? (Select 3)

#   a.) The programming paradigms allows functions to accept other functions passed to them as
#       parameters (higher order functions). In Hadoop Map-Reduce, you can write your own mappers
#       and reducers and execute them against your data via map and reduce.

#   b.) The programming paradigm is designed to keep data structures that hold all
#       of the data in memory at once. In a map-reduce framework,
#       this stateful quality allows you to process large amounts of information in parallel.

#   c.) This programming paradigm optimizes execution of functions using recursion,
#       allowing for efficient processing of calls to functions made over a large
#       body of inputs. In map-reduce, this ability allows frameworks like Hadoop
#       to apply map, but not reduce functions, with relatively low run times.

#   d.) This paradigm attempts to avoid state changes/mutable data. In Hadoop Map-Reduce,
#       this means that data is read from source, processed during the map phase, saved to disk,
#       processed during reduce phase, and then again saved to disk.

#   e.) The results of a Hadoop Map-Reduce job can vary slightly from job to job,
#       depending on the under-the-hood operations that the framework puts in place.
#       We as programmers can't always control these under-the-hood decisions.
#       In practice, this means that a particular one set of inputs may produce
#       results that are slightly different each time the map-reduce job is executed.

#   f.) Another core idea of this programming programming is the idea of deterministic output,
#       in which a function will always return the same output, given the same input.
#       We can rely on a map-reduce job produce the same results, given the same source data.

### ENTER ONLY THE LETTERS INSIDE THE PRINT STATEMENT. (i.e. if your answer is x.), y.), and z.), enter "xyz")
answer = "adf"


#####################
print(answer)

adf


In [10]:
# q1c
### SHORT ESSAY
### QUESTION: What is the Hadoop Shuffle? When does it happen? Why is it potentially costly?
#             Describe one specific thing we can we do to mitigate the cost associated
#             with this stage of our Hadoop Streaming jobs.

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
The Hadoop Shuffle is the step that happens in between the map phase and the reduce phase.
It’s the process of moving the mapper outputs around the cluster so that all the values for
a given key end up on the same reducer. This happens right after the map tasks finish and
before the reducers can start working.

The shuffle can get expensive because it often requires writing data to disk, sorting it,
and sending large amounts of information across the network. For really big jobs, the
extra movement of data can slow things down a lot.

One way to cut down on the cost is by using a combiner. A combiner runs after the mapper
and does a mini-reduce at the local level, so less data has to be sent over the network during the
shuffle stage.
"""
)



The Hadoop Shuffle is the step that happens in between the map phase and the reduce phase.
It’s the process of moving the mapper outputs around the cluster so that all the values for
a given key end up on the same reducer. This happens right after the map tasks finish and
before the reducers can start working.

The shuffle can get expensive because it often requires writing data to disk, sorting it,
and sending large amounts of information across the network. For really big jobs, the
extra movement of data can slow things down a lot.

One way to cut down on the cost is by using a combiner. A combiner runs after the mapper
and does a mini-reduce at the local level, so less data has to be sent over the network during the
shuffle stage.



In [11]:
# q1d
### MULTIPLE CHOICE
### QUESTION: In Hadoop Streaming why do the input and output record format of a combiner
#             script have to be the same? [HINT: What level of combining does the
#             framework guarantee? What is the relationship between the record
#             format your mapper emits and the format your reducer expects to receive?]

#   a.) The combiner processes the Key/Value records produced by the mapper.
#       As such a combiner is a replacement function for a reducer.

#   b.) Since Hadoop does not guarantee that a combiner will be executed,
#       our record format has to work whether or not it goes through the combiner.
#       In other words, the signature of the input and output of the combiner
#       must match the signature of the input to the reducer.

#   c.) If using combiners, you can only use a specific record format which
#       is determined by the Hadoop environment you are working in. Since this
#       is the case, the record must be in the same system-specified format
#       for the input and output of all combiners.

#   d.) None of the provided responses are correct.


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "b"


#####################
print(answer)

b


In [12]:
# q1e
### MULTIPLE CHOICE
### QUESTION: To what extent can you control the level of parallelization of your Hadoop Streaming jobs?

#   a.) We can explicitly control both the number of reducers used and the number of mappers used
#       (for example by setting the -numReduceTasks parameter).

#   b.) We can explicitly control the number of reducers used (for example by setting the
#       -numReduceTasks parameter), but we can't force Hadoop to use the number of mappers we desire.

#   c.) We can't explicitly control the number of reducers or the number of mappers used
#       (but can make suggestions to the framework, for example by setting the -numReduceTasks parameter).

#   d.) None of the provided responses are correct.


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "b"


#####################
print(answer)

b


In [13]:
# q1f
### MULTIPLE CHOICE
### QUESTION: What change in the kind of computing resources available prompted the creation of
#             parallel computation frameworks like Hadoop?

#   a.) The rise of cloud computing providers like GCP, AWS, and Microsoft Azure,
#       and the availability of OS-level virtualization frameworks such as Docker,
#       and orchestration frameworks such as Kubernetes, and Powerpoint made it
#       possible to orchestrate multiple virtual computers in the cloud, giving rise to frameworks like Hadoop.

#   b.) The rise of parallel computing frameworks was made possible by an increase
#       in the availability of cheap commodity hardware. Instead of investing in super
#       computers, the idea is to link together a lot of less powerful machines.

#   c.) The rise of High Performance Computers made it possible to run huge numbers
#       of calculations quickly on a single machine, which gave rise to frameworks like
#       Hadoop which could take advantage of that extra computing power to run programs on huge datasets.

#   d.) None of the provided responses are correct.


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "b"


#####################
print(answer)

b


# Question 2: MapReduce Design Patterns 

In the last two live sessions and in your readings from Lin & Dyer you encountered a number of techniques for manipulating the logistics of a MapReduce implementation to ensure that the right information is available at the right time and location. In this question we'll review a few of the key techniques you learned.   

### Q2 Tasks:

* __a) Multiple Answers:__ What are counters (in the context of Hadoop Streaming)? How are they useful? What kinds of counters does Hadoop provide for you? How do you create your own custom counter? (Select 4)

* __b) Multiple Choice:__ What are composite keys? How are they useful? How are they related to the idea of custom partitioning?

* __c) Multiple Choice:__ What is the order inversion pattern? What problem does it help solve? How do we implement it? 


In [14]:
# q2a
### MULTIPLE ANSWERS
### QUESTION: What are counters (in the context of Hadoop Streaming)? How are they useful? What kinds of
#             counters does Hadoop provide for you? How do you create your own custom counter? (Select 4)

#   a.) Counters are a shared variable that is incremented and decremented
#       atomically by the Hadoop framework. This means that all running
#       instances within a job can update this variable to get a total
#       count at the end. This is a departure from the principle of statelessness
#       but is very useful for confirming that your jobs are running properly or
#       aggregating summary statistics while performing other computations.

#   b.) The built-in counters tell us information about IO, timing, and
#       job orchestration. These are useful because such information can help
#       you optimize your jobs. Especially useful are the Job Counters that tell
#       you how many map, combine, and reduce tasks were run as well as the Map-Reduce
#       Framework counters that tell you how many lines were input and output from your tasks.

#   c.) To create a custom counter we just write to standard output.
#       For example: sys.stderr.write("reporter:counter:MyWordCounter,count,1\n")

#   d.) It is important to keep in mind that counter values are not available to
#       mapper and reducer functions, and are only exposed after the job has finished.

#   e.) Hadoop counters provide visibility into the inner workings of a Hadoop
#       map-reduce job using NumPy and pandas.

#   f.) None of the provided responses are correct.

### ENTER ONLY THE LETTERS INSIDE THE PRINT STATEMENT. (i.e. if your answer is x.), y.), and z.), enter "xyz")
answer = "abcd"


#####################
print(answer)

abcd


In [15]:
# q2b
### MULTIPLE CHOICE
### QUESTION: What are composite keys in Hadoop? Hint: How are they useful;
#             How are they related to the idea of custom partitioning;

#             Select the combination of the statements below that best answers the question posed above.
#             i.  A composite key in Hadoop composes two or more fields or pieces of information from
#                 the mapper output record to form the key that Hadoop can use during the shuffle phase.


#             ii. In Hadoop a composite key can be used when we want to control which sets of records are
#                 shuffled (partition the data) to the same reducer node. For example, the following specifies
#                 a composite key that is used to partition mapper records such that records in the
#                 same partition are sent to the same reducer:

#                   -D stream.num.map.output.key.fields=4 \
#                   -D map.output.key.field.separator=. \
#                   -D mapreduce.partition.keypartitioner.options=-k1,2

#             iii. In Hadoop a composite key can be used to determine the sort order of records arriving
#                  at a reducer. For example, this can be accomplished as follows:

#                   -D stream.num.map.output.key.fields=4 \
#                   -D mapreduce.map.output.key.field.separator=. \
#                   -D mapreduce.partition.keycomparator.options=-k1,1 -k2,2 -k3n,3

#             iv. The (composite) key for partitioning and the (composite) key for sorting records arriving
#                 at the reduced need not be the same. Secondary sorting is an example of a use case where
#                 the fields used for partitioning and sorting are different. 


#   a.) i, ii, iii, iv
#   b.) i, ii, iv
#   c.) i, ii
#   d.) i, ii, iii
#   e.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


In [16]:
# q2c
### MULTIPLE CHOICE
### QUESTION: What is the order inversion pattern? What problem does it help solve? How do we implement it?

#   a.) The Order Inversion (OI) design pattern can be used to control the order of reducer records in the
#       MapReduce framework (which is useful because some computations require ordered data).
#       The order inversion pattern is when we use a special key (think of a key having a value
#       such as __total_word_frequency; notice that __ appear before any key starting with
#       alphanumerical characters)  to ensure that a record (key, and value pair) is read and
#       processed before all other records at the reduce stage. In some situations, this can help
#       us avoid a second shuffle, for example, if you want to normalize word count frequency values
#       by the total word count. Without using the OI pattern, the reducers would not have access
#       to the total word until all records were processed by the reducers. The OI pattern enables
#       the processing of records with special keys, such as __total_word_frequency, in advance
#       of processing all individual word frequencies, leading to the total word count. This then
#       enables the normalization step as the reducer processes each individual word count record. 

#   b.) The Order Inversion (OI) design pattern is when we choose a custom sort order
#       for keys or values in the reduce stage, so that we get the desired results we want
#       from the program execution faster. For example, in a word counting problem we may
#       want the most common words to be output first, so we invert the order of the reducer
#       output records so that they are sorted by descending order of the values, instead of
#       the default behavior of sorting by the ascending order of the keys.

#   c.) The Order Inversion (OI) design pattern is when we switch the keys and the values
#       around so that each record output by the mapper can be reduced by its computed value
#       instead of by its key (the default behavior). For example, if the input to the
#       mappers for a word counting problem is a tuple of (count, word) then we would need to
#       invert the order of the key and the value, so that the reducer can then combine the counts
#       by using the word as the key.

#   d.) The order inversion pattern does not apply to Hadoop map-reduce jobs.

#   e.) None of the provided responses are correct.


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


# Question 3: Understanding Total Order Sort

The key challenge in distributed computing is to break a problem into a set of sub-problems that can be performed without communicating with each other. Ideally, we should be able to define an arbitrary number of splits and still get the right result, but that is not always possible. Parallelization becomes particularly challenging when we need to make comparisons between records, for example when sorting. Total Order Sort allows us to order large datasets in a way that enables efficient retrieval of results. Before beginning this assignment, make sure you have read and understand the [Total Order Sort Notebook](https://github.com/UCB-w261/main/blob/main/HelpfulResources/TotalSortGuide/_total-sort-guide-spark2.01-JAN27-2017.ipynb) in GCS folder (__`/GCS/HelpfulResources/TotalSortGuide/_total-sort-guide-spark2.01-JAN27-2017.ipynb`__). You can skip the first two MRJob sections, but the rest of section III and all of section IV are **very** important (and apply to Hadoop Streaming) so make sure to read them closely. Feel free to read the Spark sections as well but you won't be responsible for that material until later in the course. To verify your understanding, answer the following questions.

### Q3 Tasks:

* __a) Short Essay:__ What is the difference between a Partial Sort, an Unordered Total Sort, and a Total Order Sort? From the programmer's perspective, what does Total Order Sort allow us to do that we can't with Unordered Total? Why is this important with large datasets?

* __b) Multiple Choice:__ Which phase of a MapReduce job is leveraged to implement Total Order Sort? Which default behaviors must be changed. Why must they be changed?

* __c) Short Essay:__ Describe in words how to configure a Hadoop Streaming job for the custom sorting and partitioning that is required for Total Order Sort.  

* __d) Multiple Choice:__ Explain why we need to use an inverse hash code function.

* __e) Multiple Choice:__ Where does this function need to be located so that a Total Order Sort can be performed?


In [17]:
# q3a
### SHORT ESSAY
### QUESTION: What is the difference between a Partial Sort, an Unordered Total Sort, and a Total Order Sort?
#             From the programmer's perspective, what does Total Order Sort allow us to do that we can't
#             with Unordered Total? Why is this important with large datasets?

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""

These are all examples of how data can be sorted in distributed systems. For Partial Sort, 
each reducer sorts the data it receives locally, but there’s no guarantee about how the outputs 
of different reducers relate to each other. For example, reducer 1 might output sorted keys that 
overlap with reducer 2’s keys. We might know the data is sorted within each partition, but not 
across the whole dataset. Unordered Total Sort has data partitioned so that each reducer gets 
a disjoint key range. This ensures that, reducer 1 processes only smaller keys and reducer 2 
processes only larger keys. However, the final outputs are not in a single global order because 
Hadoop doesn’t enforce an order in how reducers write results. We would still need to merge or 
reorder the output files afterward. Finally, Total Order Sort makes the entire dataset sorted 
across all reducers. Each reducer’s output is sorted internally and the reducers’ outputs are 
written in key order. Thus, if you concatenate the reducer outputs, you get one single sorted file.

From a programmer’s point of view, total order sort is valuable because it produces one globally 
ordered dataset straight from MapReduce without the need for an extra merge step. This is especially
important with very large datasets, where merging reducer outputs would be too costly. It also makes
the system more scalable since we can divide the work across many reducers while still preserving a
consistent global order. Finally, it improves usability by enabling efficient operations on the
distributed data.

"""
)




These are all examples of how data can be sorted in distributed systems. For Partial Sort, 
each reducer sorts the data it receives locally, but there’s no guarantee about how the outputs 
of different reducers relate to each other. For example, reducer 1 might output sorted keys that 
overlap with reducer 2’s keys. We might know the data is sorted within each partition, but not 
across the whole dataset. Unordered Total Sort has data partitioned so that each reducer gets 
a disjoint key range. This ensures that, reducer 1 processes only smaller keys and reducer 2 
processes only larger keys. However, the final outputs are not in a single global order because 
Hadoop doesn’t enforce an order in how reducers write results. We would still need to merge or 
reorder the output files afterward. Finally, Total Order Sort makes the entire dataset sorted 
across all reducers. Each reducer’s output is sorted internally and the reducers’ outputs are 
written in key order. Thus, if you concaten

In [18]:
# q3b
### MULTIPLE CHOICE
### QUESTION: Which phase of a MapReduce job is leveraged to implement Total Order Sort? Which default behaviors
#             must be changed. Why must they be changed?

#   a.) The shuffle/sort phase is where the sorting functionality takes place. To implement Total
#       Order Sort, we must override the default partitioning behavior. By default, Hadoop
#       does an alphanumerically increasing sort on a single key field within each
#       partition resulting in partial order sort.

#   b.) The shuffle/sort phase is where the sorting functionality takes place.
#       To implement Total Order Sort, we must override the default partitioning behavior.
#       By default, the Hadoop combiner does an alphanumerically increasing sort on a
#       single key field within each partition resulting in unordered total sort.

#   c.) To implement Total Order Sort, we must override the default partitioning behavior
#       in the shuffle/sort phase. By default, Hadoop does not sort records, so one has to
#       provide a post-processing step to sort the output partition files using, say,
#       the Unix sort command.

#   d.) None of the provided responses are correct.


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


In [19]:
# q3c
### SHORT ESSAY
### QUESTION: Describe in words how to configure a Hadoop Streaming job for the custom sorting and
#             partitioning that is required for Total Order Sort. (hint: feel free to try and write
#             a Hadoop Streaming job and come back to this question if you need. Additionally,
#             be sure to check the Hadoop Streaming documentation!)

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
To configure a Hadoop Streaming job for Total Order Sort, the main goal is to ensure that all the data
is globally ordered across all reducers. By default, Hadoop Streaming sorts records within each
partition but does not control how records are distributed across partitions. To achieve Total Order
Sort, we need to override the default partitioning by using a KeyFieldBasedPartitioner and supplying
either a custom partition file or a method to compute partition keys such that each reducer receives a
distinct, ordered subset of the data. This ensures that after all reducers finish, the concatenated
output is fully sorted.

Additionally, the job configuration must specify the proper key comparator to handle secondary sorting
of values if needed. For example, setting the mapreduce.job.output.key.comparator.class to
KeyFieldBasedComparator allows us to define which fields of the composite key to sort and whether to
sort them numerically or lexicographically. Other settings, such as stream.map.output.field.separator
and mapreduce.partition.keycomparator.options, help control how fields are parsed and sorted, ensuring
that the shuffle phase respects the intended order.

Finally, the mapper may need to prepend or transform the key so that it hashes to the correct partition
index according to the custom partitioning scheme. This involves computing an inverse hash
function or using percentile-based partition keys derived from a random sample of the dataset.
Combined, these steps allow Hadoop Streaming to distribute the data across reducers while maintaining a
total order, enabling efficient downstream operations without additional post-processing.
"""
)


To configure a Hadoop Streaming job for Total Order Sort, the main goal is to ensure that all the data
is globally ordered across all reducers. By default, Hadoop Streaming sorts records within each
partition but does not control how records are distributed across partitions. To achieve Total Order
Sort, we need to override the default partitioning by using a KeyFieldBasedPartitioner and supplying
either a custom partition file or a method to compute partition keys such that each reducer receives a
distinct, ordered subset of the data. This ensures that after all reducers finish, the concatenated
output is fully sorted.

Additionally, the job configuration must specify the proper key comparator to handle secondary sorting
of values if needed. For example, setting the mapreduce.job.output.key.comparator.class to
KeyFieldBasedComparator allows us to define which fields of the composite key to sort and whether to
sort them numerically or lexicographically. Other settings, such as stream.

In [20]:
# q3d
### MULTIPLE CHOICE
### QUESTION: Explain why we need to use an inverse hash code function.

#   a.) The inverse hash code function lets us know the actual key that Hadoop will use for
#       partitioning. By knowing this, we can reorder the partition keys that we are
#       using according to the order that they will be sorted in after being hashed.
#       Without this, we cannot know which partition key to use for which partition so
#       that the end result is ordered by file name.

#   b.) The inverse hash code function tells Hadoop which key to use for partitioning,
#       allowing us to control which records get sent to which reducer. This means that
#       we can know the full sorted order of all records once the Hadoop job finishes.

#   c.) The inverse hash code function tells Hadoop us to invert the order of the
#       keys which are sent from the mapper to the reducer, allowing us to sort
#       the reducer output in descending order.


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


In [21]:
# q3e
### MULTIPLE CHOICE
### QUESTION: Where does this function (i.e.,  inverse hash code function) need to be located
#             so that a Total Order Sort can be performed?

#   a.) The inverse hash function must be accessible before partitioning takes place.
#       Hence, it needs to reside inside the mapper function.

#   b.) The inverse hash function is only used after partitioning takes place.
#       Hence, it needs to reside inside the reducer function.

#   c.) The inverse hash function must be accessible before and after partitioning takes place.
#       Hence, it must reside inside both the combiner and the reducer functions.


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


# About the Data

For the main task in this portion of the homework you will train a classifier to determine whether an email represents spam or not. You will train your Naive Bayes model on a 100 record subset of the Enron Spam/Ham corpus available in the HW2 data directory (__`HW2/data/enronemail_1h.txt`__).

__Source:__   
The original data included about 93,000 emails which were made public after the company's collapse. There have been a number raw and preprocessed versions of this corpus (including those available [here](http://www.aueb.gr/users/ion/data/enron-spam/index.html) and [here](http://www.aueb.gr/users/ion/publications.html)). The subset we will use is limited to emails from 6 Enron employees and a number of spam sources. It is part of [this data set](http://www.aueb.gr/users/ion/data/enron-spam/) which was created by researchers working on personalized Bayesian spam filters. Their original publication is [available here](http://www.aueb.gr/users/ion/docs/ceas2006_paper.pdf). __`IMPORTANT!`__ _For this homework please limit your analysis to the 100 email subset which we provide. No need to download or run your analysis on any of the original datasets, those links are merely provided as context._

__Preprocessing:__  
For their work, Metsis et al. (the authors) appeared to have pre-processed the data, not only collapsing all text to lower-case, but additionally separating "words" by spaces, where "words" unfortunately include punctuation. As a concrete example, the sentence:  
>  `Hey Jon, I hope you don't get lost out there this weekend!`  

... would have been reduced by Metsis et al. to the form:  
> `hey jon , i hope you don ' t get lost out there this weekend !` 

... so we have reverted the data back toward its original state, removing spaces so that our sample sentence would now look like:
> `hey jon, i hope you don't get lost out there this weekend!`  

Thus we have at least preserved contractions and other higher-order lexical forms. However, one must be aware that this reversion is not complete, and that some object (specifically web sites) will be ill-formatted, and that all text is still lower-cased.


__Format:__   
All messages are collated to a tab-delimited format:  

>    `ID \t SPAM \t SUBJECT \t CONTENT \n`  

where:  
>    `ID = string; unique message identifier`  
    `SPAM = binary; with 1 indicating a spam message`  
    `SUBJECT = string; title of the message`  
    `CONTENT = string; content of the message`   
    
Note that either of `SUBJECT` or `CONTENT` may be "NA", and that all tab (\t) and newline (\n) characters have been removed from both of the `SUBJECT` and `CONTENT` columns.  

In [22]:
!pwd

/media/notebooks/Assignments/HW2


In [23]:
# take a look at the first 100 characters of the first 5 records (RUN THIS CELL AS IS)
!head -n 5 {ENRON} | cut -c-100

0001.1999-12-10.farmer	0	 christmas tree farm pictures	NA
0001.1999-12-10.kaminski	0	 re: rankings	 thank you.
0001.2000-01-17.beck	0	 leadership development pilot	" sally:  what timing, ask and you shall receiv
0001.2000-06-06.lokay	0	" key dates and impact of upcoming sap implementation over the next few week
0001.2001-02-07.kitchen	0	 key hr issues going forward	 a) year end reviews-report needs generating 


In [24]:
# see how many messages/lines are in the file 
#(this number may be off by 1 if the last line doesn't end with a newline)
!wc -l {ENRON}

100 data/enronemail_1h.txt


In [25]:
# load the data into HDFS (RUN THIS CELL AS IS)
!hdfs dfs -copyFromLocal {ENRON} {HDFS_DIR}/enron.txt

In [26]:
!hdfs dfs -ls {HDFS_DIR}

Found 11 items
drwxr-xr-x   - root hadoop          0 2025-09-21 03:15 /user/root/HW2/chinese-train-output
drwxr-xr-x   - root hadoop          0 2025-09-21 04:16 /user/root/HW2/chinese-train-smooth-output
-rw-r--r--   1 root hadoop        119 2025-09-21 03:13 /user/root/HW2/chineseTest.txt
-rw-r--r--   1 root hadoop        107 2025-09-21 03:13 /user/root/HW2/chineseTrain.txt
drwxr-xr-x   - root hadoop          0 2025-09-21 04:21 /user/root/HW2/enron-model
drwxr-xr-x   - root hadoop          0 2025-09-21 05:00 /user/root/HW2/enron-smoothed-eval
drwxr-xr-x   - root hadoop          0 2025-09-21 05:04 /user/root/HW2/enron-unsmoothed-eval
-rw-r--r--   1 root hadoop     204559 2025-09-21 05:25 /user/root/HW2/enron.txt
-rw-r--r--   1 root hadoop      41493 2025-09-21 04:17 /user/root/HW2/enron_test.txt
-rw-r--r--   1 root hadoop     163066 2025-09-21 04:17 /user/root/HW2/enron_train.txt
drwxr-xr-x   - root hadoop          0 2025-09-21 04:27 /user/root/HW2/smooth-model


# Question 4:  Enron Ham/Spam EDA

Before building our classifier, let's get aquainted with our data. In particular, we're interested in which words occur more in spam emails than in legitimate ("ham") emails. In this question you'll implement two Hadoop MapReduce jobs to count and sort word occurrences by document class. You'll also learn about two new Hadoop streaming parameters that will allow you to control how the records output from your mappers are partitioned for reducing on separate nodes. 

__`IMPORTANT NOTE:`__ For this question and all subsequent items, you should include both the subject and the body of the email in your analysis (i.e. concatetate them to get the 'text' of the document).

### Q4 Tasks:
* __a) Code in Notebook:__ Complete the missing components of the code in __`EnronEDA/mapper.py`__ and __`EnronEDA/reducer.py`__ to create a Hadoop  MapReduce job that counts how many times each word in the corpus occurs in an email for each class. Pay close attention to the data format specified in the docstrings of these scripts _-- there are a number of ways to accomplish this task, we've chosen this format to help illustrate a technique in `part e`_. Run the provided unit tests to confirm that your code works as expected, then run the provided Hadoop Streaming command to apply your analysis to the Enron data.

* __b) Code in Notebook + Multiple Choice:__ How many times does the word "__assistance__" occur in each class? (`HINT:` Use a `grep` command to read from the results file you generated in '`a`' and then report the answer in the space provided.)

* __c) Multiple Choice:__ Would it have been possible to add some sorting parameters to the Hadoop streaming command that would cause our `part a` results to be sorted by count? Explain why or why not. (`HINT:` This question demands an understanding of the sequence of the phases of MapReduce.)

* __d) Code in Notebook + Short Essay:__ Write a second Hadoop MapReduce job to sort the output of `part a` first by class and then by count. Run your job and save the results to a local file. Then describe in words how you would go about printing the top 10 words in each class given this sorted output. (`HINT 1:` _remember that you can simply pass the `part a` output directory to the input field of this job; `HINT 2:` since this task is just reordering the records from `part a` we don't need to write a mapper or reducer, just use `/bin/cat` for both_)

* __e) Code in Notebook:__ A more efficient alternative to '`grep`-ing' for the top 10 words in each class would be to use the Hadoop framework to separate records from each class into its own partition so that we can just read the top lines in each. Rewrite your job from ` part d` to specify 2 reduce tasks and to tell Hadoop to partition based on the second field (which indicates spam/ham in our data). Your code should maintain the secondary sort -- that is each partition should list words from most to least frequent.

In [27]:
# part a - do your work in the provided scripts then RUN THIS CELL AS IS
!chmod a+x EnronEDA/mapper.py
!chmod a+x EnronEDA/reducer.py

In [28]:
# part a - unit test EnronEDA/mapper.py (RUN THIS CELL AS IS)
# STRING IS:
#   d1	1	title	body
#   d2	0	title	body
!echo -e "d1\t1\ttitle\tbody\nd2\t0\ttitle\tbody" | EnronEDA/mapper.py

title	1	1
body	1	1
title	0	1
body	0	1


In [29]:
# part a - unit test EnronEDA/reducer.py (RUN THIS CELL AS IS)
# STRING IS:
#   one	1	1
#   one	0	2
#   two	0	1
!echo -e "one\t1\t1\none\t0\t1\none\t0\t1\ntwo\t0\t1" | EnronEDA/reducer.py

-e one	1	1
one	0	2
two	0	1


In [30]:
# part a - clear output directory in HDFS (RUN THIS CELL AS IS)
!hdfs dfs -rm -r {HDFS_DIR}/eda-output

rm: `/user/root/HW2/eda-output': No such file or directory


In [31]:
# part a - Hadoop streaming job (RUN THIS CELL AS IS)
!hadoop jar {JAR_FILE} \
  -files EnronEDA/reducer.py,EnronEDA/mapper.py \
  -mapper mapper.py \
  -reducer reducer.py \
  -input {HDFS_DIR}/enron.txt \
  -output {HDFS_DIR}/eda-output \
  -numReduceTasks 2 \
  -cmdenv PATH={PATH}

packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob5258217519449079432.jar tmpDir=null
2025-09-21 05:25:38,753 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:25:39,058 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:25:39,577 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:25:39,578 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:25:39,794 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0017
2025-09-21 05:25:40,531 INFO mapred.FileInputFormat: Total input files to process : 1
2025-09-21 05:25:40,582 INFO mapreduce.JobSubmitter: number of splits:9
2025-09-21 05:25:40,778 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1758422346528_0017
2025-09-21 05:25:40,780 INFO mapred

In [32]:
# part a - retrieve results from HDFS & copy them into a local file (RUN THIS CELL AS IS)
!hdfs dfs -cat {HDFS_DIR}/eda-output/part-0000* > EnronEDA/results.txt

In [33]:
# part b - write your grep command here
# !grep -P "^assistance\t1\t" EnronEDA/results.txt
# !grep -P "^assistance\t0\t" EnronEDA/results.txt
!grep '^assistance' EnronEDA/results.txt

assistance	1	8
assistance	0	2


In [34]:
# q4b
### MULTIPLE CHOICE
### QUESTION: How many times does the word "assistance" occur in each class?
#             (HINT: Use a grep command to read from the results file you
#             generated in 'a' and then report the answer in the space provided.)

#   a.) 'assistance' occurs 8 times in Spam emails and only 2 times in real emails.
#   b.) 'assistance' occurs only 2 times in Spam emails and 8 times in real emails.
#   c.) 'assistance' occurs 10 times in Spam emails and only 2 times in real emails.
#   d.) 'assistance' occurs 8 times in Spam emails and only 3 times in real emails.


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


In [35]:
# q4c
### MULTIPLE CHOICE
### QUESTION: Would it have been possible to add some sorting parameters to the Hadoop
#             streaming command that would cause our part a results to be sorted by count?
#             (HINT: This question demands an understanding of the sequence of the phases
#             of MapReduce.)

#   a.) No, we can't sort on counts in the original job because Hadoop's sorting only allows
#       keys to be sorted by letters, and counts are numbers.

#   b.) Yes, we can sort on counts in the original job because Hadoop's sorting occurs after
#       the reducer stage, which is where our counts are tallied up.

#   c.) Yes, we can sort on counts in the original job because Hadoop allows sorting in any
#       of the stages of the map-reduce job.

#   d.) No, we can't sort on counts in the original job because Hadoop's sorting occurs in the
#       phase between the mapper and reducer since our counts aren't tallied up until after the reducer.


### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "d"


#####################
print(answer)

d


In [36]:
# part d - clear the output directory in HDFS (RUN THIS CELL AS IS)
!hdfs dfs -rm -r {HDFS_DIR}/eda-sort-output

rm: `/user/root/HW2/eda-sort-output': No such file or directory


In [37]:
# q4d1
# part d - write your Hadoop streaming job here
!hdfs dfs -rm -r {HDFS_DIR}/eda-sort-output

!hadoop jar {JAR_FILE} \
  -D stream.num.map.output.key.fields=3 \
  -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
  -D mapreduce.partition.keycomparator.options="-k2,2n -k3,3nr" \
  -input {HDFS_DIR}/eda-output \
  -output {HDFS_DIR}/eda-sort-output \
  -mapper /bin/cat \
  -reducer /bin/cat \
  -numReduceTasks 1 \
  -cmdenv PATH={PATH}

!hdfs dfs -cat {HDFS_DIR}/eda-sort-output/part-00000 > EnronEDA/sorted_results.txt

rm: `/user/root/HW2/eda-sort-output': No such file or directory
packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob3802559138501684604.jar tmpDir=null
2025-09-21 05:26:34,385 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:26:34,631 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:26:35,287 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:26:35,288 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:26:35,838 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0018
2025-09-21 05:26:36,161 INFO mapred.FileInputFormat: Total input files to process : 2
2025-09-21 05:26:36,252 INFO mapreduce.JobSubmitter: number of splits:10
2025-09-21 05:26:36,486 INFO mapreduce.JobSubmitter: Submitting tokens fo

In [38]:
# q4d2
### SHORT ESSAY
### QUESTION: Describe in words how you would go about printing the top 10
#             words in each class given this sorted output.

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
Since the results are already sorted first by class and then by count, printing the top 10 words in
each class is just going to be scanning through the file and keeping track of how many words we’ve 
seen for each class. I’d read the sorted output line by line, check the class field, and maintain a 
counter for each class. When I see a line with class 1 (spam), I increment the spam 
counter and print the word only if I haven’t yet reached 10. I would do the same for class 0 (ham).
Once the counter for a class reaches 10, I would stop printing for that class and continue on until
both classes have their top 10 words. The sorted order allows the most frequent words for each class
appear first in the group, so I don’t need to do any extra sorting or tallying. I just read, check the
class, and print the first 10 entries per class. This makes it simple and efficient to pull out the top
terms from the data.
"""
)


Since the results are already sorted first by class and then by count, printing the top 10 words in
each class is just going to be scanning through the file and keeping track of how many words we’ve 
seen for each class. I’d read the sorted output line by line, check the class field, and maintain a 
counter for each class. When I see a line with class 1 (spam), I increment the spam 
counter and print the word only if I haven’t yet reached 10. I would do the same for class 0 (ham).
Once the counter for a class reaches 10, I would stop printing for that class and continue on until
both classes have their top 10 words. The sorted order allows the most frequent words for each class
appear first in the group, so I don’t need to do any extra sorting or tallying. I just read, check the
class, and print the first 10 entries per class. This makes it simple and efficient to pull out the top
terms from the data.



In [39]:
# part e - clear the output directory in HDFS (RUN THIS CELL AS IS)
!hdfs dfs -rm -r {HDFS_DIR}/eda-sort-output

Deleted /user/root/HW2/eda-sort-output


In [40]:
# q4e
# part e - write your Hadoop streaming job here
!hdfs dfs -rm -r {HDFS_DIR}/eda-sort-output

!hadoop jar {JAR_FILE} \
    -D stream.num.map.output.key.fields=3 \
    -D stream.map.output.field.separator="\t" \
    -D mapreduce.partition.keypartitioner.options=-k2,2 \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-k2,2n -k3,3nr" \
    -input {HDFS_DIR}/eda-output \
    -output {HDFS_DIR}/eda-sort-output \
    -mapper /bin/cat \
    -reducer /bin/cat \
    -numReduceTasks 2 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -cmdenv PATH={PATH}

!hdfs dfs -cat {HDFS_DIR}/eda-sort-output/part-* > EnronEDA/sorted_results_partitioned.txt

rm: `/user/root/HW2/eda-sort-output': No such file or directory
packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob7189516461276457532.jar tmpDir=null
2025-09-21 05:27:29,913 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:27:30,182 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:27:30,817 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:27:30,818 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:27:31,371 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0019
2025-09-21 05:27:31,690 INFO mapred.FileInputFormat: Total input files to process : 2
2025-09-21 05:27:31,773 INFO mapreduce.JobSubmitter: number of splits:10
2025-09-21 05:27:32,009 INFO mapreduce.JobSubmitter: Submitting tokens fo

In [41]:
# part e - view the top 10 records from each partition (RUN THIS CELL AS IS)
for idx in range(2):
    print(f"\n===== part-0000{idx}=====\n")
    !hdfs dfs -cat {HDFS_DIR}/eda-sort-output/part-0000{idx} | head 2>&1 | tee q4e.txt


===== part-00000=====

the	0	549	
to	0	398	
ect	0	382	
and	0	278	
of	0	230	
hou	0	206	
a	0	196	
in	0	182	
for	0	170	
on	0	135	
cat: Unable to write to output stream.

===== part-00001=====

the	1	698	
to	1	566	
and	1	392	
your	1	357	
a	1	347	
you	1	345	
of	1	336	
in	1	236	
for	1	204	
com	1	153	
cat: Unable to write to output stream.


__Expected output:__
<table>
<th>part-00000:</th>
<th>part-00001:</th>
<tr><td><pre>
the	0	549	
to	0	398	
ect	0	382	
and	0	278	
of	0	230	
hou	0	206	
a	0	196	
in	0	182	
for	0	170	
on	0	135
</pre></td>
<td><pre>
the	1	698	
to	1	566	
and	1	392	
your	1	357	
a	1	347	
you	1	345	
of	1	336	
in	1	236	
for	1	204	
com	1	153
</pre></td></tr>
</table>

# Question 5: Counters and Combiners

Tuning the number of mappers & reducers is helpful to optimize very large distributed computations. Doing so successfully requires a thorough understanding of the data size at each stage of the job. As you learned in the week3 live session, counters are an invaluable resource for understanding this kind of detail. In this question, we will take the EDA performed in Question 4 as an opportunity to illustrate some related concepts.

### Q5 Tasks:
* __a) Multiple Choice:__ Read the Hadoop output from your job in Question 4a to report how many records are emitted by the mappers and how many records are received by the reducers (hint: we are not using combiners here).

* __b) Multiple Choice:__ In the context of word counting in Question 4b, what does the number of records emitted by the mapper represent practically?

* __c) Code in Notebook:__ Note that we wrote the reducer in Question 4a such that the input and output record format is identical. This makes it easy to use the same reducer script as a combiner. In the space provided below, write the Hadoop Streaming command to re-run your job from Question 4a with this this __combiner step__ added.

* __d) Multiple Choice:__ Read the Hadoop output from your job in Question 5c to report how many records are emitted by the mappers and how many records are received by the reducers (hint: we are using combiners here). 

* __e) Short Essay:__ Compare your results from Question 5d to what you saw in Question 5a. Explain the differences, if any.

* __f) Short Essay:__ Describe a scenario where using a combiner would NOT improve the efficiency of the shuffle stage. Explain.


In [42]:
# q5a1
### MULTIPLE CHOICE
### QUESTION: Read the Hadoop output from your job in Question 4a to report how
#             many records are emitted by the mappers (hint: we are not using combiners here). 

#   a.) 20576
#   b.) 13096
#   c.) 10101
#   d.) 31490
#   e.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "d"


#####################
print(answer)

d


In [43]:
# q5a2
### MULTIPLE CHOICE
### QUESTION: Read the Hadoop output from your job in Question 4a to report how
#             many records are received by the reducers (hint: we are not using combiners here). 

#   a.) 13096
#   b.) 31490
#   c.) 10101
#   d.) 20576
#   e.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "b"


#####################
print(answer)

b


In [44]:
# q5b
### MULTIPLE CHOICE
### QUESTION: In the context of word counting in Question 4b, what does the number of records emitted by the mapper
#             represent practically?

#   a.) The total number of unique words in all documents.
#   b.) The total number of documents.
#   c.) The total number of words in all documents.
#   d.) The average number of words per document.

### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "c"


#####################
print(answer)

c


In [45]:
# part c - clear output directory in HDFS (RUN THIS CELL AS IS)
!hdfs dfs -rm -r {HDFS_DIR}/eda-output

Deleted /user/root/HW2/eda-output


In [46]:
!cat EnronEDA/mapper.py

#!/usr/bin/env python
"""
Mapper tokenizes and emits words with their class.
INPUT:
    ID \t SPAM \t SUBJECT \t CONTENT \n
OUTPUT:
    word \t class \t count 
"""
import re
import sys

# read from standard input
for line in sys.stdin:
    # parse input
    docID, _class, subject, body = line.split('\t')
    # tokenize
    words = re.findall(r'[a-z]+', subject + ' ' + body)
    
############ YOUR CODE HERE #########
    # emit each word with its class
    for word in words:
        print(f"{word}\t{_class}\t1")
############ (END) YOUR CODE #########

In [47]:
!cat EnronEDA/reducer.py

#!/usr/bin/env python
"""
Reducer takes words with their class and partial counts and computes totals.
INPUT:
    word \t class \t partialCount 
OUTPUT:
    word \t class \t totalCount  
"""
import re
import sys

# initialize trackers
current_word = None
spam_count, ham_count = 0,0

# read from standard input
for line in sys.stdin:
    # parse input
    word, is_spam, count = line.split('\t')
    
############ YOUR CODE HERE #########
    count = int(count)
    if current_word is None:
        current_word = word
    if word != current_word:
        # output both counts for the finished word
        if spam_count > 0:
            print(f"{current_word}\t1\t{spam_count}")
        if ham_count > 0:
            print(f"{current_word}\t0\t{ham_count}")
        # reset trackers
        current_word = word
        spam_count, ham_count = 0, 0
    # add to the counters
    if is_spam == "1":
        spam_count += count
    else:
        ham_count += count
# flush the last word
if current_word

In [48]:
# q5c
# write your Hadoop streaming job here
!hadoop jar {JAR_FILE} \
  -files EnronEDA/mapper.py,EnronEDA/reducer.py \
  -mapper mapper.py \
  -combiner reducer.py \
  -reducer reducer.py \
  -input {HDFS_DIR}/enron.txt \
  -output {HDFS_DIR}/eda-output \
  -numReduceTasks 2 \
  -cmdenv PATH={PATH}

packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob1478029178737531473.jar tmpDir=null
2025-09-21 05:28:31,118 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:28:31,434 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:28:31,948 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:28:31,948 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:28:32,149 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0020
2025-09-21 05:28:32,495 INFO mapred.FileInputFormat: Total input files to process : 1
2025-09-21 05:28:32,568 INFO mapreduce.JobSubmitter: number of splits:9
2025-09-21 05:28:32,778 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1758422346528_0020
2025-09-21 05:28:32,780 INFO mapred

In [49]:
# q5d1
### MULTIPLE CHOICE
### QUESTION: Read the Hadoop output from your job in Question 5c to report how
#             many records are emitted by the mappers (hint: we are not using combiners here). 

#   a.) 10130
#   b.) 13096
#   c.) 20576
#   d.) 31490
#   e.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "d"


#####################
print(answer)

d


In [50]:
# q5d2
### MULTIPLE CHOICE
### QUESTION: Read the Hadoop output from your job in Question 5c to report how
#             many records are received by the reducers (hint: we are not using combiners here). 

#   a.) 13096
#   b.) 31490
#   c.) 10130
#   d.) 20576
#   e.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE PRINT STATEMENT. (i.e. if your answer is f.), enter "f")
answer = "e"


#####################
print(answer)

e


In [51]:
# q5e
### SHORT ESSAY
### QUESTION: Compare your results from Question 5d to what you saw in
#             Question 5a. Explain the differences, if any.

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
In Question 5a, the number of records received by the reducers was 31,490, matching the number
of records outputted by the mappers since no combiner was used. In Question 5d, after adding the
combiner in Question 5c, the number of records received by the reducers dropped to 11,433. There is a
difference because the combiner partially aggregates the mapper output before it is sent to the
reducers. This reduces the amount of intermediate data shuffled across the network, lowering both the
reduce input records and the overall network input/output, which improves efficiency without affecting
the final output counts.
"""
)


In Question 5a, the number of records received by the reducers was 31,490, matching the number
of records outputted by the mappers since no combiner was used. In Question 5d, after adding the
combiner in Question 5c, the number of records received by the reducers dropped to 11,433. There is a
difference because the combiner partially aggregates the mapper output before it is sent to the
reducers. This reduces the amount of intermediate data shuffled across the network, lowering both the
reduce input records and the overall network input/output, which improves efficiency without affecting
the final output counts.



In [52]:
# q5f
### SHORT ESSAY
### QUESTION: Describe a scenario where using a combiner would NOT improve the efficiency
#             of the shuffle stage. Explain.

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
A combiner won’t help if most of the keys output by the mappers are unique or occur only once.
If this is true, there would be nothing that would be aggregated locally before sending the data to
the reducers, so every record still has to travel across the network. Thus, the shuffle stage
sees no reduction in data volume or network traffic, and the overall job efficiency does not improve.
Therefore, combiners are beneficial when there are repeated keys that can be partially reduced at the
mapper level.
"""
)


A combiner won’t help if most of the keys output by the mappers are unique or occur only once.
If this is true, there would be nothing that would be aggregated locally before sending the data to
the reducers, so every record still has to travel across the network. Thus, the shuffle stage
sees no reduction in data volume or network traffic, and the overall job efficiency does not improve.
Therefore, combiners are beneficial when there are repeated keys that can be partially reduced at the
mapper level.



# Question 6: Document Classification Task Overview

The week 2 assigned reading from Chapter 13 of _Introduction to Information Retrieval_ by Manning, Raghavan and Schutze provides a thorough introduction to the document classification task and the math behind Naive Bayes. In this question we'll use the example from Table 13.1 (reproduced below) to 'train' an unsmoothed Multinomial Naive Bayes model and classify a test document by hand.

<table>
<th>DocID</th>
<th>Class</th>
<th>Subject</th>
<th>Body</th>
<tr><td>Doc1</td><td>1</td><td></td><td>Chinese Beijing Chinese</td></tr>
<tr><td>Doc2</td><td>1</td><td></td><td>Chinese Chinese Shanghai</td></tr>
<tr><td>Doc3</td><td>1</td><td></td><td>Chinese Macao</td></tr>
<tr><td>Doc4</td><td>0</td><td></td><td>Tokyo Japan Chinese</td></tr>
</table>

### Q6 Tasks:

* __a) Multiple Choice:__ Assume we estimate the following probabilities from a collection of SPAM emails and HAM emails. We limit the vocabulary of the Naive Bayes model to the following keywords: Urgent, Sale, Hello.  In the following $Pr(word)$ indicates the probability of that word appearing in any email.

* $Pr(Urgent) = 50\%,$
* $Pr(Sale) = 8\%,$
* $Pr(Hello) = 20\%$
    
We also know that spam is $40\%$ of all email, and $80\%$ of spam email contains “Urgent”, i.e, $Pr(Urgent|SPAM) = 0.8.$

Given a multinomial Naive Bayes model learnt from this data with no smoothing, what is the probability that an email is a spam if it contains only one word that is our vocabulary, that of,  “Urgent”?  I.e., $Pr(SPAM|X=Urgent)$ = ?????

`HINT`: The [law of total probabilities](https://en.wikipedia.org/wiki/Law_of_total_probability) is used to calculate the denominator for a multinomial Naive Bayes Classifier. But we don't need to use that rule here since our document contains just a single word in the model vocabulary i.e., it contains the word "Urgent". Plus we know the $Pr(X) = Pr(Urgent) = 0.5$. Having said that what is $Pr(Urgent|HAM) = 0.3$ (just out of curiosity's sake; this is not needed in this quiz).


* $Pr(Urgent) = 0.4 * 0.8 + 0.6 * Pr(Urgent|HAM)$
* $0.5 = 0.4 * 0.8 + 0.6 * Pr(Urgent|HAM)$
* $Pr(Urgent|HAM) = 0.3$

Ordinarily, we tend to not use the denominator in our calculation for classification tasks (as it is common to all the classes). But if we need probabilities then we would calculate this quantity.

QUESTION: 

What is $Pr(SPAM|X=Urgent)$?


* __b) Numerical Input:__ Given the following training dataset of 5 documents for a 2 Class problem: HAM versus SPAM.

**Training Data**


|DocId |Class | Document String
|---|---|---|
|d1 | HAM | good
|d2 | SPAM | very good
|d3 | SPAM | good bad
|d4 | HAM | very bad
|d5 | SPAM | very bad very good


The vocabulary of the dataset is [good, very, bad]. The word class conditionals are calculated in the following table(without smoothing). 

Please fill in the blanks with appropriate answers in the following table:


| Word | $\text{Pr(word|HAM)}$ | $\text{Pr(word|SPAM)}$ |
|---|---|---|
| good | $1/3$ | $3/8$ 
| very | $1/3$ | $3/8$ 
| bad | \[Blank_1\] | \[Blank_2\] 

Please submit your response as a fraction using "/" or as a decimal with at least 2 decimal places. Example: 1/16 or 0.06


* __c) Numerial Input:__ In this question learn a multinomial Naive Bayes model using all the training data. Please use unigrams features (i.e., single words, like "very", "bad").  You would then learn the Pr(bad | SPAM) and Pr(bad | ham).  Note: please do not use any higher-order features such as bigrams (e.g., "very bad"). 

Given the following training corpus of documents for a two Class problem: __HAM__ versus __SPAM__.

__Training Data__:

| DocId | Class | Document String
|---|---|---|
| d1 | HAM | good
| d2 | SPAM | very good
| d3 | SPAM | good bad
| d4 | HAM | very bad 
| d5 | SPAM | very bad very good

and a test data set consisting of a single test case:

__Test Data__

| DocId | Class | Document String
|---|---|---|
| d6 | ?? | good bad very

__TASK:__ Learn a multinomial Naive Bayes model with Laplace (plus one) smoothing using all the training data. 

Given a test document $d6$ calculate the posterior probability for __HAM__ using the learned model.

* $Pr(HAM | d6) = ?? \% $ 

Recall at a high level the posterior probability of, say, $Pr(HAM|d6)$ is a follows:

$Pr(HAM|d6) = Pr(d6|HAM)/( Pr(d6|HAM) + Pr(d6|SPAM))$

where $Pr(HAM|d6)$ and $Pr(SPAM|d6)$ can be calculated as follows:

\begin{equation}
\begin{aligned}
p(C_k \mid x_1, \dots, x_n)
&=\frac{ p( x_1, \dots, x_n\mid C_k)p(C_k) }{p( x_1, \dots, x_n)}\\
&=\frac{  \prod_{i=1}^n p(x_i \mid C_k)p(C_k) }{p( x_1, \dots, x_n)}
\end{aligned}
\end{equation}

where,

$x_1, \dots, x_n$ are the words in d6, and $C_k$ is the class label (HAM or SPAM).

The formula given above is a generic formula. For our example it will be:

\begin{equation}
\begin{aligned} 
p(HAM \mid good, bad, very)
&=\frac{  p(HAM)\prod_{i=1}^3 p(w_i \mid HAM) }{p( good, bad, very)}\\
&=\frac{  p(HAM)\prod_{i=1}^3 p(w_i \mid HAM) }{(p(HAM) \prod_{i=1}^3 p(w_i \mid HAM)) +(p(SPAM) \prod_{i=1}^3 p(w_i \mid SPAM))}
\end{aligned}
\end{equation}

Here $w_i$ will be the $i^{th}$ word $\text{d6}$.

Sometimes the above equations are simplified for the purposes of classification as follows:

\begin{equation}
\begin{aligned}
p(C_k \mid x_1, \dots, x_n)
& \varpropto p(C_k, x_1, \dots, x_n) \\
& \varpropto p(C_k) \ p(x_1 \mid C_k) \ p(x_2 \mid C_k) \ p(x_3 \mid C_k) \ \cdots \\
& \varpropto p(C_k) \prod_{i=1}^n p(x_i \mid C_k) \,.
\end{aligned}
\end{equation}

__Please report the probability, $Pr(HAM|d6)$, as an integer percentage (please round).__

For example, if $Pr(HAM|d6)  = 0.709493671$ then the $Pr(HAM|d6)$ should be reported as $71$. Please input $71$ for your response.

You can calculate these probabilities by hand and you can verify your calculations by running the code given below.

-----------------------------------

* __d) Short Essay:__ Equation 13.3 in Manning, Raghavan and Shutze shows how a Multinomial Naive Bayes model classifies a document. It predicts the class, $c$, for which the estimated conditional probability of the class given the document's contents,  $\hat{P}(c|d)$, is greatest. In this equation what two pieces of information are required to calculate  $\hat{P}(c|d)$? Your answer should include both mathematical notatation and verbal explanation.

* __e) Short Essay:__ The Enron data includes two classes of documents: `spam` and `ham` (they're actually labeled `1` and `0`). In plain English, explain what  $\hat{P}(c)$ and  $\hat{P}(t_{k} | c)$ mean in the context of this data. How would we estimate these values from a training corpus?

* __f) Multiple Choice:__ How many passes over the data would we need to make to retrieve this information for all classes and all words?

* __g) Hand Calculations:__ Above we've reproduced the document classification example from the textbook (we added an empty subject field to mimic the Enron data format). Remember that the classes in this "Chinese Example" are `1` (about China) and `0` (not about China). Calculate the class priors and the conditional probabilities for an __unsmoothed__ Multinomial Naive Bayes model trained on this data. Show the calculations that lead to your result using markdown and $\LaTeX$ in the space provided or by embedding an image of your hand written work. [`NOTE:` _Your results should NOT match those in the text -- they are training a model with +1 smoothing you are training a model without smoothing_]

    The following is a sample table in Latex. Please feel free to adapt as needed:

    $$
    \begin{matrix} 
    Word & Freq(word in China Docs) & Freq(word in NOT China Docs) & Pr(w_i|y=China) &Pr(w_i|y= NOT China)\\
    beijing    & 0&0&0&0\\
     chinese  & 0&0&0&0\\
     tokyo      & 0&0&0&0\\
     shanghai   & 0&0&0&0\\
     japan      & 0&0&0&0\\
     macao      & 0&0&0&0\\
    CLASS PRIORS    & 0&0&0&0\\
    \end{matrix}
    $$

* __h) Hand Calculations:__ Use the model you trained to classify the following test document: `Chinese Chinese Chinese Tokyo Japan`. Show the calculations that lead to your result using markdown and $\LaTeX$ in the space provided or by embedding an image of your hand written work.

    $$
    \begin{align} 
    Pr(Class| Doc = D_5)  &\approx \underset{c_{j} \in \{China, not China\}}{\operatorname{\text{arg}max}}  P(Class=c_{j}| Doc = D_5) \\
                &\approx \underset{c_{j} \in \{China, not China\}}{\operatorname{\text{arg}max}}  P(Class=c_{j}) \prod_{w_i \in Doc=D_5 }P(w_{i}|c_{j})\\
                 &\approx \underset{c_{j} \in \{China, not China\}}{\operatorname{\text{arg}max}} (\frac{..}{...}, ..)\\
    \end{align}
    $$


* __i) Short Essay:__ Compare the classification you get from this unsmoothed model in `g`/`h` to the results in the textbook's "Example 1" which reflects a model with Laplace plus 1 smoothing. How does smoothing affect our inference?


In [53]:
# q6a
### MULTIPLE CHOICE
### QUESTION: See whole question above. What is Pr(SPAM|X=Urgent)?

#   a.) 64%
#   b.) 32%
#   c.) 8%
#   d.) 15%

### ENTER ONLY THE LETTER INSIDE THE ANSWER VARIABLE. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


In [54]:
# q6b1
### NUMERICAL INPUT
### QUESTION: See whole question above if necessary. What is `Blank_1`?

# | Word | Pr(word|HAM)| Pr(word|SPAM)
# | ---  | ---         | ---
# | good | 1/3         | 3/8 
# | very | 1/3         | 3/8 
# | bad  | Blank_1     | Blank_2


### ENTER ONLY THE ANSWER INSIDE THE ANSWER VARIABLE, AS A STRING. USE THE DECIMAL, NOT THE FRACTION. (i.e. "0.06", NOT "1/16")
#       Please submit your response as a DECIMAL with at least 2 decimal places (as shown above)

answer = "0.33"


#####################
print(answer)


0.33


In [55]:
# q6b2
### NUMERICAL INPUT
### QUESTION: See whole question above if necessary. What is `Blank_2`?

# | Word | Pr(word|HAM)| Pr(word|SPAM)
# | ---  | ---         | ---
# | good | 1/3         | 3/8 
# | very | 1/3         | 3/8 
# | bad  | Blank_1     | Blank_2


### ENTER ONLY THE ANSWER INSIDE THE ANSWER VARIABLE, AS A STRING. USE THE DECIMAL, NOT THE FRACTION. (i.e. "0.06", NOT "1/16")
#       Please submit your response as a DECIMAL with at least 2 decimal places (as shown above)

answer = "0.25"


#####################
print(answer)


0.25


In [56]:
# Notebook Only - DON'T REMOVE THIS
# part c

import pandas as pd
import numpy as np
from IPython import display

vocabulary = ["bad", "good", "very"]

# Document by term matrix
doc_per_term= np.array([[0, 1, 0 ],[0, 1, 1],[1, 1, 0],[1, 0, 1],[1, 1, 2]])

# y_train: 0 for Ham and 1 for spam
class_per_doc= np.array([0,1,1,0,1])
print(pd.DataFrame(np.c_[class_per_doc, doc_per_term], index = ["d1", "d2","d3", "d4", "d5"],columns = ["Class"]+ vocabulary))

## Learn the Naïve Bayes Classification:
model_priors = np.bincount(class_per_doc)/ len(class_per_doc)
print(f"model_priors: {model_priors}")


# Calculate Pr(w_i|ham) aka ham  class conditionals
print (doc_per_term[class_per_doc==0,:])
model_data_given_ham= (np.sum(doc_per_term[class_per_doc==0,:],axis=0)+1)/(np.sum(doc_per_term[class_per_doc==0,:]) +len(vocabulary))
print(f"Pr(w_i|ham):  {np.round(model_data_given_ham, 3)}")

# Calculate Pr(w_i|spam) aka SPAM class conditionals:
model_data_given_spam= (np.sum(doc_per_term[class_per_doc==1,:],axis=0)+1)/(np.sum(doc_per_term[class_per_doc==1,:]) +len(vocabulary))
print(f"Pr(w_i|spam):  {np.round(model_data_given_spam, 3)}")


# Test document terms are: bad, good, very
d6 = [1, 1, 1] #TEST DOCUMENT
print(pd.DataFrame([d6], index = ["d3"], columns = vocabulary))

# Naïve Bayes Classification
# Likelihood
# Applying the Unigram Language Model
# Calculate Posterior Probabilities using the learnt Naive Bayes Model
print(f"likelihood Pr(d6|ham): {np.prod(np.power(model_data_given_ham, d6))}")
print(f"likelihood Pr(d6|SPAM): {np.prod(np.power(model_data_given_spam, d6))}")

pr_ham = np.prod(np.power(model_data_given_ham, d6)) * model_priors[0]
pr_spam = np.prod(np.power(model_data_given_spam, d6))* model_priors[1]
print(f"unnormalized Pr(D6|ham)*Pr(ham) is : {pr_ham:7.5f}")
print(f"unnormalized Pr(D6|SPAM)*Pr(SPAM) is : {pr_spam:7.5f}")

print(f"Posterior Probabilities in % is: Pr(Ham|D6) is : {100*pr_ham/(pr_spam+pr_ham):7.0f}")
print(f"Posterior Probabilities in % is: Pr(SPAM|D6) is : {100*pr_spam/(pr_spam+pr_ham):7.0f}")

    Class  bad  good  very
d1      0    0     1     0
d2      1    0     1     1
d3      1    1     1     0
d4      0    1     0     1
d5      1    1     1     2
model_priors: [0.4 0.6]
[[0 1 0]
 [1 0 1]]
Pr(w_i|ham):  [0.333 0.333 0.333]
Pr(w_i|spam):  [0.273 0.364 0.364]
    bad  good  very
d3    1     1     1
likelihood Pr(d6|ham): 0.037037037037037035
likelihood Pr(d6|SPAM): 0.03606311044327573
unnormalized Pr(D6|ham)*Pr(ham) is : 0.01481
unnormalized Pr(D6|SPAM)*Pr(SPAM) is : 0.02164
Posterior Probabilities in % is: Pr(Ham|D6) is :      41
Posterior Probabilities in % is: Pr(SPAM|D6) is :      59


In [57]:
# q6c
### NUMERICAL INPUT
### QUESTION: See whole question above if necessary. Please report the probability, Pr(HAM|d6),
#             as an integer percentage (please round).

### ENTER ONLY THE ANSWER INSIDE THE ANSWER VARIABLE, AS A STRING. (i.e. an answer of 0.709493671 should be entered as "71")
answer = "41"


#####################
print(answer)

41


In [58]:
# q6d
### SHORT ESSAY
### QUESTION: See whole question above if necessary.
#             Equation 13.3 in Manning, Raghavan, and Shutze  (included above for convenience) shows how a Multinomial
#             Naive Bayes model classifies a document. It predicts the class, 'c' for which the estimated conditional probability
#             of the class given the document's contents, P_hat(c|d), is greatest. In this equation, what two pieces of information
#             are required to calculate P_hat(c|d)? Your answer should include both mathematical notation and verbal explanation.

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
To calculate the estimated probability 𝑃̂_hat(c|d) in Equation 13.3, we need two pieces of information:

1. The class prior probability, 𝑃(c) = (number of documents in class c) / (total number of documents), which represents how likely each class is overall in the training data.  
2. The conditional probability of each word in the document given the class, 𝑃(w_i|c) = (count of word w_i in class c documents) / (total words in class c documents), which represents how likely each word w_i in the document is to appear in documents of class c.  

For a document "d" containing words w_1, w_2, ..., w_n, the model estimates:

𝑃̂(c|d) ∝ 𝑃(c) × ∏_{i=1}^{n} 𝑃(w_i|c)

That is, we multiply the prior probability of the class by the likelihood of observing all the words in the document in that class. 
The class with the highest value is chosen as the predicted class.
"""
)


To calculate the estimated probability 𝑃̂_hat(c|d) in Equation 13.3, we need two pieces of information:

1. The class prior probability, 𝑃(c) = (number of documents in class c) / (total number of documents), which represents how likely each class is overall in the training data.  
2. The conditional probability of each word in the document given the class, 𝑃(w_i|c) = (count of word w_i in class c documents) / (total words in class c documents), which represents how likely each word w_i in the document is to appear in documents of class c.  

For a document "d" containing words w_1, w_2, ..., w_n, the model estimates:

𝑃̂(c|d) ∝ 𝑃(c) × ∏_{i=1}^{n} 𝑃(w_i|c)

That is, we multiply the prior probability of the class by the likelihood of observing all the words in the document in that class. 
The class with the highest value is chosen as the predicted class.



In [59]:
# q6e
### SHORT ESSAY
### QUESTION: See whole question above if necessary.
#             The Enron data includes two classes of documents: spam and ham (they're actually labeled 1 and 0).
#             In plain English, explain what P_hat(c) and P_hat(t_k|c) means in the context of this data. How would
#             we estimate these values from a training corpus?

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
In the context of the Enron email data, 𝑃̂_hat(c) means the estimated probability of a class, either
spam (1) or ham (0), based on the training data, which in plain English tells us how common each class
is overall. It can be estimated by counting the number of emails in each class and dividing by the
total number of emails in the training set. Similarly, 𝑃̂(t_k|c) means the estimated probability of a
specific word t_k occurring given a class c, which tells us how likely a given word is to appear in
spam emails versus ham emails. It can be estimated by counting how many times that word appears in
all documents of class c and dividing by the total number of words in class c documents (or possibly 
adding 1 if using Laplace smoothing). These probabilities allow a Naive Bayes model to predict the
likelihood that a new email belongs to spam or ham based on the words it contains.
"""
)


In the context of the Enron email data, 𝑃̂_hat(c) means the estimated probability of a class, either
spam (1) or ham (0), based on the training data, which in plain English tells us how common each class
is overall. It can be estimated by counting the number of emails in each class and dividing by the
total number of emails in the training set. Similarly, 𝑃̂(t_k|c) means the estimated probability of a
specific word t_k occurring given a class c, which tells us how likely a given word is to appear in
spam emails versus ham emails. It can be estimated by counting how many times that word appears in
all documents of class c and dividing by the total number of words in class c documents (or possibly 
adding 1 if using Laplace smoothing). These probabilities allow a Naive Bayes model to predict the
likelihood that a new email belongs to spam or ham based on the words it contains.



In [60]:
# q6f
### MULTIPLE CHOICE
### QUESTION: How many passes over the data would we need to make to retrieve this information for all classes and all words?

#   a.) We'll need three passes over the data because we need to count the words (first pass), count each word's occurrence (second pass) and then compute the probabilities (third pass)
#   b.) We only need one pass over the data to tally up the information for these prior & conditional probabilities... however after counting each word's occurrences in each class we will need to go on to divide by the class totals which is a little extra work after completing the pass over the data.
#   c.) We'll need two passes over the data as we need to compute the totals before we can calculate the probabilities.
#   d.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE ANSWER VARIABLE. (i.e. if your answer is f.), enter "f")
answer = "b"


#####################
print(answer)

b


Part G My work in Markdown:
Training data:

| DocID | Class | Body                     |
| ----- | ----- | ------------------------ |
| Doc1  | 1     | Chinese Beijing Chinese  |
| Doc2  | 1     | Chinese Chinese Shanghai |
| Doc3  | 1     | Chinese Macao            |
| Doc4  | 0     | Tokyo Japan Chinese      |

We will mimic the tetbook except calculate frequencies and probabilities without smoothing:

* Class counts:

  * Class 1 (China): 3 documents
  * Class 0 (NOT China): 1 document
  * Priors:

    * Pr(y=1) = 3/4
    * Pr(y=0) = 1/4

* Word counts:

  * Chinese: 2+2+1=5 times in Class 1, 1 time in Class 0
  * Beijing: 1 in Class 1, 0 in Class 0
  * Shanghai: 1 in Class 1, 0 in Class 0
  * Macao: 1 in Class 1, 0 in Class 0
  * Tokyo: 0 in Class 1, 1 in Class 0
  * Japan: 0 in Class 1, 1 in Class 0

* Conditional probabilities:

  For each word, $Pr(word|class) = \frac{\text{count of word in class}}{\text{total words in class}}$

  * Class 1 total words = 2+2+2=8
    Word counts in Class 1:

    * Beijing: 1 → Pr = 1/8
    * Chinese: 5 → Pr = 5/8
    * Shanghai: 1 → Pr = 1/8
    * Macao: 1 → Pr = 1/8
    * Tokyo: 0 → Pr = 0
    * Japan: 0 → Pr = 0

  * Class 0 total words: 3
    Word counts in Class 0:

    * Tokyo: 1 →  Pr = 1/3
    * Japan: 1 →  Pr = 1/3
    * Chinese: 1 →  Pr = 1/3
    * Beijing, Shanghai, Macao: 0 →  Pr = 0

Table filled:
 $$
    \begin{matrix} 
    Word & Freq(word in China Docs) & Freq(word in NOT China Docs) & Pr(w_i|y=China) &Pr(w_i|y= NOT China)\\
    beijing    & 1&0&1/8&0\\
     chinese  & 5&1&5/8&1/3\\
     tokyo      & 0&1&0&1/3\\
     shanghai   & 1&0&1/8&0\\
     japan      & 0&1&0&1/3\\
     macao      & 1&0&1/8&0\\
    CLASS PRIORS    & 3&1&3/4&1/4\\
    \end{matrix}
    $$

In [61]:
# q6g
### HAND CALCULATIONS / REPLACE LIST VALUES
### QUESTION: See whole question above if necessary.
#             Above, we've reproduced the document classification example from the textbook (we added an empty subject
#             field to mimic the Enron data format). Remember that the classes in this "Chinese Example" are 1 (about China)
#             and 0 (not about China). Calculate the class priors and the conditional probabilities for an unsmoothed Multinomial
#             Naive Bayes model trained on this data. Fill in the answers below:

#             [NOTE: Your results should NOT match those in the text -- they are trained  with +1 smoothing.
#             You are training a model without smoothing]

### ENTER ONLY THE **FRACTION**. DO NOT ENTER THE DECIMAL.


word =    ["freq_wordinChinaDocs", "freq_wordinNOTChinaDocs", "pr_wi_given_y_eq_China", "pr_wi_given_y_neq_China"]
beijing =      [1,                       0,                          1/8,                      0]
chinese =      [5,                       1,                          5/8,                      1/3]
tokyo =        [0,                       1,                          0,                      1/3]
shanghai =     [1,                       0,                          1/8,                      0]
japan =        [0,                       1,                          0,                      1/3]
macao =        [1,                       0,                          1/8,                      0]
class_priors = [3,                       1,                          3/4,                      1/4]

#####################
# DO NOT MODIFY

all = [beijing, chinese, tokyo, shanghai, japan, macao, class_priors]
print(all)

[[1, 0, 0.125, 0], [5, 1, 0.625, 0.3333333333333333], [0, 1, 0, 0.3333333333333333], [1, 0, 0.125, 0], [0, 1, 0, 0.3333333333333333], [1, 0, 0.125, 0], [3, 1, 0.75, 0.25]]


In [62]:
# q6h
### HAND CALCULATIONS
### QUESTION: See whole question above if necessary.
###           Use the model you trained to classify the following test document: Chinese Chinese Chinese Tokyo Japan.
#             Show the calculations that lead to your result inside the print statement below
#             or by uploading an image called 'q6h.<png|jpeg|etc.>'. For example: 'q6h.png'.

### IF YOU UPLOADED AN IMAGE, ENTER THE FILENAME INSIDE THE PRINT STATEMENT. IF YOU WANT TO WRITE IN LATEX, CREATE A MARKDOWN CELL
###     BELOW AND PUT 'SEE ANSWER BELOW' INSIDE THE PRINT STATEMENT
print(
"""
SEE ANSWER BELOW
"""
)


SEE ANSWER BELOW



Q6h: My Calculations

We classify the test document:

$$D_5 = \text{Chinese Chinese Chinese Tokyo Japan}$$ using the unsmoothed Multinomial Naive Bayes model.

Class priors:

$$P(y = \text{China}) = \frac{3}{4}, \quad P(y \neq \text{China}) = \frac{1}{4}$$

Conditional probabilities:

$$
\begin{aligned}
P(\text{Chinese} \mid \text{China}) &= \frac{5}{8}, & P(\text{Chinese} \mid \text{NOT China}) &= \frac{1}{3} \\
P(\text{Beijing} \mid \text{China}) &= \frac{1}{8}, & P(\text{Beijing} \mid \text{NOT China}) &= 0 \\
P(\text{Shanghai} \mid \text{China}) &= \frac{1}{8}, & P(\text{Shanghai} \mid \text{NOT China}) &= 0 \\
P(\text{Macao} \mid \text{China}) &= \frac{1}{8}, & P(\text{Macao} \mid \text{NOT China}) &= 0 \\
P(\text{Tokyo} \mid \text{China}) &= 0, & P(\text{Tokyo} \mid \text{NOT China}) &= \frac{1}{3} \\
P(\text{Japan} \mid \text{China}) &= 0, & P(\text{Japan} \mid \text{NOT China}) &= \frac{1}{3} \\
\end{aligned}
$$

Now I compute the unnormalized posterior probabilities:

$$
\begin{aligned}
P(\text{China} \mid D_5) &\propto P(y = \text{China}) \cdot P(\text{Chinese} \mid \text{China})^3 \cdot P(\text{Tokyo} \mid \text{China}) \cdot P(\text{Japan} \mid \text{China}) \\
&= \frac{3}{4} \cdot \left(\frac{5}{8}\right)^3 \cdot 0 \cdot 0 = 0
\end{aligned}
$$

$$
\begin{aligned}
P(\text{NOT China} \mid D_5) &\propto P(y \neq \text{China}) \cdot P(\text{Chinese} \mid \text{NOT China})^3 \cdot P(\text{Tokyo} \mid \text{NOT China}) \cdot P(\text{Japan} \mid \text{NOT China}) \\
&= \frac{1}{4} \cdot \left(\frac{1}{3}\right)^3 \cdot \frac{1}{3} \cdot \frac{1}{3} \\
&= \frac{1}{4} \cdot \frac{1}{27} \cdot \frac{1}{9} = \frac{1}{972}
\end{aligned}
$$

Finally, we do the classification:

$$\hat{y} = \arg\max_{c \in \{\text{China}, \text{NOT China}\}} P(c \mid D_5) = \text{NOT China}$$


In [63]:
# q6i
### SHORT ESSAY
### QUESTION: Use the model you trained to classify the following test document: Chinese Chinese Chinese Tokyo Japan.
#             Show the calculations that lead to your result inside the print statement below
#             or by uploading an image called 'q6i.<png|jpeg|etc.>'. For example: 'q6i.png'.

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
I believe the question here is incorrect, we should be referencing the question written in the markdown instead.
In our unsmoothed model (g and h above), the test document 'Chinese Chinese Chinese Tokyo Japan'
was classified as NOT China because the words 'Tokyo' and 'Japan' had zero probability under the
China class, which drove the entire probability for that class down to zero despite the overall large
amount of 'Chinese'. In contrast, Example 13.1 in the textbook applies Laplace +1 smoothing, which
makes every word in the vocabulary have at least a small probability in each class. That is, our 0
probabilities for the unsmoothed model now have probabilities > 0 with smoothing. Thus, the China
class does not get wiped out by the unseen words, and the strong evidence from three occurrences of
'Chinese' in China class outweighs the two negative indicators ('Tokyo' and 'Japan'). With smoothing,
the model now correctly labels the test document as China. Smoothing prevents zero probabilities from
dominating the final result, leading to more balanced and realistic classifications, which is
especially useful when the training data is small or incomplete.
"""
)


I believe the question here is incorrect, we should be referencing the question written in the markdown instead.
In our unsmoothed model (g and h above), the test document 'Chinese Chinese Chinese Tokyo Japan'
was classified as NOT China because the words 'Tokyo' and 'Japan' had zero probability under the
China class, which drove the entire probability for that class down to zero despite the overall large
amount of 'Chinese'. In contrast, Example 13.1 in the textbook applies Laplace +1 smoothing, which
makes every word in the vocabulary have at least a small probability in each class. That is, our 0
probabilities for the unsmoothed model now have probabilities > 0 with smoothing. Thus, the China
class does not get wiped out by the unseen words, and the strong evidence from three occurrences of
'Chinese' in China class outweighs the two negative indicators ('Tokyo' and 'Japan'). With smoothing,
the model now correctly labels the test document as China. Smoothing prevents zero probabili

# Question 7: Naive Bayes Inference

In the next two questions you'll write code to parallelize the Naive Bayes calculations that you performed above. We'll do this in two phases: one MapReduce job to perform training and a second MapReduce to perform inference. While in practice we'd need to train a model before we can use it to classify documents, for learning purposes we're going to develop our code in the opposite order. By first focusing on the pieces of information/format we need to perform the classification (inference) task you should find it easier to develop a solid implementation for training phase when you get to question 8 below. In both of these questions we'll continue to use the Chinese example corpus from the textbook to help us test our MapReduce code as we develop it. Below we've reproduced the corpus, test set and model in text format that matches the Enron data.

### Q7 Tasks:

* __a) short essay:__ run the provided cells to create the example files and load them in to HDFS. Then take a closer look at __`NBmodel.txt`__. This text file represents a Naive Bayes model trained (with Laplace +1 smoothing) on the example corpus. What are the 'keys' and 'values' in this file? Which record means something slightly different than the rest? The value field of each record includes two numbers which will be helpful for debugging but which we don't actually need to perform inference -- what are they? [`HINT`: _This file represents the model from Example 13.1 in the textbook, if you're having trouble getting oriented try comparing our file to the numbers in that example._]

* __b) short essay:__ When performing Naive Bayes in practice instead of multiplying the probabilities (as in equation 13.3) we add their logs (as in equation 13.4). Why do we choose to work with log probabilities? If we had an unsmoothed model, what potential error could arise from this transformation?

* __c) multiple choice:__ Documents 6 and 8 in the test set include a word that did not appear in the training corpus (and as a result does not appear in the model). What should we do at inference time when we need a class conditional probability for this word?

* __d) multiple choice:__ The goal of our MapReduce job is to stream over the test set and classify each document by peforming the calculation from equation 13.4. To do this we'll load the model file (which contains the probabilities for equation 13.4) into memory on the nodes where we do our mapping. This is called an in-memory join. Does loading a model 'state' like this depart from the functional programming principles? Explain why or why not. From a scability perspective when would this kind of memory use be justified? When would it be unwise?

* __e) code:__ Complete the code in __`NaiveBayes/classify_mapper.py`__. Read the docstring carefully to understand how this script should work and the format it should return. Run the provided unit tests to confirm that your script works as expected then write a Hadoop streaming job to classify the Chinese example test set. [`HINT 1:` _you shouldn't need a reducer for this one._ `HINT 2:` _Don't forget to add the model file to the_ `-files` _parameter in your Hadoop streaming job so that it gets shipped to the mapper nodes where it will be accessed by your script._]

Run these cells to create the example corpus and model.

In [64]:
%%writefile NaiveBayes/chineseTrain.txt
D1	1		Chinese Beijing Chinese
D2	1		Chinese Chinese Shanghai
D3	1		Chinese Macao
D4	0		Tokyo Japan Chinese

Overwriting NaiveBayes/chineseTrain.txt


In [65]:
%%writefile NaiveBayes/chineseTest.txt
D5	1		Chinese Chinese Chinese Tokyo Japan
D6	1		Beijing Shanghai Trade
D7	0		Japan Macao Tokyo
D8	0		Tokyo Japan Trade

Overwriting NaiveBayes/chineseTest.txt


In [66]:
%%writefile NBmodel.txt
beijing	0.0,1.0,0.111111111111,0.142857142857
chinese	1.0,5.0,0.222222222222,0.428571428571
tokyo	1.0,0.0,0.222222222222,0.0714285714286
shanghai	0.0,1.0,0.111111111111,0.142857142857
ClassPriors	1.0,3.0,0.25,0.75
japan	1.0,0.0,0.222222222222,0.0714285714286
macao	0.0,1.0,0.111111111111,0.142857142857

Overwriting NBmodel.txt


In [67]:
# load the data files into HDFS
!hdfs dfs -copyFromLocal NaiveBayes/chineseTrain.txt {HDFS_DIR}
!hdfs dfs -copyFromLocal NaiveBayes/chineseTest.txt {HDFS_DIR}

copyFromLocal: `/user/root/HW2/chineseTrain.txt': File exists
copyFromLocal: `/user/root/HW2/chineseTest.txt': File exists


In [68]:
# q7a
### SHORT ESSAY
### QUESTION: Run the provided cells to create the example files and load them into HDFS. Then take a
#             closer look at NBmodel.txt. This text file represents a Naive Bayes model trained
#             (with Laplace +1 smoothing) on the example corpus. What are the 'keys' and 'values'
#             in this file? Which record means something slightly different than the rest? The value
#             field of each record includes two numbers that will be helpful for debugging but which
#             we don't actually need to perform inference -- what are they? [HINT: This file represents
#             the model from Example 13.1 in the textbook, if you're having trouble getting oriented
#             try comparing our file to the numbers in that example.]

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
In NBModel.txt, each line in here is a key-value pair representing part of the trained Naive Bayes
model. I believe the keys are words from the vocabulary, such as 'beijing', 'chinese', and 'tokyo'.
The values consist of four numbers: the first number is the frequency count of that word in documents
labeled 0 (Not China), the second number is the frequency count in documents labeled 1 (China), the
third number is the conditional probability P(word|class=Not China), and the fourth number is
P(word|class=China). The third and fourth numbers are what we actually use for inference.
Most records correspond to individual words and their conditional probabilities, while the
'ClassPriors' record is slightly different because it represents the prior probabilities of the classes
rather than word probabilities. The first two numbers in each record are probably for debugging and are
not needed for the inference step.
"""
)


In NBModel.txt, each line in here is a key-value pair representing part of the trained Naive Bayes
model. I believe the keys are words from the vocabulary, such as 'beijing', 'chinese', and 'tokyo'.
The values consist of four numbers: the first number is the frequency count of that word in documents
labeled 0 (Not China), the second number is the frequency count in documents labeled 1 (China), the
third number is the conditional probability P(word|class=Not China), and the fourth number is
P(word|class=China). The third and fourth numbers are what we actually use for inference.
Most records correspond to individual words and their conditional probabilities, while the
'ClassPriors' record is slightly different because it represents the prior probabilities of the classes
rather than word probabilities. The first two numbers in each record are probably for debugging and are
not needed for the inference step.



In [69]:
# q7b
### SHORT ESSAY
### QUESTION: When performing Naive Bayes in practice instead of multiplying the probabilities
#             (as in equation 13.3) we add their logs (as in equation 13.4).
#             Why do we choose to work with log probabilities? If we had an unsmoothed model,
#             what potential error could arise from this transformation?

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
We choose to work with log probabilities because multiplying many small probabilities can quickly
result in the product becoming so small that it cannot be accurately represented by the computer
(floating point underflow as the textbook says). By taking the logarithm of each probability,
multiplication becomes addition, which is much more stable and prevents this issue. Using logs also
makes the calculations easier to manage and allows us to compare values without computing extremely
tiny products. However, if we have an unsmoothed model and a word occurs in the test document but never
appeared in the training data for a given class, its probability is zero. Taking the logarithm of zero
is undefined, which would cause a computation error and prevent us from obtaining a valid result for
that class. Thus, smoothing is important when using log probabilities in Naive Bayes.
"""
)


We choose to work with log probabilities because multiplying many small probabilities can quickly
result in the product becoming so small that it cannot be accurately represented by the computer
(floating point underflow as the textbook says). By taking the logarithm of each probability,
multiplication becomes addition, which is much more stable and prevents this issue. Using logs also
makes the calculations easier to manage and allows us to compare values without computing extremely
tiny products. However, if we have an unsmoothed model and a word occurs in the test document but never
appeared in the training data for a given class, its probability is zero. Taking the logarithm of zero
is undefined, which would cause a computation error and prevent us from obtaining a valid result for
that class. Thus, smoothing is important when using log probabilities in Naive Bayes.



In [70]:
# q7c
### MULTIPLE CHOICE
### QUESTION: Documents 6 and 8 in the test set include a word that did not appear in the training
#             corpus (and as a result does not appear in the model). What should we do at inference
#             time when we need a class conditional probability for this word?

#   a.) We should ignore such documents with previously unseen words.

#   b.) We should assign the class probability of the majority class to the unseen word.

#   c.) We could either assign the same conditional probability to each class (eg. 0.5 for two classes)
#       or we could simply disregard that word since multiplying the same value for each class won't
#       ultimately affect the argmax determination.

#   d.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE ANSWER VARIABLE. (i.e. if your answer is f.), enter "f")
answer = "c"


#####################
print(answer)

c


In [71]:
# q7d
### MULTIPLE CHOICE
### QUESTION: The goal of our MapReduce job is to stream over the test set and classify each
#             document by performing the calculation from equation 13.4 (see figure above for
#             more details of this equation). To do this, we'll load the model file (which contains
#             the probabilities for equation 13.4 (see figure above for more details of this equation)
#             into memory on the nodes where we do our mapping. This is called an in-memory join.
#             Does loading a model 'state' like this depart from the functional programming principles?
#             Explain why or why not. From a scalability perspective, when would this kind of memory
#             use be justified? When would it be unwise?

#   a.) Loading the model into memory fundamentally breaks the functional programming model as it
#       makes it fully stateful. This is unacceptable and it's a discouraged programming practice
#       in a map-reduce model. We should find an alternative way of solving the problem.

#   b.) Loading the model into memory has absolutely nothing to do with the functional programming model.
#       This is a very acceptable solution that scales really well and has the benefit of being applicable
#       to small as well as large vocabularies.

#   c.) Loading the model into memory is a slight departure from statelessness since we're maintaining
#       a model state. This is forgivable since we're not updating that state at all after it gets
#       loaded on each mapper node so there's no risk of a race condition.
#       Furthermore our model is small, however with a really large vocabulary the model
#       could get too large to fit in memory and we might pursue a different solution in that case.

#   d.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE ANSWER VARIABLE. (i.e. if your answer is f.), enter "f")
answer = "c"


#####################
print(answer)

c


Your work for `part e` starts here:

In [137]:
# part e - do your work in NaiveBayes/classify_mapper.py first, then run this cell.
!chmod a+x NaiveBayes/classify_mapper.py

In [138]:
!cat NaiveBayes/classify_mapper.py

#!/usr/bin/env python
"""
Mapper for Naive Bayes Inference.
INPUT:
    ID \t true_class \t subject \t body \n
OUTPUT:
    ID \t true_class \t logP(ham|doc) \t logP(spam|doc) \t predicted_class
SUPPLEMENTAL FILE: 
    This script requires a trained Naive Bayes model stored 
    as NBmodel.txt in the current directory. The model should 
    be a tab separated file whose records look like:
        WORD \t ham_count,spam_count,P(word|ham),P(word|spam)
        
Instructions:
    We have loaded the supplemental file and taken the log of 
    each conditional probability in the model. We also provide
    the code to tokenize the input lines for you. Keep in mind 
    that each 'line' of this file represents a unique document 
    that we wish to classify. Fill in the missing code to get
    the probability of each class given the words in the document.
    Remember that you will need to handle the case where you
    encounter a word that is not represented in the model.
"""
import os
import r

In [139]:
# part e - unit test NaiveBayes/classify_mapper.py (RUN THIS CELL AS IS)
!cat NaiveBayes/chineseTest.txt | NaiveBayes/classify_mapper.py | column -t

d5  1  -8.90668134500626   -8.10769031284611   1
d6  1  -5.780743515794329  -4.179502370564408  1
d7  0  -6.591673732011658  -7.511706880737812  0
d8  0  -4.394449154674438  -5.565796731681498  0


In [140]:
# part e - clear the output directory in HDFS (RUN THIS CELL AS IS)
!hdfs dfs -rm -r {HDFS_DIR}/chinese-output

Deleted /user/root/HW2/chinese-output


In [141]:
# part e - write your Hadooop streaming job here
#NOTE: I HAD TO ADD REDUCER WITH SORT JUST TO MAP THE EXPECTED OUTCOME SHOWN BELOW:
!hadoop jar {JAR_FILE} \
    -files NBmodel.txt,NaiveBayes/classify_mapper.py \
    -mapper classify_mapper.py \
    -reducer /bin/cat \
    -input {HDFS_DIR}/chineseTest.txt \
    -output {HDFS_DIR}/chinese-output \
    -cmdenv PATH={PATH}

packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob703969390542198683.jar tmpDir=null
2025-09-21 05:57:16,789 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:57:17,078 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:57:17,591 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:57:17,592 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:57:17,828 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0029
2025-09-21 05:57:18,581 INFO mapred.FileInputFormat: Total input files to process : 1
2025-09-21 05:57:19,433 INFO mapreduce.JobSubmitter: number of splits:10
2025-09-21 05:57:19,627 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1758422346528_0029
2025-09-21 05:57:19,629 INFO mapred

In [142]:
# part e - retrieve test set results from HDFS (RUN THIS CELL AS IS)
!hdfs dfs -cat {HDFS_DIR}/chinese-output/part-000* > NaiveBayes/chineseResults.txt

In [285]:
# part e - take a look (RUN THIS CELL AS IS)
!cat NaiveBayes/chineseResults.txt | sort -k1,1 | column -t

d5  1  -8.90668134500626   -8.10769031284611   1
d6  1  -5.780743515794329  -4.179502370564408  1
d7  0  -6.591673732011658  -7.511706880737812  0
d8  0  -4.394449154674438  -5.565796731681498  0


<table>
<th> Expected output for the test set:</th>
<tr align=Left><td><pre>
d5	1	-8.90668134	-8.10769031	1
d6	1	-5.78074351	-4.17950237	1
d7	0	-6.59167373	-7.51170688	0
d8	0	-4.39444915	-5.56579673	0
</pre></td><tr>
</table>

# Question 8: Naive Bayes Training

In Question 7 we used a model that we had trained by hand. Next we'll develop the code to do that same training in parallel, making it suitable for use with larger corpora (like the Enron emails). The end result of the MapReduce job you write in this question should be a model text file that looks just like the example (`NBmodel.txt`) that we created by hand above.

To refresh your memory about the training process take a look at  `6a` and `6b` where you described the pieces of information you'll need to collect in order to encode a Multinomial Naive Bayes model. We now want to retrieve those pieces of information while streaming over a corpus. The bulk of the task will be very similar to the word counting excercises you've already done but you may want to consider a slightly different key-value record structure to efficiently tally counts for each class. 

The most challenging (interesting?) design question will be how to retrieve the totals (# of documents and # of words in documents for each class). Of course, counting these numbers is easy. The hard part is the timing: you'll need to make sure you have the counts totalled up _before_ you start estimating the class conditional probabilities for each word. It would be best (i.e. most scalable) if we could find a way to do this tallying without storing the whole vocabulary in memory... Use an appropriate MapReduce design pattern to implement this efficiently! 


### Q8 Tasks:

* __a) make a plan:__  Fill in the docstrings for __`NaiveBayes/train_mapper.py`__ and __`NaiveBayes/train_reducer.py`__ to appropriately reflect the format that each script will input/output. [`HINT:` _the input files_ (`enronemail_1h.txt` & `chineseTrain.txt`) _have a prespecified format and your output file should match_ `NBmodel.txt` _so you really only have to decide on an internal format for Hadoop_].


* __b) short essay:__ Read the code in __`NaiveBayes/train_mapper.py`__ and __`NaiveBayes/train_reducer.py`__ so that together they train a Multinomial Naive Bayes model __with no smoothing__. Confirm that your trained model matches your hand calculations from Question 6. Explain the code.


* __c) multiple choice:__ We saw in Question 6 that adding Laplace smoothing (where the smoothing parameter $k=1$) makes our classifications less sensitve to rare words. However implementing this technique requires access to one additional piece of information that we had not previously used in our Naive Bayes training. What is that extra piece of information? [`HINT:` see equation 13.7 in Manning, Raghavan and Schutze].

* __d) multiple choice:__ There are a couple of approaches that we could take to handle the extra piece of information you identified in `c`: 1) if we knew this extra information beforehand, we could provide it to our reducer as a configurable parameter for the vocab size dynamically (_where would we get it in the first place?_). Or 2) we could compute it in the reducer without storing any bulky information in memory but then we'd need some postprocessing or a second MapReduce job to complete the calculation (_why?_). What is non-ideal about each of these options?

* __e) code + short essay:__ Choose one of the 2 options above. State your choice & reasoning in the space below then use that strategy to complete the code in __`NaiveBayes/train_reducer_smooth.py`__. Test this alternate reducer then write and run a Hadoop streaming job to train an MNB model with smoothing on the Chinese example. Your results should match the model that we provided for you above (and the calculations in the textbook example). __IMPORTANT NOTE:__ For full credit on this question, your code must work with multiple reducers. 

    - [`HINT:` You will need to implement custom partitioning - [Total Order Sort Notebook](https://github.com/UCB-w261/main/tree/master/HelpfulResources/TotalSortGuide/_total-sort-guide-spark2.01-JAN27-2017.ipynb) in GCS bucket __`GCS/HelpfulResources/TotalSortGuide/_total-sort-guide-spark2.01-JAN27-2017.ipynb`__] 
    
    - [`HINT:` To make your custom partitioning code more flexible, you can read the number of reduce tasks configuration parameter in your mapper code (instead of hard coding it). See pg 204. Hadoop Defintive Guide - Streaming environment variables]

    - [`HINT:` Don't start from scratch with this one -- you can just copy over your reducer code from part `b` and make the needed modifications]. 


In [144]:
# part a - do your work in train_mapper.py and train_reducer.py then RUN THIS CELL AS IS
!chmod a+x NaiveBayes/train_mapper.py
!chmod a+x NaiveBayes/train_reducer.py
!echo "=========== MAPPER DOCSTRING ============"
!head -n 8 NaiveBayes/train_mapper.py | tail -n 6
!echo "=========== REDUCER DOCSTRING ============"
!head -n 8 NaiveBayes/train_reducer.py | tail -n 6

Mapper reads in text documents and emits word counts by class.
INPUT:                                                    
    DocID \t true_class \t subject \t body                
OUTPUT:                                                   
    partitionKey \t word \t class0_partialCount,class1_partialCount       
    
Reducer aggregates word counts by class and emits frequencies.

INPUT:
    Each line of input comes from the mapper and has the format:
        partitionKey \t word \t class0_partialCount,class1_partialCount



__`part b starts here`:__ MNB _without_ Smoothing (training on Chinese Example Corpus).

In [145]:
# q8b1
### SHORT ESSAY
### QUESTION: Read the code in NaiveBayes/train_mapper.py and NaiveBayes/train_reducer.py so
#             that together they train a Multinomial Naive Bayes model with no smoothing.
#             Confirm that your trained model matches your hand calculations from Question 6.
#             Explain the code.

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""

The `train_mapper.py` and `train_reducer.py` scripts work together to train a Multinomial Naive Bayes
model using Hadoop Streaming. The mapper’s job is to read each document in the training set, split it
into words, and keep track of which class the document belongs to. For every word, the mapper emits a
record that says whether it came from class 0 or class 1, using the format
`[class0_count, class1_count]`. The mapper also keeps track of the total number of documents and total
number of words per class. At the end, it sends these totals along with special keys (`docTotals` and
`wordTotals`) so that every reducer can access the global class statistics.

The reducer then takes all of these intermediate records and combines them. For words, it adds up the
counts from different mappers so we get the total number of times each word appears in class 0 and in
class 1. Once the totals are ready, the reducer computes the conditional probabilities for each word in
each class: `P(word|class) = count(word, class) / total_words_in_class`. It also computes the class
priors, which are just the fraction of documents in each class compared to the total number of
documents. These results are formatted to match the `NBmodel.txt` structure.

When we compare the output from these scripts to the hand calculations in Question 6, we see that the
values match. For example, the word “Chinese” appears exactly the right number of times in the class 1
documents, and the computed probability lines up with the manual fraction we calculated. The same is
true for other words like “Tokyo,” “Beijing,” and “Macao,” and the overall priors (25% class 0,
75% class 1) are correct. This confirms that the MapReduce implementation is reproducing
the unsmoothed Multinomial Naive Bayes model.

The design choice in these scripts is how the mapper and reducer communicate. Instead of trying to hold
everything in memory, the mapper outputs both word counts and global totals so that reducers can
independently compute the necessary probabilities. This design makes the workflow scalable for large
datasets, like the Enron emails, while still matching the logic of the hand-calculated example.
Thus, the code is a parallelized version of the exact same training process we did manually.
"""
)



The `train_mapper.py` and `train_reducer.py` scripts work together to train a Multinomial Naive Bayes
model using Hadoop Streaming. The mapper’s job is to read each document in the training set, split it
into words, and keep track of which class the document belongs to. For every word, the mapper emits a
record that says whether it came from class 0 or class 1, using the format
`[class0_count, class1_count]`. The mapper also keeps track of the total number of documents and total
number of words per class. At the end, it sends these totals along with special keys (`docTotals` and
`wordTotals`) so that every reducer can access the global class statistics.

The reducer then takes all of these intermediate records and combines them. For words, it adds up the
counts from different mappers so we get the total number of times each word appears in class 0 and in
class 1. Once the totals are ready, the reducer computes the conditional probabilities for each word in
each class: `P(word|class) 

In [146]:
# part b - write a unit test for your mapper here - RUN CELL AS IS
!cat NaiveBayes/chineseTrain.txt | NaiveBayes/train_mapper.py

A	chinese	0,1
A	beijing	0,1
A	chinese	0,1
A	chinese	0,1
A	chinese	0,1
A	shanghai	0,1
A	chinese	0,1
A	macao	0,1
A	tokyo	1,0
A	japan	1,0
A	chinese	1,0
A	*docTotals	1,3
A	*wordTotals	3,8


In [147]:
# part b - write a unit test for your reducer here - RUN CELL AS IS
!cat NaiveBayes/chineseTrain.txt | NaiveBayes/train_mapper.py | sort -k2,2 > mapper_test_output.txt
!cat mapper_test_output.txt | NaiveBayes/train_reducer.py | column -t

beijing      0,1,0.0,0.125
chinese      1,5,0.3333333333333333,0.625
japan        1,0,0.3333333333333333,0.0
macao        0,1,0.0,0.125
shanghai     0,1,0.0,0.125
tokyo        1,0,0.3333333333333333,0.0
ClassPriors  1.0,3.0,0.25,0.75


In [148]:
# part b - write a systems test for your mapper + reducer together here - RUN CELL AS IS
!cat NaiveBayes/chineseTrain.txt | NaiveBayes/train_mapper.py | sort | NaiveBayes/train_reducer.py | column -t

beijing      0,1,0.0,0.125
chinese      1,5,0.3333333333333333,0.625
japan        1,0,0.3333333333333333,0.0
macao        0,1,0.0,0.125
shanghai     0,1,0.0,0.125
tokyo        1,0,0.3333333333333333,0.0
ClassPriors  1.0,3.0,0.25,0.75


In [149]:
# part b - clear (and name) an output directory in HDFS for your unsmoothed chinese NB model - RUN CELL AS IS
!hdfs dfs -rm -r {HDFS_DIR}/chinese-train-output

Deleted /user/root/HW2/chinese-train-output


In [150]:
# part b - write your hadoop streaming job - RUN CELL AS IS
!hadoop jar {JAR_FILE} \
  -D stream.num.map.output.key.fields=2 \
  -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
  -D mapreduce.partition.keycomparator.options="-k2,2" \
  -D mapreduce.partition.keypartitioner.options="-k1,1" \
  -files NaiveBayes/train_mapper.py,NaiveBayes/train_reducer.py \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
  -mapper train_mapper.py \
  -reducer train_reducer.py \
  -input {HDFS_DIR}/chineseTrain.txt \
  -output {HDFS_DIR}/chinese-train-output \
  -cmdenv PATH={PATH} \
  -numReduceTasks 2 # <-- feel free to modify the number of reducers

packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob2085878467307239662.jar tmpDir=null
2025-09-21 05:58:55,860 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:58:56,164 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:58:56,669 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 05:58:56,670 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 05:58:56,888 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0030
2025-09-21 05:58:58,037 INFO mapred.FileInputFormat: Total input files to process : 1
2025-09-21 05:58:58,096 INFO mapreduce.JobSubmitter: number of splits:10
2025-09-21 05:58:58,326 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1758422346528_0030
2025-09-21 05:58:58,328 INFO mapre

In [151]:
# part b - extract your results (i.e. model) to a local file - RUN CELL AS IS
!hdfs dfs -cat {HDFS_DIR}/chinese-train-output/part-0000* > NaiveBayes/chineseModelUnsmoothed.txt

In [152]:
# part b - print your model so that we can confirm that it matches expected results - RUN CELL AS IS
!cat NaiveBayes/chineseModelUnsmoothed.txt | column -t

beijing      0,1,0.0,0.125
japan        1,0,0.3333333333333333,0.0
tokyo        1,0,0.3333333333333333,0.0
ClassPriors  1.0,3.0,0.25,0.75
chinese      1,5,0.3333333333333333,0.625
macao        0,1,0.0,0.125
shanghai     0,1,0.0,0.125
ClassPriors  1.0,3.0,0.25,0.75


In [153]:
# q8c
### MULTIPLE CHOICE
### QUESTION: What is that extra piece of information you will need in order to smooth
#             the word class conditional probabilities?

#   a.) vocabulary size
#   b.) the number of occurrences of t in training documents from class c
#   c.) the number of documents in class c
#   d.) the total number of documents (including class c and not class c)
#   e.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE ANSWER VARIABLE. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


In [154]:
# q8d
### MULTIPLE CHOICE
### QUESTION: There are a couple of approaches that we could take to handle the extra piece of information
#             you identified in Q8.c)

#             1) if we knew this extra information beforehand, we could provide it to our reducer as a
#                configurable parameter for the vocab size dynamically (where would we get it in 
#                the first place?).
#             2) we could compute it in the reducer without storing any bulky information in memory
#                but then we'd need some postprocessing or a second MapReduce job to complete the
#                calculation (why?). 

#             What is non-ideal about each of these options?

#   a.) For option 1, we need to do some EDA in advance, but the data might be too big for us to pass through.
#       For option 2, it is incompatible with using multiple reducers.

#   b.) For option 1, the information we got might not be accurate because it is changing dynamically.
#       For option 2, we might get out of memory issue.

#   c.) For option 1, there is no way to have the information in advance.
#       For option 2, it will take a long time to process.

#   d.) For option 1, we can not pass it as a configurable parameter.
#       For option 2, we wouldn't be able to compute the correct conditional probability (estimates)
#       until the reducer has already parsed all of the records.

#   e.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE ANSWER VARIABLE. (i.e. if your answer is f.), enter "f")
answer = "d"


#####################
print(answer)

d


__`part e starts here`:__ MNB _with_ Smoothing (training on Chinese Example Corpus).

In [175]:
# q8e1
### SHORT ESSAY
### QUESTION: Choose one of the 2 options above. State your choice & reasoning in the space below.

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
I would choose Option 2, computing the vocabulary size in the reducer, because of scalability. 
If we choose Option 1, we’d have to run an additional preprocessing pass or do some EDA just to know 
the vocabulary size, which becomes inefficient when dealing with very large datasets. In contrast, 
Option 2 allows us to compute the vocabulary size as part of the reducer’s workflow, even though it
means we’ll need either a second MapReduce job or a postprocessing step before finalizing probabilities.
While this adds some overhead, it avoids assumptions or prior knowledge about the data and is more
true to the distributed nature of MapReduce. In practice, needing an extra pass for smoothing is more
acceptable than requiring complete knowledge of the dataset ahead of time.
"""
)


I would choose Option 2, computing the vocabulary size in the reducer, because of scalability. 
If we choose Option 1, we’d have to run an additional preprocessing pass or do some EDA just to know 
the vocabulary size, which becomes inefficient when dealing with very large datasets. In contrast, 
Option 2 allows us to compute the vocabulary size as part of the reducer’s workflow, even though it
means we’ll need either a second MapReduce job or a postprocessing step before finalizing probabilities.
While this adds some overhead, it avoids assumptions or prior knowledge about the data and is more
true to the distributed nature of MapReduce. In practice, needing an extra pass for smoothing is more
acceptable than requiring complete knowledge of the dataset ahead of time.



In [244]:
# part e
!cat NaiveBayes/train_reducer_smooth.py

#!/usr/bin/env python

import os
import sys                                                  
import numpy as np  

#################### YOUR CODE HERE ###################
"""
Multi-reducer for Naive Bayes with Laplace smoothing, aggregates word counts by class and emits smoothed conditional probabilities

INPUT:
    partitionKey \t word \t class0_partialCount,class1_partialCount
    where:
      - word can be a token, or special keys:
            *docTotals, *wordTotals, *vocabWord

OUTPUT:
    word \t class0_count,class1_count,P(word|class0),P(word|class1)
    OR:
    ClassPriors \t doc0_count,doc1_count,P(class0),P(class1)

This reducer works correctly with multiple reducers.
Vocabulary size is inferred dynamically as the number of unique words seen
(excluding docTotals and wordTotals).
"""

def EMIT(word, c0, c1, p0, p1):
    print(f"{word}\t{c0},{c1},{p0},{p1}")

# trackers
docTotals = np.array([0.0, 0.0])
wordTotals = np.array([0.0, 0.0])
vocab = set()
word_counts = {}

for line 

In [245]:
# part e - write a unit test for your NEW reducer here
!chmod +x NaiveBayes/train_reducer_smooth.py
!cat NaiveBayes/chineseTrain.txt | NaiveBayes/train_mapper.py | sort -k2,2 > mapper_test_output.txt
!cat mapper_test_output.txt | NaiveBayes/train_reducer_smooth.py | column -t

beijing      0,1,0.3333333333333333,0.25
chinese      1,5,0.6666666666666666,0.75
japan        1,0,0.6666666666666666,0.125
macao        0,1,0.3333333333333333,0.25
shanghai     0,1,0.3333333333333333,0.25
tokyo        1,0,0.6666666666666666,0.125
ClassPriors  1,3,0.25,0.75


In [246]:
# part e - write a systems test for your mapper + reducer together here
!cat NaiveBayes/chineseTrain.txt | NaiveBayes/train_mapper.py | sort | NaiveBayes/train_reducer_smooth.py | column -t

beijing      0,1,0.3333333333333333,0.25
chinese      1,5,0.6666666666666666,0.75
japan        1,0,0.6666666666666666,0.125
macao        0,1,0.3333333333333333,0.25
shanghai     0,1,0.3333333333333333,0.25
tokyo        1,0,0.6666666666666666,0.125
ClassPriors  1,3,0.25,0.75


In [247]:
# part e - clear (and name) an output directory in HDFS for your SMOOTHED chinese NB model
!hdfs dfs -rm -r {HDFS_DIR}/chinese-train-smooth-output

Deleted /user/root/HW2/chinese-train-smooth-output


In [248]:
# part e - write your hadoop streaming job
!hadoop jar {JAR_FILE} \
  -D stream.num.map.output.key.fields=2 \
  -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
  -D mapreduce.partition.keycomparator.options="-k2,2" \
  -D mapreduce.partition.keypartitioner.options="-k1,1" \
  -files NaiveBayes/train_mapper.py,NaiveBayes/train_reducer_smooth.py \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
  -mapper train_mapper.py \
  -reducer train_reducer_smooth.py \
  -input {HDFS_DIR}/chineseTrain.txt \
  -output {HDFS_DIR}/chinese-train-smooth-output \
  -cmdenv PATH={PATH} \
  -numReduceTasks 2

packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob8662185012174384812.jar tmpDir=null
2025-09-21 07:07:25,864 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:07:26,166 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:07:26,686 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:07:26,687 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:07:26,907 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0048
2025-09-21 07:07:27,276 INFO mapred.FileInputFormat: Total input files to process : 1
2025-09-21 07:07:27,336 INFO mapreduce.JobSubmitter: number of splits:10
2025-09-21 07:07:27,550 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1758422346528_0048
2025-09-21 07:07:27,551 INFO mapre

In [249]:
# part e - extract your results (i.e. model) to a local file called "chineseModelSmoothed.txt" in the NaiveBayes folder.
!hdfs dfs -getmerge {HDFS_DIR}/chinese-train-smooth-output NaiveBayes/chineseModelSmoothed.txt

In [250]:
# part e - RUN CELL AS IS
!cat NaiveBayes/chineseModelSmoothed.txt | column -t

beijing      0,1,0.3333333333333333,0.25
japan        1,0,0.6666666666666666,0.125
tokyo        1,0,0.6666666666666666,0.125
ClassPriors  1,3,0.25,0.75
chinese      1,5,0.6666666666666666,0.75
macao        0,1,0.3333333333333333,0.25
shanghai     0,1,0.3333333333333333,0.25
ClassPriors  1,3,0.25,0.75


# Question 9: Enron Ham/Spam NB Classifier & Results

Fantastic work. We're finally ready to perform Spam Classification on the Enron Corpus. In this question you'll run the analysis you've developed, report its performance.

### Q9 Tasks:
* __a) train/test split:__ Run the provided code to split our Enron file into a training set and testing set then load them into HDFS. [`NOTE:` _Make sure you re calculate the vocab size for just the training set!_]

* __b) code:__ Write Hadoop Streaming jobs to train MNB Models on the training set with smoothing (without smoothing is provided for your reference). Save your models to local files at __`NaiveBayes/Unsmoothed/NBmodel.txt`__ and __`NaiveBayes/Smoothed/NBmodel.txt`__. [`NOTE:` _This naming is important because we wrote our classification task so that it expects a file of that name... if this inelegance frustrates you there is an alternative that would involve a few adjustments to your code [read more about it here](http://www.tnoda.com/blog/2013-11-23)._] Finally run the checks that we provide to confirm that your results are correct.


* __c) code:__ Recall that we designed our classification job with just a mapper. An efficient way to report the performance of our models would be to simply add a reducer phase to this job and compute precision and recall right there. Complete the code in __`NaiveBayes/evaluation_reducer.py`__ and then write Hadoop jobs to evaluate your two models on the test set. Report their performance side by side. [`NOTE:` if you need a refresher on precision, recall and F1-score [Wikipedia](https://en.wikipedia.org/wiki/F1_score) is a good resource.]


* __d) short essay:__ Compare the performance of your two models. What do you notice about the unsmoothed model's predictions? Can you guess why this is happening? Which evaluation measure do you think is most relevant in our use case? [`NOTE:` _Feel free to answer using your common sense but if you want more information on evaluating the classification task checkout_ [this blogpost](https://tryolabs.com/blog/2013/03/25/why-accuracy-alone-bad-measure-classification-tasks-and-what-we-can-do-about-it/) or [here](https://web.archive.org/web/20141112020055/https://www.flinders.edu.au/science_engineering/fms/School-CSEM/publications/tech_reps-research_artfcts/TRRA_2007.pdf)

* __e.1) multiple answers:__ What is the reason behind the different performance of two models? (Select 2)

* __e.2) multiple choide:__ Which evaluation measure do you think is least relevant in our use case?


__Test/Train split__

In [251]:
# part a - test/train split (RUN THIS CELL AS IS)
!head -n 80 data/enronemail_1h.txt > data/enron_train.txt
!tail -n 20 data/enronemail_1h.txt > data/enron_test.txt
!hdfs dfs -copyFromLocal data/enron_train.txt {HDFS_DIR}
!hdfs dfs -copyFromLocal data/enron_test.txt {HDFS_DIR}

copyFromLocal: `/user/root/HW2/enron_train.txt': File exists
copyFromLocal: `/user/root/HW2/enron_test.txt': File exists


In [252]:
# Get vocab size from the training set only
!cat data/enron_train.txt | NaiveBayes/train_mapper.py | cut -f2 | sort | uniq | wc -l

4557


__Training__ (Enron MNB Model _without smoothing_ )

In [253]:
# part b -  Unsmoothed model (RUN CELL AS IS)

# clear the output directory
!hdfs dfs -rm -r {HDFS_DIR}/enron-model

# hadoop command
!hadoop jar {JAR_FILE} \
  -D stream.num.map.output.key.fields=2 \
  -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
  -D mapreduce.partition.keycomparator.options="-k2,2" \
  -D mapreduce.partition.keypartitioner.options="-k1,1" \
  -files NaiveBayes/train_mapper.py,NaiveBayes/train_reducer.py \
  -mapper train_mapper.py \
  -reducer train_reducer.py \
  -input {HDFS_DIR}/enron_train.txt \
  -output {HDFS_DIR}/enron-model \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
  -cmdenv PATH={PATH}

Deleted /user/root/HW2/enron-model
packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob3186959711895925512.jar tmpDir=null
2025-09-21 07:08:30,930 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:08:31,235 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:08:31,766 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:08:31,766 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:08:31,967 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0049
2025-09-21 07:08:32,288 INFO mapred.FileInputFormat: Total input files to process : 1
2025-09-21 07:08:32,352 INFO mapreduce.JobSubmitter: number of splits:9
2025-09-21 07:08:32,645 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1758422346528_0049


In [254]:
# save the model locally - RUN CELL AS IS
!mkdir NaiveBayes/Unsmoothed

mkdir: cannot create directory ‘NaiveBayes/Unsmoothed’: File exists


In [255]:
!hdfs dfs -cat {HDFS_DIR}/enron-model/part-000* > NaiveBayes/Unsmoothed/NBmodel.txt

In [256]:
# part b - check your UNSMOOTHED model results (RUN THIS CELL AS IS)
!grep assistance NaiveBayes/Unsmoothed/NBmodel.txt
# EXPECTED OUTPUT: assistance	2,4,0.000172547666293,0.000296823983378

assistance	2,4,0.0001725476662928134,0.00029682398337785694


In [257]:
# part b - check your UNSMOOTHED model results (RUN THIS CELL AS IS)
!grep money NaiveBayes/Unsmoothed/NBmodel.txt
# EXPECTED OUTPUT: money	1,22,8.62738331464e-05,0.00163253190858

money	1,22,8.62738331464067e-05,0.001632531908578213


__Training__ (Enron MNB Model _with Laplace +1 smoothing_ )

In [258]:
# part b -  Smoothed model (FILL IN THE MISSING CODE BELOW)

# clear the output directory
!hdfs dfs -rm -r {HDFS_DIR}/smooth-model

# hadoop command
!hadoop jar {JAR_FILE} \
  -D stream.num.map.output.key.fields=2 \
  -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
  -D mapreduce.partition.keycomparator.options="-k2,2" \
  -D mapreduce.partition.keypartitioner.options="-k1,1" \
  -files NaiveBayes/train_mapper.py,NaiveBayes/train_reducer_smooth.py \
  -mapper train_mapper.py \
  -reducer train_reducer_smooth.py \
  -input {HDFS_DIR}/enron_train.txt \
  -output {HDFS_DIR}/smooth-model \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
  -cmdenv PATH={PATH}

# save the model locally
!mkdir NaiveBayes/Smoothed
!hdfs dfs -cat {HDFS_DIR}/smooth-model/part-000* > NaiveBayes/Smoothed/NBmodel.txt

Deleted /user/root/HW2/smooth-model
packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob8285227735498364300.jar tmpDir=null
2025-09-21 07:09:28,181 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:09:28,485 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:09:29,023 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:09:29,023 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:09:29,248 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0050
2025-09-21 07:09:29,635 INFO mapred.FileInputFormat: Total input files to process : 1
2025-09-21 07:09:29,689 INFO mapreduce.JobSubmitter: number of splits:9
2025-09-21 07:09:29,919 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1758422346528_0050

In [259]:
# part b - check your SMOOTHED model results (RUN THIS CELL AS IS)
!grep assistance NaiveBayes/Smoothed/NBmodel.txt
# EXPECTED OUTPUT: assistance	2,4,0.0001858045336306206,0.00027730020520215184

assistance	2,4,0.0002588214994392201,0.00037102997922232116


In [260]:
# part b - check your SMOOTHED model results (RUN THIS CELL AS IS)
!grep money NaiveBayes/Smoothed/NBmodel.txt
# EXPECTED OUTPUT: money	1,22,0.0001238696890870804,0.0012755809439298986

money	1,22,0.0001725476662928134,0.0017067379044226774


In [261]:
# part b - Copy to new files.
# IMPORTANT: Use this for the autograder!
!cp NaiveBayes/Unsmoothed/NBmodel.txt Unsmoothed_NBmodel.txt
!cp NaiveBayes/Smoothed/NBmodel.txt Smoothed_NBmodel.txt

__Evaluation__

In [262]:
# part c - write your code in NaiveBayes/evaluation_reducer.py then RUN THIS
!chmod a+x NaiveBayes/evaluation_reducer.py

In [263]:
!chmod a+x NaiveBayes/classify_mapper.py 

In [264]:
# part c - unit test your evaluation job on the chinese model (RUN THIS CELL AS IS)
!cat NaiveBayes/chineseTest.txt | NaiveBayes/classify_mapper.py 
!cat NaiveBayes/chineseTest.txt | NaiveBayes/classify_mapper.py | NaiveBayes/evaluation_reducer.py

d5	1	-8.90668134500626	-8.10769031284611	1
d6	1	-5.780743515794329	-4.179502370564408	1
d7	0	-6.591673732011658	-7.511706880737812	0
d8	0	-4.394449154674438	-5.565796731681498	0
d5	1	-8.90668134500626	-8.10769031284611	 True
d6	1	-5.780743515794329	-4.179502370564408	 True
d7	0	-6.591673732011658	-7.511706880737812	 True
d8	0	-4.394449154674438	-5.565796731681498	 True
# Documents:	4
True Positives:	2
True Negatives:	2
False Positives:	0
False Negatives:	0
Accuracy	1.0000
Precision	1.0000
Recall	1.0000
F-Score	1.0000


In [265]:
# part c - Evaluate the UNSMOOTHED Model Here (FILL IN THE MISSING CODE)

# clear output directory
!hdfs dfs -rm -r {HDFS_DIR}/enron-unsmoothed-eval

# hadoop job
!hadoop jar {JAR_FILE} \
  -files NaiveBayes/classify_mapper.py,NaiveBayes/evaluation_reducer.py,NaiveBayes/Unsmoothed/NBmodel.txt \
  -mapper "classify_mapper.py Unsmoothed/NBmodel.txt" \
  -reducer evaluation_reducer.py \
  -input {HDFS_DIR}/enron_test.txt \
  -output {HDFS_DIR}/enron-unsmoothed-eval \
  -cmdenv PATH={PATH}

# retrieve results locally
!hdfs dfs -cat {HDFS_DIR}/enron-unsmoothed-eval/part-000* > NaiveBayes/Unsmoothed/eval_results.txt
!cat NaiveBayes/Unsmoothed/eval_results.txt

Deleted /user/root/HW2/enron-unsmoothed-eval
packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob4052303460531890707.jar tmpDir=null
2025-09-21 07:10:27,767 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:10:28,071 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:10:28,580 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:10:28,580 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:10:28,783 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0051
2025-09-21 07:10:29,155 INFO mapred.FileInputFormat: Total input files to process : 1
2025-09-21 07:10:29,215 INFO mapreduce.JobSubmitter: number of splits:9
2025-09-21 07:10:29,419 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_175842234

In [266]:
# part c - Evaluate the SMOOTHED Model Here (FILL IN THE MISSING CODE)

# clear output directory
!hdfs dfs -rm -r {HDFS_DIR}/enron-smoothed-eval

# hadoop job
!hadoop jar {JAR_FILE} \
  -files NaiveBayes/classify_mapper.py,NaiveBayes/evaluation_reducer.py,NaiveBayes/Smoothed/NBmodel.txt \
  -mapper "classify_mapper.py Smoothed/NBmodel.txt" \
  -reducer evaluation_reducer.py \
  -input {HDFS_DIR}/enron_test.txt \
  -output {HDFS_DIR}/enron-smoothed-eval \
  -cmdenv PATH={PATH}

# retrieve results locally
!hdfs dfs -cat {HDFS_DIR}/enron-smoothed-eval/part-000* > NaiveBayes/Smoothed/eval_results.txt
!cat NaiveBayes/Smoothed/eval_results.txt

Deleted /user/root/HW2/enron-smoothed-eval
packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob6827331058676559847.jar tmpDir=null
2025-09-21 07:11:23,451 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:11:23,774 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:11:24,294 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:11:24,294 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:11:24,507 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0052
2025-09-21 07:11:25,302 INFO mapred.FileInputFormat: Total input files to process : 1
2025-09-21 07:11:25,352 INFO mapreduce.JobSubmitter: number of splits:9
2025-09-21 07:11:25,586 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_17584223465

In [267]:
# part c - display results 
# NOTE: feel free to modify the tail commands to match the format of your results file
print('=========== UNSMOOTHED MODEL ============')
!tail -n 9 NaiveBayes/Unsmoothed/eval_results.txt
print('=========== SMOOTHED MODEL ============')
!tail -n 9 NaiveBayes/Smoothed/eval_results.txt

# Documents:	6
True Positives:	2
True Negatives:	1
False Positives:	3
False Negatives:	0
Accuracy	0.5000
Precision	0.4000
Recall	1.0000
F-Score	0.5714
# Documents:	6
True Positives:	2
True Negatives:	4
False Positives:	0
False Negatives:	0
Accuracy	1.0000
Precision	1.0000
Recall	1.0000
F-Score	1.0000


In [284]:
# part c - Copy to new files.
# IMPORTANT: Use this for the autograder!
!cp NaiveBayes/Unsmoothed/eval_results.txt Unsmoothed_results.txt
!cp NaiveBayes/Smoothed/eval_results.txt Smoothed_results.txt

__`EXPECTED RESULTS:`__ 
<table>
<th>Unsmoothed Model</th>
<th>Smoothed Model</th>
<tr>
<td><pre>
# Documents:	20
True Positives:	1
True Negatives:	9
False Positives:	0
False Negatives:	10
Accuracy	0.5
Precision	1.0
Recall	0.0909
F-Score	0.1667
</pre></td>
<td><pre>
# Documents:	20
True Positives:	11
True Negatives:	6
False Positives:	3
False Negatives:	0
Accuracy	0.85
Precision	0.7857
Recall	1.0
F-Score	0.88
</pre></td>
</tr>
</table>

__`NOTE:`__ _Don't be too disappointed if these seem low to you. We've trained and tested on a very very small corpus... bigger datasets coming soon!_

In [268]:
# q9d
### ESSAY
### QUESTION: Compare the performance of your two models. What do you notice about the unsmoothed model's
#             predictions? Can you guess why this is happening? Which evaluation measure do you think is
#             most relevant in our use case?

### ENTER ANSWER IN BETWEEN THE """ """ INSIDE THE PRINT STATEMENT.
print(
"""
Looking at the (expected) results, the unsmoothed model falls apart. Even though it gets a precision of 1.0,
it almost never actually predicts spam correctly – the recall is only 0.09. That means it is too conservative
and only calls something spam in very rare cases. The reason for this is that without Laplace smoothing,
any word that never appeared in training for a certain class gets probability zero. Once you multiply (or add
log-probabilities), that one zero wipes out the whole score for that class. So the model ends up defaulting
to the safer class (ham) almost all the time, which explains why recall is so terrible.

Once we add smoothing, the model starts performing much more reasonably. It no longer collapses whenever
it sees a new or rare word, and we can see that recall shoots up to 1.0 while precision stays at a pretty good
0.79. The F-score also jumps from 0.17 to 0.88, which shows a huge gain in overall balance. Smoothing
fixes the zero-probability problem and lets the classifier generalize better.

For spam filtering specifically, recall is the metric that matters the most, since missing spam (false negatives)
is worse than occasionally flagging a ham email. But precision is still important because if too many good
emails get flagged, users won’t trust the filter. Because of that, the F-score ends up being the most useful
measure here, since it balances both precision and recall and gives a clearer picture of real-world performance.
"""
)


Looking at the (expected) results, the unsmoothed model falls apart. Even though it gets a precision of 1.0,
it almost never actually predicts spam correctly – the recall is only 0.09. That means it is too conservative
and only calls something spam in very rare cases. The reason for this is that without Laplace smoothing,
any word that never appeared in training for a certain class gets probability zero. Once you multiply (or add
log-probabilities), that one zero wipes out the whole score for that class. So the model ends up defaulting
to the safer class (ham) almost all the time, which explains why recall is so terrible.

Once we add smoothing, the model starts performing much more reasonably. It no longer collapses whenever
it sees a new or rare word, and we can see that recall shoots up to 1.0 while precision stays at a pretty good
0.79. The F-score also jumps from 0.17 to 0.88, which shows a huge gain in overall balance. Smoothing
fixes the zero-probability problem and lets the cl

In [269]:
# q9e1
### MULTIPLE ANSWERS
### QUESTION: What is the reason behind the different performances of the two models? (Select 2)

#   a.) There are a lot of words with 0 probability in one or the other class.
#   b.) The Class Prior for the negative class is larger than the Class Prior for the positive class.
#   c.) The number of words in negative class is different from positive class.
#   d.) An unsmoothed Naive Bayes model will always yield non-zero posterior class probabilities.

### ENTER ONLY THE LETTERS INSIDE THE PRINT STATEMENT. (i.e. if your answer is x.), y.), and z.), enter "xyz")
answer = "ac"


#####################
print(answer)

ac


In [270]:
# q9e2
### MULTIPLE CHOICE
### QUESTION: Which evaluation measure do you think is least relevant in our use case?
#             [Hint: Think about the class imbalance between Ham/Spam and the meaning
#             of each measure in relations to spam emails.]

#   a.) Precision
#   b.) Recall
#   c.) F1 Score
#   d.) Accuracy

### ENTER ONLY THE LETTER INSIDE THE ANSWER VARIABLE. (i.e. if your answer is f.), enter "f")
answer = "d"


#####################
print(answer)

d


# Question 10: Custom Partitioning and Secondary Sort - EXTRA CREDIT (Optional)

Now that we have our model, we can analyse the results and think about future improvements.

### Q10 Tasks:

* __a.1) code + multiple choice:__ Let's look at the top ten words with the highest conditional probability in `Spam` and in `Ham`. We'll do this by writing a Hadoop job that sorts the model file (`NaiveBayes/Smoothed/NBmodel.txt`). Normally we'd have to run two jobs -- one that sorts on $P(word|ham)$ and another that sorts on $P(word|spam)$. However if we slighly modify the data format in the model file then we can get the top words in each class with just one job. We've written a mapper that will do just this for you. Read through __`NaiveBayes/model_sort_mapper.py`__. How will this mapper allow us to partition and sort our model file? 


* __a.2) code:__ Write a Hadoop job that uses our mapper and `/bin/cat` for a reducer to partition and sort. Print out the top 10 words in each class (where 'top' == highest conditional probability).[`HINT:` _this should remind you a lot of what we did in Question 6._]

* __b)__ Print top words in each class. What do you notice about the 'top words' we printed? [`NOTE:` _you do not need to code anything for this task, but if you are struggling with it you could try changing 'k' and see what happens to the test set. We don't recommend doing this exploration with the Enron data because it will be harder to see the impact with such a big vocabulary_]

* __b.1) multiple choice:__ How does the smoothing parameter 'k' affect the bias and the variance of our model?

* __b.2) True/False:__ : Increasing the smoothing parameter 'k' would mostly affect the probabilities of words that occur much more in one class than another, and would have little effect on words whose probabilities were similar in each class?


In [271]:
# q10a1
### MULTIPLE CHOICE (Extra Credit)
### QUESTION: How will this mapper allow us to partition and sort model file?

#   a.) This mapper output two new fields: name of the class, and conditional probability
#       of the corresponding class. We can use them to partition our file and sort on both
#       classes simultaneously.

#   b.) This mapper output two new fields: name of the class, and payload.
#       We can use them to sort based on the payload.

#   c.) This mapper output two new fields: name of the class, and Class Prior.
#       We can use them to sort based on the Class Prior.

#   d.) This mapper output two new fields: name of the class, and conditional
#       probability of the both classes. We can use them to partition our file and
#       sort on both classes simultaneously.

#   e.) None of the provided responses are correct.

### ENTER ONLY THE LETTER INSIDE THE ANSWER VARIABLE. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


In [280]:
# part a - write your Hadoop job here (sort smoothed model on P(word|class))

# clear output directory
!hdfs dfs -rm -r -f {HDFS_DIR}/smooth-model-sorted

# Hadoop job
!hadoop jar {JAR_FILE} \
  -D stream.num.map.output.key.fields=4 \
  -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
  -D mapreduce.partition.keycomparator.options="-k4,4nr" \
  -D mapreduce.partition.keypartitioner.options="-k1,1" \
  -files NaiveBayes/model_sort_mapper.py \
  -mapper model_sort_mapper.py \
  -reducer org.apache.hadoop.mapred.lib.IdentityReducer \
  -input {HDFS_DIR}/smooth-model \
  -output {HDFS_DIR}/smooth-model-sorted \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
  -cmdenv PATH={PATH}

Deleted /user/root/HW2/smooth-model-sorted
packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.2.4.jar] /tmp/streamjob16796280175296395.jar tmpDir=null
2025-09-21 07:42:02,185 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:42:02,464 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:42:02,995 INFO client.RMProxy: Connecting to ResourceManager at w261-m/10.142.0.14:8032
2025-09-21 07:42:02,996 INFO client.AHSProxy: Connecting to Application History server at w261-m/10.142.0.14:10200
2025-09-21 07:42:03,228 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1758422346528_0055
2025-09-21 07:42:03,599 WARN concurrent.ExecutorHelper: Thread (Thread[GetFileInfo #1,5,main]) interrupted: 
java.lang.InterruptedException
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:510)
	at com.google.common.util.concur

In [281]:
# part b - print top words in each class
for idx in range(2):
    print(f"============== PART-0000{idx}===============")
    !hdfs dfs -cat {HDFS_DIR}/smooth-model-sorted/part-0000{idx} | head

the	453,535,0.03916832024846864,0.03977441377263283	spam	0.039774	
the	453,535,0.03916832024846864,0.03977441377263283	ham	0.039168	
of	188,252,0.016305754464670866,0.01877411694864945	spam	0.018774	
of	188,252,0.016305754464670866,0.01877411694864945	ham	0.016306	
for	148,153,0.012854801138814598,0.011427723360047493	ham	0.012855	
for	148,153,0.012854801138814598,0.011427723360047493	spam	0.011428	
enron	116,0,0.010094038478129584,7.420599584446424e-05	ham	0.010094	
i	113,106,0.009835216978690364,0.007940041555357673	ham	0.009835	
i	113,106,0.009835216978690364,0.007940041555357673	spam	0.007940	
or	41,88,0.003623500992149081,0.0066043336301573165	spam	0.006604	
cat: Unable to write to output stream.
ect	378,0,0.03269778276248814,7.420599584446424e-05	ham	0.032698	
and	258,277,0.022344922784919334,0.020629266844761057	ham	0.022345	
and	258,277,0.022344922784919334,0.020629266844761057	spam	0.020629	
a	168,274,0.014580277801742732,0.020406648857227663	spam	0.020407	
your	35,271,0.00310

Expected results:
============== PART-00000===============
the	453,535,0.02811841942276725,0.029726581997670677	ham	0.028118	
ect	378,0,0.023473306082001735,5.546004104043037e-05	ham	0.023473	
to	350,420,0.021739130434782608,0.023348677278021184	ham	0.021739	
and	258,277,0.01604112473677691,0.015417891409239643	ham	0.016041	
hou	203,0,0.0126347082868822,5.546004104043037e-05	ham	0.012635	
of	188,252,0.011705685618729096,0.014031390383228884	ham	0.011706	
a	168,274,0.010466988727858293,0.015251511286118352	ham	0.010467	
in	160,157,0.009971509971509971,0.008762686484387999	ham	0.009972	
for	148,153,0.00922829183698749,0.008540846320226277	ham	0.009228	
on	122,95,0.007617985878855444,0.005324163939881316	ham	0.007618	
cat: Unable to write to output stream.
============== PART-00001===============
the	453,535,0.02811841942276725,0.029726581997670677	spam	0.029727	
to	350,420,0.021739130434782608,0.023348677278021184	spam	0.023349	
and	258,277,0.01604112473677691,0.015417891409239643	spam	0.015418	
a	168,274,0.010466988727858293,0.015251511286118352	spam	0.015252	
your	35,271,0.002229654403567447,0.01508513116299706	spam	0.015085	
of	188,252,0.011705685618729096,0.014031390383228884	spam	0.014031	
you	80,252,0.005016722408026756,0.014031390383228884	spam	0.014031	
in	160,157,0.009971509971509971,0.008762686484387999	spam	0.008763	
for	148,153,0.00922829183698749,0.008540846320226277	spam	0.008541	
it	30,119,0.0019199801808497462,0.0066552049248516446	spam	0.006655	
cat: Unable to write to output stream.

In [282]:
# q10b1
### MULTIPLE CHOICE (Extra Credit)
### QUESTION: How does the smoothing parameter 'k' affect the bias and the variance of our model?

#   a.) Increasing k reduces the variance of our model, and increases the bias of our model.
#   b.) Increasing k reduces the variance of our model, and reduces the bias of our model.
#   c.) Increasing k increases the variance of our model, and reduces the bias of our model.
#   d.) Increasing k increases the variance of our model, and increases the bias of our model.


### ENTER ONLY THE LETTER INSIDE THE ANSWER VARIABLE. (i.e. if your answer is f.), enter "f")
answer = "a"


#####################
print(answer)

a


In [283]:
# q10b2
### TRUE OR FALSE (Extra Credit)
### QUESTION: Increasing the smoothing parameter 'k' would mostly affect the probabilities of
#             words that occur much more in one class than another, and would have little
#             effect on words whose probabilities were similar in each class?

### ENTER "t" for True or "f" for False.
answer = "f"


#####################
print(answer)

f


### Congratulations, you have completed HW2!

## THE FOLLOWING FILES ARE REQUIRED TO BE UPLOADED FOR GRADESCOPE

#### Please make sure that the filenames exactly match:
- HW2.ipynb
- mapper.py
- reducer.py
- chineseResults.txt
- chineseModelUnsmoothed.txt
- chineseModelSmoothed.txt
- Unsmoothed_results.txt
- Smoothed_results.txt
- Unsmoothed_NBmodel.txt
- Smoothed_NBmodel.txt