## MRJob Exercise
MRJob exercise to show and explore the concept of MapReduce using large text corpora from Project Gutenberg and basic MR word parsing applications.

`%%time` functions included, but results may vary by operating system configuration.

In [1]:
# !pip install MRJob

The following cell uses the magic command `%%file`, which will allow us to write the contents of the cell out to a file. This is required in order to use MRJob

In [2]:
%%file word_count.py
from mrjob.job import MRJob
import re

class MRWordFrequencyCount(MRJob):

  ### input: self, in_key, in_value
  def mapper(self, _, line):
    yield "chars", len(line)           # how many characters in a line, regardless of alphanumeric
    yield "words", len(line.split())   # split line into words based on spaces, to be mapped, return length of list
    yield "lines", 1                   # returns number of lines analyzed on pass, each line one pass (value of 1)

  ### input: self, in_key from mapper, in_value from mapper
  def reducer(self, key, values):
    yield key, sum(values)             # for each key ("chars", "words", "lines") return sum of values for each key

if __name__ == "__main__":
    MRWordFrequencyCount.run()

Overwriting word_count.py


Run `word_count.py` on short text file as an example:

In [3]:
%%time
!python word_count.py ../assets/gutenberg/short.t1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033835.919762
Running step 1 of 1...
job output is in /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033835.919762/output
Streaming final output from /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033835.919762/output...
"words"	1822
"lines"	200
"chars"	10653
Removing temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033835.919762...
CPU times: user 4.79 ms, sys: 5.63 ms, total: 10.4 ms
Wall time: 349 ms


Run `word_count.py` on extremely long text file containing the works of William Shakespeare:

In [4]:
%%time
!python word_count.py ../assets/gutenberg/t8.shakespeare.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033836.265327
Running step 1 of 1...
job output is in /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033836.265327/output
Streaming final output from /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033836.265327/output...
"words"	901325
"lines"	124456
"chars"	5333743
Removing temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033836.265327...
CPU times: user 20.4 ms, sys: 10.4 ms, total: 30.8 ms
Wall time: 2.11 s


The configs mean that we did not make any changes. It creats a temporary directory to do all the work and calculations, then outputs <b> 5,333,743 characters, 901,325 words, and 124,456 lines</b>, then deletes the temp directory and filename.
__________________________
Now, using a slightly more complicated example to extract the most used words, and integrate the concept of stop words.

In [5]:
%%file most_used_word.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+") # any whitespace or apostrophe, used to split lines below


class MRMostUsedWord(MRJob):
    STOPWORDS = {'i', 'we', 'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'}
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   #combiner=self.combiner_count_words,
                   reducer=self.reducer_count_words),
            MRStep(reducer=self.reducer_find_max_word)
        ]

    def mapper_get_words(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            if word.lower() not in self.STOPWORDS:
                yield (word.lower(), 1)

# INCLUSION/EXCLUSION OF COMBINER CODE DISCUSSED BELOW
#     def combiner_count_words(self, word, counts):
#         # optimization: sum the words we've seen so far
#         yield (word, sum(counts))

    def reducer_count_words(self, word, counts):
        # send all (num_occurrences, word) pairs to the same reducer.
        # num_occurrences is used so we can easily use Python's max() function.
        yield None, (sum(counts), word)

    # discard the key; it is just None
    def reducer_find_max_word(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word
        yield max(word_count_pairs)



if __name__ == '__main__':
    import time
    start = time.time()
    MRMostUsedWord.run()
    end = time.time()
    print(end - start)

Overwriting most_used_word.py


The yield statement in the `mapper_get_words()` function yields the word and a count of 1, provided the word is not in the list "STOPWORDS".

The yield statement in the `combiner_count_words()` function yields the sum of all of the counts of each word (from above, where each pass of a word returns 1, thus a summation creates a total count) up to that point.

The yield statement in the `reducer_count_words()` function yields a more readable format of the above for the reducer, essentially (a number_of_occurrences, word) pair in a format that can be sorted.

The yield statement in the `reducer_find_max_word()` function yields the word_count_pair with the max number of occurences. This happens after all words have been collected and counted through the previous three steps and operates outside of that, pulling the top result <i>after</i> all counts are final.


Now we run the file against the short .txt file again.

In [6]:
%%time
!python most_used_word.py ../assets/gutenberg/short.t1.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/most_used_word.Allie.20220101.033838.386294
Running step 1 of 2...
Running step 2 of 2...
job output is in /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/most_used_word.Allie.20220101.033838.386294/output
Streaming final output from /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/most_used_word.Allie.20220101.033838.386294/output...
11	"day"
Removing temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/most_used_word.Allie.20220101.033838.386294...
0.1792588233947754
CPU times: user 5.83 ms, sys: 5.45 ms, total: 11.3 ms
Wall time: 410 ms


"day" is the most used word, with a count of 11.

Now let's run it on the Shakespeare file again:

In [7]:
%%time
!python most_used_word.py ../assets/gutenberg/t8.shakespeare.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/most_used_word.Allie.20220101.033838.801353
Running step 1 of 2...
Running step 2 of 2...
job output is in /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/most_used_word.Allie.20220101.033838.801353/output
Streaming final output from /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/most_used_word.Allie.20220101.033838.801353/output...
5479	"thou"
Removing temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/most_used_word.Allie.20220101.033838.801353...
3.152860164642334
CPU times: user 27 ms, sys: 13.1 ms, total: 40.1 ms
Wall time: 3.38 s


The most common word is "thou" with 5479 instances.

The end time for the script <i>with</i> the combiner almost a second and a half more than without it. Both scripts produced the same result. This suggests that the extra step of combining is actually slowing down the efficiency of the script. Instead of combining them at an intermediate point, it may be faster to actually combine the final products in one go.

Now we will write a script that allows us to find the 10 words with the most syllables from the `t5.churchill.txt` file.

This output will be messy as there may be a few that have the same number of syllables.

In [8]:
# !pip install syllapy
# !pip install MRStep

In [9]:
%%file word_syllables.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import itertools
import syllapy
import re
# WORD_RE = re.compile(r"[\w']+") # any whitespace or apostrophe, used to split lines below
WORDS_ONLY_RE = re.compile(r"[a-zA-Z]+") # any whitespace or apostrophe, used to split lines below
class MRWordSyllables(MRJob):
    
    STOPWORDS = {'a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and',
                 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below',
                 'between', 'both', 'but', 'by', 'can', 'did', 'do', 'does', 'doing', 'don',
                 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'has', 'have',
                 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his',
                 'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'itself', 'just', 'me',
                 'more','most', 'my', 'myself', 'no', 'nor', 'not', 'now', 'of', 'off', 'on',
                 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own',
                 's', 'same', 'she', 'should', 'so', 'some', 'such', 't', 'th', 'than', 'that', 'the',
                 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they',
                 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was',
                 'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why',
                 'will', 'with', 'you', 'your', 'yours', 'yourself', 'yourselves'}
    
    word_seen = set()
    
    def steps(self):
        
        return [MRStep(mapper = self.mapper_get_words,
                       combiner = self.combiner_swap_tuple,
                       reducer = self.reducer_order_set)]
    
    def mapper_get_words(self, _, line):
        # lower and extract words from each line of data
        for word in WORDS_ONLY_RE.findall(line.lower()):
            # if the word is not a stopword and has not already been processed
            # update word_seen and yield word and syllable count
            if word not in self.STOPWORDS and word not in self.word_seen:
                self.word_seen.add(word)
                yield (word, syllapy.count(word))
                
    def combiner_swap_tuple(self, word, syllable_count):
        # effectively zip syllable_count and word into a single 2-tuple
        yield (None, (word, syllable_count))
                
    def reducer_order_set(self, _, syllable_count_pairs):
        # sort the tuples by syllable count and word, storing the top 10
        sorted_list = sorted(syllable_count_pairs,
                             key=lambda pairs: (pairs[1], pairs[0]))[-1:-11:-1]
        
        # iterate the sorted list extracting each list element
        # yielding each item from the element
        for (word, syllables) in sorted_list:
            yield (word, syllables[0])
            
if __name__ == '__main__':
    import time
    start = time.time()
    MRWordSyllables.run()
    end = time.time()
    print(end - start)

Overwriting word_syllables.py


To give some reference, we will look at how large the `t5.churchill.txt` file is:

In [10]:
%%time
!python word_count.py ../assets/gutenberg/t5.churchill.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033842.197317
Running step 1 of 1...
job output is in /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033842.197317/output
Streaming final output from /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033842.197317/output...
"words"	1671473
"lines"	189685
"chars"	9160853
Removing temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_count.Allie.20220101.033842.197317...
CPU times: user 27.3 ms, sys: 13.2 ms, total: 40.4 ms
Wall time: 3.08 s


To compare the Churchill output with the Shakespeare output:

||Shakespeare|Churchill|
|----|:---:|:---:|
|characters|5,333,743|9,160,853|
|words|901,325|1,671,473|
|lines|124,456|189,685|

We can see that the Churchill text is materially longer than the Shakespeare text, and may take longer to process. (Noting, of course, that these are not necessarily "long" texts in relation to some corpora that one may need to work with.)

With that said, let's look at the top 10 syllable words:

In [11]:
%%time
!python word_syllables.py ../assets/gutenberg/t5.churchill.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_syllables.Allie.20220101.033845.460481
Running step 1 of 1...
job output is in /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_syllables.Allie.20220101.033845.460481/output
Streaming final output from /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_syllables.Allie.20220101.033845.460481/output...
"overcapitalization"	8
"incommunicability"	8
"unenforceability"	7
"overcapitalized"	7
"materialistically"	7
"invulnerability"	7
"interrogatively"	7
"infinitesimally"	7
"indissolubility"	7
"indispensability"	7
Removing temp directory /var/folders/7_/nkpwttfs0l398qx_653rtr2w0000gn/T/word_syllables.Allie.20220101.033845.460481...
1.5923857688903809
CPU times: user 19.7 ms, sys: 10.2 ms, total: 30 ms
Wall time: 2.01 s


______________________
<div style="text-align: right"><sub>Exercise adapted and modified from UMSI homework assignment for SIADS 516.</sub></div>