# **MapReduce**

# Python MRJob

MRjob lets you write MapReduce jobs in Python 2.7/3.4+ and run them on several platforms.

* Write multi-step MapReduce jobs in pure Python
* Test on your local machine
* Run on a Hadoop cluster

To get started, install 'mrjob' with pip:

In [1]:
!pip install mrjob

Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl.metadata (7.3 kB)
Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mrjob
Successfully installed mrjob-0.7.4


## Example file - Deli by Ice Spice (CLEAN) lyrics

In [18]:
with open('Deli_IceSpice.txt') as file:
    lyrics = file.readlines() # Load the lines into a list

for line in lyrics[:10]:
    print(line.replace("\n", "")) # Print the first 10 lines

Grrah (Grrah, grrah, grrah)
Grrah (Grrah, grrah, Grrah)
She a baddie, she showin' her hair (Stop playin' with 'em, RIOT)

She a baddie, she showin' her hair (She showin' her hair)
She shake it like jelly (She shake it like jelly)
Hundred bands in Chanely (Hundred bands in Chanely)
But I'm still shakin' cake in a deli (But I'm still shakin' cake, grrah, grrah)
With my friend gettin' deady (With my friend gettin' deady)
He like him all friendly (He like him all friendly)


## **Use Case 1**: Frequency Count

In [19]:
%%file IceSpiceCount.py
# ^^^ This means that we are writing the following code to this .py file

from mrjob.job import MRJob

class MRLinesWordsChars(MRJob): # Creating a class that inherits "MRJob"

    def mapper(self, _, line): # Defining our own mapper function (Overriding)
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values): # Defining our own reducer function (Overriding)
        yield key, sum(values)

if __name__ == '__main__':
    MRLinesWordsChars.run()

Overwriting IceSpiceCount.py


In [20]:
! python IceSpiceCount.py IceSpiceLyricsClean.txt -q

"words"	498
"chars"	2688
"lines"	65


## **Use Case 2**: Word Count

In [21]:
%%file IceSpiceWordCount.py

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+") # Breaking sentences into words

class MRWordFrequencyCount(MRJob):
    def mapper(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
      # Intermidiate Step where we combine words we have seen so far - optimizing the process
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield word, sum(counts)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

Overwriting IceSpiceWordCount.py


In [22]:
! python IceSpiceWordCount.py IceSpiceLyricsClean.txt -q

"ice"	2
"icy"	1
"if"	1
"in"	11
"inch"	2
"it"	6
"jacket"	1
"jelly"	6
"juice"	2
"just"	1
"know"	4
"like"	17
"lil'"	1
"loopy"	2
"loose"	1
"lose"	1
"lot"	2
"lov\u0435"	1
"love"	1
"lucy"	1
"me"	5
"middle"	2
"miss"	1
"money"	2
"moon"	2
"move"	1
"much"	1
"my"	12
"name"	1
"not"	1
"off"	2
"on"	3
"out"	1
"outside"	1
"pack"	1
"parted"	2
"partner"	4
"pass"	1
"passenger"	1
"perform"	1
"personality"	1
"petty"	4
"phone"	1
"pics"	1
"pj"	1
"playin'"	1
"pot"	1
"princess"	1
"pucci"	1
"put"	1
"react"	1
"real"	2
"regular"	1
"riot"	1
"sad"	1
"shake"	6
"shakin'"	7
"she"	17
"shoutout"	1
"showin'"	8
"singin'"	1
"so"	2
"song"	1
"spot"	1
"started"	1
"stay"	1
"steppin'"	1
"still"	7
"stop"	1
"straight"	2
"takin'"	1
"that"	4
"the"	10
"then"	2
"they"	5
"three"	1
"to"	10
"too"	1
"while"	1
"white"	1
"with"	9
"you"	3
"your"	1
"'em"	1
"'ku"	1
"'oot"	1
"'ooters"	2
"a"	11
"actin'"	1
"ain't"	1
"all"	9
"along"	1
"always"	2
"and"	8
"artist"	1
"ay"	1
"ayo"	1
"back"	2
"baddest"	1
"baddie"	4
"baddies"	3
"bag"	1
"baggin'"	4
"ban

## **Exercise 1**: How many times does Ice Spice say "Grrah" in the song 'Deli'?

Write an MRJob scrip that will find out how many times Ice Spice says 'grrah' in the song Deli.

Do it in an efficient manner - think where is the best stage for filtering? (in this case)

In [None]:
# Your Solution Here

### Solution: (DONT PEAK BEFORE YOU TRY!)

In [30]:
%%file GrrahCount.py

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+")

class MRGrrahFrequencyCount(MRJob):
    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
          if word.lower() == "grrah": # FILTERRING FOR GRRAH
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield word, sum(counts)


if __name__ == '__main__':
    MRGrrahFrequencyCount.run()

Overwriting GrrahCount.py


In [31]:
! python GrrahCount.py IceSpiceLyricsClean.txt -q

"grrah"	34


## **Use Case 3**: Most Used Word

In [32]:
%%file mr_most_used_word.py

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+")


class MRMostUsedWord(MRJob):

    def steps(self):
      # Overriding the Steps function which lets us run multi-step jobs
        return [
            MRStep(mapper=self.mapper_get_words,
                   combiner=self.combiner_count_words,
                   reducer=self.reducer_count_words), # Up until here we had a 'regular' MRJob
            MRStep(reducer=self.reducer_find_max_word) # We add another step - finding the maximum
        ]

    def mapper_get_words(self, _, line):
        # yield each word in the line
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner_count_words(self, word, counts):
        # optimization: sum the words we've seen so far
        yield (word, sum(counts))

    def reducer_count_words(self, word, counts):
        # send all (num_occurrences, word) pairs to the same reducer.
        # num_occurrences is so we can easily use Python's max() function.
        yield None, (sum(counts), word)

    # discard the key; it is just None
    def reducer_find_max_word(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word
        yield max(word_count_pairs)


if __name__ == '__main__':
    MRMostUsedWord.run()

Overwriting mr_most_used_word.py


In [33]:
! python mr_most_used_word.py IceSpiceLyricsClean.txt -q

34	"grrah"


### Exercise 2 will be your lab assignment :)