# Comparing Word Count with ipyparallel, MRJob, and single CPU

## Word Count Example with ipyparallel

### Starting engines
```
!ipcluster nbextension enable # doesn't seem to work anymore
!ipython profile create mycluster --parallel
!ipcluster start --n=4 --profile=myclster # --daemon=True doesn't seem to work, have to do it on Terminal
```

In [1]:
import os
import ipyparallel as ipp


# wait 10 seconds before running this cell after starting the cluster
client = ipp.Client()
print(client.ids)
load_balanced_view = client.load_balanced_view()

[0, 1, 2, 3]


In [2]:
%%time
# pass function into load_balanced view
from glob import glob
from collections import Counter
from tqdm import tqdm

def word_counter(file_name):
    from collections import Counter # can pass in all imports
    counter = Counter()
    with open(file_name) as f:
        for line in f:
            counter.update(line.lower().split())
        return counter

num_pieces = 5
!split --number=l/{num_pieces} encyclopedia_britannica.txt temp_file.

async = load_balanced_view.map(word_counter, 
    glob("/Users/Eugene/Desktop/Repos/ipyparallel-vs-MRJob/temp_file*"),
    ordered=False, chunksize=1)

global_counter1 = Counter() 
for engine_result in tqdm(async):
    global_counter1.update(engine_result)
    
!rm temp_file*

100%|███████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.51s/it]


Wall time: 6.76 s


In [3]:
%%time
# decorate with load_balanced_view
@load_balanced_view.parallel(ordered=False, chunksize=1)
def word_counter(file_name):
    from collections import Counter # can pass in all imports
    counter = Counter()
    with open(file_name) as f:
        for line in f:
            counter.update(line.lower().split())
        return counter

num_pieces = 5
!split --number=l/{num_pieces} encyclopedia_britannica.txt temp_file.

global_counter2 = Counter()
    
file_names = glob("/Users/Eugene/Desktop/Repos/ipyparallel-vs-MRJob/temp_file*")
async = word_counter.map(file_names) # need to write map

for engine_result in async:
    global_counter2.update(engine_result)
    
!rm temp_file*

Wall time: 5.87 s


In [4]:
global_counter1 == global_counter2

True

In [5]:
!ipcluster stop

2017-08-27 18:58:59.753 [IPClusterStop] Removing pid file: C:\Users\Eugene\.ipython\profile_default\pid\ipcluster.pid


# MRJob Version of Word Count

In [6]:
%%writefile mr_word_counter.py
from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        for word in line.lower().split():
            yield (word, 1)

    def combiner(self, word, aggregated_counts):
        yield word, sum(aggregated_counts)

    def reducer(self, key, count):
        yield key, sum(count)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

Overwriting mr_word_counter.py


In [7]:
%%time
!python mr_word_counter.py < encyclopedia_britannica.txt > temp_encyclopedia_counter_results.txt

# sort by second key in reverse order
!cat temp_encyclopedia_counter_results.txt | sort --key 2nr -n | head -20

No configs found; falling back on auto-configuration
Creating temp directory c:\users\eugene\appdata\local\temp\mr_word_counter.Eugene.20170827.235900.359000
Running step 1 of 1...
reading from STDIN
Streaming final output from c:\users\eugene\appdata\local\temp\mr_word_counter.Eugene.20170827.235900.359000\output...
Removing temp directory c:\users\eugene\appdata\local\temp\mr_word_counter.Eugene.20170827.235900.359000...


"the"	693015
"of"	424805
"and"	247850
"in"	214955
"to"	173425
"a"	155675
"is"	102865
"by"	81345
"was"	70005
"as"	58580
"on"	51990
"which"	51645
"it"	51545
"with"	50830
"for"	50715
"at"	48265
"that"	47875
"his"	46175
"from"	46035
"are"	45785


sort: write failed: 'standard output'
sort: write error


Wall time: 49 s


## Comparision of ipyparallel, MRJob, Manual Counter

In [8]:
# ipyparallel version
print(global_counter1.most_common()[:10])

[('the', 693015), ('of', 424805), ('and', 247850), ('in', 214955), ('to', 173425), ('a', 155675), ('is', 102865), ('by', 81345), ('was', 70005), ('as', 58580)]


In [9]:
# MRjob version
from collections import Counter

counter_mrjob = Counter()

with open('temp_encyclopedia_counter_results.txt') as f:
    for line in f:
        word, count = line.strip().split('\t')
        counter_mrjob[word.strip('"')] = int(count)

print(counter_mrjob.most_common()[:10])

!rm temp_encyclopedia_counter_results.txt

[('the', 693015), ('of', 424805), ('and', 247850), ('in', 214955), ('to', 173425), ('a', 155675), ('is', 102865), ('by', 81345), ('was', 70005), ('as', 58580)]


In [10]:
%%time
counter_manual = Counter()
with open('encyclopedia_britannica.txt') as f:
    for line in f:
        counter_manual.update(line.lower().split())

print(counter_manual.most_common()[:10])

[('the', 693015), ('of', 424805), ('and', 247850), ('in', 214955), ('to', 173425), ('a', 155675), ('is', 102865), ('by', 81345), ('was', 70005), ('as', 58580)]
Wall time: 6.86 s


In [11]:
print(global_counter1 == counter_manual) # perfect!
print(counter_manual - counter_mrjob).most_common()[:10] # close enough!

True
[('"', 9710), ('the"', 2730), ('("', 995), ('\\\\ith', 820), ('\xc2\xb7', 745), ('"the', 650), ('"),', 595), ('\'"', 475), ('and"', 465), ('/', 425)]


### Conclusion:
For this word count example, it appears that ipyparallel is significantly faster (8x) than MRJob. Based on the CPU utilization fron `htop`, ipyparallel uses CPUs heavier than MRJob. Hence, I believe that MRJob is not using maximum CPU power. However, ipyparallel is only slightly faster than manual word counter (single CPU process)...  
Perhaps, this file is not large enough to merit multiple processors.