**MapReduce**
The MapReduce programming technique was designed to analyze massive data sets across a cluster.

The biggest difference between Hadoop and Spark is that Spark tries to do as many calculations as possible in memory, which avoids moving data back and forth across a cluster. Hadoop writes intermediate calculations out to disk, which can be less efficient. Hadoop is an older technology than Spark and one of the cornerstone big data technologies.

This code counts the number of times the songs were played !!!

In [1]:
# Install mrjob library. This package is for running MapReduce jobs with Python
# In Jupyter notebooks, "!" runs terminal commands from inside notebooks 

! pip install mrjob

Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
Collecting PyYAML>=3.10
  Downloading PyYAML-5.3.1-cp38-cp38-win_amd64.whl (219 kB)
Installing collected packages: PyYAML, mrjob
Successfully installed PyYAML-5.3.1 mrjob-0.7.4


In [3]:
%%file wordcount.py
# %%file is an Ipython magic function that saves the code cell as a file

from mrjob.job import MRJob # import the mrjob library

class MRSongCount(MRJob):
    
    # the map step: each line in the txt file is read as a key, value pair
    # in this case, each line in the txt file only contains a value but no key
    # _ means that in this case, there is no key for each line
    def mapper(self, _, song):
        # output each line as a tuple of (song_names, 1) 
        yield (song, 1)

    # the reduce step: combine all tuples with the same key
    # in this case, the key is the song name
    # then sum all the values of the tuple, which will give the total song plays
    def reducer(self, key, values):
        yield (key, sum(values))
        
if __name__ == "__main__":
    MRSongCount.run()

Overwriting wordcount.py


In [1]:
# run the code as a terminal command
! python wordcount.py songs.txt

"Broken Networks"	510
"Data House Rock"	828
"Deep Dreams"	1131


No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory C:\Users\Dell\AppData\Local\Temp\wordcount.Dell.20201224.171644.527842
Running step 1 of 1...
job output is in C:\Users\Dell\AppData\Local\Temp\wordcount.Dell.20201224.171644.527842\output
Streaming final output from C:\Users\Dell\AppData\Local\Temp\wordcount.Dell.20201224.171644.527842\output...
Removing temp directory C:\Users\Dell\AppData\Local\Temp\wordcount.Dell.20201224.171644.527842...
