<a href="https://colab.research.google.com/github/Vasiliki655/DSC511-Introduction/blob/main/Lab04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab04: Map - Reduce
Traditionally, Hadoop utilizes Java as its primary programming language. However, for hands-on experience in writing MapReduce programs, we will utilize the <Strong>MRJob Python library</strong>. To execute the code in Google Colab, we will <strong>export our MapReduce code into a Python file</strong> and then invoke it, providing the desired file as input. Here's an example on how to use MRJob:

```python
%%file python_file_to_be_exported.py

from mrjob.job import MRJob

# Define a class which inherits from MRJob
class MRJobExtended(MRJob):

    # Define a mapper method within the class
    def mapper(self, key, value):
        # do smth
        # then yield (return)
        yield (key, list_of_values)

    # Define a reducer method within the class
    def reducer(self, key, list_of_values):
        # do smth
        # then yield (return)
        yield (key, final_value)
```

<br><strong>Note:</strong> Keep in mind that the exported file is not automatically saved in your Google Drive; instead, it is stored on temporary disk space which will be reclaimed shortly after your session ends. For permanent data storage, we will explore how to mount files from Google Drive in another tutorial.

## Setup
Let's install mrjob

In [1]:
! pip install mrjob

Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl.metadata (7.3 kB)
Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mrjob
Successfully installed mrjob-0.7.4


## Task 0: WordCount
Its aim is to find the frequency of each word in a set of (one or more) input files.<br>
<strong>Note:</strong> You need to download the [wordcount.txt](https://www.cs.ucy.ac.cy/~jgeorg02/dsc511/hadoop/wordcount.txt) (click on the link > right click > save as..) and upload it as shown in slide 7.

In [2]:
%%file wordcount.py

from mrjob.job import MRJob

### start the process locally
# in reality differne process

## this file will be save in the local space of google colab

# Define a class named MRWordCount which inherits from MRJob
class MRWordCount(MRJob):   # inherint the functions that MRJobs has

    # Define a mapper method within the class
    def mapper(self, _, line):  ## self is for python
        # Split the line into words and iterate over each word
        for word in line.split():
            # Yield (return) key-value pairs, where the key is the word and the value is 1
            yield (word, 1)  ## 1 - list of values that reducer needs to have as an input

    # Define a reducer method within the class
    def reducer(self, word, counts):
        # TODO: fill the reducer code
        result=0
        for value in counts:
          result+=value
        yield(word,result)

Writing wordcount.py


In [3]:
def run_mr_job(mr_job):
  # Create a runner for the MapReduce job
  with mr_job.make_runner() as runner:
      # Run the MapReduce job
      runner.run()

      # Iterate over the output of the MapReduce job
      for key, value in mr_job.parse_output(runner.cat_output()):
          # Print each key-value pair (word, count)
          print(key, value)

In [5]:
import wordcount

# uncomment the following two lines when wanting to reload the python file, or you can restart the session
# import importlib
# importlib.reload(wordcount)

# Create an instance of the MRWordCount class, specifying the input file 'wordcount.txt'
mr_job = wordcount.MRWordCount(args=['wordcount.txt'])
## file(that we have exported).class

# call run run_mr_job method to run the job and print the output
run_mr_job(mr_job)



cloud 1
clustering 1
combination 1
community 1
computing, 1
course 3
coverage 1
current 1
data 1
datasets 1
dedicated 1
detection, 1
dimensionality 1
e-commerce, 1
etc. 1
explored. 1
fields, 1
for 1
foundational 1
from 1
good 1
graph 1
in 2
information 1
infrastructure 1
Apache 1
For 1
Hadoop 1
Spark. 1
Specifically, 1
The 1
This 1
Together, 1
a 4
ad 1
algorithms, 1
analytics 1
and 8
applications 2
at 1
auctions. 1
balance 1
based 1
basic 1
be 1
between 1
but 1
by 1
inspired 1
internet 1
is 1
laboratory 1
large-scale 1
look 1
main 1
material 1
materials 1
media, 1
mining 1
models, 1
networking, 1
networks, 1
of 3
on 1
online 1
practice 1
provide 1
real 1
real-world 1
recommender 1
reduction, 1
related 1
relatively 1
search/retrieval/topic 1
seeks 1
services. 1
sessions, 1
several 1
social 3
statistics, 1
stream 1
students 1
systems, 1
the 3
theoretical 1
theory 1
these 1
this 1
this, 1
uses 1
weekly 1
where 1
will 3
with 2
work 1
world 1


## Task 1: Palindrome Words
Find the frequency of palindrome words. <br>
A palindrome is a word that reads the same backwards as forwards, such as madam or racecar. <br>
As input give the file [palindrome_words.txt](https://www.cs.ucy.ac.cy/~jgeorg02/dsc511/hadoop/palindrome_words.txt) <br>
Example of output: `word frequency`, where frequency is the number of times that word appeared in the file.

In [1]:
%%file palindrome.py

from mrjob.job import MRJob

class MRPalindrome(MRJob):

    def mapper(self, _, word):
      # TODO: fill the mapper code
      if word==word[::-1]:
        yield(word,1)

    def reducer(self, word, counts):
      # TODO: fill the reducer code
      yield(word,sum(counts))

Overwriting palindrome.py


In [4]:
import palindrome

mr_job = palindrome.MRPalindrome(args=['palindrome_words.txt'])
run_mr_job(mr_job)



level 2
noon 2
racecar 2
radar 2
civic 1
deed 2
kayak 6
rotor 3
stats 1
tenet 5
