<a href="https://colab.research.google.com/github/d-vinha/SPBD/blob/main/lab1/SPBD_Labs_mapreduce1_exercise_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Install Hadoop on Google Colab
!curl -s https://raw.githubusercontent.com/smduarte/spbd-2324/main/lab1/install_hadoop.sh | bash

# Python MapReduce Exercise

In the notebook, you should create a map-reduce program that counts the number of occurrences of each word.

In this exercise, hadoop runs in standalone mode and reads data from the local filesystem.


### Download the dataset

In [None]:
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

## WordCount Example
Read the words from input and count the number of occurrences of each word.


### Mapper
Complete with the code for the mapper.

In [None]:
%%file mapper_words.py
#!/usr/bin/env python

# import sys
import sys
# import string library function
import string

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # remove punctuation characters
    line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
    # split the line into words
    words = line.split()
    for w in words:
        print('%s\t1' % w)

### Reducer

In [None]:
%%file reducer_words.py
#!/usr/bin/env python

import sys

lastWord = None
lastCounter = 0;

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    count = int(count)

    if word != lastWord:
        if lastWord:
            print('%s\t%d' % (lastWord, lastCounter))
        lastWord = word
        lastCounter = count
    else:
        lastCounter += count

if lastWord:
    print('%s\t%d' % (lastWord, lastCounter))

### Hadoop standalone mode execution


The output directory needs to be cleared...

In [None]:
!rm -rf results_words

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [None]:
!hadoop jar /usr/local/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_words.py,reducer_words.py -mapper mapper_words.py -reducer reducer_words.py -input os_maias.txt -output results_words

#### Checking the results
The result is stored in directory results.

In [None]:
!cat results_words/part-*

## Sorting
The results are not sorted. Let's sort them by frequency (the words with higher occurrence first).

### Mapper
Complete with the code for the mapper.

In [None]:
%%file mapper_sort.py
#!/usr/bin/env python

# to be completed

# import sys
import sys

max=10000

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into word and frequency
    word, freq = line.split('\t', 1)
    # output the frequency as the key and the word as the value
    print('%04d\t%s' % (max-int(freq), word))

### Reducer

In [None]:
%%file reducer_sort.py
#!/usr/bin/env python

# import sys
import sys

max=10000

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into word and frequency
    freq, word = line.split('\t', 1)
    # print the word and frequency (sthe secret is that the mapper gets the keys sorted by increasing order)
    print('%s\t%d' % (word, max-int(freq)))

### Hadoop standalone mode execution


The output directory needs to be cleared...

In [None]:
!rm -rf results_sort

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

Note that the results from previous map reduce step are going to be the input for the sorting step.

In [None]:
!hadoop jar /usr/local/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_sort.py,reducer_sort.py -mapper mapper_sort.py -reducer reducer_sort.py -input results_words/part-* -output results_sort

#### Checking the results
The result is stored in directory results_sort.

In [None]:
!head -10 results_sort/part-*