<a href="https://colab.research.google.com/github/d-vinha/SPBD/blob/main/lab1/SPBD_Labs_mapreduce1_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
#@title Install Hadoop on Google Colab
!curl -s https://raw.githubusercontent.com/smduarte/spbd-2324/main/lab1/install_hadoop.sh | bash

mv: cannot move 'hadoop-3.3.6/' to '/usr/local/hadoop-3.3.6': Directory not empty


# Python MapReduce Exercise

In the notebook, you should create a map-reduce program that counts the number of occurrences of each word.

In this exercise, hadoop runs in standalone mode and reads data from the local filesystem.


### Download the dataset

In [30]:
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

## WordCount Example
Read the words from input and count the number of occurrences of each word.


### Mapper
Complete with the code for the mapper.

In [31]:
%%file mapper_words.py
#!/usr/bin/env python

# import sys
import sys
# import string library function
import string

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # remove punctuation characters
    line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
    # split the line into words
    words = line.split()
    for word in words:
      print('%s\t%s' % (word, 1))


Writing mapper_words.py


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Reducer

In [32]:
%%file reducer_words.py
#!/usr/bin/env python

import sys

current_word = None
word_count = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # examine the input we got from mapper.py and split each line
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    count = int(count)

    # if this is the first word or if the word changes
    if current_word != word:
        # if this is not the first word
        if current_word:
            print('%s\t%s' % (current_word, word_count))
        current_word = word
        word_count = count
    else:
        word_count += count

# output the count for the last word
if current_word:
    print('%s\t%s' % (current_word, word_count))


Writing reducer_words.py


### Hadoop standalone mode execution


The output directory needs to be cleared...

In [33]:
!rm -rf results_words

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [34]:
!hadoop jar /usr/local/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_words.py,reducer_words.py -mapper mapper_words.py -reducer reducer_words.py -input os_maias.txt -output results_words

2023-09-21 05:40:12,122 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2023-09-21 05:40:12,453 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2023-09-21 05:40:12,453 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2023-09-21 05:40:12,506 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2023-09-21 05:40:13,162 INFO mapred.FileInputFormat: Total input files to process : 1
2023-09-21 05:40:13,194 INFO mapreduce.JobSubmitter: number of splits:1
2023-09-21 05:40:13,765 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1242855447_0001
2023-09-21 05:40:13,765 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-09-21 05:40:14,568 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper_words.py as file:/tmp/hadoop-root/mapred/local/job_local1242855447_0001_12ca37b8-c474-43e1-881d-7e5edd981bef/mapper_words.py
2023-09-21 05:40:14,605 INFO mapred.LocalDistri

#### Checking the results
The result is stored in directory results.

In [35]:
!cat results_words/part-*

0	2
1	1
15	1
1815	1
1830	3
1836	1
1848	1
1858	1
1870	1
1872	1
1875	3
1886	1
1887	1
20	1
26	2
3	1
32	1
3º	1
4	1
46	1
52	1
6	1
64	1
71	1
79	1
93	1
A	472
Abafava	1
Abaixo	1
Abalemos	1
Abandonaste	1
Abandoneia	1
Abandono	1
Abanouse	1
Abecê	1
Abegoaria	1
Abissínia	1
Abracemse	1
Abraão	10
Abraçaramse	1
Abraçou	1
Abraçoua	1
Abraçouo	1
Abria	2
Abril	4
Abrilada	1
Abriu	11
Absoluto	1
Acabou	1
Acabouse	2
Academia	4
Académico	2
Aceitar	1
Aceito	1
Aceitou	1
Acendeu	2
Acendia	1
Aceso	1
Acha	2
Achandose	1
Acharaa	1
Acharam	1
Achas	3
Achava	2
Achavaa	1
Achavao	2
Achavase	1
Acheia	1
Acheime	1
Acho	4
Achote	1
Achou	2
Achoua	1
Achoulhe	1
Achouo	1
Acompanhada	1
Acordame	1
Acordaria	1
Acordou	3
Acreditas	1
Acredite	4
Acrópole	1
Acudam	1
Addisson	1
Adeus	7
Adiante	7
Admirável	1
Adormeci	1
Adosinda	6
Adquirese	1
Adélia	13
Afastado	1
Aferrolhou	1
Afigiame	1
Afirmoumo	1
Afonso	320
Africana	1
Agarrara	1
Agarraralhe	1
Agarrou	2
Agitandose	1
Agora	57
Agostinho	1
Agosto	1
Agradavalhe	1
Agradeceu	1
Agradecido	1
Agr

## Sorting
The results are not sorted. Let's sort them by frequency (the words with higher occurrence first).

### Mapper
Complete with the code for the mapper.

In [62]:
%%file mapper_sort.py
#!/usr/bin/env python

import sys


maximum = 100000

for line in sys.stdin:
    word, count = line.split()
    count = maximum - int(count)
    print('%s\t%s' % (count, word))

Overwriting mapper_sort.py


### Reducer

In [63]:
%%file reducer_sort.py
#!/usr/bin/env python

maximum = 100000

import sys
for line in sys.stdin:
    count, word = line.split()
    count = maximum - int(count)
    print('%s\t%s' % (word, count))


Overwriting reducer_sort.py


### Hadoop standalone mode execution


The output directory needs to be cleared...

In [64]:
!rm -rf results_sort

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

Note that the results from previous map reduce step are going to be the input for the sorting step.

In [65]:
!hadoop jar /usr/local/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_sort.py,reducer_sort.py -mapper mapper_sort.py -reducer reducer_sort.py -input results_words/part-* -output results_sort

2023-09-21 06:04:47,085 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2023-09-21 06:04:47,273 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2023-09-21 06:04:47,273 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2023-09-21 06:04:47,300 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2023-09-21 06:04:47,571 INFO mapred.FileInputFormat: Total input files to process : 1
2023-09-21 06:04:47,600 INFO mapreduce.JobSubmitter: number of splits:1
2023-09-21 06:04:47,910 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local768109200_0001
2023-09-21 06:04:47,910 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-09-21 06:04:48,386 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper_sort.py as file:/tmp/hadoop-root/mapred/local/job_local768109200_0001_872bcb01-5c1c-4379-b281-cb11d0059fa7/mapper_sort.py
2023-09-21 06:04:48,405 INFO mapred.LocalDistribute

#### Checking the results
The result is stored in directory results.

In [60]:
!cat results_sort/part-*

de	8311
a	6736
o	6615
que	4986
e	4533
um	3026
com	2794
do	2571
da	2202
uma	2170
Carlos	1797
os	1763
para	1737
E	1697
não	1661
em	1510
no	1441
se	1436
as	1401
ao	1394
na	1251
Ega	1125
como	1045
por	961
ele	945
é	942
O	930
à	891
seu	882
mais	774
sua	765
era	723
Mas	680
lhe	671
dos	620
ela	581
já	564
muito	558
Não	508
das	474
A	472
sobre	467
lá	452
Maria	439
num	408
sem	402
Dâmaso	396
tinha	385
eu	382
ainda	381
numa	364
onde	364
tão	355
É	354
seus	353
Sr	351
disse	342
olhos	335
estava	331
tudo	328
entre	328
quando	328
dum	324
nos	324
grande	324
Afonso	320
também	319
bem	318
casa	314
Vilaça	309
agora	308
logo	303
só	302
depois	301
todo	295
Que	294
foi	283
pela	283
me	281
mas	280
pelo	273
ali	273
sempre	269
Maia	269
mão	269
então	269
toda	268
Depois	261
outro	259
ou	255
Então	254
mesmo	252
ar	248
Era	247
havia	245
dois	244
homem	240
Craft	239
fora	233
assim	227
ser	226
duas	224
lado	223
até	222
aos	221
ás	221
noite	220
duma	214
meu	214
senhora	214
há	211
nem	210
outra	209
essa	208
coisa	206