<a href="https://colab.research.google.com/github/d-vinha/SPBD/blob/main/lab2/SPBD_Labs_mapreduce2_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python MapReduce Exercises

##1. MrJob MapReduce Word Frequency

Using the [MrJob](https://mrjob.readthedocs.io/en/latest/) library, create a map-reduce program that counts the number of occurrences of each word, **sorting** them by frequency (the words with higher occurrence first).

Check the MrJob documentation to see how multi-step MapReduce jobs can
be implemented in the same Python class.

Note that you will need a notebook cell to install MrJob before you run your solution, as shown in this week's example notebook.

In [5]:
#@title Download the dataset and install MrJob
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0
!pip install mrjob --quiet
!wget -q -O /etc/mrjob.conf https://raw.githubusercontent.com/smduarte/spbd-2324/main/lab2/mrjob.conf

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/439.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/439.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h

### MrJob Word Frequency 1st Mapper-Reducer
Read the words from input and count each ***unique*** word.

The processing is split into:

+ The mapper emits for each line the number of words
+ The reducer sums all the tuples produced by the mapper stage and emit a list of tuples that each unique word and its count

Using MrJob, a MapReduce job can be expressed in a single Python class,
with methods for each of the phases. The reducer phase is called separately for each key, with the collection of values to be reduced.



In [11]:
%%file eachwordcount.py

from mrjob.job import MRJob
from mrjob.step import MRStep
import re
import string


WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
      # remove leading and trailing whitespace
      line = line.strip()
      # remove punctuation characters
      line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
      # split the line into words
      line = line.split()
      # yield each word in the line with a value of 1
      for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def reducer(self, word, counts):
        # sum the counts for each word we've seen so far
        yield word, sum(counts)

if __name__ == '__main__':
    MRWordFreqCount.run()

Writing eachwordcount.py


#### Local Execution of first mapreduce Mr Job


In [12]:
import eachwordcount

# prepare the mapreduce job for local execution
mr_job = eachwordcount.MRWordFreqCount(args=['-r', 'local','os_maias.txt'])

# execute the job and print the output results
with mr_job.make_runner() as runner:
    runner.run()
    for key, value in mr_job.parse_output(runner.cat_output()):
        print( key, value)

AttributeError: ignored

##2. Weblog Analysis

Consider a set of log files captured during a DDOS (*Distributed Denial of Service*) attack, containing information for the web accesses performed during the attack to the server.

Create a new notebook that processes the log of web entries using MrJob and map-reduce to:

1. Count the number of unique IP addresses involved in the DDOS attack.

2. For each interval of 10 seconds, provide the following information: [number of requests, average execution time, maximum time, minimum time]

3. Create an inverted index that, for each interval of 10 seconds, has a list of (unique) IPs executing accesses (to each URL).


The log files contain text lines as shown below, with TAB as the separator:

date |IP_source | status_code | operation | URL | execution time |
-|-|-|-|-|-
timestamp  | string | int | string | string| float |
2016-12-06T08:58:35.318+0000|37.139.9.11|404|GET|/codemove/TTCENCUFMH3C|0.026

<br>
The log can be downloaded from:

[https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0](https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0)

Suggestion: to start, make a copy an existing notebook and modify it.

If you really must..., you can use [dateutil.parser](https://dateutil.readthedocs.io/en/stable/parser.html) for decoding timestamps.