##2. Weblog Analysis

Consider a set of log files captured during a DDOS (*Distributed Denial of Service*) attack, containing information for the web accesses performed during the attack to the server.

Create a new notebook that processes the log of web entries using MrJob and map-reduce to:

1. Count the number of unique IP addresses involved in the DDOS attack.

2. For each interval of 10 seconds, provide the following information: [number of requests, average execution time, maximum time, minimum time]

3. Create an inverted index that, for each interval of 10 seconds, has a list of (unique) IPs executing accesses (to each URL).


The log files contain text lines as shown below, with SPACE as the separator:

date |IP_source | status_code | operation | URL | execution time |
-|-|-|-|-|-
timestamp  | string | int | string | string| float |
2016-12-06T08:58:35.318+0000|37.139.9.11|404|GET|/codemove/TTCENCUFMH3C|0.026

<br>
The log can be downloaded from:

[https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0](https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0)

Suggestion: to start, make a copy an existing notebook and modify it.

If you really must..., you can use [dateutil.parser](https://dateutil.readthedocs.io/en/stable/parser.html) for decoding timestamps.

In [1]:
#@title Download the input file
!pip install mrjob --quiet
!wget -q -O weblog.txt https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/439.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m430.1/439.6 kB[0m [31m14.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Pure Python Solution

In [2]:
#@title Exercise 2a)
unique_ips = set()

with open("weblog.txt") as f:
    for line in f:
        parts = line.split()
        if len(parts) >= 2:   # make sure the line has at least date + IP
            ip = parts[1]
            unique_ips.add(ip)

print("Number of unique IPs:", len(unique_ips))


Number of unique IPs: 167


## MrJob MapReduce Solutions

In [3]:
#@title Exercise 2a)
%%file weblog_stats_2a.py

from mrjob.job import MRJob, MRStep

class MRWebLogStats2a(MRJob):

  def mapper1(self, _, line):
      parts = line.split()
      if len(parts) >= 2:   # make sure the line has at least date + IP
            ip = parts[1]
            yield ip, None

  def combiner1(self, ip, _):
      yield ip, None

  def reducer1(self, ip, _):
      yield None, 1

  def reducer2(self, _, ips):
      yield "Unique Ips", sum(ips)

  def steps(self):
    return [ MRStep(mapper=self.mapper1, combiner=self.combiner1, reducer=self.reducer1),
             MRStep(reducer=self.reducer2)]

if __name__ == '__main__':
    MRWebLogStats2a.run()

Writing weblog_stats_2a.py


In [4]:
!rm -rf results
!python3 -m weblog_stats_2a -r local --output-dir results --cleanup NONE weblog.txt
!cat results/* | head -10

No configs found; falling back on auto-configuration
No configs specified for local runner
Creating temp directory /tmp/weblog_stats_2a.root.20251014.084926.117949
Running step 1 of 2...
Running step 2 of 2...
job output is in results
"Unique Ips"	167


In [5]:
#@title Exercise 2b) version 1
%%file weblog_stats_2b_v1.py

from mrjob.job import MRJob, MRStep
from statistics import mean

class MRWebLogStats2b_V1(MRJob):

  def mapper(self, _, line):
      parts = line.split()
      if len(parts) == 6:   # make sure the line is complete
            timestamp = parts[0]
            execution_time = float(parts[5])

            time_interval_10s = timestamp[0:18]
            yield time_interval_10s, execution_time

  def reducer(self, interval, execution_times):
      values = list(execution_times)
      yield interval, "count: {}, min: {}, max: {}, avg: {}".format(len(values), min(values), max(values), mean(values))

if __name__ == '__main__':
    MRWebLogStats2b_V1.run()

Writing weblog_stats_2b_v1.py


In [6]:
!rm -rf results
!python3 -m weblog_stats_2b_v1 -r local --output-dir results --cleanup NONE weblog.txt
!cat results/* | head -10

No configs found; falling back on auto-configuration
No configs specified for local runner
Creating temp directory /tmp/weblog_stats_2b_v1.root.20251014.084948.005824
Running step 1 of 1...
job output is in results
"2016-12-06T08:58:3"	"count: 483, min: 0.013, max: 46.849, avg: 7.593424430641822"
"2016-12-06T08:58:4"	"count: 2611, min: 0.014, max: 69.654, avg: 30.15984565300651"
"2016-12-06T08:58:5"	"count: 5500, min: 0.017, max: 80.846, avg: 38.52511163636364"
"2016-12-06T08:59:0"	"count: 6914, min: 0.018, max: 81.659, avg: 38.534382123228234"
"2016-12-06T08:59:1"	"count: 6271, min: 0.017, max: 83.993, avg: 32.96384978472333"
"2016-12-06T08:59:2"	"count: 5434, min: 0.051, max: 77.967, avg: 17.29333143172617"
"2016-12-06T08:59:3"	"count: 8015, min: 0.056, max: 67.441, avg: 11.21015221459763"
"2016-12-06T08:59:4"	"count: 7947, min: 0.914, max: 65.706, avg: 7.7618157795394485"
"2016-12-06T08:59:5"	"count: 5983, min: 0.678, max: 54.29, avg: 3.8216643824168477"
"2016-12-06T09:00:0"	"count:

In [7]:
#@title Exercise 2b) version 2
%%file weblog_stats_2b_v2.py

from mrjob.job import MRJob, MRStep

class MRWebLogStats2b_V2(MRJob):

  def mapper(self, _, line):
      parts = line.split()
      if len(parts) == 6:   # make sure the line is complete
            timestamp = parts[0]
            execution_time = float(parts[5])

            time_interval_10s = timestamp[0:18]
            yield time_interval_10s, (1, execution_time, execution_time, execution_time)

  def combiner(self, interval, execution_times):
    values = list(execution_times)

    requests = sum([x[0] for x in values])
    min_time = min([x[1] for x in values])
    max_time = max([x[2] for x in values])
    sum_time = sum([x[3] for x in values])
    yield interval, (requests, min_time, max_time, sum_time)

  def reducer(self, interval, execution_times):
    values = list(execution_times)

    requests = sum([x[0] for x in values])
    min_time = min([x[1] for x in values])
    max_time = max([x[2] for x in values])
    sum_time = sum([x[3] for x in values])

    yield interval, "count: {}, min: {}, max: {}, avg: {}".format(requests, min_time, max_time, sum_time/requests)

if __name__ == '__main__':
    MRWebLogStats2b_V2.run()

Writing weblog_stats_2b_v2.py


In [None]:
!rm -rf results
!python3 -m weblog_stats_2b_v2 -r local --output-dir results --cleanup NONE weblog.txt
!cat results/* | head -10

No configs found; falling back on auto-configuration
No configs specified for local runner
Creating temp directory /tmp/weblog_stats_2b_v2.root.20251007.190536.627056
Running step 1 of 1...
job output is in results
"2016-12-06T08:58:3"	"count: 483, min: 0.013, max: 46.849, avg: 7.5934244306418215"
"2016-12-06T08:58:4"	"count: 2611, min: 0.014, max: 69.654, avg: 30.159845653006514"
"2016-12-06T08:58:5"	"count: 5500, min: 0.017, max: 80.846, avg: 38.52511163636364"
"2016-12-06T08:59:0"	"count: 6914, min: 0.018, max: 81.659, avg: 38.534382123228234"
"2016-12-06T08:59:1"	"count: 6271, min: 0.017, max: 83.993, avg: 32.96384978472333"
"2016-12-06T08:59:2"	"count: 5434, min: 0.051, max: 77.967, avg: 17.29333143172617"
"2016-12-06T08:59:3"	"count: 8015, min: 0.056, max: 67.441, avg: 11.21015221459763"
"2016-12-06T08:59:4"	"count: 7947, min: 0.914, max: 65.706, avg: 7.761815779539449"
"2016-12-06T08:59:5"	"count: 5983, min: 0.678, max: 54.29, avg: 3.8216643824168477"
"2016-12-06T09:00:0"	"count

In [None]:
#@title Exercise 2c)
%%file weblog_stats_2c.py

from mrjob.job import MRJob, MRStep

class MRWebLogStats2c(MRJob):

  def mapper(self, _, line):
      parts = line.split()
      if len(parts) == 6:   # make sure the line is complete
            timestamp = parts[0]
            ip = parts[1]
            url = parts[4]
            time_interval_10s = timestamp[0:18]
            yield "{}-{}".format(time_interval_10s, url), ip

  def combiner(self, interval_key, ips):
      for ip in set(ips):
        yield interval_key, ip

  def reducer(self, interval_key, ips):
    yield interval_key, list(set(ips))

if __name__ == '__main__':
    MRWebLogStats2c.run()

Writing weblog_stats_2c.py


In [None]:
!rm -rf results
!python3 -m weblog_stats_2c -r local --output-dir results --cleanup NONE weblog.txt
!head -10 results/*

No configs found; falling back on auto-configuration
No configs specified for local runner
Creating temp directory /tmp/weblog_stats_2c.root.20251007.190608.392761
Running step 1 of 1...
job output is in results
==> results/part-00000 <==
"2016-12-06T08:58:3-/codemove/01IX95N3AFP4"	["120.52.73.98"]
"2016-12-06T08:58:3-/codemove/0GLNQSHCISWJ"	["120.52.73.98"]
"2016-12-06T08:58:3-/codemove/1N80W0N2R36C"	["120.52.73.97"]
"2016-12-06T08:58:3-/codemove/1U6HCG3V2S9D"	["185.28.193.95"]
"2016-12-06T08:58:3-/codemove/2CEBGK8M78Y7"	["192.241.151.220"]
"2016-12-06T08:58:3-/codemove/5Q9SRR2G46PJ"	["120.52.73.98"]
"2016-12-06T08:58:3-/codemove/6GTXIA9YHX09"	["120.52.73.98"]
"2016-12-06T08:58:3-/codemove/7HIW17K7FDZI"	["97.77.104.22"]
"2016-12-06T08:58:3-/codemove/B35MFVKMU1C4"	["2002:894a:3a93:d:250:56ff:fe00:88c0"]
"2016-12-06T08:58:3-/codemove/BJJHJB8J8T7C"	["120.52.73.97"]

==> results/part-00001 <==
"2016-12-06T08:59:4-/codemove/JO322U2THUS7"	["94.177.171.187"]
"2016-12-06T08:59:4-/codemove/JP3