<a href="https://colab.research.google.com/github/aavarela/SPBD_Labs/blob/main/Lab4_exercises_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##2. Weblog Analysis

Consider a set of log files captured during a DDOS (*Distributed Denial of Service*) attack, containing information for the web accesses performed during the attack to the server.

The log files contain text lines as shown below, with TAB as the separator:

date |IP_source | status_code | operation | URL | execution time |
-|-|-|-|-|-
timestamp  | string | int | string | string| float |
2016-12-06T08:58:35.318+0000|37.139.9.11|404|GET|/codemove/TTCENCUFMH3C|0.026

In [None]:
#@title Download the dataset
!wget -q -O web.log https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0
!head -3 web.log

2016-12-06T08:58:35.318+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.026  
2016-12-06T08:58:35.356+0000 178.22.148.122 404 GET /codemove/PSO83TYKET12 0.088  
2016-12-06T08:58:35.357+0000 178.22.148.122 404 GET /codemove/PSO83TYKET12 0.088  


2.1. Count the number of unique IP addresses involved in the DDOS attack.


In [None]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("WebLogExample2.1 version1") \
    .getOrCreate()

sc = spark.sparkContext

try:
  lines = sc.textFile('web.log') \
          .filter( lambda line : len( line ) > 0 ) \
          .map( lambda line : line.strip().split(' ') ) \
          .filter( lambda parts : len(parts) == 6 )

  for line in lines.take( 3 ) :
      print( line )

  ips = lines.map(lambda parts: parts[1]) \
        .map( lambda ip : (ip, None )) \
        .reduceByKey( lambda a, b : None ) \
        .map( lambda ip : ip[0]) \
        .map( lambda ip : ( None, 1)) \
        .reduceByKey( lambda a, b : a + b) \
        .map( lambda c : c[1]) \

  for ip in ips.take(10):
    print(ip)

except Exception as e:
  print(e)

['2016-12-06T08:58:35.318+0000', '37.139.9.11', '404', 'GET', '/codemove/TTCENCUFMH3C', '0.026']
['2016-12-06T08:58:35.356+0000', '178.22.148.122', '404', 'GET', '/codemove/PSO83TYKET12', '0.088']
['2016-12-06T08:58:35.357+0000', '178.22.148.122', '404', 'GET', '/codemove/PSO83TYKET12', '0.088']
167


In [None]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("WebLogExample2.1 version2") \
    .getOrCreate()

sc = spark.sparkContext

try:
  lines = sc.textFile('web.log') \
          .filter( lambda line : len( line ) > 0 ) \
          .map( lambda line : line.strip().split(' ') ) \
          .filter( lambda parts : len(parts) == 6 )

  for line in lines.take( 3 ) :
      print( line )

  ips = lines.map(lambda parts: parts[1]) \
        .distinct() \
        .map( lambda _ : ( None, 1)) \
        .reduceByKey( lambda a, b : a + b) \
        .map( lambda c : c[1]) \

  for ip in ips.take(10):
    print(ip)

except Exception as e:
  print(e)

['2016-12-06T08:58:35.318+0000', '37.139.9.11', '404', 'GET', '/codemove/TTCENCUFMH3C', '0.026']
['2016-12-06T08:58:35.356+0000', '178.22.148.122', '404', 'GET', '/codemove/PSO83TYKET12', '0.088']
['2016-12-06T08:58:35.357+0000', '178.22.148.122', '404', 'GET', '/codemove/PSO83TYKET12', '0.088']
167


2.2. For each interval of 10 seconds, provide the following information: [number of requests, average execution time, maximum time, minimum time]

In [None]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("WebLogExample2.1") \
    .getOrCreate()

sc = spark.sparkContext

try:
  lines = sc.textFile('web.log') \
          .filter( lambda line : len( line ) > 0 ) \
          .map( lambda line : line.strip().split(' ') ) \
          .filter( lambda parts : len(parts) == 6 )

  intervals = lines.map( lambda parts : (parts[0][0:18], (1, float(parts[5]), float(parts[5]), float(parts[5])))) \
          .reduceByKey( lambda a, b : (a[0] + b[0], max(a[1], b[1]), min(a[2], b[2]), a[3] + b[3])) \
          .map( lambda t : (t[0], (t[1][0], t[1][3] / t[1][0], t[1][1], t[1][2])))

  for i in intervals.take(10):
    print(i)

except Exception as e:
  print(e)

('2016-12-06T08:58:3', (483, 7.5934244306418215, 46.849, 0.013))
('2016-12-06T08:58:4', (2611, 30.159845653006503, 69.654, 0.014))
('2016-12-06T08:58:5', (5500, 38.52511163636371, 80.846, 0.017))
('2016-12-06T08:59:4', (7947, 7.761815779539431, 65.706, 0.914))
('2016-12-06T09:00:0', (6882, 8.649971519907023, 45.314, 0.017))
('2016-12-06T09:00:1', (9719, 7.857372672085602, 34.406, 0.225))
('2016-12-06T09:00:3', (6771, 1.6047638458130256, 26.53, 0.007))
('2016-12-06T09:01:2', (5315, 0.1536705550329246, 1.361, 0.005))
('2016-12-06T09:01:3', (6163, 0.11656384877494576, 1.117, 0.005))
('2016-12-06T09:01:5', (3343, 0.0984113072090947, 1.098, 0.005))


2.3. Create an inverted index that, for each interval of 10 seconds, has a list of (unique) IPs executing accesses (to each URL).

In [None]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("WebLogExample2.1") \
    .getOrCreate()

sc = spark.sparkContext

try:

  lines = sc.textFile('web.log') \
          .filter( lambda line : len( line ) > 0 ) \
          .map( lambda line : line.strip().split(' ') ) \
          .filter( lambda parts : len(parts) == 6 )

  intervals = lines.map( lambda parts : ("{}-{}".format(parts[0][0:18], parts[4]), { parts[1] } )) \
              .reduceByKey( lambda a, b : a | b )

  for i in intervals.take(10):
    print(i)

except Exception as e:
  print(e)

('2016-12-06T08:58:3-/codemove/PSO83TYKET12', {'178.22.148.122'})
('2016-12-06T08:58:3-/codemove/1U6HCG3V2S9D', {'185.28.193.95'})
('2016-12-06T08:58:3-/codemove/B35MFVKMU1C4', {'2002:894a:3a93:d:250:56ff:fe00:88c0'})
('2016-12-06T08:58:3-/codemove/2CEBGK8M78Y7', {'192.241.151.220'})
('2016-12-06T08:58:3-/codemove/ZBOWM9VZMHE1', {'2a02:c207:2008:5497::1'})
('2016-12-06T08:58:3-/codemove/TYVRFD3NGGXK', {'2a01:488:66:1000:5c33:8503:0:1'})
('2016-12-06T08:58:3-/codemove/BRPB8Y32OAGA', {'120.52.73.97'})
('2016-12-06T08:58:3-/codemove/ICGPOQXLVXFS', {'120.52.73.97'})
('2016-12-06T08:58:3-/codemove/BJJHJB8J8T7C', {'120.52.73.97'})
('2016-12-06T08:58:3-/codemove/1N80W0N2R36C', {'120.52.73.97'})
