<a href="https://colab.research.google.com/github/d-vinha/SPBD/blob/main/lab4/SPBD_Labs_spark1_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Spark Exercises


In [None]:
#@title Install Pyspark
!pip install --quiet pyspark

##1. Sorted Word Frequency

Create a [Spark](https://spark.apache.org/docs/latest/api/python/) program that counts the number of occurrences of each word in "Os Maias" novel, sorting them by frequency (the words with higher occurrence first).

Note that the sorting should be performed as a transformation (i.e. it should produce an RDD)...

In [None]:
#@title Download "Os Maias"
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

1.1) Create a Spark program that counts the number of occurrences of each word in "Os Maias" novel, sorting them by frequency (the words with higher occurrence first).

Note that the sorting should be performed as a transformation (i.e. it should produce an RDD)...

In [None]:
import pyspark
import string

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('os_maias.txt') \
      .map( lambda line: line.strip() ) \
      .map( lambda word: word.lower() ) \
      .map( lambda line: line.translate(str.maketrans('', '', string.punctuation+'«»')) ) \


  words = lines.flatMap( lambda line: line.split() ) \
          .map( lambda word: (word, 1)) \
          .reduceByKey( lambda a, b: a+b)


  sorted_words = words.sortBy(lambda x: x[1], ascending = False)

  for w,f in sorted_words.take(100):
      print("{}\t{}".format(w,f))

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()

1.2) Create a Spark program that the top 10 most used words in "Os Maias" novel.

You should try to avoid sorting or finding the top-10 as actions. Your top-10 most used words should be a RDD at the end of the computation. Check *zipWithIndex* in [pyspark RDD](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html#pyspark.RDD) documentation.

In [None]:
import pyspark
import string

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('os_maias.txt') \
      .map( lambda line: line.strip() ) \
      .map( lambda word: word.lower() ) \
      .map( lambda line: line.translate(str.maketrans('', '', string.punctuation+'«»')) ) \

  top10_words_partitions = lines.flatMap( lambda line: line.split() ) \
          .map( lambda word: (word, 1)) \
          .reduceByKey( lambda a, b: a+b) \
          .mapPartitions( lambda partition : sorted(partition, key=lambda kv: kv[1], reverse=True)[0:10])

  top10_words = top10_words_partitions.sortBy(lambda x: x[1], ascending = False) \
                .zipWithIndex() \
                .filter( lambda ranked: ranked[1] < 10) \
                .map( lambda ranked: ranked[0])

  print("Partitions: Top-10")
  for x in top10_words_partitions.glom().collect():
      print("{}".format(x))

  print("Top-10 Most frequent words:")
  for x in top10_words.collect():
      print("{}".format(x))

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()

##2. Weblog Analysis

Consider a set of log files captured during a DDOS (*Distributed Denial of Service*) attack, containing information for the web accesses performed during the attack to the server.

The log files contain text lines as shown below, with TAB as the separator:

date |IP_source | status_code | operation | URL | execution time |
-|-|-|-|-|-
timestamp  | string | int | string | string| float |
2016-12-06T08:58:35.318+0000|37.139.9.11|404|GET|/codemove/TTCENCUFMH3C|0.026

In [None]:
#@title Download the dataset
!wget -q -O web.log https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0
!head -1 web.log

1) Count the number of unique IP addresses involved in the DDOS attack. Do not use the ***distinct()*** transformation.


In [None]:
import pyspark

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('web.log') \
          .map( lambda line: line.strip() )

  unique_ips = lines.map( lambda line: line.split()) \
          .filter( lambda values: len(values) == 6) \
          .map( lambda values : (values[1], None)) \
          .reduceByKey( lambda a, b : None ) \
          .map( lambda _ : (None, 1)) \
          .reduceByKey( lambda a, b : a+b ) \


  for _,c in unique_ips.collect():
      print("{}".format(c))

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()

In [None]:
import pyspark

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('web.log') \
          .map( lambda line: line.strip() )

  unique_ips = lines.map( lambda line: line.split()) \
          .filter( lambda values: len(values) == 6) \
          .map( lambda values : values[1] ) \
          .distinct() \
          .map( lambda _ : (None, 1)) \
          .reduceByKey( lambda a, b : a+b ) \


  for _,c in unique_ips.collect():
      print("{}".format(c))

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()

In [None]:
import pyspark

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('web.log') \
          .map( lambda line: line.strip() )

  unique_ips = lines.map( lambda line: line.split()) \
          .filter( lambda values: len(values) == 6) \
          .map( lambda values : values[1] ) \
          .distinct()


  print(unique_ips.count())

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()

2) For each interval of 10 seconds, provide the following information: [number of requests, average execution time, maximum time, minimum time]

In [None]:
import pyspark
from operator import *

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('web.log') \
         .map( lambda line: line.strip() )

  intervals = lines.map( lambda line: line.split()) \
          .filter( lambda values: len(values) == 6) \
          .map( lambda values: (values[0][0:18], float(values[5]))) \
          .map( lambda kv : (kv[0], (1, kv[1], kv[1], kv[1]))) \
          .reduceByKey( lambda a, b : (a[0] + b[0], a[1] + b[1], max(a[2],b[2]), min(a[3],b[3])) ) \
          .map( lambda kv : (kv[0], (kv[1][0], kv[1][1] / kv[1][0], kv[1][2], kv[1][3]))) \
          .sortByKey()

  for interval in intervals.take(100):
    print(interval)

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()

3) Create an inverted index that, for each interval of 10 seconds, has a list of (unique) IPs executing accesses (to each URL).

In [None]:
import pyspark

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('web.log') \
          .map( lambda line: line.strip() )


  intervals = lines.map( lambda line: line.split()) \
          .filter( lambda values: len(values) == 6) \
          .map( lambda values: ("{}-{}".format(values[0][0:18], values[4]), { values[1] } )) \
          .reduceByKey( lambda a, b : a | b ) \
          .sortByKey()

  for v in intervals.collect():
    print(v)

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()