<a href="https://colab.research.google.com/github/d-vinha/SPBD/blob/main/lab3/SPBD_Labs_spark1_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Spark Exercises


In [13]:
#@title Install Pyspark
!pip install --quiet pyspark

In [14]:
#@title Download "Os Maias"
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

##1. Sorted Word Frequency

1.1) Create a [Spark](https://spark.apache.org/docs/latest/api/python/) program that counts the number of occurrences of each word in "Os Maias" novel, sorting them by frequency (the words with higher occurrence first).

Note that the sorting should be performed as a transformation (i.e. it should produce an RDD)...

In [17]:
import pyspark
import string

def preprocess_line(line):
    #preprocess the lines of text
    return line.strip().translate(str.maketrans('', '', string.punctuation + '«»')).lower()

def main():
    # Initialise Spark context
    sc = pyspark.SparkContext('local[*]')
    try:
        # Read and preprocess the text file using the preprocess function
        non_empty_lines = sc.textFile('os_maias.txt') \
            .map(preprocess_line) \
            .filter(lambda line: len(line) > 0)

        # Word count
        words = non_empty_lines.flatMap(lambda line: line.split()) \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda a, b: a + b)
          #Word count: 1st .flatMap splits each line into words and flattens the
          #resulting list of lists of words (for each line) into a single
          #list of words (for all the lines).
          #2nd .map maps each word to a tuple of the word and the count 1.
          #3rd .reduceByKey(lambda a, b: a+b): Reduces by key, summing up the
          #counts of each word.

        # Sort words by frequency (descending)
        sorted_words = words.sortBy(lambda x: x[1], ascending=False)

        # Print the top 100 words
        top_words = sorted_words.take(100)
        for word, freq in top_words:
            print("{}\t{}".format(word, freq))

    except Exception as e:
        print("An error occurred: {}".format(str(e)))
    finally:
        # Stop SparkContext
        sc.stop()

if __name__ == "__main__":
    main()


de	8459
o	7545
a	7208
e	6230
que	5280
um	3191
com	2872
do	2588
uma	2298
da	2208
não	2169
os	1888
para	1821
carlos	1798
em	1626
no	1560
se	1527
ao	1487
as	1455
na	1320
é	1296
ega	1125
como	1122
ele	1076
por	1068
era	970
mas	960
à	891
seu	885
mais	798
sua	770
ela	740
já	671
lhe	671
dos	623
muito	594
eu	586
depois	562
então	523
sobre	481
lá	479
das	475
tinha	441
maria	440
quando	437
ainda	433
sem	422
num	420
dâmaso	396
tudo	390
onde	385
foi	382
numa	380
estava	366
agora	365
bem	364
tão	357
seus	353
sr	351
disse	346
só	346
nos	339
entre	336
olhos	335
grande	332
também	331
casa	327
dum	324
afonso	320
logo	317
todo	314
vilaça	309
pela	294
ali	292
toda	291
sempre	289
pelo	286
me	281
tu	279
havia	278
fora	277
mão	269
maia	269
ou	268
assim	267
outro	267
até	262
dois	261
há	259
mesmo	258
homem	254
nem	252
isto	251
ar	248
craft	239
ás	237
está	237
duas	235
ser	228
meu	228


1.2) Create a Spark program that computes the top 10 most used words in "Os Maias" novel.

You should try to avoid sorting or finding the top-10 as actions. Your top-10 most used words should be a RDD at the end of the computation. Check *zipWithIndex* in [pyspark RDD](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html#pyspark.RDD) documentation

In [26]:
import pyspark
import string

def preprocess_line(line):
    #preprocess the lines of text (same as before)
    return line.strip().translate(str.maketrans('', '', string.punctuation + '«»')).lower()

def main():
  # Initialise Spark context
  sc = pyspark.SparkContext('local[*]')
  try:
    # Read and preprocess the text file using the preprocess function
    non_empty_lines = sc.textFile('os_maias.txt') \
            .map(preprocess_line) \
            .filter(lambda line: len(line) > 0)

    # Word count
    words_count = non_empty_lines.flatMap(lambda line: line.split()) \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda a, b: a + b)

    # Now we don't want to sort all the words (which would use the sortbyKey
    # as before); we don't need that... we just want the top 10 most-used words
    # zipwithIndex is adequate for it
    # Get the top 10 most used words using zipWithIndex
    words_top = words_count.glom().collect()

    print(words_top)
    # .map(lambda x: (x[1], x[0])): We map each element in words_count and swap
    # the positions of the word and count, making the count the key and the word
    # the value. This transformation results in an RDD of (count, word) pairs.



















  except Exception as e:
    print("An error occurred: {}".format(str(e)))

  finally:
    # Stop Spark context
    sc.stop()

if __name__ == "__main__":
    main()

[[('os', 1888), ('queirós', 1), ('i', 11), ('que', 5280), ('em', 1626), ('lisboa', 195), ('no', 1560), ('outono', 21), ('conhecida', 8), ('da', 2208), ('s', 131), ('paula', 2), ('das', 475), ('do', 2588), ('ramalhete', 170), ('ou', 268), ('deste', 17), ('fresco', 28), ('sombrio', 24), ('paredes', 19), ('severas', 3), ('um', 3191), ('varandas', 6), ('primeiro', 86), ('andar', 39), ('cima', 99), ('uma', 2298), ('tímida', 7), ('à', 891), ('tinha', 441), ('aspecto', 7), ('tristonho', 8), ('eclesiástica', 2), ('edificação', 1), ('reinado', 3), ('d', 179), ('sineta', 13), ('cruz', 22), ('assimilarseia', 1), ('colégio', 6), ('certo', 73), ('dum', 324), ('revestimento', 1), ('quadrado', 3), ('fazendo', 29), ('painel', 5), ('heráldico', 2), ('escudo', 2), ('armas', 28), ('chegara', 19), ('colocado', 2), ('girassóis', 3), ('onde', 385), ('letras', 27), ('duma', 214), ('permanecera', 2), ('teias', 2), ('grades', 5), ('postigos', 2), ('térreos', 2), ('cobrindose', 2), ('bucarini', 1), ('núncio', 1

##2. Weblog Analysis

Consider a set of log files captured during a DDOS (*Distributed Denial of Service*) attack, containing information for the web accesses performed during the attack to the server.

The log files contain text lines as shown below, with TAB as the separator:

date |IP_source | status_code | operation | URL | execution time |
-|-|-|-|-|-
timestamp  | string | int | string | string| float |
2016-12-06T08:58:35.318+0000|37.139.9.11|404|GET|/codemove/TTCENCUFMH3C|0.026

In [None]:
#@title Download the dataset
!wget -q -O web.log https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0
!wc web.log

2.1. Count the number of unique IP addresses involved in the DDOS attack.


In [None]:
import pyspark

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('web.log')

  for line in lines.take(10):
    print(line)

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()

2.2. For each interval of 10 seconds, provide the following information: [number of requests, average execution time, maximum time, minimum time]

In [None]:
import pyspark

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('web.log')

  for line in lines.take(10):
    print(line)

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()

2.3. Create an inverted index that, for each interval of 10 seconds, has a list of (unique) IPs executing accesses (to each URL).

In [None]:
import pyspark

sc = pyspark.SparkContext('local[*]')
try:
  lines = sc.textFile('web.log')

  for line in lines.take(10):
    print(line)

  sc.stop()
except Exception as e:
  print(e)
  sc.stop()