# Spark Training Exercise Book Day 2



In this excercise you will process Apache HTTP access logs with Spark. Below you can see some sample lines from our test dataset.

The security team let you know, they would like to see some statistics from this large volume of log files, for example which resource produced the most of the non 200 (OK) HTTP response codes. They also have a list of malicious IP address. They would like to see a report about the specific requests which are initiated from one of these IPs.

It is up to you how to implement this use-case in Spark.

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

sc = pyspark.SparkContext()
spark = SparkSession(sc)

In [3]:
!head -n 5 data/apache_logs.txt


109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"


As you can see, the Apache access log file is not perfactly a CSV so we need some additional processing to be able to build a dataframe from this file. Here are some helper functions which can help you in extracting the relevant fields for this excercise. You can use the **process_access_log_line** function to process one access log line and get back the IP address, the date (in string), the hour (int), some header information like resource as well as the HTTP response code generated by the server.

In [4]:
import time
import datetime

def get_ip(s):
    return s.split(' ')[0]

def get_timestamp(str):
    s = str.find('[')
    l = str.find(']')
    ts_str = str[s + 1:l]
    #return long(ts)
    return ts_str

def get_header(str):
    s = str.find('"')
    l = str[s + 1:].find('"')
    header = str[s + 1:s + l + 1].split(' ')
    method = header[0] if len(header) > 0 else "malformed"
    resource = header[1] if len(header) > 1 else "malformed"
    protocol = header[2] if len(header) > 2 else "malformed"        
    return (method, resource, protocol)
    
def get_error_code(str):
    f = str.split(' ')
    if len(f) < 9:
        return 0
    try:
        code = int(f[8])
    except ValueError:
        code = 0
    return code

# input: raw access log from the RDD
# output: structured daa: (ip, ts, date, hour, method, resource, protocol, response code)
def process_access_log_line(log_line):
    header = get_header(log_line)
    ts_str = get_timestamp(log_line)
    date_str = "1980-01-01"
    hour = 12
    try:
        td = datetime.datetime.strptime(ts_str, "%d/%b/%Y:%H:%M:%S %z")
        date_str = '{}-{}-{}'.format(td.year, td.month, td.day)   
        hour = td.hour
    except ValueError:
        pass
    return (get_ip(log_line), ts_str, date_str, hour, header[0], header[1], header[2], get_error_code(log_line))


Not it is time to read up the **data/access_logs.txt** file and do some transformation to be able to create a dataframe and do some aggregation. We also provided you the schema what is matching with the output of the **process_access_log_line** method and can be used to create the dataframe.

In [6]:
access_log_schema = StructType([
            StructField('ip', StringType(), True),
            StructField('ts', StringType(), True),
            StructField('date', StringType(), True),
            StructField('hour', IntegerType(), True),
            StructField('method', StringType(), True),
            StructField('resource', StringType(), True),
            StructField('protocol', StringType(), True),
            StructField('response', IntegerType(), True)
        ])

In [7]:
# Your code comes here

# Step 1: read the data to a RDD
rdd = sc.textFile('data/apache_logs.txt')
rdd.take(5)

# Step 2: transoform the lines of RDD to a "structured" format
#   you can use the process_access_log_line function for the transformation

# Step 3: create the dataframe
#df.show()

['',
 '109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"',
 '109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"',
 '46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"',
 '46.72.177.4 - - [12/Dec/2015:18:31:08 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"']

The security team would like to see the top 10 resources which generates the highest number of error (not 200) response codes. They would like to see this information on hourly basis. The security team would like to store this information on a shared distributed data storage (like S3) where they effectively can query this information. Most of the queries are filtered for one specific day.

In [32]:
# Your code comes here


+----------+----+------+--------------------+--------+------------+
|      date|hour|method|            resource|response|access_count|
+----------+----+------+--------------------+--------+------------+
| 2016-1-14|  20|   GET|/templates/_syste...|     404|          63|
| 2016-2-11|   0|   GET|/apache-log/acces...|     206|          57|
|  2016-1-1|  21|   GET|/templates/_syste...|     404|          35|
|  2016-2-8|   9|   GET|/templates/_syste...|     404|          31|
|  2016-1-6|  19|   GET|/templates/_syste...|     404|          29|
|2015-12-20|  18|   GET|/templates/_syste...|     404|          22|
| 2016-1-15|  20|   GET|/templates/_syste...|     404|          22|
|2015-12-20|  12|   GET|/apache-log/acces...|     404|          20|
| 2016-1-21|  13|   GET|      /administrator|     301|          19|
| 2016-1-14|   8|   GET|/templates/_syste...|     404|          19|
+----------+----+------+--------------------+--------+------------+



The security team also provided a list of malicious IP address. Your task is to count how many requests coming from each malicous IP address in each hour. You will find the IP list under the **data/ip-list.txt**

In [8]:
!head -10 data/ip-list.txt

Malicious IP address:  109.106.142.176
Malicious IP address:  109.184.11.34
Malicious IP address:  128.72.82.254
Malicious IP address:  130.255.13.57
Malicious IP address:  176.194.74.20
Malicious IP address:  176.214.240.8
Malicious IP address:  177.201.52.125
Malicious IP address:  178.35.29.219
Malicious IP address:  195.8.51.14
Malicious IP address:  2.86.149.141


In [37]:
# your code comes here


+---------------+
|   malicious_ip|
+---------------+
|109.106.142.176|
|  109.184.11.34|
|  128.72.82.254|
|  130.255.13.57|
|  176.194.74.20|
|  176.214.240.8|
| 177.201.52.125|
|  178.35.29.219|
|    195.8.51.14|
|   2.86.149.141|
|    2.93.22.237|
|   204.44.90.14|
|  206.59.253.65|
| 213.24.132.190|
|   37.1.206.196|
| 37.113.244.223|
| 46.146.101.241|
|  46.166.75.133|
|   46.211.46.18|
|   5.138.36.198|
+---------------+
only showing top 20 rows

