# Apache Server Log Analysis

## Objective:

Perform server log analysis to assist businesses in identifying and analyzing critical business errors, as well as potential customers and their domains

Dataset Name: server-access-log.txt (Can be downloaded from the Course Resources tab)

Access_log file description:

Snippet:

10.1.2.3 - rehg [10/Nov/2021:19:22:12 -0000] "GET /sematext.png HTTP/1.1" 200 3423

The following elements are present in the dataset:

%h: Resolved into 10.1.2.3 – the IP address of the remote host that made the request
%l: Identd provides the remote log name, with a hyphen, which is a value that can be logged if the information provided by the logging directive cannot be located or accessed
%u: Resolved into rehg, the user identifier determined by the HTTP authentication
%t: The date and time of the request with the time zone; in the above case it is [10/Nov/2021:19:22:12 -0000]
\”%r\”: The first line of the request inside double quotes; in the above case it is “GET /sematext.png HTTP/1.1”
%>s: The status code reported to the client
This information is crucial because it determines whether the request was successful or not.
%b: The size of the object sent to the client; in our case, the object was the sematext.png file and its size was 3423 bytes.

## Status Code Analysis
    Read the log file as an RDD in PySpark
    Consider the sixth element as it is “request type” and replace   the “single quote" with blank
    Convert each word into a tuple of (word,1)
    Apply “reduceByKey“ transformation to count the values
    Display the data

In [5]:
# Read the log file as an RDD in PySpark via Spark Session and Spark Context
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark').getOrCreate()

log_rdd = spark.sparkContext.textFile('server-access-log.txt')

In [45]:
log_rdd.count()

98378

In [24]:
# Print first few lines to check out the data. collect one line to work with
test_line = ""
for line in zip(log_rdd.collect(),range(2)):
    print(line)
    test_line = line

('13.66.139.0 - - [19/Dec/2020:13:57:26 +0100] "GET /index.php?option=com_phocagallery&view=category&id=1:almhuette-raith&Itemid=53 HTTP/1.1" 200 32653 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"', 0)
('157.48.153.185 - - [19/Dec/2020:14:08:06 +0100] "GET /apache-log/access.log HTTP/1.1" 200 233 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "-"', 1)


In [38]:
word = test_line.split(" ")[5].replace('"','')

In [39]:
word = (word,1)

In [40]:
word

('GET', 1)

In [64]:
code_rdd = log_rdd.filter(lambda line : line != "")\
        .map(lambda line : (line.split(" ")[5].replace('"',''),1))\
        .reduceByKey(lambda a,b:a+b)
# .filter for taking out empty lines
# .map for doing transformation for each line --> applying function to each line
# .reduceByKey is a narrow aggregration function that does local reduce on each
# node locally before sending the reduced values to the reduce layer, causing
# less network overhead

In [65]:
# print result
for line in code_rdd.collect():
    print(line)

('GET', 41075)
('HEAD', 307)
('POST', 56995)
('OPTIONS', 1)


In [70]:
# sort by descending order
ranked_rdd = code_rdd.sortBy(lambda x : x[1],ascending = False)
for line in ranked_rdd.collect():
    print(line)

('POST', 56995)
('GET', 41075)
('HEAD', 307)
('OPTIONS', 1)


## Identify the top 10 frequent visitors of the website

In [79]:
# lets assume an IP address counts as a visitor
visitor_rdd = log_rdd.filter(lambda x : x != "")\
                .map(lambda line : (line.split(' ')[0],1))\
                .reduceByKey(lambda a,b : a+b)\
                .sortBy(lambda x : x[1], ascending = False)
# filter function eliminates any empty lines
# map function transform the row into a tuple (keyIWantToKeep,1) which are later used by
# reduceByKey function to count up all instances of the IP address
# sortBy is finally used to sort via descending order

In [80]:
for line in zip(visitor_rdd.collect(),range(10)):
    print(line)

(('193.106.31.130', 43423), 0)
(('173.255.176.5', 5220), 1)
(('178.44.47.170', 2824), 2)
(('51.210.183.78', 2684), 3)
(('45.15.143.155', 1927), 4)
(('45.144.0.179', 946), 5)
(('176.222.58.254', 934), 6)
(('45.132.207.154', 890), 7)
(('45.153.227.55', 888), 8)
(('45.138.4.22', 880), 9)


## Identify the top 10 missing (does not exist) URLs
    Read the log file as an RDD in PySpark
    Identify the URLs for which the server is returning the 404-request code and display the data

In [302]:
missing_rdd = log_rdd.map(lambda line : line.split(' ')[8:11])\
                    .filter(lambda line : line[0] == '404')\
                    .map(lambda line : (line[2].replace('"',''),1))\
                    .reduceByKey(lambda a,b : a+b)\
                    .sortBy(lambda line : line[1],ascending=False)
"""
    After some data exploratory analysis, I was able to determine that my two needed values are
at the 8th and 11th position. Because of this, I used map() to slice the important piece
    Then, I use filter to collect entry points with 404 request code
    Then, I use map() again to remove extras on the url string
    Finally, I used reduceByKey() to count the number of instances of each url, then sort desc
"""

'\n    After some data exploratory analysis, I was able to determine that my two needed values are\nat the 8th and 11th position. Because of this, I used map() to slice the important piece\n    Then, I use filter to collect entry points with 404 request code\n    Then, I use map() again to remove extras on the url string\n    Finally, I used reduceByKey() to count the number of instances of each url, then sort desc\n'

In [303]:
# I could've filtered out '-', but I decided to keep it in here because
# it reflects the data most accurately ... the most missing url is ... a missing url

for line in zip(range(11),missing_rdd.collect()):
    print(line)

(0, ('-', 3070))
(1, ('http://www.almhuette-raith.at', 609))
(2, ('http://www.almhuette-raith.at/', 447))
(3, ('http://www.almhuette-raith.at/apache-log/access.log', 398))
(4, ('http://www.almhuette-raith.at/apache-log/', 183))
(5, ('http://almhuette-raith.at/', 153))
(6, ('http://www.almhuette-raith.at/index.php?option=com_phocagallery&view=category&id=1&Itemid=53', 90))
(7, ('http://www.almhuette-raith.at/index.php?option=com_content&view=article&id=49&Itemid=55', 68))
(8, ('http://www.almhuette-raith.at/index.php?option=com_content&view=article&id=50&Itemid=56', 53))
(9, ('http://www.almhuette-raith.at/robots.txt', 51))
(10, ('http://www.almhuette-raith.at/index.php?option=com_content&view=article&id=46&Itemid=54', 29))


## Identify the traffic (total number of HTTP requests received per day)
    Read the log file as an RDD in PySpark
    Fetch the DateTime string and replace "[" with blank
    Get the date string from the DateTime
    Identify HTTP requests using the map function
    Display the data

In [259]:
traffic_rdd = log_rdd.filter(lambda line : line.split(' ')[7].split(r'/')[0] == 'HTTP')\
                    .map(lambda line : (line.split(' ')[3].replace('[','').split(':')[0],1))\
                    .reduceByKey(lambda a,b:a+b)\
                    .sortBy(lambda line:line[1],ascending=False)

traffic_rdd.collect()

[('28/Dec/2020', 7478),
 ('25/Dec/2020', 5644),
 ('18/Jan/2021', 4988),
 ('11/Jan/2021', 4283),
 ('08/Jan/2021', 4056),
 ('21/Dec/2020', 3982),
 ('23/Dec/2020', 3856),
 ('20/Dec/2020', 3698),
 ('22/Dec/2020', 3645),
 ('24/Dec/2020', 3607),
 ('07/Jan/2021', 3098),
 ('29/Dec/2020', 2919),
 ('09/Jan/2021', 2805),
 ('04/Jan/2021', 2788),
 ('17/Jan/2021', 2498),
 ('13/Jan/2021', 2475),
 ('30/Dec/2020', 2389),
 ('06/Jan/2021', 2386),
 ('03/Jan/2021', 2379),
 ('16/Jan/2021', 2328),
 ('10/Jan/2021', 2313),
 ('19/Jan/2021', 2302),
 ('12/Jan/2021', 2300),
 ('26/Dec/2020', 2269),
 ('15/Jan/2021', 2227),
 ('20/Jan/2021', 2204),
 ('27/Dec/2020', 2181),
 ('01/Jan/2021', 2165),
 ('31/Dec/2020', 2067),
 ('05/Jan/2021', 2017),
 ('14/Jan/2021', 1954),
 ('02/Jan/2021', 1942),
 ('19/Dec/2020', 1135)]

## Identify the top 10 endpoints that transfer maximum content in megabytes and display the data

In [307]:
max_rdd = log_rdd.filter(lambda line : line.split(' ')[9] != '-')\
                .map(lambda line : int(line.split(' ')[9]))\
                .sortBy(lambda line:line,ascending=False)

for line in zip(max_rdd.map(lambda x:str(x/1000000)+" Mb").collect(),range(10)):
    print(line)

('19.734268 Mb', 0)
('19.733582 Mb', 1)
('19.733209 Mb', 2)
('19.732606 Mb', 3)
('19.689319 Mb', 4)
('19.675282 Mb', 5)
('19.674632 Mb', 6)
('19.666675 Mb', 7)
('19.666118 Mb', 8)
('19.655908 Mb', 9)


## Reference

In [308]:
for line in zip(log_rdd.map(lambda line : line.split(' ')).collect(),range(1)):
    count = 0
    for x in line[0]:
        print(str(count) + " " + x)
        count+=1

0 13.66.139.0
1 -
2 -
3 [19/Dec/2020:13:57:26
4 +0100]
5 "GET
6 /index.php?option=com_phocagallery&view=category&id=1:almhuette-raith&Itemid=53
7 HTTP/1.1"
8 200
9 32653
10 "-"
11 "Mozilla/5.0
12 (compatible;
13 bingbot/2.0;
14 +http://www.bing.com/bingbot.htm)"
15 "-"
