# Lab Assignment

## Initialising spark context

In [1]:
import re # Using re module for regular expressions

from pyspark.sql import SparkSession # Importing SparkSession 

In [2]:
# Initialising sparkContext
spark = SparkSession.builder.appName("LogFileParser").getOrCreate()
sc = spark.sparkContext

In [3]:
# Path where the log file is located
filePath = "/user/dalonlobo2857/Spark"

# Read the log file using sc.textFile and store the rdd in logFileRdd
logFileRdd = sc.textFile(filePath)

In [4]:
# Check if the log file is read correctly
logFileRdd.take(2)

[u'in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839',
 u'uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] "GET / HTTP/1.0" 304 0']

## Answer to Q1

To find out top 10 requested URLs along with count of number of times they have been requested

In [5]:
# Based on the pattern of log file entry, I've written the following regular expression
# This will break the log entry in to groups and then we can use those groups to solve the problem
# Each group of regular expression is explained below in the comments.

regex = r"""(?:[\w\-]*(?=[/,@:&]))?[/,@:&]? # Used to cleanup the domain name/host name/ip
            ([\w\.\-\*]+)                   # First group with domain name/host name/ip (Group 1)
            (?:\s-){2}\s                    # Unwanted characters
            \[(.*)\]                        # Second group with timestamp               (Group 2)
            \s"                             # Space and ", these are consumed by regex
            ([\w]+)?\s*                     # Request type, ex: GET, POST               (Group 3)
            ([^\s]+)?\s*                    # Requested url                             (Group 4)
            (.*(?=[\s*][HTTP]))?\s*         # Unwanted characters                       (Group 5)
            (HTTP/\d\.\d)?                  # Request prototype and version             (Group 6)
            (.*)?                           # Unwanted characters                       (Group 7)
            "\s                             # " and space
            (\d+)\s                         # HTTP reply code                           (Group 8)
            ([\d\-]+)                       # Number of bytes returned by the server    (Group 9)
        """

# The following function will split the lines based on the regular expression
def splitLines(line):
    try:
        return (re.match(regex, line, re.VERBOSE).groups()[3], 1)
    except Exception as e:
        return ('Regex match failed', line)

# Using map to apply 'splitLines' function on each line
# splitLines will return a tuple with Requested URL as key and count as value
urlCountRdd = logFileRdd.map(splitLines)

In [6]:
# Using reduceByKey function, since each entry in rdd is key = URL and Value = count
# reduceByKey function will do the aggregation 
reducedUrlCountRdd = urlCountRdd.reduceByKey(lambda x, y: x + y)

In [7]:
# Testing if any record in rdd has failed to match the regular expression
if reducedUrlCountRdd.filter(lambda x: x[0] == "Regex match failed").collect() == []:
    print("The supplied regular expression has matched all the entries successfully")

The supplied regular expression has matched all the entries successfully


In [8]:
# Function to format the ouput
def formatOutput(key, val):
    key = key.ljust(40)
    val = str(val).center(11)
    return (key, val)

### Result: Top 10 requested URLs along with count of number of times they have been requested is displayed below.

I will ignore URL '/', because this URL will not give us more information on the page requested.

In [9]:
# Top 10 requested URLs are displayed below, displaying top 11 URLs including /:
print("{0}|{1}|{2}".format("Rank".center(10, '-'), 
                           "URL".center(40, '-'), 
                           "Count".center(11, '-')))

for item, element in enumerate(reducedUrlCountRdd.takeOrdered(11, key=lambda x: -x[1])):
    print("{0}|{1}|{2}".format(str(item + 1).center(10, '-'),
                               *formatOutput(element[0], element[1])))

---Rank---|------------------URL-------------------|---Count---
----1-----|/images/NASA-logosmall.gif              |   97410   
----2-----|/images/KSC-logosmall.gif               |   75337   
----3-----|/images/MOSAIC-logosmall.gif            |   67448   
----4-----|/images/USA-logosmall.gif               |   67068   
----5-----|/images/WORLD-logosmall.gif             |   66444   
----6-----|/images/ksclogo-medium.gif              |   62778   
----7-----|/ksc.html                               |   43687   
----8-----|/history/apollo/images/apollo-logo1.gif |   37826   
----9-----|/images/launch-logo.gif                 |   35138   
----10----|/                                       |   30347   
----11----|/images/ksclogosmall.gif                |   27810   


> #### Insights
> - Most of the requested URLs are of .gif files
> - Among html files ksc.html is requested 43687 times

## Answer to Q2

Spark code to find out top 5 hosts/IP making the request along with count.

In [10]:
# All the imports

from datetime import datetime

# import datatypes for schema
from pyspark.sql.types import StringType, StructType, StructField, TimestampType
from pyspark.sql.functions import desc

In [11]:
# Based on the pattern of log file entry, I've written the following regular expression after many iterations
# This will break the log entry in to groups and then we can use those groups to solve the problem
# Each group of regular expression is explained below in the comments.

# Ignoring the timezone -0400 since all the entries are in same timezone

regex = r"""(?:[\w\-]*(?=[/,@:&]))?[/,@:&]? # Used to cleanup the domain name/host name/ip
            ([\w\.\-\*]+)                   # First group with domain name/host name/ip (Group 1)
            (?:\s-){2}\s                    # Unwanted characters
            \[(.*)\s-0400\]                 # Second group with timestamp               (Group 2)
            \s"                             # Space and ", these are consumed by regex
            ([\w]+)?\s*                     # Request type, ex: GET, POST               (Group 3)
            ([^\s]+)?\s*                    # Requested url                             (Group 4)
            (.*(?=[\s*][HTTP]))?\s*         # Unwanted characters                       (Group 5)
            (HTTP/\d\.\d)?                  # Request prototype and version             (Group 6)
            (.*)?                           # Unwanted characters                       (Group 7)
            "\s                             # " and space
            (\d+)\s                         # HTTP reply code                           (Group 8)
            ([\d\-]+)                       # Number of bytes returned by the server    (Group 9)
        """

# The following function will split the lines based on the regular expression
def splitLines(line):
    try:
        tem = re.match(regex, line, re.VERBOSE).groups()
        return [tem[0], datetime.strptime(tem[1], '%d/%b/%Y:%H:%M:%S'), tem[3], tem[7], tem[8]]
    except Exception as e:
        return ['Regex match failed', e]

# Using map to apply 'splitLines' function on each line
# splitLines will return a list with all the required matched groups
generalRdd = logFileRdd.map(splitLines)

In [12]:
# Display one element of rdd
generalRdd.take(1)

[[u'in24.inetnebr.com',
  datetime.datetime(1995, 8, 1, 0, 0, 1),
  u'/shuttle/missions/sts-68/news/sts-68-mcc-05.txt',
  u'200',
  u'1839']]

- Converting rdd to dataframe for ease of use

In [13]:
# Converting the rdd to df

fields = [StructField("HostName", StringType()),
         StructField("TimeStamp", TimestampType()),
         StructField("URL", StringType()),
         StructField("ResponseCode", StringType()),
         StructField("Bytes", StringType())]

schema = StructType(fields)

logDf = generalRdd.toDF(schema)

In [14]:
# Display the values in data frame
logDf.show(5)

+-----------------+--------------------+--------------------+------------+-----+
|         HostName|           TimeStamp|                 URL|ResponseCode|Bytes|
+-----------------+--------------------+--------------------+------------+-----+
|in24.inetnebr.com|1995-08-01 00:00:...|/shuttle/missions...|         200| 1839|
|  uplherc.upl.com|1995-08-01 00:00:...|                   /|         304|    0|
|  uplherc.upl.com|1995-08-01 00:00:...|/images/ksclogo-m...|         304|    0|
|  uplherc.upl.com|1995-08-01 00:00:...|/images/MOSAIC-lo...|         304|    0|
|  uplherc.upl.com|1995-08-01 00:00:...|/images/USA-logos...|         304|    0|
+-----------------+--------------------+--------------------+------------+-----+
only showing top 5 rows



In [15]:
groupByHostNameDF = logDf.groupBy(logDf.HostName).count()

### Result: Top 5 hosts/IP making the request along with count is displayed below

In [16]:
groupByHostNameDF.orderBy(desc("count")).show(5)

+--------------------+-----+
|            HostName|count|
+--------------------+-----+
|  edams.ksc.nasa.gov| 6530|
|piweba4y.prodigy.com| 4846|
|        163.206.89.4| 4791|
|piweba5y.prodigy.com| 4607|
|piweba3y.prodigy.com| 4416|
+--------------------+-----+
only showing top 5 rows



> #### Insights
> Server got 6530 requests from edams.ksc.nasa.gov

## Answer to Q3

Spark code to find out top 5 time frame for high traffic.

In [17]:
from pyspark.sql.functions import date_format, udf

In [18]:
# Creating a new column, which contain TimeFrame
timeFramelogDf = logDf.withColumn("TimeFrame", date_format(logDf["TimeStamp"], "dd/MM/YYYY:HH"))


In [19]:
groupByTimeFrameDf = timeFramelogDf.groupBy(timeFramelogDf.TimeFrame).count()

### Result: Top 5 time frame of highest traffic

In [20]:
timeFrameOrderDesc = groupByTimeFrameDf.orderBy(desc("count"))
timeFrameOrderDesc.show(5)

+-------------+-----+
|    TimeFrame|count|
+-------------+-----+
|31/08/1995:11| 6321|
|31/08/1995:10| 6283|
|31/08/1995:13| 5948|
|30/08/1995:15| 5919|
|31/08/1995:09| 5627|
+-------------+-----+
only showing top 5 rows



> #### Insights
> - Top 5 time frames with highest traffic are displayed above
> - On 31/08/1995 at 11 hours, the server received highest traffic of 6321 requests.

### Grouping by day of week and hour of day

In [21]:
# User defined function
formatTimestamp = udf(lambda x: x.strftime("%A %H"), StringType())

# Creating a new column, which contains TimeFrame
weekdayHourlogDf = logDf.withColumn("WeekdayHour", formatTimestamp(logDf["TimeStamp"]))

In [22]:
weekdayHourlogDf.groupBy(weekdayHourlogDf.WeekdayHour)\
                .count()\
                .orderBy(desc("count"))\
                .show(5)

+-----------+-----+
|WeekdayHour|count|
+-----------+-----+
|Thursday 15|23380|
|Thursday 12|23035|
| Tuesday 13|21115|
| Tuesday 12|20908|
|Thursday 13|20423|
+-----------+-----+
only showing top 5 rows



> #### Insights
> The company is receiving peak traffic on Thursday's at 15 hours

## Answer to Q4

Spark code to find 5 time frames of least traffic

In [23]:
# I will reuse groupByTimeFrameDf from Q3
timeFrameOrderAsc = groupByTimeFrameDf.orderBy("count")

### Result: Top 5 time frame of highest traffic

In [24]:
timeFrameOrderAsc.show(5)

+-------------+-----+
|    TimeFrame|count|
+-------------+-----+
|03/08/1995:04|   16|
|03/08/1995:09|   22|
|03/08/1995:05|   43|
|03/08/1995:10|   57|
|03/08/1995:07|   58|
+-------------+-----+
only showing top 5 rows



> #### Insights
> - 5 time frames with least traffic are displayed above
> - On 03/08/1995 at 04 hours, the server received least traffic of only 16 requests.

### Grouping by day of week and hour of day

In [25]:
# using the weekdayHourlogDf from Q3
weekdayHourlogDf.groupBy(weekdayHourlogDf.WeekdayHour)\
                .count()\
                .orderBy("count")\
                .show(5)

+-----------+-----+
|WeekdayHour|count|
+-----------+-----+
|  Sunday 06| 2437|
|Saturday 05| 2579|
|  Sunday 05| 2734|
|Saturday 06| 2748|
|  Sunday 04| 2807|
+-----------+-----+
only showing top 5 rows



> #### Insights
> The company can do production deployment on Sunday's at 06 hours as the servers are least used at that time.

## Answer to Q5

Spark code to find out unique HTTP codes returned by the server along with count.


In [26]:
# Using logDF
groupByRespCodeDF = logDf.groupBy(logDf.ResponseCode).count()

### Result: Unique HTTP codes returned by the server along with count is displayed below

In [27]:
# Formatting the result
print("{0}|{1}".format("Response Code".center(20, '-'), 
                           "Count".center(15, '-')))

for row in groupByRespCodeDF.orderBy(desc("count")).collect():
    print("{0}|{1}".format(str(row.ResponseCode).center(20, ' '),
                           str(row["count"]).rjust(11, ' ')))



---Response Code----|-----Count-----
        200         |    1398988
        304         |     134146
        302         |      26497
        404         |      10056
        403         |        171
        501         |         27
        400         |         10
        500         |          3


> #### Insights
> - Most of the requests got response code of 200
> - Only 3 internal server errors i.e. response code 500