# Processing NASA Logs  
  
This challenge was solved using Spark SQL to clean and wrangle the data and dataframes to answer the proposed questions.  
  
The challenge consists of answering the following questions:  
  
- Number of unique hosts  
- Total 404 errors  
- The 5 URLs that causes the most 404 errors  
- Number of 404 errors for each day  
- Total bytes returned  
  
Official dataset source:  
https://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html  
  
Data:  
ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz  
ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz  

About the dataset:  
The dataset holds all the HTTP requests to the server at NASA'S Kennedy Space Center during July and August of 1995.  

## Data Preparation

In [1]:
# Creating a Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Processing NASA Logs').getOrCreate()

In [2]:
# Reading files into a dataframe
df = spark.read.csv('access_log_*')

In [3]:
# Splitting each row into columns
from pyspark.sql.functions import split

split_col = split(df['_c0'], ' ')
df = df.withColumn('host', split_col.getItem(0)) \
       .withColumn('rfc931', split_col.getItem(1)) \
       .withColumn('username', split_col.getItem(2)) \
       .withColumn('date:time', split_col.getItem(3)) \
       .withColumn('timezone', split_col.getItem(4)) \
       .withColumn('method', split_col.getItem(5)) \
       .withColumn('resource', split_col.getItem(6)) \
       .withColumn('protocol', split_col.getItem(7)) \
       .withColumn('statuscode', split_col.getItem(8)) \
       .withColumn('bytes', split_col.getItem(9))

In [4]:
# Merging appropriate columns
from pyspark.sql.functions import col, concat, lit

df = df.withColumn('date:time timezone', concat(col('date:time'), lit(' '), col('timezone')))
df = df.withColumn('request', concat(col('method'), lit(' '), col('resource'), lit(' '), col('protocol')))

In [5]:
# Dropping redundant columns
df = df.drop('_c0', 'date:time', 'timezone', 'method', 'resource', 'protocol')

In [6]:
# Reordering the dataframe
df = df.select('host', 'rfc931', 'username', 'date:time timezone', 'request', 'statuscode', 'bytes')

In [7]:
# Cleaning request and date:time timezone columns
from pyspark.sql.functions import expr, substring

df = df.withColumn('request', expr('substring(request, 2, length(request)-2)'))
df = df.withColumn('date:time timezone', substring(col('date:time timezone'), 2, 26))

In [8]:
# Casting date:time timezone as TimeStamp and bytes as Int
from pyspark.sql.functions import to_timestamp

df = df.withColumn('date:time timezone', to_timestamp(col('date:time timezone'), 'dd/MMM/yyyy:HH:mm:ss Z'))
df = df.withColumn('bytes', col('bytes').cast('int'))

In [9]:
# Visualizing dataframe
df.show(truncate = False) 
df.printSchema()

+---------------------------+------+--------+-------------------+----------------------------------------------------------------+----------+-----+
|host                       |rfc931|username|date:time timezone |request                                                         |statuscode|bytes|
+---------------------------+------+--------+-------------------+----------------------------------------------------------------+----------+-----+
|in24.inetnebr.com          |-     |-       |1995-08-01 01:00:01|GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0    |200       |1839 |
|uplherc.upl.com            |-     |-       |1995-08-01 01:00:07|GET / HTTP/1.0                                                  |304       |0    |
|uplherc.upl.com            |-     |-       |1995-08-01 01:00:08|GET /images/ksclogo-medium.gif HTTP/1.0                         |304       |0    |
|uplherc.upl.com            |-     |-       |1995-08-01 01:00:08|GET /images/MOSAIC-logosmall.gif HTTP/1.0      

## Number of unique hosts

## Total 404 errors

## The 5 URLs that causes the most 404 errors

## Number of 404 errors for each day

## Total bytes returned