## In this Notebook we will learn how to write a customized function for PySpark

### Reading in our initial dataset
For this first section, we're going to be working with a set of Apache log files. These log files are made available by Databricks via the databricks-datasets directory. This is made available right at the root directory.

In [3]:
log_files = "/databricks-datasets/sample_logs"

In [4]:
import re
from pyspark.sql import Row
APACHE_ACCESS_LOG_PATTERN = '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+) (\S+)" (\d{3}) (\d+)'

# Returns a dictionary containing the parts of the Apache Access Log.
def parse_apache_log_line(logline):
    match = re.search(APACHE_ACCESS_LOG_PATTERN, logline)
    if match is None:
        # Optionally, you can change this to just ignore if each line of data is not critical.
        # For this example, we want to ensure that the format is consistent.
        raise Exception("Invalid logline: %s" % logline)
    return Row(
        ipAddress    = match.group(1),
        clientIdentd = match.group(2),
        userId       = match.group(3),
        dateTime     = match.group(4),
        method       = match.group(5),
        endpoint     = match.group(6),
        protocol     = match.group(7),
        responseCode = int(match.group(8)),
        contentSize  = long(match.group(9)))

In [5]:
# We next read in the log files into an RDD

log_files = "/databricks-datasets/sample_logs"
raw_log_files = sc.textFile(log_files)

In [6]:
raw_log_files.count()

In [7]:
## Let's go ahead and perform a transformation by parsing the log files using the customized function we have written

parsed_log_files = raw_log_files.map(parse_apache_log_line)


Please note again that no computation has been started, we've only created a logical execution plan that has not been realized yet. In Apache Spark terminology we're specifying a transformation. By specifying transformations and not performing them right away, this allows Spark to do a lot of optimizations under the hood most relevantly, pipelining. Pipelining means that Spark will perform as much computation as it can in memory and in one stage as opposed to spilling to disk after each step.
Now that we've set up the transformation for parsing this raw text file, we want to make it available in a more structured format. We'll do this by creating a DataFrame (using toDF()) and then a table (using registerTempTable()). This will give us the added advantage of working with the same dataset in multiple languages.

In [9]:
parsed_log_files.toDF().registerTempTable("log_data")

In [10]:
%sql select * from log_data limit 5

<img src="screenshots/temp22.png">