In the previous file, we learned about functional programming. We briefly spoke about the requirements of tasks, and how a combination of tasks combine to create a data pipeline. In this file, we will build on the functional programming concepts we learned, and construct a real pipeline from scratch.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In our exercises, we will be focusing on the `request` type that can be one of `POST`, `GET`, or `PUT`. To begin, we're going to learn about special iterable types in Python, called **generators**. Then, we will use these generators to create a highly performant data pipeline.

Before we can dive into our task pipeline, we need to introduce generators in Python. The best way to do this is with an example.

In the previous file, we would read in the `example_log.txt` file, and write it to a list. Recall that when creating a list, Python loads each element of the list into RAM. For files that exceed multiple gigabytes, this file loading can cause a program to run out of memory.

Instead of reading the file into memory, we can take advantage of **file streaming**. File streaming works by breaking a file into small sections (called **chunks**), and then loaded one at time into memory. Once a chunk has been exhausted (all the bytes of that chunk has been read), Python requests the next chunk, and then that chunk is loaded into memory to be iterated on.

This is abstracted away when we run the following:

![image.png](attachment:image.png)

We can see evidence of exhausted bytes if we try to read from the opened file again:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)


**Answer**

In [1]:
def squares(n):
    i = 0
    while True:
        if i > n:
            return
        yield i * i
        i += 1

squared_values = [i for i in squares(20)]

Generator comprehensions are extremely similar to list comprehensions. We can turn any list comprehension into a generator comprehension by replacing the square brackets `[]` to parenthesis `()`. For example, here's how we could write the `squares` function as a list and generator expression:

`squared_list = [i * i for i in range(20)]
squared_gen = (i * i for i in range(20))`

Before we begin replacing all our lists as generators, let's discuss a major drawback of the generator. Suppose we had two places in our code that wanted to use the `squared_gen` generator. With a list, `squared_list`, we could easily do:

![image.png](attachment:image.png)

Using a generator, however, the second loop will **not** run. Like a file, a generator will exhaust all it's elements once the final yield has been executed. Be cautious of this behavior when using generators, like `squared_gen` in our code!

![image.png](attachment:image.png)

It's time to use generators in our pipeline. Recall from the previous file that we combined a sequence of `maps`, `filters`, `reducers`, and produced a final count output. Using a sequence of generators, instead of the built-in objects, we will mimic this compose behavior for our pipeline.

To restate our goals, we want to perform the following to get from a raw log file to a summarized CSV:

![image.png](attachment:image.png)

Furthermore, we still want to adhere to the general practices of functional programming. The tenets being: highly composable functions with a focus on function purity.

To emphasize composability, we can create a general `parse()` function that takes in a log file, splits the lines, and then extracts the fields.

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [2]:
log = open('example_log.txt')
def parse_log(log):
    for line in log:
        split_line = line.split()
        remote_addr = split_line[0]
        time_local = split_line[3] + " " + split_line[4]
        request_type = split_line[5]
        request_path = split_line[6]
        status = split_line[8]
        body_bytes_sent = split_line[9]
        http_referrer = split_line[10]
        http_user_agent = " ".join(split_line[11:])
        yield (
            remote_addr, time_local, request_type, request_path,
            status, body_bytes_sent, http_referrer, http_user_agent
        )

first_line = next(parse_log(log))

We can update the `parse_log()` function to also perform some data cleaning for us. Notice that we've stored a `local_time field` with square brackets, the HTTP status code is a string, and there's unnecessary double quotes (`"`) around some fields.

To help fix these, we have exposed a couple utility functions in the exercise:

In [3]:
from datetime import datetime
def parse_time(time_str):
    """
    Parses time in the format [30/Nov/2017:11:59:54 +0000]
    to a datetime object.
    """
    time_obj = datetime.strptime(time_str, '[%d/%b/%Y:%H:%M:%S %z]')
    return time_obj

def strip_quotes(s):
    return s.replace('"', '')

**Task**

![image.png](attachment:image.png)

**Answer**

In [4]:
def parse_log(log):
    for line in log:
        split_line = line.split()
        remote_addr = split_line[0]
        time_local = parse_time(split_line[3] + " " + split_line[4])
        request_type = strip_quotes(split_line[5])
        request_path = split_line[6]
        status = int(split_line[8])
        body_bytes_sent = int(split_line[9])
        http_referrer = strip_quotes(split_line[10])
        http_user_agent = strip_quotes(" ".join(split_line[11:]))
        yield (
            remote_addr, time_local, request_type, request_path,
            status, body_bytes_sent, http_referrer, http_user_agent
        )

first_line = next(parse_log(log))

After parsing our logs into a generator of tuples, it's now time to write a task, and save the rows to a CSV file. This keeps the data in a well known data storage structure that we can use in future tasks. In the next file, we will discuss the role of files in a data pipeline.

A CSV is best understood when it has a set of header names for the columns, and the proper data types for its values. After parsing the logs, we have the proper data types, but we don't have the metadata of the column names. At the end of the exercise, we'll want to have the following output for our CSV file:

![image.png](attachment:image.png)

We have worked with the Python csv module a few times in the data engineering track. Here's an example of how we can use the csv module to write to a file:

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [5]:
import csv
log = open('example_log.txt')
parsed = parse_log(log)

def build_csv(lines, file, header=None):
    if header:
        lines = [header] + [l for l in lines]
    writer = csv.writer(file, delimiter=',')
    writer.writerows(lines)
    file.seek(0)
    return file

file = open('temporary.csv', 'r+')
csv_file = build_csv(
    parsed,
    file,
    header=[
        'ip', 'time_local', 'request_type',
        'request_path', 'status', 'bytes_sent',
        'http_referrer', 'http_user_agent'
    ]
)
    
contents = csv_file.readlines()
print(contents[:5])

['ip,time_local,request_type,request_path,status,bytes_sent,http_referrer,http_user_agent\n', '\n', '200.155.108.44,2017-11-30 11:59:54+00:00,PUT,/categories/categories/categories,401,963,http://www.yates.com/list/tags/category/,"Mozilla/5.0 (Windows CE) AppleWebKit/5332 (KHTML, like Gecko) Chrome/13.0.864.0 Safari/5332"\n', '\n', '36.139.255.202,2017-11-30 11:59:54+00:00,PUT,/search,404,171,https://www.butler.org/main/tag/category/home.php,"Mozilla/5.0 (Macintosh; PPC Mac OS X 10_5_0) AppleWebKit/5332 (KHTML, like Gecko) Chrome/15.0.813.0 Safari/5332"\n']


![image.png](attachment:image.png)

In [6]:
import itertools
import random

nums = [1, 2]
letters = ('a', 'b')
# Random number generator.
randoms = (random.random() for _ in range(2))

for ele in itertools.chain(nums, letters, randoms):
    print(ele)

1
2
a
b
0.5227404413400094
0.7928320230310916


**Task**

![image.png](attachment:image.png)

**Answer**

In [7]:
import itertools

log = open('example_log.txt')
parsed = parse_log(log)

def build_csv(lines, file, header=None):
    if header:
        lines = itertools.chain([header], lines)
    writer = csv.writer(file, delimiter=',')
    writer.writerows(lines)
    file.seek(0)
    return file

file = open('temporary.csv', 'r+')
csv_file = build_csv(
    parsed,
    file,
    header=[
        'ip', 'time_local', 'request_type',
        'request_path', 'status', 'bytes_sent',
        'http_referrer', 'http_user_agent'
    ]
)

contents = csv_file.readlines()
print(contents[:5])

['ip,time_local,request_type,request_path,status,bytes_sent,http_referrer,http_user_agent\n', '\n', '200.155.108.44,2017-11-30 11:59:54+00:00,PUT,/categories/categories/categories,401,963,http://www.yates.com/list/tags/category/,"Mozilla/5.0 (Windows CE) AppleWebKit/5332 (KHTML, like Gecko) Chrome/13.0.864.0 Safari/5332"\n', '\n', '36.139.255.202,2017-11-30 11:59:54+00:00,PUT,/search,404,171,https://www.butler.org/main/tag/category/home.php,"Mozilla/5.0 (Macintosh; PPC Mac OS X 10_5_0) AppleWebKit/5332 (KHTML, like Gecko) Chrome/15.0.813.0 Safari/5332"\n']


![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [8]:
def count_unique_request(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('request_type')
    
    uniques = {}
    for line in reader:
        if not uniques.get(line[idx]):
            uniques[line[idx]] = 0
        uniques[line[idx]] += 1
    return uniques

log = open('example_log.txt')
parsed = parse_log(log)
file = open('temporary.csv', 'r+')
csv_file = build_csv(
    parsed,
    file,
    header=[
        'ip', 'time_local', 'request_type',
        'request_path', 'status', 'bytes_sent',
        'http_referrer', 'http_user_agent'
    ]
)
uniques = count_unique_request(csv_file)
print(uniques)

IndexError: list index out of range

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [9]:
def count_unique_request(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('request_type')
    
    uniques = {}
    for line in reader:
        
        if not uniques.get(line[idx]):
            uniques[line[idx]] = 0
        uniques[line[idx]] += 1
    return ((k, v) for k,v in uniques.items())

log = open('example_log.txt')
parsed = parse_log(log)
file = open('temporary.csv', 'r+')
csv_file = build_csv(
    parsed,
    file,
    header=[
        'ip', 'time_local', 'request_type',
        'request_path', 'status', 'bytes_sent',
        'http_referrer', 'http_user_agent'
    ]
)
uniques = count_unique_request(csv_file)
summarized_file = open('summarized.csv', 'r+')
summarized_csv = build_csv(uniques, summarized_file, header=['request_type', 'count'])
print(summarized_file.readlines())

IndexError: list index out of range

In this file we expanded on the concept of functional programming, and explored how composition naturally creates a data pipeline. We built a sequence of tasks, and completed a pipeline that transformed raw log data into summarized CSV file.

In the next file, we will generalize these tasks, and create a general purpose pipeline. We will learn about closures and function decorators that provide additional code reusability in functional programming. Finally, we will rebuild this pipeline using the general purpose pipeline.