# Chapter 15 - Data Engineering
"Python for DevOps" - Noah Gift

## Small Data

In this section, the tools of data engineering outline themselves. These tools include small data tasks like reading and writing files, using `pickle`, using `JSON`, and writing and reading `YAML` files. Being able to master these formats is critical to be the type of automate who can tackle any task and turn it into a script. 

### Write a File

In [10]:
with open("containers.txt", "w") as file_to_write:
    file_to_write.write("Pod\n")
    file_to_write.write("Service\n")
    file_to_write.write("Volume\n")
    file_to_write.write("Namespace\n")

In [11]:
!cat containers.txt

Pod
Service
Volume
Namespace


### Read a File

In [12]:
with open("containers.txt") as file_to_read:
    lines = file_to_read.readlines()
    print(lines)

['Pod\n', 'Service\n', 'Volume\n', 'Namespace\n']


### Generator Pipeline to Read and Process Lines
>NOTE: maybe use to parse HTML

In [13]:
def process_file_lazily():
    """Uses generator to lazily process file"""
    
    with open("containers.txt") as file_to_read:
        for line in file_to_read.readlines():
            yield line

> Next, this generator is used to create a pipeline to perform operations line by line. This example converts line to lowercase string. This is a very efficient way to chain actions together

In [19]:
# Create generator object
pipeline = process_file_lazily()
# convert to lowercase
lowercase = (line.lower() for line in pipeline)
# print first processed line
print(next(lowercase))

pod



> This means that files which are infinite could still be processed because the code exits when it find a condition. A generator pipeline could look for a customer ID and then exit the processing at the first occurrence. 

### Using YAML
> YAML is becoming an emerging standard for config files. It is a human-readable data serialization format that is a superset of JSON. It is often used because there is a need for a configuration language that allows rapid iteration when interacting with highly automated systems. 

In [21]:
import yaml

In [22]:
kubernetes_components = {
    "Pod": "Basic building block of Kubernetes.",
    "Service": "An abstraction for dealing with Pods.",
    "Volume": "A directory accessible to containers in a Pod.",
    "Namespaces": "A way to divide cluster resources between users."
}

In [23]:
with open("kubernetes_info.yaml", "w") as yaml_to_write:
    yaml.safe_dump(kubernetes_components, yaml_to_write, default_flow_style=False)

In [24]:
!cat kubernetes_info.yaml

Namespaces: A way to divide cluster resources between users.
Pod: Basic building block of Kubernetes.
Service: An abstraction for dealing with Pods.
Volume: A directory accessible to containers in a Pod.


> The takeaway is that it makes it trivial to serialize a Python data structure into a format that is easy to edit and iterate on. Reading the file back is just two lines of code...

In [25]:
with open("kubernetes_info.yaml", "rb") as yaml_to_read:
    result = yaml.safe_load(yaml_to_read)

In [26]:
# Pretty print YAML file
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(result)

{   'Namespaces': 'A way to divide cluster resources between users.',
    'Pod': 'Basic building block of Kubernetes.',
    'Service': 'An abstraction for dealing with Pods.',
    'Volume': 'A directory accessible to containers in a Pod.'}


## Big Data