This repository contains the required files and write-up for Cheuk Lau's 2018 Insight DevOps project.
- Introduction
- Data Engineering Application
- Overview - AirAware
- Data Extraction
- Data Transformation
- Data Loading
- DevOps Pipeline
- Overview
- Terraform
- Packer
- Prometheus
- Advanced DevOps
- Demo - Traffic Spike
- Demo - Availability Zone Outage
- Build Instructions
- Prerequisites
- Download Data into S3
- Deploy using Terraform and Packer
- Run Spark Jobs
- Monitor with Prometheus
- Conclusion
- Future Work
- CI/CD with Jenkins
- Configuration Management with Puppet
- Docker + ECS/EKS Containerization for Flask
- AWS RDS for Postgres
- AWS EMR for Spark
- System Design - Scaling Up 10x
The goal of this project is to automate the deployment of an application onto AWS by writing infrastructure as code (IaC), and building a high-reliability infrastructure by using auto-scaling and building redundant pipelines across multiple availability zones. The DevOps pipeline will use Terraform and Packer for automatic deployment, and Git for version control. Prometheus and Grafana will be used to monitor the health and status across all servers. We demonstrate the robustness of our infrastructure by spiking the traffic and simulating an availability zone outage.
The application we will deploy is called AirAware (https://github.com/agaiduk/AirAware). The goal of AirAware is to show the historical pollution level at any address in the United States. The following figure illustrates the data pipeline that AirAware uses.
The measurement data is available from EPA (https://aqs.epa.gov/aqsweb/airdata/download_files.html#Raw) for the years 1980 to 2017. The measurement data provides hourly pollutant levels at fixed stations throughout the United States. The amount of data for the years after 2000 is approximately 10GB/year. For the data extraction step, the data is downloaded into an Amazon Web Service (AWS) S3 storage bucket, then loaded into Spark for processing.
The data transformation step calculates the pollution level at 100,000 spatial points distributed throughout the United States such that any arbitrary address is at most 15 miles away from one of the grid points. The pollution level at each grid point is calculated by inverse-distance weighting of the measurement data at the fixed stations. Several additional metrics are also computed including monthly averages for further analysis. Note that to reduce computational costs, the distances between grid points and measurement stations were pre-computed.
The processed data along with additional metrics from Spark are loaded into a PostgreSQL database. We also use the PostGIS extension in order to perform easier location-based search. The user interface is provided through a Flask application which allows the user to select an address and view the monthly-averaged data from PostgreSQL.
The DevOps pipeline will write infrastructure as code (IaC) using Terraform and Packer and version control the application and IaC using Git. The following figure illustrates the DevOps pipeline used for AirAware:
The proposed DevOps pipeline is an example of an immutable infrastructure where once an instance is launched, it is never changed, only replaced. The benefits of an immutable infrastructure include more consistency and reliability in addition to a simpler, more predictable deployment process. It also eliminates issues with mutable infrastructures such as configuration drift which requires the implementation of a configuration management tool (e.g., Chef, Puppet).
Terraform is used to setup the virtual private cloud (VPC) and other security group settings. The following figure illustrates the VPC used for AirAware:
The figure above shows two subnets: public and private. Flask uses the public subnet which is connected to the internet through the internet gateway. The remaining data pipeline components (i.e., Spark and PostgreSQL) reside in the private subnet since the outside internet should not have access to these components. In addition to setting up the VPC, Terraform also sets up the security groups which limit communication between components to specific ports. Terraform is also used to spin up the amazon machine images (AMIs) created by Packer and configures them accordingly.
Packer is used to create the Amazon machine images (AMI) for each of the components (i.e., Flask, Spark and PostgreSQL) of the data engineering pipeline. The AMIs use a base Ubuntu image and installs the required software.
In this section, we explore the use of an auto-scaled multi-pipeline infrastructure across multiple availability zones. The following figure illustrates the proposed infrastructure for this project:
The above infrastructure creates separate pipelines across two availability zones. A Spark cluster is only placed in one of the pipelines. This reduces the cost of having to spin up multiple Spark clusters, and is acceptable since batch processing only occurs periodically, and therefore we are only concerned with the customer having access to the front-end and databases containing post-processed data. We also place an elastic load balancer (ELB) to redirect traffic across the two availability zones, and auto-scale the front-end application (Flask servers) according to the fluctuation in user demand.
We use LocustIO to simulate 1000 users pinging our elastic load balancer (ELB) at a rate of 3 clicks per second. The figure below plots the CPU usage as a function of time for the two availability zones (us-west-2a and us-west-2b) throughout the simulation.
The results show the initial spike in CPU usage in both availability zones followed by automatic provisioning of servers to decrease CPU usage across all servers until they are all below the upper threshold. Once LocustIO is turned off, the CPU usage decreases to nearly zero across all servers and they begin to spin down one by one until only a single server remains. We can also see that the ELB fairly evenly distributed work between the availability zone (perhaps slight bias towards us-west-2b in this example) and between the servers within each availability zone (as evident by the nearly overlapping lines). The screencast of this demo can be found here (https://youtu.be/I6_M_wAIVqY).
We use LocustIO to again simulate 1000 users pinging our ELB at a rate of 3 clicks per second. We shut off the ELB connection to one of the availability zones (us-west-2b) at approximately the peak CPU usage. The figure below plots the CPU usage as a function of time for the two availability zones throughout the simulation.
The results show the initial spike in CPU usage in both availability zones. The CPU usage in us-west-2b prompty falls to zero once its ELB connection is removed. Note that the CPU in us-west-2a plateaus at around 12 percent which is nearly double that of the previous example (7 percent). This makes sense since us-west-2a should be handling nearly double the traffic with us-west-2b disconnected from the ELB. It also takes 7 servers for the CPU load on each machine of us-west-2a to fall below the upper threshold. This is approximately the sum of the servers for both availability zones in the previous example (8 servers), which again makes sense since us-west-2a is now handling all of the traffic. After Locust is turned off, we see the number of servers in us-west-2a spin down to one. The screencast of this demo can be found here (https://youtu.be/SjjnE2ZPrtU).
The following software must be installed into your local environment:
- Terraform
- Packer
- AWS command line interface (CLI)
Clone the repository:
git clone https://github.com/cheuklau/insight_devops_airaware.git
Perform the following steps to download the EPA data into AWS S3. First, create an S3 bucket called epa-data
then perform the following:
wget https://aqs.epa.gov/aqsweb/airdata/hourly_44201_****.zip
unzip https://aqs.epa.gov/aqsweb/airdata/hourly_44201_****.zip
export AWS_ACCESS_KEY_ID=<insert AWS access key ID>
export AWS_SECRET_ACCESS_KEY=<insert AWS secret key>
aws s3 cp hourly_44201_xxxx.csv s3://epa-data
wherexxxx
is the year of interest.aws s3 cp hourly_88101_xxxx.csv s3://epa-data
cd insight_devops_airaware/devops/final
vi build.sh
and change the user inputs as needed../build.sh --packer y --terraform y
Running build.sh
performs the following:
- Calls Packer to build the Spark, Postgres, and Flask AMIs.
- Calls Terraform to spin up Spark cluster, Spark controller, Postgresl, and Flask instances.
- Outputs the ELB DNS which accesses the Flask front-end servers.
The AirAware front-end is now visible at <ELB-DNS>
. However, we first must run Spark jobs from the Spark controller in order for any data to be visible.
Perform the following to submit Spark jobs:
ssh ubuntu@<Spark-Controller-IP> -i mykeypair
cd insight_devops_airaware/AirAware/spark
spark-submit raw_batch.py hourly_44201_xxxx
wherexxxx
is the year of interest.spark-submit raw_batch.py hourly_88101_xxxx
wherexxxx
is the year of interest.
We can monitor the status of the Spark jobs at <Spark-Master-IP>:8080
.
We can go to the Prometheus dashboard at <Prometheus-IP>:9090
or the Grafana dashboard at <Prometheus-IP>:3000
. Node Exporter is installed and running on the Flask servers and the Prometheus server is continuously scraping them for data. Grafana is reading the data sent to Prometheus and displaying them in more elegant dashboards. A default dashboard has been provided in this repo to display the CPU usage as a function of time across the used availability zones.
In this project, we have automated the deployment of an application onto AWS using a high-reliability infrastructure. We used Terraform and Packer to automate deployment, added auto-scaling to our user-facing servers to handle changes in traffic, and built redundant pipelines across multiple availability zones for increased fault tolerance.
The developer-to-customer pipeline is summarized below:
- Developer
- Build
- Test
- Release
- Provision and Deploy
- Customer
Terraform and Packer handles steps 4 and 5. However, we still need a CI/CD tool (e.g., Jenkins) to handle steps 2 and 3, and to automatically trigger Terraform and Packer to perform steps 4 and 5. CI/CD for AirAware using Jenkins is summarized below:
- Developer pushes code into Git repository.
- Jenkins detects the change and automatically triggers:
- Packer to build the AMIs in the staging environment.
- Terraform to spin up the AMIs in the staging environment.
- Jenkins performs unit tests as provided by the AirAware developer.
- If build is not green, developers are notified.
- If green, we can either automatically deploy into the production environment (continuous delivery) or wait for manual approval (continuous deployment).
Below are some specific work items to incorporate Jenkins for CI/CD:
- Create separate staging and production environments that both use the same Terraform modules.
- Use Terraform to spin up an additional instance to run Jenkins.
- Create a
Jenkinsfile
as follows:- Monitors for changes to the AirAware Git repository.
- Create a
build
stage which triggers Packer to build new AMIs in the staging environment. - Create a
deploy-to-staging
stage which triggers Terraform to spin up the new AMIs in the staging environment. - Create a
testing
stage to perform unit tests on the staging environment. - Create a
deploy-to-production
stage which either automatically deploys to production or waits for manual approval iftesting
stage passes.
In this project, we used Packer to build AMIs with the required software to run each component of AirAware, and used Bash scripts to configure new instances using Terraform. Using Bash scripts is undesirable for the following reasons:
- Requires expert knowledge of scripting language standards and style.
- Increases complexity when dealing with mixed operating systems (OS).
An alternative is to use a configuration management (CM) tool e.g., Puppet, which can achieve the same results without worrying about the underlying OS or Bash commands. Puppet uses a declarative domain specific language (DSL) allowing users to only have to specify the end goal. The main objective of Puppet is to maintain a defined state configuration. For AirAware, we will use a master-agent
setup where agents
check in with the master
to see if anything needs to be updated. Communication between master
and agents
is summarized below:
Agent
sends data about its state to themaster
(facts
which include hostname, kernel details, etc).Master
compiles a list of configurations to be performed on theagent
(catalog
which includes upgrades, file creation, etc).Agent
receivescatalog
frommaster
and executes the tasks.Agent
reports back tomaster
aftercatalog
tasks are complete.
Below are some specific work items to incorporate Puppet for CM:
- Create a Puppet
module
for AirAware.Module
contains Puppetmanifests
for each AirAware component i.e., Spark, Postgres and Flask.Manifests
contain Puppetscripts
which described the desired state using Puppet DSL.
- Install Puppet on Packer-generated AMIs.
- Create an additional instance to run the
master
. - Provision
manifest
on themaster
. - Install Puppet on
master
, and start the Puppet server. - Use Puppet
node definitions
to sendmanifest
tasks to the appropriate servers.
In this project, we used virtual machines (VMs) for every component of AirAware. Each VM runs a full copy of an operating system (OS) and virtual copies of all the hardware that the OS requires. An alternative to VMs are containers which contain just the application, its despendencies and the libraries and binaries required to run the application. Containers are MB instead of GB in size compared to VMs, and take seconds rather than minutes to spin up. Flask is a good candidate for containerization as it requires low OS overhead, and needs to be quickly spun up or down based on user-demand. We can use Docker to containerize Flask. This is done by creating a DockerFile
that performs many of the same tasks as Packer in order to create a Flask Docker image. Docker can be used in conjunction with AWS elastic container service (ECS) or AWS elastic Kubernetes service (EKS) for container orchestration. A container orchestration tool:
- Defines the relationship between containers.
- Sets up container auto-scaling.
- Defines how containers connect with the internet.
Note that ECS and EKS clusters can be built in Terraform using the aws_ecs_cluser
and aws_eks_cluster
resources respectively.
Amazon relational database service (RDS) supports Postgres, and performs the following tasks:
- Scales database storage with little to no downtime.
- Performs backups.
- Patches software.
- Manages synchronous data replication across availability zones.
Note that RDS can be built in Terraform using the aws_rds_cluster
resource.
Amazon elastic map reduce (EMR) provides a Hadoop framework to run Spark, and performs the following tasks:
- Installs and configures all of the required software.
- Provisions nodes and sets up the cluster.
- Increases or decreases the number of instances with autoscaling.
- Uses spot instances.
Note that EMR can be built in Terraform using the aws_emr_cluster
resource.
For this exercise, we will scale up the architecture to handle:
- 10x increase in the amount of data.
- Add ability for user to download the full resolution hourly data.
- Increase in user traffic to require multiple webservers.
We have the following non-functional requirements:
- High availability - monthly-averaged and full resolution hourly data should not be lost.
- Low latency - requests should be performed within a given latency requirement.
Full-resolution hourly data storage after 10 years:
- Multiplying original storage of 500GB results in 5TB of data.
- Multiplying original data growth per year of 10GB/year results in 100GB/year.
- Over 10 years, we expect to accumulate 6TB of data.
Monthly-averaged data storage after 10 years:
- Monthly-to-hourly ratio = 1/720.
- Over 10 years, we expact to accumulate 8GB of data.
For the monthly-averaged data:
- Continue to use PostGIS to handle geo-spatial lookups.
- A modern-day server can easily handle 8GB of data.
- Increase availability by using a master-slave architecture.
- Writes are performed on the master which replicates to the slaves.
- Reads are performed on the slaves.
For the full-resolution hourly data:
- Assuming 1TB of data per shard, we can shard the full-resolution data across 6 databases.
- Each database can be NoSQL Cassandra with high replication based on the number of worker nodes.
We can speed up requests to the read API by caching part of the monthly-averaged data into memcache or Redis. Requests to the read API first checks the cache for the location. If the cache is not found, then the read API reads from the PostGIS database, and the PostGIS database sends the data to cache. The cache can be cleaned up routinely by removing the least frequently used locations.
Load balancers should be placed between:
- Client and the reverse-proxy/webservers.
- Reverse-proxy/webservers and the read/download APIs.
- Read APIs and cache.
- Read APIs and PostGIS.
- Download APIs and Cassandra.
The detailed design with data partitioning/replication, caching and load balancers is shown in the figure below.