# REL 5: How are you executing change?

Software is delivered continuously. The fastest new features are delivered to users, the earlier comes new data, learnings and profits. However, as architectures and teams grow in size and complexity, more needs to be done and longer it takes to release a new feature reliably. Continuous delivery for our purposes refer to the set of techniques used to automate the software pipeline and get changes to users quickly and reliably.

The cloud puts all the effort of acquiring and managing infrastructure behind an API so infrastructure can be expressed as code. That allows every code push to be immediately deployed to production, but that is not necessarily a good idea. Continuous delivery requires the proper safeguards on testing, monitoring and automation to be reliable. Each organization will have different repositories and processes for code and data. AWS Code Pipeline lets you model your delivery process as a workflow and manage its execution as changes flow. This is how a sample pipeline would look in the management console, although remember they can be created using AWS CloudFormation, Command Line or any language with the SDKs.
 
Software delivery is extremely heterogeneous, each language, framework, legacy and business can be updated quite differently. AWS CodeStar is a service to help building pipelines for common technology stacks, such as Java with Spring or Node.js with Express. The CodeStar new project wizard guides from the connection to source repositories on AWS CodeCommit or Github to deployment in several services.
 
Continuous delivery is not only tools, but the process of how to apply them and the people that they serve. The following techniques are frequently used to help software evolve fast, reliable and safe. They are not strict rules but building blocks, applied successfully in many architectures, but each time with its context variations. The techniques reinforce each other, but are hardly adopted all at once. As architecture evolves, so does its development process, helped by some or all of the following: Collective Ownership, Zero Regressions, Continuous Integration, Feature Flags, Microservices, Infrastructure as Code, Immutable Infrastructure, Blue-Green Deployments and Canary Releases. 

## Collective Ownership

The cornerstone of continuous delivery is the commitment to build for the customer. An outage or issue is not a development or operations problem. It’s a business commitment to the customer first and foremost. As CTO Werner Vogels declared:

“The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.”

Collective ownership is the key to balance the pace of innovation. In an ideal world every service would be launched at the same time in all regions and with support for AWS CloudTrail and AWS CloudFormation. And that is getting faster, but it is still more efficient to serve some customers first than having all to wait. The same logic can be applied to mitigate technical debt and other issues that arises when teams are segregated into hard roles. Everyone is responsible for business continuity in some aspect and should share the same priorities: keeping the software up and improving it as fast as possible, in the direction of the customer needs, while preventing and eliminating bugs as early as possible.

## Fix before build

The fastest possible delivery would be to publish every code change directly to production. But it is very hard to determine when a bug has been introduced and what impact it will have. But it is very likely that the longer bugs take to fix, the more they will cost and the more they breed. And as they do, the schedule, cost and quality of the software process gets more and more unreliable. The “broken windows” theory suggests that if everyone helps preventing small crimes, such as breaking the build or uncaught exceptions, it can significantly reduce the bigger issues. No developer wants to be the only one doing things wrong, but also won’t be the only one caring to do it right. 
To keep reliability, bugs must be prevented thoroughly, detected early and corrected quickly. The many different types of automated software testing are beyond the scope of this discussion, but they are the first line of defense in bug prevention. At each commit, the pipeline can use AWS CodeBuild or other partner test tools to automatically verify if the code is ready to proceed according to the automated tests suite.
 
## Continuous Integration

The longer a developer work alone in his own local copy of the code, the harder it is to merge it back with the master branch or trunk. Ideally, coding tasks would be short enough to merge frequently and without stepping on each other toes. But tasks such as large refactorings, epic features or irreproducible bugs can keep a developer away from the team for a long time. 

Continuous Integration (CI) is about getting the feedback from automated checks on the codebase as often as possible, so everyone can build faster and safer. The “State of DevOps Report” indicates that the adoption of CI helps to reduce change failures in up to 5 times but the really difference is in time from commit to deploy, which can be up to 440 times faster and get done up to 46 times more often.

Continuous integration starts with the shared repository. AWS Code Commit provides private managed git repositories and can so be used with any CI tools, besides the native integration with AWS Developer Tools. GitHub is also widely supported, as it is more suitable for open source projects. Once a repository is set up, it is important to decide on a branching strategy, so that everybody works on the same flow. Here again each team have his own perks, but usually around two general strategies:

* Trunk-based Development is when everyone codes directly to the master branch. It cuts the branching issues by doing it minimally or not at all. As the codebase is single and integrated often, extra care must be taken to not break it the for other developers and even users, a good idea in the first place. The culture and tools for comprehensive automated testing are an important safety net for reliable trunk-based development. 

* Git Workflow is the popular name of the branching model described by Vicent Driessen in 2010 and adopted by many teams and tools.  The master branch still holds the production-ready code, but a develop branch aggregates changes while the new release is under development. Each of those changes is developed in a separate feature branch, merged back to the develop branch when done. This way unfinished code gets integrated in develop and can be thoroughly tested before merging to master and proceeding in the pipeline to production.

In both strategies, automated validation should begin as soon as code is pushed into the repository, ideally to any branch. From there the pipeline starts a new cycle of compiling, testing, packaging and deploying. AWS CodePipeline manages the execution of the build workflow, but the actual computation of each build step is managed by AWS CodeBuild or one of the integrated partner tools. 

Using AWS CodeBuild, the build specification file (buildspec.yml) declares the list of commands to be executed at each of five lifecycle phases (install, pre_build, build, post_build) and their resulting artifacts. Here is a sample build specification, for java using maven in this case, but commands could be replaced to run anything.
 
Teams frequently adopt further practices to improve the development process. As any other scheduled task, nightly builds can be triggered by a lambda function and a CloudWatch Events schedule to ensure at least a daily pace of integration. 

Integrating frequently, even when there are no code changes, can be important to detect issues with dependencies and integrations. For example, the Amazon Inspector service agent can detect vulnerabilities know to the Common Vulnerabilities and Exposures database. Integrating that rule package in your build pipeline can prevent you from deploying a dependency with a known CVE, such as Heartbleed, Shellshock, POODLE and many others. Some package managers may even let up upgrade the dependency versions automatically and that may be a good idea as well. 

CodeBuild can also cache dependencies on S3 for significantly improving build times, as each build environment is created empty and all dependencies would be fetched over and over unecessarily.

Once each build is successful, CodeBuild will upload the artifact files to S3 and pass the key name to the following stages in the pipeline. Build results are also emitted to CloudWatch Events so they can trigger lambda functions for any kind of post-processing or notification, such as e-mail or chat channel. Artifacts then keep progressing on the pipeline to further tests, approvals and deployments. But not necessarily new releases.

## Feature Flags

A new software release can be a significant event for business and implicate in announcements, promotions, sales and what not. But it does not need to be a huge thing operationally. The code can be there and working perfectly for a long time, just disabled for all or most customers by feature flags.
For example, a new feature designed for Christmas may hardly make sense before that, but its code can be deployed since Halloween. As nobody wants to watch logs on Christmas, the new feature can be tested and deployed, but kept hidden. The business release then becomes then a separate event from the software deployment, probably long in the past.
The actual implementation may range from a simple feature flag and if statement to dynamic properties of the system. There may be value in partial releases, whitelisting new features and services gradually to larger groups of users. Many AWS Services are announced first in a “preview” period before general availability. To use them, customers’ needs to submit an application and may be contact by the service team for feedback. 
Microservices
As business grows, so does software development teams and the complexity in their coordination. To sustain the vertiginous pace of growth of Amazon, and countless startups, software teams must be able to work independently. 
"Adding human resources to a late software project makes it later" – Brooks Law
This is so much of an issue that back in 2002 Jeff Bezos started to restructure Amazon.com around what he called “two pizza teams”: autonomous groups of around a dozen people, a good enough crew size for effective teamwork and to be fed with a couple pizzas. This decomposition of teams and services was challenging, but also beneficial much further than reducing synchronization and complexity. Faster onboarding of the new engineers and building teams effectively is key for accelerating the pace of innovation. In 2008, Amazon was already decomposed into hundreds of services, whose dependency graph looks like this:
 


Netflix, Uber, Ebay and many other fast scaling enterprises are adopting this strategy of fine-grained services and responsibilities to scale faster, both in terms of business and software architecture. Considering that each of those services will be in continuous improvement and delivery, provisioning infrastructure by hand becomes quickly inviable.

## Infrastructure as Code

Deep automation is essential to deliver software at such scale and rate. A new software version and feature may require more than new code, but also new infrastructure components. In traditional on-premises IT, infrastructure components are frequently restricted to servers (application, messaging and database) and perhaps a load balancer or network storage appliance. Bringing in a new resource type, like a NoSQL database or machine learning cluster, may pose a significant business commitment and risk. On the cloud, new features may bring in new resource types or replace them with little to none consequences.

As the name implies, Amazon services are offered over a public web API. This allows infrastructure resources to be managed with code and inherit many benefits from code development: configuration management, version control, automated testing, costless duplication and so on. It also brings different approaches to coding and abstractions as fit for different programming languages and styles.  Automations can be as simple as a shell script using the AWS CLI or in most modern languages using AWS SDKs. However, those scripts quickly grow in complexity and may end up consuming as much attention as the application they manage. Instead managing resources imperatively, declaring resources and having an interpreter manage them is more effective for infrastructure code. 

AWS CloudFormation creates and manage resources based on templates declared in either JSON or YAML. The template is a recipe for the application architecture and once it is built and tested, can used to create multiple instances of it. It can also take parameters, execute mappings and functions, share outputs and many other features for managing infrastructure as code. Here is a sample template file using YAML
 

CloudFormation stacks can be managed by the console, SDK or CLI just like other services. During development and testing those tools will help to debug and then, when ready, published to the code repository, usually together with application code. This can trigger another cycle of your continuous delivery pipeline to take changes to production. This way not only new application code can be automatically provisioned, but also the infrastructure elements that they depend on, such as caches, databases and many others. See the “CloudFormation Type Reference” page for the complete list of resources and properties.

## Immutable Infrastructure

As applications are decomposed into a complex set of dependencies it may be difficult to determine if a change will have availability impact or other negative consequences. Some application servers can hot-deploy new application packages, others don’t. Some databases need downtime to change schema, others don’t have a schema at all.  Some AWS resources can be changed in-place, others may need replacement. Once an environment is up and running, it is safer not to change it at all.

Instead of manipulating resources receiving live traffic, consider creating new “clone” environments and redirecting traffic to it once it is stable and ready. With infrastructure declared as code, the new environment can be provisioned instantly, and as it still has no traffic, the extra cost should be negligible.  After stabilizing and passing all health checks, the new environment can start receiving live traffic. Once all traffic has shifted to the new environment and it is performing safe and sound, old environment can be safely decommissioned. Or even better, kept for a while for simplified rollback case a bug slips into production.
Immutability and infrastructure as code have important implications for security as well. As all changes are applied automatically, there should be no need for SSH or other management. Particularly in production environments, this should even raise an alarm. If someone logged in it is either an automation issue or an attack, and system administrators should be notified of both. It also makes it possible to use or respawn old environments for auditing or understanding the origin of issues, even much after they were fixed.

## Blue Green Deployment

When a bug gets deployed or anything goes wrong, traffic can be rolled back to the old environment and quickly prevent further damage. How long “quickly” means depends on the failure detection and routing capabilities of each software. It is common to rely on changing DNS records to failover, changing resolution to a secondary environment and requiring only a reconnection from clients. While this is simple to implement, DNS changes may take a couple minutes, according to records Time to Live (TTL) and how DNS caches expire.  AWS ElasticBeanstalk “Switch URL” feature does exactly that with a single invocation.

Reduced failover times can be obtained using “heartbeat” protocols and client-side fault detection, like featured in MariaDB clients and Netflix Eureka. But however fast and reliable failover can be, it is even better to prevent it. Towards that, development processes can go further than the “zero regressions” suggested earlier and also adopt a policy of data compatibility. The new version must be compatible and able to execute concurrently with the old version against the same data.  Databases that do not require a predefined schema, such as Amazon DynamoDB, are very helpful to comply with such a policy, as compatibility moves entirely to application code under developers control.

Having several application environments with the same data is helpful not only for rolling out changes, but also for enabling experimentation. You can have alternative versions of the application running at the same time and monitor metrics to find out which is better, whatever “better” means for the experiment. This could be used to optimize for performance, sales, conversions or any other business criteria. This is instrumental to adopt lean methodologies and taking decisions based not on speculation but on experience and data from actual users. 

## Canary Release

A canary release, like the canaries used in coal mines, increases reliability by anticipating the detection of issues. Instead of deploying a new environment directly into the firehose of production traffic, canary releases get a reduced and/or simulated traffic first. Once monitoring indicates a new environment is reliable, or more successful in an experiment, it can gradually receive more traffic. The actual implementation may depend on how safe and how fast deployments must be. Some common alternatives are:

* Synthetic requests can be designed and fired at the application to check if it behaves reliably. Modern tools let developers craft requests in several protocols and fire them from a distributed cluster of slaves. However, as this usually generates fake data, synthetic canary releases are usually discarded and a new copy with the vetted version is deployed into production afterwards.

* Recorded traffic from live production can be replayed to the canaries. Further than the scenarios tested by the developer with synthetic requests, replayed traffic exposes the application to the creativity of users in breaking software. More than that, it can ensure bugs were corrected properly and help to enforce the zero regressions policy. Recorded canaries are also usually discarded, as the same data is already in production and transactions would be duplicated. 

* Live canaries receive more traffic according to behavior. While they keep up with reliability metrics, traffic is periodically shifted to the new environment, or otherwise rolled back.

Although these change management techniques were presented in the reliability context, their benefits can extend much further. The reliable and simultaneous execution of several environments can be used not only to roll out new versions, but also to perform business experiments. Instead of shifting traffic according to reliability, that can be made according to other technical metrics, such as latency or throughput, or even business goals, such conversions or revenue.