# AWS Whitepapers
> Introduction to AWS Additional Key Services

- toc: true 
- comments: true
- author: Ankush Agarwal
- categories: [aws, whitepapers]

### Introduction 
    The process for reviewing an architecture is a constructive conversation about architectural decisions,
    and is not an audit mechanism. 
    We believe that having well-architected systems greatly increases the likelihood of business success.
    
    AWS also provides a service for reviewing your workloads at no charge. The AWS Well-Architected
    Tool (AWS WA Tool) is a service in the cloud that provides a consistent process for you to review and
    measure your architecture using the AWS Well-Architected Framework. The AWS WA Tool provides
    recommendations for making your workloads more reliable, secure, efficient, and cost-effective.

###  The Pillars of the AWS Well-Architected Framework

    Operational Excellence 
        The ability to support development and run workloads effectively, gain insight into their operations,
        and to continuously improve supporting processes and procedures to deliver business value.
        
    Security
        The security pillar describes how to take advantage of cloud technologies to protect data, systems,
        and assets in a way that can improve your security posture.
        
    Reliability
        The reliability pillar encompasses the ability of a workload to perform its intended function
        correctly and consistently when it’s expected to. This includes the ability to operate and test the
        workload through its total lifecycle. This paper provides in-depth, best practice guidance for
        implementing reliable workloads on AWS.
        
    Performance Efficiency
        The ability to use computing resources efficiently to meet system requirements, and to maintain
        that efficiency as demand changes and technologies evolve.
        
    Cost Optimization
        The ability to run systems to deliver business value at the lowest price point.
        
    Sustainability
        The ability to continually improve sustainability impacts by reducing energy consumption and 
        increasing efficiency across all components of a workload by maximizing the benefits from the
        provisioned resources and minimizing the total resources required.

### Overview
    Technology architecture teams typically include a set of roles such as: Technical Architect 
    (infrastructure), Solutions Architect (software), Data Architect, Networking Architect, and 
    Security Architect
        
    At AWS, we prefer to distribute capabilities into teams rather than having a centralized team with that
    capability. There are risks when you choose to distribute decision making authority, for example, ensure 
    that teams are meeting internal standards. We mitigate these risks in two ways. First, we have practices 
    that focus on enabling each team to have that capability, and we put in place experts who ensure that teams
    raise the bar on the standards they need to meet. Second, we implement mechanisms that carry out automated
    checks to ensure standards are being met. 

### General Design Principles

    Stop guessing your capacity needs: 
        If you make a poor capacity decision when deploying a workload, you might end up sitting on expensive
        idle resources or dealing with the performance implications of limited capacity. 
        With cloud computing, these problems can go away. You can use as much or as little capacity as you 
        need, and scale up and down automatically.
        
    Test systems at production scale: 
        In the cloud, you can create a production-scale test environment on demand, complete your testing, 
        and then decommission the resources. Because you only pay for the test environment when it's 
        running, you can simulate your live environment for a fraction of the cost of testing on premises.
        
    Automate to make architectural experimentation easier: 
        Automation allows you to create and replicate your workloads at low cost and avoid the expense 
        of manual effort. You can track changes to your automation, audit the impact, and revert to previous
        parameters when necessary.
        
    Allow for evolutionary architectures: 
        In a traditional environment, architectural decisions are often implemented as static, onetime 
        events, with a few major versions of a system during its lifetime. As a business and its context
        continue to evolve, these initial decisions might hinder the system's ability to deliver changing
        business requirements. In the cloud, the capability to automate and test on demand lowers the risk 
        of impact from design changes. This allows systems to evolve over time so that 
        businesses can take advantage of innovations as a standard practice.
        
    Drive architectures using data: 
        In the cloud, you can collect data on how your architectural choices affect the behavior of your
        workload. This lets you make fact-based decisions on how to improve your workload. 
        Your cloud infrastructure is code, so you can use that data to inform your architecture choices and 
        improvements over time.
        
    Improve through game days: 
        Test how your architecture and processes perform by regularly scheduling game days to simulate 
        events in production. This will help you understand where improvements can be made and can help
        develop organizational experience in dealing with events.

### Operational Excellence
    The Operational Excellence pillar includes the ability to support development and run workloads
    effectively, gain insight into their operations, and to continuously improve supporting processes 
    and procedures to deliver business value

#### Design Principles

    Perform operations as code:
         In the cloud, you can apply the same engineering discipline that you use for application code to 
         your entire environment. You can define your entire workload (applications, infrastructure) as 
         code and update it with code. You can implement your operations procedures as code and automate 
         their execution by triggering them in response to events. By performing operations as code, you 
         limit human error and enable consistent responses to events.
         
    Make frequent, small, reversible changes: 
        Design workloads to allow components to be updated regularly. Make changes in small increments 
        that can be reversed if they fail (without affecting customers when possible).
        
    Refine operations procedures frequently: 
        As you use operations procedures, look for opportunities to improve them. As you evolve your 
        workload, evolve your procedures appropriately. Set up regular game days to review and validate
        that all procedures are effective and that teams are familiar with them.
        
    Anticipate failure: 
        Perform “pre-mortem” exercises to identify potential sources of failure so that they can be 
        removed or mitigated. Test your failure scenarios and validate your understanding of their 
        impact. Test your response procedures to ensure that they are effective, and that teams are 
        familiar with their execution. Set up regular game days to test workloads and team responses 
        to simulated events.
        
    Learn from all operational failures: 
        Drive improvement through lessons learned from all operational events and failures. 
        Share what is learned across teams and through the entire organization.

#### Practice Areas 
    
    • Organization
    • Prepare
    • Operate
    • Evolve

#### Organization
    
    Evaluate threats to the business (for example, business risk and liabilities, and information security
    threats) and maintain this information in a risk registry. Evaluate the impact of risks, and tradeoffs
    between competing interests or alternative approaches. For example, accelerating speed to market for
    new features may be emphasized over cost optimization, or you may choose a relational database for
    non-relational data to simplify the effort to migrate a system without refactoring. 
    Manage benefits and risks to make informed decisions when determining where to focus efforts. 
    Some risks or choices may be acceptable for a time, it may be possible to mitigate associated risks,
    or it may become unacceptable to allow a risk to remain, in which case you will take action to 
    address the risk.
    
    Ensure that there are identified owners for each application, workload, platform, and infrastructure
    component, and that each process and procedure has an identified owner responsible for its definition,
    and owners responsible for their performance.
    
    They should be the sponsor, advocate, and driver for the adoption of best practices and evolution of 
    the organization.
    Teams must grow their skill sets to adopt new technologies, and to support changes in demand and
    responsibilities.
    
    The Well-Architected Framework emphasizes learning, measuring, and improving. 
    You should use tools or services that enable you to centrally govern your environments across accounts,
    such as AWS Organizations, to help manage your operating models. 
    
    Services like AWS Control Tower expand this management capability by enabling you to define blueprints
    (supporting your operating models) for the setup of accounts, apply ongoing governance using AWS
    Organizations, and automate provisioning of new accounts.
    
    AWS provides the AWS Well-Architected Tool to help you review your approach prior to development, the 
    state of your workloads prior to production,and the state of your workloads in production. You can 
    compare workloads to the lat-est AWS architectural best practices, monitor their overall status, and 
    gain insight in-to potential risks. AWS Trusted Advisor is a tool that provides access to a core set
    of checks that recommend optimizations that may help shape your priorities.         

#### Operational Excellence Organization Questions        
       
    OPS 1:  How do you determine what your priorities are?
        Everyone needs to understand their part in enabling business success. Have shared goals inorder to set
        priorities for resources. This will maximize the benefits of your efforts.
    
    OPS 2:  How do you structure your organization to support your business outcomes?
        Your teams must understand their part in achieving business outcomes. Teams need to understand 
        their roles in the success of other teams, the role of other teams in their success,and have 
        shared goals. Understanding responsibility, ownership, how decisions are made,and who has authority 
        to make decisions will help focus efforts and maximize the benefits from your teams.
    
    OPS 3:  How does your organizational culture support your business outcomes?
        Provide support for your team members so that they can be more effective in taking action and 
        supporting your business outcome.

        Organizational culture has a direct impact on team member job satisfaction and re-tention. Enable 
        the engagement and capabilities of your team members to enable the success of your business. 
        Experimentation is required for innovation to happen and turn ideas into outcomes. Recognize that an 
        undesired result is a successful experiment that has identified a path that will not lead to success.
    
    Prepare
        Design your workload so that it provides the information necessary for you to under-stand its 
        internal state (for example, metrics, logs, events, and traces)
        Iterate to develop element necessary to monitor the health of your workload
        Adopt approaches that provide fast feedback on quality and enable rapid recoveryfrom changes that 
        do not have desired outcomes. 
        
        AWS enables you to view your entire workload (applications, infrastructure, policy,governance,
        and operations) as code.
        Using AWS CloudFor-mation enables you to have consistent, templated, sandbox development, test,
        and production environments with increasing levels of operations control.
        
    OPS 4:  How do you design your workload so that you can understand its state?
    Design your workload so that it provides the information necessary across all components(for example, 
    metrics, logs, and traces) for you to understand its internal state. This enablesyou to provide effective
    responses when appropriate.
    
    OPS 5:  How do you reduce defects, ease remediation, and improve flow into production?
    Adopt approaches that improve flow of changes into production, that enable refactoring,fast feedback on
    quality, and bug fixing. These accelerate beneficial changes entering pro-duction, limit issues deployed, 
    and enable rapid identification and remediation of issues in-troduced through deployment activities.
    
    OPS 6:  How do you mitigate deployment risks?
    Adopt approaches that provide fast feedback on quality and enable rapid recovery fromchanges that do not 
    have desired outcomes. Using these practices mitigates the impact of is-sues introduced through the 
    deployment of changes.
    
    OPS 7:  How do you know that you are ready to support a workload?
    Evaluate the operational readiness of your workload, processes and procedures, and personnel to understand 
    the operational risks related to your workload.
    
    Operate
        Define expected outcomes, determine how success will be measured, and identify metrics that will be 
        used in those calculations to determine if your workload and operations are successful
         
        Operational health includes both the health of the workload and the health and success of the 
        operations activities performed insupport of the workload (for example, deployment and incident 
        response). Establish metrics baselines for improvement, investigation, and intervention, collect and 
        analyze your metrics, and then validate your understanding of operations success and how it changes 
        over time. 
        
        You can leverage CloudWatch or third-party applications to aggregate and present business, workload, 
        and operations level views of operations activities. AWS provides workload insights through logging
        capabilities including AWS X-Ray, CloudWatch, CloudTrail, and VPC Flow Logs enabling the identification
        of workload issues in support of root cause analysis and remediation.
        
    OPS 8:  How do you understand the health of your workload?
    Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take
    appropriate action.
    
    OPS 9:  How do you understand the health of your operations?
    Define, capture, and analyze operations metrics to gain visibility to operations events so that you can 
    take appropriate action.
    
    OPS 10:  How do you manage workload and operations events?
    Prepare and validate procedures for responding to events to minimize their disruption to your workload.
    
    Evolve
        On AWS, you can export your log data to Amazon S3 or send logs directly to Amazon S3 for long-term 
        storage. Using AWS Glue, you can discover and prepare your log data in Amazon S3 for analytics, and 
        store associated metadata in the AWS Glue Data Catalog. Amazon Athena, through its native integration 
        with AWS Glue, can then be used to analyze your log data, querying it using standard SQL. Using a 
        business intelligence tool like Amazon QuickSight, you can visualize, explore, and analyze your data.
        Discovering trends and events of interest that may drive improvement.  
        
    OPS 11:  How do you evolve operations?
    Dedicate time and resources for continuous incremental improvement to evolve the effectiveness and 
    efficiency of your operations

### Security
    Design Principles
        Implement a strong identity foundation
        Enable traceability
        Apply security at all layers
        Automate security best practices
        Protect data in transit and at rest
        Keep people away from data
        Prepare for security events
    
    Security
        SEC 1:  How do you securely operate your workload?
        To operate your workload securely, you must apply overarching best practices to every area of security. 
        Take requirements and processes that you have defined in operational excellenceat an organizational and
        workload level, and apply them to all areas. Staying up to date with AWS and industry recommendations 
        and threat intelligence helps you evolve your threat model and control objectives. Automating 
        security processes, testing, and validation allow you to scale your security operations.
        
    Identity and Access Management
        SEC 2:  How do you manage identities for people and machines?
        There are two types of identities you need to manage when approaching operating secure AWS workloads.
        Understanding the type of identity you need to manage and grant access helps you ensure the right
        identities have access to the right resources under the right conditions. 
        Human Identities: Your administrators, developers, operators, and end users require an identity to 
        access your AWS environments and applications. These are members of your organization, or external 
        users with whom you collaborate, and who interact with your AWS resources via a web browser, client
        application, or interactive command-line tools. 
        Machine Identities: Your service applications, operational tools, and workloads require an identity 
        to make requests to AWS services - for example, to read data. These identities include machines 
        running in your AWS environment such as Amazon EC2 instances or AWS Lambda functions.You may also 
        manage machine identities for external parties who need access. Additionally,you may also have machines
        outside of AWS that need access to your AWS environment.
        
        SEC 3:  How do you manage permissions for people and machines?
        Manage permissions to control access to people and machine identities that require access to AWS and 
        your workload. Permissions control who can access what, and under what conditions.
        
        Programmatic access including API calls to AWS services should be performed using temporary and 
        limited-privilege credentials such as those issued by the AWS Security Token Service.
        
    Detection
        CloudTrail logs,AWS API calls, and CloudWatch provide monitoring of metrics with alarming, and AWS 
        Config provides configuration history. Amazon GuardDuty is a managed threat detection service that
        continuously monitors for malicious or unauthorized behavior to help you protect your AWS accounts 
        and workloads. Service-level logs are also available, for example, you can use Amazon Simple Storage
        Service (Amazon S3) to log access requests.
        
        SEC 4:  How do you detect and investigate security events?
        Capture and analyze events from logs and metrics to gain visibility. Take action on security events 
        and potential threats to help secure your workload.
        
    Infrastructure Protection
        SEC 5:  How do you protect your network resources?
        Any workload that has some form of network connectivity, whether it’s the internet or a private network,
        requires multiple layers of defense to help protect from external and internal network-based threats.
        
        SEC 6:  How do you protect your compute resources?
        Compute resources in your workload require multiple layers of defense to help protect from
        external and internal threats. Compute resources include EC2 instances, containers, AWS Lambda 
        functions,database services, IoT devices, and more.
        
    Data Protection
        SEC 7:  How do you classify your data?
        Classification provides a way to categorize data, based on criticality and sensitivity in orderto help 
        you determine appropriate protection and retention controls.
        
        SEC 8:  How do you protect your data at rest
        Protect your data at rest by implementing multiple controls, to reduce the risk of unauthorized access
        or mishandling.
        
        SEC 9:  How do you protect your data in transit?
        Protect your data in transit by implementing multiple controls to reduce the risk of unauthorized 
        access or loss.
        
    Incident Response
        SEC 10:  How do you anticipate, respond to, and recover from incidents?
        Preparation is critical to timely and effective investigation, response to, and recovery from 
        security incidents to help minimize disruption to your organization.

### Reliability
    Design Principles
        Automatically recover from failure
        Test recovery procedures
        Scale horizontally to increase aggregate workload availability
        Stop guessing capacity
        Manage change in automation
        
    Foundations
        