# Reliable, Scalable and Maintainable Applications

* Reliability - Tolerating hardware & software faults, human error
* Scalability - Measuring load & performance, Latency percentiles/throughput
* Maintainability - Operability, simplicity & evolvability

Amount, complexity and speed of data leads to most problems in today's applications

### Standard Building Blocks for most applications

* **Databases** - Store data so that it can be found again
* **Caches** - Remeber the result of an expensive operation, to speed up reads
* **Search indexes** - Search or filter data based on keywords
* **Stream processing** - Asynchronous handling of messages
* **Batch processing** - Periodical crunching of data

A single building block/tool is typically not enough to build an application. Often we need to break down the problem into multiple tasks/tools and stitched together

## Reliability

Continuing to work correctly, even when things go wrong
* Application performs the function that user expects
* Tolerate user mistakes or unexpected flows
* Good performance under expected load and data volume 
* Prevents unauthorised access/abuse

**Fault is not the same as Failure**

* Fault is typically one component deviating from spec, but failure is when entire system stops providing service to user
* It is hard to avoid faults, therefore it is best to design fault-tolerant mechanisms that prevent faults from causing failures

### Harware Faults

* When you have a lot of machines in large datacenters, probability of hardware faults is high
* This could be because of hard disk crash, RAM errors, power grid backouts or human errors
* RAID - We add redundant components so that when one component dies, other can take over
* Software fault-tolerance techniques are preferred or used in addition to hardware redundancy due to operational advantages since they support rolling upgrades (one node at a time) instead of downtime

### Software Faults
* Harware faults are typically random and independent from each other (one machine failing may not imply other will fail)
* A different class of faults is a systematic error which are harder to anticipate as co-relation is across nodes. Examples:
  * A software bug that crashes application server for some bad input
  * A runaway process that uses up a shared resource (CPU time, memory, disk space, network bandwidth)
  * A service that system depends on slows down/becomes unresponsive or sends corrupted responses
  * Cascading failures where one component triggers fault in another fault

### Human Errors

* How do we make systems reliable, inspite of unreliable humans designing/building software and running them?
  * Design systems in a way that minimises opportunity for error - well designed abstractions, APIs, and admin interfaces
  * Provide sandbox environments where people can explore and experiment safely using real data without impact real users
  * Rogorous testing from unit testing to whole system integration tests including corner cases
  * Easy recovery from human errors - enable fast roll back configuration changes or code
  * Detailed, clear monitoring - performance metrics and error rates (called as telemetry in other engg disciplines)

## Scalability

* Even if a system is working reliably today, it may not necessarily mean that it will continue to do so in future
* This could be because of increased load in number of users or data volume
* Scalability is the term we use to describe a system's ability to cope with increased load

### Defining Load

* We define load with a few numbers referred to as load parameters

* Best choice of parameters depends on the architecture of system. Examples:
  * Requests per second (web server)
  * Ratio of reads to writes (database)
  * Number of concurrent users (chatroom)
  * Hit rate (cache)
  * Avg case vs extreme case of parameters depends on the use case


* Twitter example (2012 data)
    * Posting a tweet - User can publish a tweet to their followers, which results in 4.6k requests per second on avg and 12k requests at peak
    * Home timeline - User can view tweets posted by people they follow (300k requests per sec)
    * Here the main challenge is *fanout* , each user follows many people and each user is followed by many people. The term fanout is borrowed from electronic engineering where it describes the number of logic gate inputs that are attached to another gate's output. Output needs to supply enough current to drive all out attached inputs. In transaction processing systems, we use it to describe the number of requests to other services that we need to make in order to serve one incoming request