# Designing Data-Intensive Applications Notes

## Chapter 1: Reliable, Scalable, and Maintainable Applications

> The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free?
—Alan Kay, in interview with Dr Dobb’s Journal (2012)

* Shifts Today: Many applications are *data-intensive* rather than *compute-intensive*
* Example Application Functions:
    * Storing data for later usage -> databases
    * Remembering results of an expsnsive operations -> caches
    * Allowing users to search / filter data -> seach indexes
    * Sending an asynchronous message to another -> streaming
    * Periodically crunch a large amount of accumulated data -> batch processing
* So then why are there multiple different ways to do the above?
    * Different applications have different requirements
    
### Thinking About Data Systems
* Many tools no longer fit into their tradidtional categories
    * Redis: Message Queues
    * Kafka: Message Queues with database-like durability guarantees
* More and more apps are moving towards a microservice framework
    * Work gets broken down into tasks that go to their specialized tool
    * The tools are then stitched together using code to work as one
    * This API hides the back-end complexity of this via a user-friendly front end
    
* The big ideas behind reliable, scalable, & maintainable systems
    * Reliability:
        * The system should continue to work correctly in the face of adversity
            * Hardware or software faults & human error
    * Scalability:
        * As the system grows, there should be ways of dealing with that growth
        * Measuring load and performance
        * Latency percentiles throughput
    * Maintainability:
        * As a system grows, different people will work on a system and they should all be able to work productively
        * Operability
        * Simplicity
        * Evolvability   
---

### Reliability
* Typical definitions of reliability:
    * The application performs as expected and can tolerate unexpected mistakes
        * "Continuing to work correctly, even when things go wrong
* Faults != Failures
    * Faults: When one component deviates from its spec
    * Failures: When the system as a whole stops provididng its service
* How can we best ensure reliability?
    * Increase the rate of faults by triggering them deliberately!
    * Test how truly fault-tolerant your system truly is
    
#### Hardware Faults
* Common Faults:
    * Hard disks crashing
    * RAM becoming faulty
    * Unplugged cables
* Given a 10,000 storage cluster with a mean time to failure (10, 50 yrs), one can assume an average failure of 1 / day
* As there is a larger need for data volumes/computing demands, these applications require even more machines, increasing the failures/cluster. 
* How do we deal with this:
    * Add redundancy to reduce the system failure rate
        * Physical: Hardware components can be duplicated
        * Software: Patched systems can be patched one node at a time

#### Software Errors
* Systematic errors are correlated across the nodes rather than on a singular node
* How do we deal with this?
    * Evaluate and question your assumptions and system interactions
    * Create robust and detailed error logging

#### Human Errors:
* Humans design, build, and use operating systems
    * Humans are unreliable :(
* How do we deal with this?
    * Minimize opportunities for errors
        * APIs and other interfaces
    * Seperate places of frequent errors form places of high importance. 
    * Utilize automated unit testing and system integration tests
    * Setup detailed and clear monitoring / logging

### How Important is Reliability?
* Application bugs can lead to huge physical, monetary, and legal problems 
* It's all a balancing act between reliability and profitability

---
### Scalability

* Scalability: A system's ability to copy with increased load
* How do we grow?
* How can we add computing resources to handle more load. 

#### Describing Load
* Load Parameters:
    * Architecture specific
    * Ex: requests / second, reads/writes, # of simulatenous users

#### Describing Performance
* Once you define your load parameters, then you can see what happens as the load increases. 
* Potential paths to check:
    * If you increase a load parameter and hold resources constant, what happens to system performance. 
    * If you increase a load parameter and keep performance constant, how do we need to manage our resources. 
* Latency: Time from request receipt to being operated on
* Response Time: From client request sent to result recieved

* Latency must be viewed as a distribution of times across a number of requests
    * As such, we need to pay attention to those on the right-most tail as they have the higest amount of data -> heavy user
* It's all about balancing performance and cost
* Head-of-line blocking: When heavy requests take a long time to evaluate and subsequently hold up smaller requests. 

#### Approaches for Coping with Load
* Vertical Scaling:
    * Scaling ***Up***
    * Moving to a more powerful machine
* Horizontal Scaling:
    * Scaling ***Out***
    * Distributing the load across multiple machines
* Good design will typically use an example of both scaling types
* Determining Load:
    * Elastic: 
        * The system can automatically add / subtract coputing resources on load increase detection
        * Generally useful for highly unpredictable loads
    * Manual:
        * A human scales the load of the system manually
        * Generally useful for simplicity and has few operational surprises
    * Different data architectures for large scale operations require application specificity
    * 100,000 1kb requests / secs != 3 2 GB requests / min
        * Despite having the same data!
---
### Maintainability

* Maintainability: Fixing bugs, investigating failures, modifying old cold for new use cases, and adding new features

#### Operability: Making Life Easy for Operations:
* Good operability means making routine tasks easy


---
### Simplicity
* As projects get larger, they become complex/difficult to understand
* Symptoms of Complexity:
    * Inconsistent naming / terminology
    * Hacks made for solving performance issues
    * Special cases to work around issues
* ^ Complexity = ^ Maintainance Costs
* Removing complexity != Reducing Functionality
    * Accidental Complexity: 
        * Complexity that arises only through implementation and not through the lens of the user
    * Abstraction:
        * Allows for a reuse of efficient implementations

---
### Evolvability

* You need to plan for the chance of a change in system requirements
* The ease of modification is directly linked to the simplicity of its codebase!
