# Data-Intensive Applications

- data-intensive: CPU power is rarely a limiting factor, but rather the amount, the complexity, and the changing speed of the data
- a data-intensive application is built from standard building blocks that provide commonly needed functionality
    - store data so applications can find it later (databases)
    - remember the result of an expensive operation to speed up reads (caches)
    - allow users to search data by keyword or filter it in various ways (search indexes)
    - send a message to another process to be handled asynchronously (stream processing)
    - periodically crunch a large amount of accumulated data (batch processing)




## Reliability 

- working correctly and continue to do so even when things go wrong

### Hardware Faults

- e.g., hard disks are reported as having a MTTF (mean time to failure) of about 10 to 50 years, thus on a storage cluster with 10,000 disks, we can expect on average one disk to die per day
- first response is usually to add redundancy to the individual hardware
- as data volumes and applications' computing demands increase, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults
- there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques


### Software Errors
- hardware faults are random and independent from each other (usually)
- systematic error within the system: harder to anticipate, correlated across nodes, cause many more system failures 
    - software bugs
    - a runaway process that occupies shared resource
    - a service that the system depends on fails
    - cascading failures, where a small fault in one component triggers a fault in another component, which in turn triggers further faults
- potential solutions
    - thorough testing
    - process isolation
    - allowing processes to crash and restart
    - measuring, monitoring, and analyzing system behavior
    - SLA monitoring

### Human Errors
- design systems in a way that minimizes opportunities for error
    - well designed abstractions, APIs to encourage "golden path" and discourage sketchy operations
    - decouple error-prone areas from failure-sensitive areas
        - e.g., provide development sandbox 
    - test thoroughly
    - allow quick and easy recovery from human errors, to minimize the impact
        - make it easy and fast to rollback config changes
        - rolling deploy
    - detailed and clear monitoring, such as performance metrics and error rates
    


## Scalability

- the system's ability to cope with increased load
- if a system grows in a particular way, what are the options for coping with the growth

### Describing Load
- load can be described with a few numbers called *load parameters*
    - requests per second to a web server
    - ratio of reads to writes in a database
    - number of simultaneously active users in a chat room
    - hit rate on a cache

**example: twitter**
- operations involved: post tweet and browse home timeline
    - post tweet: publish a new message to followers (4.6k requests/sec on avg, over 12k requests/sec at peak)
    - home timeline: view tweets posted by users they follow (300k requests/sec)
- handling 12,000 writes per second would be easy, but twitter's scaling challenge is not primarily due to tweet volume, but due to *fan-out*
    - *fan-out*: the output needs to supply enough current to drive all the attached inputs; in transaction processing systems, it refers to the number of requests to other services that we need to make in order to serve one incoming request
- implementation 1: global collection
    - posting simply inserts the new tweet into a global collection of tweets
    - when user requests home timeline, look up users they follow and find all the tweets for each of those users and merge them

```
SELECT tweets.*, users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
```

<img src="img/Snip20190913_61.png"/>

- implementation 2: mailbox
    - maintain a cache for each user's home timeline
    - when a user posts a tweet, look up all the people who follow the user, and insert the new tweet into each of their home timeline caches
    - read home timeline is then cheap because the result is computed ahead of time
    
<img src="img/Snip20190913_62.png"/>

- the global collection approach struggled to keep up with the load of home timeline queries
- the mailbox approach works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads, so in this case it's preferable to do more work at write time and less at read time
- the mailbox approach is extremely expensive when the tweet poster has many followers
- so the distribution of followers per user (maybe weighted by how often those users tweet) is a **key load parameter** for scalability as it determines the fan-out load

### Describing Performance

- investigate what happens when the load increases
    - what is the performance impact when a certain load parameter is increased if system resources remain unchanged
    - how much system resources need to be scaled up if a load parameter is increased, to keep performance unchanged
- in batch processing, *throughput* 
- in online systems, *response time*
- *percentiles*: sort a list of metrics from low to high
    - *p50*, *p95*, *p99*, *p999*
    - if the 95th percentile response time is 1.5 seconds, then 95 out of 100 requests take less than 1.5 seconds
    - high percentiles of response times, also known as *tail latencies*, are important because they directly affect users' experience of the service
- optimizing for high percentiles can be hard and expensive as they are easily affected by random events outside of the control and the benefits are diminishing
- percentiles are often used in *service level objectives* (SLOs) and *service level agreements* (SLAs), contracts that define the expected performance and availability of a service
- queueing delays often account for a large part of the response time at high percentiles
    - as a server can only process a small number of things in parallel (e.g., limited by # of CPU cores)
    - *head-of-line blocking*: the server takes a small number of slow requests to hold up the processing of subsequent requests

**Percentiles in Practice**
- high percentiles become especially important in backend services that are called multiple times as part of serving a single end-user request
    - it takes one slow call to make the entire end-user request slow
    - even if only a small percentage of backend calls are slow, the chance of getting a slow call increases w/ the # of calls needed
    - this is called *tail latency amplification*


### Approaches for Coping with Load

- ***scaling up***: vertical scaling, moving to a more powerful machine
- ***scaling out***: horizontal scaling, distributing the load across multiple smaller machines
- *shared-noting architecture*: distributing load across multiple machines
- a system that can run on a single machine is often simpler, but high-end machines can become very expensive, so inensive workloads often cant avoid scaling out
- *elastic* systems can automatically add computing resources when they detech a load increase
- distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup can introduce lots of additional complexity
    - common wisdom until recently was to keep the database on a single node (scale up) until scaling cost or high-availability requirements forced you to make it distributed
    - it is conceivable that distributed data systems will become the default in the future, even for use cases that dont handle large volumnes of data or traffic, as tooling and abstactions for distributed systems get better
- there is no generic, one-size-fits-all scalable architecture (*magic scaling sauce*)
- an architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare

# Maintainability

- design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves

### Operability
- make it easy for operations teams to keep the system running smoothly
- operations responsibilities:
    - monitoring the health of the system and restoring service if it goes into a bad state
    - tracking down the cause of problems, such as system failures or degraded performance
    - keeping software and platforms up to date, including security patches
    - keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage
    - anticipating future problems and solving them before they occur (e.g., capacity planning)
    - establishing good practices and tools for development, configuration management, and more
    - performing complex maintenance tasks, such as moving an application from one platform to another
    - defining processes that make operations predictable and help keep the production environment stable
    - preserving the organization's knowledge about the system
- data systems can do various things to make operation routine tasks easy
    - providing **visibility into the runtime behavior and internals** of the system, with good monitoring
    - providing good support for **automation and integration** with standard tools
    - **avoiding dependency** on individual machines
    - providing good **documentation** and an easy-to-understand operational model
    - providing good **default behavior**, but also give admins the freedom to override defaults when needed
    - self-healing where appropriate, but also giving admins manual control over the system state when needed
    - exhibiting predicate behavior, minimizing surprises

### Simplicity
- make it easy for new engineers to understand the system, by removing as much complexity as possible from the system
- various possible symptoms of complexity:
    - explosion of the state space
    - tight coupling of modules
    - tangled dependencies
    - inconsistent naming and terminology
    - hacks aimed at solving performance problems
    - special-casing to work around issues elsewhere
    - ...

### Evolvability
- make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change

