## Characteristics of Distributed Systems (SERAM)
* Scalability
* Efficiency
* Reliability 
* Availability
* Manageability


## Scalability
- Reason - inc data volume or inc work/transaction
- scale without performance loss
- performance decline due to management or environment cost. e.g: network speed, some task may not be distributed   

**Horizontal scaling and Vertical Scaling**
- Horizontal Scaling - adding more servers
- Easy to add
- E.g: Cassandra, MongoDB

- Vertical Scaling - adding more power e.g: CPU, RAM, Storage, etc
- limitation due to capacity of a single server
- E.g: MySql (often downtime may be required)


# Reliability
- probability a system will fail in a given period
- distributed system is reliable if it keep on delivering its services even when one or more of its software or hardware component fails
- achieve resilience for services by removing every single point of failure
- cost is associated with redundancy

# Availability
- time a system remains operational to perform its required function in a specific period
- takes into account maintainability, repair time, spares availability, etc
- Reliability is availability over time considering every possible real Word conditions that can occur
- An Aircraft is more reliable if it can make through possible weather safely than one that has vulnerability to possible conditions


**Reliability vs Availability**
- a system reliable system is available
- an available system not necessary reliable
- Possible to achieve high availability even with unreliable product by minimising repair time
- E.g: A system was launched without Information Security Testing, Users were happy as they didnt realize it is vulnerable to risks. After some time there were security breaches and system was less available for extended period of time

# Efficiency
- Two Standard Measures - Response Time(Latency) and Throughput(Bandwidth)
- Response Time - delay to obtain the first item
- Throughput - number of items delivered in a given time unit

# Serviceability or Manageability
- easiness to operate and maintain
- simplicity and speed with which a system can be repaired/maintained
- If time to repair increases -> availability decreases
- Factors - ease of diagnosing and undertand problem when they occur, ease of making updates or modifications, how simple is to operate the system
- early detection of faults can decrease or avoid system downtime
- e.g: System notifying automatically of the failure to call centre

**Pizza Shop Example**
1. 


## Load Balancers
* one of the critical component of Distributed systems
* spread traffic across various servers in the clusters to improve responsiveness, balance load, enhance availability of databases & applications,etc
* keep track of status of all resources while distributing the requests

<img src="Image/Load_Balancer.JPG" width="600" />

We can add LB at each layer of system to utilize full scalability and redundancy
* User & Web Server
* Webservers & internal platform layer (application server or cache server)
* Internal platform layer & database

<img src="Image/Load_Balancer_1.JPG" width="800" />

# Benefits
* fast user experience, uninterrupted service
* less downtime and higher throughput
* decrease wait time for users
* smart load balancers - use predictive analytics that determine traffic bottlenecks before they happen


### Types of Load balancers
1. Hardware LB - Expensive
2. Software LB - HA Proxy(Open Source LB)

### Algorithms Load Balancing
1. Round Robin - same server configuration and not many persistent connections
2. Round Robin with Weighted Server - server with different processing power
3. Least Connections - useful when large no. of persistent connections
4. Least Response Time - fewest active connections and lowest average response time
5. Source IP Hash
6. URL Hash
7. Least Bandwidth Method - serving least amount of traffic measured in megabits per second(Mbps)

### Redundant Load Balancers
* load balancer can be single point of failure
* a second LB can be connected to first to form cluster and monitor each other
* both are equally capable


## Cache
* Principle: Recently requested data is likely to be requested again
* Short Term Memory, limited space, faster & contains mostly recently accessed items
* can be added almost in every layer like OS, Web, hardware, browser, web app

### Types of Cache
1. **Application Server Cache**
- cache on a request layer node enables the local storage of response data
- If applciation is on many nodes, than each node has its own cache
- as load balancer sends requests randomly to different nodes - increasing cache misses
- two option to overcome this - Distributed and Global Cache
2. **Distributed Cache** - Consistent Hashing
3. **Global Cache**
4. **CDN - Content Distributed Network**
 - serving from large amount of static media which is common to all
 - Request will first ask CDM for a piece of static media
 - CDN will serve if it has locally otherwise it will query back-end servers, cache it locally and serve the information to the request
 

### Cache Invalidation
* When data is modified in DB, it should be invalidated in the Cache 
* if data is not invalidated, it can cause inconsistent app behaviour

Types:
1. **Write through Cache**
- Data is written into Cache and database at same time
- complete data consistency
- ensures nothing get lost
- every write operation need to be done twise before returning success to the client
- disadvantage of higher latency for write operations

2. **Write around Cache**  
- Similar to Write through Cache but data is written directly to permanent storage, bypassing Cache
- reduce flooding of write operation at the Cache
- Disadv - a read request for recently written data will create cache miss and must be read from slower backend storage resulting in higher latency

3. **Write back Cache**
- data is written to cache alone and completion is confirmed to client
- write to permanent storage is done after specified interval or under certain condition
- results in lower latency and high throughput for write intensive app
- risk of data loss in case of crash as cache only holds the copy


# Cache Eviction Policies
1. FIFO
2. LIFO
3. Least Recently Used
4. Most Recently Used
5. Least Frequently Used
6. Random Replacement

## Sharding/Data Partitioning
* break big Database into many smaller parts (e.g: Slicing Pizza among six people)
* this improves manageability, performance, availability and load balancing of application
* after certain limit its cheaper to add more machines than by augmenting resource to existing machine

#### Sharding - Partitioning Methods
1. **Horizontal Partitioning**
- put different rows into different tables
- e.g: locations with ZIP codes less than 10000 are stored in one and greater are stored in another - range based Sharding
- Issue - Range Based Sharding - improper selection of range leads to unbalanced servers
2. **Vertical Partitioning**
- divide data to store tables for feature on their respective server
- easy to implement and low impact on application
- additional growth of application require partition of feature specific DB across various servers
3. **Directory Based Partitioning**

#### Sharding - Partitioning Criteria
* List Partitioning
* Key or Hash Based Partitioning
* Composite Partitioning
* Round Robin Partitioning

#### Sharding - Issues
* Referential Integrity
* Joins and De-normalization
* Rebalancing


## Indexes
* indexing in database make searching faster
* can be created using one or more columns of a database table
* increases data retrieval
* decreases write performance (insert, delete , update)

## Proxies
* intermediate server between client and the back-end server
* piece of software or hardware that acts as an intermediary for requests from clients seeking resources from other servers
* used to filter requests, log requests, or sometimes transform requests (by adding/removing headers, encrypting/decrypting, or compressing a resource)
* its cache can serve a lot of requests


<img src="Image/Proxies.JPG" width="600" />

#### Proxies - Types
1. Open Proxy - accessible by any internet users  
 * Anonymous Proxy - reveals its identity as servers but does not disclose the initial IP address
 * Transparent Proxy - able to cache websites
2. Reverse Proxy 
*  retrieve resources from servers on behalf of client and return to client as it originated from proxy itself


## Redundancy
* duplication of critical components or functions of a system to increase the reliability of the system
* removes single point of failure


## Replication
* sharing information to ensure consistency between redundant resources
* improves reliability, fault-tolerance, or accessibility
* get used in DBMS(master-slave)
* master get all updates and send to slave, slave confirms by message that it has received the update 

## SQL
* Relational databases are structured
* have predefined schemas like phone books
* E.g: MySQL, Oracle, MS SQL Server, SQLite, Postgres, and MariaDB

## No-SQL
* Non-relational databases are unstructured, distributed
*  have a dynamic schema like file folders that hold everything from a person’s address and phone number to their Facebook ‘likes’ and online shopping preferences


**1. Document Databases**
* Data is stored in documents and these documents are grouped together in collections
*  Each document can have an entirely different structure
* E.g: CouchDB and MongoDB  

**2. Wide Column Databases**
* don’t need to know all the columns up front
* each row doesn’t have to have the same number of columns
* best suited for analyzing large datasets
* E.g: Cassandra and HBase

**3. Graph Database**
* used to store data whose relations are best represented in a graph
* Data is saved in graph structures with nodes (entities), properties (information about the entities), and lines (connections between the entities)
* E.g: Neo4J and InfiniteGraph

**4. Key Value Stores**
* Data is stored in an array of key-value pairs
* Well-known key-value stores include Redis, Voldemort, and Dynamo

**High level differences between SQL and NoSQL**
* Storage
* Schema
* Quering
* Scalability
* Reliability or ACID Compliancy (Atomicity, Consistency, Isolation,Durability)
* SQL are ACID compliant
* NoSQL are not ACID compliant due to performance and scalability
