# Transactions & Concurrency

Example "business transactions":

* Insert one row in *Order* table, then several in *OrderItem* table
* Insert one row in supertype table, then one row in subtype table
* Check amount < balance. If so, subtract from one row in bank account table, then add amount to another row
* For all rows in *Customer* table, send out monthly statements

Each requires several distinct database operations.

Transaction: A logical unit of work that must either be entirely completed or aborted(indivisible, atomic)

DML statements are already atomic.

RDBMS also allows for user-defined transactions.

A successful transaction changes the database from one consistent state to another.(One in which all data integrity constraints are satisfied)

Transactions solve TWO problems:

* users need the ability to define a unit of work
* concurrent access to data by >1 user or program

## Problem 1: Unit of work

Single DML or DDL command(implicit transaction):

* Update $700$ records, but database crashes after $200$ records processed
* Restart server: you will find no changes to any records
* Changes are "all or none"

Multiple statements (user-defined transaction)
* START TRANSACTION; (or, 'BEGIN')
    * SQL statement;
    * SQL statement;
    * SQL statement;
    * ...
* COMMIT; (commits the whole transaction)
    * Or ROLLBACK (to undo everything)
    
SQL keywords: **begin**, **commit**, **rollback**

Each transaction consists of several SQL statements, embedded within a larger application program.

Transaction needs to be treated as an indivisible unit of work. "Indivisible" means that either the whole job gets done, or none gets done: If an error occurs, we don't leave the database with the job half done, in an inconsistent state.

In the case of an error: 
* Any SQL statements already completed must be reversed
* Show an error message to the user
* When ready, the user can try the transaction again

Transaction Properties(ACID):
* Atomicity: A transaction is treated as a single, indivisible, logical unit of work. All operations in a transaction must be completed; if not, then the transaction is aborted.
* Consistency: Constraints that hold before a transaction must also hold after it. Multiple users accessing the same data see the same value.
* Isolation: Changes made during execution of a transaction cannot be seen by other transactions until this one is completed.
* Durability: When a transaction is complete, the changes made to the database are permanent, even if the system fails.

## Problem 2: Concurrent access

Concurrent execution of DML against a shared database; Note that the sharing of data among multiple users is where much of the benefit of databases derives - users communicate and collaborate via shared data.

What could go wrong?
* Lost updates
* Uncommitted data
* Inconsistent retrievals

### Lost Update problem

<img src="img/img60.png" width="400">

### Uncommitted Data

<img src="img/img61.png" width="400">

### The Inconsistent Retrieval problem

It occurs when one transaction calculates some aggregate functions over a set of data, while other transaction are updating the data. Some data may be read after they are changed and some before they are changed, yielding inconsistent results.

### Serializability

Transactions ideally are "serializable"
* Multiple, concurrent transactions appear as if they were executed one after another
* Ensures that the concurrent execution of several transactions yields consistent results

<img src="img/img62.png" width="400">

But true serial execution (i.e. no concurrency) is very expensive.

### Concurrency control methods

To achieve efficient execution of transactions, the DBMS creates a schedule of read and write operations for concurrent transactions.

Interleaves the execution of operations, based on concurrency control algorithms such as locking and time stamping.

Several methods of concurrency control:
* Locking is the main method used
* Alternate methods: Time Stamping & Optimistic methods

#### Locking

Guarantees exclusive use of a data item to a current transaction
* T1 acquires a lock prior to data access; the lock is released when the transaction is complete
* T2 does not have access to data item currently being used by T1
* T2 has to wait until T1 releases the lock

Required to prevent another transaction from reading inconsistent data.

Lock manager: Responsible for assigning and policing the locks used by the transactions

##### Lock Granularity: options

Database-level lock

* Entire database is locked
* Good for batch processing but unsuitable for multi-user DBMSs
* T1 and T2 can not access the same database concurrently even if they use different tables

Table-level lock

* Entire table is locked - as above but not quite as bad
* T1 and T2 can access the same database concurrently as long as they use different tables
* Can cause bottlenecks, even if transactions want to access different parts of the table and would not interfere with each other
* Not suitable for highly multi-user DBMSs

Page-level lock

* An entire disk page is locked (a table can span several pages and each page can contain several rows of one or more tables)
* Not commonly used now

Row-level lock

* Allows concurrent transactions to access different rows of the same table, even if the rows are located on the same page
* Improves data availability but with high overhead (each row has a lock that must be read and written to)
* Currently the most popular approach(MySQL, Oracle)

Field-level lock

* Allows concurrent transactions to access the same row, as long as they access different attributes within that row
* Most flexible lock but requires an extremely high level of overhead
* Not commonly used

##### Types of Locks

Binary locks:

* has only two states: locked(1) or unlocked(0)
* eliminates "Lost Update" problem (the lock is not released until the statement is completed)
* considered too restrictive to yield optimal concurrency, as it locks even for two READs (when no update is being done)
* The alternative is to allow both Exclusive and Shared locks(often called Write and Read locks)

Exclusive lock
* access is reserved for the transaction that locked the object
* must be used when transaction intends to WRITE
* granted if and only if no other locks are held on the data item
* in MySQL: "select ... for update"

Shared lock
* other transactions are also granted Read access
* issued when a transaction wants to READ data, and no Exclusive lock is held on that data item
* in MySQL: "select ... lock in share mode"

##### Deadlock

Condition that occurs when transactions wait for each other to unlock data
* T1 locks data item X, then wants Y
* T2 locks data item Y, then wants X
* each waits to get a data item which the other transaction is already holding
* could wait forever if not dealt with

It only happens with exclusive locks.

Deadlock are dealt with by:
* prevention
* detection

#### Alternative concurrency control methods

Timestamp
* Assigns a global unique timestamp to each transaction
* Each data item accessed by the transaction gets the timestamp
* Thus for every data item, the DBMS knows which transaction performed the last read or write on it
* When a transaction wants to read or write, the DBMS compares its timestamp with the timestamps already attached to the item and decides whether to allow access

Optimistic
* Based on the assumption that the majority of database operations do not conflict
* Transaction is executed without restrictions or checking
* Then when it is ready to commit, the DBMS checks whether any of the data it read has been altered - if so, rollback

### Logging transactions

We want to restore database to a previous consistent state. If transaction cannot be completed, it must be aborted and any changes rolled back.

To enable this, DBMS tracks all updates to data.

This transaction log contains:
* a record for the beginning of the transaction
* for each SQL statement
    * operation being performed (update, delete, insert)
    * objects affected by the transaction
    * "before" and "after" values for updated fields
    * pointers to previous and next transaction log entries
* the ending (COMMIT) of the transaction

It also provides the ability to restore a corrupted database. If a system failure occurs, the DBMS will examine the log for all uncommitted or incomplete transactions and it will restore the database to a previous state.

# Distributed Databases

<img src="img/img63.png" width="500">

Distributed database: a single logical database spread across multiple computers in multiple locations that are connected by a data communication link. It appears to users as though it is one database.

Decentralized database: a collection of independent databases which are not networked together as one logical database. It appears to users as though many databases.

## Distributed DBMS Advantages

* Good fit for geographically distributed organizations/users
* Data located near site with with greatest demand
* Faster data access (to local data)
* Faster data processing (workload spread out across many machines)
* Allows modular growth (add new servers as load increases)
* Increased reliability and availability (less danger of single-point failure)
* Supports database recovery (data is replicated across multiple sites)

## Disadvantages

* Complexity of management and control (db or application must stitch together data across sites)
* Data integrity (additional exposure to improper updating)
* Security (many server sites leads to higher chance of breach)
* Lack of standards (different DBMS vendors use different protocols)
* Increased training costs (more complex IT infrastructure)
* Increased storage requirements (if there are multiple copies of data)

## Objectives and Trade-offs

Location transparency: 
* A user(or program) needn't know where particular data are stored
* Requests to retrieve or update data from any site are automatically forwarded by the system to the site or sites related to the processing request
* All data in the network appears as a single logical database stored at one site to the users
* A single query can join data from tables in multiple sites

Local autonomy: 
* A node can continue to function for local users if connectivity to the network is lost
* Users can administer their local database
    * control local data
    * administer security
    * log transactions
    * recover when local failures occur
    * provide full access to local data

Trade-offs:
* Availability vs Consistency
* Synchronous vs Asynchronous updates

## Distribution options

* Data replication(data copied across sites)
* Horizontal partitioning: table rows distributed across sites
* Vertical partitioning: table columns distributed across sites
* Combination of the above

### Replication advantages

* High reliability due to redundant copies
* Fast access to local data
* May avoid complicated distributed integrity routines if replicated data is refreshed at scheduled intervals
* Decouples nodes, transactions proceed even if some nodes are down
* Reduced network traffic at prime time if updates can be delayed
* This is currently popular as a way of achieving high availability for global systems. Most NoSQL databases offer replication.

### Replication disadvantages

* Need more storage space
* Integrity: can retrieve incorrect data if updates have not arrived
* Takes time for update operations
    * high tolerance for out-of-date data may be required
    * updates may cause performance problems for busy nodes
* Network communication capabilities, update place heavy demand on telecommunications

### Synchronous updates

* Data is continuously kept up to date, users anywhere can access data and get the same answer
* If any copy of a data item is updated anywhere on the network, the same update is immediately applied to all other copies or it is aborted
* Ensures data integrity and minimizes the complexity of knowing where the most recent copy of data is located.
* Can result in slow response time and high network usage. The DDBMS spends considerable time checking that an update is accurately and completely propagated across the network.

### Asynchronous updates

* Some delay in propagating data updates to remote databases. Some degree of at least temporary inconsistency is tolerated. May be ok if it's temporary and well managed
* Tends to have acceptable response time. Updates happen locally and data replicas are synchronized in batches and predetermined intervals
* May be more complex to plan and design. Need to ensure the right level of data integrity and consistency.
* Suits some information systems more than others. Compare commerce/finance systems with social media

### Horizontal partitioning

Different rows of a table at different sites.

Advantages:

* Efficiency: data stored close to where it is used
* Better performance: local access optimization
* Security: only relevant data is stored locally
* Ease of query: unions across partitions

Disadvantage:

* Inconsistent access speed: accessing data across partitions
* backup vulnerability: no data replication

<img src="img/img64.png" width="500">

### Vertical partitioning

Different columns of a table at different sites.

Advantages and disadvantages are the same as for horizontal partitioning except combining data across partitions is more difficult because it requires joins(instead of unions).

<img src="img/img65.png" width="500">

### Comparing 5 configurations

* Centralized database, distributed access: DB is at one location, and accessed from everywhere.
* Replication with periodic snapshot update: many locations, each data copy updated periodically
* Replication with near real-time synchronization of updates: many locations, each data copy updated in near real time
* Partitioned, integrated, one logical database: data partitioned across at many sites, within a logical database, and a single DBMS
* Partitioned, independent, nonintegrated segments: data partitioned across many sites, multiple DBMS, multiple computers

#### Centralized

* Reliability: poor, highly dependent on central server

* Expandability: poor, limitations are barriers to performance

* Communication overhead: very high, traffic from all locations goes to one site

* Manageability: very good, one monolithic site requires little coordination

* Data consistency: excellent, all users always access the same data

#### Replicated with Snapshots

* Reliability: good, redundancy and tolerated delays

* Expandability: very good, cheap to add new servers

* Communications Overhead: low to medium, not constant, but periodic snapshots can cause bursts of network traffic

* Manageability: very good, each copy is like every other one

* Data consistency: medium, fine as long as update delays are tolerable

#### Synchronized Replication

* Reliability: excellent, redundancy and minimal delays

* Expandability: very good, cost of additional copies may be low and synchronization work only linear

* Communications Overhead: medium, messages are constant, but some delays are tolerated

* Manageability: medium, collisions add some complexity to manageability

* Data Consistency: very good, close to precise consistency

#### Integrated Partitions

* Reliability: good, effective use of partitioning and redundancy

* Expandability: very good, new nodes get only data they need without changes in overall database design

* Communications Overhead: low to medium, most queries are local, but queries that require data from multiple sites can cause a temporary load

* Manageability: difficult, especially difficult for queries that need data from distributed tables, and updates must be tightly coordinated

* Data Consistency: very poor, considerable effort, and inconsistencies not tolerated

#### Decentralized, Independent Partitions

* Reliability: good, depends on only local database availability

* Expandability: good, new sites independent of existing ones

* Communications Overhead: low, little if any need to pass data or queries across the network (if one exists)

* Manageability: very good, easy for each site, until there is a need to share data across sites

* Data Consistency: low, no guarantees of consistency, in fact, pretty sure of inconsistency

### Functions of a distributed DBMS

* Locate data with a distributed data dictionary

* Determine location from which to retrieve data and process query components

* DBMS translation between nodes with different local DBMSs(using middleware)

* Data consistency(via multiphase commit protocols)

* Global primary key control

* Scalability

* Security, concurrency, query optimization, failure recovery

### 12 Commandments for Distributed Databases

* Local Site Independence: Each local site can act as an independent, autonomous, centralized DBMS. Each site is responsible for security, concurrency, control, backup and recovery.

* Central Site Independence: No site in the network relies on a central site or any other site. All sites have the same capabilities.

* Failure Independence: The system is not affected by node failures. The system is in continuous operation even in the case of a node failure or an expansion of the network.

* Location Transparency: The user does not need to know the location of the data to retrieve those data.

* Fragmentation Transparency: Data fragmentation is transparent to the user, who sees only one logical database. The user does not need to know the name of the database fragments to retrieve them.

* Replication Transparency: The user sees only one logical database. The DDBMS transparently selects the database fragment(s) to access. To the user, the DDBMS manages all fragments transparently.

* Distributed Query Processing: A distributed query may be executed at several different sites. Query optimisation is performed transparently by the DDBMS.

* Distributed Transaction Processing: A transaction may update data at several different sites, and the transaction is executed transparently.

* Hardware Independence: The system must run on any hardware platform.

* Operating System Independence: The system must run on any operating system platform.

* Network Independence: The system must run on any network platform.

* Database Independence: The system must support any vendor's database product.