# Transaction

- a **transaction** is a *unit* of program execution that accesses and possibly updates variaous data items
- two important issues must be dealt with in transaction processing
    - **failure**: failures of various kinds, such as hardware failures and system crashes
    - **concurrency**: concurrent execution of multiple transactions

## ACID Properties

- example: a transaction to transfer 50 dollars from account A to B
    1. read(A)
    2. A := A - 50
    3. write(A)
    4. read(B)
    5. B := B + 50
    6. write(B)


- **atomicity**: either all operations of the transaction are properly reflected in the database, or none are
    - if 50 dollars is debited from A, then it is deposited in B or credited back to A
- **consistency**: execution of a transaction in isolation preserves the consistency of the database (includes integrity constraints, check constraints, and assertions)
    - the total balance of A plus B is preserved
- **durability**: after a transaction completes successfully (i.e., without aborting and rolling back), the changes it has made to the database persist, even if failures occur
    - if 50 dollars is deposited in B, it will not disappear from B
- **isolation**: multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions; intermediate transaction results should be **hidden** from other concurrently executed transactions
    - two transfers executing in parallel should not be aware of each other

# Transaction State

<img src="img/Snip20191104_116.png" width=60%/>

- **active**: the initial state, the transaction remains in this state while it is executing
- **partially committed**: after the final statement has been executed, before updates are reflected in the database
- **failed**: after discovering that normal execution can no longer proceed, and hence the transaction must be rolled back
- **Aborted**: after the transaction has been rolled back and the database is restored to its state prior to the start of the transaction, there are two options after an abort
    1. restart the transaction (e.g., if it aborted due to a crash failure or due to contention)
    2. kill the transaction (e.g., if it aborted due to a logic error that must be debugged)
    
    
- **Committed**: After successful completion, updates are reflected in the database

# Concurrent Execution

- **Concurrency**: when multiple transactions run simultaneously
    - enables better performance
    - **increased processor and disk utilization**, leading to better **transaction throughput**
    - **reduced average response time** for transactions: short transactions need not to wait behind long ones

- a **concurrency control mechanism** is needed to preserve ACID properties, for example when two transactions attempt to access the same row of the same table


# Schedules

- to define concepts such as serializable isolation in a precise manner we need a rigorous framework for reasoning about concurrent execution of transactions

- a **schedule** is a sequence of instructions that specify the chronological order in which instructions of concurrent transactions are executed
- a **serial schedule** is one in which a transaction that has been started runs to completion before another transaction may start


- a transaction that successfully completes its execution ends with a **commit instruction**
- a tansaction that fails to successfully complete its execution ends with an **abort instruction**


- the commit and abort instructions are sometimes omitted for brevity when the outcome of the transaction is clear from the context

- our simplified model of transactions only captures *read and write* operations on data objects, but not *creation or deletion* of such objects

<img src="img/Snip20191108_190.png" width=80%/>

<img src="img/Snip20191108_191.png" width=80%/>

<img src="img/Snip20191108_192.png" width=80%/>

# Serializability

- **assumption**: each transaction, when executed in isolation, preserves database consistency (e.g., integrity constraints)
    - implies that any serial schedule containing possibly many transactions also preserves database consistency

- a schedule is **serializable** if it is equivalent to a serial schedule
    - a serializable schedule also preserves database consistency even through the transactions in the schedule may not be executed serially

- different definitions of schedule equivalence give rise to the formal notions of **conflict serializability** and **view serializability**


# Conflicting Instructions

- instructions $I_i$ and $I_j$ of transactions $T_i$ and $T_j$ respectively, they **conflict** if and only if there exists some item $Q$ accessed by both $I_i$ and $I_j$, and at least one of these instructions wrote $Q$
    1. $I_i = read(Q), I_j = read(Q)$: no conflict
    2. $I_i = read(Q), I_j = write(Q)$: conflict
    3. $I_i = write(Q), I_j = read(Q)$: conflict
    4. $I_i = write(Q), I_j = write(Q)$: conflict


- a conflict between $I_i$ and $I_j$ **forces a logical ordering between them**

- if $I_i$ and $I_j$ are consecutive in a schedule and they do not conflict, their results would remain the same even if they had been interchanged in the schedule (commute)

# Conflict Serializability

- def: let $S$ and $S'$ be schedules for some set $R$ of transactions; if schedule $S$ can be transformed into schedule $S'$ by a series of swaps of non-conflicting instructions, then we say that $S$ and $S'$ are **conflict equivalent**
    - none of the swaps can change the order of instructions that belong to the same transaction; 

- schedule $S$ is **conflict serializable** if it is conflict equivalent to a serial schedule

<img src="img/Snip20191108_193.png" width=80%/>

<img src="img/Snip20191108_195.png" width=60%/>

# View Serializability

- let $S$ and $S'$ be two schedules with the same set of transactions; $S$ and $S'$ are **view equivalent** if the following three conditions are met for each data item $Q$
    1. if in schedule $S$, a $read(Q)$ of transaction $T_i$ returns the initial value of $Q$, then in schedule $S'$ the corresponding $read(Q)$ of $T_i$ must also return the initial value of $Q$
    2. if in schedule $S$, a $read(Q)$ of transaction $T_i$ returns the value written by some $write(Q)$ of transaction $T_j$ (if any), then in schedule $S'$ the corresponding $read(Q)$ of $T_i$ must also return the value written by the corresponding $write(Q)$ of transaction $T_j$
    3. the transaction (if any) that performs the final $write(Q)$ operation in schedule $S$ must also perform the final $write(Q)$ operation in schedule $S'$
    
    
- a schedule $S$ is **view serializable** if it is view equivalent to a serial schedule
- **theorem**: every conflict serializable schedule is also view serializable

<img src="img/Snip20191108_196.png" width=80%/>



# Testing for Conflict Serializability

- consider a given schedule of a set of transactions $T_1, T_2,..., T_n$
- **precedence graph**: a directed graph where the vertices are the transactions (denoted by their names), and edges represent conflicting operations
- we draw an egde from $T_i$ to $T_j$ if the two transactions contain conflicting instructions on some data item $X$, and
    - $T_i$ does a $write(X)$ before $T_j$ does a $read(X)$, or
    - $T_i$ does a $read(X)$ before $T_j$ does a $write(X)$, or
    - $T_i$ does a $write(X)$ before $T_j$ does a $write(X)$

- **Observation**: an edge from $T_i$ and $T_j$ implies that $T_i$ must precede $T_j$ in any conflict-equivalent serial schedule (if one exists at all)

- **Theorem**: a schedule is conflict serializable if and only if its precedence graph is acyclic

<img src="img/Snip20191108_197.png" width=40%/>

- cycle-detection can be done in $O(n + e)$ time where $n$ is the number of vertices in the graph and $e$ is the number of edges
- if the precedence graph is acyclic, the order of transactions in an equivalent serial schedule can be obtained by a *topological sorting* of the graph

<img src="img/Snip20191108_198.png" width=60%/>


# Test for View Serializability

- the precedence graph test for conflict serializability cannot be used directly to test for view serializability
    - if the precedence graph is acyclic then the schedule is view-serializable, but if there is a cycle in the graph then the schedule may or may not be view-serializable
    - extension to test for view serializability has cost exponential in the size of the precedence graph
- the problem of testing view serializability falls in the class of NP-complete problems

# Recoverable Schedules

- motivation: we must address the effect of transaction failures on concurrently executing transactions
- **Recoverable Schedule**: if a transaction $T_j$ reads a data item previously written by a transaction $T_i$, and if $T_j$ later commits successfully, then $T_i$ also commits successfully and moreover, the commit operation of $T_i$ occurs before the commit operation of $T_j$

<img src="img/Snip20191108_199.png" width=80%/>

# Cascading Rollbacks

- **Cascading Rollback**: a single transaction failure leads to a series of transaction rollbacks;

<img src="img/Snip20191108_201.png" width=60%/>

- none of the transactions has yet committed so the schedule is recoverable
- if $T_{10}$ fails, $T_{11}$ and $T_{12}$ must also be rolled back
- can lead to the undoing of a significant amount of work, hence wasted CPU cycles and disk IOs

# Cascadeless Schedules

- **Cascadeless Schedules**: cascading rollbacks cannot occur; 
    - for each pair of transaction $T_i$ and $T_j$ such that $T_j$ reads a data item previously written by $T_i$, transaction $T_i$ commits successfully, and moreover, the commit operation of $T_i$ occurs **before** the read operation of $T_j$

- for performance it is desirable to restrict the schedules to those that are cascadeless
- **Theorem**: every cascadeless schedule is a recoverable schedule




# Concurrency Control

- a db must provide a mechanism that will ensure that all possible schedules are
    - conflict or view-serializable, and 
    - are recoverable and preferably cascadeless
- a policy in which only one transaction can execute at a time generates only serial schedules, but severely restricts concurrency
- testing a schedule for serializability *after* it has executed is too late
- Goal: develop concurrency control protocols that will guarantee serializability by design
- Observation: concurrency control protocols generally do not execute serializability tests internally
    - we can use serializability tests to understand why a concurrency control protocol is correct


# Weaker-than-serializable Isolation

- some applications can tolerate weak levels of transaction isolation
    - e.g., a read-only transaction that only wants to get an approximate total balance of all accounts
    - e.g., approximate db stats computed for query optimization
- the ability to choose among multiple isolation levels make it possible to trade off correctness for performance

# Transaction Isolation Anomalies

- **Phantom Read**: the results of a query in one transaction are changed by another transaction before the former commits
    - e.g., $T_1$ executes a range query (e.g., find instructors where salary is less than 80k), then $T_2$ inserts a record, and finally $T_1$ executes the same range query again with a different result set


- **Non-repeatable Read**: repeated reads of same record in one transaction return different values because of an update made by another transaction
    - e.g., $T_1$ reads a record, then $T_2$ updates the same record, then $T_1$ reads the record again and sees the updated value
    - not a conflict-serializable schedule in this case


- **Dirty Read**: one transaction reads a value written by another transaction that has not yet committed
    - e.g., $T_1$ updates a record in a table, then $T_2$ reads the updated record before $T_1$ has committed
    - not a cascadeless schedule, not recoverable either if $T_2$ commits before $T_1$

# Transaction Isolation Levels in SQL-92

<img src="img/Snip20191111_219.png" width=80%/>

- the InnoDB storage engine provides **repeatable read** by default