# Scaling and concurrency

* 22nd March 2018

``<jeep@cphbusiness.dk>``

## Agenda

* PostgreSQL functions
* Aggregations and grouping
* PostgreSQL statistics
* Triggers

* Concurrency in PostgreSQL
  * Read and write locks
* Scaling
  * Dealing with concurrent data

### Literature
* [PostgreSQL planner and its usage of statistics](https://www.citusdata.com/blog/2018/03/06/postgres-planner-and-its-usage-of-statistics/)
* [Pgpool](https://wiki.postgresql.org/wiki/Pgpool-II)
* [Serverless databases](https://www.simform.com/serverless-databases/)

# Learning objectives
## Knowledge
The student must have knowledge of:

 * Various database types and the underlying models
 * A specific database system’s storage organisation  and query execution
 * A specific database system’s optimisation possibilities – including advantages and disadvantages
 * Database-specific security problems and their solutions
 * Concepts and issues when handling big data
 * The particular issues raised by having many simultaneous transactions, including in connection with distributed databases
 * Relational algebra (including its relationship to execution plans)

## Skills
The student can:

 * Transform logical data models into physical models in various database types
 * Implement database optimisation
 * Use parts of the administration tool to assist in the optimisation and tuning of existing databases, including the incorporation of a specific DBMS’ execution plans
 * Use a specific database system’s tools for handling simultaneous transactions
 * Use the programming and other facilities provided by a modern DBMS


## Competencies
The student can:
 
 * Analyse the application domain in order to select a database type
 * Divide responsibility for tasks between the application and DBMS during system development, to ensure the best possible implementation.


## PostgreSQL functions

PostgreSQL has a ton of functions for their inbuilt data types.

You've already seen some of them:

* ``>``, ``<``, ``=``
* ``count``, ``sum``

There are plenty more! Look them up on PostgreSQL's documentation:

      https://www.postgresql.org/docs/10/static/functions.html

### Exercise on functions

Open a connection to your PostgreSQL database and:

* Convert ``'I HATE CAPS'`` to lowercase
* Find the smallest network where both '10.2.17.12' and '10.2.16.13' are contained
* Get the current time (both date and time to millisecond precision)
* Sleep for 6.7 seconds and prove it using ``EXPLAIN ANALYZE``
* Get the md5 sum of ``All your base``
* Find today's day of the year (somewhere between 1 and 365)

Remember, you can always execute empty ``SELECT`` statements like so: ``SELECT 1;``

Documentation is available here: ``https://www.postgresql.org/docs/10/static/functions.html``

## Aggregating data

Aggregation functions allow you to get a single result from a whole set.

Example: ``SELECT count(*) FROM tweet;``

Examples of aggregation functions are:

* ``count``
* ``sum`` and ``avg``
* ``max`` and ``min``
* ``stddev``

## Grouping data

In SQL you can group data to learn more about a range of values:

* Grouping tweets based on language
* Grouping tweets based on location
* Grouping tweets based on hour


Groupings can happen on more than one column!

SQL syntax: ``GROUP BY``

* Grouping on a column will take each unique value from that column and aggregate all tuples, where the column has the same value
* Example: 

        SELECT lang, count(*) FROM tweet GROUP BY lang ORDER BY count(*) DESC;                                                                                                                                    
         lang | count                                                                                                                                                                                                      
        ------+--------
         en   | 172206
         es   |  27062
         fr   |   1695
         it   |    959
         pt   |    737

## How not to do grouping

    SELECT * FROM tweet GROUP BY lang;                                                                                                                                                                        
    ERROR:  column "tweet.id" must appear in the GROUP BY clause or be used in an aggregate function                                                                                                                   
    LINE 1: SELECT * FROM tweet GROUP BY lang;
     

For grouping to work you need 

1. A column to group by (``lang``)
2. A function to aggregate the results

    SELECT count(*) FROM tweet GROUP BY lang;                                                                                                                                                                 
     count                                                                                                                                                                                                             
    --------
        442
        636
     172206

## Plotting in your sql notebook



In [None]:
%load_ext sql

In [None]:
%sql postgresql://appdev@0.0.0.0/appdev

In [None]:
# This will allow you to plot your graphs
%matplotlib inline

In [None]:
result = %sql SELECT lang, count(*) FROM tweet GROUP BY lang ORDER BY count(*) DESC;

In [None]:
result.bar()

### Grouping exercise

Using ``GROUP BY`` on the ``public.tweet`` table:

* Group tweets by country and aggregate on the total number of tweets from that country
* Group tweets by place and aggregate on the earliest timestamp (you can concatenate two fields with ``+``)
* Find the 10 places with the most users registered

# PostgreSQL execution plan statistics

* ``ANALYZE``, ``VACUUM``
* PostgreSQL planning source
* Execution statistics

## PostgreSQL housekeeping

* Query planning is based on table statistics
  * Tables stored in 
* The statistics needs to be kept relevant
  * ``ANALYZE``
* 

## Execution plan information

* Source: https://www.postgresql.org/docs/9.2/static/using-explain.html

``EXPLAIN`` queries gives you metadata about queries
    
    EXPLAIN SELECT * FROM tweet;                                                                    
                           QUERY PLAN                                                                    
    ----------------------------------------------------------------                                         
     Seq Scan on tweet  (cost=0.00..14758.20 rows=204820 width=468)                                          
    (1 row)               
                               ^1      ^2           ^3          ^4


1. Estimated start-up cost: time expended *before* the output phase can begin, e.g. sorting.

2. Estimated total cost. Can be cheaper with e. g. LIMIT.

3. Estimated number of rows to output. Can be smaller.

4. Estimated average width of rows output (in bytes).


## A note on cost

* Arbitrary unit of "execution time"
  * In our case determined by one "lookup" on disk
  
* Different query planners with different parameters
  * Example: genetic query planner

## PostgreSQL secret planning sauce

PostgreSQL bases their planning based on statistics from the ``pg_stats``
* Documented here: https://www.postgresql.org/docs/current/static/view-pg-stats.html

    => \d pg_stats                                                                                                                                                                                               
              View "pg_catalog.pg_stats"                                                                                                                                                                               
             Column         |   Type   | Modifiers 
    ------------------------+----------+-----------
     schemaname             | name     | 
     tablename              | name     | 
     attname                | name     | 
     inherited              | boolean  | 
     null_frac              | real     | 
     avg_width              | integer  | 
     n_distinct             | real     | 
     most_common_vals       | anyarray | 
     most_common_freqs      | real[]   | 
     histogram_bounds       | anyarray | 
     correlation            | real     | 
     most_common_elems      | anyarray | 
     most_common_elem_freqs | real[]   | 
     elem_count_histogram   | real[]   |

## Creating statistics in PostgreSQL

* Statistics are updated using ``ANALYZE``
  * But only on singular columns
  
* You can create statistics for multiple columns, if you know they are often included in queries
  * ``CREATE STATISTICS mystat (ndistinct) ON latitude, longitude FROM tweet;``
  
* This will help PostgreSQL to take multicolumn queries into account, and allow it to optimise your query

## Example: From grouping to hashing

    EXPLAIN ANALYZE SELECT col1,col2,count(*) from tbl group by col1, col2;                   
                                                         QUERY PLAN                                                          
    -----------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=1990523.20..2091523.04 rows=100000 width=16) (actual time=2697.246..4470.789 rows=1001 loops=1)
       Group Key: col1, col2
       ->  Sort  (cost=1990523.20..2015523.16 rows=9999984 width=8) (actual time=2695.498..3440.880 rows=10000000 loops=1)
             Sort Key: col1, col2
             Sort Method: external sort  Disk: 176128kB
             ->  Seq Scan on tbl  (cost=0.00..144247.84 rows=9999984 width=8) (actual time=0.008..665.689 rows=10000000 loops=1)
     Planning time: 0.072 ms
     Execution time: 4494.583 ms

## Example: From grouping to hashing

    CREATE STATISTICS s2 (ndistinct) on col1, col2 from tbl;                                  
    ANALYZE tbl;

    EXPLAIN ANALYZE SELECT col1,col2,count(*) from tbl group by col1, col2;                   
                                                          QUERY PLAN                                                       
    -----------------------------------------------------------------------------------------------------------------------
     HashAggregate  (cost=219247.63..219257.63 rows=1000 width=16) (actual time=2431.767..2431.928 rows=1001 loops=1)
       Group Key: col1, col2
       ->  Seq Scan on tbl  (cost=0.00..144247.79 rows=9999979 width=8) (actual time=0.008..643.488 rows=10000000 loops=1)
     Planning time: 0.129 ms
     Execution time: 2432.010 ms
    (5 rows)

# Functions and triggers

* DBMSs contains tons of functions
  * And naturally you can construct your own
* Triggers are a useful usecase

## Functions in PostgreSQL

A function can be created with the ``CREATE FUNCTION`` clause:

    CREATE FUNCTION function_name(p1 type, p2 type)
     RETURNS type AS
    $$ -- This starts a multiline string
    BEGIN
     -- logic
    END;
    $$ -- This ends a multiline string
    LANGUAGGE language;
    
* This creates a function where the ``BEGIN .. END`` query is executed whenever the function is called
* Also called a **stored procedure**
* Can return either 
  * singular values (``int``, ``varchar``) or 
  * whole relations (``TABLE(column type, ...)``)
* Can be dropped with ``DROP FUNCTION``

In [None]:
%%sql
CREATE FUNCTION myfun(s varchar)
  RETURNS int AS 
$$
BEGIN
  RETURN length(s);
END;
$$
LANGUAGE PLPGSQL;

In [None]:
%sql SELECT myfun('hullubullu');

In [None]:
%%sql
CREATE FUNCTION tweet_from_country(s varchar)
  RETURNS TABLE(count int) AS
$$
BEGIN
  RETURN QUERY SELECT count(*) FROM tweet WHERE country = s;
END;
$$
LANGUAGE PLPGSQL;

In [None]:
%sql SELECT tweet_from_country('US');

## Triggers

* Triggers are basically event-driven function calls
  * On ``INSERT``, ``UPDATE``, ``DELETE``, ``TRUNCATE``
* Particularly useful for
  * Logging (auditing)
  * Checking extra constraints
  * Performing periodic operations
* Created with ``CREATE TRIGGER``
  * Requires a stored procedure that returns a ``trigger``
  * Gives access to the ``OLD`` and ``NEW`` variable
    * ``RETURN OLD`` will for instance discard any changes

## Trigger example


In [None]:
%%sql
CREATE FUNCTION crash()
  RETURNS trigger AS
$$
BEGIN
  RAISE EXCEPTION 'You shall not pass! %', NEW.language;
END
$$
LANGUAGE PLPGSQL;

In [None]:
%%sql
CREATE TRIGGER crash_trigger
    BEFORE UPDATE ON hello                 
    FOR EACH ROW
    EXECUTE PROCEDURE crash();

In [None]:
%sql SELECT * FROM hello WHERE language = 'Emacs';

In [None]:
result = %sql UPDATE hello SET language = 'Emacs, the awesome editor' WHERE language = 'Emacs';

## Exercise on triggers

The table ``hello`` contains two columns (``language`` and ``hello``) and contains translation of "hello" in different languages. We want to audit whenever someone inserted a new language.

1. Create a table called ``hello_log`` containing two columns: ``language varchar`` and ``time timestamp``
2. Create a trigger function that inserts a row into your ``hello_log`` containing the name of the new language (``NEW.language``) and the current timestamp on insertion (``now()``)
  1. Note: This is NOT a trigger, just a function that returns a trigger
3. Create a trigger that triggers your trigger function when someone inserts a new language
4. Try to insert the new language ``brainfuck`` with the value ``++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.``

# Controlling concurrency in PostgreSQL

PostgreSQL comes with tons of concurrency logic. We will focus on two important aspects:

* Isolation through transactions
* Locks and lock types

## Isolating transactions

* Remember ACID?
  * What did Isolation stand for?


* You can create a transaction in SQL, defining what a 'transaction' is:

      BEGIN; ... COMMIT;
      BEGIN; ... ROLLBACK;

**NOTE:** Your ``sql`` multiline magic does **NOT** support this (yet)!

Example: ``BEGIN; TRUNCATE hello; ROLLBACK;``

## Locks and lock types

Locks exists in many variants, but it is useful to know two types:
  * Read locks
    * Can be obtained by many different threads at once
    * Prevents any writing
  * Write locks
    * Can be obtained by *one at the time*
    * Prevents any reads

## Lock types and SQL statements

What kinds of locks will be provoked by these statements:

* ``SELECT * FROM tweet;``
* ``UPDATE tweet SET country = 'Ukraine' WHERE uname = 'C';``
* ``DELETE FROM tweet;``
* ``SELECT * FROM tweet WHERE uname = 'C'``;
* ``TRUNCATE tweet;``

## Locking a table in a transaction

You can specifically lock an entire table during a transaction: ``LOCK TABLE table;``

    BEGIN;
    LOCK TABLE tweet;
    SELECT * FROM tweet;
    COMMIT;

... Although I don't really know why you'd need this

## Locks and transactions

More importantly you should consider your level of concurrency in transactions:

    BEGIN;
    SELECT * FROM tweet;
    DELETE FROM tweet WHERE uname = 'C';
    SELECT * FROM tweet;
    COMMIT;
    
Versus

    BEGIN;
    SELECT * FROM tweet;
    DELETE FROM tweet WHERE uname = 'C';
    SELECT * FROM tweet;
    COMMIT;


# Scaling a database

* Why databases are hard to scale
* Serverless databases
* Scaling PostgreSQL

## Why databases are hard to scale

* Typical scaling requires three things
  * Redundancy
  * Somewhat linear performance
  * Availability
  
* ... And **always** consider *online* backups

## Redundancy, linear scaling and availability in a DBMS

* ACID
  * We talked about what that means for a database on a single computer
  * What happened when you need these guarantees over a network of computers?

* Normalisation
  * What does that mean for scaling?

## Serverless databases benefits

* Real-time Access
  * You have access to your data at a granular level. Whatever the data that you store, gets automatically indexed by default and it makes those indexes available immediately.

* Infinite Scalability
  * Serverless databases can be scaled up or down anytime you want... they start-up or shut down as per the application’s need.

* High Security: Most of the traditional databases implement schema-level user authentication only.

* Availability: While businesses are going global, it is imperative to have your data replicated at different geo-location, it means closest to where your users are.

* Schemaless: This feature is quite unique as it enables you to handle any data output from your functions.

## Limitations of traditional databases

1. Overspending on Resources
  1. Traditional database infrastructure means they benefit very little from resource sharing.
2. Locality of Data
  1. Who doesn’t have a global customer base?
3. Higher Fulfillment Time
  1. It is quite hard for the development team to add functionalities.

## The problem of serverless

* Who owns your data?
  * General data protection regulation (GDPR)
* Does it actually spread geographically
  * Depends on vendor
* Price models
  * Ad-hoc scaling sounds good, but then the bill arrives

## Naïve solution to database scaling

Replicate your queries. Or better, your write queries

* Solves the problem of redundancy
* Partly solves availability
* Semi-poor scalability

This is the approach of [pgpool](https://wiki.postgresql.org/wiki/Pgpool-II)

* Probably sufficient when you have more reads than writes

## Complex solution

Handle this internally in the database (clustering)

* Solves the problem of redundancy
* Solves availability
* Fair scaleability

This is the approach of many NoSQL databases and
* [Postgres-BDR](https://www.2ndquadrant.com/en/resources/bdr/) and
* [Citus](https://www.citusdata.com)

## A good database setup

* Quick
  * Time to market is extremely important
* Efficient
  * Performance seriously matters, especially at scale
* Cheap
  * Opague price models can kill your business
* Offers full control
  * Think long term

# A query within a query

In the query ``SELECT X FROM Y`` you select column $X$ from relation $Y$. 

The output of that is a relation with one column: $X$

So if the output is also a relation, why can't I use a selection within a selection?

    SELECT X FROM (SELECT Y FROM T);

This is particularly useful in joins:

    SELECT A FROM T
    INNER JOIN 
      (SELECT B FROM U)

## Named subqueries

PostgreSQL can sometimes complain that subqueries do not have a name. 

By name, they mean a particular reference that points to the subquery, so PostgreSQL isn't confused about where to find the data.

Solution:

    SELECT A FROM T
    INNER JOIN
      (SELECT B FROM U) AS subquery
    ORDER BY subquery.B

This won't be a problem with views (because they're already named)

# Today's curious read: How to create a re-arrangable order

https://begriffs.com/posts/2018-03-20-user-defined-order.html

# Assignment 7: Grouping and more joins

This assignment focuses on the use of ``GROUP BY`` and different types of joins. Proceed stepwise and take everything very slowly to avoid getting very confused.

1. Using a single join and a grouping, write a query that allows you to find the ``forename``, ``surname``, ``driverid`` and the total number of wins for each driver in the ``f1db`` schema.
2. Write a query that joins the tables ``results``, ``constructors`` and ``drivers`` to show the amount of times a driver has driven a car from a constructor. In other words how many times one driver (for instance Schumacher) has driven a constructor (for instance Mercedes). Your table should have three fields: ``drivers.driverref``, ``constructors.name`` and ``count(*)`` (showing the number of times the driver has driven the constructor).
  1. Hint: You can group on more than one value
3. Now we have found the most driven vehicles even for those who didn't finish the race. Extend the query from 2 by removing all tuples from the result where the status is not "Finished".
4. Create a third (and new) query that finds the amount of milliseconds spent in pitstop (see the ``pitstops`` relation) by each unique combinations of ``driverid`` and ``raceid``.
  1. Hint: use the ``sum`` aggregate function to find the total amount of milliseconds
  2. Hint: i recommend making this into a view
5. Time to put everything together. Use the query for 4 as a join subquery in query 3 (see slides on "a query within a query") to find the total amount of pitstop time, for each result in query 3
6. As a last thing we want to find the driver with the least pistop time. However we want to control for the amount of races (see explanation below), so we need to include one last column: the average pitstop time per race. Your table should be sorted based on the ascending values of that average pitstop time.
  1. Explanation: If we just count the pitstop time, we only get the total time he spend in a workshop, no matter how many races he participated in. If a driver participated in 100 races, he would have more pitstop time than a driver participating in one race!
  2. Hint: Use the ``sum`` function to accumulate the total pistop time and divide that with the amount of races the driver has participated in

Hand-in your query in text and your resulting table either as text or an image.