In [ ]:
#;.pykx.disableJupyter()

In [ ]:
# https://code.kx.com/pykx/3.0/examples/jupyter-integration.html#q-first-mode
import pykx as kx
kx.util.jupyter_qfirst_enable()

In [None]:
//run before moving on to rest of notebook
\l buildtaq.q
\l ./db/taq

<img src="../qbies.png" width="50px" style="width: 100px;padding-right:5px;padding-top:1px;padding-left:5px;" align="left"/>

   # Practical Guidance

# SQL vs qSQL

**SQL**
```
select [b,] [a] from t [where c] [group by b order by b]
update t set [a] [where c] 
```

**qSQL**
```
q)select [a] [by b] from t [where c]
q)update [a] [by b] from t [where c]
```

qSQL relational queries are generally half the size of the corresponding SQL queries. We can take advantage of `fby`'s and the full range of the kdb+/q programming language to do things that are difficult in SQL.  

 ##### Comparing constraints, aggregations
In SQL the `where` and `group` clauses are atomic, and the `select` and `update` clauses are atomic or aggregate if grouping. In `q` the `where` and `by` clauses are uniform, and the `select` and `update` clauses are uniform or aggregate if grouping (by). All clauses execute on the columns and `q` can therefore take advantage of order. SQL can't tell the difference.

SQL repeats the group by expressions in the select clause and the where clause is one boolean expression. The q where clause is a cascading list of constraints which nicely obviates some complex sql correlated sub-queries and also gets rid of some parentheses.

# Gotchas - Vertical filters
There are cases when changing the order of the constraints can affect the results returned. 

Let's take the following example:

In [None]:
show smalltrade2:([]sym:10#`JPM`GE`IBM;size:10#30 40);

Normally, the order of constraints is not important as far as the result is concerned. For example, say size=10, price=10 or price=10, size=10 - either way we are going to get the records where both price and size are 10.

The order is important when we use vertical functions i.e. functions that refer to other indices in the column. Taking our example here where switching the order changes the result:

In [None]:
select from smalltrade2 where size=first size                  // first size = 30
select from smalltrade2 where size=first size,sym=`GE         // and filter these by sym=`GE

In [None]:
select from smalltrade2 where sym=`GE                 
select from smalltrade2 where sym=`GE,size=first size  // now first size refers to the first size of the GE records

This phenomenon can occur with many other functions that refer to order indices in the column or to the column as a whole e.g. `first`, `last`, `avg`, `med`, `next`, `prev` etc.

# Using `exec` to return distinct tables

A lesser known behaviour of exec is that if we modify our `by` to a boolean, and we are returning more than one column, we will get a table returned. 

In [None]:
//see the table first before the output - multiple rows
show egTrade: select from trade where date = last date, sym like "A*"  

In [None]:
exec sym, date  from egTrade where sym like "A*"          //no by clause 

In [None]:
exec sym, date  by 0b from egTrade where sym like "A*"    //by 0b will return the output as a table

In [None]:
exec sym, date  by 1b from egTrade where sym like "A*"    //by 1b will return the distinct items in the table!

# Optimizing qSQL queries
<a id='optim'></a>

There are a few things that we need to consider when optimizing a query. These mostly revolve around being aware of the table structure, and structuring the qSQL constraints in a restrictive fashion.

## Compound Filtering
Probably the most important thing to bear in mind when thinking about the order of the constraints is that qSQL statements (`select` / `exec` / `update` / `delete`) work by whittling down the result set with each constraint i.e. the output of constraint N is the input to constraint N+1. This is called compound filtering.

Let's create a smaller table so we can see exactly what is going on

In [None]:
smalltrade:([]sym:5#`JPM`GE`IBM;size:5#100 200);
smalltrade

In [None]:
select from smalltrade where sym=`JPM             // reduces result set to 2 rows

In [None]:
select from smalltrade where sym=`JPM,size=200    // the second constraint here only operates on 2 rows

In the above example, the comparison on sym yields two rows that are in turn the input to the size constraint.

<img src="../qbies.png" width="50px" style="width: 50px;padding-right:5px;padding-top:2px;padding-left:5px;" align="left"/>

<p style='color:#273a6e'><i> Keep in mind the order of evaluation when constructing queries - remove as much data as possible at each step! </i></p>

### Guidance when Ordering Constraints

Now, let's suppose we wish to extract the records from the trade table where the date is the 2nd of January, sym=`AAPL`, price>10 and size=60. The question is - how should we order our constraints?

The rules for ordering constraints are as follows:

    a) If the table is partitioned on disk, filter on the partition(s) required 
    b) Move expensive (slow) operations to the right (end) of the query 
    c) Leverage attributes if possible 
    d) Reduce the size of the result-set as quickly as possible
    
Nearly 95% of the time in the world of kdb+, all tables are partitioned by date. We haven't discussed what a partitioned means as it is not covered in this course but it is basically a directory where the tables are stored in.
In our example, trade is partitioned by date so we know that the query will start like:

In [None]:
.Q.pf //this built-in function returns the table partitioning  
select from trade where date=2020.01.02

**Why is it important to put this first?**

If we do not put this constraint first, the query will need to "look" in each partition folder instead of being able to restrict itself to the partition(s) we've specified.

We don't have any particularly slow operations in the query so point b is not applicable in this case. 

Next, we have a parted attribute on the sym column so this should follow as the next constraint:

In [None]:
meta trade

In [None]:
select from trade where date=2020.01.02,sym=`AAPL 

For the next constraint, we want to reduce the size of the result set as quickly as possible. We are performing an equality check for `size`, and a `>` check for price. The equality check will be moderately faster and better yet, it's most likely to restrict our data set more than the ranged `>` check.


In [None]:
count select from trade where date=2020.01.02,sym=`AAPL,size=60

And finally: 

In [None]:
select from trade where date=2020.01.02,sym=`AAPL,size=60,price>85

We can play around with the query to get a feel for the impact on speed of each change:

In [None]:
\t:100 select from trade where date=2020.01.02,sym=`AAPL,size=60,price>85 // optimum query

In [None]:
\t:100 select from trade where date=2020.01.02,sym=`AAPL,price>85,size=60

In [None]:
\t:100 select from trade where date=2020.01.02,size=60,price>85,sym=`AAPL

In [None]:
\t:100 select from trade where date=2020.01.02,price>85,sym=`AAPL,size=60

<img src="../qbies.png" width="50px" style="width: 50px;padding-right:5px;padding-top:10px;padding-left:5px;" align="left"/>

<p style='color:#273a6e'><i> The performance of the above queries have not changed significantly as the trade table is very small, however if we were applying these queries to a bigger dataset, we would definitely see the difference!</i></p>

It is worth noting in the above example that changing the order of constraints does not alter the actual result; it merely affects the speed at which it is returned.

Now, let's suppose we didn't want the records where size = 60 -  instead we wanted the records where the size is an even number. 

Our hitherto optimum query order would run slower and it would make sense to shift the expensive (slow) operation to the end of the query:

In [None]:
\t:1000 select from trade where date=2020.01.02,sym=`AAPL,0= size mod 2,price>85

Runs more slowly as the `mod` operation is performed on a larger result set

In [None]:
\t:1000 select from trade where date=2020.01.02,sym=`AAPL,price>85,0= size mod 2

The same principle would hold if the attribute constraint were the expensive operation, it should be moved to right of the query.

## Further Reading 

The below materials provide additional information on q-SQL and query optimization.

The following whitepapers are good resources focused on optimization. 
* [Kdb+ and q documentation Columnar database and query optimization](https://code.kx.com/q/wp/columnar-database/)
* [Kdb+ and q documentation Kdb+ query scaling](https://code.kx.com/q/wp/query-scaling/)

The relevant extract from Q for Mortals: 
* [Queries: q-sql](https://code.kx.com/q4m3/9_Queries_q-sql/)

# Advanced topic -  Pivot tables (using `exec`)

In kdb+, pivot tables are used to reorganize or summarize the data stored in the database. It can be a useful tool for calculating group averages and specific sym values. They allow us to transform columns into rows for clear visibility.

Given the following source table:

In [None]:
show t:([]k:1 2 3 2 3;p:`xx`yy`zz`xx`yy;v:10 20 30 40 50)

suppose we want to obtain the following pivot table:

In [None]:
show pvt:([k:1 2 3]xx:10 40 0N;yy:0N 20 50;zz:0N 0N 30)

So looking at the above, what we really want is a table broken down by `k`, where we associate the categorization of `p` with their corresponding values `v`. 

Let's see if we can write that as a qSQL statement: 

In [None]:
exec p!v       //we are associating the p categories with our values v 
    by k       //broken down by k -> by k
    from t 

Hmm, that's starting to look a bit better, but something is odd. We can see that for each row we have a dictionary association between the `k` value and the corresponding dictionary of `p!v`. 

If we recall from the tables section, tables are really just lists of dictionaries where the keys are symbols, and the keys are consistent between each dictionary. Let's try to make the dictionaries consistent so! 

We first need to determine what we would like our new columns to be - for us, they're the values associated with the column `p`. So we can pull out the distinct pivot values (these will later become our column names):

In [None]:
show P:asc exec distinct p from t

Great! Now let's use these keys to ensure each of our dictionaries is consistent:

In [None]:
//refresher on dictionaries!
d: `a`b! 1 2 
`a`c`d`b#d

In [None]:
exec P#p!v by k from t   //by taking the keys we want from each of our dictionaries we ensure consistency
type exec P#p!v by k from t

Almost there! The first column doesn't appear to have a column name - that's because this isn't actually a column yet, it's just a list of our `k` values (since we are using `exec` not `select`) and so to have this function as a proper keyed table, we need to assign a column name to our k value:

In [None]:
show pvt:exec P#(p!v) by k:k from t

We can't do this with `select`because `select` will always return a table, hence it automatically creates the column name from the input provided: 

In [None]:
select P#p!v by k from t

By using `exec` we instead return a list, which just so happens to be dictionaries, which when they have a common set of symbol keys will themselves form a table! 

Can we think of a case where this wouldn't work? 

In [None]:
 //example, if p was a string instead of a symbol.
show P:asc exec string distinct p from t
show pvt:exec P#(string[p]!v) by k:k from t

This won't work if we tried to pivot the other way around .i.e. `v` to `p`, since `v` is not a symbol, or for any other type that's not a symbol. We also can't do this for multiple columns 