NOTE: This notebook gets rendered with all cells executed in the `docs` directory.     

## Working with Nodes

In [None]:
from funsql import *

To start constructing queries, we first need specify the database model - the tables and schemas we write queries against. 

In [None]:
concept = SQLTable(S.concept, [S.concept_id, S.vocab_id, S.concept_code])

location = SQLTable(S.location, [S.location_id, S.city, S.state])

person = SQLTable(S.person, [S.person_id, S.year_of_birth, S.month_of_birth, S.day_of_birth, S.birth_datetime, S.location_id])

visit_occurence = SQLTable(S.visit_occurence, [S.visit_occurence_id, S.person_id, S.visit_start_date, S.visit_end_date])

measurement = SQLTable(S.measurement, [S.measurement_id, S.person_id, S.measurement_concept_id, S.measurement_date])

observation = SQLTable(S.observation, [S.observation_id, S.person_id, S.observation_concept_id, S.observation_date])

FunSQL code has a lot of objects of the form - `S.{...}`, which is a shorthand to create `Symbol` objects. A `Symbol` is a wrapper around a string, so we can distinguish between identifiers (table/column/function names)
and literal strings values (say, values in the TEXT column _user_name_). So, 
* `SELECT(S("user_name"))` corresponds to: SELECT user_name
* `SELECT("user_name")` corresponds to: SELECT 'user_name'



However, most class constructors accept both strings and Symbols if it is clear that an identifier is expected. 

### Writing a query

FunSQL generates SQL queries by constructing a tree of SQL nodes. The node objects correspond to regular SQL words (or, close to it), and are connected together using the `>>` (rshift) operator. 

In [None]:
q = From(person) >> Where(Fun(">", Get.year_of_birth, 2000)) >> Select(Get.person_id)
q

The SQL query can be generated using the `render` function. 

In [None]:
q = From(person) >> Where(Fun(">", Get.year_of_birth, 2000)) >> Select(Get.person_id)
render(q, depth=RenderDepth.SERIALIZE)

Queries with a parameter are rendered with a placeholder in the query string and a list of all the parameter names. 

In [None]:
q = From(location) >> Where(Fun("-", Get.city, Var.CITY_INPUT)) >> Select(Get.state)
render(q, depth=RenderDepth.SERIALIZE)

Ill formed queries raise an error. 

In [None]:
q = From(person) >> Agg.Count() >> Select(Get.person_id)
# render(q)

## Node definitions

### Literals

The `Lit` node is used to create SQL values. 

In [None]:
q = Lit("SQL is fun!")
q

In a `SELECT` clause, literal expressions without a name, get the alias `_`.

In [None]:
q = Select("SQL is fun!")
render(q)

The regular python datatypes are automatically converted to a SQL literal, when used in the context of a SQL node. 

In [None]:
import datetime
q = Select(
    aka(None, "null"), 
    aka(10, S.number), 
    aka("funsql", S.string), 
    aka(datetime.datetime(2020, 1, 1, 0, 0, 0), "time")
)
q

### Attributes

#### Creating references

The `Get` node is used to create table/column references. 

In [None]:
q = Get(S.person_id)
q

Hierarchical references can also be created. 

In [None]:
q = Get.person.person_id # equivalent to: Get.person >> Get.person_id
q

For reference names starting with an underscore, use the function call syntax instead of the `.` accessor. This is just so we don't get name conflicts with python internal methods.

In [None]:
Get("_person")

`Get` can also be used to create bound references. 

In [None]:
q = From(person)
q = Where(Fun("=", q >> Get(S.year_of_birth), 2000))
q = q >> Get.person_id
q

Contrast this with an unbound reference, where the reference is resolved at render time by looking into the references available at the parent node. 

In [None]:
q = From(person) >> Group(Get.city) >> Select(Get.city)
q

`Get` is used to dereference an alias created using the `As` node.

In [None]:
q = From(person) >> As(S.p) >> Select(Get.p.person_id)
render(q)

This is useful when say, disambiguating the results of a Join.

In [None]:
q = (
    From(person) >> 
    As(S.p) >> 
    Join(From(location) >> As(S.l), 
         on = Fun("=", Get.p.location_id, Get.l.location_id)) >>
    Select(Get.p.person_id, Get.l.state)
)
render(q)

This could also be done using `bound` references. 

In [None]:
s1 = From(person)
s2 = From(location)
q = (
    s1 >> Join(s2, 
               on = Fun("=", s1 >> Get.location_id, s2 >> Get.location_id)) >> 
    Select(s1 >> Get.person_id, s2 >> Get.state)
)
render(q)

#### Incorrect references

An error is raised when `Get` refers to an unknown attribute. 

In [None]:
q = Select(Get.person_id)
# render(q)

In [None]:
q = From(person) >> As(S.p) >> Select(Get.person_id)
# render(q)

An error is also raised when a reference can't be resolved unambiguously. 

In [None]:
q = From(person) >> Join(From(person), on=True) >> Select(Get.person_id)
# render(q)

Unexpected hierarchical references

In [None]:
q = From(person) >> Select(Get.person_id.year_of_birth)
# render(q)

In [None]:
q = From(person) >> As(S.p) >> Select(Get.p)
# render(q)

Node bound references bound to an unrelated node, raise an error.

In [None]:
s1 = From(person)
q = From(location) >> Where(Fun("=", s1 >> Get.year_of_birth, 2000))
# render(q)

A node bound reference that can't be resolved unabiguously, also raises an error. 

In [None]:
s1 = From(person)
q = s1 >> Join(aka(s1, "another"), 
               on = Fun("!=", Get.person_id, Get.another.person_id)) >> Select(s1 >> Get.person_id)
# render(q)

#### Define

`Define` can be used to create a new expression, and attach it to a query. 

In [None]:
age = Fun("-", Fun.now(), Get.birth_datetime)
q = From(person) >> Define(aka(age, "age")) >> Select(Get.person_id, Get.age)
render(q)

The column added by `Define` is like a regular table/query column. 

In [None]:
age = Fun("-", Fun.now(), Get.birth_datetime)
person_w_age = From(person) >> Define(aka(age, "age"))
q = person_w_age >> Where(Fun(">=", Get.age, 32)) >>  Select(Get.person_id, Get.age)
render(q)

`Define` can be used to overwrite an existing field. 

In [None]:
q = From(person) >> Define(aka(Get.person_id, "location_id"), aka(Get.location_id, "person_id"))
render(q)

`Define` can be used after a `Select`. 

In [None]:
age = Fun("-", Fun.now(), Get.year_of_birth)
q = From(person) >> Select(Get.person_id, Get.year_of_birth) >> Define(aka(age, "age"))
render(q)

### Variables

`Var` is used to create a query variable. 

In [None]:
Var.location_id

Unbound variables get serialized as query parameters. 

In [None]:
q = From(person) >> Where(Fun("=", Get.location_id, Var.location_id)) >> Select(Get.person_id)
render(q)

query variables can be bound using the `Bind` operator. 

In [None]:
def q(p_id):
    return From(visit_occurence) >> Where(Fun("=", Get.person_id, Var.PERSON)) >> Bind(aka(p_id, S.PERSON))

render(q(210))

`Bind` can also be used to create correlated queries. 

In [None]:
def has_visit(p_id):
    return (
        From(visit_occurence) >> 
        Where(Fun("=", Get.person_id, Var.PERSON)) >> 
        Bind(aka(p_id, S.PERSON))
    )

q = From(person) >> Where(Fun.exists(has_visit(Get.person_id))) >> Select(Get.person_id)
render(q)

As a lateral `Join`.

In [None]:
def visit_for_person(p_id):
    return (
        From(visit_occurence) >> 
        Where(Fun("=", Get.person_id, Var.PERSON)) >> 
        Bind(aka(p_id, S.PERSON))
    )

q = (
    From(person) >> 
    Join(visit_for_person(Get.person_id) >> As("visit"), on=True, left=True) >>
    Select(Get.person_id, Get.visit.visit_start_date)
)
render(q)

### Functions and Operators

Functions or Operators are represented using the `Fun` node. 

In [None]:
q = Fun(">", Get.year_of_birth, Lit(1940))
q

Function args can be nested queries. 

In [None]:
p = From(person) >> Where(Fun("<", Get.year_of_birth, 2000)) 
q = Select(Fun.exists(p))
render(q)

All kinds of SQL expressions and operators can be represented using the `Fun` node. 

In [None]:
q = (
    From(person) >>
    Where(Fun("and", Fun("is null", Get.birth_datetime), Fun("is not null", Get.year_of_birth))) >>
    Select(aka(Fun.cast(Fun.extract("YEAR", Get.birth_datetime), "INT"), "year_of_birth"))
)
render(q)

Redundant function expressions are not rendered. 

In [None]:
q = From(person) >> Select(Get.person_id) >> Where(Fun.AND())
render(q)

### Append

`Append` node represents a SQL UNION, that is it concatenates output from multiple queries.

In [None]:
q1 = From(measurement) >> Define(aka(Get.measurement_date, "date"))
q2 = From(observation) >> Define(aka(Get.observation_date, "date"))
q = q1 >> Append(q2)
render(q)

Another example

In [None]:
q1 = From(measurement) >> Define(aka(Get.measurement_concept_id, "concept_id")) >> Group(Get.person_id)
q2 = From(observation) >> Define(aka(Get.observation_concept_id, "concept_id")) >> Group(Get.person_id)
q = q1 >> Append(q2) >> Select(Get.person_id, 
                               aka(Agg.count(), "count"), 
                               aka(Agg.count(Get.concept_id, distinct=True), "count_distinct")
                              )
render(q)

`Append` aligns the columns of its subqueries before doing a UNION. 

In [None]:
q1 = From(measurement) >> Select(Get.person_id, aka(Get.measurement_date, "date"))
q2 = From(observation) >> Select(aka(Get.observation_date, "date"), Get.person_id)
q = q1 >> Append(q2)
render(q)

If an explicit `Select` is missing, the output includes only the columns common to the nested queries. 

In [None]:
q = From(measurement) >> Append(From(observation))
render(q)

### Iterate

The `Iterate` node can be used to create a recursive CTE. 

In [None]:
q = (
    Define(aka(1, "n"), aka(1, "prod")) >>
    Iterate(
        From(S.factorial) >>
        Define(aka(Fun("+", Get.n, 1), "n")) >>
        Define(aka(Fun("*", Get.n, Get.prod), "prod")) >>
        Where(Fun("<=", Get.n, 10)) >>
        As(S.factorial)
    )
)
render(q)

The `Iterate` node output preserves only the columns present in both the base query and the iterator query. 

In [None]:
q = (
    Define(aka(0, "k"), aka(0, "m")) >>
    Iterate(
        From(S.self) >>
        As(S.previous) >>
        Where(Fun("<", Get.previous.m, 10)) >>
        Define(aka(0, "n"), aka(Fun("+", Get.previous.m, 1), "m")) >>
        As(S.self)
    )
)
render(q)

### As

`As` node creates an alias for an expression. 

In [None]:
q = From(person) >> Select(Get.person_id >> As("user")) >> Select(Get.user)
render(q)

`As` can also create an alias for a subquery. 

In [None]:
q = From(person) >> As(S.p) >> Select(Get.p.person_id)
render(q)

This blocks the columns in the subquery from the output. To reference them, you need to subscript the alias first. 

In [None]:
# error 
q = From(person) >> As(S.p) >> Select(Get.person_id)
# render(q)

In [None]:
# works
q = From(person) >> As(S.p) >> Select(Get.p.person_id)
render(q)

Node bound references are not blocked. 

In [None]:
s1 = From(person)
q = s1 >> As(S.p) >> Select(s1 >> Get.person_id)
render(q)

### From

`From` can be used to select columns from the table specified. 

In [None]:
q = From(person)
q

By default, all the columns are selected. 

In [None]:
q = From(person)
render(q)

If the table has a schema specified, the qualifier gets added in the rendered query. 

In [None]:
tab = SQLTable("madeup_table", ["colA", "colB"], schema="madeup_schema")
q = From(tab)
render(q)

Queries with a `VALUES` query can be generated. 

In [None]:
tab = ValuesTable(("name", "year"), [("SQL", 1974), ("Python", 1990), ("FunSQL", 2022)])
q = From(tab)
render(q)

Only columns used in the query are serialized for a `VALUES` clause. 

In [None]:
tab = ValuesTable(("name", "year"), [("SQL", 1974), ("Python", 1990), ("FunSQL", 2022)])
q = From(tab) >> Select(Get.name)
render(q)

If no columns are selected, the values are replaced with nulls. 

In [None]:
tab = ValuesTable(("name", "year"), [("SQL", 1974), ("Python", 1990), ("FunSQL", 2022)])
q = From(tab) >> Group() >> Select(Agg.Count())
render(q)

The `VALUES` clause requires at least one row of data. 

In [None]:
tab = ValuesTable(("name", "year"), [])
q = From(tab)
render(q)

A null source generates a dataset with one row. 

In [None]:
q = From(None)
render(q)

### With

SQL has a `WITH` clause to create temporary tables for reuse in a query. They can be created using the `With` node. 

In [None]:
q = From(S.thirty) >> With(From(person) >> Where(Fun("=", Get.year_of_birth, 1990)) >> As("thirty")) >> Select(Get.person_id)
render(q)

NOTE: Coming from SQL, the order of `From` and `With` nodes might seem odd since the `With` definition usually comes first. The reason is that FunSQL builds a query starting from the last node, and propagates up. Since the `From` node refers to a temporary table declared using `With`, to resolve it correctly, we must have encountered the `With` node first. 

`With` nodes can also declare multiple subqueries. 

In [None]:
q = (
    Select(
        From(S.thirty) >> Group() >> Select(Agg.Count()) >> As("count_30s"),
        From(S.forty) >> Group() >> Select(Agg.Count()) >> As("count_40s"),
    ) >>
    With(
        From(person) >> Where(Fun("=", Get.year_of_birth, 1990)) >> As("thirty"),
        From(person) >> Where(Fun("=", Get.year_of_birth, 1980)) >> As("forty")        
    )
)
render(q)

Tables defined using a `With` node must have explicit, unique labels. 

In [None]:
q = From(S.person) >> With(From(person))
# render(q)

### Group

`Group` node is used to partition rows with the given keys, and summarize over them. 

In [None]:
q = From(person) >> Group(Get.year_of_birth) >> Select(Get.year_of_birth, Agg.count())
render(q)

By splitting the grouping logic from the aggregate expressions, queries get easier to construct.

In [None]:
visit_group = From(visit_occurence) >> Group(Get.person_id) >> As("visit_group")
num_visits = lambda: Agg.count(over=Get.visit_group) # regular assignment instead of a function works too
q = (
    From(person) >> 
    Join(visit_group, on = Fun("=", Get.person_id, Get.visit_group.person_id)) >>
    Where(Fun(">", num_visits(), 2)) >>
    Select(Get.person_id, num_visits())
)

render(q)

Grouping can be done in succession. 

In [None]:
# counting measurements for each concept, then counting frequency for each count
q = (
    From(measurement) >> 
    Group(Get.measurement_concept_id) >> 
    Group(aka(Agg.count(), "count_for_measure")) >>
    Select(Get.count_for_measure, aka(Agg.count(), "size"))
)
render(q)

`Group` can work with an empty list of keys. 

In [None]:
q = From(person) >> Group() >> Select(Agg.count(), Agg.max(Get.year_of_birth), Agg.min(Get.year_of_birth))
render(q)

Each aggregate expression gets a unique alias. 

In [None]:
visit_group = From(visit_occurence) >> Group(Get.person_id) >> As("visit_group")
person_visits = From(person) >> Join(visit_group, on = Fun("=", Get.person_id, Get.visit_group.person_id))

max_start_date = aka(Get.visit_group >> Agg.max(Get.visit_start_date), "max_start_date")
max_end_date = aka(Get.visit_group >> Agg.max(Get.visit_end_date), "max_end_date")
q = person_visits >> Select(Get.person_id, max_start_date, max_end_date)

render(q)

Aggregate expressions can be applied to only the distinct values in a partition. 

In [None]:
q = From(person) >> Group() >> Select(Agg.count(Get.year_of_birth, distinct=True))
render(q)

Aggregates can be applied to filtered portion of a partition. 

In [None]:
measure = Agg.count(filter_ = Fun("<", Get.year_of_birth, 2000))
q = From(person) >> Group() >> Select(measure)
render(q)

Aggregate expressions can't be used without a `Group` node. 

In [None]:
q = From(person) >> Select(Agg.max(Get.year_of_birth))
# render(q)

Aggregate expressions need to unabiguously determine the corresponding `Group` node. 

In [None]:
q1 = From(person)
q2 = From(measurement) >> Group(Get.person_id)
q3 = From(visit_occurence) >> Group(Get.person_id)

q = (
    q1 >> 
    Join(q2, on = Fun("=", Get.person_id, q2 >> Get.person_id)) >> 
    Join(q3, on = Fun("=", Get.person_id, q3 >> Get.person_id)) >>
    Select(q1 >> Get.person_id, Agg.count())
)
# render(q)

### Partition

`Partition` node creates a subquery that partitions rows by the specified keys. For each row, an aggregate can be calculated across all the rows in its partition (called `WINDOW` functions in SQL). 

In [None]:
q = (
    From(person) >> 
    Partition(Get.year_of_birth, order_by=[Get.month_of_birth]) >>
    Select(Get.person_id, Agg.row_number())
)
render(q)

A Partition node may specify a window frame. 

In [None]:
births_by_year = From(person) >> Group(Get.year_of_birth) >> Select(Get.year_of_birth, Agg.count())
cumulative_births_by_year = (
    births_by_year >> 
    Partition(order_by=[Get.year_of_birth], 
                frame=Frame(FrameMode.ROWS, FrameEdge(FrameEdgeSide.PRECEDING, None), FrameEdge(FrameEdgeSide.CURRENT_ROW))) >> 
    Select(Get.year_of_birth, Agg.sum(Get.count))
)

render(cumulative_births_by_year)

Defining a Frame node gets a bit mouthful, so the regular constructs are available through an alias class, `F`. 

In this example, Partition nodes are used one after the other to simplify a nested SQL query. We want to get the set of non overlapping visits made by a person. 

In [None]:
# for all visits by a person
# gather all the visits made prior
# find the one that ended last
# TODO: I can't really follow

q = (
    From(visit_occurence) >> 
    Partition(Get.person_id, 
              order_by=[Get.visit_start_date], 
              frame = Frame(F.ROWS, F.pre(None), F.pre(1))) >> 
    Define(aka(Agg.max(Get.visit_end_date), "boundary")) >>
    Define(aka(Fun("-", Get.visit_start_date, Get.boundary), "gap")) >>
    Define(aka(Fun.case(Fun("<=", Get.gap, 0), 0, 1), "new")) >>
    Partition(Get.person_id, 
              order_by=[Get.visit_start_date, Fun("-", Get.new)],
              frame=Frame(F.ROWS, F.pre(None), F.curr_row())) >>
    Define(aka(Agg.sum(Get.new), "group")) >>
    Group(Get.person_id, Get.group) >>
    Define(aka(Agg.min(Get.visit_start_date), "start_date"), 
           aka(Agg.max(Get.visit_end_date), "end_date")) >>
    Select(Get.person_id, Get.start_date, Get.end_date)
)

render(q)

### Join

A Join query combines the output of two subqueries. 

In [None]:
q = (
    From(person) >> 
    Join(aka(From(location), "location"), 
         on=Fun("=", Get.location_id, Get.location.location_id))
)
render(q)

Different variants of the `SQL JOIN` operation can be constructed using the relevant keyword args. 

In [None]:
# right join
q = (
    From(person) >> 
    Join(aka(From(location), "location"), 
         on=Fun("=", Get.location_id, Get.location.location_id), 
         right=True)
)
render(q)

Joins with correlated subqueries are supported too. 

In [None]:
# gets the _second_ visit made by a person
def second_visit(p_id):
    return (
        From(visit_occurence) >>
        Where(Fun("=", Get.person_id, Var.PERSON_ID)) >>
        Partition(order_by=[Get.visit_start_date]) >>
        Where(Fun("=", Agg.row_number(), 2)) >>
        Bind(aka(p_id, "PERSON_ID"))
    )

# gets all people and if they made any second visits
q = (
    From(person) >>
    Join(aka(second_visit(Get.person_id), "visit"), on=True, left=True) >>
    Select(Get.person_id, Get.visit.visit_occurence_id, Get.visit.visit_start_date)
)
render(q)

### Order

The `Order` operator creates a subquery to sort the output. 

In [None]:
q = From(person) >> Order(Get.year_of_birth)
render(q)

The num of rows in the result set can also be limited. 

In [None]:
q = (
    From(person) >> 
    Order(Get.year_of_birth) >> 
    Limit(10) >> 
    Order(Get.person_id) >> 
    Select(Get.person_id, Get.location_id)
)
render(q)

The direction of the sort column can be specified too. 

In [None]:
q = (
    From(person) >> 
    Order(Get.year_of_birth >> Desc(NullsOrder.FIRST), Get.person_id >> Asc())
)

render(q)

### Limit

The `Limit` node selects a fixed number of rows from a subquery, typically in conjunction with an `Order` node. 

In [None]:
q = From(person) >> Order(Get.person_id) >> Limit(20)
render(q)

`Limit` also lets you specify an offset. 

In [None]:
q = From(person) >> Order(Get.person_id) >> Limit(100, offset=10)
render(q)

You could also specify just the offset. 

In [None]:
q = From(person) >> Order(Get.person_id) >> Limit(offset=10)
render(q)

### Select

The Select node specifies the output columns for a subquery. 

In [None]:
q = From(person) >> Select(Get.person_id, Get.year_of_birth)
render(q)

`Select` always creates a complete subquery. So, it creates nesting if it isn't the last node. 

In [None]:
q = (
    From(person) >> 
    Select(Get.year_of_birth) >>
    Where(Fun(">", Get.year_of_birth, 2000))
)

render(q)

All columns passed to a Select node must have unique aliases.

In [None]:
# doesn't work
# q = From(person) >> Select(Get.person_id, Get.person_id)

# works
q = From(person) >> Select(Get.person_id, aka(Get.person_id, "duplicate_id"))

render(q)

### Where

`Where` can be used to filter the query output by a condition. 

In [None]:
q = From(person) >> Where(Fun("=", Get.year_of_birth, 2000))
render(q)

Multiple `Where` nodes in sequence are collapsed into a single clause. 

In [None]:
q = (
    From(person) >> 
    Where(Fun(">", Get.year_of_birth, 1980)) >>
    Where(Fun("<", Get.year_of_birth, 2000)) >>
    Where(Fun("!=", Get.year_of_birth, 1990))
)
render(q)

`Where` node following a `Group` subquery is translated into a `HAVING` clause. 

In [None]:
q = (
    From(location) >> 
    Group(Get.state) >>
    Where(Fun(">", Agg.count(Get.city, distinct=True), 5)) >>
    Where(Fun("<", Agg.count(Get.city, distinct=True), 10))
)

render(q)