NOTE: This notebook gets rendered with all cells executed in the `docs` directory.     

## Working with Nodes

In [1]:
from funsql import *

To start constructing queries, we first need specify the database model - the tables and schemas we write queries against. 

In [2]:
concept = SQLTable(S.concept, [S.concept_id, S.vocab_id, S.concept_code])

location = SQLTable(S.location, [S.location_id, S.city, S.state])

person = SQLTable(
    S.person,
    [
        S.person_id,
        S.year_of_birth,
        S.month_of_birth,
        S.day_of_birth,
        S.birth_datetime,
        S.location_id,
    ],
)

visit_occurence = SQLTable(
    S.visit_occurence,
    [S.visit_occurence_id, S.person_id, S.visit_start_date, S.visit_end_date],
)

measurement = SQLTable(
    S.measurement,
    [S.measurement_id, S.person_id, S.measurement_concept_id, S.measurement_date],
)

observation = SQLTable(
    S.observation,
    [S.observation_id, S.person_id, S.observation_concept_id, S.observation_date],
)

FunSQL code has a lot of objects of the form - `S.{...}`, which is a shorthand to create `Symbol` objects. A `Symbol` is a wrapper around a string, so we can distinguish between identifiers (table/column/function names)
and literal strings values (say, values in the TEXT column _user_name_). So, 
* `SELECT(S("user_name"))` corresponds to: SELECT user_name
* `SELECT("user_name")` corresponds to: SELECT 'user_name'



However, most class constructors accept both strings and Symbols if it is clear that an identifier is expected. 

### Writing a query

FunSQL generates SQL queries by constructing a tree of SQL nodes. The node objects correspond to regular SQL words (or, close to it), and are connected together using the `>>` (rshift) operator. 

In [3]:
q = From(person) >> Where(Fun(">", Get.year_of_birth, 2000)) >> Select(Get.person_id)
q

let person = SQLTable(person, ...),
    q1 = From(person),
    q2 = q1 >> Where(Fun.">"(Get.year_of_birth, Lit(2000))),
    q3 = q2 >> Select(Get.person_id),
    q3
end

The SQL query can be generated using the `render` function. 

In [4]:
q = From(person) >> Where(Fun(">", Get.year_of_birth, 2000)) >> Select(Get.person_id)
render(q, depth=RenderDepth.SERIALIZE)

query: 
SELECT "person_1"."person_id"
FROM "person" AS "person_1"
WHERE ("person_1"."year_of_birth" > 2000)

Queries with a parameter are rendered with a placeholder in the query string and a list of all the parameter names. 

In [5]:
q = From(location) >> Where(Fun("-", Get.city, Var.CITY_INPUT)) >> Select(Get.state)
render(q, depth=RenderDepth.SERIALIZE)

query: 
SELECT "location_1"."state"
FROM "location" AS "location_1"
WHERE ("location_1"."city" - $1)

vars: [CITY_INPUT]

Ill formed queries raise an error. 

In [6]:
q = From(person) >> Agg.Count() >> Select(Get.person_id)
# render(q)

## Node definitions

### Literals

The `Lit` node is used to create SQL values. 

In [7]:
q = Lit("SQL is fun!")
q

Lit("SQL is fun!")

In a `SELECT` clause, literal expressions without a name, get the alias `_`.

In [8]:
q = Select("SQL is fun!")
render(q)

query: 
SELECT 'SQL is fun!' AS "_"

The regular python datatypes are automatically converted to a SQL literal, when used in the context of a SQL node. 

In [9]:
import datetime

q = Select(
    aka(None, "null"),
    aka(10, S.number),
    aka("funsql", S.string),
    aka(datetime.datetime(2020, 1, 1, 0, 0, 0), "time"),
)
q

Select(Lit(NULL) >> As(null),
       Lit(10) >> As(number),
       Lit("funsql") >> As(string),
       Lit(TIMESTAMP '2020-01-01T00:00:00') >> As(time))

### Attributes

#### Creating references

The `Get` node is used to create table/column references. 

In [10]:
q = Get(S.person_id)
q

Get.person_id

Hierarchical references can also be created. 

In [11]:
q = Get.person.person_id  # equivalent to: Get.person >> Get.person_id
q

Get.person.person_id

For reference names starting with an underscore, use the function call syntax instead of the `.` accessor. This is just so we don't get name conflicts with python internal methods.

In [12]:
Get("_person")

Get._person

`Get` can also be used to create bound references. 

In [13]:
q = From(person)
q = Where(Fun("=", q >> Get(S.year_of_birth), 2000))
q = q >> Get.person_id
q

Where(Fun."="(From(SQLTable(person, ...)) >> Get.year_of_birth, Lit(2000))) >>
Get.person_id

Contrast this with an unbound reference, where the reference is resolved at render time by looking into the references available at the parent node. 

In [14]:
q = From(person) >> Group(Get.city) >> Select(Get.city)
q

let person = SQLTable(person, ...),
    q1 = From(person),
    q2 = q1 >> Group(Get.city),
    q3 = q2 >> Select(Get.city),
    q3
end

`Get` is used to dereference an alias created using the `As` node.

In [15]:
q = From(person) >> As(S.p) >> Select(Get.p.person_id)
render(q)

query: 
SELECT "person_1"."person_id"
FROM "person" AS "person_1"

This is useful when say, disambiguating the results of a Join.

In [16]:
q = (
    From(person)
    >> As(S.p)
    >> Join(
        From(location) >> As(S.l), on=Fun("=", Get.p.location_id, Get.l.location_id)
    )
    >> Select(Get.p.person_id, Get.l.state)
)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "location_1"."state"
FROM "person" AS "person_1"
INNER JOIN "location" AS "location_1" ON ("person_1"."location_id" = "location_1"."location_id")

This could also be done using `bound` references. 

In [17]:
s1 = From(person)
s2 = From(location)
q = (
    s1
    >> Join(s2, on=Fun("=", s1 >> Get.location_id, s2 >> Get.location_id))
    >> Select(s1 >> Get.person_id, s2 >> Get.state)
)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "location_1"."state"
FROM "person" AS "person_1"
INNER JOIN "location" AS "location_1" ON ("person_1"."location_id" = "location_1"."location_id")

#### Incorrect references

An error is raised when `Get` refers to an unknown attribute. 

In [18]:
q = Select(Get.person_id)
# render(q)

In [19]:
q = From(person) >> As(S.p) >> Select(Get.person_id)
# render(q)

An error is also raised when a reference can't be resolved unambiguously. 

In [20]:
q = From(person) >> Join(From(person), on=True) >> Select(Get.person_id)
# render(q)

Unexpected hierarchical references

In [21]:
q = From(person) >> Select(Get.person_id.year_of_birth)
# render(q)

In [22]:
q = From(person) >> As(S.p) >> Select(Get.p)
# render(q)

Node bound references bound to an unrelated node, raise an error.

In [23]:
s1 = From(person)
q = From(location) >> Where(Fun("=", s1 >> Get.year_of_birth, 2000))
# render(q)

A node bound reference that can't be resolved unabiguously, also raises an error. 

In [24]:
s1 = From(person)
q = (
    s1
    >> Join(aka(s1, "another"), on=Fun("!=", Get.person_id, Get.another.person_id))
    >> Select(s1 >> Get.person_id)
)
# render(q)

#### Define

`Define` can be used to create a new expression, and attach it to a query. 

In [25]:
age = Fun("-", Fun.now(), Get.birth_datetime)
q = From(person) >> Define(aka(age, "age")) >> Select(Get.person_id, Get.age)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  (NOW() - "person_1"."birth_datetime") AS "age"
FROM "person" AS "person_1"

The column added by `Define` is like a regular table/query column. 

In [26]:
age = Fun("-", Fun.now(), Get.birth_datetime)
person_w_age = From(person) >> Define(aka(age, "age"))
q = person_w_age >> Where(Fun(">=", Get.age, 32)) >> Select(Get.person_id, Get.age)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  (NOW() - "person_1"."birth_datetime") AS "age"
FROM "person" AS "person_1"
WHERE ((NOW() - "person_1"."birth_datetime") >= 32)

`Define` can be used to overwrite an existing field. 

In [27]:
q = From(person) >> Define(
    aka(Get.person_id, "location_id"), aka(Get.location_id, "person_id")
)
render(q)

query: 
SELECT
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."birth_datetime", 
  "person_1"."person_id" AS "location_id", 
  "person_1"."location_id" AS "person_id"
FROM "person" AS "person_1"

`Define` can be used after a `Select`. 

In [28]:
age = Fun("-", Fun.now(), Get.year_of_birth)
q = From(person) >> Select(Get.person_id, Get.year_of_birth) >> Define(aka(age, "age"))
render(q)

query: 
SELECT
  "person_2"."person_id", 
  "person_2"."year_of_birth", 
  (NOW() - "person_2"."year_of_birth") AS "age"
FROM (
  SELECT
    "person_1"."person_id", 
    "person_1"."year_of_birth"
  FROM "person" AS "person_1"
) AS "person_2"

### Variables

`Var` is used to create a query variable. 

In [29]:
Var.location_id

Var.location_id

Unbound variables get serialized as query parameters. 

In [30]:
q = (
    From(person)
    >> Where(Fun("=", Get.location_id, Var.location_id))
    >> Select(Get.person_id)
)
render(q)

query: 
SELECT "person_1"."person_id"
FROM "person" AS "person_1"
WHERE ("person_1"."location_id" = $1)

vars: [location_id]

query variables can be bound using the `Bind` operator. 

In [31]:
def q(p_id):
    return (
        From(visit_occurence)
        >> Where(Fun("=", Get.person_id, Var.PERSON))
        >> Bind(aka(p_id, S.PERSON))
    )


render(q(210))

query: 
SELECT
  "visit_occurence_1"."visit_occurence_id", 
  "visit_occurence_1"."person_id", 
  "visit_occurence_1"."visit_start_date", 
  "visit_occurence_1"."visit_end_date"
FROM "visit_occurence" AS "visit_occurence_1"
WHERE ("visit_occurence_1"."person_id" = 210)

`Bind` can also be used to create correlated queries. 

In [32]:
def has_visit(p_id):
    return (
        From(visit_occurence)
        >> Where(Fun("=", Get.person_id, Var.PERSON))
        >> Bind(aka(p_id, S.PERSON))
    )


q = From(person) >> Where(Fun.exists(has_visit(Get.person_id))) >> Select(Get.person_id)
render(q)

query: 
SELECT "person_1"."person_id"
FROM "person" AS "person_1"
WHERE (EXISTS (
  SELECT NULL
  FROM "visit_occurence" AS "visit_occurence_1"
  WHERE ("visit_occurence_1"."person_id" = "person_1"."person_id")
))

As a lateral `Join`.

In [33]:
def visit_for_person(p_id):
    return (
        From(visit_occurence)
        >> Where(Fun("=", Get.person_id, Var.PERSON))
        >> Bind(aka(p_id, S.PERSON))
    )


q = (
    From(person)
    >> Join(visit_for_person(Get.person_id) >> As("visit"), on=True, left=True)
    >> Select(Get.person_id, Get.visit.visit_start_date)
)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "visit_1"."visit_start_date"
FROM "person" AS "person_1"
LEFT JOIN LATERAL (
  SELECT "visit_occurence_1"."visit_start_date"
  FROM "visit_occurence" AS "visit_occurence_1"
  WHERE ("visit_occurence_1"."person_id" = "person_1"."person_id")
) AS "visit_1" ON TRUE

### Functions and Operators

Functions or Operators are represented using the `Fun` node. 

In [34]:
q = Fun(">", Get.year_of_birth, Lit(1940))
q

Fun.">"(Get.year_of_birth, Lit(1940))

Function args can be nested queries. 

In [35]:
p = From(person) >> Where(Fun("<", Get.year_of_birth, 2000))
q = Select(Fun.exists(p))
render(q)

query: 
SELECT (EXISTS (
  SELECT NULL
  FROM "person" AS "person_1"
  WHERE ("person_1"."year_of_birth" < 2000)
)) AS "exists"

All kinds of SQL expressions and operators can be represented using the `Fun` node. 

In [36]:
q = (
    From(person)
    >> Where(
        Fun(
            "and",
            Fun("is null", Get.birth_datetime),
            Fun("is not null", Get.year_of_birth),
        )
    )
    >> Select(
        aka(Fun.cast(Fun.extract("YEAR", Get.birth_datetime), "INT"), "year_of_birth")
    )
)
render(q)

query: 
SELECT CAST(EXTRACT(YEAR FROM "person_1"."birth_datetime") AS INT) AS "year_of_birth"
FROM "person" AS "person_1"
WHERE (("person_1"."birth_datetime" IS NULL) AND ("person_1"."year_of_birth" IS NOT NULL))

Redundant function expressions are not rendered. 

In [37]:
q = From(person) >> Select(Get.person_id) >> Where(Fun.AND())
render(q)

query: 
SELECT "person_1"."person_id"
FROM "person" AS "person_1"

### Append

`Append` node represents a SQL UNION, that is it concatenates output from multiple queries.

In [38]:
q1 = From(measurement) >> Define(aka(Get.measurement_date, "date"))
q2 = From(observation) >> Define(aka(Get.observation_date, "date"))
q = q1 >> Append(q2)
render(q)

query: 
SELECT
  "measurement_1"."person_id", 
  "measurement_1"."measurement_date" AS "date"
FROM "measurement" AS "measurement_1"
UNION ALL
SELECT
  "observation_1"."person_id", 
  "observation_1"."observation_date" AS "date"
FROM "observation" AS "observation_1"

Another example

In [39]:
q1 = (
    From(measurement)
    >> Define(aka(Get.measurement_concept_id, "concept_id"))
    >> Group(Get.person_id)
)
q2 = (
    From(observation)
    >> Define(aka(Get.observation_concept_id, "concept_id"))
    >> Group(Get.person_id)
)
q = (
    q1
    >> Append(q2)
    >> Select(
        Get.person_id,
        aka(Agg.count(), "count"),
        aka(Agg.count(Get.concept_id, distinct=True), "count_distinct"),
    )
)
render(q)

query: 
SELECT
  "union_1"."person_id", 
  "union_1"."count", 
  "union_1"."count_2" AS "count_distinct"
FROM (
  SELECT
    "measurement_1"."person_id", 
    count(*) AS "count", 
    count(DISTINCT "measurement_1"."measurement_concept_id") AS "count_2"
  FROM "measurement" AS "measurement_1"
  GROUP BY "measurement_1"."person_id"
  UNION ALL
  SELECT
    "observation_1"."person_id", 
    count(*) AS "count", 
    count(DISTINCT "observation_1"."observation_concept_id") AS "count_2"
  FROM "observation" AS "observation_1"
  GROUP BY "observation_1"."person_id"
) AS "union_1"

`Append` aligns the columns of its subqueries before doing a UNION. 

In [40]:
q1 = From(measurement) >> Select(Get.person_id, aka(Get.measurement_date, "date"))
q2 = From(observation) >> Select(aka(Get.observation_date, "date"), Get.person_id)
q = q1 >> Append(q2)
render(q)

query: 
SELECT
  "measurement_1"."person_id", 
  "measurement_1"."measurement_date" AS "date"
FROM "measurement" AS "measurement_1"
UNION ALL
SELECT
  "observation_2"."person_id", 
  "observation_2"."date"
FROM (
  SELECT
    "observation_1"."observation_date" AS "date", 
    "observation_1"."person_id"
  FROM "observation" AS "observation_1"
) AS "observation_2"

If an explicit `Select` is missing, the output includes only the columns common to the nested queries. 

In [41]:
q = From(measurement) >> Append(From(observation))
render(q)

query: 
SELECT "measurement_1"."person_id"
FROM "measurement" AS "measurement_1"
UNION ALL
SELECT "observation_1"."person_id"
FROM "observation" AS "observation_1"

### Iterate

The `Iterate` node can be used to create a recursive CTE. 

In [42]:
q = Define(aka(1, "n"), aka(1, "prod")) >> Iterate(
    From(S.factorial)
    >> Define(aka(Fun("+", Get.n, 1), "n"))
    >> Define(aka(Fun("*", Get.n, Get.prod), "prod"))
    >> Where(Fun("<=", Get.n, 10))
    >> As(S.factorial)
)
render(q)

query: 
WITH RECURSIVE "factorial_1" ("n", "prod")  AS (
  SELECT
    1 AS "n", 
    1 AS "prod"
  UNION ALL
  SELECT
    ("factorial_2"."n" + 1) AS "n", 
    (("factorial_2"."n" + 1) * "factorial_2"."prod") AS "prod"
  FROM "factorial_1" AS "factorial_2"
  WHERE (("factorial_2"."n" + 1) <= 10)
)
SELECT
  "factorial_1"."n", 
  "factorial_1"."prod"
FROM "factorial_1"

The `Iterate` node output preserves only the columns present in both the base query and the iterator query. 

In [43]:
q = Define(aka(0, "k"), aka(0, "m")) >> Iterate(
    From(S.self)
    >> As(S.previous)
    >> Where(Fun("<", Get.previous.m, 10))
    >> Define(aka(0, "n"), aka(Fun("+", Get.previous.m, 1), "m"))
    >> As(S.self)
)
render(q)

query: 
WITH RECURSIVE "self_1" ("m")  AS (
  SELECT 0 AS "m"
  UNION ALL
  SELECT ("self_2"."m" + 1) AS "m"
  FROM "self_1" AS "self_2"
  WHERE ("self_2"."m" < 10)
)
SELECT "self_1"."m"
FROM "self_1"

### As

`As` node creates an alias for an expression. 

In [44]:
q = From(person) >> Select(Get.person_id >> As("user")) >> Select(Get.user)
render(q)

query: 
SELECT "person_2"."user"
FROM (
  SELECT "person_1"."person_id" AS "user"
  FROM "person" AS "person_1"
) AS "person_2"

`As` can also create an alias for a subquery. 

In [45]:
q = From(person) >> As(S.p) >> Select(Get.p.person_id)
render(q)

query: 
SELECT "person_1"."person_id"
FROM "person" AS "person_1"

This blocks the columns in the subquery from the output. To reference them, you need to subscript the alias first. 

In [46]:
# error
q = From(person) >> As(S.p) >> Select(Get.person_id)
# render(q)

In [47]:
# works
q = From(person) >> As(S.p) >> Select(Get.p.person_id)
render(q)

query: 
SELECT "person_1"."person_id"
FROM "person" AS "person_1"

Node bound references are not blocked. 

In [48]:
s1 = From(person)
q = s1 >> As(S.p) >> Select(s1 >> Get.person_id)
render(q)

query: 
SELECT "person_1"."person_id"
FROM "person" AS "person_1"

### From

`From` can be used to select columns from the table specified. 

In [49]:
q = From(person)
q

From(SQLTable(person, ...))

By default, all the columns are selected. 

In [50]:
q = From(person)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."birth_datetime", 
  "person_1"."location_id"
FROM "person" AS "person_1"

If the table has a schema specified, the qualifier gets added in the rendered query. 

In [51]:
tab = SQLTable("madeup_table", ["colA", "colB"], schema="madeup_schema")
q = From(tab)
render(q)

query: 
SELECT
  "madeup_table_1"."colA", 
  "madeup_table_1"."colB"
FROM "madeup_schema"."madeup_table" AS "madeup_table_1"

Queries with a `VALUES` query can be generated. 

In [52]:
tab = ValuesTable(("name", "year"), [("SQL", 1974), ("Python", 1990), ("FunSQL", 2022)])
q = From(tab)
render(q)

query: 
SELECT
  "values_1"."name", 
  "values_1"."year"
FROM (
  VALUES
    ('SQL', 1974),
    ('Python', 1990),
    ('FunSQL', 2022)
) AS "values_1" ("name", "year") 

Only columns used in the query are serialized for a `VALUES` clause. 

In [53]:
tab = ValuesTable(("name", "year"), [("SQL", 1974), ("Python", 1990), ("FunSQL", 2022)])
q = From(tab) >> Select(Get("name"))
render(q)

query: 
SELECT "values_1"."name"
FROM (
  VALUES
    ('SQL'),
    ('Python'),
    ('FunSQL')
) AS "values_1" ("name") 

If no columns are selected, the values are replaced with nulls. 

In [54]:
tab = ValuesTable(("name", "year"), [("SQL", 1974), ("Python", 1990), ("FunSQL", 2022)])
q = From(tab) >> Group() >> Select(Agg.Count())
render(q)

query: 
SELECT Count(*) AS "Count"
FROM (
  VALUES (NULL, NULL, NULL)
) AS "values_1" ("_") 

The `VALUES` clause requires at least one row of data. 

In [55]:
tab = ValuesTable(("name", "year"), [])
q = From(tab)
render(q)

query: 
SELECT
  NULL AS "name", 
  NULL AS "year"
WHERE FALSE

A null source generates a dataset with one row. 

In [56]:
q = From(None)
render(q)

query: 
SELECT NULL

### With

SQL has a `WITH` clause to create temporary tables for reuse in a query. They can be created using the `With` node. 

In [57]:
q = (
    From(S.thirty)
    >> With(From(person) >> Where(Fun("=", Get.year_of_birth, 1990)) >> As("thirty"))
    >> Select(Get.person_id)
)
render(q)

query: 
WITH "thirty_1" ("person_id")  AS (
  SELECT "person_1"."person_id"
  FROM "person" AS "person_1"
  WHERE ("person_1"."year_of_birth" = 1990)
)
SELECT "thirty_2"."person_id"
FROM "thirty_1" AS "thirty_2"

NOTE: Coming from SQL, the order of `From` and `With` nodes might seem odd since the `With` definition usually comes first. The reason is that FunSQL builds a query starting from the last node, and propagates up. Since the `From` node refers to a temporary table declared using `With`, to resolve it correctly, we must have encountered the `With` node first. 

`With` nodes can also declare multiple subqueries. 

In [58]:
q = Select(
    From(S.thirty) >> Group() >> Select(Agg.Count()) >> As("count_30s"),
    From(S.forty) >> Group() >> Select(Agg.Count()) >> As("count_40s"),
) >> With(
    From(person) >> Where(Fun("=", Get.year_of_birth, 1990)) >> As("thirty"),
    From(person) >> Where(Fun("=", Get.year_of_birth, 1980)) >> As("forty"),
)
render(q)

query: 
WITH "thirty_1" ("_")  AS (
  SELECT NULL
  FROM "person" AS "person_1"
  WHERE ("person_1"."year_of_birth" = 1990)
), 
"forty_1" ("_")  AS (
  SELECT NULL
  FROM "person" AS "person_2"
  WHERE ("person_2"."year_of_birth" = 1980)
)
SELECT
  (
    SELECT Count(*) AS "Count"
    FROM "thirty_1" AS "thirty_2"
  ) AS "count_30s", 
  (
    SELECT Count(*) AS "Count"
    FROM "forty_1" AS "forty_2"
  ) AS "count_40s"

Tables defined using a `With` node must have explicit, unique labels. 

In [59]:
q = From(S.person) >> With(From(person))
# render(q)

### Group

`Group` node is used to partition rows with the given keys, and summarize over them. 

In [60]:
q = From(person) >> Group(Get.year_of_birth) >> Select(Get.year_of_birth, Agg.count())
render(q)

query: 
SELECT
  "person_1"."year_of_birth", 
  count(*) AS "count"
FROM "person" AS "person_1"
GROUP BY "person_1"."year_of_birth"

By splitting the grouping logic from the aggregate expressions, queries get easier to construct.

In [61]:
visit_group = From(visit_occurence) >> Group(Get.person_id) >> As("visit_group")
num_visits = lambda: Agg.count(
    over=Get.visit_group
)  # regular assignment instead of a function works too
q = (
    From(person)
    >> Join(visit_group, on=Fun("=", Get.person_id, Get.visit_group.person_id))
    >> Where(Fun(">", num_visits(), 2))
    >> Select(Get.person_id, num_visits())
)

render(q)

query: 
SELECT
  "person_1"."person_id", 
  "visit_group_1"."count"
FROM "person" AS "person_1"
INNER JOIN (
  SELECT
    "visit_occurence_1"."person_id", 
    count(*) AS "count"
  FROM "visit_occurence" AS "visit_occurence_1"
  GROUP BY "visit_occurence_1"."person_id"
) AS "visit_group_1" ON ("person_1"."person_id" = "visit_group_1"."person_id")
WHERE ("visit_group_1"."count" > 2)

Grouping can be done in succession. 

In [62]:
# counting measurements for each concept, then counting frequency for each count
q = (
    From(measurement)
    >> Group(Get.measurement_concept_id)
    >> Group(aka(Agg.count(), "count_for_measure"))
    >> Select(Get.count_for_measure, aka(Agg.count(), "size"))
)
render(q)

query: 
SELECT
  "measurement_2"."count" AS "count_for_measure", 
  count(*) AS "size"
FROM (
  SELECT count(*) AS "count"
  FROM "measurement" AS "measurement_1"
  GROUP BY "measurement_1"."measurement_concept_id"
) AS "measurement_2"
GROUP BY "measurement_2"."count"

`Group` can work with an empty list of keys. 

In [63]:
q = (
    From(person)
    >> Group()
    >> Select(Agg.count(), Agg.max(Get.year_of_birth), Agg.min(Get.year_of_birth))
)
render(q)

query: 
SELECT
  count(*) AS "count", 
  max("person_1"."year_of_birth") AS "max", 
  min("person_1"."year_of_birth") AS "min"
FROM "person" AS "person_1"

Each aggregate expression gets a unique alias. 

In [64]:
visit_group = From(visit_occurence) >> Group(Get.person_id) >> As("visit_group")
person_visits = From(person) >> Join(
    visit_group, on=Fun("=", Get.person_id, Get.visit_group.person_id)
)

max_start_date = aka(Get.visit_group >> Agg.max(Get.visit_start_date), "max_start_date")
max_end_date = aka(Get.visit_group >> Agg.max(Get.visit_end_date), "max_end_date")
q = person_visits >> Select(Get.person_id, max_start_date, max_end_date)

render(q)

query: 
SELECT
  "person_1"."person_id", 
  "visit_group_1"."max" AS "max_start_date", 
  "visit_group_1"."max_2" AS "max_end_date"
FROM "person" AS "person_1"
INNER JOIN (
  SELECT
    "visit_occurence_1"."person_id", 
    max("visit_occurence_1"."visit_start_date") AS "max", 
    max("visit_occurence_1"."visit_end_date") AS "max_2"
  FROM "visit_occurence" AS "visit_occurence_1"
  GROUP BY "visit_occurence_1"."person_id"
) AS "visit_group_1" ON ("person_1"."person_id" = "visit_group_1"."person_id")

Aggregate expressions can be applied to only the distinct values in a partition. 

In [65]:
q = From(person) >> Group() >> Select(Agg.count(Get.year_of_birth, distinct=True))
render(q)

query: 
SELECT count(DISTINCT "person_1"."year_of_birth") AS "count"
FROM "person" AS "person_1"

Aggregates can be applied to filtered portion of a partition. 

In [66]:
measure = Agg.count(filter_=Fun("<", Get.year_of_birth, 2000))
q = From(person) >> Group() >> Select(measure)
render(q)

query: 
SELECT (count(*) FILTER (WHERE ("person_1"."year_of_birth" < 2000))) AS "count"
FROM "person" AS "person_1"

Aggregate expressions can't be used without a `Group` node. 

In [67]:
q = From(person) >> Select(Agg.max(Get.year_of_birth))
# render(q)

Aggregate expressions need to unabiguously determine the corresponding `Group` node. 

In [68]:
q1 = From(person)
q2 = From(measurement) >> Group(Get.person_id)
q3 = From(visit_occurence) >> Group(Get.person_id)

q = (
    q1
    >> Join(q2, on=Fun("=", Get.person_id, q2 >> Get.person_id))
    >> Join(q3, on=Fun("=", Get.person_id, q3 >> Get.person_id))
    >> Select(q1 >> Get.person_id, Agg.count())
)
# render(q)

### Partition

`Partition` node creates a subquery that partitions rows by the specified keys. For each row, an aggregate can be calculated across all the rows in its partition (called `WINDOW` functions in SQL). 

In [69]:
q = (
    From(person)
    >> Partition(Get.year_of_birth, order_by=[Get.month_of_birth])
    >> Select(Get.person_id, Agg.row_number())
)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  (row_number() OVER (PARTITION BY "person_1"."year_of_birth" ORDER BY "person_1"."month_of_birth")) AS "row_number"
FROM "person" AS "person_1"

A Partition node may specify a window frame. 

In [70]:
births_by_year = (
    From(person) >> Group(Get.year_of_birth) >> Select(Get.year_of_birth, Agg.count())
)
cumulative_births_by_year = (
    births_by_year
    >> Partition(
        order_by=[Get.year_of_birth],
        frame=Frame(
            FrameMode.ROWS,
            FrameEdge(FrameEdgeSide.PRECEDING, None),
            FrameEdge(FrameEdgeSide.CURRENT_ROW),
        ),
    )
    >> Select(Get.year_of_birth, Agg.sum(Get.count))
)

render(cumulative_births_by_year)

query: 
SELECT
  "person_2"."year_of_birth", 
  (sum("person_2"."count") OVER (ORDER BY "person_2"."year_of_birth" ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS "sum"
FROM (
  SELECT
    "person_1"."year_of_birth", 
    count(*) AS "count"
  FROM "person" AS "person_1"
  GROUP BY "person_1"."year_of_birth"
) AS "person_2"

Defining a Frame node gets a bit mouthful, so the regular constructs are available through an alias class, `F`. 

In this example, Partition nodes are used one after the other to simplify a nested SQL query. We want to get the set of non overlapping visits made by a person. 

In [71]:
# for all visits by a person
# gather all the visits made prior
# find the one that ended last
# TODO: I can't really follow

q = (
    From(visit_occurence)
    >> Partition(
        Get.person_id,
        order_by=[Get.visit_start_date],
        frame=Frame(F.ROWS, F.pre(None), F.pre(1)),
    )
    >> Define(aka(Agg.max(Get.visit_end_date), "boundary"))
    >> Define(aka(Fun("-", Get.visit_start_date, Get.boundary), "gap"))
    >> Define(aka(Fun.case(Fun("<=", Get.gap, 0), 0, 1), "new"))
    >> Partition(
        Get.person_id,
        order_by=[Get.visit_start_date, Fun("-", Get.new)],
        frame=Frame(F.ROWS, F.pre(None), F.curr_row()),
    )
    >> Define(aka(Agg.sum(Get.new), "group"))
    >> Group(Get.person_id, Get.group)
    >> Define(
        aka(Agg.min(Get.visit_start_date), "start_date"),
        aka(Agg.max(Get.visit_end_date), "end_date"),
    )
    >> Select(Get.person_id, Get.start_date, Get.end_date)
)

render(q)

query: 
SELECT
  "visit_occurence_3"."person_id", 
  min("visit_occurence_3"."visit_start_date") AS "start_date", 
  max("visit_occurence_3"."visit_end_date") AS "end_date"
FROM (
  SELECT
    "visit_occurence_2"."person_id", 
    (sum("visit_occurence_2"."new") OVER (PARTITION BY "visit_occurence_2"."person_id" ORDER BY "visit_occurence_2"."visit_start_date", (- "visit_occurence_2"."new") ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS "group", 
    "visit_occurence_2"."visit_start_date", 
    "visit_occurence_2"."visit_end_date"
  FROM (
    SELECT
      "visit_occurence_1"."person_id", 
      (CASE WHEN (("visit_occurence_1"."visit_start_date" - (max("visit_occurence_1"."visit_end_date") OVER (PARTITION BY "visit_occurence_1"."person_id" ORDER BY "visit_occurence_1"."visit_start_date" ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING))) <= 0) THEN 0 ELSE 1 END) AS "new", 
      "visit_occurence_1"."visit_start_date", 
      "visit_occurence_1"."visit_end_date"
    FROM "visit_o

### Join

A Join query combines the output of two subqueries. 

In [72]:
q = From(person) >> Join(
    aka(From(location), "location"),
    on=Fun("=", Get.location_id, Get.location.location_id),
)
render(q)

query: 
SELECT
  "person_1"."location_id", 
  "person_1"."birth_datetime", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."person_id"
FROM "person" AS "person_1"
INNER JOIN "location" AS "location_1" ON ("person_1"."location_id" = "location_1"."location_id")

Different variants of the `SQL JOIN` operation can be constructed using the relevant keyword args. 

In [73]:
# right join
q = From(person) >> Join(
    aka(From(location), "location"),
    on=Fun("=", Get.location_id, Get.location.location_id),
    right=True,
)
render(q)

query: 
SELECT
  "person_1"."location_id", 
  "person_1"."birth_datetime", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."person_id"
FROM "person" AS "person_1"
RIGHT JOIN "location" AS "location_1" ON ("person_1"."location_id" = "location_1"."location_id")

Joins with correlated subqueries are supported too. 

In [74]:
# gets the _second_ visit made by a person
def second_visit(p_id):
    return (
        From(visit_occurence)
        >> Where(Fun("=", Get.person_id, Var.PERSON_ID))
        >> Partition(order_by=[Get.visit_start_date])
        >> Where(Fun("=", Agg.row_number(), 2))
        >> Bind(aka(p_id, "PERSON_ID"))
    )


# gets all people and if they made any second visits
q = (
    From(person)
    >> Join(aka(second_visit(Get.person_id), "visit"), on=True, left=True)
    >> Select(Get.person_id, Get.visit.visit_occurence_id, Get.visit.visit_start_date)
)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "visit_1"."visit_occurence_id", 
  "visit_1"."visit_start_date"
FROM "person" AS "person_1"
LEFT JOIN LATERAL (
  SELECT
    "visit_occurence_2"."visit_occurence_id", 
    "visit_occurence_2"."visit_start_date"
  FROM (
    SELECT
      "visit_occurence_1"."visit_occurence_id", 
      "visit_occurence_1"."visit_start_date", 
      (row_number() OVER (ORDER BY "visit_occurence_1"."visit_start_date")) AS "row_number"
    FROM "visit_occurence" AS "visit_occurence_1"
    WHERE ("visit_occurence_1"."person_id" = "person_1"."person_id")
  ) AS "visit_occurence_2"
  WHERE ("visit_occurence_2"."row_number" = 2)
) AS "visit_1" ON TRUE

### Order

The `Order` operator creates a subquery to sort the output. 

In [75]:
q = From(person) >> Order(Get.year_of_birth)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."birth_datetime", 
  "person_1"."location_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."year_of_birth"

The num of rows in the result set can also be limited. 

In [76]:
q = (
    From(person)
    >> Order(Get.year_of_birth)
    >> Limit(10)
    >> Order(Get.person_id)
    >> Select(Get.person_id, Get.location_id)
)
render(q)

query: 
SELECT
  "person_2"."person_id", 
  "person_2"."location_id"
FROM (
  SELECT
    "person_1"."person_id", 
    "person_1"."location_id"
  FROM "person" AS "person_1"
  ORDER BY "person_1"."year_of_birth"
  LIMIT 10
) AS "person_2"
ORDER BY "person_2"."person_id"

The direction of the sort column can be specified too. 

In [77]:
q = From(person) >> Order(
    Get.year_of_birth >> Desc(NullsOrder.FIRST), Get.person_id >> Asc()
)

render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."birth_datetime", 
  "person_1"."location_id"
FROM "person" AS "person_1"
ORDER BY
  "person_1"."year_of_birth" DESC NULLS FIRST, 
  "person_1"."person_id" ASC

### Limit

The `Limit` node selects a fixed number of rows from a subquery, typically in conjunction with an `Order` node. 

In [78]:
q = From(person) >> Order(Get.person_id) >> Limit(20)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."birth_datetime", 
  "person_1"."location_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."person_id"
LIMIT 20

`Limit` also lets you specify an offset. 

In [79]:
q = From(person) >> Order(Get.person_id) >> Limit(100, offset=10)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."birth_datetime", 
  "person_1"."location_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."person_id"
LIMIT 100
OFFSET 10

You could also specify just the offset. 

In [80]:
q = From(person) >> Order(Get.person_id) >> Limit(offset=10)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."birth_datetime", 
  "person_1"."location_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."person_id"
LIMIT -1
OFFSET 10

### Select

The Select node specifies the output columns for a subquery. 

In [81]:
q = From(person) >> Select(Get.person_id, Get.year_of_birth)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."year_of_birth"
FROM "person" AS "person_1"

`Select` always creates a complete subquery. So, it creates nesting if it isn't the last node. 

In [82]:
q = (
    From(person)
    >> Select(Get.year_of_birth)
    >> Where(Fun(">", Get.year_of_birth, 2000))
)

render(q)

query: 
SELECT "person_2"."year_of_birth"
FROM (
  SELECT "person_1"."year_of_birth"
  FROM "person" AS "person_1"
) AS "person_2"
WHERE ("person_2"."year_of_birth" > 2000)

All columns passed to a Select node must have unique aliases.

In [83]:
# doesn't work
# q = From(person) >> Select(Get.person_id, Get.person_id)

# works
q = From(person) >> Select(Get.person_id, aka(Get.person_id, "duplicate_id"))

render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."person_id" AS "duplicate_id"
FROM "person" AS "person_1"

### Where

`Where` can be used to filter the query output by a condition. 

In [84]:
q = From(person) >> Where(Fun("=", Get.year_of_birth, 2000))
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."birth_datetime", 
  "person_1"."location_id"
FROM "person" AS "person_1"
WHERE ("person_1"."year_of_birth" = 2000)

Multiple `Where` nodes in sequence are collapsed into a single clause. 

In [85]:
q = (
    From(person)
    >> Where(Fun(">", Get.year_of_birth, 1980))
    >> Where(Fun("<", Get.year_of_birth, 2000))
    >> Where(Fun("!=", Get.year_of_birth, 1990))
)
render(q)

query: 
SELECT
  "person_1"."person_id", 
  "person_1"."year_of_birth", 
  "person_1"."month_of_birth", 
  "person_1"."day_of_birth", 
  "person_1"."birth_datetime", 
  "person_1"."location_id"
FROM "person" AS "person_1"
WHERE (("person_1"."year_of_birth" > 1980) AND ("person_1"."year_of_birth" < 2000) AND ("person_1"."year_of_birth" <> 1990))

`Where` node following a `Group` subquery is translated into a `HAVING` clause. 

In [86]:
q = (
    From(location)
    >> Group(Get.state)
    >> Where(Fun(">", Agg.count(Get.city, distinct=True), 5))
    >> Where(Fun("<", Agg.count(Get.city, distinct=True), 10))
)

render(q)

query: 
SELECT "location_1"."state"
FROM "location" AS "location_1"
GROUP BY "location_1"."state"
HAVING ((count(DISTINCT "location_1"."city") > 5) AND (count(DISTINCT "location_1"."city") < 10))