<img align="right" src="tf-small.png"/>

# tfQuery

Do we need a query language in TF, like MQL?

Yes, it is convenient to have a more declarative way of getting a set of interesting nodes to work with.
But should it be MQL?

Experience shows that MQL may give you a very good first try, 
until you realize that you may not have queried for all cases.
You forgot to query for some elements in a different order.
You have not reckoned with gaps.
And the query does not give you interesting things from the context with the results.
Also, MQL does not work nicely with object types that are scattered through other object types, such as the
[*lexeme*](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/otype.html)
type.

Look in SHEBANQ for examples how unwieldy MQL queries may become.

Here is a good example:

[Dirk Roorda: Yesh](https://shebanq.ancient-data.org/hebrew/query?version=4b&id=556)

```
select all objects where
[book [chapter [verse
[clause
    [clause_atom
        [phrase
            [phrase_atom
                [word focus lex="JC/" OR lex=">JN/"]
            ]
        ]
    ]
]
]]]
```

Well, this is not too complicated, but the query misses results.
See [here](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_mql.html)
to see what would be needed to make it right.

Also, I wonder: do we want a new language?
Suppose we make a TFQL, then we need a parser for it,
we need to define a syntax, we need to refine the syntax, update the parser, etc.
It will become a cumbersome straight-jacket.

In our case, we do not have the requirement that non-coders should be able to use TFQL in a stand-alone manner.

On the contrary, TFQL should live in a programming environment, and we can take advantage of that.

Here are initial thought for **tfQuery**, a query *mechanism* inside TF, not a *language*.

* tfQuery defines queries as data structures in Python, more precisely: as a graph
* it does not matter how you build up a query, tfQuery processses the value of a datastructure
  that you pass to it. The surface syntax will not be seen by tfQuery
* a query is a graph representation where the nodes are things like
  
  `('phrase', dict(det='und'))`
  
  or
  
  `('word', dict(sp='verb', gn='f', ps='3f'))`

* the edges specify relations between the nodes, like: *is contained in*, *follows*,
  *precedes*
  
In MQL you also specify a graph, by means of a template, but this template forces you to *overspecify*: the template often implies more constraints then you really want.

So how do we specify edges? As constraints.

Let us formulate a query for

* clauses that are object clauses
* containing two phrases (both undetermined)
* one of which contains a verb in the third person feminine
* and the other phrase contains a feminine, plural noun

In MQL

```
[clause rela='Objc'
    [phrase det='und'
        [word sp='verb' AND gn='f' AND ps='p3']
    ]
    [phrase det='und'
        [word sp='subs' AND gn='f' AND nu='pl']
    ]
]
```

Here is how we are going to do it,
and note that we are going to write executable code!

In [1]:
c = ('clause', dict(rela='Objc'))
p1 = ('phrase', dict(det='und'))
p2 = ('phrase', dict(det='und'))
w1 = ('word', dict(sp='verb', gn='f', ps='p3'))
w2 = ('word', dict(sp='subs', gn='f', nu='pl'))

In [2]:
nodes = [c, p1, p2, w1, w2]
edges = [
    (c, [p1,p2]),
    (p1, [w1]),
    (p2, [w2]),
    (p1, p2),
]

query = (nodes, edges)

An edge of like `(x, [y,z])` means that `y` and `z` are embedded in `x`, but does not mean
that `y` comes before `z`.

An edge like `(x, y)` means that `x` comes before `y`.

## Increased flexibility

Note that it is very easy to remove the `(p1, p2)` condition, which states that the first
phrase comes before the second one.

If we wanted to do that in MQL, the query would become:

```
[clause rela='Objc'
    [phrase det='und'
        [word sp='verb' AND gn='f' AND ps='p3']
    ]
    [phrase det='und'
        [word sp='subs' AND gn='f' AND nu='pl']
    ]
    OR
    [phrase det='und'
        [word sp='subs' AND gn='f' AND nu='pl']
    ]
    [phrase det='und'
        [word sp='verb' AND gn='f' AND ps='p3']
    ]
]
```

This goes quickly out of hand, see e.g.
[Dirk Roorda: Object clauses of verbless mothers](https://shebanq.ancient-data.org/hebrew/query?id=984) and accompanying
[notebook](https://shebanq.ancient-data.org/shebanq/static/docs/tools/shebanq/VerblessMothers.html)

## Query results

What should we return as query results?
Do we want every instantiation of the nodes that satisfy the criteria?

That can become overwhelming. 
If for example you search for a word in a book, an other word in the same book, and a third word in the same book without further constraints, then for a book with 10,000 words you'll get 10,000 * 10,000 * 10,000 results or 1 Tera results, which is, even for a computer, a bit much.

This is why Ulrik invented the sheaf.

Our way of solving this problem could be like this:

* we return a collection of node lists: for each node in the query we return the 
  corresponding node list;
* these lists consist of TF nodes which are guaranteed to occur in at least one
  instantatiation of the whole graph;
* we return nothing else.

It is up to the user to pick one of these nodesets and to further process them.
If he needs needs context around the result nodes, he can easily draw the info from there.

It is even possible to generate the code to get full results from the original query graph.

So the real result is two things:

* a collection of node lists
* a function to look up other result nodes in the context of a given result node.

## Implementation

How could we implement this search efficiently?

First idea:

* build for each query graph node
  (which corresponds to a local feature condition on an object)
  the set of nodes that satisfy the condition.
  This is the easy part:
  A single walk over all nodes could construct these sets in one go, in a fraction
  of a second;
* then work through all edges, where every edge is an instruction to weed out non-results
  from the earlier obtained sets.
  
How would that work, filtering along an edge in the graph?

Suppose there is an edge from node1 to node2 (in the query graph).
This edge specifies a relationship between nodes in the result nodeset of n1 and nodes 
in the result nodeset of n2.

In English: such an edge says: hey TF node in result set of n1: 
do you have a parent (or child, or older brother or younger sister)
that occurs in the result of n2?

If so: you can stay. If not: you're OUT.
So this weeds out TF nodes from the result set of n1.

But we can also reduce the nodes in the result set of n2.
Every TF node in the result set of n2 that does not figure as the parent
(or child or brother/sister) of a TF node in the result set of n1 is also out.

In this way we can take every edge, one by one, and perform the filtering.
This is also a fast operation, provided we can make the elementary relationship checks
quickly. (And we can, in TF, thanks to precomputing).

When we have done all edges, we probably have to iterate again over all edges.

Because an edge between n2 and n3 could have weeded out additional TF nodes from the 
result set of n2, and this influences the validity of the TF nodes in the result set of n1.

Probably we have to repeat until the result sets do not change anymore.

Or we can find a way to order the edges, so that we can do them in one or two passes.

Maybe we need a bit of math here.

My gut feeling is that this is all very doable and that it corresponds to people's query
needs.

# Just playing around

The stuff below is not yet meaningful.

In [6]:
def makeNode(otype, features):
    return (otype, tuple((y[0], y[1]) for y in features.items()))
def makeEdge(node, nodes):
    if type(nodes) is list:
        frozenNodes = tuple(list)
    elif type(nodes) is set:
        frozenNodes = frozenset(set)