Adding first-class relational support to Morphir #181
AttilaMihaly
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem Statement
While programming languages have come a long way in making it easy for developers to write data processing pipelines declaratively, relational algebra is still the most widely used and well-known method of processing large amounts of data. Even data processing solutions that offer native language integrations (such as Apache Spark) end up exposing APIs that abandon the semantics of the programming language and mimic the behavior of relational algebra instead. As a result, there's still friction between the programming language and the data processing pipelines.
Proposal
One of the most successful solutions to this problem is Microsoft's Language Integrated Query (LINQ) which seamlessly integrates relational algebra into a programming language. What if we did something similar in Morphir?
To be as non-intrusive as possible, we could start by adding a separate language building block that captures relational operations in a separate data structure and refers back to existing Morphir IR constructs for the column level operations.
Benefits
There are a number of benefits that this approach:
Implementation Details
We can start by defining a relation as a recursive data structure:
While we could represent these to a certain extent as function calls with the existing Value constructs, the advantage of having dedicated structures for this is that we can more accurately represent the semantics of relational algebra and map it to values efficiently. We will define the exact semantics later, but to demonstrate the concept we will examine what goes into the
predicate
of aWhere
node.In SQL, what column/object names you can use in a where clause depends on what is available at that point in the relation, which depends on what is in the from clause and what was joined. The most direct mapping of that behavior in Morphir is to treat the
predicate
as a function body where the variables in scope is derived from thesource
relation. For example in the query below:The expression inside the parenthesis is a Morphir value where only variable
a
is available which is a record with fields that can be derived fromFoo
's schema. On the other hand, in the query below:Now
a
andb
are both variables that are available in thepredicate
's scope.We would need new name resolution and type inference tooling on the relation level that follows the semantics described here (and expanded later) but for the column level we could simply refer back to the existing tooling.
What does everyone think?
Beta Was this translation helpful? Give feedback.
All reactions